How to remove diacritic marks from characters

…with Unicode and Perl 5.8. Originally: 2003-12-27; update: 2004-10-24. Important update: 2005-02-13, corrected bugs in code.

Let’s assume you have some ISO-8859-1 (Latin-1) data. Naturally, this data has some characters with diacritics, like, for instance, “é”. (Such characters are sometimes called "accented".) You need to convert this data to plain ASCII, but still have it readable. I’ll give you an advice on how to do that.

First, it will be useful to understand certain things about Unicode. Then I’ll show you what to do, exactly.

Virtues of Unicode

Unicode, first of all, is a database of characters. Its primary virtue is that it assigns codes to almost all existing letters and punctuation characters of almost every world’s language. It provides a name to each character.

Selected Unicode character codepoints and their names
codepoint	character name	the character
U+005D	RIGHT SQUARE BRACKET	]
U+00E3	LATIN SMALL LETTER A WITH TILDE	ã
U+0F3D	TIBETAN MARK ANG KHANG GYAS	༽
U+1FBF	GREEK PSILI	᾿

But it is not even a half of the story. Unicode also carries hell of a lot of other information about the characters. Some of which is very useful, sometimes.

Combining sequence

The same human text in the same encoding may have several Unicode representations. To be precise, there may be several ways to write a certain character in Unicode. The funny thing is: the accented characters are such characters. And I’m not talking about encodings here.

Let’s make this clear. Unicode has separate characters for special marks, which you can combine with other characters to form a new one. For instance, there is a separate character for the acute accent mark. When you need to write “é”, you can write “e” (U+0065) and add the combining acute mark to it (U+0301). This means: put two Unicode characters, one after another: U+0065 U+0301. Here you have an example of what is called “a combining sequence”.

Or, you can write the Unicode character U+00E9 directly, which is LATIN SMALL LETTER E WITH ACUTE. Both ways will be equivalent in terms of the text produced. Every Unicode-compatible application shall process (e.g. display) it equally well.

Not every possible combining sequence has a single-character equivalent. But many of the accented characters used in ISO-8859-1 encoding do have two interchangeable representations.

Unicode decomposition

Unicode defines a precise way to translate between the combining sequences (U+0065 U+0301) and their single character equivalents (U+00E9). Unicode defines backwards translation as well. Translating single accented characters into corresponding combining sequences is called decomposition. That’s the next big thing to solve our little problem.

Decomposition would break each “é” into “e” and the acute accent mark “´” combining character.

Practical outline

So, here is the outline of our solution:

we take some data with diacritics;
convert it to Unicode;
put it through Canonical Decomposition, also known as Normalization Form D;
remove all characters that belong to the Unicode General Category “Mark” (non-spacing, spacing combining, enclosing) — thus removing the diacritics (accent marks);
prepare the data for output to an ASCII stream.

I had to tell you all the story about Unicode. Because otherwise you won’t understand this outline. If you still don’t get it, ask me questions.

Practice

We will need Perl 5.8 with its Encode module (standard library), and an add-on module Unicode::Normalize.

The code:


 require Encode;
 use Unicode::Normalize;

 for ( $str ) {  # the variable we work on
   ##  convert to Unicode first
   ##  if your data comes in Latin-1, then uncomment:
   #$_ = Encode::decode( 'iso-8859-1', $_ );  
   $_ = NFD( $_ );   ##  decompose
   s/\pM//g;         ##  strip combining characters
   s/[^\0-\x80]//g;  ##  clear everything else
 }

Nothing is perfect

Problem: not all “funny” characters of ISO-8859-1 are decomposable into a base character and a combining character. Here are some of those: “ß”, “Ø”, “œ”. These characters would disappear after the above code, unless we take measures.

In fact, there’s a lot of such characters. Some of them transliterate well into a pair, like “ä” -> “ae”. Some other — into a single simple character, like “Ø” -> “O”.

Here is my code, complete with all those additional transliterations:

 require Encode;
 use Unicode::Normalize;

 for ( $str ) {  # the variable we work on

   ##  convert to Unicode first
   ##  if your data comes in Latin-1, then uncomment:
   #$_ = Encode::decode( 'iso-8859-1', $_ );  

   s/\xe4/ae/g;  ##  treat characters ä ñ ö ü ÿ
   s/\xf1/ny/g;  ##  this was wrong in previous version of this doc    
   s/\xf6/oe/g;
   s/\xfc/ue/g;
   s/\xff/yu/g;

   $_ = NFD( $_ );   ##  decompose (Unicode Normalization Form D)
   s/\pM//g;         ##  strip combining characters

   # additional normalizations:

   s/\x{00df}/ss/g;  ##  German beta “ß” -> “ss”
   s/\x{00c6}/AE/g;  ##  Æ
   s/\x{00e6}/ae/g;  ##  æ
   s/\x{0132}/IJ/g;  ##  Ĳ
   s/\x{0133}/ij/g;  ##  ĳ
   s/\x{0152}/Oe/g;  ##  Œ
   s/\x{0153}/oe/g;  ##  œ

   tr/\x{00d0}\x{0110}\x{00f0}\x{0111}\x{0126}\x{0127}/DDddHh/; # ÐĐðđĦħ
   tr/\x{0131}\x{0138}\x{013f}\x{0141}\x{0140}\x{0142}/ikLLll/; # ıĸĿŁŀł
   tr/\x{014a}\x{0149}\x{014b}\x{00d8}\x{00f8}\x{017f}/NnnOos/; # ŊŉŋØøſ
   tr/\x{00de}\x{0166}\x{00fe}\x{0167}/TTtt/;                   # ÞŦþŧ

   s/[^\0-\x80]//g;  ##  clear everything else; optional
 }

When dealing with european data this will give real good results. The code is in public domain; you are free to use it as you please. Don’t forget it requires Perl 5.8. To compile the above special character translations I used that page. If something is wrong or incomplete, please let me know.

P.S.

Having said all the above, I suggest to return back to the original problem you had and think again about it. Is it the right way to solve it?

These days basic support for Unicode is not a miracle anymore. Operating systems, web browsers, fonts — many are already Unicode-enabled. Potentially, this could mean that transformation of data to plain ASCII is an unnecessary loss of information. Information of cultural value, by the way.