How to remove diacritic marks from characters
…with Unicode and Perl 5.8. Originally: 2003-12-27; update: 2004-10-24. Important update: 2005-02-13, corrected bugs in code.
Let’s assume you have some ISO-8859-1 (Latin-1) data. Naturally, this data has some characters with diacritics, like, for instance, “é”. (Such characters are sometimes called "accented".) You need to convert this data to plain ASCII, but still have it readable. I’ll give you an advice on how to do that.
First, it will be useful to understand certain things about Unicode. Then I’ll show you what to do, exactly.
Virtues of Unicode
Unicode, first of all, is a database of characters. Its primary virtue is that it assigns codes to almost all existing letters and punctuation characters of almost every world’s language. It provides a name to each character.
codepoint | character name | the character |
---|---|---|
U+005D | RIGHT SQUARE BRACKET | ] |
U+00E3 | LATIN SMALL LETTER A WITH TILDE | ã |
U+0F3D | TIBETAN MARK ANG KHANG GYAS | ༽ |
U+1FBF | GREEK PSILI | ᾿ |
But it is not even a half of the story. Unicode also carries hell of a lot of other information about the characters. Some of which is very useful, sometimes.
Combining sequence
The same human text in the same encoding may have several Unicode representations. To be precise, there may be several ways to write a certain character in Unicode. The funny thing is: the accented characters are such characters. And I’m not talking about encodings here.
Let’s make this clear. Unicode has separate characters for special marks, which you can combine with other characters to form a new one. For instance, there is a separate character for the acute accent mark. When you need to write “é”, you can write “e” (U+0065) and add the combining acute mark to it (U+0301). This means: put two Unicode characters, one after another: U+0065 U+0301. Here you have an example of what is called “a combining sequence”.
Or, you can write the Unicode character U+00E9 directly, which is LATIN SMALL LETTER E WITH ACUTE. Both ways will be equivalent in terms of the text produced. Every Unicode-compatible application shall process (e.g. display) it equally well.
Not every possible combining sequence has a single-character equivalent. But many of the accented characters used in ISO-8859-1 encoding do have two interchangeable representations.
Unicode decomposition
Unicode defines a precise way to translate between the combining sequences (U+0065 U+0301) and their single character equivalents (U+00E9). Unicode defines backwards translation as well. Translating single accented characters into corresponding combining sequences is called decomposition. That’s the next big thing to solve our little problem.
Decomposition would break each “é” into “e” and the acute accent mark “´” combining character.Practical outline
So, here is the outline of our solution:
- we take some data with diacritics;
- convert it to Unicode;
- put it through Canonical Decomposition, also known as Normalization Form D;
- remove all characters that belong to the Unicode General Category “Mark” (non-spacing, spacing combining, enclosing) — thus removing the diacritics (accent marks);
- prepare the data for output to an ASCII stream.
I had to tell you all the story about Unicode. Because otherwise you won’t understand this outline. If you still don’t get it, ask me questions.
Practice
We will need Perl 5.8 with its Encode module (standard library), and an add-on module Unicode::Normalize.
The code:
require Encode;
use Unicode::Normalize;
for ( $str ) { # the variable we work on
## convert to Unicode first
## if your data comes in Latin-1, then uncomment:
#$_ = Encode::decode( 'iso-8859-1', $_ );
$_ = NFD( $_ ); ## decompose
s/\pM//g; ## strip combining characters
s/[^\0-\x80]//g; ## clear everything else
}
Nothing is perfect
Problem: not all “funny” characters of ISO-8859-1 are decomposable into a base character and a combining character. Here are some of those: “ß”, “Ø”, “œ”. These characters would disappear after the above code, unless we take measures.
In fact, there’s a lot of such characters. Some of them transliterate well into a pair, like “ä” -> “ae”. Some other — into a single simple character, like “Ø” -> “O”.
Here is my code, complete with all those additional transliterations:
require Encode;
use Unicode::Normalize;
for ( $str ) { # the variable we work on
## convert to Unicode first
## if your data comes in Latin-1, then uncomment:
#$_ = Encode::decode( 'iso-8859-1', $_ );
s/\xe4/ae/g; ## treat characters ä ñ ö ü ÿ
s/\xf1/ny/g; ## this was wrong in previous version of this doc
s/\xf6/oe/g;
s/\xfc/ue/g;
s/\xff/yu/g;
$_ = NFD( $_ ); ## decompose (Unicode Normalization Form D)
s/\pM//g; ## strip combining characters
# additional normalizations:
s/\x{00df}/ss/g; ## German beta “ß” -> “ss”
s/\x{00c6}/AE/g; ## Æ
s/\x{00e6}/ae/g; ## æ
s/\x{0132}/IJ/g; ## IJ
s/\x{0133}/ij/g; ## ij
s/\x{0152}/Oe/g; ## Œ
s/\x{0153}/oe/g; ## œ
tr/\x{00d0}\x{0110}\x{00f0}\x{0111}\x{0126}\x{0127}/DDddHh/; # ÐĐðđĦħ
tr/\x{0131}\x{0138}\x{013f}\x{0141}\x{0140}\x{0142}/ikLLll/; # ıĸĿŁŀł
tr/\x{014a}\x{0149}\x{014b}\x{00d8}\x{00f8}\x{017f}/NnnOos/; # ŊʼnŋØøſ
tr/\x{00de}\x{0166}\x{00fe}\x{0167}/TTtt/; # ÞŦþŧ
s/[^\0-\x80]//g; ## clear everything else; optional
}
P.S.
Having said all the above, I suggest to return back to the original problem you had and think again about it. Is it the right way to solve it?
These days basic support for Unicode is not a miracle anymore. Operating systems, web browsers, fonts — many are already Unicode-enabled. Potentially, this could mean that transformation of data to plain ASCII is an unnecessary loss of information. Information of cultural value, by the way.
Further reading
-
On the Goodness of Unicode
By Tim Bray. A good essay about Unicode, expanding on some issue I’ve only touched gently here. -
The Absolute Minimum Every Software Developer Absolutely,
Positively Must Know About Unicode and Character Sets (No
Excuses!)
By Joel Spolsky. -
perlunicode - Unicode support in Perl
A man page, part of perl distribution. Also: perluniintro, Encode, Unicode::Normalize. It contains much more than you need. - Text::Unidecode module, which tries to transliterate any Unicode text to US-ASCII.
-
Unicode.org
The Official website of the consortium. The standard is available online for free access or as a book with a CD-ROM for order. On the site you’ll also find additional guidance.
You may find useful my previous essay on dealing with Unicode in Perl.
Critique welcome.