Perl 5.6+: Unicode-processing issues and how to cope with it.
Perl version 5.6 introduced partial Unicode support. Perl 5.6.1 fixed some of its issues. Perl 5.8 has comprehensive support for it. But still many people experience harsh problems when processing Unicode. Here I explain the most common problems and offer solutions.
There is a newer version of this article. This older and longer one may be useful if you need to get your stuff running in a older perl, like version 5.6.1+.
Manpages perluniintro
and
perlunicode
document support for Unicode
since perl 5.6.0. Perl 5.8 has better Unicode support and
even more Unicode-related documentation. In addition to
perluniintro
and perlunicode
, it includes: Encode
, encoding
, -f
open
manpages, and the list is not complete.
The major problem with this documentation is its volume. Normal programmer won’t read it. (Cool programmers don’t read documentation, as we all know.) Most programmers don’t even have to read it all, because to start working with Unicode you just need to know the basic facts and rules.
I somehow got into several different kinds of trouble with Unicode in Perl, both in 5.6 and 5.8, in several different projects. Always it was about processing and generating data in UTF-8 encoding.
The two main problems I’ve seen are:
- UTF-8 data getting double-encoded
- “Wide character in print” warning
Having said the above, reading or at least browsing through the above mentioned manpages is still a good way to understand and solve your Unicode problems. If you don’t have time for that now, read on.
The basic facts you need to know
There is a distiction between bytes and characters. Characters are Unicode characters, and one character may consist of several bytes.
There is a “utf8” flag on every scalar value, which might be “on” or “off”. “On” state of the flag tells perl to treat the value as a string of characters.
That is the source of many many perl/unicode problems. If you take a string with utf8 flag off and concatenate it with another string with utf8 flag on, perl converts the first one to UTF-8.
This may sound okay and obvious. But then you think: How? Perl will need to know the encoding of the string data before converting it, and perl will try to guess it.
The algorithm perl uses in guessing is documented (uses some defaults and maybe checks your locale), but my suggestion is: never let perl do that. In my experience, this is the reason for double-encoded UTF-8 strings in 99% cases.
An example
Imagine you have two variables with Unicode data in it. And you print those variables…
#!/usr/bin/perl
my $ustring1 = "Hello \x{263A}!\n";
my $ustring2 = <DATA>;
print "$ustring1$ustring2";
__DATA__
Hello ☺!
Both variables here contain the same data: string
"Hello "
followed by Unicode character WHITE
SMILING FACE U+263A, an exclamation mark and a new-line
character. But when we print it, the first one comes out
fine and the second one comes garbled. This is because Perl
knows that the first string is a Unicode string and is stored in UTF-8, and doesn’t know about encoding of the
second. When it builds a bigger string for printing, it encodes the second into UTF-8, wrongly.
(Perl 5.8, in addition, prints a warning: Wide
character in print at unitest1.pl line 6, <DATA> line
1.
We’ll look at it later, after we fix our output.)
There are several ways to solve the problem. One way is to avoid concatenating the strings:
#!/usr/bin/perl
my $ustring1 = "Hello \x{263A}!\n";
my $ustring2 = <DATA>;
print $ustring1, $ustring2;
__DATA__
Hello ☺!
It works, but it is wrong. Sometimes you simply can’t avoid concatenation; it is such a basic operation. In addition, it is error-prone and not future-proof.
Let perl know you are a grown-up
The real solution would be to tell perl that the
$ustring2
contains Unicode data. Here is a way
to make it work in both perl versions (5.6.1 and 5.8.0, at least):
#!/usr/bin/perl
my $ustring1 = "Hello \x{263A}!\n";
my $ustring2 = <DATA>;
$ustring2 = pack "U0C*", unpack "C*", $ustring2;
print $ustring1, $ustring2;
__DATA__
Hello ☺!
(Perl 5.8 continues to complain about Wide characters at this stage.)
Wheather or not you see the smiling face character depends on your terminal environment. But at least perl prints two exactly the same lines now. And that is right.
The strategy
Always use Unicode (UTF-8) to store and process your data. Make sure perl knows you use it. Make sure all your scalars, which contain Unicode, have the utf8 flag on. Then you can safely concatenate strings. Then you can use Unicode-related regular expressions, which gives you great powers for international text processing.
To achieve that, you need to know all the ways data gets into your program. As soon as you get some input, mark it as Unicode or convert it to Unicode and sleep well.
Sometimes data comes into your program already in
Unicode. For instance, XML parsers return you string
values with the utf8 flag “on”. (Unless you do something
weird, like getting it in original form from the parser,
which you shouldn’t anyway.) In above example we
explicitly include a unicode character into a string
($ustring1
) and perl knows its encoding.
But when you read data from input streams, from a database or from environment variables (like parameters in CGI), you need to warn perl about its encoding.
Perl 5.8 machinery
If you read data from a file or another input stream, and you expect UTF-8 data there, warn perl:
if ( open( FILE, "<:utf8", $fname ) ) {
. . .
}
or, in our case,
#!/usr/bin/perl
my $ustring1 = "Hello \x{263A}!\n";
binmode DATA, ":utf8";
my $ustring2 = <DATA>;
print "$ustring1$ustring2";
__DATA__
Hello ☺!
When you get UTF-8 data from another module
(e.g. DBI, CGI) or from the environment variables,
explicitly mark it as utf8. You can use the pack
"U0…
trick shown above. Or you can use
Encode module’s decode_utf8()
function.
require Encode;
my $ustring = Encode::decode_utf8( $input );
(Do not forget though, that not every sequence of
bytes is valid UTF-8. So this operation may fail. See
Encode
doc for error handling.)
If you get data in another encoding, convert it to
UTF-8. Again, Encode module, decode()
function:
require Encode;
my $ustring = Encode::decode( 'iso-8859-1', $input );
UTF-8 from CGI example
In ACIS we send HTML in UTF-8. We expect the HTML form input to be UTF-8 as well. To manipulate it, we need to tell perl about the encoding:
require Encode;
require CGI;
my $query = CGI -> new;
my @par_names = $query -> param;
my $form_input = {};
foreach my $name ( @par_names ) {
my @val = $query -> param( $name );
foreach ( @val ) {
$_ = Encode::decode_utf8( $_ );
}
$name = Encode::decode_utf8( $name );
if ( scalar @val == 1 ) {
$form_input ->{$name} = $val[0];
} else {
$form_input ->{$name} = \@val; # save value as an array ref
}
}
This builds a ready- and safe-to-use hash of input parameters.
Perl 5.6 machinery
Use pack/unpack trick to set the utf8 flag. Use Unicode::String module to convert data from Latin-1 to Unicode and back.
Wide character in print warning (Perl 5.8+)
The warning happens when you output a Unicode string to a non-unicode filehandle.
What is a “non-unicode filehandle?”, you ask.
In Perl 5.8 a filehandle can have an encoding specified for it. Perl then will convert all input from the file automatically into its internal UTF-8. It will mark the values read from it accordingly with the utf8 flag. Equally, perl can convert output to a specific encoding for a filehandle. Additionally, perl checks that the data is valid for the encoding.
Say, if you open a file as:
open FILE, "<:encoding(iso-8859-7)", $filename;
it’s content will be assumed to be in iso-8859-7 encoding. Perl will use that to interprete file’s data correctly, i.e. to convert it to internal UTF-8.
By default, a filehandle produces bytes on input and expects bytes for output. Hence, when you print Unicode characters to such an filehandle, perl will warn you. This is what happens with our example.
To get rid of the warning, you again have two ways: the wrong one and the right. The wrong way is to turn off the utf8 flag on your to-be-printed data. Then the characters will turn into bytes, and perl will push them to a bytes-filehandle smoothly. But you’ll have to do this for every piece of data.
The right way is to tell perl that you want to output UTF-8. So, if you print to a file, open it this way:
open FILE, ">:utf8", $filename;
If you print to standard output (or standard error), as in our case, we do:
#!/usr/bin/perl
my $ustring1 = "Hello \x{263A}!\n";
binmode DATA, ":utf8";
my $ustring2 = <DATA>;
binmode STDOUT, ":utf8";
print "$ustring1$ustring2";
__DATA__
Hello ☺!
Global Unicode setting in Perl
And there is another way to approach your coding/encoding
problems. It is to command perl to treat all your program’s
input and output as UTF-8 by default. -C
is a perl switch which let’s you do that. Just put
-CS
on the perl command line.
Alternatively, use PERL_UNICODE
environment
variable, PERL_UNICODE=S
in this case. It has
to be set in the environment where you execute perl.
You can also specify UTF-8-ness for just your stdin or just stdout or just stderr. Read a section on
-C
in perlrun.
Off you go!
Further reading
- perluniintro manpage
- perlunicode manpage
- Encode module
- PerlIO manpage
- open pragma
- binmode() function
- open() function
- perlrun manpage
- utf8 pragma
Comments welcome.