/ English / tech /

Perl 5.6+: Unicode-processing issues and how to cope with it.

Perl version 5.6 introduced partial Unicode support. Perl 5.6.1 fixed some of its issues.  Perl 5.8 has comprehensive support for it.  But still many people experience harsh problems when processing Unicode.  Here I explain the most common problems and offer solutions.

There is a newer version of this article.  This older and longer one may be useful if you need to get your stuff running in a older perl, like version 5.6.1+.

You can read this piece and dive into all the technical details and idiosyncrasies of perl and unicode.  Or you can hire me to fix your code.

Manpages perluniintro and perlunicode document support for Unicode since perl 5.6.0.  Perl 5.8 has better Unicode support and even more Unicode-related documentation.  In addition to perluniintro and perlunicode, it includes: Encode, encoding, -f open manpages, and the list is not complete.

The major problem with this documentation is its volume.  Normal programmer won’t read it. (Cool programmers don’t read documentation, as we all know.) Most programmers don’t even have to read it all, because to start working with Unicode you just need to know the basic facts and rules.

I somehow got into several different kinds of trouble with Unicode in Perl, both in 5.6 and 5.8, in several different projects. Always it was about processing and generating data in UTF-8 encoding.

The two main problems I’ve seen are:

Having said the above, reading or at least browsing through the above mentioned manpages is still a good way to understand and solve your Unicode problems.  If you don’t have time for that now, read on.

The basic facts you need to know

There is a distiction between bytes and characters. Characters are Unicode characters, and one character may consist of several bytes.

There is a “utf8” flag on every scalar value, which might be “on” or “off”. “On” state of the flag tells perl to treat the value as a string of characters.

That is the source of many many perl/unicode problems. If you take a string with utf8 flag off and concatenate it with another string with utf8 flag on, perl converts the first one to UTF-8.

This may sound okay and obvious. But then you think: How? Perl will need to know the encoding of the string data before converting it, and perl will try to guess it.

The algorithm perl uses in guessing is documented (uses some defaults and maybe checks your locale), but my suggestion is: never let perl do that.  In my experience, this is the reason for double-encoded UTF-8 strings in 99% cases.

An example

Imagine you have two variables with Unicode data in it. And you print those variables…

#!/usr/bin/perl

my $ustring1 = "Hello \x{263A}!\n";  
my $ustring2 = <DATA>;

print "$ustring1$ustring2";
__DATA__
Hello ☺!

grab the source

Both variables here contain the same data: string "Hello " followed by Unicode character WHITE SMILING FACE U+263A, an exclamation mark and a new-line character.  But when we print it, the first one comes out fine and the second one comes garbled.  This is because Perl knows that the first string is a Unicode string and is stored in UTF-8, and doesn’t know about encoding of the second.  When it builds a bigger string for printing, it encodes the second into UTF-8, wrongly.

(Perl 5.8, in addition, prints a warning: Wide character in print at unitest1.pl line 6, <DATA> line 1. We’ll look at it later, after we fix our output.)

There are several ways to solve the problem.  One way is to avoid concatenating the strings:

#!/usr/bin/perl

my $ustring1 = "Hello \x{263A}!\n";  
my $ustring2 = <DATA>;

print $ustring1, $ustring2;
__DATA__
Hello ☺!

grab the source

It works, but it is wrong.  Sometimes you simply can’t avoid concatenation; it is such a basic operation.  In addition, it is error-prone and not future-proof.

Let perl know you are a grown-up

The real solution would be to tell perl that the $ustring2 contains Unicode data.  Here is a way to make it work in both perl versions (5.6.1 and 5.8.0, at least):

#!/usr/bin/perl

my $ustring1 = "Hello \x{263A}!\n";  
my $ustring2 = <DATA>;
$ustring2 = pack "U0C*", unpack "C*", $ustring2;

print $ustring1, $ustring2;
__DATA__
Hello ☺!

grab the source

(Perl 5.8 continues to complain about Wide characters at this stage.)

Wheather or not you see the smiling face character depends on your terminal environment.  But at least perl prints two exactly the same lines now.  And that is right.

The strategy

Always use Unicode (UTF-8) to store and process your data.  Make sure perl knows you use it.  Make sure all your scalars, which contain Unicode, have the utf8 flag on.  Then you can safely concatenate strings.  Then you can use Unicode-related regular expressions, which gives you great powers for international text processing.

To achieve that, you need to know all the ways data gets into your program.  As soon as you get some input, mark it as Unicode or convert it to Unicode and sleep well.

Sometimes data comes into your program already in Unicode.  For instance, XML parsers return you string values with the utf8 flag “on”. (Unless you do something weird, like getting it in original form from the parser, which you shouldn’t anyway.) In above example we explicitly include a unicode character into a string ($ustring1) and perl knows its encoding.

But when you read data from input streams, from a database or from environment variables (like parameters in CGI), you need to warn perl about its encoding.

Perl 5.8 machinery

If you read data from a file or another input stream, and you expect UTF-8 data there, warn perl:

if ( open( FILE, "<:utf8", $fname ) ) {
  . . . 
}

or, in our case,

#!/usr/bin/perl

my $ustring1 = "Hello \x{263A}!\n";  
binmode DATA, ":utf8";
my $ustring2 = <DATA>;

print "$ustring1$ustring2";
__DATA__
Hello ☺!

grab the source

When you get UTF-8 data from another module (e.g. DBI, CGI) or from the environment variables, explicitly mark it as utf8.  You can use the pack "U0… trick shown above.  Or you can use Encode module’s decode_utf8() function.

require Encode;
my $ustring = Encode::decode_utf8( $input );

(Do not forget though, that not every sequence of bytes is valid UTF-8.  So this operation may fail.  See Encode doc for error handling.)

If you get data in another encoding, convert it to UTF-8.  Again, Encode module, decode() function:

require Encode;
my $ustring = Encode::decode( 'iso-8859-1', $input );

UTF-8 from CGI example

In ACIS we send HTML in UTF-8.  We expect the HTML form input to be UTF-8 as well.  To manipulate it, we need to tell perl about the encoding:

require Encode;
require CGI;
my $query = CGI -> new;
my @par_names  = $query -> param;
my $form_input = {};  

foreach my $name ( @par_names ) {
  my @val = $query -> param( $name );

  foreach ( @val ) {
    $_ = Encode::decode_utf8( $_ );
  }
  $name = Encode::decode_utf8( $name );

  if ( scalar @val == 1 ) {   
    $form_input ->{$name} = $val[0];
  } else {                      
    $form_input ->{$name} = \@val;  # save value as an array ref
  }
}

This builds a ready- and safe-to-use hash of input parameters.

Perl 5.6 machinery

Use pack/unpack trick to set the utf8 flag.  Use Unicode::String module to convert data from Latin-1 to Unicode and back.

Wide character in print warning (Perl 5.8+)

The warning happens when you output a Unicode string to a non-unicode filehandle.

What is a “non-unicode filehandle?”, you ask.

In Perl 5.8 a filehandle can have an encoding specified for it.  Perl then will convert all input from the file automatically into its internal UTF-8.  It will mark the values read from it accordingly with the utf8 flag. Equally, perl can convert output to a specific encoding for a filehandle.  Additionally, perl checks that the data is valid for the encoding.

Say, if you open a file as:

open FILE, "<:encoding(iso-8859-7)", $filename;

it’s content will be assumed to be in iso-8859-7 encoding.  Perl will use that to interprete file’s data correctly, i.e. to convert it to internal UTF-8.

By default, a filehandle produces bytes on input and expects bytes for output.  Hence, when you print Unicode characters to such an filehandle, perl will warn you. This is what happens with our example.

To get rid of the warning, you again have two ways: the wrong one and the right.  The wrong way is to turn off the utf8 flag on your to-be-printed data.  Then the characters will turn into bytes, and perl will push them to a bytes-filehandle smoothly.  But you’ll have to do this for every piece of data.

The right way is to tell perl that you want to output UTF-8.  So, if you print to a file, open it this way:

open FILE, ">:utf8", $filename;

If you print to standard output (or standard error), as in our case, we do:

#!/usr/bin/perl

my $ustring1 = "Hello \x{263A}!\n";  
binmode DATA, ":utf8";
my $ustring2 = <DATA>;

binmode STDOUT, ":utf8";
print "$ustring1$ustring2";
__DATA__
Hello ☺!

grab the source

Global Unicode setting in Perl

And there is another way to approach your coding/encoding problems.  It is to command perl to treat all your program’s input and output as UTF-8 by default. -C is a perl switch which let’s you do that.  Just put -CS on the perl command line.

Alternatively, use PERL_UNICODE environment variable, PERL_UNICODE=S in this case.  It has to be set in the environment where you execute perl.

You can also specify UTF-8-ness for just your stdin or just stdout or just stderr.  Read a section on -C in perlrun.

Do you still want to do it yourself?  You can hire me to do your perl & unicode coding.

Off you go!

Further reading

Comments welcome.