Unicode-processing issues in Perl and how to cope with it

Perl 5.8+ has comprehensive support for Unicode and a wide range of different text encodings.  But still many people experience problems when processing multi-language text.  Here I explain the most common problems and offer solutions.

21 Nov 2013. Some inaccuracies in the text of the article and in the test scripts were corrected.

This article is translated to Serbo-Croatian language by Anja Skrba from Webhostinggeeks.com.

An older version of this article is available.  It is not as well structured, but provides some additional perl version 5.6.1 unicode-related details.

You can read this piece and dive into all the technical details and idiosyncrasies of perl and unicode.  Or you can hire me to fix your code.

A bunch of perldoc manpages outline and explain the Perl’s unicode support. perluniintro, perlunicode, Encode module, binmode() function.  And the list is not complete. The major problem with this documentation is its volume. Most programmers don’t even have to read it all, because to start working with Unicode you just need to know some basic facts and rules.

I have experienced several kinds of trouble with Unicode in Perl, in several projects. The two main problems I’ve seen are:

These two problems are closely related and often solved by similar moves.

Reading or at least browsing through the related manpages is still a good way to understand and solve your Unicode problems. If you don’t have time for that now, read on.

The problem showcase: the example

Imagine two simple variables with Unicode text in it. And you print those variables to standard output. What may be easier?..

#!/usr/bin/perl

my $ustring1 = "Hello \x{263A}!\n";  
my $ustring2 = <DATA>;

print "$ustring1$ustring2";
__DATA__
Hello ☺!

source

Both variables here contain the same data: string "Hello " followed by Unicode character WHITE SMILING FACE U+263A, an exclamation mark and a new-line character. The __DATA__ part ($ustring2) is UTF-8 encoded.

But when we print it, the first one comes out fine and the second one comes garbled. This is because Perl knows that the first string is a Unicode string and is internally stored in UTF-8. But it doesn’t know the encoding of the second. When it builds a bigger string for printing, it re-encodes the second into UTF-8, wrongly.

In addition, it prints a warning: Wide character in print at unitest1.pl line 6, <DATA> line 1. We’ll look at it later, after we fix our output.

You could apparently fix things by avoiding concatenation:

#!/usr/bin/perl

my $ustring1 = "Hello \x{263A}!\n";  
my $ustring2 = <DATA>;

print $ustring1, $ustring2;
__DATA__
Hello ☺!

source

But this is not a solution. Sometimes you simply can’t avoid concatenation; it is such a basic operation. In addition, it is error-prone and not future-proof.

Why the problem happens

First, some basic facts.

There is a distinction between bytes and characters. Characters are Unicode characters. One character may be represented by several bytes when stored, printed or sent over network. How exactly a character is converted into bytes depends on the encoding used. UTF-8 is just one of the ways to do represent Unicode characters.

Perl has a “utf8” flag for every scalar value, which may be “on” or “off”. The “On” state of the flag tells perl to treat the value as a string of Unicode characters. Otherwise, it is just a bunch of bytes.

If you take a string with utf8 flag off and concatenate it with a string that has utf8 flag on, perl converts the first one to Unicode.

This may sound okay and obvious. But then you think: How? Perl will need to know the encoding of the string data before converting it. And perl will try to guess it. And this is the usual source of problems.

The algorithm perl uses when guessing is documented (uses some defaults and maybe checks your locale), but my firm suggestion is: never let perl do that. Otherwise, there is a BIG chance that you’ll get double-encoded UTF-8 strings, or otherwise mangled data.

The solution: always make data encoding explicit, both for your input and output.

Solution #1: Convert string to Unicode

One solution could be to tell perl that the $ustring2 contains Unicode data in UTF-8 encoding. There is a couple of ways to do that; the orthodox way is through Encode’s decode_utf8() function:

#!/usr/bin/perl

use Encode;
my $ustring1 = "Hello \x{263A}!\n";  
my $ustring2 = <DATA>;
$ustring2 = decode_utf8( $ustring2 );

print "$ustring1$ustring2";
__DATA__
Hello ☺!

source

In this simple case both ways would do the job, but may get quite tedious, if your inputs are plentiful. And it still prints the “Wide character” warning.

But this is what you should always do for the international data you get from other Perl modules, like from databases.

You should not forget though, that not every sequence of bytes is valid UTF-8. So the decode_utf8() operation may fail. See Encode perldoc for the error handling details.

(Another way to do let perl accept the UTF-8 data as such is with a pack “U0C*”, unpack “C*” hack. But you probably shouldn’t do that.)

If you get data in another encoding (not UTF-8), convert it to Unicode explicitly. Again, Encode module, decode() function:

require Encode;
my $ustring = Encode::decode( 'iso-8859-1', $input );

Another example: UTF-8 data from CGI

In ACIS we produce HTML pages in UTF-8.  We expect the HTML form input to be UTF-8 as well.  To manipulate it, we tell perl about the encoding:

require Encode;
require CGI;
my $query = CGI ->new;
my $form_input = {};  
foreach my $name ( $query ->param ) {
  my @val = $query ->param( $name );
  foreach ( @val ) {
    $_ = Encode::decode_utf8( $_ );
  }
  $name = Encode::decode_utf8( $name );
  if ( scalar @val == 1 ) {   
    $form_input ->{$name} = $val[0];
  } else {                      
    $form_input ->{$name} = \@val;  # save value as an array ref
  }
}

This builds a ready- and safe-to-use hash of input parameters.

Solution #2: Specify IO encoding layers for your filehandles

Starting with version 5.8 in Perl a filehandle can have an encoding specified for it. Perl then will convert all input from the file automatically into its internal Unicode encoding. It will mark the values read from it accordingly with the utf8 flag. Equally, perl can convert output to a specific encoding for a filehandle. Additionally, perl checks that the data you output is valid for the filehandle’s encoding.

So, if you read data from a file or another input stream, and you expect UTF-8 data there, warn perl:

if ( open( FILE, "<:utf8", $fname ) ) {
  . . . 
}

or, in case of our simple test,

#!/usr/bin/perl

my $ustring1 = "Hello \x{263A}!\n";  
binmode DATA, ":utf8";
my $ustring2 = <DATA>;

print "$ustring1$ustring2";
__DATA__
Hello ☺!

source

This should print two equal lines, but it would still make the annoying warning. That’s because we still print the unicode-containing value to a file handle that is not prepared for that: the STDOUT. (And it happens implicitly, since print prints there by default.) Jump there to see the fix for the warning right now.

Similarly, if you open a file as:

open FILE, "<:encoding(iso-8859-7)", $filename;

it’s content will be assumed to be in iso-8859-7 encoding. Perl will use that to interprete file’s data correctly, i.e. to convert it to the internal UTF-8.

(Here and below, the ISO-8859-7 encoding is just an example.  Any of the perl-supported encodings may be used.)

Solution #3: Global Unicode setting in Perl

And there is yet another way to approach your coding/encoding problems. It is to command perl to treat all your program’s input and output as UTF-8 by default. -C is a perl switch which let’s you do that. Just put -CS on the perl command line.

Alternatively, use PERL_UNICODE environment variable. It has to be set in the environment where you execute perl, for instance:

god@world:~$ PERL_UNICODE=S perl script.pl

Would command perl to assume UTF-8 in all input and output filehandles in your script and used modules, by default. (Unfortunately and contrary to my expectations this does not have an impact on the special DATA filehandle. So this is not a solution to our problem showcase script.)

You can also specify UTF-8-ness for just your stdin or just stdout or just stderr. Read a section on -C in perlrun for full details.

Wide character in print warning

The warning happens when you output a Unicode string to a non-unicode filehandle. What is a "non-unicode filehandle?", you ask. That’s the one with no unicode-compatible IO layer on it (see Solution #2 section above.)

The right way to fix this is to specify the output encoding explicitly, with the binmode() function or in your open() call. For example, open your file this way:

open FILE, ">:utf8", $filename;

To print UTF-8 to standard output (or standard error), as in our case, we do:

#!/usr/bin/perl

my $ustring1 = "Hello \x{263A}!\n";  
binmode DATA, ":utf8";
my $ustring2 = <DATA>;
binmode STDOUT, ":utf8";
print "$ustring1$ustring2";
__DATA__
Hello ☺!

source

Now that should finally print two equal lines (correctly) and produce no warning!

The wrong way to avoid the warning is to turn off the utf8 flag on your to-be-printed data. Then the characters will turn into bytes, and perl will push them to a bytes-filehandle smoothly. But you don’t need that, really.

On the other hand, if you open a file as:

open FILE, ">:encoding(iso-8859-7)", $filename;

the stuff you print will be output in iso-8859-7 encoding, transcoded automatically. ISO-8859-7 is not a Unicode-compatible charset, so you won’t be able to output Unicode characters on it without a warning.

The right strategy: summary

If you can, use a Unicode encoding (such as UTF-8) to store and process your data. Always make sure perl knows which encoding your data comes in and come out. Make sure all your Unicode-containing scalars, have the utf8 flag on. Then you can safely concatenate strings. Then you can use Unicode-related regular expressions, which gives you great powers for international (multi-language) text processing.

To achieve that, you may need to know all the ways data gets into your program. As soon as you get some input, mark it as Unicode or convert it to Unicode and sleep well.

Sometimes data comes into your program already in Unicode and you shouldn’t worry. For instance, XML parsers return you string values with the utf8 flag “on”. (Unless you do something weird, like getting it in original form from the parser, which you shouldn’t do anyway.) In the above example we explicitly include a unicode character into a string ($ustring1) and perl knows its encoding.

But when you read data from input streams, from a database or from environment variables (like parameters in CGI), you need to tell perl about its encoding.

Use PERL_UNICODE environment variable to force UTF-8 IO layers on your input and/or output filehandles.

Further reading

Do you still want to do it yourself?  You can hire me to do your perl & unicode coding.
Man pages (perldocs): Other:

Comments are welcome.