Programmer's Life: Recipe 19.13. Manipulating UTF-8 Text

Recipe 19.13. Manipulating UTF-8 Text

19.13.1. Problem

You

want to work with UTF-8-encoded text in your programs. For example, you want to properly calculate the length of multibyte strings and make sure that all text is output as proper UTF-8-encoded characters.

19.13.2. Solution

Use a combination of PHP functions for the variety of tasks that UTF-8 compliance demands.

If the mbstring extension
is available, use its string functions for UTF-8-aware string manipulation. Example 19-26 uses the
mb_strlen( ) function to compute the number of characters in each of two UTF-8-encoded strings.

Using mb_strlen( )

<?php
// Set the encoding properly
mb_internal_encoding('UTF-8');
// ö is two bytes
$name = 'Kurt Gödel';
// Each of these Hangul characters is three bytes
$dinner = 
;

$name_len_bytes = strlen($name);
$name_len_chars = mb_strlen($name);

$dinner_len_bytes = strlen($dinner);
$dinner_len_chars = mb_strlen($dinner);

print "$name is $name_len_bytes bytes and $name_len_chars chars\n";
print "$dinner is $dinner_len_bytes bytes and $dinner_len_chars chars\n";
?>

Example 19-26 prints:

Kurt Gödel is 11 bytes and 10 chars
 is 9 bytes and 3 chars

The iconv extension, which is available by default in PHP 5, also offers a few multibyte-aware string manipulation functions, as shown in Example 19-27.

Using iconv

<?php
// Set the encoding properly
iconv_set_encoding('internal_encoding','UTF-8');
// ö is two bytes
$name = 'Kurt Gödel';
// Each of these Hangul characters is three bytes
$dinner = 
;

$name_len_bytes = strlen($name);
$name_len_chars = iconv_strlen($name);

$dinner_len_bytes = strlen($dinner);
$dinner_len_chars = iconv_strlen($dinner);

print "$name is $name_len_bytes bytes and $name_len_chars chars\n";
print "$dinner is $dinner_len_bytes bytes and $dinner_len_chars chars\n <br/>";

print "The seventh character of $name is " . iconv_substr($name,6,1) . "\n";
print "The last two characters of $dinner are " . iconv_substr($dinner,-2);
?>

Use the optional third argument to functions such as htmlentities( ) and htmlspecialchars( ) that instructs them to treat input as UTF-8 encoded, as shown in Example 19-28.

UTF-8 HTML encoding

<?php
$encoded_name = htmlspecialchars($_POST['name'], ENT_QUOTES, 'UTF-8');
$encoded_dinner = htmlentities($_POST['dinner'], ENT_QUOTES, 'UTF-8');
?>

19.13.3. Discussion

Eternal vigilance is the price of proper character encoding, at least until PHP 6 is released. If you've followed the instructions in Recipes 19.11 and 19.12, data coming into your program should be UTF-8 encoded and browsers will properly handle data coming out of your program as UTF-8 encoded. This leaves you with two responsibilities: to operate on strings in a UTF-8-aware manner and to generate text that is UTF-8 encoded.

Fulfulling the first responsibility is made easier once you have adopted the fundamental credo of internationalization awareness: a character is not a byte. The PHP-specific correlary to this axiom is that PHP's string functions only know about bytes, not characters. For example, the strlen( ) function counts the number of bytes in a string, not the number of characters. In the prelapsarian days of ISO-8859-1 encoding, this wasn't a problem'each of the 256 characters in the character set took up one byte. A UTF-8-encoded character, on the other hand, uses between one and four bytes. The mbstring and iconv extensions provide alternatives for some string functions that operate on a character-by-character basis, not a byte-by-byte basis. These functions are listed in Table 20-3.

Table 19-3. Character-Based Functions
Regular function	mbstring function	iconv function
strlen( )	mb_strlen( )	iconv_strlen( )
strpos( )	mb_strpos( )	iconv_strpos( )
strrpos( )	mb_strrpos( )	iconv_strrpos( )
substr( )	mb_substr( )	iconv_substr( )
strtolower( )	mb_strtolower( )	-
strtoupper( )	mb_strtoupper( )	-
substr_count( )	mb_substr_count( )	-
ereg( )	mb_ereg( )	-
eregi( )	mb_eregi( )	-
ereg_replace( )	mb_ereg_replace( )	-
eregi_replace( )	mb_eregi_replace( )	-
split( )	mb_split( )	-
mail( )	mb_send_mail( )	-

For mbstring to work properly, it needs to be told to use the UTF-8 encoding scheme. As in Example 19-26, you can do this in script with the mb_internal_encoding( ) function. Or to set this value system-wide, set the mbstring.internal_encoding configuration directive to UTF-8.

iconv has similar needs. Use the iconv_set_encoding( ) function as in Example 19-27 or set the iconv.internal_encoding configuration directive.

mbstring provides alternatives for the ereg family of regular expression functions. However, you can always use UTF-8 strings with the PCRE (preg_*( ))
regular expression functions. The u modifier tells a preg function that the pattern string is UTF-8 encoded and enables the use of various Unicode properties in patterns. Example 19-29 uses the "lowercase letter" Unicode property to count the number of lowercase letters in each of two strings.

UTF-8 regular expression matching

<?php
$name = 'Kurt Gödel';
$dinner = 
;

$name_lower = preg_match_all('/\p{Ll}/u',$name,$match);
$dinner_lower = preg_match_all('/\p{Ll}/u',$dinner,$match);

print "There are $name_lower lowercase letters in $name. \n";
print "There are $dinner_lower lowercase letters in $dinner. \n";
?>

Example 19-29 prints:

There are 7 lowercase letters in Kurt Gödel.
There are 3 lowercase letters in 
.

Other functions help you translate between other character encodings and UTF-8. The
utf8_encode( ) and utf8_decode( )
functions move strings between the ISO-8859-1 encoding and UTF-8. Because ISO-8859-1 is the default encoding in many situations, these functions are a handy way to bring non-UTF-8-aware data into compliance. For example, the dictionaries that the pspell extension uses often have their entries encoded in ISO-8859-1. In Example 19-30, the utf8_encode( ) function is necessary to turn the output of pspell_suggest( )
into a proper UTF-8-encoded string.

Applying UTF-8 encoding to ISO-8859-1 strings

<?php
$lang = isset($_GET['lang']) ? $_GET['lang'] : 'en';
$word = isset($_GET['word']) ? $_GET['word'] : 'asparagus';

$ps = pspell_new($lang);
$check = pspell_check($ps, $word);

print htmlspecialchars($word,ENT_QUOTES,'UTF-8');
print $check ? ' is ' : ' is not ';
print ' found in the dictionary.';
print '<hr/>';

if (! $check) {
    $suggestions = pspell_suggest($ps, $word);
    if (count($suggestions)) {
        print 'Suggestions: <ul>';
        foreach ($suggestions as $suggestion) {
            $utf8suggestion = utf8_encode($suggestion);
            $safesuggestion = htmlspecialchars($utf8suggestion,
                                               ENT_QUOTES,'UTF-8');
            print "<li>$safesuggestion</li>";
        }
        print '</ul>';
}
?>

It may ease the cognitive burden of proper character encoding to think of it as a task similar to HTML entity encoding. In each case, text must be processed so that it is appropriately formatted for a particular context. With entity encoding, that usually means running data retrieved from an external source through htmlentities( )

or htmlspecialchars( ). With character encoding, it means turning everything into UTF-8 before you process it, using a character-aware function for string operations, and ensuring strings are UTF-8 encoded before outputting them.

19.13.4. See Also

Recipes 19.11 and 19.12 for setting up your programs for receiving and sending UTF-8-encoded strings; documentation on mbstring at http://www.php.net/mbstring, on iconv at http://www.php.net/iconv, on htmlentities( ) at http://www.php.net/htmlentities, on htmlspecialchars( ) at http://www.php.net/htmlspecialchars, on PCRE pattern syntax at http://www.php.net/reference.pcre.pattern.syntax, on utf8_encode( ) at http://www.php.net/utf8_encode, and on utf8_decode( ) at http://www.php.net/utf8_decode.

Good background resources on managing PHP and character set issues include:

"An Overview on Globalizing Oracle PHP Applications" by Peter Linsley

(http://www.oracle.com/technology/tech/php/pdf/globalizing_oracle_php_applications.pdf)
Character Sets/Character Encoding Issues on the PHP WACT Wiki (http://www.phpwact.org/php/i18n/charsets)
"Characters vs. Bytes"

by Tim Bray (http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF)
"A Tutorial on Character Code Issues"

by Jukka Korpela (http://www.cs.tut.fi/~jkorpela/chars.html)

Programmer's Life

Wednesday, December 30, 2009

Recipe 19.13. Manipulating UTF-8 Text

Recipe 19.13. Manipulating UTF-8 Text

19.13.1. Problem

19.13.2. Solution

Using mb_strlen( )

Using iconv

UTF-8 HTML encoding

19.13.3. Discussion

Table 19-3. Character-Based Functions

UTF-8 regular expression matching

Applying UTF-8 encoding to ISO-8859-1 strings

19.13.4. See Also

No comments:

Blog Archive

About Me

Link