Recipe 19.13. Manipulating UTF-8 Text
19.13.1. Problem
You want to work with UTF-8-encoded text in your programs. For example, you want to properly calculate the length of multibyte strings and make sure that all text is output as proper UTF-8-encoded characters.
19.13.2. Solution
Use a combination of PHP functions for the variety of tasks that UTF-8 compliance demands.
If the mbstring extension is available, use its string functions for UTF-8-aware string manipulation. Example 19-26 uses the mb_strlen( ) function to compute the number of characters in each of two UTF-8-encoded strings.
Using mb_strlen( )
<?php // Set the encoding properly mb_internal_encoding('UTF-8'); // ö is two bytes $name = 'Kurt Gödel'; // Each of these Hangul characters is three bytes $dinner = ;
$name_len_bytes = strlen($name); $name_len_chars = mb_strlen($name);
$dinner_len_bytes = strlen($dinner); $dinner_len_chars = mb_strlen($dinner);
print "$name is $name_len_bytes bytes and $name_len_chars chars\n"; print "$dinner is $dinner_len_bytes bytes and $dinner_len_chars chars\n"; ?>
|
Example 19-26 prints:
Kurt Gödel is 11 bytes and 10 chars is 9 bytes and 3 chars
The iconv extension, which is available by default in PHP 5, also offers a few multibyte-aware string manipulation functions, as shown in Example 19-27.
Using iconv
<?php // Set the encoding properly iconv_set_encoding('internal_encoding','UTF-8'); // ö is two bytes $name = 'Kurt Gödel'; // Each of these Hangul characters is three bytes $dinner = ;
$name_len_bytes = strlen($name); $name_len_chars = iconv_strlen($name);
$dinner_len_bytes = strlen($dinner); $dinner_len_chars = iconv_strlen($dinner);
print "$name is $name_len_bytes bytes and $name_len_chars chars\n"; print "$dinner is $dinner_len_bytes bytes and $dinner_len_chars chars\n <br/>";
print "The seventh character of $name is " . iconv_substr($name,6,1) . "\n"; print "The last two characters of $dinner are " . iconv_substr($dinner,-2); ?>
|
Use the optional third argument to functions such as htmlentities( ) and htmlspecialchars( ) that instructs them to treat input as UTF-8 encoded, as shown in Example 19-28.
UTF-8 HTML encoding
<?php $encoded_name = htmlspecialchars($_POST['name'], ENT_QUOTES, 'UTF-8'); $encoded_dinner = htmlentities($_POST['dinner'], ENT_QUOTES, 'UTF-8'); ?>
|
19.13.3. Discussion
Eternal vigilance is the price of proper character encoding, at least until PHP 6 is released. If you've followed the instructions in Recipes 19.11 and 19.12, data coming into your program should be UTF-8 encoded and browsers will properly handle data coming out of your program as UTF-8 encoded. This leaves you with two responsibilities: to operate on strings in a UTF-8-aware manner and to generate text that is UTF-8 encoded.
Fulfulling the first responsibility is made easier once you have adopted the fundamental credo of internationalization awareness: a character is not a byte. The PHP-specific correlary to this axiom is that PHP's string functions only know about bytes, not characters. For example, the strlen( ) function counts the number of bytes in a string, not the number of characters. In the prelapsarian days of ISO-8859-1 encoding, this wasn't a problem'each of the 256 characters in the character set took up one byte. A UTF-8-encoded character, on the other hand, uses between one and four bytes. The mbstring and iconv extensions provide alternatives for some string functions that operate on a character-by-character basis, not a byte-by-byte basis. These functions are listed in Table 20-3.
Table 19-3. Character-Based FunctionsRegular function | mbstring function | iconv function |
---|
strlen( )
| mb_strlen( )
| iconv_strlen( )
| strpos( )
| mb_strpos( )
| iconv_strpos( )
| strrpos( )
| mb_strrpos( )
| iconv_strrpos( )
| substr( )
| mb_substr( )
| iconv_substr( )
| strtolower( )
| mb_strtolower( )
| - | strtoupper( )
| mb_strtoupper( )
| - | substr_count( )
| mb_substr_count( )
| - | ereg( )
| mb_ereg( )
| - | eregi( )
| mb_eregi( )
| - | ereg_replace( )
| mb_ereg_replace( )
| - | eregi_replace( )
| mb_eregi_replace( )
| - | split( )
| mb_split( )
| - | mail( )
| mb_send_mail( )
| - |
For mbstring to work properly, it needs to be told to use the UTF-8 encoding scheme. As in Example 19-26, you can do this in script with the mb_internal_encoding( ) function. Or to set this value system-wide, set the mbstring.internal_encoding configuration directive to UTF-8.
iconv has similar needs. Use the iconv_set_encoding( ) function as in Example 19-27 or set the iconv.internal_encoding configuration directive.
mbstring provides alternatives for the ereg family of regular expression functions. However, you can always use UTF-8 strings with the PCRE (preg_*( )) regular expression functions. The u modifier tells a preg function that the pattern string is UTF-8 encoded and enables the use of various Unicode properties in patterns. Example 19-29 uses the "lowercase letter" Unicode property to count the number of lowercase letters in each of two strings.
UTF-8 regular expression matching
<?php $name = 'Kurt Gödel'; $dinner = ;
$name_lower = preg_match_all('/\p{Ll}/u',$name,$match); $dinner_lower = preg_match_all('/\p{Ll}/u',$dinner,$match);
print "There are $name_lower lowercase letters in $name. \n"; print "There are $dinner_lower lowercase letters in $dinner. \n"; ?>
|
Example 19-29 prints:
There are 7 lowercase letters in Kurt Gödel. There are 3 lowercase letters in .
Other functions help you translate between other character encodings and UTF-8. The utf8_encode( ) and utf8_decode( ) functions move strings between the ISO-8859-1 encoding and UTF-8. Because ISO-8859-1 is the default encoding in many situations, these functions are a handy way to bring non-UTF-8-aware data into compliance. For example, the dictionaries that the pspell extension uses often have their entries encoded in ISO-8859-1. In Example 19-30, the utf8_encode( ) function is necessary to turn the output of pspell_suggest( ) into a proper UTF-8-encoded string.
Applying UTF-8 encoding to ISO-8859-1 strings
<?php $lang = isset($_GET['lang']) ? $_GET['lang'] : 'en'; $word = isset($_GET['word']) ? $_GET['word'] : 'asparagus';
$ps = pspell_new($lang); $check = pspell_check($ps, $word);
print htmlspecialchars($word,ENT_QUOTES,'UTF-8'); print $check ? ' is ' : ' is not '; print ' found in the dictionary.'; print '<hr/>';
if (! $check) { $suggestions = pspell_suggest($ps, $word); if (count($suggestions)) { print 'Suggestions: <ul>'; foreach ($suggestions as $suggestion) { $utf8suggestion = utf8_encode($suggestion); $safesuggestion = htmlspecialchars($utf8suggestion, ENT_QUOTES,'UTF-8'); print "<li>$safesuggestion</li>"; } print '</ul>'; } ?>
|
It may ease the cognitive burden of proper character encoding to think of it as a task similar to HTML entity encoding. In each case, text must be processed so that it is appropriately formatted for a particular context. With entity encoding, that usually means running data retrieved from an external source through htmlentities( ) or htmlspecialchars( ). With character encoding, it means turning everything into UTF-8 before you process it, using a character-aware function for string operations, and ensuring strings are UTF-8 encoded before outputting them.
19.13.4. See Also
Recipes 19.11 and 19.12 for setting up your programs for receiving and sending UTF-8-encoded strings; documentation on mbstring at http://www.php.net/mbstring, on iconv at http://www.php.net/iconv, on htmlentities( ) at http://www.php.net/htmlentities, on htmlspecialchars( ) at http://www.php.net/htmlspecialchars, on PCRE pattern syntax at http://www.php.net/reference.pcre.pattern.syntax, on utf8_encode( ) at http://www.php.net/utf8_encode, and on utf8_decode( ) at http://www.php.net/utf8_decode.
Good background resources on managing PHP and character set issues include:
"An Overview on Globalizing Oracle PHP Applications" by Peter Linsley (http://www.oracle.com/technology/tech/php/pdf/globalizing_oracle_php_applications.pdf) Character Sets/Character Encoding Issues on the PHP WACT Wiki (http://www.phpwact.org/php/i18n/charsets) "Characters vs. Bytes" by Tim Bray (http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF) "A Tutorial on Character Code Issues" by Jukka Korpela (http://www.cs.tut.fi/~jkorpela/chars.html)
|
No comments:
Post a Comment