Wednesday, December 30, 2009

Recipe 19.13. Manipulating UTF-8 Text










Recipe 19.13. Manipulating UTF-8 Text



19.13.1. Problem


You

want to work with UTF-8-encoded text in your programs. For example, you want to properly calculate the length of multibyte strings and make sure that all text is output as proper UTF-8-encoded characters.




19.13.2. Solution


Use a combination of PHP functions for the variety of tasks that UTF-8 compliance demands.


If the mbstring extension
is available, use its string functions for UTF-8-aware string manipulation. Example 19-26 uses the
mb_strlen( ) function to compute the number of characters in each of two UTF-8-encoded strings.


Using mb_strlen( )



<?php
// Set the encoding properly
mb_internal_encoding('UTF-8');
// ö is two bytes
$name = 'Kurt Gödel';
// Each of these Hangul characters is three bytes
$dinner =
;

$name_len_bytes = strlen($name);
$name_len_chars = mb_strlen($name);

$dinner_len_bytes = strlen($dinner);
$dinner_len_chars = mb_strlen($dinner);

print "$name is $name_len_bytes bytes and $name_len_chars chars\n";
print "$dinner is $dinner_len_bytes bytes and $dinner_len_chars chars\n";
?>




Example 19-26 prints:


Kurt Gödel is 11 bytes and 10 chars
is 9 bytes and 3 chars



The iconv extension, which is available by default in PHP 5, also offers a few multibyte-aware string manipulation functions, as shown in Example 19-27.


Using iconv



<?php
// Set the encoding properly
iconv_set_encoding('internal_encoding','UTF-8');
// ö is two bytes
$name = 'Kurt Gödel';
// Each of these Hangul characters is three bytes
$dinner =
;

$name_len_bytes = strlen($name);
$name_len_chars = iconv_strlen($name);

$dinner_len_bytes = strlen($dinner);
$dinner_len_chars = iconv_strlen($dinner);

print "$name is $name_len_bytes bytes and $name_len_chars chars\n";
print "$dinner is $dinner_len_bytes bytes and $dinner_len_chars chars\n <br/>";

print "The seventh character of $name is " . iconv_substr($name,6,1) . "\n";
print "The last two characters of $dinner are " . iconv_substr($dinner,-2);
?>




Use the optional third argument to functions such as htmlentities( ) and htmlspecialchars( ) that instructs them to treat input as UTF-8 encoded, as shown in Example 19-28.


UTF-8 HTML encoding



<?php
$encoded_name = htmlspecialchars($_POST['name'], ENT_QUOTES, 'UTF-8');
$encoded_dinner = htmlentities($_POST['dinner'], ENT_QUOTES, 'UTF-8');
?>






19.13.3. Discussion


Eternal vigilance is the price of proper character encoding, at least until PHP 6 is released. If you've followed the instructions in Recipes 19.11 and 19.12, data coming into your program should be UTF-8 encoded and browsers will properly handle data coming out of your program as UTF-8 encoded. This leaves you with two responsibilities: to operate on strings in a UTF-8-aware manner and to generate text that is UTF-8 encoded.


Fulfulling the first responsibility is made easier once you have adopted the fundamental credo of internationalization awareness: a character is not a byte. The PHP-specific correlary to this axiom is that PHP's string functions only know about bytes, not characters. For example, the strlen( ) function counts the number of bytes in a string, not the number of characters. In the prelapsarian days of ISO-8859-1 encoding, this wasn't a problem'each of the 256 characters in the character set took up one byte. A UTF-8-encoded character, on the other hand, uses between one and four bytes. The mbstring and iconv extensions provide alternatives for some string functions that operate on a character-by-character basis, not a byte-by-byte basis. These functions are listed in Table 20-3.


Table 19-3. Character-Based Functions

Regular function

mbstring function

iconv function

strlen( )

mb_strlen( )

iconv_strlen( )

strpos( )

mb_strpos( )

iconv_strpos( )

strrpos( )

mb_strrpos( )

iconv_strrpos( )

substr( )

mb_substr( )

iconv_substr( )

strtolower( )

mb_strtolower( )

-

strtoupper( )

mb_strtoupper( )

-

substr_count( )

mb_substr_count( )

-

ereg( )

mb_ereg( )

-

eregi( )

mb_eregi( )

-

ereg_replace( )

mb_ereg_replace( )

-

eregi_replace( )

mb_eregi_replace( )

-

split( )

mb_split( )

-

mail( )

mb_send_mail( )

-



For mbstring to work properly, it needs to be told to use the UTF-8 encoding scheme. As in Example 19-26, you can do this in script with the mb_internal_encoding( ) function. Or to set this value system-wide, set the mbstring.internal_encoding configuration directive to UTF-8.


iconv has similar needs. Use the iconv_set_encoding( ) function as in Example 19-27 or set the iconv.internal_encoding configuration directive.


mbstring provides alternatives for the ereg family of regular expression functions. However, you can always use UTF-8 strings with the PCRE (preg_*( ))
regular expression functions. The u modifier tells a preg function that the pattern string is UTF-8 encoded and enables the use of various Unicode properties in patterns. Example 19-29 uses the "lowercase letter" Unicode property to count the number of lowercase letters in each of two strings.


UTF-8 regular expression matching



<?php
$name = 'Kurt Gödel';
$dinner =
;

$name_lower = preg_match_all('/\p{Ll}/u',$name,$match);
$dinner_lower = preg_match_all('/\p{Ll}/u',$dinner,$match);

print "There are $name_lower lowercase letters in $name. \n";
print "There are $dinner_lower lowercase letters in $dinner. \n";
?>




Example 19-29 prints:


There are 7 lowercase letters in Kurt Gödel.
There are 3 lowercase letters in
.



Other functions help you translate between other character encodings and UTF-8. The
utf8_encode( )
and utf8_decode( )
functions move strings between the ISO-8859-1 encoding and UTF-8. Because ISO-8859-1 is the default encoding in many situations, these functions are a handy way to bring non-UTF-8-aware data into compliance. For example, the dictionaries that the pspell extension uses often have their entries encoded in ISO-8859-1. In Example 19-30, the utf8_encode( ) function is necessary to turn the output of pspell_suggest( )
into a proper UTF-8-encoded string.


Applying UTF-8 encoding to ISO-8859-1 strings



<?php
$lang = isset($_GET['lang']) ? $_GET['lang'] : 'en';
$word = isset($_GET['word']) ? $_GET['word'] : 'asparagus';

$ps = pspell_new($lang);
$check = pspell_check($ps, $word);

print htmlspecialchars($word,ENT_QUOTES,'UTF-8');
print $check ? ' is ' : ' is not ';
print ' found in the dictionary.';
print '<hr/>';

if (! $check) {
$suggestions = pspell_suggest($ps, $word);
if (count($suggestions)) {
print 'Suggestions: <ul>';
foreach ($suggestions as $suggestion) {
$utf8suggestion = utf8_encode($suggestion);
$safesuggestion = htmlspecialchars($utf8suggestion,
ENT_QUOTES,'UTF-8');
print "<li>$safesuggestion</li>";
}
print '</ul>';
}
?>




It may ease the cognitive burden of proper character encoding to think of it as a task similar to HTML entity encoding. In each case, text must be processed so that it is appropriately formatted for a particular context. With entity encoding, that usually means running data retrieved from an external source through htmlentities( )

or htmlspecialchars( ). With character encoding, it means turning everything into UTF-8 before you process it, using a character-aware function for string operations, and ensuring strings are UTF-8 encoded before outputting them.




19.13.4. See Also


Recipes 19.11 and 19.12 for setting up your programs for receiving and sending UTF-8-encoded strings; documentation on mbstring at http://www.php.net/mbstring, on iconv at http://www.php.net/iconv, on htmlentities( ) at http://www.php.net/htmlentities, on htmlspecialchars( ) at http://www.php.net/htmlspecialchars, on PCRE pattern syntax at http://www.php.net/reference.pcre.pattern.syntax, on utf8_encode( ) at http://www.php.net/utf8_encode, and on utf8_decode( ) at http://www.php.net/utf8_decode.


Good background resources on managing PHP and character set issues include:


  • "An Overview on Globalizing Oracle PHP Applications" by Peter Linsley

    (http://www.oracle.com/technology/tech/php/pdf/globalizing_oracle_php_applications.pdf)

  • Character Sets/Character Encoding Issues on the PHP WACT Wiki (http://www.phpwact.org/php/i18n/charsets)

  • "Characters vs. Bytes"

    by Tim Bray (http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF)

  • "A Tutorial on Character Code Issues"

    by Jukka Korpela (http://www.cs.tut.fi/~jkorpela/chars.html)















No comments: