Interlude: Unicode and Generic CharactersBefore proceeding, it is necessary to explain briefly how Windows processes characters and differentiates between 8- and 16-bit characters and generic characters. The topic is a large one and beyond the book's scope, so we only provide the minimum detail required, rather than a complete chapter. Windows supports standard 8-bit characters (type char or CHAR) and (except on Windows 9x) wide 16-bit characters (WCHAR, which is defined to be the C wchar_t type). The Microsoft documentation refers to the 8-bit character set as ASCII, but it is actually the Latin-1 character set; for convenience, in this discussion we use ASCII too. The wide character support that Windows provides using the Unicode UTF-16 encoding is capable of representing symbols and letters in all major languages, including English, French, Spanish, German, Japanese, and Chinese, using the Unicode representation. Here are the steps commonly used to write a generic Windows application that can be built to use either Unicode (UTF-16, as opposed to UCS-4, for example) or 8-bit ASCII characters.
Windows uses Unicode 16-bit characters (UTF-16 encoding) throughout, and NTFS file names and pathnames are represented internally in Unicode. If the UNICODE macro is defined, wide character strings are required by Windows calls; otherwise, 8-bit character strings are converted to wide characters. If the program is to run under Windows 9x, which is not a Unicode system, do not define the UNICODE and _UNICODE macros. Under NT and CE, the definition is optional unless the executable is to run under Windows 9x as well. All future program examples will use TCHAR instead of the normal char for characters and character strings unless there is a clear reason to deal with individual 8-bit characters. Similarly, the type LPTSTR indicates a pointer to a generic string, and LPCTSTR indicates, in addition, a constant string. At times, this choice will add some clutter to the programs, but it is the only choice that allows the flexibility necessary to develop and test applications in either Unicode or 8-bit character form so that the program can be easily converted to Unicode at a later date. Furthermore, this choice is consistent with common, if not universal, industry practice. It is worthwhile to examine the system include files to see how TCHAR and the system function interfaces are defined and how they depend on whether or not UNICODE and _UNICODE are defined. A typical entry is of the following form:
Alternative Generic String Processing FunctionsString comparisons can use lstrcmp and lstrcmpi rather than the generic _tcscmp and _tcscmpi to account for the specific language and region, or locale, at run time and also to perform word rather than string comparisons.[2] String comparisons simply compare the numerical values of the characters, whereas word comparisons consider locale-specific word order. The two methods can give opposite results for string pairs such as coop/co-op and were/we're.
There is also a group of Windows functions for dealing with Unicode characters and strings. These functions handle local characteristics transparently. Typical functions are CharUpper, which can operate on strings as well as individual characters, and IsCharAlphaNumeric. Other string functions include CompareString (which is locale-specific) and MultiByteToWideChar. Multibyte characters in Windows 3.1 and 9x extend the 8-bit character set to allow double bytes to represent character sets for languages of the Far East. The generic C library functions (_tprintf and the like) and the Windows functions (CharUpper and the like) will both appear in upcoming examples to demonstrate their use. Examples in later chapters will rely mostly on the generic C library. The Generic Main FunctionThe C main function, with its argument list (argv []), should be replaced by the macro _tmain. The macro expands to either main or wmain depending on the _UNICODE definition. _tmain is defined in <tchar.h>, which must be included after <windows.h>. A typical main program heading, then, would look like this:
The Microsoft C _tmain function also supports a third parameter for environment strings. This nonstandard extension is also common in UNIX. Function DefinitionsA function such as CreateFile is defined through a preprocessor macro as CreateFileA when UNICODE is not defined and as CreateFileW when UNICODE is defined. The definitions also describe the string parameters as 8-bit or wide character strings. Consequently, compilers will report a source code error, such as an illegal parameter to CreateFile, as an error in the use of CreateFileA or CreateFileW. |
Saturday, December 19, 2009
Interlude: Unicode and Generic Characters
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment