Saturday, December 19, 2009

Interlude: Unicode and Generic Characters









Interlude: Unicode and Generic Characters


Before proceeding, it is necessary to explain briefly how Windows processes characters and differentiates between 8- and 16-bit characters and generic characters. The topic is a large one and beyond the book's scope, so we only provide the minimum detail required, rather than a complete chapter.


Windows supports standard 8-bit characters (type char or CHAR) and (except on Windows 9x) wide 16-bit characters (WCHAR, which is defined to be the C wchar_t type). The Microsoft documentation refers to the 8-bit character set as ASCII, but it is actually the Latin-1 character set; for convenience, in this discussion we use ASCII too. The wide character support that Windows provides using the Unicode UTF-16 encoding is capable of representing symbols and letters in all major languages, including English, French, Spanish, German, Japanese, and Chinese, using the Unicode representation.


Here are the steps commonly used to write a generic Windows application that can be built to use either Unicode (UTF-16, as opposed to UCS-4, for example) or 8-bit ASCII characters.








1.
Define all characters and strings using the generic types TCHAR, LPTSTR, and LPCTSTR.

2.
Include the definitions #define UNICODE and #define _UNICODE in all source modules to get Unicode wide characters (ANSI C wchar_t); otherwise, with UNICODE and _UNICODE undefined, TCHAR will be equivalent to CHAR (ANSI C char). The definition must precede the #include <windows.h> statement and is frequently defined on the compiler command line. The first preprocessor variable controls the Windows function definitions, and the second variable controls the C library.

3.
Character buffer lengthsas used, for example, in ReadFilecan be calculated using sizeof (TCHAR).

4.
Use the collection of generic C library string and character I/O functions in <tchar.h>. Available representative functions are _fgettc, _itot (for itoa), _stprintf (for sprintf), _tstcpy (for strcpy), _ttoi, _totupper, _totlower, and _tprintf.[1] See the on-line help for a complete and extensive list. All these definitions depend on _UNICODE. This collection is not complete. memchr is an example of a function without a wide character implementation. New versions are provided as required.

[1] The underscore character (_) indicates that a function or keyword is provided by Microsoft C, and the letters t and T denote a generic character. Other development systems provide similar capability but may use different names or keywords.

5.
Constant strings should be in one of three forms. Use these conventions for single characters as well. The first two forms are ANSI C; the thirdthe _T macro (equivalently, TEXT and _TEXT)is supplied with the Microsoft C compiler.



"This string uses 8-bit characters"

L"This string uses 16-bit characters"

_T ("This string uses generic characters")

6.
Include <tchar.h> after <windows.h> to get required definitions for text macros and generic C library functions.


Windows uses Unicode 16-bit characters (UTF-16 encoding) throughout, and NTFS file names and pathnames are represented internally in Unicode. If the UNICODE macro is defined, wide character strings are required by Windows calls; otherwise, 8-bit character strings are converted to wide characters. If the program is to run under Windows 9x, which is not a Unicode system, do not define the UNICODE and _UNICODE macros. Under NT and CE, the definition is optional unless the executable is to run under Windows 9x as well.


All future program examples will use TCHAR instead of the normal char for characters and character strings unless there is a clear reason to deal with individual 8-bit characters. Similarly, the type LPTSTR indicates a pointer to a generic string, and LPCTSTR indicates, in addition, a constant string. At times, this choice will add some clutter to the programs, but it is the only choice that allows the flexibility necessary to develop and test applications in either Unicode or 8-bit character form so that the program can be easily converted to Unicode at a later date. Furthermore, this choice is consistent with common, if not universal, industry practice.


It is worthwhile to examine the system include files to see how TCHAR and the system function interfaces are defined and how they depend on whether or not UNICODE and _UNICODE are defined. A typical entry is of the following form:



#ifdef UNICODE
#define TCHAR WCHAR
#else
#define TCHAR CHAR
#endif


Alternative Generic String Processing Functions


String comparisons can use lstrcmp and lstrcmpi rather than the generic _tcscmp and _tcscmpi to account for the specific language and region, or locale, at run time and also to perform word rather than string comparisons.[2] String comparisons simply compare the numerical values of the characters, whereas word comparisons consider locale-specific word order. The two methods can give opposite results for string pairs such as coop/co-op and were/we're.

[2] Historically, the l prefix was used to indicate a long pointer to the character string parameters.


There is also a group of Windows functions for dealing with Unicode characters and strings. These functions handle local characteristics transparently. Typical functions are CharUpper, which can operate on strings as well as individual characters, and IsCharAlphaNumeric. Other string functions include CompareString (which is locale-specific) and MultiByteToWideChar. Multibyte characters in Windows 3.1 and 9x extend the 8-bit character set to allow double bytes to represent character sets for languages of the Far East. The generic C library functions (_tprintf and the like) and the Windows functions (CharUpper and the like) will both appear in upcoming examples to demonstrate their use. Examples in later chapters will rely mostly on the generic C library.


The Generic Main Function


The C main function, with its argument list (argv []), should be replaced by the macro _tmain. The macro expands to either main or wmain depending on the _UNICODE definition. _tmain is defined in <tchar.h>, which must be included after <windows.h>. A typical main program heading, then, would look like this:



#include <windows.h>
#include <tchar.h>
int _tmain (int argc, LPTSTR argv [])
{
...
}


The Microsoft C _tmain function also supports a third parameter for environment strings. This nonstandard extension is also common in UNIX.


Function Definitions


A function such as CreateFile is defined through a preprocessor macro as CreateFileA when UNICODE is not defined and as CreateFileW when UNICODE is defined. The definitions also describe the string parameters as 8-bit or wide character strings. Consequently, compilers will report a source code error, such as an illegal parameter to CreateFile, as an error in the use of CreateFileA or CreateFileW.









    No comments: