24.4. String Sort Order
Oracle provides advanced linguistic sort capabilities that extend far beyond the basic A-Z sorting you get with an ORDER BY clause . The complexities found in international character sets do not lend themselves to simple alphabetic sort, or collation, rules. Chinese, for example, includes approximately 40,000 characters (although many are not used regularly). Not exactly something you can easily put into song like the ABCs! Also, not something that can be defined by simple sort rules.
String sort order is an obvious programming problem that is often overlooked in globalization until a product makes its way to the test team. Ordering the names of employees, cities of operation, or customers is much more complicated than "A comes before B." Consider the following factors:
Some European characters include accents that change the meaning of the base letter. The letter "a" is different from "ä." Which letter should come first in an ORDER BY? Concatenated characters are used in other languages. Spanish, for example, combines two "l"s to form a new character that is pronounced differently and that comes after the letter "l" in the Spanish alphabet. Using a basic sort from the English alphabet, the word "pollo" would be listed before the word "polo." In the Spanish alphabet, it would come afterwards. Each locale may have its own sort rules, so a multilingual application must be able to support different sort rules based on the text. Even regions that use the same alphabet may still have different sort rules.
Oracle provides three types of sorts: binary, monolingual, and multilingual.
The Unicode Consortium makes its collation algorithm public, so we can compare the output from our queries for these three types of sorts with the expected results shown at http://www.unicode.org/charts/collation/.
24.4.1. Binary Sort
The binary sort is based on the character's encoding scheme, and the values associated with each character. It is very fast, and is especially useful if you know that all of your data is stored in uppercase. The binary sort is most useful for ASCII characters, sorting the English alphabet, but even then you may find some undesired results. ASCII encoding, for example, orders uppercase letters before their lowercase representation.
The following example from the g11n sample schema shows the results of a binary sort of German cities:
SELECT city FROM store_location WHERE country <> 'JP' ORDER BY city;
The ordered list of results is as follows:
CITY ------------------------------------ Abdêra Asselfingen Astert Auufer Außernzell Aßlar Boßdorf Bösleben Bötersen Cremlingen Creuzburg Creußen Oberahr Zudar Zühlen Ängelholm ...lsen
Note the order of the cities in the list. Ängelholm is ordered after Zühlen. Character codes are sorted in ascending order, providing the A-Z ordering you see above. These anomalies result from the inclusion of characters outside the English alphabet, here being treated as special characters.
24.4.2. Monolingual Sort
Most European languages will benefit from monolingual sort capabilities within Oracle. Rather than using basic codes associated with the character's encoding scheme like the binary sort, two values are used to determine the relative position of a character in a monolingual sort. Each character has a major value, related to the base character, and a minor value, based on case and diacritic differences. If sort order can be determined by a difference in major value, the ordering is complete. Should there be a tie in major value, the minor value is used. This way, characters such as ö can be ordered relative to the character o accurately.
To see the impact this has on the ordering of these additional characters, let's return to the prior example and modify the session to use the German monolingual sort:
ALTER SESSION SET NLS_SORT = german;
Upon confirmation that the session has been altered, run the following query:
SELECT city FROM store_location WHERE country <> 'JP' ORDER BY city;
Notice that the order is different now that NLS_SORT is set to "german":
CITY ------------------------------------ Abdêra Ängelholm Aßlar Asselfingen Astert Außernzell Auufer Boßdorf Bösleben Bötersen Cremlingen Creußen Creuzburg Oberahr ...lsen Zudar Zühlen
This is much better! The treatment of non-English characters is now in line with the expected German order of characters. By the way, if you do not want to (or cannot) alter your session NLS settings, you can use the NLSSORT function and the NLS_SORT parameter as part of your query. The following function demonstrates the use of this parameter:
CREATE OR REPLACE FUNCTION city_order_by_func (v_order_by IN VARCHAR2) RETURN sys_refcursor IS v_city sys_refcursor; BEGIN OPEN v_city FOR SELECT city FROM store_location ORDER BY NLSSORT (city, 'NLS_SORT=' || v_order_by); RETURN v_city; END city_order_by_func; /
As seen above, the NLSSORT function and the NLS_SORT parameter provide quick ways to change the results of an ORDER BY clause. For this function, which is used in the remaining examples, the NLS_SORT parameter is taken as input. Table 24-5 lists some of the available NLS_SORT parameter values in Oracle Database 10g.
Table 24-5. Monolingual NLS_SORT parameter valuesarabic | xcatalan | japanese | arabic_match | german | polish | arabic_abj_sort | xgerman | punctuation | arabic_abj_match | german_din | xpunctuation | azerbaijani | xgerman_din | romanian | xazerbaijani | hungarian | russian | bengali | xhungarian | spanish | bulgarian | icelandic | xspanish | canadian french | indonesian | west_euorpean | catalan | italian | xwest_european |
The list of parameters in this table includes some values prefixed with an x. These extended sort parameters allow for special cases in a language. In our cities example, some names have the character ß. This sharp s in German can be treated as ss for the purposes of the sort. We tried a sort using NLS_SORT = german. Let's try xgerman to see the difference:
VARIABLE v_city_order REFCURSOR CALL city_order_by_func('xgerman') INTO :v_city_order; PRINT v_city_order
This displays the following:
CITY ------------------------------------ ... Abdêra Ängelholm Asselfingen Aßlar Astert Außernzell Auufer ...
Using xgerman rather than german, the word Aßlar drops to fourth in the list instead of third.
24.4.3. Multilingual Sort
Monolingual sort, as you might guess from the use of "mono" in its name, has a major drawback. It can operate on only one language at a time based on the NLS_SORT parameter setting. Oracle also provides multilingual sort capabilities that allow you to sort for multiple locales.
The multilingual sort , based on the ISO 14651 standard, supports more than 1.1 million characters in a single sort. Not only does Oracle's multilingual support cover characters defined as part of the Unicode 4.0 standard, but it can also support supplementary characters.
Where binary sorts are determined by character encoding scheme codes, and monolingual sorts are developed in two stages, multilingual sorts use a three-step approach to determine the order of characters:
The first level, or primary level, separates base characters. The secondary level distinguishes base characters from diacritics that modify the base characters. Finally, the tertiary level separates by case.
For Asian languages, characters are also differentiated by number of strokes, PinYin, or radicals.
NLSSORT and NLS_SORT are still used for multilingual sorts, but the parameters change. GENERIC_M works well for most Western languages, and provides the base for the remaining list of values. Table 24-6 lists the NLS_SORT parameter values available for multilingual sorts.
Table 24-6. Multilingual NLS_SORT parameter valuesgeneric_m | | | | canadian_m | japanese_m | schinese_pinyin_m | tchinese_radical_m | danish_m | korean_m | schinese_radical_m | tchinese_stroke_m | french_m | schinese_stroke_m | spanish_m | thai_m |
To demonstrate the multilingual sort functionality, we can modify the call to use the generic_m value:
VARIABLE v_city_order REFCURSOR CALL city_order_by_func('generic_m') INTO :v_city_order; PRINT v_city_order
This returns the following ordered list of cities:
CITY ------------------------------------ Abdêra Ängelholm Asselfingen Aßlar Astert .. Zudar Zühlen
|
No comments:
Post a Comment