Sunday, October 25, 2009

Section 24.4.  String Sort Order









24.4. String Sort Order




Oracle provides advanced linguistic sort capabilities that extend far beyond the basic A-Z sorting you get with an ORDER BY clause
. The complexities found in international character sets do not lend themselves to simple alphabetic sort, or collation, rules. Chinese, for example, includes approximately 40,000 characters (although many are not used regularly). Not exactly something you can easily put into song like the ABCs! Also, not something that can be defined by simple sort rules.


String sort order
is an obvious programming problem that is often overlooked in globalization until a product makes its way to the test team. Ordering the names of employees, cities of operation, or customers is much more complicated than "A comes before B." Consider the following factors:


  • Some European characters include accents that change the meaning of the base letter. The letter "a" is different from "ä." Which letter should come first in an ORDER BY?

  • Concatenated characters are used in other languages. Spanish, for example, combines two "l"s to form a new character that is pronounced differently and that comes after the letter "l" in the Spanish alphabet. Using a basic sort from the English alphabet, the word "pollo" would be listed before the word "polo." In the Spanish alphabet, it would come afterwards.

  • Each locale may have its own sort rules, so a multilingual application must be able to support different sort rules based on the text. Even regions that use the same alphabet may still have different sort rules.


Oracle provides three types of sorts: binary, monolingual, and multilingual.


The Unicode Consortium makes its collation algorithm
public, so we can compare the output from our queries for these three types of sorts with the expected results shown at http://www.unicode.org/charts/collation/.



24.4.1. Binary Sort


The binary sort

is based on the character's encoding scheme, and the values associated with each character. It is very fast, and is especially useful if you know that all of your data is stored in uppercase. The binary sort is most useful for ASCII characters, sorting the English alphabet, but even then you may find some undesired results. ASCII encoding, for example, orders uppercase letters before their lowercase representation.


The following example from the g11n sample schema shows the results of a binary sort of German cities:



SELECT city
FROM store_location
WHERE country <> 'JP'
ORDER BY city;



The ordered list of results is as follows:



CITY
------------------------------------
Abdêra
Asselfingen
Astert
Auufer
Außernzell
Aßlar
Boßdorf
Bösleben
Bötersen
Cremlingen
Creuzburg
Creußen
Oberahr
Zudar
Zühlen
Ängelholm
...lsen



Note the order of the cities in the list. Ängelholm is ordered after Zühlen. Character codes are sorted in ascending order, providing the A-Z ordering you see above. These anomalies result from the inclusion of characters outside the English alphabet, here being treated as special characters.




24.4.2. Monolingual Sort


Most European languages will benefit from monolingual sort


capabilities within Oracle. Rather than using basic codes associated with the character's encoding scheme like the binary sort, two values are used to determine the relative position of a character in a monolingual sort. Each character has a major value, related to the base character, and a minor value, based on case and diacritic differences. If sort order can be determined by a difference in major value, the ordering is complete. Should there be a tie in major value, the minor value is used. This way, characters such as ö can be ordered relative to the character o accurately.


To see the impact this has on the ordering of these additional characters, let's return to the prior example and modify the session to use the German monolingual sort:



ALTER SESSION SET NLS_SORT = german;



Upon confirmation that the session has been altered, run the following query:



SELECT city
FROM store_location
WHERE country <> 'JP'
ORDER BY city;



Notice that the order is different now that NLS_SORT is set to "german":



CITY
------------------------------------
Abdêra
Ängelholm
Aßlar
Asselfingen
Astert
Außernzell
Auufer
Boßdorf
Bösleben
Bötersen
Cremlingen
Creußen
Creuzburg
Oberahr
...lsen
Zudar
Zühlen



This is much better! The treatment of non-English characters is now in line with the expected German order of characters. By the way, if you do not want to (or cannot) alter your session NLS settings, you can use the NLSSORT function and the NLS_SORT parameter as part of your query. The following function demonstrates the use of this parameter:



CREATE OR REPLACE FUNCTION city_order_by_func (v_order_by IN VARCHAR2)
RETURN sys_refcursor
IS
v_city sys_refcursor;
BEGIN
OPEN v_city
FOR
SELECT city
FROM store_location
ORDER BY NLSSORT (city, 'NLS_SORT=' || v_order_by);
 
RETURN v_city;
END city_order_by_func;
/



As seen above, the NLSSORT function and the NLS_SORT parameter provide quick ways to change the results of an ORDER BY clause. For this function, which is used in the remaining examples, the NLS_SORT parameter is taken as input. Table 24-5 lists some of the available NLS_SORT parameter values in Oracle Database 10g.


Table 24-5. Monolingual NLS_SORT parameter values

arabic

xcatalan

japanese

arabic_match

german

polish

arabic_abj_sort

xgerman

punctuation

arabic_abj_match

german_din

xpunctuation

azerbaijani

xgerman_din

romanian

xazerbaijani

hungarian

russian

bengali

xhungarian

spanish

bulgarian

icelandic

xspanish

canadian french

indonesian

west_euorpean

catalan

italian

xwest_european



The list of parameters in this table includes some values prefixed with an x. These extended sort parameters allow for special cases in a language. In our cities example, some names have the character ß. This sharp s in German can be treated as ss for the purposes of the sort. We tried a sort using NLS_SORT = german. Let's try xgerman to see the difference:



VARIABLE v_city_order REFCURSOR
CALL city_order_by_func('xgerman') INTO :v_city_order;
PRINT v_city_order



This displays the following:



CITY
------------------------------------
...
Abdêra
Ängelholm
Asselfingen
Aßlar
Astert
Außernzell
Auufer
...



Using xgerman rather than german, the word Aßlar drops to fourth in the list instead of third.




24.4.3. Multilingual Sort



Monolingual sort, as you might guess from the use of "mono" in its name, has a major drawback. It can operate on only one language at a time based on the NLS_SORT parameter setting. Oracle also provides multilingual sort capabilities that allow you to sort for multiple locales.


The multilingual sort


, based on the ISO 14651 standard, supports more than 1.1 million characters in a single sort. Not only does Oracle's multilingual support cover characters defined as part of the Unicode 4.0 standard, but it can also support supplementary characters.


Where binary sorts are determined by character encoding scheme codes, and monolingual sorts are developed in two stages, multilingual sorts use a three-step approach to determine the order of characters:


  1. The first level, or primary level, separates base characters.

  2. The secondary level distinguishes base characters from diacritics that modify the base characters.

  3. Finally, the tertiary level separates by case.


For Asian languages, characters are also differentiated by number of strokes, PinYin, or radicals.


NLSSORT and NLS_SORT are still used for multilingual sorts, but the parameters change. GENERIC_M works well for most Western languages, and provides the base for the remaining list of values. Table 24-6 lists the NLS_SORT parameter values available for multilingual sorts.


Table 24-6. Multilingual NLS_SORT parameter values

generic_m

   

canadian_m

japanese_m

schinese_pinyin_m

tchinese_radical_m

danish_m

korean_m

schinese_radical_m

tchinese_stroke_m

french_m

schinese_stroke_m

spanish_m

thai_m



To demonstrate the multilingual
sort functionality, we can modify the call to use the generic_m value:



VARIABLE v_city_order REFCURSOR
CALL city_order_by_func('generic_m') INTO :v_city_order;
PRINT v_city_order



This returns the following ordered list of cities:



CITY
------------------------------------
Abdêra
Ängelholm
Asselfingen
Aßlar
Astert
..
Zudar
Zühlen

 

 

 

 











    No comments: