Collation
Q: My script does not sort right because the characters were assigned to Unicode code points in the wrong order. What can I do about that?
There is a misunderstanding here: Linguistically meaningful sorting is done not by comparing code point values (an approach which would fail even for English), but by assigning multi-level weights to characters or sequences of characters and then comparing those weights on each level. There are many algorithms and implementations for this; the standard Unicode Collation Algorithm (UCA) comes with a default weight table for all assigned characters as well as a tailoring mechanism that describes how this table can be modified to conform to local conventions, where necessary. [MS]
Q: How should collations be made available?
Ideally, people should be able to specify a collation order for any set of data returned by a database query and sorted by a SQL 'ORDER BY' clause. Actual database implementations may differ in how they surface the choices of collations to users. Differing collations should also be specifiable for any comparison (for example, s1 < s2) of strings, unless a strictly binary order comparison is intended. People should also be able to use collations for doing loose matching, and string searching. For more information, see: Unicode Collation Algorithm
Q: Where can I find out more information on implementations of collation?
For an implementation of the Unicode Collation Algorithm, see the Collator class reference in the documentation for the International Components for Unicode (ICU). The code is open source and available online.
ICU uses the tailoring rules for different locales supplied by CLDR. Those rules are all based on a root table, which itself is a tailoring of the DUCET table of the Unicode Collation Algorithm.
Q: How would the collation be specified, taking into account current implementations?
To specify a particular collation, clients normally supply a locale identifier. For example, specifying the locale identifier "de" indicates collation following general rules for the German language. The locale identifier may also include further specification of a collation variant, as in "de-u-co-phonebk", to indicate that the desired collation should be the one used for German phone books. See Collation Variants for details. The locale identifier may add other collation-related parameters, as in "de-u-kf-upper", which designates collation for German, but with uppercase ordered before lowercase. See CollationOptions for details.
Q: How can I modify collation results at runtime?
The locale parameters typically can also be specified at runtime via an API, as in ICU. The API may allow tailoring rules to be supplied at runtime. The API may support the merging of multiple locales plus further tailoring, for example, French + Arabic + tailoring.
Q: The Unicode Collation Algorithm is defined for a particular version of the Unicode Standard, but I am using characters from a later version of Unicode. What shall I do?
You can update to a later version of the Unicode Collation Algorithm, which is synchronized with that later version of the Unicode Standard. The UTC is committed to ensuring that the Unicode Collation Algorithm is updated in a timely manner, so that the repertoire of characters in the Default Unicode Collation Element Table stays in synch with the Unicode Standard. However, if you need to stay with a particular version of the Unicode Collation Algorithm for any reason, such as maintaining binary compatibility of generated key weights, note that the algorithm does assign a default sorting order to every valid code point, assigned or unassigned. Any characters that are not assigned in the repertoire for that version will be given derived, implicit weights in code point order after all of the assigned characters. See Implicit Weights for more details.
Q: Is transitive consistency maintained by the UCA?
Yes, for any strings A, B, and C, if A < B and B < C, then A < C. However, implementers must be careful to accurately reproduce the results of the Unicode Collation Algorithm as they optimize their own algorithms. It is easy to perform careless optimizations — especially with Incremental Comparison algorithms — that fail transitive consistency.
Other implementation details to check are the maintenance of proper distinction between the bases to which accents apply. For example, the sequence <u-macron, u-diaeresis-macron> should compare as less than the sequence <u-macron-diaeresis, u-macron>. The secondary distinction, based on the weighting of the accents, must be correctly associated with the primary weights of each respective base letter.
Q: Does JIS require tailorings?
The Default Unicode Collation Element Table uses the Unicode order for CJK ideographs (Kanji). This represents a radical-stroke ordering for the characters in JIS levels 1 and 2. If a different order is needed, such as an exact match to binary JIS order for these characters, that can be achieved with tailoring.
Q: How are readings handled for Kanji?
There is no algorithmic mapping from Kanji characters to the phonetic readings for those characters, because determination of Japanese readings requires too much linguistic context. The common practice for sorting in a Japanese database by reading is to store the reading in a separate field, and construct the sort keys from the readings.
Q: How are mixed Japanese and Chinese handled?
The Unicode Collation Algorithm specifies how collation works for a single context. In this respect, mixed Japanese and Chinese are no different than mixed Swedish and German, or any other languages that use the same characters. Generally, the customers using a particular collation will want text sorted uniformly, no matter what the source. Japanese customers would want them sorted in the Japanese fashion, etc. There are contexts where foreign words are called out separately and sorted in a separate group with different collation conventions. Such cases would require the source fields to be tagged with the type of desired collation (or tagged with a language, which is then used to look up an associated collation).
Q: Are the half-width katakana properly interleaved with the full-width?
Yes, the Default Unicode Collation Element Table properly interleaves half-width katakana, full-width katakana, and full-width hiragana. It also interleaves the voicing and semi-voicing marks correctly, whether they are precomposed or not.
Q: Can the katakana length mark be handled properly?
Yes, by using a combination of contraction and expansion, the length mark can be tailored to sort according to the vowel of the previous katakana character. For a description of the phenomenon involved and how to handle it, see Contextual Sensitivity.
CLDR and ICU support context-sensitive mappings that can handle length and iteration marks, and with better performance than contractions; the CLDR Japanese collation tailoring takes advantage of this mechanism.
Q: How are names in a database sorted properly?
In international sorting, it will make a difference whether strings in one field are sorted first and strings in a second field are sorted subsequently, or whether a single sort is done considering both fields together. This is because international sorting uses multi-level comparison of differences in strings. Suppose that your database is sorted first by family name, then by given name. Since family names are sorted first, a secondary or tertiary difference in the family name will completely swamp a primary difference in the given name. So {field1=Casares, field2=Zelda} will sort before {field1=Cásares, field2=Albert}.
This is not the typically desired behavior. The database should be sorted by a constructed field which contains family name + <separator> + given name. Typical historical practice was to use a ',' as the separator. However, that does not work for collation sequences that ignore punctuation. A better option is to use U+FFFE as this separator. CLDR tailors this code point to sort before any other base character, for exactly this purpose, so that the record with {field1=Cásares, field2=Albert} sorts before the record with {field1=Casares, field2=Zelda}.
For more information on this topic, see Merging Sort Keys.
Q: How can I use the Unicode Collation Algorithm for a stable sort?
A stable sort is one where identical records are collated in the same order they occurred in the input data. The easiest way to achieve this is to append an index number for each record to the sort key for that record. Whether that sort key comes from strings, other data, or a concatenation of sort keys, it will then produce a stable sort. Further information about stable sorts and related topics can be found in Deterministic Sorting.
Q. What are the differences between the UCA and ISO 14651?
Very broadly, the UCA includes the following features that are not part of ISO 14651. This is only a sketch; for details see the Unicode Collation Algorithm.
- a much more thorough introduction to multilingual sorting issues
- much more information about performance and implementation practices
- how to apply collation to searching and matching
- a variable weighting option allowing punctuation to make a difference at the first three levels ("Non-ignorable" option)
Q. What can you tell me about searching and sorting with Braille?
The individual Braille patterns are not tied to specific characters. A pattern that represents an "A" for English might represent a completely different letter or symbol or ideograph for another language. Therefore, search and sort engines cannot assume that the underlying meaning of any individual Braille pattern is fixed. It can and will vary by language, greatly affecting how searching and sorting rules are defined, and how strings that contain Braille patterns are interpreted. [SO]
Q. In my language, "ch" usually sorts like a separate letter. If I want a foreign word to sort without this happening, how do I do it?
You use U+034F COMBINING GRAPHEME JOINER (CGJ), as described in Characters and Combining Marks.
Q. What policies constrain allowable changes to UCA between versions?
The UTC has established a number of policies which help to keep the UCA and its associated data table (DUCET) stable, even as the UCA is updated to stay in synch with additions to the Unicode Standard. First, there are policies which define how collation weights should be established for newly assigned characters and scripts. Those can be found in UCA Default Criteria for New Characters. There are also policies which limit the kinds of changes which can be made for characters already in the DUCET, and which define how potential updates should be specified and tracked. Those can be found in Change Management for the Unicode Collation Algorithm.