230 likes | 330 Vues
Understand the significance of Unicode normalization forms, their application in XML, JavaScript, and standards. Learn about canonical and compatibility equivalence, cautions, composition, conformance testing, and legacy encodings.
E N D
Unicode Normalization Mark Davis www.macchiato.com
Normalization • Uniqueness • two equivalent strings have precisely the same normalized form • Fast binary comparison,accurate digital signatures • Recommended for XML, JavaScript and other standards
Canonical Equivalence • Fundamental equivalence • Indistinguishable to users, when correctly rendered • Includes • Combining sequences • Hangul • Singletons Ç C ¸ 가 ㄱ ㅏ Ω Ω
Compatibility Equivalence • Formatting differences • Font variants (ℌ) • Breaking differences (-) • Cursive forms (ﻦ ﻨ ﻧ ﻥ) • Circled (⑪) • Width, size, rotated (カ ﹠ ︷) • Super/subscripts (₉ ⁹) • Squared characters (㌀) • Fractions (⅚) • Others (dž) カ カ ㎏ k g fi f i
Normalization Requirement • Uniqueness: two equivalent strings will have precisely the same normalized form • If two strings x and y are canonical equivalents, then C(x) = C(y) D(x) = D(y) • If two strings are compatibility equivalents, then KC(x) = KC(y) KD(x) = KD(y)
Affected Characters • None of the forms affect text with only ASCII characters (U+0000 to U+007F) • None of the forms generate compability characters that were not in the source text. • Both KD and KC replace compatibility characters. • Both D and C maintain compatibility characters.
Cautions: Decomposition • Requires decomposition mappings from the Unicode Character Database • Those decomposition mappings must be applied recursively • The string must be put into canonical order • Either Canonical or Compatibility
Cautions: Composition • Decomposition required first! • Then canonical composition • Composition data: fixed at Unicode 3.0.0 • Some characters are excluded from composition • Form C and Form KC can still have combining characters! • Required for Indic, Arabic, Hebrew, &c.
Caution: Both C & D • All normalization forms are not closed under string concatenation. Example: • NFC/D "…a◌̰" + "◌̀…" • Not Norm."…a◌̰◌̀…" • NFC "…à◌̰…" • NFD "…a◌̀◌̰…" • Exceptions easy to test for
Composition Process • Decompose (D or KD) • Combine unblocked characters with the previous starter, if possible*
Composition Exclusions • Script Specifics क + ◌̣ ⇏ क़ • Futures: G + ◌̣ ⇏ G̣ • Singletons* Ω ⇏ Ω • Non-starter sequences* ◌̈ + ◌́ ⇏ ◌̈́
Legacy Encoding • Legacy text is ‘normalized’ if it maps 1:1 to normalized Unicode text • Legacy sets: • Prenormalized: e.g. ISO 8859-1 • Normalizable: e.g. ISO 2022 (ISO 5426/ISO 8859-1/…) • Unnormalizable: e.g. ISO 5426
Programming Identifiers • Closed under all Normalization Forms, if minor changes incorporated • Modified syntax: • identifier := start ( start | extend )* • start := [{Lu}{Ll}{Lt}{Lm}{Lo}{Nl}]- irregulars – combining_like • extend := [{Mn}{Mc}{Nd}{Pc}{Cf}]- irregulars + combining_like + mid_dot • (Almost) closed under Case Mappings • see SpecialCasing.txt
Resources • Reference version on Unicode Site • Production Version • http://oss.software.ibm.com/icu • ICU: C/C++ and Java Versions • Open Source, with IBM Public License • Free commercial use and distribution: Not Viral! • Panel Later today • Other companies also providing: ask!
Normalization • Uniqueness: two equivalent strings have precisely the same normalized form • Fast binary comparison, accurate digital signatures • Recommended for XML, JavaScript and other standards
Definition: Starter • S is a starter = • Canonical class of zero in the Unicode Character Database • Can start a composition • Examples: Starters: Spacing marks, some non-spacing ‘a’, ‘ق’ ‘Θ’ ‘क’ ‘ी’ ‘◌ै’ Non-starters: most non-spacing marks ‘◌̀’, ‘◌̊’ ‘◌̽’ ‘◌̥’
Definition: Blocked • C is blocked from S • There is some character B between S and C, and either • B is a starter or • B has the same canonical class as C • Examples • “ABC” – B blocks C from A • “A◌̀◌̊” – ◌̀ blocks ◌̊ from A • “A◌̥◌̊” – ◌̥doesn’t block ◌̊ from A
Unicode Normalization • Introduction • Normalization forms • Design goals • Specification • Excluded characters • Versions • Legacy encodings • Applications
Characters and Encoding Forms Abstract Encoded Serialized UTF-16BE UTF-8 C5 00 C5 C3 85 212B 21 2B E2 84 AB Å F0000 DB 80 DC 00 F3 B0 80 80 00 61 03 0A 61 CC 8A A 61 30A °