Unicode Normalization: Explanation & Importance

Unicode Normalization Mark Davis www.macchiato.com

Normalization • Uniqueness • two equivalent strings have precisely the same normalized form • Fast binary comparison,accurate digital signatures • Recommended for XML, JavaScript and other standards

Canonical Equivalence • Fundamental equivalence • Indistinguishable to users, when correctly rendered • Includes • Combining sequences • Hangul • Singletons Ç C ¸ 가 ㄱ ㅏ Ω Ω

Compatibility Equivalence • Formatting differences • Font variants (ℌ) • Breaking differences (-) • Cursive forms (ﻦ ﻨ ﻧ ﻥ) • Circled (⑪) • Width, size, rotated (ｶ ﹠ ︷) • Super/subscripts (₉ ⁹) • Squared characters (㌀) • Fractions (⅚) • Others (ǆ) ｶカ㎏ k g ﬁ f i

UTR #15:Unicode Normalization Forms

Normalization Requirement • Uniqueness: two equivalent strings will have precisely the same normalized form • If two strings x and y are canonical equivalents, then C(x) = C(y) D(x) = D(y) • If two strings are compatibility equivalents, then KC(x) = KC(y) KD(x) = KD(y)

Affected Characters • None of the forms affect text with only ASCII characters (U+0000 to U+007F) • None of the forms generate compability characters that were not in the source text. • Both KD and KC replace compatibility characters. • Both D and C maintain compatibility characters.

Cautions: Decomposition • Requires decomposition mappings from the Unicode Character Database • Those decomposition mappings must be applied recursively • The string must be put into canonical order • Either Canonical or Compatibility

Cautions: Composition • Decomposition required first! • Then canonical composition • Composition data: fixed at Unicode 3.0.0 • Some characters are excluded from composition • Form C and Form KC can still have combining characters! • Required for Indic, Arabic, Hebrew, &c.

Caution: Both C & D • All normalization forms are not closed under string concatenation. Example: • NFC/D "…a◌̰" + "◌̀…" • Not Norm."…a◌̰◌̀…" • NFC "…à◌̰…" • NFD "…a◌̀◌̰…" • Exceptions easy to test for

Composition Process • Decompose (D or KD) • Combine unblocked characters with the previous starter, if possible*

Composition Exclusions • Script Specifics क + ◌̣ ⇏ क़ • Futures: G + ◌̣ ⇏ G̣ • Singletons* Ω ⇏ Ω • Non-starter sequences* ◌̈ + ◌́ ⇏ ◌̈́

Legacy Encoding • Legacy text is ‘normalized’ if it maps 1:1 to normalized Unicode text • Legacy sets: • Prenormalized: e.g. ISO 8859-1 • Normalizable: e.g. ISO 2022 (ISO 5426/ISO 8859-1/…) • Unnormalizable: e.g. ISO 5426

Programming Identifiers • Closed under all Normalization Forms, if minor changes incorporated • Modified syntax: • identifier := start ( start | extend )* • start := [{Lu}{Ll}{Lt}{Lm}{Lo}{Nl}]- irregulars – combining_like • extend := [{Mn}{Mc}{Nd}{Pc}{Cf}]- irregulars + combining_like + mid_dot • (Almost) closed under Case Mappings • see SpecialCasing.txt

Resources • Reference version on Unicode Site • Production Version • http://oss.software.ibm.com/icu • ICU: C/C++ and Java Versions • Open Source, with IBM Public License • Free commercial use and distribution: Not Viral! • Panel Later today • Other companies also providing: ask!

Normalization • Uniqueness: two equivalent strings have precisely the same normalized form • Fast binary comparison, accurate digital signatures • Recommended for XML, JavaScript and other standards

Q & A

Backup Slides

Definition: Starter • S is a starter = • Canonical class of zero in the Unicode Character Database • Can start a composition • Examples: Starters: Spacing marks, some non-spacing ‘a’, ‘ق’ ‘Θ’ ‘क’ ‘ी’ ‘◌ै’ Non-starters: most non-spacing marks ‘◌̀’, ‘◌̊’ ‘◌̽’ ‘◌̥’

Definition: Blocked • C is blocked from S • There is some character B between S and C, and either • B is a starter or • B has the same canonical class as C • Examples • “ABC” – B blocks C from A • “A◌̀◌̊” – ◌̀ blocks ◌̊ from A • “A◌̥◌̊” – ◌̥doesn’t block ◌̊ from A

Testing Conformance: Canonical

Unicode Normalization • Introduction • Normalization forms • Design goals • Specification • Excluded characters • Versions • Legacy encodings • Applications

Characters and Encoding Forms Abstract Encoded Serialized UTF-16BE UTF-8 C5 00 C5 C3 85 212B 21 2B E2 84 AB Å F0000 DB 80 DC 00 F3 B0 80 80 00 61 03 0A 61 CC 8A A 61 30A °

Unicode Normalization: Explanation & Importance

Unicode Normalization: Explanation & Importance

Presentation Transcript

Normalization

Normalization

Normalization

Unicode

Unicode Security

Unicode Introduction

Normalization

Unicode 4.0

Normalization

Unicode 4.0

Unicode in

Unicode Security

Unicode

Dzongkha Unicode

Unicode

UNICODE

Unicode Oddity

Unicode 4.0