1 / 32

Unicode Transforms in ICU

Unicode Transforms in ICU. Mark Davis Chief SW Globalization Architect IBM. What is ICU?. The Premier Unicode-Enablement Library Open-Source: non-viral license Full-featured, cross-platform C, C++, Java APIs

yates
Télécharger la présentation

Unicode Transforms in ICU

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Unicode Transforms in ICU Mark DavisChief SW Globalization Architect IBM

  2. What is ICU? • The Premier Unicode-Enablement Library • Open-Source: non-viral license • Full-featured, cross-platform • C, C++, Java APIs • Collation, Charset Conversion, Resources, Boundaries, Calendars, Transforms (case, norm., translit., …), Format/Parse (dates, times, msgs, nums., curr., …), Unicode strings/props • Unicode Conformant • http://oss.software.ibm.com/icu/ 21st International Unicode Conference

  3. ICU Transforms • Powerful, flexible mechanism • Uppercase, Lowercase, Titlecase, Full/Halfwidth • Normalization • Hex, Character Names • Script to Script conversion… • Supports Styled Text, not just Plain Text • Chaining, Filters, Buffering • Customizable 21st International Unicode Conference

  4. Transform Examples • “Any-Uppercase” a → A • “Any-Hex/Java” a → \u0061 • “Greek-Latin” a → α 21st International Unicode Conference

  5. Filters • “[aeiou] Latin - Greek” • “Latin” is the source • “[aeiou]” is a filter, restricts the application to only English vowels. • “Greek” is the target • “[^\u0000-\u007E] Any - Hex” • “A δ is…” → “A \u03B4 is\u2026” 21st International Unicode Conference

  6. UnicodeSet Filters • Ranges [ABC a-z] • Union [[:Lu:] [:P:]] • Intersection [[:Lu:] & [\u0000-\u01FF]] • Set Difference [[:Lu:] - [\u0000-\u01FF]] • Complement [^aeiou] • Properties • Uppercase letters[:Lu:] • Punctuation[:P:] • Script[:Greek:] Other Unicode properties in ICU 2.2 21st International Unicode Conference

  7. Example Filter • “[:Lu:] Latin-Katakana; Latin-Hiragana” • Converts all uppercase Latin characters to Katakana, • Then converts all other Latin characters to Hiragana. 21st International Unicode Conference

  8. Chaining Transforms • “Hiragana-Latin; Any-Title” • たけだ, まさゆき • takeda, masayuki • Takeda, Masayuki • Any number of transforms in chain 21st International Unicode Conference

  9. Filtering plus Chaining • “NFD; [:M:] Remove; NFC” • Decompose • Remove accents (Marks) • Recompose 21st International Unicode Conference

  10. 김, 국삼 김, 명희 정, 병호 たけだ, まさゆき ますだ, よしひこ やまもと, のぼる Ρούτση, Άννα Καλούδης, Χρήστος Θεοδωράτου, Ελένη Gim, Gugsam Gim, Myeonghyi Jeong, Byeongho Takeda, Masayuki Masuda, Yoshihiko Yamamoto, Noboru Roútsē, Ánna Kaloúdēs, Chrêstos Theodōrátou, Elénē Script ↔ Script Examples 21st International Unicode Conference

  11. Script ↔ Script Conversions • General conversions: Greek-Latin • Source-Target Reversible: φ → ph → φ • Not Target-Source Reversible: f → φ → ph • Variants • By Language: Greek-German • By Standard: Greek-Latin/UNGEGN • Can build your own 21st International Unicode Conference

  12. Styled Text • Preserves individual styles on letters, where possible απα → apa 21st International Unicode Conference

  13. p? ph? ps? When Buffering • Conversions are not performed if they may extend over boundaries Key Result a α p αp a απα p απαp h απαφ 21st International Unicode Conference

  14. Custom Rules • Similar to Regular Expressions • Variables • Property matches • Contextual matches • Rearrangement • $1, $2… • Quantifiers: • *, +, ? 21st International Unicode Conference

  15. Differences from Regular Expressions • More Powerful… • Buffered/Keyboard • Styled Text • Ordered Rules • Cursor Backup • Less Powerful… • Only greedy quantifiers • No backup: so no (X | Y) • No “input-side back references” 21st International Unicode Conference

  16. Example of Custom Rules • “UnixQuotes-RealQuotes” \`\` > “; # two graves → right-quote \'\' > ” ; # two generics → left-quote • Example (SJ Mercury News online) ``expertise''→“expertise” 21st International Unicode Conference

  17. Rule Ordering • Find first rule that matches at start • If no match, or (isBuffered & clipped-Match) • advance start by 1 • Else if match, • Substitute text • Move start as specified • Continue until start reaches limit 21st International Unicode Conference

  18. Rule Ordering Example Translit. Reg Exp. xy > c ; s/xy/c/g yx > d ; s/yx/d/g xyx-yxy-xyx cx-dy-cx cx-yc-cx 21st International Unicode Conference

  19. Context • Rules: • γ } [ Γ Κ Χ Ξ γ κ χ ξ ] > n; • γ > g; • Meaning: • Convert gamma into n IF followed byΓ, Κ, Χ, Ξ, γ, κ, χ, or ξ • Otherwise into g 21st International Unicode Conference

  20. Cursor Backup • Allows text to be revisited • Reduces rule-count • Example Rules • BY > ビ | ~Y ; • ~YO > ョ; |BYO 1 ビ|~YO 2 ビョ| 21st International Unicode Conference

  21. Demonstration • Public Demo • http://oss.software.ibm.com/icu/demo • (local copy, samples) 21st International Unicode Conference

  22. More Information http://oss.software.ibm.com/… User Guide /icu/userguide/ C /icu/apiref/utrans_h.html C++ /icu/apiref/ Java API /icu4j/doc/com/ibm/text/ • Latest Version of these slides • http://www.macchiato.com 21st International Unicode Conference

  23. ICU Transforms • Powerful, flexible mechanism • Uppercase, Lowercase, Titlecase, Full/Halfwidth • Normalization • Hex, Character Names • Script to Script conversion… • Supports Styled Text, not just plaintext • Chaining & Filters • Customizable 21st International Unicode Conference

  24. Q & A 21st International Unicode Conference

  25. Backup Slides • Not used in the presentation, except in response to questions 21st International Unicode Conference

  26. Buffered Usage • No conversion for clipped match …t…t • Fill buffer • Transliterate • May have left-overs x …τ…t th… • Copy left-overs to start • Fill rest of buffer • Transliterate θ… 21st International Unicode Conference

  27. Styled Text Handling • Transforms operate on Replaceable, an interface/abstract class defined by ICU • In ICU4c, UnicodeString is a Replaceable subclass (with no out-of-band data -- no styles) • ICU4j defines ReplaceableString, a Replaceable subclass, also with no styles • Clients must define their own Replaceable subclass that implements their styled text. 21st International Unicode Conference

  28. Transliteration Sources • Søren Binks • http://homepage.mac.com/sirbinks/translit.html • UNGEGN • http://www.eki.ee/wgrs/ • … 21st International Unicode Conference

  29. API: Information • Like other ICU APIs, can get each of the available Transform IDs: • count =Transliterator:: countAvailableIDs(); • myID = Transliterator::getAvailableID(n); • And get a localizable name for each: • Transliterator::getDisplayName(myID, france, nameForUser); Note: these are C++ APIs; C and Java are also available. 21st International Unicode Conference

  30. API: Creation • Use an ID to create: • myTrans = Transliterator::createInstance("Latin-Greek"); 21st International Unicode Conference

  31. API: Simple usage • Convert entire string • myTrans.transliterate(myString); 21st International Unicode Conference

  32. More Control • Specify Context • Use with Styled Text abcdefghijklmnopqrstuvwxyz contextStart contextLimit start limit 21st International Unicode Conference

More Related