1 / 22

ICU Character Conversion API

ICU Character Conversion API. Markus Scherer ICU Team IBM Cupertino. Codepages. Codepages, character sets, etc. are collections of coded characters For text exchange: byte-serialized, need to be able to get characters to and from byte stream One character may need 1..4 bytes

brooke
Télécharger la présentation

ICU Character Conversion API

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ICU Character Conversion API Markus Scherer ICU Team IBM Cupertino First ICU DeveloperWorkshop

  2. Codepages • Codepages, character sets, etc. are collections of coded characters • For text exchange: byte-serialized, need to be able to get characters to and from byte stream • One character may need 1..4 bytes • Some stateful with SI/SO/ESC… First ICU DeveloperWorkshop

  3. Unicode • Unicode is a coded character set • Its repertoire is superset of most other codepages’ repertoires • Several encoding schemes for exchange • ICU internal: UTF-16 with platform endianness • Internal use based on 16-bit units (UChars), not bytes (character encoding form) First ICU DeveloperWorkshop

  4. ICU Codepage Conversion • ICU has one conversion API with several implementations • Each optimized for a type of encoding, transparent to user • All conversions are between internal UTF-16 UChars and external codepage bytes • Non-Unicode codepages need mapping tables: .ucm sources -> .cnv binary First ICU DeveloperWorkshop

  5. ICU Conversion Capabilities • Support for several Unicode encodings: UTF-8, UTF-16 (either endianness) • Support for mapping tables for general encodings with 1..4 bytes per character • Support for mappings to/from surrogate pairs (Unicode above U+ffff) • Stateful: ISO-2022, EBCDIC MBCS • Lotus LMBCS First ICU DeveloperWorkshop

  6. General Limitations • ICU converters only map each code point from one encoding to a code point in another encoding • No reordering or other transformations for different character models: directionality, composing chars, localized digits, vowel reordering, … • For such transformations, additionally use BiDi/Transliteration/Shaping APIs First ICU DeveloperWorkshop

  7. ICU 1.6 Limitations • Missing Unicode encodings: UTF-32 • SCSU in separate API, not regular converter • ISO-2022 only “JP” country variant First ICU DeveloperWorkshop

  8. Codepage Names and Aliases • Most codepages have several names • MIME, IANA: Name lists for Internet • IBM, MS: numeric names • Many OSes use own names • ICU: internal name + aliases • See icu/data/convrtrs.txt First ICU DeveloperWorkshop

  9. ICU Conversion API • 3 main functions: • Streaming: ucnv_toUnicode(), ucnv_fromUnicode() • Forward character iteration: ucnv_getNextUChar() • Convenience functions for all-in-one conversion: ucnv_to/fromUChars(), etc. First ICU DeveloperWorkshop

  10. Buffer Management • Streaming functions modify source & target pointers, try to read entire source & fill target • Allow to convert stream in chunks with multiple calls, converter object has state • Target full: U_BUFFER_OVERFLOW_ERROR • Source empty: no error or U_TRUNCATED_CHAR_FOUND (at end of stream) First ICU DeveloperWorkshop

  11. C++ Conversion API • Wrapper around most of C API • Provides same basic streaming functions • Convenience functions for UnicodeString • More convenient from C++ • No getNextUChar(), no custom callbacks First ICU DeveloperWorkshop

  12. Basic vs Convenience Function • Basic streaming functions allow • Conversion of arbitrarily large text with limited buffers • Offset mappings: corresponding source-target characters • Convenience functions are easier to use for single-buffer conversions, no offsets First ICU DeveloperWorkshop

  13. Callbacks for Exceptions • Callback functions are called when the source is malformed (illegal sequence) or does not encode a character (unassigned) • Several callbacks provided by ICU for stopping with error, replacing with substitution character (default), … • User-customizable: set user callback function and handle exceptions First ICU DeveloperWorkshop

  14. Fallback Mappings • Character sets often have different repertoires • Sometimes, if no precise mapping exists, a “good-enough” fallback mapping is ok • ucnv_setFallback() (default: no fallbacks) • XML/HTML e.g.: better use escape sequence like &#xx…x; First ICU DeveloperWorkshop

  15. Default Converter • ICU default converter: name of ICU converter that matches system codepage • Can be changed – do as early as possible • Mismatch with system problematic, change may not affect default converter instances First ICU DeveloperWorkshop

  16. Invariant Characters • Special encoding: common subset of codepages of a family • ASCII vs. EBCDIC • About 84 characters have same encoding within each family • Limited use for internal and syntactic strings where this is ok • Fast: no converter object First ICU DeveloperWorkshop

  17. SCSU: Unicode Compression • Described in Unicode TR 6 • Byte-based, stateful, compact • Can approximate text size of special codepage • IANA-registered charset • ICU: separate API, very similar to streaming conversion functions • But: reads/writes only complete sequences First ICU DeveloperWorkshop

  18. Buffering II • General conversion API will consume entire source if enough space in target, or fill entire target if enough source • Even if source/target characters are split • SCSU API consumes and writes only whole units, may leave source non-empty and target non-full First ICU DeveloperWorkshop

  19. Streaming Conversion Loop • while(source available) { source buffer is empty, fill it do { to/fromUnicode(); write contents of target } while(buffer overflow); if(failure other than buffer overflow) { report error }} First ICU DeveloperWorkshop

  20. Streaming Loop with SCSU • while(source available) { source buffer may not be empty, append to it [de]compress(); write contents of target if(failure other than buffer overflow) { report error } move rest of current source to start of buffer} First ICU DeveloperWorkshop

  21. API Changes for ICU 1.6 • Streaming functions: at full target, used to set U_INDEX_OUTOFBOUNDS_ERROR, which is still used for insufficient input to ucnv_getNextUChar() • Callback API changed: new function signatures, hiding internal structures (Uconverter!), new helper functions First ICU DeveloperWorkshop

  22. Future Enhancements • UTF-32 (either endianness) • SCSU as regular converter • More country variants for ISO-2022 • Collecting more precise mapping and alias tables First ICU DeveloperWorkshop

More Related