1 / 19

Globalisation & Computer Systems week 5

Globalisation & Computer Systems week 5. 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character semantics 3. Lab session finish code page work creating and browsing UNICODE characters. Representation and UNICODE.

swain
Télécharger la présentation

Globalisation & Computer Systems week 5

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE • UNICODE design principles • UNICODE character semantics 3. Lab session • finish code page work • creating and browsing UNICODE characters

  2. Representation and UNICODE • What about Chinese? • Thousands of characters – 256 bit-patterns clearly not enough • Make the bytes bigger… • Bytes have 16-bits, which gives 65536 bit-patterns • UNICODE

  3. Representation and UNICODE • Reference: The Unicode Standard, Version 4.1.0. Online: http://www.unicode.org/unicode/uni2book/

  4. UNICODE – design principles • Principle 1: 16-bit bytes • For code pages, characters share 8-bit byte code points – determined by interpretation • For UNICODE each character assigned a unique code point • 65536 code values available • Byte 1: 256 values X Byte 2: 256 values • 63485 for character representation; remaining 2048 reserved for extended 32-bit codes • This gives 1, 048, 544 code values to cover all languages

  5. UNICODE – design principles • Principle 2: allocation of code space • General scripts area: alphabetic • CJK Ideographs – 27484 ideographs • Hangul syllables – 11172 Korean Hangul syllables • 1st 128 code points for Latin • Punctuation symbols grouped together

  6. UNICODE – design principles • Principle 3: efficiency • All characters have equal status, i.e. no escape characters • Characters of a common script grouped together as far as is possible • Common punctuation shared

  7. Design principles Principle 4: logical and display order • Logical order: how the code is ordered in memory: follows time sequence of input • …and ‘logically’ that is L-R • Dynamically composed characters: base character ordered ‘before’, i.e. left wrt to the modifying character

  8. Design principles Principle 5: plain text and rich text • Unicode encodes unformatted plain text, where rendering aim is legibility only • Formatting: extra data, give rich text • To preserve plain text requirements? • Have layers of plain text representing characters and how they are formatted • Use mark-up languages: content + tags

  9. Design principles Principle 6: unification • Share characters where you can: • Mixed writing systems • Ideographs common to CJK • Punctuation

  10. Character semantics • Character name • Representative glyph • Properties

  11. Property 1: Case • A letter in the alphabet has several variants • UPPERCASE variant • lowercase variant • Five scripts which have case: • Latin, Greek, Cyrillic, Armenian, archaic Georgian

  12. Property 2: Decomposition • A character which is equivalent to one or more other characters • Š = S + ˇ • 0160 (Latin Ext.-A)= 0053 + 030C (Basic Latin)

  13. Property 3: Combining class • Base character • i.e. no special graphical combining behaviour when following another character • Combining character • Some characters have shape-change or position behaviour when combing with other characters • Non-spacing combining character • Does not take up space, e.g. diacritics • Spacing combining character • Takes up space as though a base character

  14. Property 3: Combining class • Sequence is a convention: • Base character + combining character • Symbol: dotted circle, representing the space of the base character, and combining character positioned relative to the circle • Stacking of diacritics follows the convention: • Move from the base character outwards

  15. Property 4: Directionality • Two directionality types: • Left to Right • Right to Left (Arabic, Hebrew, Syriac, Thaana) • Logical sequence: Left to Right

  16. Property 5: General Category • The full character space is partitioned into several major categories: • Letters • Punctuation • Symbols • Numbers • Examples of general category codes: • Lu: letter, uppercase; Ll: letter, lowercase • Nd: number, decimal digit; No: number, other

  17. Property 6: Numeric value • For characters that represent numbers • Decimal digits • Fractions • Subscripts and superscripts • Currency numerators • Portion of the CJK ideographs: e.g. U+4E94

  18. Property 7: Mirrored property • For characters that have equivalent mirror image characters, e.g. ‘(‘ • Important for directionality

  19. Character properties Summary • 1. Case • 2. Decomposition • 3. Combining class • 4. Directionality • 5. General category • 6. Numeric value • 7. Mirrored property

More Related