Globalisation & Computer Systems week 5

Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE • UNICODE design principles • UNICODE character semantics 3. Lab session • finish code page work • creating and browsing UNICODE characters

Representation and UNICODE • What about Chinese? • Thousands of characters – 256 bit-patterns clearly not enough • Make the bytes bigger… • Bytes have 16-bits, which gives 65536 bit-patterns • UNICODE

Representation and UNICODE • Reference: The Unicode Standard, Version 4.1.0. Online: http://www.unicode.org/unicode/uni2book/

UNICODE – design principles • Principle 1: 16-bit bytes • For code pages, characters share 8-bit byte code points – determined by interpretation • For UNICODE each character assigned a unique code point • 65536 code values available • Byte 1: 256 values X Byte 2: 256 values • 63485 for character representation; remaining 2048 reserved for extended 32-bit codes • This gives 1, 048, 544 code values to cover all languages

UNICODE – design principles • Principle 2: allocation of code space • General scripts area: alphabetic • CJK Ideographs – 27484 ideographs • Hangul syllables – 11172 Korean Hangul syllables • 1st 128 code points for Latin • Punctuation symbols grouped together

UNICODE – design principles • Principle 3: efficiency • All characters have equal status, i.e. no escape characters • Characters of a common script grouped together as far as is possible • Common punctuation shared

Design principles Principle 4: logical and display order • Logical order: how the code is ordered in memory: follows time sequence of input • …and ‘logically’ that is L-R • Dynamically composed characters: base character ordered ‘before’, i.e. left wrt to the modifying character

Design principles Principle 5: plain text and rich text • Unicode encodes unformatted plain text, where rendering aim is legibility only • Formatting: extra data, give rich text • To preserve plain text requirements? • Have layers of plain text representing characters and how they are formatted • Use mark-up languages: content + tags

Design principles Principle 6: unification • Share characters where you can: • Mixed writing systems • Ideographs common to CJK • Punctuation

Character semantics • Character name • Representative glyph • Properties

Property 1: Case • A letter in the alphabet has several variants • UPPERCASE variant • lowercase variant • Five scripts which have case: • Latin, Greek, Cyrillic, Armenian, archaic Georgian

Property 2: Decomposition • A character which is equivalent to one or more other characters • Š = S + ˇ • 0160 (Latin Ext.-A)= 0053 + 030C (Basic Latin)

Property 3: Combining class • Base character • i.e. no special graphical combining behaviour when following another character • Combining character • Some characters have shape-change or position behaviour when combing with other characters • Non-spacing combining character • Does not take up space, e.g. diacritics • Spacing combining character • Takes up space as though a base character

Property 3: Combining class • Sequence is a convention: • Base character + combining character • Symbol: dotted circle, representing the space of the base character, and combining character positioned relative to the circle • Stacking of diacritics follows the convention: • Move from the base character outwards

Property 4: Directionality • Two directionality types: • Left to Right • Right to Left (Arabic, Hebrew, Syriac, Thaana) • Logical sequence: Left to Right

Property 5: General Category • The full character space is partitioned into several major categories: • Letters • Punctuation • Symbols • Numbers • Examples of general category codes: • Lu: letter, uppercase; Ll: letter, lowercase • Nd: number, decimal digit; No: number, other

Property 6: Numeric value • For characters that represent numbers • Decimal digits • Fractions • Subscripts and superscripts • Currency numerators • Portion of the CJK ideographs: e.g. U+4E94

Property 7: Mirrored property • For characters that have equivalent mirror image characters, e.g. ‘(‘ • Important for directionality

Character properties Summary • 1. Case • 2. Decomposition • 3. Combining class • 4. Directionality • 5. General category • 6. Numeric value • 7. Mirrored property

Globalisation & Computer Systems week 5