300 likes | 396 Vues
This guide explores diverse scripts, multilingual data formats, IPA transcriptional notations, and modes of textual data representation, storage, display, and input. Learn about character sets, Unicode, fonts, and more. Enhance your understanding of text encoding challenges and best practices.
E N D
Some issues • Data types: • diverse scripts • multilingual data • IPA and other transcriptional notations • Modes: • representation (in some scheme) • storage (using some encoding) • display (in browser, word processor, etc.) • input (with various OS, keyboards, etc.) • Your issues and challenges: • data/problems to look at now? • Friday AM advice clinic
Representing data • Symbols • Encoding: character sets, Unicode • Fonts • Relationships (eg links) • Structures (eg hierarchies)
Representing textual data • Plain text • Lacks formatting information • Transfer between applications • Internal memory • Saved in files • Encodings • Unicode • Markup • XML • HTML
Plain text • What is it? • Try saving a document as plain text … in TextEdit …
Definitions • Background on digital data storage: • Bit: 0, 1 • Byte: 8 bits, e.g. 00101100 • Definitions from Yucca Korpela’s article: • Character repertoire: a set of characters (from one or more scripts) constituting the data that can be represented • Character code: a mapping that gives each character in a repertoire a distinct numeric identifier • Character encoding: a method of mapping sequences of character codes into sequences of bytes
Character encodings: ISO 8859-1 • ISO 8859-1: uses 1 byte (8 bits) to encode characters for most of the Western European languages
Unicode • International standard (ISO 10646) • Industry standard (Unicode Consortium) • Aims to code all characters from all of the world’s scripts - over 1 million code points • Privileges character semantics, not glyphic representations • Multiple encoding methods • Referencing a character: U+nnnn (in hexidecimal, base 16) • Most characters in Basic Multilingual Plane (first 65,536 character positions)
Unicode encodings • UTF-32: each code to 4 bytes; inefficient as most commonly used characters are in BMP • UTF-16: maps each code to either one 2 byte sequence, or two: • efficient and widely used • Good for the BMP • UTF-8: maps each code to 1-4 bytes • Particularly compact for Western European languages • Most widely supported across various internet protocols
Character semantics vs. glyphs • No difference between e, e, and e • IPA letter [c], unvoiced palatal plosive, but same as Roman c • No separate characters for cursive scripts, joined up handwriting
Character semantics vs. glyphs • Examples • U+0041 LATIN CAPITAL LETTER A • U+0410 CYRILLIC CAPITAL LETTER A • U+0391 GREEK CAPITAL LETTER ALPHA • IPA digraphs • “Never use a character just because it looks right.”
Precomposed characters • Complex characters involving a base character and multiple diacritics - treated as equivalent • A relevant case study: "Challenges in Writing Bih" [http://test.elar.soas.ac.uk/node/5]
Compatibility characters • Similar to their decompositions, but not equivalent; they include extra information (formatting, etc.)
Pre-composed and compatibility characters • Why do they exist, if counter to Unicode’s focus on character semantics over glyphic representation? • Compatibility with prior encodings • No such new characters will be accepted into Unicode
Things to watch out for • An example to illustrate the difference between: • Text rendering • Document encoding
Take away message • Just because characters aren’t rendered properly doesn’t mean that they aren’t there. • Just because characters are rendered properly doesn’t guarantee that they will stay that way. • Beware your platform’s default encoding (probably not Unicode).
Adding markup • Not only should the document be Unicode, but it must declare itself as Unicode.
Exercises • What's wrong with these Unicode words? • Character encoding exercises I • http://test.elar.soas.ac.uk/taxonomy/term/1