Encoding and fonts Edward Garrett Software Developer, ELAR

Encoding and fontsEdward GarrettSoftware Developer, ELAR

Some issues • Data types: • diverse scripts • multilingual data • IPA and other transcriptional notations • Modes: • representation (in some scheme) • storage (using some encoding) • display (in browser, word processor, etc.) • input (with various OS, keyboards, etc.) • Your issues and challenges: • data/problems to look at now? • Friday AM advice clinic

Representing data • Symbols • Encoding: character sets, Unicode • Fonts • Relationships (eg links) • Structures (eg hierarchies)

Representing textual data • Plain text • Lacks formatting information • Transfer between applications • Internal memory • Saved in files • Encodings • Unicode • Markup • XML • HTML

Plain text • What is it? • Try saving a document as plain text … in TextEdit …

Definitions • Background on digital data storage: • Bit: 0, 1 • Byte: 8 bits, e.g. 00101100 • Definitions from Yucca Korpela’s article: • Character repertoire: a set of characters (from one or more scripts) constituting the data that can be represented • Character code: a mapping that gives each character in a repertoire a distinct numeric identifier • Character encoding: a method of mapping sequences of character codes into sequences of bytes

Character encodings: ISO 8859-1 • ISO 8859-1: uses 1 byte (8 bits) to encode characters for most of the Western European languages

Unicode • International standard (ISO 10646) • Industry standard (Unicode Consortium) • Aims to code all characters from all of the world’s scripts - over 1 million code points • Privileges character semantics, not glyphic representations • Multiple encoding methods • Referencing a character: U+nnnn (in hexidecimal, base 16) • Most characters in Basic Multilingual Plane (first 65,536 character positions)

Unicode encodings • UTF-32: each code to 4 bytes; inefficient as most commonly used characters are in BMP • UTF-16: maps each code to either one 2 byte sequence, or two: • efficient and widely used • Good for the BMP • UTF-8: maps each code to 1-4 bytes • Particularly compact for Western European languages • Most widely supported across various internet protocols

Character semantics vs. glyphs • No difference between e, e, and e • IPA letter [c], unvoiced palatal plosive, but same as Roman c • No separate characters for cursive scripts, joined up handwriting

Character semantics vs. glyphs • Examples • U+0041 LATIN CAPITAL LETTER A • U+0410 CYRILLIC CAPITAL LETTER A • U+0391 GREEK CAPITAL LETTER ALPHA • IPA digraphs • “Never use a character just because it looks right.”

Precomposed characters • Complex characters involving a base character and multiple diacritics - treated as equivalent • A relevant case study: "Challenges in Writing Bih" [http://test.elar.soas.ac.uk/node/5]

Compatibility characters • Similar to their decompositions, but not equivalent; they include extra information (formatting, etc.)

Pre-composed and compatibility characters • Why do they exist, if counter to Unicode’s focus on character semantics over glyphic representation? • Compatibility with prior encodings • No such new characters will be accepted into Unicode

Things to watch out for • An example to illustrate the difference between: • Text rendering • Document encoding

Take away message • Just because characters aren’t rendered properly doesn’t mean that they aren’t there. • Just because characters are rendered properly doesn’t guarantee that they will stay that way. • Beware your platform’s default encoding (probably not Unicode).

Adding markup • Not only should the document be Unicode, but it must declare itself as Unicode.

Exercises • What's wrong with these Unicode words? • Character encoding exercises I • http://test.elar.soas.ac.uk/taxonomy/term/1

Your questions and issues

Encoding and fonts Edward Garrett Software Developer, ELAR

Encoding and fonts Edward Garrett Software Developer, ELAR

Presentation Transcript

Software Developer Technician

FONTS

Computer Software Developer

Software Developer

Software Developer Career

Fonts

Fonts

Software Developer

Fonts

ELAR

Fonts and CRAP

MISD ELAR

MISD ELAR

MISD ELAR

MISD ELAR

MISD ELAR

Fonts

FONTS and TYPFACES

Characters and Fonts

fonts

Construction Software Developer

MLM software developer