Using Unicode for Linguistic Data

Using the Unicode Standard for Linguistic Data: Preliminary GuidelinesDeborah AndersonResearcherDept. of Linguistics, UC Berkeley

Using Unicode for Linguistic Data Introduction: • E-MELD and its mission • What is the situation for character encoding? • The role of this presentation

Using Unicode for Linguistic Data Background: What is Unicode? Core Concepts Practical Issues: How do I get Unicode to work? Organization of the Unicode Standard Finding the character you need Other practical issues Further recommendations

Using Unicode for Linguistic Data Background: What is Unicode? • Unicode is the international character encoding standard • It assigns a unique number to every character and this number stays the same “no matter what the platform, no matter what the program, no matter what the language”

Using Unicode for Linguistic Data Background: What is Unicode? Example: the Unicode character code for Latin capital letter A is: U+0041 Unicode format: U+xxxx (xxxx is in hex)

Using Unicode for Linguistic Data Background: What is Unicode? • Used for “plain text” representation (i.e., 0045 002D 004D 0045 004C 0044 = E-MELD) • Different from “rich text,” which is plain text with additional information (including formatting information, such as font size, styles, etc.)

Using Unicode for Linguistic Data Background: What is Unicode? Example: Superscripts (a) Plain text: use Unicode characters e.g., for use 02B0 for superscript “h” (b) Rich text: apply superscript style to a base character to get the superscript “h” e.g., <sup>h</sup> (This can be done on MS Word by selecting the “superscript” formatting feature on the “font” menu.)

Using Unicode for Linguistic Data Background: What is Unicode? • Widely supported by computer companies and national bodies: many current fonts, keyboards, and software are based on Unicode • But… the process to get characters incorporated can be lengthy (2+ years), so there can be lag-time before they appear in fonts, etc.

Using Unicode for Linguistic Data Core Concepts: 1. Characters, not glyphs. Characters are “the smallest components of written language that have semantic value” (TUS, p. 13) Glyphs: the surface representation of abstract characters; what appears on the page or on your monitor

Using Unicode for Linguistic Data Core Concepts: 1. Characters, not glyphs. Example: Abstract Character: a  Unicode’s (small Latin letter a) domain Glyphs: a, a, a,a  Font’s domain

Using Unicode for Linguistic Data Core Concepts: 1. Characters, not glyphs. Don’t take glyphs in the Unicode Standard charts as definitive:

Using Unicode for Linguistic Data Core Concepts: 1. Characters, not glyphs. Characters aren’t necessarily the same as graphemes: Spanish ch Unicode c + h

Using Unicode for Linguistic Data Core Concepts: 1. Characters, not glyphs. There is not always a 1-1 relationship between a character and glyph: (a) Arabic: one character can have different glyphs depending upon position in a word (b) Devanagari: the glyph for ksha is made up of 3 characters: ka + virama + sha

Using Unicode for Linguistic Data Core Concepts: 2. No new precomposed forms or digraphs Example: +

Using Unicode for Linguistic Data Core Concepts: 3. No variants 4. No idiosyncratic characters

Using Unicode for Linguistic Data Core Concepts: 4. Unify, wherever possible Greek letter beta is unified with IPA beta (voiced bilabial fricative)

Using Unicode for Linguistic Data Core Concepts: 4. Unify, wherever possible 0283 LATIN SMALL LETTER ESH (voiceless post-alveolar fricative) 222B INTEGRAL symbol

Using Unicode for Linguistic Data Practical Issues: Getting Unicode to Work

Using Unicode for Linguistic Data Practical Issues: Getting Unicode to Work • A recent operating system (Mac OS 9.2, X, Windows CE, NT, 2000, XP, GNU/Linux with glibc 2.2.2+) • A recent browser (IE, Safari, OmniWeb, Mozilla/Netscape) • A Unicode text editor (Word 2000, 2002, Unipad, Apple “TextEdit”) • An input mechanism (“insert symbol,” keyboard, Keyman)

Using Unicode for Linguistic Data Getting Unicode to Work • A Unicode-enabled font (Code2000, Lucida Sans Unicode, SIL’s Doulos, Gentium, Arial Unicode MS) Note: Be wary of “Unicode” fonts; they may only be partially Unicode-compliant.

Organization of the Unicode Standard

Organization of the Unicode Standard:Unicode Code Charts

Using Unicode for Linguistic Data Unicode Code Charts

Using Unicode for Linguistic Data Code Chart (Phonetic Extensions block)

Using Unicode for Linguistic Data Unicode Code Charts

Using Unicode for Linguistic DataSteps to using Unicode Finding the character you need 1. See if it is in Unicode: • Check the IPA blocks (etc.) on the Unicode website • Check Appendix 2 of the IPA Handbook or a Web version of the IPA symbols

Using Unicode for Linguistic DataSteps to using Unicode Finding the character you need Note: In looking through Unicode and using “insert Symbol”/font charts, be careful of “spoof buddies”:

Using Unicode for Linguistic Data Steps to using Unicode Finding the character you need 2. See if it is in the process of being proposed: • Check on Unicode’s Proposed New Characters page • Ask on the Transcription email list • Ask on Unicode email list • Verify the character you need is a true character, and not a variant

Using Unicode for Linguistic Data Steps to using Unicode If you find a character that is missing Work with the Peter Constable to get it proposed. A proposal is composed of: • the character’s name • a representative glyph • information on the character’s properties • a representative sample of the character in context • a short bibliography with references

Using Unicode for Linguistic Data Steps to using Unicode How can I use a character not yet in Unicode? • Use FontLab or work with a font foundry to create a font in the interim, using the Private Use Area (PUA); fully document PUA chars. • Use markup / entities • Use Scalable Vector Graphics. TEI is preparing guidelines, but nothing has yet been finalized.

Using Unicode for Linguistic Data Steps to using Unicode For those languages without an orthography • Use Unicode characters if possible • Verify character properties are similar • Stay away from certain characters: • Presentation forms • Letterlike symbols • Number forms

Using Unicode for Linguistic Data Steps to using Unicode How do I tell if my font is Unicode-compliant? • Set your font as the default for your browser, then look at a test page, such as Alan Wood’s IPA Extensions page. • Use font utilities to check the fonts on your system (see Alan Wood’s website)

Using Unicode for Linguistic Data Steps to using Unicode What about my data that is in a non-Unicode font? If possible, upgrade your documents to Unicode, converting to a Unicode font. • Use a converter • If the font you use isn’t included, create a converter and have it hosted on a publicly available website

Using Unicode for Linguistic Data Steps to using Unicode Encoding Forms Different ways to represent the hex-based integer as a series of bytes: • A series of 8-bit values (UTF-8) • A 16-bit value (UTF-16) • A 32-bit value (UTF-32)

Using Unicode for Linguistic Data Steps to using Unicode Encoding Forms • Reason for different forms: different implementation needs • Some tradeoffs for storage/processing • Suggestion: Use UTF-8 or UTF-16

Using Unicode for Linguistic Data Steps to using Unicode Further recommendations • Groups of users (i.e., Athabaskanists) should publicly document Unicode values for the orthography and give font recommendations. • Provide feedback on missing characters to Peter Constable.

Appendices1: Linguistic letters and Symbols in Unicode2: Characters known to be missing3: Normalization

end

Using Unicode for Linguistic Data

Using Unicode for Linguistic Data

Presentation Transcript

Unicode Support for Mathematics

Unicode

Unicode Security

Using Mechanical Turk for linguistic research

Unicode Introduction

Unicode 4.0

Representing Linguistic Data

Unicode 4.0

Unicode in

Unicode Security

Unicode

Dzongkha Unicode

Unicode

Unicode Normalization

UNICODE

Using XML Parsers and Unicode

Unicode Oddity

Unicode Support for Mathematics

Unicode 4.0