Indic Script Support on the Windows Platform

Indic Script Support on the Windows Platform Cathy Wissink Globalization Infrastructure and Font Technology Windows International Microsoft

Why implement Indic scripts (I)? • Languages spoken by large percentage of world’s population: • Hindi: ~400 million speakers • Bengali: ~200 million speakers • Other “smaller” Indic languages spoken in numbers approximating many of the traditionally globalized or localized languages (1999): • Tamil: ~60 million speakers • Telugu: ~70 million speakers • Korean: ~75 million speakers • Italian: ~65 million speakers Prague, Czech Republic (IUC 23)

Why implement Indic scripts (II)? • New initiatives at national and state level in India require local language support • Much of the population is not literate in English • Fast-growing IT market • Estimated 10.9% growth for the next 3 years* (*Semiconductor Business News, Jan 2003) Prague, Czech Republic (IUC 23)

In India, English is not the complete story. Implement native language support as well! …but it’s not quite that simple…

Indic scripts = complex scripts • Not meant in the sense of “well, it’s harder to implement than ASCII or English” • Complex script: a writing system that requires additional processing between input and display • Examples of additional processing • Multiple diacritic positioning • Bidirectional text • Word breaking Prague, Czech Republic (IUC 23)

What are the implications of complex scripts? • Additional technologies for processing = additional development time • The same applies for Arabic, Hebrew, Thai, Vietnamese, etc. • Understanding the relationship between these additional technologies and display/input can be more difficult • Can result in misunderstandings about implementation technologies, scheduling and guidelines Prague, Czech Republic (IUC 23)

Code points vs. glyphs • There is a systematic relationship between code points (in Unicode), input and display • Understanding this relationship is crucial to understanding the philosophy of implementation used by MS (and other vendors) on Unicode • This relationship is especially important to understand when handling complex scripts Prague, Czech Republic (IUC 23)

From the perspective of multilingual processing… • A code point deals with semantic content • A glyph deals with visual representation • There is not always a 1:1 relationship between code points and glyphs! • Even in alphabets, you can even see this: • Diacritics needed to complete repertoire • Multiple code points, single collation units • Different code point, same glyph Prague, Czech Republic (IUC 23)

Number of code points not always equal to number of “glyphs” פּ= U+05E4 and U+05BC É = U+0045 and U+0301 ў = U+0443 and U+0306 dzs = U+0064 and U+007A and U+0073 What a user thinks of as a “character” (a single collation element, a single keystroke, a single typographical unit) will often be more than one code point. Prague, Czech Republic (IUC 23)

How does this apply to Indic? • Consonantal code points have inherent vowel, so additional code point used for virama • Consonants and vowel signs separate code points • Syllables (Devanagari Ksha and Tamil Shri) are more than one code point क़ = U+0915 and U+093C ರಿ = U+0CB0 and U+0CBF க்ஷ = U+0B95 and U+0BCD and U+0BB7 Prague, Czech Republic (IUC 23)

An implementation challenge, but… • The system does most of the work transparently for the user: • Input can circumvent this complexity • Rendering and display are handled “behind the scenes” • User does not need to worry about code point – glyph relationship • MS was able to leverage previous development work for other complex scripts and optimize it for Indic • Using Unicode means world-wide functionality for all languages on all versions Prague, Czech Republic (IUC 23)

How does the average user deal with this complexity? • Input: • On a keyboard, a key can output up to four UTF-16 code points • Example: Tamil keyboard on Windows; user simply needs to type the key to get the complex output if defined on keyboard • With MSKLC, keyboard can now be customized to include more of these kinds of code point combinations Prague, Czech Republic (IUC 23)

What about display and output? • Code points are passed from the keyboard on to the shaping engine (and the font) • The rendering engine breaks the string into syllables and makes adjustments where necessary • Rearranging for vowels that are left positioned e.g., க + ை becomes கை • Rearranging for split vowels e.g., க + ௌ becomes கௌ • This information is then sent on to display • This is all transparent to the user • No need for IME-like input Prague, Czech Republic (IUC 23)

Prague, Czech Republic (IUC 23)

What about collation? • Many collation elements in a language require more than one code point: • Hindi: consonant modified with candrabindu, anusvara, or visarga is a single sort element • Kannada: consonant modified with anusvara is a distinct single sort weight • Tamil: consonant modified by puLLi is a single sort element • All this functionality is built-in Prague, Czech Republic (IUC 23)

Collation in action Prague, Czech Republic (IUC 23)

Technology behind the scenes • All this functionality “just works” • A user can work with Indic without having to understand: • Encodings or Unicode • The relationship between code points and glyphs • How fonts work • How rendering engines work • This technology is available on all versions of the system: English, MUI, localized versions Prague, Czech Republic (IUC 23)

How does someone enable this? • Control Panel  Regional and Language Options • Under “Standards and Formats”, set your language • Click OK or Apply Prague, Czech Republic (IUC 23)

Prague, Czech Republic (IUC 23)

Language settings for a user throughout the system • Once Standards and Formats is set, a user has: • Keyboard installation • Shaping and display of script • Cultural data use throughout the system (date, time, formatting) • Sorting Prague, Czech Republic (IUC 23)

Date/Time formatting: Konkani Prague, Czech Republic (IUC 23)

Date/Time Formatting: Punjabi Prague, Czech Republic (IUC 23)

Gujarati Hindi* Kannada Konkani* Marathi* Punjabi Sanskrit* Tamil* Telugu Currently available on XP: *Also supported on Windows 2000 Prague, Czech Republic (IUC 23)

Coming soon • Windows localized into Hindi (the first localization for India!) • Bengali and Malayalam enabled Windows • Keyboards • Shaping and fonts • Cultural data • Sorting • Other languages and scripts in planning stages Prague, Czech Republic (IUC 23)

References • Unicode Technical Report #17: Character-Encoding Model http://www.unicode.org/reports/tr17/ • MS Global Development website http://www.microsoft.com/globaldev/ • MS Typography Development website http://www.microsoft.com/typography/creators.htm • Unicode Technical Note #1: Issues in Indic Collation http://www.unicode.org/notes/tn1/ Prague, Czech Republic (IUC 23)

Indic Script Support on the Windows Platform

Indic Script Support on the Windows Platform

Presentation Transcript

the windows azure platform

Windows Script 101

Introducing the Windows Azure Platform

High-Performance Computing on the Windows Server Platform

Windows Script Components

The Windows Phone Application Platform

Developing on the Windows Live Platform

The Windows Web Platform

Windows Azure Platform

Running PHP on the Windows Azure Platform

Acumatica on the Windows Azure Platform

Overview of RISOT: Retrieval of Indic Script OCR’d Text

Practical Sandboxing on the Windows Platform

The Windows Live Dev Platform

Indic Crossword

WINDOWS PLATFORM (ITI310)

WINDOWS PLATFORM (ITI310)

The Windows NT Platform

(re)-Architecting cloud applications on the windows Azure platform

Indic Crossword

Kickstarter clone script | Kickstarter Script | Fundraising Platform

Overview of RISOT: Retrieval of Indic Script OCR’d Text