Unicode & W3C Jataayu Software

Unicode & W3CJataayu Software C. Kumar January 2007

Agenda • About Jataayu • Unicode & Encoding • W3C Specification for multi-lingual authoring • Multilingual WEB Address • Indian WEB Sites an Overview • W3C Activity

About Jataayu • Jataayu formed with a clear focus of delivering solutions for wireless data services • Over 60% of the data traffic in Indian Mobile Networks for WAP, Mobile WEB and MMS handled by Jataayu Products • Mobile Device Solution Division focusing on wireless data applications like WAP, MMS, SyncML, IMPS, Email, Web Browsing, Download • Active participants in OMA, W3C and MWI • Over 350 people strong with offices in UK, Singapore, Korea, Taiwan and the US; headquartered in India with major development center in Bangalore

Localization - Internationalization • Localization (l10n) • Adaptation of the content to meet the language, cultural and other requirements of a specific target market • Internationalization (i18n) • Design & Development of the content that enables easy localization for target audiences that vary in culture, region or language. • Mission of W3C i18n Activity is to ensure the W3C’s formats and protocols are usable worldwide in all languages and in all writing systems.

Need for Unicode • Early character sets based on 7-bit, gave 27 (ie. 128) possible characters • Adding the 8th bit gave a total of 256 possible characters. Still not enough for all the European languages. • Code page mechanism helped a little by changing the upper cells (0xA0 to 0xFF), but was very complex. • Addressing the needs of the other languages requires thousands of ideographic characters at a time.

Unicode & Encoding • Unicode, universal character set contains all the characters needed for writing the majority of living languages in use on computers. • Allows for simple display and storage of multilingual content • An encoding refers to the way that characters are mapped from the character set to actual Unicode value. • Different encoding yield different byte sequences.

Unicode & Encoding • UTF-8 (Unicode Transformation Format) • Variable length 8-bit character encoding for Unicode • Able to represent any universal character in the Unicode Standard • Uses one to four bytes to encode a Unicode symbol • Only one byte is needed to encode the US-ASCII characters

Unicode & Encoding • UTF-16 (16-bit Unicode Transformation Format) • Variable length 16-bit character encoding for Unicode • Uses two or four byte sequence to encode a Unicode symbol • Two byte is required to encode the US-ASCII character • UCS-2 (2-byte Universal Character Set) • Fixed length encoding that always encodes characters into a single 16-bit value • It can encode characters in the range 0x0000 to 0xFFFF

Unicode & Encoding • UCS-4 / UTF-32 (32-bit Unicode Transformation Format) • Fixed length 32-bit character encoding for Unicode • Every character it uses 4 bytes and it is very space inefficient • Little used in practice with UTF-8 and UTF-16 being the normal ways of encoding Unicode Text • http://www.unicode.org/

Unicode & Encoding • Devanagari (0x0900 – 0x097F) • Bengali (0x0980 – 0x09FF) • Tamil (0x0B80 – 0x0BFF) • Kannada (0x0C80 – 0x0CFF)

Unicode & Encoding • Alternate way to represent the character is by using escape value. (א) • Not all documents have to be encoded as Unicode • But documents can only contain characters defined by Unicode Standard • Any encoding can be used as long as it is properly declared and it is the subset of Unicode • Unicode encoding also allows many more languages to be mixed on a single page

Other Encoding formats … • Shift_JIS (SJIS), character encoding for the Japanese Language • Single byte character encoding for the lower-ASCII characters (0x00 – 0x7F) • Double-byte character encoding for the upper-ASCII bytes • GB2312, character encoding for simplified Chinese characters

W3C Specification - Encoding • W3C specification for multi-lingual authoring • Encoding of the document needs to be mentioned, so that the application that consumes can interpret it. • Meta Tag • <meta http-equiv=“Content-type” content=“text/html;charset=UTF-8” /> • XML • <?xml version=“1.0” encoding=“UTF-8”?> • Content-type header returned from the WEB server should also contain the character encoding of the document • Content-Type: text/html; Charset=utf-8

W3C Specification - Language • Author needs to specify the language of the document (web page content) • Browser can choose the appropriate font selection using the Lang attribute • Search Engine can group or filter results based on the user’s linguistic preferences (using meta) • Translation tools use to recognize the section of text in a particular language

W3C Specification - Language • HTTP Content Language Header • Content-Language: hi • Language Attribute on html tag • <html lang=“hi”> • <html xml:lang=“hi”> • Content Language in meta tag • <meta http-equiv=“Content-Language” content=“hi” /> • Language attribute on embedded content • <div lang=“en” xml:lang=“en”> Some English Content </div>

What value to use for lang? • IANA (Internet Assigned Numbers Authority) • Provides a unique value for each language • It is available in the Subtag value in the new IANA Language • http://www.iana.org/assignments/language-subtag-registry • Hindi – hi, Kannada – kn, Tamil – ta

Bi-directional text • Additional information is required in addition to the language attribute to provide support for non-Latin scripts (like Arabic, Hebrew, Urdu) • In HTML, dir attribute is used to specify the direction of the text • The title says “<span dir=“rtl”> ם ו א נ י ב ה ת ו ל י ע פ, W3C</span>” in Hebrew.

Multilingual WEB Address • A Web address is used to point a resource on the WEB • Web address are typically expressed using URIs (Uniform Resource Identifiers) • Restricts to a small number of characters (upper & lower case letters of the English alphabet, numerals and few symbols). • User’s expectations and use of the Internet have changed this restrictions. • There is a growing need to use any language characters in WEB Addresses.

Multilingual WEB Address … • A Web address in your own language and alphabet is easier to create, memorize, interpret and relate it. (Ex: http://खोज.com) • Punycode is a way of representing Unicode code points using only ASCII characters. (Ex: http://xn--21bm4l.com)

Indian Content an Overview • Most Indian Websites are not using Unicode • Content are generated within the ASCII range and provide the proprietary fonts which maps the ASCII character set to Indian Languages. • Visually it will be fine, but no other entities will be able to interpret it • For each site, the user may need to download the proprietary fonts, which is not user friendly • Search Engine will not be able to interpret the content which is intended by author as it does not follow the standard encoding.

Indian Content an Overview

Unicode & W3C Importance • WEB is also moving towards the mobile • W3C Mobile Web Initiative (MWI) defines the best practices for Mobile Browsing • Cannot install the required font’s during run-time as used to do in desktop • If Unicode character are used the required font may be available within the device

Firefox • Firefox (http://www.getfirefox.com) • Provides extensive support for Unicode and related fonts • Provides the Add-ons to type in Indian Languages in web pages in Linux. (Such tools are already available for Windows XP Users through the language packs) • https://addons.mozilla.org/firefox/5484/author/

W3C i18n activity • Core Working group • Enable universal access to the World Wide Web by providing adequate support to other W3C Working Groups • GEO (Guidelines, Education & Outreach) • Internationalization aspects of W3C technology better understood and more widely and consistently used • ITS (Internationalization Tag Set) • Develop a set of elements and attributes that can be used with new DTDs/Schemas to support the internationalization and localization of documents

Thanks kumarc@jataayusoft.com

Unicode & W3C Jataayu Software