1 / 21

Internationalization Using Locales

Internationalization Using Locales. Achim Ruopp. Agenda. Working with multilingual data Language and locale identifiers Locale Data Frameworks for locale support Ideas/discussion how this could be used in compling. Not about character encoding. Read Jeremy’s slides from last quarter

adara
Télécharger la présentation

Internationalization Using Locales

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. InternationalizationUsing Locales Achim Ruopp

  2. Agenda • Working with multilingual data • Language and locale identifiers • Locale Data • Frameworks for locale support • Ideas/discussion how this could be used in compling

  3. Not about character encoding • Read Jeremy’s slides from last quarter • http://students.washington.edu/jgk/talks/char-enc/char-encodings.pdf • Use Unicode wherever possible

  4. InternationalizationMore than Encoding Text • Where are the word breaks? คลิกปุ่มเมาส์ขวา Your balance is $1234.56... I think. • How do I sort these words in French? • cote dimension • côte coast • coté with dimensions • côté side • How do I uppercase this word in Turkish? • istiyorum - İstiyorum • How do I transcribe this text into Latin characters? • 인수문제를 - in'su'mun'je'reul'

  5. Cultural Conventions • What does this date stand for? • 3/8/2006 • What is the currency symbol for Hungary? • … linguistic characteristics of languages and cultural conventions – a locale

  6. Agenda • Working with multilingual data • Language and locale identifiers • Locale Data • Frameworks for locale support • Ideas/discussion how this could be used in compling

  7. Internet Language Tags • Used today: RFC 3066 (RFC 1766) • Generative:ISO 639-1/2 language tag[-ISO 3166 country tag] • e.g. fr, en-US, ale-CA • Registered with IANA • e.g. no-nyo, zh-Hant • Exceptions • x-… • Several problems • Dependency on ISO standards • No generative options for dialects etc. • RFC3066bis should solve this

  8. SIL Etnologue • Cataloging all of the world’s 6,912 known living languages • http://www.ethnologue.com/ • Uses ISO/DIS 639-3 3-letter codes • E.g. Swabian dialect: x-sil-swg • Hope for consolidation with RFC3066 or successor once 639-3 becomes full standard • Not so well supported in programming frameworks

  9. Agenda • Working with multilingual data • Language and locale identifiers • Locale Data • Frameworks for locale support • Ideas/discussion how this could be used in compling

  10. Types of Locale Data • Dates/time formats • Number/currency formats • Collation Specification • For sorting and comparison • Translated names for language, region, script, timezones, currencies,… • Script and characters used by a language • Measurement System • Paper sizes • …

  11. Common Locale Data Repository • “The purpose of the Common Locale Data Repository project is to provide a general XML format for the exchange of locale information for use in application and system development, and to gather, store, and make available a common set of locale data generated in that format.” • http://www.unicode.org/cldr/

  12. Common Locale Data Repository • Collection/vetting process • Contributors add/modify data • Reviewed by commitee • Accessible over the web • Locale Data Markup Language XML format • E.g. http://unicode.org/cldr/data/common/main/fr.xml

  13. Agenda • Working with multilingual data • Language and locale identifiers • Locale Data • Frameworks for locale support • Ideas/discussion how this could be used in compling

  14. FrameworksPosix Locale • Standard C/C++ libary • LC_COLLATE – sorting/comparison • LC_CTYPE - behavior of character-handling • LC_MONETARY - monetary formatting LC_NUMERIC – numeric formatting • LC_TIME – date/time formatting • Used in Un*x systems for command line functions too • Results can be platform-dependent • Stable, but feature set stuck in the 1980s

  15. FrameworksICU Library • IBM Open Source project • Developed originally for the Taligent OS project in the late 80s/early 90s • Java and C++ APIs • Extensive locale data and APIs to use it • http://www.icu-project.org/cgi-bin/locexp • Also includes localization support • Everybody (Mac OS X, Java, DB2, Mathworks …) is using it • But …

  16. FrameworksMicrosoft • Windows NLS API • Microsoft .NET Framework System.Globalization namespace • Similar set of data to ICU • Vetted by subsidiaries • APIs accessible from all MS programming languages • Localization support in different API

  17. Microsoft demos Culture ExplorerMicrosoft Transliteration Utility

  18. Extensibility • What if I don’t find the locale I need? • What if I need to modify some of the data? • ICU • Can create new locales • Microsoft • .NET Framework v2.0: custom cultures • Windows Vista: custom locales • LDML can be interchange format

  19. Agenda • Working with multilingual data • Language and locale identifiers • Locale Data • Frameworks for locale support • Ideas/discussion how this could be used in compling

  20. Usages for Computational Linguistics • Up to the imagination • Transliteration use in MT • Named Entity Recognition • … • suggestions? • Most importantly: Do not reinvent the wheel! • Check if API or data you need is available • If possible write code in a language/locale-independent fashion

  21. References • RFC3066bis • http://www.inter-locale.com/ID/why-rfc3066bis.html • Etnologue • http://www.ethnologue.com/ • Common Locale Data Repository • http://www.unicode.org/cldr/ • Posix Locale • http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html • ICU • http://icu.sourceforge.net/ • Microsoft • http://www.microsoft.com/globaldev/ • UNGEGN Working Group on Romanization Systems • http://www.eki.ee/wgrs/

More Related