210 likes | 318 Vues
Explore language and locale identifiers, frameworks for support, and practical applications in computational linguistics. Learn how to utilize locale data efficiently and leverage Unicode. Discover various frameworks like ICU and Microsoft for effective localization.
E N D
InternationalizationUsing Locales Achim Ruopp
Agenda • Working with multilingual data • Language and locale identifiers • Locale Data • Frameworks for locale support • Ideas/discussion how this could be used in compling
Not about character encoding • Read Jeremy’s slides from last quarter • http://students.washington.edu/jgk/talks/char-enc/char-encodings.pdf • Use Unicode wherever possible
InternationalizationMore than Encoding Text • Where are the word breaks? คลิกปุ่มเมาส์ขวา Your balance is $1234.56... I think. • How do I sort these words in French? • cote dimension • côte coast • coté with dimensions • côté side • How do I uppercase this word in Turkish? • istiyorum - İstiyorum • How do I transcribe this text into Latin characters? • 인수문제를 - in'su'mun'je'reul'
Cultural Conventions • What does this date stand for? • 3/8/2006 • What is the currency symbol for Hungary? • … linguistic characteristics of languages and cultural conventions – a locale
Agenda • Working with multilingual data • Language and locale identifiers • Locale Data • Frameworks for locale support • Ideas/discussion how this could be used in compling
Internet Language Tags • Used today: RFC 3066 (RFC 1766) • Generative:ISO 639-1/2 language tag[-ISO 3166 country tag] • e.g. fr, en-US, ale-CA • Registered with IANA • e.g. no-nyo, zh-Hant • Exceptions • x-… • Several problems • Dependency on ISO standards • No generative options for dialects etc. • RFC3066bis should solve this
SIL Etnologue • Cataloging all of the world’s 6,912 known living languages • http://www.ethnologue.com/ • Uses ISO/DIS 639-3 3-letter codes • E.g. Swabian dialect: x-sil-swg • Hope for consolidation with RFC3066 or successor once 639-3 becomes full standard • Not so well supported in programming frameworks
Agenda • Working with multilingual data • Language and locale identifiers • Locale Data • Frameworks for locale support • Ideas/discussion how this could be used in compling
Types of Locale Data • Dates/time formats • Number/currency formats • Collation Specification • For sorting and comparison • Translated names for language, region, script, timezones, currencies,… • Script and characters used by a language • Measurement System • Paper sizes • …
Common Locale Data Repository • “The purpose of the Common Locale Data Repository project is to provide a general XML format for the exchange of locale information for use in application and system development, and to gather, store, and make available a common set of locale data generated in that format.” • http://www.unicode.org/cldr/
Common Locale Data Repository • Collection/vetting process • Contributors add/modify data • Reviewed by commitee • Accessible over the web • Locale Data Markup Language XML format • E.g. http://unicode.org/cldr/data/common/main/fr.xml
Agenda • Working with multilingual data • Language and locale identifiers • Locale Data • Frameworks for locale support • Ideas/discussion how this could be used in compling
FrameworksPosix Locale • Standard C/C++ libary • LC_COLLATE – sorting/comparison • LC_CTYPE - behavior of character-handling • LC_MONETARY - monetary formatting LC_NUMERIC – numeric formatting • LC_TIME – date/time formatting • Used in Un*x systems for command line functions too • Results can be platform-dependent • Stable, but feature set stuck in the 1980s
FrameworksICU Library • IBM Open Source project • Developed originally for the Taligent OS project in the late 80s/early 90s • Java and C++ APIs • Extensive locale data and APIs to use it • http://www.icu-project.org/cgi-bin/locexp • Also includes localization support • Everybody (Mac OS X, Java, DB2, Mathworks …) is using it • But …
FrameworksMicrosoft • Windows NLS API • Microsoft .NET Framework System.Globalization namespace • Similar set of data to ICU • Vetted by subsidiaries • APIs accessible from all MS programming languages • Localization support in different API
Microsoft demos Culture ExplorerMicrosoft Transliteration Utility
Extensibility • What if I don’t find the locale I need? • What if I need to modify some of the data? • ICU • Can create new locales • Microsoft • .NET Framework v2.0: custom cultures • Windows Vista: custom locales • LDML can be interchange format
Agenda • Working with multilingual data • Language and locale identifiers • Locale Data • Frameworks for locale support • Ideas/discussion how this could be used in compling
Usages for Computational Linguistics • Up to the imagination • Transliteration use in MT • Named Entity Recognition • … • suggestions? • Most importantly: Do not reinvent the wheel! • Check if API or data you need is available • If possible write code in a language/locale-independent fashion
References • RFC3066bis • http://www.inter-locale.com/ID/why-rfc3066bis.html • Etnologue • http://www.ethnologue.com/ • Common Locale Data Repository • http://www.unicode.org/cldr/ • Posix Locale • http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html • ICU • http://icu.sourceforge.net/ • Microsoft • http://www.microsoft.com/globaldev/ • UNGEGN Working Group on Romanization Systems • http://www.eki.ee/wgrs/