330 likes | 445 Vues
An Experience of the Language Observatory Project. Yoshiki Mikami Leader, Language Observatory Project Japan Science & Technology Agency Workshop on “Recent Experiences on Measuring Languages on the Cyberspace” UNESCO, Paris, February 22, 2007. Outlines. Global Digital Divide
E N D
An Experience of the Language Observatory Project Yoshiki Mikami Leader, Language Observatory Project Japan Science & Technology Agency Workshop on “Recent Experiences on Measuring Languages on the Cyberspace” UNESCO, Paris, February 22, 2007
Outlines • Global Digital Divide • Language Observatory: How It Functions? • Major Findings 3.1 Survey Snapshots, Asia and Africa 3.2 Technical aspect of the Divide 3.3 Social aspect of the Divide 3.4 Several non-linguistic aspects • Future Agenda Regarding Measurement From Measurement to Empowerment
1. Global Digital DivideIncome, telephony 2004 1999 Source: ITU Statistics
The Degree of Inequality Telephony<Income<Internet Gini-coefficient: Telephony 0.51 < GDP 0.73 < Internet 0.91
UNESCO Recommendation Recommendation concerning the Promotion and Use of Multilingualism and Universal Access to Cyberspace, October 2003 [PREAMBLE] • Noting that linguistic diversity in the global information networks and universal access to information in cyberspace are at the core of contemporary debates and can be a determining factor in the development of a knowledge-based society,
2. Language ObservatoryHow It Functions? Internet http://gii.nagaokaut.ac.jp/gii/papers.php <HTML><HEAD> <TITLE>Language Observatory</TITLE> <META http-equiv=Content-Type content="text/html; charset=UTF-8"> </HEAD> <BODY> <A href="http://www.language-observatory.org"><IMG height=137 alt="logo" src = “LO.files/logo.gif" width=155></A> <H2>About us</H2> <P>Astronomical observatory catches the light from stars, likewise................. Crawler [ UbiCrawler ] pages Language Identifier [ LI ] Tag Analysis Analysis on Digital Language Divide Language Resources Contant nalysis
Unit of Identification = LSE Language+Script+Encoding Difference of language Differnce of Encoding Difference of Script
The First Workshopon the IMLD, 2004 UNESCO reported the launch of the project http://portal.unesco.org/ci/en/ev.php-URL_ID=14480&URL_DO=DO_TOPIC&URL_SECTION=201.html
Expert CollaborationCase of African Survey June 26-28, 2006 at Bamako, Mali ACALAN Mali Algeria Burkina Faso Ethiopia Kenya Malawi Nigeria Tunisia CNRS, France
Researchers NetworkOver 35 countries Experts’ contribution is essential in collection of local coding text, seed URLs, and verification of LI results
3.1 Survey SnapshotLanguages on the net, Asia as of June 2006
3.1 Survey Snapshot (cont.)Languages on the net, Africa as of October 2006
3.2 Technical AspectLocalization Problem “Language Localization” has been the key obstacle to the use of new information technologies since type printing age.
A Jesuit Friar’s letter, 1608Six hundred versus 24 "Before I end this letter I wish to bring before Your Paternity's mind the fact that for many years I very strongly desired to see in this Province some books printed in the language and alphabet of the land, as there are in Malabar with great benefit for that Christian community. And this could not be achieved for two reasons; the first because it looked impossible to cast so many moulds amounting to six hundred, whilst as our twenty-four in Europe." Doctrina Christam in Tamil, 1578 source: Priolkar, The Printing Press in India,Bombay, 1958
Doctrina in Tagalog, 1593The script was finally lost Philippines postal stamp issued in 1995 “Doctrina Christiana”, bi-lingual version, printed in Tagalog by Tagalog script / in Tagalog by Latin script / in Spanish by Latin script.
Encoding Chaos leads todelay of localization note: Local proprietary encodings are shown in this table by names of font (families). as of June 2006
Unavailability of search engines :another problem Google As of June 2006
differentiation strategy to enclose customers local media local IT firms encoding chaos delay in localization non-availability of search engines (SEs) gov. users lack of leadership in standardization Technical Aspect of the Digital Language Divide lack of standard in typewriter keyboard less attention from IT vendors global IT firms difficulty in access to standardization process Int’l standard bodies various localization by overseas communities
3.3 Social Aspect: languages in multilingual society Based on EU’s “Common European Framework of Reference for Languages” (2004)
Language plays a different role in multilingual society ac.xx educational com.xx occupational gov.xx public others personal Socio-economic domains Globallanguages Regional languages secondarylevel domain Officiallanguage (s) Minoritylanguage (s)
Specialization of LanguageSecondary domain analysis Cyprus Turkey Kazakhstan Iran
Social Aspect of the Digital Language Divide restricted social activities overseas community non availability of SEs local business global IT firms e- business local media users users media press gov. primary seondary education higher education users gov absence of mother language low literacy
3.4 Non-linguistic Aspectsa. Network and Server • ○rw: Rwanda • △ml: Mali • □mz: Mozambique • White: servers installed in the country • Colored: servrs installed overseas 80% of servers under African domains are located outside of the country. 60% of servers in Asian domains are also “offshore” as of December 2005
Complaint against accessA letter from Namibia I am the web master of the XXXXXXX Database. We are being severely hit by your LanguageObservatory‘s web crawler - already 37000 page hits this month. InDecember 2005 you hit us 34000 times. We are on limited bandwidth, and this puts unacceptable strain on our server. I notice that you considerone HTTP request every 5 seconds 'polite' and 'modest'. This may be true in Japan, but not in Africa - our connections are very slow and very narrow. I would appreciate it if you could prevent your crawlers from visitingour URL again. In return, I will be happy to provide you directly with whatever statistics about our site you need for your research. Sincerely we carefully control data collection speed using a set of parameters, such as revisiting interval, depth, maximum pages per server, prohibition URL list.
b. Domain Governance pages Management of small Islands’ domains are often re-delegated to overseas web-hosting operators, who tend to admit spam, porn, etc. population (1,000) as of December 2005
c. Access regulationsby the government Countries where only state controlled TV stations available, show higher percentage of links going to global news sites abroad.
4. Future Agenda • Regarding Measurement • Improvement of accuracy and coverage • Multi-stakeholder Collaboration • Global Observatories Network • From Measurement to Empowerment Goals/Targets/Indicators system which help and guide stakeholders in empowering languages
”Language Empowerment”Mother language for creation localization of application SW based on standard local language search engines language community language portal OSS developers IT firms 母語情報処理技術OCR, TTS, 翻訳 promotion of NLP OCR, TTS, MTe-dictionary, etc media press mother language for creation higher education creation of local contents 豊富な母語コンテンツ 豊富な母語コンテンツ gov users electronic delivery of public services mother language use in higher education literacy
Thanks for your attention Jehan Rectus Square, Paris photo: courtesy by Wunna Ko Ko, June 2005