Multilinguality and cross-language searching
210 likes | 376 Vues
Multilinguality and cross-language searching. Multilingual aspects in Indexing, Searching and Metadata (Resource Description). Multilingual aspects in Indexing, Searching and Metadata. IETF Model of Multilingual support in Internet Applications Electronic Mail Interactive applications
Multilinguality and cross-language searching
E N D
Presentation Transcript
Multilingualityand cross-language searching Multilingual aspects in Indexing, Searching and Metadata (Resource Description) Multilinguality in Indexing, Searching and Metadata
Multilingual aspects in Indexing, Searching and Metadata • IETF Model of Multilingual support in Internet Applications • Electronic Mail • Interactive applications • Charset and Language tagging • MIME types • XML Language and Charset tagging • DC language definition • Metadata and RDF • DC.Language • Existing solutions • TUSTEP • Search Engines and Subject Gateways • Multilingual framework for the REIS Project Multilinguality in Indexing, Searching and Metadata
IETF Model of Multilingual support in Internet Applications • Electronic Mail • Language • Character Encoding Scheme • Transfer Encoding Scheme • Interactive applications • WWW: HTTP/HTML • http-equiv="Content-Type" Content="text/html; charset=euc-jp" • <META http-equiv="Content-Type" Content="text/html; charset=euc-jp"> • XML/DOM • LDAP and X.500 (?) Multilinguality in Indexing, Searching and Metadata
XML:Language and Charset tagging • Character is atomic unit of text • All ISO 10646 characters + TAB, CR, LF • The mechanism for Encoding can vary for different characters • All XML processors must accept UTF-8 and UTF-16 • Character Encoding in Entities (XML 4.3.3) • EncodingDecl : : = S ‘encoding’ Eq ‘ ” ’ EncName ‘ “ ‘ | “ ‘ “ EncName “ ‘ “ )<?xml encoding+’UTF-8’?> <?xml encoding+’EUC-JP’?> • Autodetection of Character Encoding • Language identification (XML 2.12) • Tag for identification of languages • LanguageID : : = Langcode (‘-’ Subcode) • Langcode : : = ISO639Code | IanaCode | UserCode Multilinguality in Indexing, Searching and Metadata
Charset and Language tagging • MIME types • text, img, audio, video • Charset = Character Set + Character Encoding Scheme • Transfer Encoding Scheme • base64 • quoted-printable • Language • RFC 1766 • ISO639-2 Multilinguality in Indexing, Searching and Metadata
Language Definition in DC Metadata set • <meta name = “DC.language” • scheme= “rfc1766” “ISO639-2” • content= “es”> • <meta name = “DC.title” • lang = “es” • content= “La Mesa y Silla Roja”> Multilinguality in Indexing, Searching and Metadata
Multilingual Subject Gateway • Developing multilingual subject gateways (SOSIG as example) • SOSIG accept any languages evaluated for quality • Translation should be coherent and checked • Different language version should be equally well maintained • SOSIG Cataloguing rules • TITLE will be displayed in the first language • ALTERNATIVE TITLE in other languages • DESCRIPTION will mention different languages in which resource is available • URI of all language versions • Labeling URI language • Library standards for multilingual provision • NISO Z39.53 Language codes • USMARC Language codes Multilinguality in Indexing, Searching and Metadata
Multilingual provision in popular Internet Search Engines • AltaVista • Search in 25 languages • Documents indexed as is • Automatic translation - very simple and naive • Other sites that have dedicated national sites • interface language • language resoures • no special language policy • Euroseek • Excite • Lycos • Infoseek Multilinguality in Indexing, Searching and Metadata
New Developments in Subject Gateways, Indexing, Searching • NRENs projects • Subject gateways • Commercial Search Engines • Multilingual Text Retrieval and Processing • TUSTEP system Multilinguality in Indexing, Searching and Metadata
NREN projects • Social Science Information Gateway http://sosig.esrc.bris.ac.uk/ • ROADS Project Software/Documentation Server - http://www.roads.lut.ac.uk/ • CHIP-Pilot (Clearing House for Internet Projects) - http://www.terena.nl/chip/ • IMesh - International Collaboration on Internet Subject Gateways - http://www.desire.org/html/subjectgateways/community/imesh/ • DFN Indexing and Searching projects - http://www.dfn.de/links/suchen.html • X.500 Directory E-mail Addresses Search (AMBIX-D) - http://ambix.uni-tuebingen.de:8889 • TUSTEP Munltilingual Textdata Processing and Fuzzy Searching - http://www.uni-tuebingen.de/zdv/tustep/tdv_eng.html • IKEM Toolkit - http://bikit.rug.ac.be:80/ikem/ • DRUID Classification Tools, University of Twente - http://twentyone.tpd.tno.nl/druid/ Multilinguality in Indexing, Searching and Metadata
Search Engines news • CLEVER project at IBM Almaden Research Center - http://www.almaden.ibm.com/cs/k53/clever.html • Cora Search Engine - http://www.cora.justresearch.com/about.html • Google Search Engine - http://www.google.com/why_use.html • Free AltaVista Search Intranet v2.3A Entry Level Software http://www.altavista.software.digital.com/search/intranet/free_3k/index.asp • Ultraseek Server for Linux Platformshttp://software.infoseek.com/products/ultraseek/linux/ultrareq.htm Multilinguality in Indexing, Searching and Metadata
TUSTEP TUebingen System of Text Processing Programs • 1. File structure • 2. Multilingual capabilities • 3. Internal data presentation • 4. Database publishing/output data presentation • 5. CGI • 6. Sample implementation • http://lddv.zdv.uni-tuebingen.de/cgi-bin/opac/zdvlit • Try entries like Smith or Meier or... • http://lddv.zdv.uni-tuebingen.de/cgi-bin/km/npquery Multilinguality in Indexing, Searching and Metadata
TUSTEP: File structure • TUSTEP can handle basically all kinds of (explicitely or implicitely) structured text files) • Special support for XML • "Databases" (i. e. files with a repeated and regular structure) are only a special case of this. • Fuzzy search and other retrieval actions can then be used to access the data Multilinguality in Indexing, Searching and Metadata
TUSTEP: Multilingual capabilities • TUSTEP supports the following scripts: • - Latin • - Cyrillic • - Greek (classical and modern) • - Hebrew (with support for Yiddish) • - Arabic • - Estrangelo • - Coptic • - Old Church Slavonic • More: • Phonetics, Egyptian hieroglyphs • allows use of combining diacritics • Experimental: Indic scripts and Armenian Multilinguality in Indexing, Searching and Metadata
TUSTEP: Internal data presentation and transformation • TUSTEP uses internally a script tagging system with transliteration into ASCII which allows all data to be encoded in a human-readable and easily transmittable form • TUSTEP has a module for importing from and exporting into the UCS (UTF8 and UTF16) • Example: #r+Novij rafiqnij clovnik ykra^ins^bko%:^i movi#r- • Transformation module allows use of other tagging systems and other transliteration schemes Multilinguality in Indexing, Searching and Metadata
TUSTEP: Database publishing • TUSTEP's typesetting module • offers a high-quality, fast and easy way of publishing all or part of the database in paper (or pdf) form Multilinguality in Indexing, Searching and Metadata
TUSTEP: CGI • Complete control over input and output forms • Possibility to configure exactly the kind of search(es), e.g. • exact matches only • SoundEX • "intelligent" fuzzy search • "brute" fuzzy search that allows a number of different letters. Multilinguality in Indexing, Searching and Metadata
Multilinguality framework of the project • Multiple language indexing • multiple language documents/indexes • Cross-language Searching • Multiple language indexes/documents • Automatic Query forwarding based on thesauri • Automatic translation • Multilingual information retrieval • Translation Request Protocol • Language and Character Encoding tagging • XML as internal presentation of data • Using XML language and charset tagging • Metadata • DC.Language definition Multilinguality in Indexing, Searching and Metadata