TTC project

TTC project EC project and progresspresentation 28 May, 2010 Terminology Extraction, Translation Tools and Comparable Corpora 2010-2012 www.ttc-project.eu ICT 2009.2.2. Language-Based Interaction The research leading to these results has received funding from the European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement n° 248005.

Introduction & Objective • Introduction • Central role of linguistic resources for translation applications • Specialized languages • Multilingual terminologies • Objective  Producing multilingual terminologies from comparable corpora for translation applications TTC presentation – 28/05/2010

Concepts 1/2 • Parallel Corpora an original text and its manual translation into one or more languages Restrictions: sparse data • only available for some pairs of languages, mostly one of them is English [French-English Hansards (Germann 2001)] • only for some few specific domains [Bible ( Resnouf et al., 1999), Europarl (Koehn, 2005), (JRC-Acquis; Steinberger et al. 2006)] • Comparable Corpora [EAGLES 1996] A comparable corpus is one which selects similar texts in more than one language or variety. [Bowker, Pearson 2002, p.93] “sets of texts in different languages, that are not translations of each other” TTC presentation – 28/05/2010

Concepts 2/2 • Monolingual terminology extraction to automate the extraction of terms from corpora in specialized domains • Single word terms (SWTs) • Multi-word terms (MWTs) • Alignment through lexical context analysis [Grefenstette, 1994, p. 279] « First-order affinities describe what other words are likely to be found in the immediate vicinity of a given word » TTC presentation – 28/05/2010

WEB haversting documens Source documents Target documents terminology terminology extraction extraction lexical context lexical context extraction extraction lexical alignment process terms to be candidate translated translations bilingual dictionary Bilingual terminology mining chain TTC presentation – 28/05/2010

Applications • Machine translation tools (MT tools) • Computer-assisted translation tools (CAT tools) • Multilingual content management tools • Terminology management tools TTC presentation – 28/05/2010

Objectives • Compiling comparable corpora • Candidate term extraction • Defining and combining different strategies for term alignment • Development of an open platform for use in MT and CAT tools • Demonstrating on MT and CAT tools TTC presentation – 28/05/2010

Comparable corpora • « Web as a corpus » approach  successful for general language corpus compilation • Objective  compilation of specialized language corpora • Methods  monolingual / interlingual comparability • Outputs  Topical web crawler (M24) TTC presentation – 28/05/2010

Term extraction • Single word term SWT/Multi-word term MWT • Statistical and symbolic approaches • Objectives • evaluation of resources for term extraction SWT/MWT performance • variations of MWT • extraction of context data • Outputs • sets of extraction tools / rule sets for variants (M24) TTC presentation – 28/05/2010

Term alignment • Contextual analysis: 60% on TOP20 for single terms • Objective • To improve the contextual analysis of SWT • To reach for MWTs a score close to the score of SWT • Methods • Lexical / Contextual / Corpora strategies • Outputs • Neo-classical MWT detection component (M18) • Compositional translation component (M24) TTC presentation – 28/05/2010

Open Platform • Several tool suites for aligned corpora (Itools, Giza++) • Objective • handling and exploiting comparable corpora • Outputs • Terminology tool suite for comparable corpora • Open terminology management tool TTC presentation – 28/05/2010

Participants TTC presentation – 28/05/2010

Main impacts • Better language coverage • 5 distinct language families • 7 targeted languages: Chinese, English, French, German, Latvian, Russian and Spanish • 12 pairs of languages: Zh-En, Zh-Fr, En-Fr, En-De, En-Lv, En-Ru, En-Es, Fr-De, Fr-Ru, Fr-Es, De-Es, Lv-Ru • Expected resources • Domain-specific resources for renewable energy an computer science (focus on mobile technologies) in 7 languages • Comparable corpora, lemmatized and POS-tagged (M24) • Rule sets for recognizing term variants, for term inflection and morphological analysis (M24) • Bilingual aligned terminologies (M27) TTC presentation – 28/05/2010

Work planning • Project Duration: 36 months TTC presentation – 28/05/2010

WP1 – Requirements & Specifications (UN-LINA) • Task 1.1 – requirements analysis (by Syllabs) • Online survey (advertised among translators and localization communities, received 139 answers mainly to/into EN. • 74% use a translation software (TRADOS leader with 17% of users within the respondents) • Reasons for not using MT: price, translation quality, not suitable for specific domain • Users wishes for collection of corpora • Searchfunctions • Automaticupdating (crawl) • Frequencylists • Annotation function • Collaborative tool: share with others • Formats: sentence per line plaintext, TMX format • Terminology extraction tools TTC presentation – 28/05/2010

WP1 – Requirements & Specifications (UN-LINA) • Task 1.2 – functional specifications (by UN-LINA) • Toward a Data Model • Integration level: a functional approach • Functional capabilities: • Linguistic analysis stage: raw token, lemma, pos, offset, ... • Extractor and aligner stages: must continue our analysis by taking into account the information in the final output formats (e.g. for terms:TBX or TMF-compliant format) TTC presentation – 28/05/2010

WP1 – Requirements & Specifications (UN-LINA) • Task 1.3 – definition of data exchange format (by IMS) • The exchange format has to as simple as possible for internalpurposes • Output format willbe TBX / TMX • Consultation with UN/LINA: • on data categoriesused in TTC tools • on UIMA-basedprocessing formats • → First outlinespecification of processing formats • Consultation withusers (Sogitec, Tilde): • on data categoriesused in input tools/resources • on data categoriesused in CAT/MT • → First outlinespecification of semanticinteroperabilityunder exchange • Mapping of requirements onto ISO formats • → First outlinespecification of exchange format TTC presentation – 28/05/2010

WP2: Corpora compilation (by UL-CTS) • Use pre existing corpus to evaluate comparability • First version (not fully functional) of the crawler by September 2010. • Issues within multilingualism for example gender management • Corpus contents are dependant from type of document (news, reports, blogs…) and also differs according to targeted audience, authority, content vs linkerie and region. TTC presentation – 28/05/2010

WP8 Dissemination (by TILDE) • T8.1: Website: www.ttc-project.eu Collaborative platform • T8.2: Workshop in July • T8.3: Poster and leafletpresentedduring LREC 2010 in Malta • T8.4: IPR guidelines deliverablesubmitted to EC • T8.5: Dissemination by everychannel, conferences (“Applied Linguistics in Science and Education”, 25-26 March 2010, Saint-Petersburg, Russia, LREC 2010 Malta, EAMT Conference 2010 Saint Raphaël, 14th EURALEX International Congress, 6-10 July 2010, Leeuwarden/Ljouwert, The Netherlands), social media (LinkedIn, professionnalcommunities…) TTC presentation – 28/05/2010

TTC project

TTC project

Presentation Transcript

TTC Information Meeting

ttc saclay 2016

, TTC Crossville

TTC-PON

TTC news

TTC for NA62

IPTV in TTC

Standardization Activities in TTC

Tracker FED TTC

TTC Activities Update

ttc-website

Standardization Activities in TTC

TTC EMAIL

IPTV in TTC

TTC WG restructuring

Cloud Computing in TTC

Yoga TTC in RISHIKESH

ATC/TTC Basics

IPTV in TTC

Yoga TTC Thailand

Sea Ice

Sea Ice