Creating Translation Context with Disambiguation

Creating Translation Contextwith Disambiguation Tadej Štajner – Jožef Stefan Institute Yves Savourel – ENLASO Corporation Localization World – London – June 2013

Context: A Shortcoming • Traditionally, translation tools have been strong on code handling, re-use of existing translations. • But they have been less good at providing context or linguistic resources for the translators. • Things are improving and are bound to improve even more.

New Factors • Component-based processing is becoming wide-spread (i.e. source text goes through several preparation steps: TM, MT, etc.) • Web services allow a single process to tape on many different resources; specialization becomes easier. • Now ITS 2.0 provides a common way to carry various information across tools/services.

ITS: Internationalization Tag Set • A set of common internationalization and localization-related features (called “data categories”) for XML…and now with ITS 2.0 also for HTML5 • ITS 2.0 is being finalized at the W3Chttp://www.w3.org/TR/its20/

ITS and “Context” • ITS 2.0 offers several data categories that can help with contextual information: Localization Note, Terminology, Id Value, Domain and Text Analysis. • Quick glance at the first four,then in-depth look at Text Analysis.

Localization Note Comments put in the source document and meant to be seen by the translators. <msg its:locNote="%s is for On or Off">Click the %s button</msg>

Terminology Annotates a “term” in the content and, optionally, provides additional related information. We need a new motherboard.

Id Value Provides a way to associates unique IDs with parts of the content during translation.Can be useful for software text where IDs are often descriptive. <its:idValueRule selector="//msg" idValue="@name"/>...<msg name="FILENOTFOUND">Not found</msg>

Domain Allows to identify the general topic area of the content to translate.Can be useful for selecting MT engines. <its:domainRule selector="/h:html" domainPointer="/h:html/h:head/h:meta[@name='keywords']/@content"/>...<meta name="keywords" content="automotive"/>

Text Analysis • Annotates content with lexical or conceptual information. • Useful for many things: • Term suggestion • General context information • Suggestion of things not to translate • Automated transliteration of proper names • Etc.

Text Analysis: An Example Enrycher is an example of component generating Text Analysis annotations that can be easily integrated with translation tools or localization processes.

Motivation • Translating proper names … can be problematic for statistical MT systems

Motivation (2) • There are specific rules to translate (or transliterate) proper names • Solution: figure out what is actually being mentioned and see if any existing translated expression exists for that entity

Motivation (3) • Examples: personal names, product names, or geographic names, chemical compounds, protein names • Names and phrases appear in situations without sufficient context (UI labels, etc.)

ITS 2.0 Text Analysis • Support text analysis agents that enhance content by suggesting or identifying concepts, identified by IRIs. • A TextAnalysis annotates a text fragment with: • entity type • entity identifier • confidence

Text Analysis in ITS 2.0– what can it tell us? • Does a text fragment represent some entity? • London is lovely in the summer. • Out of 73 known entities named London, we mean a particular one: http://dbpedia.org/resource/London • … a particular type of entity? • London is a phrase, representing a location • … and with what confidence?

ITS 2.0 Text Analysis <!DOCTYPE html> <div its-annotators-ref="text-analysis|http://enrycher.ijs.si/mlw/toolinfo.xml#enrycher"> London is the capital of United Kingdom. </div>

Producing these annotations • Manual annotation • Automated NLP Techniques • Named entity extraction & disambiguation • Word sense disambiguation

Use cases • Informing a human agent (i.e. translator) that a certain fragment of text is subject to follow specific translation rules: • proper names • officially regulated translations. • Informing a software agent (i.e. CMS) about the conceptual type of a textual entity in order to enable special processing or indexing

Named entity disambiguation Document Entity Label Mention

Named entity disambiguation – behind the scenes • A difficult problem: • A name can refer to many entities, an entity can have many names • Which interpretation is correct? • Humans are pretty good at this • We have prior knowledge on the ‘usual’ meanings • We can glean the meaning from the context • Things that are related, appear together

Named entity disambiguation – behind the scenes (2) • Prior knowledge:what is the most frequent meaning of ‘London’? • Context: someone using the word ‘London’ in the context of ‘Canada’ is likely to be referring to another London in Ontario

Named entity disambiguation – behind the scenes (3) • Relational similarity: things connected in the knowledge graph tend to appear together

Building blocks of Enrycher • Token-level analysis • Sentence splitting • Tokenization • Lemmatization • Part-of-speech tagging • Entity-level analysis • Named entity extraction • Co-reference resolution • Anaphora resolution • Named entity disambiguation • Document-level analysis • Sentiment analysis • Topic classification • Keyword extraction (not used here)

Using Enrycher • A HTTP service endpoint: send HTML5 in, get enriched HTML5+ITS2.0 out • Multilingual: supports English and Slovene • See http://enrycher.ijs.si/mlw/, or try it from the command line: $ curl -d "Welcome to London" http://enrycher.ijs.si/mlw/en/entityIdent.html5its2 Welcome to London

Enrycher Integrated in Okapi • The Okapi Framework is an open-source and cross-platform set of components designed to help building localization processes. • One of its components is a client of the Enrycher services. • Text Analysis annotations can be applied to any document in a format supported by the Okapi filters.

One example of usage ofthe Enrycher Web services Enrycher Server Extraction Step Enrycher Step OtherSteps… Trans-Kit Creation Step Term Extraction Step InputFile Translation Kit XLIFF Terms

Enrycher Step • Convert batches of segments (in Okapi’s internal format) into HTML paragraphs and send them to the Enrycher service. • Converts back the annotated paragraphs into Okapi’s internal format. • Next steps can use the Text Analysis metadata, e.g. XLIFF output, OmegaT comments, etc.

Term Extraction Step • The Term Extraction Step offers various simple ways to guess terms in a source content. • One of its methods is to re-use the content annotated with the Text Analysis metadata to feed the list of term candidates.

Demonstration…

Questions? • Enrycher:http://enrycher.ijs.si/ • Okapi Framework:http://okapi.opentag.com/ • ITS 2.0:http://www.w3.org/TR/its20/

Creating Translation Context with Disambiguation

Creating Translation Context with Disambiguation

Presentation Transcript

Disambiguation

A Hybrid Relational Approach for Word Sense Disambiguation in Machine Translation

Text and Context in Translation

Creating Collaboration and Context with Government Data

Creating a Context for Professional Learning

Creating Context Through Music

CDs with Context

URI Disambiguation in the Context of Linked Data

Word Sense Disambiguation

Word Sense Disambiguation for Machine Translation

Iterative Translation Disambiguation for Cross-Language Information Retrieval

Languages with Context

Searching with Context

Noun Homograph Disambiguation Using Local Context in Large Text Corpora

Context constraint disambiguation of word semantics by field association schemes

Entity Disambiguation

Word Sense Disambiguation

DISAMBIGUATION BY CHILDREN WITH SLI: THE EFFECTS OF WORD AND CONTEXT FACTORS

Iterative Translation Disambiguation for Cross Language Information Retrieval

Word Sense Disambiguation and Machine Translation Part I - Introduction

Creating Collaboration and Context with Government Data