MT For Low-Density Languages

MT For Low-Density Languages Ryan Georgi Ling 575 – MT Seminar Winter 2007

What is “Low Density”?

What is “Low Density”? • In NLP, languages are usually chosen for: • Economic Value • Ease of development • Funding (NSA, anyone?)

What is “Low Density”? • As a result, NLP work until recently has focused on a rather small set of languages. • e.g. English, German, French, Japanese, Chinese

What is “Low Density”? • “Density” refers to the availability of resources (primarily digital) for a given language. • Parallel text • Treebanks • Dictionaries • Chunked, semantically tagged, or other annotation

What is “Low Density”? • “Density” not necessarily linked to speaker population • Our favorite example, Iniktitut

So, why study LDL?

So, why study LDL? • Preserving endangered languages • Spreading benefits of NLP to other populations • (Tegic has T9 for Azerbaijani now) • Benefits of wide typological coverage for cross-linguistic research • (?)

Problem of LDL?

Problem of LDL? • “The fundamental problem for annotation of lower-density languages is that they are lower density” – Maxwell & Hughes • Easiest NLP development (and often best) done with statistical methods • Training requires lots of resources • Resources require lots of money • Cost/Benefit chicken and the egg

What are our options? • Create corpora by hand • Very time-consuming (= expensive) • Requires trained native speakers • Digitize printed resources • Time-consuming • May require trained native speakers • e.g. orthography without unicode entries

What are our options? • Traditional requirements are going to be difficult to satisfy, no matter how we slice it. • We need to, then: • Maximize information extracted from resources we can get • Reduce requirements for building a system

Maximizing Information with IGT

Maximizing Information with IGT • Interlinear Glossed Text • Traditional form of transcription for linguistic field researchers and grammarians • Example: Rhoddodd yr athro lyfr I’r bachgen ddoe gave-3sg the teacher book to-the boy yesterday “The teacher gave a book to the boy yesterday”

Benefits of IGT • As IGT is frequently used in fieldwork, it is often available for low-density languages • IGT provides information about syntax, morphology, • The translation line is usually a high-density language that we can use as a pivot language.

Drawbacks of IGT • Data can be ‘abormal’ in a number of ways • Usually quite short • May be used by grammarian to illustrate fringe usages • Often purposely limited vocabularies • Still, in working with LDL it might be all we’ve got

Utilizing IGT • First, a big nod to Fei (this is her paper!) • As we saw in HW#2, word alignment is hard. • IGT, however, often gets us halfway there!

Utilizing IGT • Take the previous example: Rhoddodd yr athro lyfr I’r bachgen ddoe gave-3sg the teacher book to-the boy yesterday “The teacher gave a book to the boy yesterday”

Utilizing IGT • Take the previous example: Rhoddodd yr athro lyfr I’r bachgen ddoe gave-3sg the teacher book to-the boy yesterday “The teachergave a book to the boy yesterday”

Utilizing IGT • Take the previous example: Rhoddodd yr athro lyfr I’r bachgen ddoe gave-3sg the teacherbook to-the boy yesterday “The teachergave a book to the boy yesterday”

Utilizing IGT • Take the previous example: Rhoddodd yr athro lyfr I’r bachgen ddoe gave-3sg the teacherbookto-theboyyesterday “The teachergave a bookto theboyyesterday”

Utilizing IGT • Take the previous example: Rhoddodd yr athro lyfr I’r bachgen ddoe gave-3sg the teacherbookto-theboyyesterday “The teachergave a bookto theboyyesterday” • The interlinear already aligns the source with the gloss • Often, the gloss uses words found in the translation already

Utilizing IGT • Alignment isn’t always this easy… xaraju mina lgurfati wa nah.nu nadxulu xaraj-u: mina ?al-gurfat-i wa nah.nu na-dxulu exited-3MPL from DEF-room-GEN and we 1PL-enter 'They left the room as we were entering it‘ (Source: Modern Arabic: Structures, Functions, and Varieties; Clive Holes)

Utilizing IGT • Alignment isn’t always this easy… xaraju mina lgurfati wa nah.nu nadxulu xaraj-u: mina ?al-gurfat-i wa nah.nu na-dxulu exited-3MPL from DEF-room-GEN and we 1PL-enter 'They left the room as we were entering it‘ (Source: Modern Arabic: Structures, Functions, and Varieties; Clive Holes) • We can get a little more by stemming…

Utilizing IGT • Alignment isn’t always this easy… xaraju mina lgurfati wa nah.nu nadxulu xaraj-u: mina ?al-gurfat-i wa nah.nu na-dxulu exited-3MPL from DEF-room-GEN and we 1PL-enter 'They left the room as we were entering it‘ (Source: Modern Arabic: Structures, Functions, and Varieties; Clive Holes) • We can get a little more by stemming… • …but we’re going to need more.

Utilizing IGT • Thankfully, with an English translation, we already have tools to get phrase and dependency structures that we can project: (Source: Will & Fei’s NAACL 2007 Paper!)

Utilizing IGT • What can we get from this? • Automatically generated CFGs • Can infer word order from these CFGs • Can infer possible constituents • …suggestions? • From a small amount of data, this is a lot of information, but what about…

Reducing data Requirements with Prototyping

Grammar Induction • So, we have a way to get production rules from a small amount of data. • Is this enough? • Probably not. • CFGs aren’t known for their robustness • How about using what we have as a bootstrap?

Grammar Induction • Given unannotated text, we can derive PCFGs • Without annotation, though, we just have unlabelled trees: ROOT C2 X0 X1 Y2 the dog Z3 N4 fell asleep • Such an unlabelled parse doesn’t give us S -> NP VP, though. p=0.02 p=0.45e-4 p=0.003 p=0.09 p=5.3e-2

Grammar Induction • Can we get labeled trees without annotated text? • Haghighi & Klein (2006) • Propose a way in which production rules can be passed to a PCFG induction algorithm as “prototypical” constituents • Think of these prototypes as a rubric that could be given to a human annotator • e.g. for English, NP -> DT NN

Grammar Induction • Let’s take the possible constituent DT NN • We could tell our PCFG algorithm to apply this as a constituent everywhere it occurs • But what about DT NN NN? (the train station)? • We would like to catch this as well

Grammar Induction • K&H’s solution? • distributional clustering • “a similarity measure between two items on the basis of their immediate left and right contexts” • …to be honest, I lose them in the math here. • Importantly, however, weighting the probability of a constituent with the right measure improves from the base unsupervised level of f-measure 35.3 to 62.2

So… what now?

Next Steps • By extracting production rules from a very small amount of data using IGT and using Haghighi & Klein’s unsupervised methods, it may be possible to bootstrap an effective language model from very little data!

Next Steps • Possible applications: • Automatic generation of language resources • (While a system with the same goals would only compound error, automatically annotated data could be easier for a human to correct rather than hand-generate) • Assist linguists in the field • (Better model performance could imply better grammar coverage) • …you tell me!

MT For Low-Density Languages

MT For Low-Density Languages

Presentation Transcript

Low Density Lipoprotein (LDL)

Low Density Parity Check Codes

MT and Resource Collection for Low-Density Languages: From new MT Paradigms to Proactive Learning and Crowd Sourcing

Towards automatic enrichment and analysis of linguistic data for low-density languages

High Density vs. Low Density Trash bags

BABYLON Parallel Text Builder: Gathering Parallel Texts for Low-Density Languages

Low Population Density Shifting Cultivation

Low Density Start-up

Nuclear EOS at low density

Low-density Start-up

Low Density Parity Check codes

Rapid development of machine translation for low density languages

AVENUE Automatic Machine Translation for low-density languages

Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages

MT for Languages with Limited Resources

NICE Machine Translation for Low-Density Languages

Shaping Methods for Low Density Lattice Codes

LOW vs. HIGH Level Languages

Galaxies in low density environments

Learning-based MT Approaches for Languages with Limited Resources

Gaur Sports wood low-density project

Low Density Epoxy Resin Filler