380 likes | 574 Vues
Lost Language Decipheration. Kovid Kapoor - 08005037 Aashimi Bhatia – 08D04008 Ravinder Singh – 08005018 Shaunak Chhaparia – 07005019. Outline. Examples of ancient languages which were lost Motivation : Why should we bother about such languages? The Manual process of Decipheration
E N D
Lost Language Decipheration KovidKapoor - 08005037 Aashimi Bhatia – 08D04008 Ravinder Singh – 08005018 ShaunakChhaparia– 07005019
Outline • Examples of ancient languages which were lost • Motivation : Why should we bother about such languages? • The Manual process of Decipheration • Motivation for a Computational Model • A Statistical Method for Decipheration • Conclusions
What is a "lost" language • A language is said to be “lost” when modern scholars cannot reconstruct text written in it. • Slightly different from a “dead” language – a language which people can translate to/from, but noone uses it anymore in everyday life. • Generally happens when one language gets replaced by another. • For eg, native American languages were replaced by English, Spanish etc.
Examples of Lost Languages • Egyptian Hieroglyphs • A formal writing system used by ancient Egyptians, containing of logographic and alphabetic symbols. • Finally deciphered in the early 19th century, following a lucky finding of “Rosetta Stone”. • Ugaritic Language • Tablets with engravings found in the lost city of Ugarit, Syria. • Researchers recognized that it is related to Hebrew, and could identify some parallel words.
Examples of Lost Languages (cont.) • Indus Script • Written in and around Pakistan around 2500 BC • Over 4000 samples of the text have been found. • Still not deciphered successfully! • What makes it difficult to decipher? http://en.wikipedia.org/wiki/File:Indus_seal_impression.jpg
Motivation for Decipheration of Lost Languages • Historical knowledge expansion • Very helpful in learning about the history of the place where the language was written. • Alternate sources of information : coins, drawings, buried tombs. • These sources not as precise as reading the literature of the region, which gives a clear idea. • Learning about the past explains the present • A lot of the culture of a place is derived from ancient cultures. • Boosts our understanding of our own culture.
Motivation for Decipheration of Lost Languages(cont.) • From a linguistic point of view • We can figure out how certain languages were developed through time. • Origin of some of the words explained.
The Manual Process • Similar to a cryptographic decryption process • Frequency analysis based techniques used • First step : identify the writing system • Logographic, alphabetic or syllaberies? • Usually determined by the number of distinct symbols. • Identify if there is a closely related known language • Hope for finding bitexts : translations of a text of the language in a known language, like Latin, Hebrew etc. http://www.straightdope.com/columns/read/2206/how-come-we-cant-decipher-the-indus-script
Examples of Manual Decipherment : Egyptian Hieroglyphs • Earliest attempt made by Horapollo in the 5th century. • However, explanations were mostly wrong! • Proved to be an impediment on the process for 1000 years! • Arab historians able to partly decipher in the 9th and 10th centuries. • Major Breakthrough : Discovery of Rosetta Stone, by Napolean’s troops. http://www.straightdope.com/columns/read/2206/how-come-we-cant-decipher-the-indus-script
Examples of Manual Decipherment : Egyptian Hieroglyphs • The stone has a decree issued by the king in three languages : hieroglyphs, demotic, and ancient Greek! • Finally deciphered in 1820 by Jean-François Champollion. • Note that even with the availability of a bitext, full decipheration took 20 more years! http://upload.wikimedia.org/wikipedia/commons/thumb/c/ca/Rosetta_Stone_BW.jpeg/200px-Rosetta_Stone_BW.jpeg
Examples of Manual Decipherment : Ugaritic • The inscribed words consisted of only 30 distinct symbols. • Very likely to be alphabetical. • The location of the tablets found suggested that it is closely related to Semitic languages • Some words in Ugaritic had the same origin as words in Hebrew • For eg, the Ugaritic word for king is the same as the Hebrew word. http://www.straightdope.com/columns/read/2206/how-come-we-cant-decipher-the-indus-script
Examples of Manual Decipherment : Ugaritic (cont.) • Lucky discovery : Hans Bauer assumed that the writings on an axe found was the word “axe”! • Led to revision of some earlier hypothesis, and resulted in decipherment of the entire script! http://knp.prs.heacademy.ac.uk/images/cuneiformrevealed/scripts/ugaritic.jpg
Conclusions on the Manual Process • Very time taking exercise; years, even centuries taken for the successful decipherment. • Even when some basic information about the language is learnt, like the syntax structure, a closely related languages, long time required to produce character and word mappings.
Need for a Computerised Model • Once some knowledge about the language has been learnt, is it possible to use a program to produce word mappings? • Can the knowledge of a closely related language be used to decipher a lost language? • If possible, would save a lot of efforts and time. • Successful archaeological decipherment has turned out to require a synthesis of logic and intuition…that computers do not (and presumably cannot) possess.– Andrew Robinson
Recent attempts : A Statistical model • Notice that manual efforts have some guiding principles • A common starting point is to compare letter and word frequencies with a known language • Morphological analysis plays a crucial role as well • Highly frequent morpheme correspondences can be particularly revealing. • The model tries to capture these letter/word level mappings and morpheme correspondences. http://people.csail.mit.edu/bsnyder/papers/bsnyder_acl2010.pdf
Problem Formulation • We are given a corpus in the lost language, and a non-parallel corpus in a related language from the same family. • Our primary goals : • Finding the mapping between the alphabets of the lost and known language. • Translate words in the lost language into corresponding cognates of the known languages http://people.csail.mit.edu/bsnyder/papers/bsnyder_acl2010.pdf
Problem Formulation • We make several assumptions in this model : • That the writing system is alphabetic in nature • Can be easily verified by counting the number of symbols in the found record. • That the corpus has been transcribed into an electronic format • Means that each character is uniquely identified. • About the morphology of the language : • Each word consists of a stem, prefix and suffix, where the latter two may be omitted • Holds true for a large variety of human languages
Problem Formulation • The inventories and the frequencies in the known language are given. • In essence, the input consists of two parts : • A list of unanalyzed words in a lost language • A morphologically analyzed syntax in a known related language
Intuition : A toy example • Consider the following example, consisting of words in a lost language closely related to English, but written using numerals. • 15234 --asked • 1525 --- asks • 4352 --- desk • Notice the pair of endings, -34 and -5, with the same initial sequence 152- • Might correspond to –ed and –s respectively. • Thus, 3=e, 4=d and 5=s
Intuition : A toy example • Now, we can say that 435=des, and using our knowledge of English, we can suppose that this word is very likely to be desk. • As this example illustrates, we proceed by discovering both character- and morpheme-level mappings. • Another intuition the model should capture is the sparsity of the mapping. • Correct mapping will preserve phonetic relations b/w the two related languages • Each character in the unknown language will map to a small number of characters in the related language.
Model Structure • We assume that each morpheme is probabilistically generated jointly with a latent counterpart in the lost language • The challenge: Each level of correspondence can completely describe the observed data. So using a mechanism based on one leaves no room for the other. • The solution: Using a Dirichlet Process to model probabilities (explained further). http://people.csail.mit.edu/bsnyder/papers/bsnyder_acl2010.pdf
Model Structure (cont…) • There are four basic layers in the generative process • Structural Sparsity • Character-edit Distribution • Morpheme-pair Distributions • Word Generation
Model Structure (cont…) Graphical overview of the Model http://people.csail.mit.edu/bsnyder/papers/bsnyder_acl2010.pdf
Step 1 : Structural Sparsity • We need a control on the sparsity of the edit-operation probabilities, encoding the linguistic intuition that character-level mapping should be sparse. • The set of edit operations include character substitutions, insertions and deletions. We assign a variable λecorresponding to every edit operation e. • The set of character correspondences with the variable set to 1 { (u,h) : λ(u,h) = 1 }conveys a set of phonetically valid correspondences. • We define a joint prior over these variables to encourage sparse character mappings.
Step 1 : Structural Sparsity (cont.) • This prior can be viewed as a distribution over binary matrices and is defined to encourage every row and column to sum to low values integer values (typically 1) • For a given matrix, define a count c(u) which is the number of corresponding letters that u has in that matrix.Formally, c(u) = ∑h λ(u,h) • We now define a function fi = max(0, |{u : c(u) = i}| - bi)For any i other than 1, fi should be as low as possible. • Now the probability of this matrix is given by
Step 1 : Structural Sparsity (cont…) • Here Z is the normalization factor and w is the weight vector. • wi is either zero or negative, to ensure that the probability is high for a low value of f. • The values of bi and wi can be adjusted depending on the number of characters in the lost language and the related language.
Step 2 : Character-Edit Distribution • We now draw a base distribution G0 over character edit sequences. • The probability of a given edit sequence P(e) depends on the value of the indicator variable of individual edit operations λe,and a function depending on the number of insertions and deletions in the sequence, q(#ins(e), #del(e)). • The factor depending on the number of insertions and deletions depends on the average word lengths of the lost language and the related language.
Step 2 :Character-Edit Distribution (cont.) Example: Average Ugaritic word is 2 letters longer than an average Herbew word Therefore, we set our q to be such as to disallow any deletions and allow 1 insertion per sequence, with the probability of 0.4 • The part depending on the λesmakes the distribution spike at 0 if the value is 0 and keeps it unconstrained otherwise (spike-and slab priors)
Step 3 : Morpheme Pair-Distributions • The base distribution G0 along with a fixed parameter α define a Dirichlet process, which provides probability over morpheme-pair distributions. • The resulting distributions are likely to be skewed in favor of a few frequently occurring morpheme-pairs, while remaining sensitive to character-level probabilities of the base distribution. • Our model distinguishes between the 3 kinds of morphemes- prefixes, stems and suffixes. We therefore use different values of α
Step 3 : Morpheme Pair-Distributions (cont.) • Also, since the suffix and prefix depend on the part of speech of the stem, we draw a single distribution Gstm for the stem, we maintain separate distributions Gsuf|stm and Gpre|stm for each possible stem part-of-speech.
Step 4 : Word Generation • Once the morpheme-pair distributions have been drawn, actual word pairs may now be generated. • Based on some prior, we first decide if a word in the lost language has a cognate in the known language. • If it does, then a cognate word pair (u, h) is produced: • Otherwise, a lone word u is generated.
Summarizing the Model • This model captures both character and lexical level correspondences, while utilizing morphological knowledge of the known language. • An additional feature of this multi-layered model structure is that each distribution over morpheme pairs is derived from the single character-level base distribution G0. • As a result, any character-level mappings learned from one correspondence will be propagated to other morpheme distributions. • Also, the character-level mappings obey sparsity constraints
Results of the process • Applied on Ugaritic language • Undeciphered corpus contains 7,386 unique word types. • The Hebrew Bible used for known language corpus, which is close to ancient Ugaritic. • Assume morphological and POS annotations availability for the Hebrew lexicon. http://people.csail.mit.edu/bsnyder/papers/bsnyder_acl2010.pdf
Results of the process • The method identifies Hebrew cognates for 2,155 words, covering almost 1/3rd of the Ugaritic vocabulary. • The baseline method correctly maps 22 out of 30 characters to their Hebrew counterparts, and translates only 29% of all the cognates • This method correctly translates 60.4 % of all cognates. • This method yields correct mapping for 29 out of 30 characters.
Future Work • Even with character mappings, many words can be correctly translated only by examining their context. • The model currently fails to take the contextual information into account. http://people.csail.mit.edu/bsnyder/papers/bsnyder_acl2010.pdf
Conclusions • We saw how language decipherment is an extremely complex task. • Years of efforts required for successful decipheration of each lost language. • Depends on the amount of available corpus in the unknown language. • But availability does not make it easy. • Statistical model has shown promise. • Can be developed further and used for more languages.
References • Wikipedia article on Decipherment of Hieroglyphs http://en.wikipedia.org/wiki/Decipherment_of_hieroglyphic_writing • Lost Languages: The Enigma of the World's Undeciphered Scripts by Andrew Robinson (2009) http://entertainment.timesonline.co.uk/tol/arts_and_entertainment/books/non-fiction/article5859173.ece • A Statistical Model for Lost Language Decipherment Benjamin Snyder, Regina Barzilay, and Kevin Knight ACL (2010) (http://people.csail.mit.edu/bsnyder/papers/bsnyder_acl2010.pdf)
References • A staff talk from Straight Dope Science Advisory Board – How come we can’t decipher the Indus Script? (2005) http://www.straightdope.com/columns/read/2206/how-come-we-cant-decipher-the-indus-script • Wade Davis on Endangered Cultures (2008) http://www.ted.com/talks/wade_davis_on_endangered_cultures.html