Machine Translation Breaking the Communication Barrier

By Dr. Gurpreet S. Josan Punjabi University, Patiala Machine TranslationBreaking the Communication Barrier

Communication • Communication is the activity of conveying meaningful information. • Communication requires a sender, a message, and an intended recipient • The communication process is complete once the receiver has understood the sender. Machine Translation-Breaking the language Barrier

Mode of Communication • Nonverbal communication- includes gesture, body language or posture; facial expression • Visual communication- includes signs, typography, drawing, colours etc. • Oral communication- spoken verbal communication • Written communication- includes alphabets symbols, grammar etc. Machine Translation-Breaking the language Barrier

Barrier in Communication Machine Translation-Breaking the language Barrier

The Effect • Language is a barrier for information dissemination. • All the major source of information/ discoveries are in English. • We are unable to reach the masses in rural area who do not know English. • You’re a scientist who has just clicked a revolutionary new idea. • How do you find out if a scientist anywhere in world has already filed a patent on a similar idea in their native language? Machine Translation-Breaking the language Barrier

Who Can Help? • A Translator • Manual • Too Slow • Limited Available • Costly • Accurate • Machine • Fast • Economical • Not accurate Machine Translation-Breaking the language Barrier

Issues • kJfmmfjmmmvvv nnnffn333 • Ujihealeeleeemnstervensicredur • Baboioicestnitze • Coovoel2^ ekk; ldsllklkdfvnnjfj? • Fgmflmllkmlfmkfrexnnn!

Issues • Computers Lack Knowledge! • Computers “see” text in English the same you have seen the previous text! • People have no trouble understanding language • Common sense knowledge • Reasoning capacity • Experience • Computers have • No common sense knowledge • No reasoning capacity

Issues • Which ones are going to be difficult for computers to deal with? Grammar or Lexicon? • Grammar (Rules for putting words together into sentences) • How many rules are there? • 100, 1000, 10000, more … • Do we have all the rules written down somewhere? • Lexicon (Dictionary) • How many words do we need to know? • 1000, 10000, 100000 … Machine Translation-Breaking the language Barrier

Issues “the dog ate my homework” - Who did what to whom? Identify the part of speech (POS) Dog = noun ; ate = verb ; homework = noun English POS tagging: 95% Try to tag this text manually I can, can the can. 2. Identify collocations mother in law, hot dog

Issues • Seemingly similar sentences may differ radically in meaning: • “The CEO was fired up about his new role.” • “The CEO was fired from his new role.” • Seemingly different sentences can have the same meaning: • “IBM’s PC division was acquired by Lenovo.” • “Lenovo bought the PC division of IBM.” Machine Translation-Breaking the language Barrier

Issues • Ambiguity • Structural ambiguity • I saw the man with the telescope • Word level ambiguity Machine Translation-Breaking the language Barrier

Ambiguity • Various Meaning of word ਵੱਟ in Punjabi • ਖੇਤ ਦੀ ਵੱਟ • ਢਿੱਡ ਵਿੱਚ ਵੱਟ • ਕੱਪੜੇ ਨੂੰ ਵੱਟ • ਮੱਥੇ ਤੇ ਵੱਟ • ਵੱਟੋ ਵੱਟ • ਰੱਸੀ ਨੂੰ ਵੱਟ • ਪੈਸੇ ਵੱਟਲੈਣੇ • ਵੱਟ ਕੇ ਚਪੇੜਮਾਰਨੀ • ਵੱਟ ਖਾ ਕੇ ਗਿਰਨਾ • ਪੱਗ ਦੇ ਵੱਟ Machine Translation-Breaking the language Barrier

Ambiguity • If more than one ambiguous word is present in a sentence, the number of potential interpretations of the sentence “explodes”: the number of interpretations is the product of all possible meanings of the words. • Consider the sentence ਸੁਖਬੀਰ ਨੇ ਵੱਟ ਪੱਟ ਦਿਤੀ। • and assume that only ਵੱਟ {vaṭṭ}and ਪੱਟ {paṭṭ} are ambiguous in this sentence, and that they both have 4 senses. • This brings the number of possible interpretations to 16. Machine Translation-Breaking the language Barrier

ਸੁਖਬੀਰ (Sukhbir) ਨੇ (has) ਵੱਟ (twine) ਪੱਟ (thigh) ਦਿੱਤੀ ਵੱਟ (crease) ਪੱਟ (destroy) ਵੱਟ (FootPath) ਪੱਟ (door-leaf) ਵੱਟ (sultriness) ਪੱਟ (silk) Ambiguity • Imagine what happens if there are more senses to be taken into account or if the sentence gets longer. ਸੁਖਬੀਰ ਨੇ ਵੱਟ ਪੱਟ ਦਿੱਤੀ Machine Translation-Breaking the language Barrier

Issues • Anaphora Resolution: “The dog ate the bird and it died.” • Gender Conversion • Idioms & Phrases • ਚੋਰ ਦੀ ਦਾੜ੍ਹੀ ਵਿੱਚ ਤਿਣਕਾ।

Issues • Named Entity Recognition • ਡਾ. ਬੂਟਾ ਸਿੰਘ Dr. Plant Singh vs Dr. Buta Singh • Foreign words • ਕੋਕਾ ਕੋਲਾ नथ कोयलाvsकोका कोला • Spelling Variation • ਮਾਇਕਰੋਸਾਫਟ, ਮਾਇਕ੍ਰੋਸਾਫਟ, ਮਇਕ੍ਰੋਸੋਫਟ etc. Machine Translation-Breaking the language Barrier

Issues • Rhyming Reduplication • ਰੋਟੀ-ਸ਼ੋਟੀ, ਪਾਣੀ-ਧਾਣੀ • Other issues • In Indian Languages-no fixed font • For example the word ਸ਼੍ਰੀcan be written in following manners: ਸ + ਼ + + ੀ ਸ + ਼ + ੀ + ਸ + + ਼ + ੀ ਸ + + ੀ +਼ ਸ + ੀ+ +਼ ਸ + ੀ+ ਼ + ਹ + ੈ + ੈ + ੈ = ਹੈ ਸ+ੋ + ੌ = ਸੌ ਸ + ੌ + ੋ =ਸੌ Machine Translation-Breaking the language Barrier

Three MT Approaches: Direct, Transfer, Interlingual Interlingua Semantic Composition Semantic Decomposition Semantic Structure Semantic Structure Semantic Analysis Semantic Generation Semantic Transfer Syntactic Structure Syntactic Structure Syntactic Transfer Syntactic Analysis Syntactic Generation Word Structure Word Structure Direct Morphological Generation Morphological Analysis Target Text Source Text Machine Translation-Breaking the language Barrier

Direct MT: Pros and Cons • Pros • Fast • Simple • Inexpensive • Robust • No translation rules hidden in lexicon • Cons • Unreliable • Not powerful • Rule proliferation • Requires too much context • Major restructuring after lexical substitution Machine Translation-Breaking the language Barrier

Transfer MT: Pros and Cons • Pros • Don’t need to find language-neutral representation • Relatively fast • Cons • Large no. of transfer rules: Difficult to extend • Proliferation of language-specific rules in lexicon and syntax Machine Translation-Breaking the language Barrier

Interlingual MT: Pros and Cons • Pros • Portable • Lexical rules and structural transformations stated more simply on normalized representation • Explanatory Adequacy • Cons • Difficult to deal with terms. • Deciding what should be added is difficult. • What will be the universal knowledge format? How do we encode? • Must decompose and reassemble concepts Machine Translation-Breaking the language Barrier

Current Techniques • Corpus-based approaches • Statistics-based Machine Translation (SMT): • Every target language string, ζ, is a possible translation of ε. • Every string is given a number called probability. • We select the string which has maximum probability. ê = argmax [Pr(ε)Pr(ζ | ε)] Where ε is a source language and ζ is a target language These are known as the language modeling problem, the translation modeling problem, and the search problem. Machine Translation-Breaking the language Barrier

Current Techniques • Corpus-based approaches • Example Based Machine Translation • Translation by Analogy. • System is given a set of sentences in the source language and their corresponding translations in the target language • System uses those examples to translate other, similar source-language sentences into the target language. • Hybrid methods • Combination of Rule Based and Statistical Methods Machine Translation-Breaking the language Barrier

Punjabi to Hindi Machine Translation • Punjabi to Hindi Machine translation system is a direct translation system based on various lexical resources and rule-base. • The system is modular with clear separation of data from process. • The central idea is to select words from source language and do the minimal analysis required like extracting the root word, lexical category and contextual information i.e. tokens at left and right side of the current token. Machine Translation-Breaking the language Barrier

Punjabi to Hindi Machine Translation • Word sense disambiguation module is called for ambiguous words. • Equivalents of source token in target language are found out from the lexicon and are replaced to get target language. • The rules are applied to the output for making it appropriate for the target language. Machine Translation-Breaking the language Barrier

Normalized Source Text Pre Processing Tokenization Named Entity Recognition Repetitive Construct Handling Translation Engine Root word & Inflectional Form DB Lexicon Look up Hit? No Yes Ambiguous Word DB Bigram & Trigram DB Yes If token present Ambiguous? No Ambiguity Resolution Transliteration Append in Output and retrieve next token Yes No Post Processing Post Editing Target Text System Architecture

Why Direct System? • For a given language pair and text type what kind of system is required is largely an empirical and a practical question. • General requirements on MT systems such as modularity, separation of data from processes, reusability of resources and modules, robustness, corpus-based derivation of data and so on, do not, provide conclusive arguments for either one of the models. • The available resources are one of the key factors for deciding the approach. Machine Translation-Breaking the language Barrier

Why Direct System? • In general, if the two languages are structurally similar, in particular as regards lexical correspondences, morphology and word order, the case for abstract syntactic analysis seems less convincing. • Keeping in view, the similarity in Punjabi and Hindi language pair, a simpler, direct model is our obvious choice for Punjabi to Hindi machine translation system. Machine Translation-Breaking the language Barrier

The Lexicon • The lexicon contains information about the primary component of languages, i.e. words. • Most NLP applications use dictionaries. For example, morphological analyzers use a lexicon containing morphemes, and tagging systems use probability data, and parsers use lexical/semantic information or co-occurrence information, and MT systems use Translation Memory and a transfer dictionary. Machine Translation-Breaking the language Barrier

The Lexicon Root Table { Field name: PW Field Type: Text Field name: gnp Field Type: Text Field name: cat Field Type: Text Field name: HW Field Type: Text } • The bilingual dictionary prepared by LTRC dept of IIIT Hyderabad in ISCII format containing about 22000 entries. • Adopted and extended for our system. • Converted in to Unicode format. • The entries are extended to about 33000 covering almost all the root words of Punjabi language. Machine Translation-Breaking the language Barrier

The Lexicon • Table of all the inflected forms of Punjabi root words. • Contains all the inflectional forms of Punjabi root words and along with their roots. • The corresponding Hindi words are entered manually. • It comprise of about 65,000 entries. Inflectional Form Table { Field name: PW Field Type: Text Field name: ROOT Field Type: Text Field name: HW Field Type: Text } Where ROOT is one of the entry from Root table. Machine Translation-Breaking the language Barrier

The Lexicon The Lexicon • For all the ambiguous words in root table as well as inflectional form table, the entry for target word contains a symbol “amb”. • It triggers the disambiguation process for the given word. • A table of ambiguous words is prepared for this purpose that contains most frequent meaning followed by all other possible meanings of a given word. Ambigous word table { Field name: PW Field Type: Text Field name: s1 Field Type: Text Field name: s2 Field Type: Text } Machine Translation-Breaking the language Barrier

The Lexicon • To help the disambiguation module, bigram and trigram tables are created. • They contains the context of ambiguous words along with their meaning in that context and frequency obtained from a corpus of about 30 lakh words. Trigram Table { Field name: PREV2 Field Type: Text Field name: PREV1 Field Type: Text Field name: PW Field Type: Text Field name: HW Field Type: Text Field name: COUNT Field Type: Number } Bigram Table { Field name: PREV1 Field Type: Text Field name: PW Field Type: Text Field name: HW Field Type: Text Field name: COUNT Field Type: Number }

The Lexicon • The lexicon also contains a rule-base. • It contains all the rules to handle different grammatical dissimilarities between two languages at post processing. Replacement Table { Field name: orgtext Field Type: Text Field name: reptxt Field Type: Text } Machine Translation-Breaking the language Barrier

Text normalization • The text should be in a normalized way – i.e. there should be only one way to represent a syllable. • Having several identical pieces of text represented by differing underlying byte sequences makes analysis of the text much more difficult. • For example, under the AnmolLipi font, the Latin character ‘A’ would appear as ‘ਅ’. Conversely, under the DrChatrikWeb font it would appear as ‘ੳ’. • This cause a problem while scanning a text. Machine Translation-Breaking the language Barrier

Text normalization • So the source text is normalized by converting it into Unicode format. • It gives us three fold advantages; First it will reduce the text scanning complexity. Secondly it also helps in internationalizing the system. Thirdly it eases the transliteration task. Machine Translation-Breaking the language Barrier

Text Normalization • Spelling normalization • There may be the chances that the word is present in database but with different spellings like ਪ੍ਰੀਖਿਆ ਪਰੀਖਿਆ {prīkhia} [examination] • In database only one may appear and other may not. • The purpose of spelling normalization is to find the missing variant. • Soundex technique is used for the spelling normalization. Machine Translation-Breaking the language Barrier

Soundex • In this technique, a unique number is assigned to each character of alphabet. • Similar sounding letters get same number. • Then codes for each string are generated. • All the strings with same code are the spelling variations of a same one string. Machine Translation-Breaking the language Barrier

Soundex Machine Translation-Breaking the language Barrier

Soundex • With this table the code for ਪ੍ਰੀਖਿਆ ਪਰੀਖਿਆ came out to be c31c37sc13sc4 enabling the system to detect the variant present in database. • For example, if the database contains ਪ੍ਰੀਖਿਆ {prīkhia} as Punjabi word then the code c31c37sc13sc4 is stored against it. • If a user enters ਪਰੀਖਿਆ as input, which is not present in database, its code will be generated on the fly by the system and checked in the database. • If code appears in the database the corresponding Punjabi word is selected as spelling variant. Machine Translation-Breaking the language Barrier

Word Sense Disambiguation • In order to achieve this, we make use of the information contained in the context similar to what humans do. • A standalone word sense disambiguation module that is capable for performing its work without any help from outside. • To start with, all we have is a raw corpus of Punjabi text. So the statistical approach is the obvious choice for us. Machine Translation-Breaking the language Barrier

Word Sense Disambiguation • We use the words surrounding the ambiguous word to build a statistical language model. This model is then used to determine the meaning of examples of that particular ambiguous word in new contexts. • The basic idea of statistical methodologies is that, given a sentence with ambiguous words, it is possible to determine the most likely sense for each word. • One of such statistical model is n gram model. Machine Translation-Breaking the language Barrier

Word Sense Disambiguation • An n-gram is simply a sequence of successive n words along with their count i.e. number of occurrences in training data. • An n-gram of size 2 is a bigram; size 3 is a trigram; and size 4 or more is simply called an n-gram or (n − 1)-order Markov model. • N-grams are used as probability estimators which estimates likeliness of a word(s) to follow a certain point in a document. • What is the optimum value of n? Machine Translation-Breaking the language Barrier

Word Sense Disambiguation • Consider predicting the word "ਸੀ" from the three sentences: • (1) …….ਕਹਿ ਰਿਹਾ ਸੀ। • (2) ਕ੍ਰਾਂਤੀ, ਜੇਕਰ ਕੋਈ ਸੀ, …. • (3) ਉਹ ਜਮਾਤ ਦੇ ਹੁਸ਼ਿਆਰ ਵਿਦਿਆਰਥੀਆਂ ਵਿਚੋਂ ਇਕ ਸੀ। • In (1), the prediction can be done with a bigram (2-gram) language model (n=2), but (2) requires n=4 and (3) require n > 9. Machine Translation-Breaking the language Barrier

Word Sense Disambiguation • Number of words to be considered at ± n positions is important • Factors of concern are • Larger the value of n, higher is the probability of getting correct word sense i.e. for the general domain; more training data will always improve the result. • But on the other hand most of the higher order n grams do not occur in training data. This is the problem of sparseness of data. Machine Translation-Breaking the language Barrier

Word Sense Disambiguation • As training data size increase, the size of model also increase which can lead to models that are too large for practical use. The total number of potential n grams scales exponentially with n. A large n require huge amount of memory space and time. • Does the model get much better if we use a longer word history for modeling an n-gram? • Do we have enough data to estimate the probabilities for the longer history? Machine Translation-Breaking the language Barrier

Word Sense Disambiguation • An experiment for optimum value of n for Punjabi language is performed. • Different n gram models were generated where n ranges from ±1 to ±6 • This was observed that as the value of n increases, its ability to disambiguate a word decreases. • This is due to sparseness of data. Machine Translation-Breaking the language Barrier

Word Sense Disambiguation • Another interesting point observed is that instead of making and using a higher order n gram models, we can improve the efficiency of the system tremendously by utilizing lower order models jointly. • We can use tri-gram model in the first place to disambiguate a word. If it fails to disambiguate then we move to lower order model i.e. bi-gram model for WSD. If it also fails, we can use the unigram model. • With this technique we get only 7.96% of incorrectly disambiguated words • This approach is adopted for the word sense disambiguation module. Machine Translation-Breaking the language Barrier

Word Sense Disambiguation • Three models viz. Unigram, Bigram and Trigram of the ambiguous words to tap the words in context of any ambiguous word are created from a corpus of about 3 million words generated by including different types of articles like essays, stories, editorials, News, novels, office letters, court orders etc. • In order to reduce the size of n-grams, we retain only those context which leads to less frequent meaning of ambiguous words. Machine Translation-Breaking the language Barrier

Machine Translation Breaking the Communication Barrier

Machine Translation Breaking the Communication Barrier

Presentation Transcript

Machine Translation

Machine Translation

Machine Translation

Machine Translation

Machine Translation

Machine Translation

Breaking the 4 micron barrier

Breaking the Language Barrier

Breaking the Single-Path Barrier

Breaking the sound barrier

Breaking the MapReduce Stage Barrier

Conjunctive Filter: Breaking the Entropy Barrier

Breaking the Interference Barrier

Machine Translation

E-Journals : Breaking the Pricing Barrier

Jackie Robinson: Breaking the Color Barrier

Breaking the color barrier

Breaking the Color Barrier:

Machine Translation

Machine Translation

Machine Translation

Machine Translation, Free Machine Translation