1 / 89

CS712 : Topics in Natural Language Processing (Lecture 1– Introduction; Machine Translation)

CS712 : Topics in Natural Language Processing (Lecture 1– Introduction; Machine Translation). Pushpak Bhattacharyya CSE Dept., IIT Bombay 10 Jan, 2013. Basic Facts. Faculty instructor: Dr. Pushpak Bhattacharyya ( www.cse.iitb.ac.in/~pb ) TAship : Piyush ( piyushadd@cse )

sonel
Télécharger la présentation

CS712 : Topics in Natural Language Processing (Lecture 1– Introduction; Machine Translation)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS712 : Topics in Natural Language Processing(Lecture 1– Introduction; Machine Translation) Pushpak BhattacharyyaCSE Dept., IIT Bombay 10 Jan, 2013 cs712, intro

  2. Basic Facts • Faculty instructor: Dr. Pushpak Bhattacharyya (www.cse.iitb.ac.in/~pb) • TAship: Piyush (piyushadd@cse) • Course material • www.cse.iitb.ac.in/~pb/cs712-2014 • Moodle • Venue: SIC 305 • 1.5 hour lectures 2 times a week: Tue-3.30, Fri-3.30 (slot 11) cs712, intro

  3. Motivation for MT • MT: NLP Complete • NLP: AI complete • AI: CS complete • How will the world be different when the language barrier disappears? • Volume of text required to be translated currentlyexceeds translators’ capacity (demand > supply). • Solution: automation cs712, intro

  4. Course plan (1/4) • Introduction • MT Perspective • Vauquois Triangle • MT Paradigms • Indian language SMT • Comparable to Parallel Corpora • Word based Models • Word Alignment • EM based training • IBM Models cs712, intro

  5. Course plan(2/4) • Phrase Based SMT • Phrase Pair Extraction by Alignment Templates • Reordering Models • Discriminative SMT models • Overview of Moses • Decoding • Factor Based SMT • Motivation • Data Sparsity • Case Study for Indian languages cs712, intro

  6. Course plan (3/4) • Hybrid Approaches to SMT • Source Side reordering • Clause based constraints for reordering • Statistical Post-editing of ruled based output • Syntax Based SMT • Synchronous Context Free Grammar • Hierarchical SMT • Parsing as Decoding cs712, intro

  7. Course plan (4/4) • MT Evaluation • Pros/Cons of automatic evaluation • BLEU evaluation metric • Quick glance at other metrics: NIST, METEOR, etc. • Concluding Remarks cs712, intro

  8. Introduction cs712, intro

  9. Set a perspective • MT today is data driven, but • When to use ML and when not to • “Do not learn, when you know”/”Do not learn, when you can give a rule” • What is difficult about MT and what is easy • Alternative approaches to MT (not based on ML) • What has preceded SMT • SMT from Indian language perspective • Foundation of SMT • Alignment cs712, intro

  10. Taxonomy of MT systems MT Approaches Data driven; Machine Learning Based Knowledge Based; Rule Based MT Statistical MT Example Based MT (EBMT) Interlingua Based Transfer Based cs712, intro

  11. MT Approaches interlingua semantics semantics syntax syntax phrases phrases words words SOURCE TARGET cs712, intro

  12. MACHINE TRANSLATION TRINITY cs712, intro

  13. Why is MT difficult? Language divergence cs712, intro

  14. Why is MT difficult: Language Divergence • One of the main complexities of MT: Language Divergence • Languages have different ways of expressing meaning • Lexico-Semantic Divergence • Structural Divergence Our work on English-IL Language Divergence with illustrations from Hindi (Dave, Parikh, Bhattacharyya, Journal of MT, 2002) cs712, intro

  15. Languages differ in expressing thoughts: Agglutination Finnish: “istahtaisinkohan”  English: "I wonder if I should sit down for a while“ Analysis: • ist +     "sit", verb stem • ahta +  verb derivation morpheme, "to do something for a while" • isi +      conditional affix • n +       1st person singular suffix  • ko +     question particle • han     a particle for things like reminder (with declaratives) or "softening" (with questions and imperatives) cs712, intro

  16. Language Divergence Theory: Lexico-Semantic Divergences (few examples) • Conflational divergence • F: vomir; E: to be sick • E: stab; H: churese maaranaa (knife-with hit) • S: Utrymningsplan; E: escape plan • Categorialdivergence • Change is in POS category: • The play is on_PREP (vs. The play is Sunday) • Khelchal_rahaa_haai_VM (vs. khelravivaarkohaai) cs712, intro

  17. Language Divergence Theory: Structural Divergences • SVOSOV • E: Peter plays basketball • H: piitar basketball kheltaahaai • Head swapping divergence • E: Prime Minister of India • H: bhaaratkepradhaanmantrii (India-of Prime Minister) cs712, intro

  18. Language Divergence Theory: Syntactic Divergences (few examples) • Constituent Order divergence • E: Singh, the PM of India, will address the nation today • H: bhaaratkepradhaanmantrii, singh, … (India-of PM, Singh…) • Adjunction Divergence • E: She will visit here in the summer • H: vahyahaagarmiimeMaayegii (she here summer-in will come) • Preposition-Stranding divergence • E: Who do you want to go with? • H: kisakesaathaapjaanaachaahate ho? (who with…) cs712, intro

  19. Vauquois Triangle cs712, intro

  20. Kinds of MT Systems(point of entry from source to the target text) cs712, intro

  21. Illustration of transfer SVOSOV S S NP VP NP VP (transfer svo sov) NP V NP N V N N eats N John eats John bread bread cs712, intro

  22. Universality hypothesis cs712, intro

  23. Understanding the Analysis-Transfer-Generation over Vauquois triangle (1/4) cs712, intro

  24. Understanding the Analysis-Transfer-Generation over Vauquois triangle (2/4) cs712, intro

  25. Understanding the Analysis-Transfer-Generation over Vauquois triangle (3/4) cs712, intro

  26. Understanding the Analysis-Transfer-Generation over Vauquois triangle (3/4) cs712, intro

  27. More flexibility in Hindi generation cs712, intro

  28. Dependency tree of the Hindi sentence H1.1: सरकार_नेचुनावो_के_बादमुंबईमेंकरों_के_माध्यम_सेअपनेराजस्व_कोबढ़ाया cs712, intro

  29. Transfer over dependency tree cs712, intro

  30. Descending transfer • नृपायते सिंहासनासीनो वानरः  • Behaves-like-king sitting-on-throne monkey • A monkey sitting on the throne (of a king) behaves like a king cs712, intro

  31. Ascending transfer: FinnishEnglish • istahtaisinkohan "I wonder if I should sit down for a while" • ist +     "sit", verb stem • ahta +  verb derivation morpheme, "to do something for a while" • isi +      conditional affix • n +       1st person singular suffix  • ko +     question particle • han     a particle for things like reminder (with declaratives) or "softening" (with questions and imperatives) cs712, intro

  32. Interlingual representation: complete disambiguation • Washington voted Washington to power Vote @past <is-a > action agent object goal Washington power Washington @emphasis <is-a > capability <is-a > place <is-a > person <is-a > … cs712, intro

  33. Kinds of disambiguation needed for a complete and correct interlingua graph • N: Name • P: POS • A: Attachment • S: Sense • C: Co-reference • R: Semantic Role cs712, intro

  34. Issues to handle Sentence: I went with my friend, John, to the bank to withdraw some money but was disappointed to find it closed. Noun or Verb ISSUES cs712, intro

  35. Issues to handle Sentence: I went with my friend, John, to the bank to withdraw some money but was disappointed to find it closed. John is the name of a PERSON ISSUES cs712, intro

  36. Issues to handle Sentence: I went with my friend, John, to the bank to withdraw some money but was disappointed to find it closed. ISSUES Financial bank or River bank cs712, intro

  37. Issues to handle Sentence: I went with my friend, John, to the bank to withdraw some money but was disappointed to find it closed. ISSUES “it” “bank” . cs712, intro

  38. Issues to handle Sentence: I went with my friend, John, to the bank to withdraw some money but was disappointed to find it closed. ISSUES Pro drop (subject “I”) cs712, intro

  39. Typical NLP tools used • POS tagger • Stanford Named Entity Recognizer • Stanford Dependency Parser • XLE Dependency Parser • Lexical Resource • WordNet • Universal Word Dictionary (UW++) cs712, intro

  40. Simple Sentence Analyser System Architecture Stanford Dependency Parser NER Stanford Dependency Parser XLE Parser Clause Marker Feature Generation WSD Simplifier Attribute Generation Simple Enco. Simple Enco. Simple Enco. Simple Enco. Simple Enco. Relation Generation Merger cs712, intro

  41. Target Sentence Generation from interlingua Target Sentence Generation Morphological Synthesis Lexical Transfer Syntax Planning (Word/Phrase Translation ) (Word form Generation) (Sequence) cs712, intro

  42. Generation Architecture Deconversion = Transfer + Generation cs712, intro

  43. Transfer Based MT Marathi-Hindi cs712, intro

  44. Indian Language to Indian Language Machine Translation (ILILMT) • Bidirectional Machine Translation System • Developed for nine Indian language pairs • Approach: • Transfer based • Modules developed using both rule based and statistical approach cs712, intro

  45. Architecture of ILILMT System Source Text Target Text Morphological Analyzer Word Generator POS Tagger Interchunk Chunker Analysis Generation Intrachunk Vibhakti Computation Agreement Feature Name Entity Recognizer Transfer Word Sense Disambiguation Lexical Transfer cs712, intro

  46. M-H MT system: Evaluation • Subjective evaluation based on machine translation quality • Accuracy calculated based on score given by linguists S5: Number of score 5 Sentences, S4: Number of score 4 sentences, S3: Number of score 3 sentences, N: Total Number of sentences Accuracy = cs712, intro

  47. Evaluation of Marathi to Hindi MT System • Module-wise evaluation • Evaluated on 500 web sentences cs712, intro Module-wise precision and recall

  48. Evaluation of Marathi to Hindi MT System (cont..) • Subjective evaluation on translation quality • Evaluated on 500 web sentences • Accuracy calculated based on score given according to the translation quality. • Accuracy: 65.32 % • Result analysis: • Morph, POS tagger, chunker gives more than 90% precision but Transfer, WSD, generator modules are below 80% hence degrades MT quality. • Also, morph disambiguation, parsing, transfer grammar and FW disambiguation modules are required to improve accuracy. cs712, intro

  49. SMT cs712, intro

  50. Czeck-English data • [nesu] “I carry” • [ponese] “He will carry” • [nese] “He carries” • [nesou] “They carry” • [yedu] “I drive” • [plavou] “They swim” cs712, intro

More Related