1 / 20

Flow Network Models for Sub-Sentential Alignment

Flow Network Models for Sub-Sentential Alignment . Ying Zhang (Joy) Advisor: Ralf Brown Dec 18 th , 2001. Background. Sub-sentential alignment problem Given a bilingual parallel sentence pair, to find the correspondence between source words and target words

vince
Télécharger la présentation

Flow Network Models for Sub-Sentential Alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Flow Network Models for Sub-Sentential Alignment Ying Zhang (Joy) Advisor: Ralf Brown Dec 18th, 2001

  2. Background • Sub-sentential alignment problem • Given a bilingual parallel sentence pair, to find the correspondence between source words and target words • One of the major issues in Data-Driven MT (SMT/EBMT) • Some approaches • IBM Model 1 • Smooth Injective Map Recognizer [I. Dan Melamed] Language Technologies Institute School of Computer Sciences Carnegie Mellon University

  3. Flow Network • A well-studied area since 1950’s • Used widely in electrical engineering, computer science, social science and economic problems • Any system involving a binary relation can be represented by a network Language Technologies Institute School of Computer Sciences Carnegie Mellon University

  4. Flow Network [Jensen] Language Technologies Institute School of Computer Sciences Carnegie Mellon University

  5. Minimum Cost Flow • One of the basic problems in flow network theory • For a network G= (V,E), Total cost = • Minimum cost problem: given a network, assign the flow for each arc, so that the total cost of the network is minimum Language Technologies Institute School of Computer Sciences Carnegie Mellon University

  6. Alignment as a Flow Network The “abstract concepts” are transformed through this network Flow>=1, if there is alignment between two words, Flow =0. O/W Language Technologies Institute School of Computer Sciences Carnegie Mellon University

  7. Alignment as a Mini-cost Flow Network • Assume the alignment probabilities only lies in the possibilities of words across languages. (Alignments between other words do not have impact on this pair), then • For word pair (si,tj), assign the cost for the arc as -lnp(si,tj), then the mini-cost flow network corresponds to the maximum P(a,s,t) Language Technologies Institute School of Computer Sciences Carnegie Mellon University

  8. Pure Mini-Cost Flow Algorithm • Primal simplex algorithm Language Technologies Institute School of Computer Sciences Carnegie Mellon University

  9. Pure Mini-cost Flow Algorithm (Cont.) [Jensen] Language Technologies Institute School of Computer Sciences Carnegie Mellon University

  10. Pure Mini-cost Flow Algorithm (Cont.) • Two phase algorithm [Jensen] Language Technologies Institute School of Computer Sciences Carnegie Mellon University

  11. Our Model Language Technologies Institute School of Computer Sciences Carnegie Mellon University

  12. Solve the Network – Phase 1 • Phase 1: to get the basis solution (solved on paper) BridgeArc Language Technologies Institute School of Computer Sciences Carnegie Mellon University

  13. Solve the Network – Phase 2 • If not optimal: • Using bridgeArc to calculate the dual values of tgt nodes • If the minimal dual values of arcs in n_0 < 0 Then make this arc to be the new bridge arc; Insert the previous bridgeArc to n_0; • Else If all dual values in n_0 >0 If min(dual values of n_0) < max(dual values of n_1) make this n_1 arc to be the new bridge arc; insert the previous bridgeArc to n_1; Else optimal, stop Language Technologies Institute School of Computer Sciences Carnegie Mellon University

  14. Update the Cost by E.M. • Each sentence pair is represented by a network • After the network is solved, update the counts and probability contribution of the word pairs in the solved network • Update the probability of word association between source and target language • Using Good-Turing smoothing • Does not work very well so far! –High frequency words Language Technologies Institute School of Computer Sciences Carnegie Mellon University

  15. Model Training • Sentence level aligned bilingual corpus (hknews) • Stemmed by Porter stemmer • Building a seed statistical dictionary using Ralf’s method [Brown97] • Combined with the Chinese-English glossary to get a seed dictionary (coverage != 100%) • Solve the network and update the probability until the total cost of the corpus converge Language Technologies Institute School of Computer Sciences Carnegie Mellon University

  16. Performance & Results • Fast (align 10,000 sentence pairs in 2 minutes*) • * words in sentence <=25 • * system ran on PC with 128 RAM • Using only the seed dictionary, tested on 30 sentence pairs: • Recall: 53%~61% • (because of the coverage of the seed dictionary) • Precision: 74%~76% • Expected to achieve a higher recall when E.M. works properly Language Technologies Institute School of Computer Sciences Carnegie Mellon University

  17. Future Work • Currently, the model is no more powerful than IBM model1 • We are planning to integrate the monolingual co-occurrence probability and the bilingual association probability into the model Language Technologies Institute School of Computer Sciences Carnegie Mellon University

  18. References • Éric Gaussier, “Flow Network Models for Word Alignment and Terminology Extraction from Bilingual Corpora”, Proceedings of the 36th Annual Meeting of the association for Computational Linguistics and the 17th International Conference on Computational Linguistics, COLING-ACL'98. Montreal, Canada • Paul A. Jensen and Jonathan F. Bard, “Operations Research Models and Methods”, http://www.me.utexas.edu/~jensen/ORMM/index.html • Ralf D. Brown, "Automated Dictionary Extraction for ``Knowledge-Free'' Example-Based Translation". In Proceedings of the Seventh International Conference on Theoretical and Methodological Issues in Machine Translation, p. 111-118. Santa Fe, July 23-25, 1997. Language Technologies Institute School of Computer Sciences Carnegie Mellon University

  19. Acknowledgement • Many thanks to Jian Zhang, Katharina Probst, and Alicia Tribble for their help ! Language Technologies Institute School of Computer Sciences Carnegie Mellon University

  20. Questions and Comments? Language Technologies Institute School of Computer Sciences Carnegie Mellon University

More Related