1 / 31

Multilinguality to the Rescue

Multilinguality to the Rescue. Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU. Multilinguality. Using more than one language at a time. Image source: https:// buffy.eecs.berkeley.edu /PHP/ resabs /images/2006//101268-1.png. Multilinguality. Why ?. बैंक. Bank. तट.

acacia
Télécharger la présentation

Multilinguality to the Rescue

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multilinguality to the Rescue Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU

  2. Multilinguality Using more than one language at a time Image source: https://buffy.eecs.berkeley.edu/PHP/resabs/images/2006//101268-1.png

  3. Multilinguality Why ? बैंक Bank तट Cross lingual Word Sense Disambiguation (Diab and Resnik, 2002) Images: http://www.realestategolfodulce.com/, http://thetrustadvisor.com/

  4. Multilinguality Why ? Bilingual Word Clustering (Faruqui & Dyer, 2013)

  5. Multilinguality Why ? Bilingual Word Clustering (Faruqui & Dyer, 2013)

  6. Multilinguality Using data from other languages Direct Indirect Assume foreign = original language Extract information from foreign language

  7. Direct Information Transfer Language 1 data Language 2 data NLP System Output

  8. Direct Information Transfer Why would it work ? • Works for specific tasks like NER • Many NEs retain their “orthographic” form • Across languages that use the same “alphabet” • English, German, French, Spanish • Hindi, Marathi, Bihari • Specially proper nouns • Names of Locations • USA, London, New York, Pittsburgh • Names of People • Obama, William, Roger

  9. Direct Information Transfer ... sagte Jimmy Wales dem Wall Street Journal in einem Interview in Hongkong. MadsRefslund, executive chef at Acme, forages in the overgrown spaces and hidden markets of Hongkongfor regional delicacies. Les sacs de luxe, nouvelle monnaie d'échange à Hongkong. Barack Obama hat 2012 mit dieser Strategie die Präsidentschaftswahlen gewonnen. The Obama administration has poured billions of dollars into expanding the reach of the Internet. Pour finir, en défendant les bonus et en tentant de faire dérailler les nouvelles règles prudentielles, ce démocrate s'est mis à dos Barack Obama.

  10. Direct Information Transfer Semantic Generalization Deutschland (100) Ostdeutschland (5) Westdeutschland (0) LOC

  11. Direct Information Transfer Language 1 Training data Language 2 Word Clusters How? NER System Input NE-tagged Text

  12. Evaluation • Tools • Stanford NER for training (Finkel and Manning, 2009) • In-built functionality to use word clusters for generalization • Word clustering software (distributional + morphological) (Clark., 2003) • Data • NER training data • German, English: CoNLL 2003 • Dutch, Spanish: CoNLL 2002 • Generalization data • WMT-2012 news commentary: 200 million tokens • English, German, French, Spanish, Czech

  13. Results

  14. Results

  15. Results Improvement in F1 scores by NE type

  16. Quick Takeaways • Multilingual data can be put to use for monolingual benefits • The amount of help depends on how similar the two languages are “orthographically”

  17. Indirect Information Transfer Language 1 data Language 2 data + NLP System Output

  18. Vector Space Word Models Image: http://www.emeraldinsight.com

  19. Vector Space Models Image: http://d1avok0lzls2w.cloudfront.net/

  20. Vector Space Models Monolingual Word Vectors 1 + Monolingual Word Vectors 2 Better Monolingual Word Vectors 1 ??

  21. Indirect Information Transfer + = Canonical Correlation Analysis d2 d1 + n n k k n n

  22. Canonical Correlation Analysis d2 d1 x y n n * * wy wx d2 d1 k k k k n n

  23. Indirect Information Transfer Word Vectors in Language 2 Word Vectors in Language 1 Obtain 1-to-1 mapping using word alignments Word Vectors in Language 1 Word Vectors in Language 2 + Word Vectors in Language 1 Word Vectors in Language 2

  24. Experiments • Task: Word Pair Reranking • Rank a list of word pairs according to semantic similarity • Datasets • WS-353: 353 word pairs • RG-65: 65 noun pairs • Truncation • Maybe the correlation introduces noise • Keep only the top k% of correlated dimensions

  25. Evaluation • Tools • Word vectors: RNNLM Toolkit (Mikolov, 2009) • Word alignments: cdec(Dyer et al, 2013) • CCA: Matlab Toolkit • Data • Word vector monolingual training data • WMT news commentary: 2011, 2012 • English, French, Spanish, German • Word alignment data • WMT news commentary 2010, 09, 08. 07, 06 • {French, Spanish, German} - English

  26. Results

  27. Results

  28. Original English Vectors

  29. German Projected on English

  30. Conclusion • Word vector quality can be improved using multilingual data • At least for lexical semantic tasks • The amount of help provided by these languages depend on how similar they are to each other • A task like NER can use data from multiple languages in a simple framework

  31. Thank You!

More Related