1 / 42

NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia

NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia . Ming Zhou Manager of Natural Language Group Microsoft Research Asia. Trends of Internet Services. Eco system to work with third party’s apps Apple Apps, Facebook , Twitter, Baidu , Sina , QQ

kimi
Télécharger la présentation

NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NLP Research at Internet AgeAn Overview of NLP at Microsoft Research Asia Ming Zhou Manager of Natural Language Group Microsoft Research Asia

  2. Trends of Internet Services • Eco system to work with third party’s apps • Apple Apps, Facebook, Twitter, Baidu, Sina, QQ • Real time content collection and search • Twitter, Facebook, Del.ici.ous, NYT, YouTube • Mobile search • Contextual intent understanding • Towards decision making and action taking • Social power • Social tags (like) for general search engines • Search engines in SNS • Social QA

  3. Impact and Challenge to NLP Research • Impact • Biggest database ever – connects data • Biggest social network – connects people • Harnessing collective intelligence • Contextual information processing: User, user’s social network, location, time • Real-time information processing: Collection, index, operation without delay • Challenge • How to leverage data, people, contextual information to reach real-time information processing?

  4. Problems of Traditional NLP Approaches (NLP 1.0) • Deep in individual component technologies but reach upper bounds • Less consider scenarios, user’s need, market need • Serious data sparseness with human annotation • Evaluation bottleneck • Slow deployment • Lack effective framework to involve users’ feedback

  5. New Strategy of NLP (NLP2.0) • Data collection from the web • Domain specific and open-IE • Contextual NLP • Maximize on the system level not on the individual component • Earlier deployment on Internet • Make best use of social factors

  6. Our Vision and Task Understand user and document in any language, for any device and any applications • Advanced NLP technologies • Word breaker, POS tagging, chunking, syntactic parser, semantic role labeling, speller, query suggestion, summarization • Chinese, Japanese, English • Multi-language information access • Statistical machine translation • Multi-language search • Semantic computing • Sentiment analysis, event extraction, ontology learning • Understanding query intent and document • Contextual NLP

  7. MSRA NLP Research Overview Translation evaluation paraphrasing Tran. know. acquisition Vertical search WEB mining for MT Cross language IR NLP enriched Indexing and search SMT MRD Balanced corpus Query-doc relevance MRD Bilingual corpus Parsing lexicon Tagged corpus Translation lexicon Bilingual tagged corpus Text mining Applications Chinese IME English writing wizard News Search Comparison Shopping Japanese IME Pocket translator Twitter Search Chatbot Query speller Couplet generation Resume Routing General web search Component techs Text analysis Machine Translation Information Extraction Information Retrieval Skeleton parser Meta data extraction Term extraction Named entity identification Annotation tool Pos tagging Machine learning SLM Data NLP (C, J, E) MT (C, J, E) IR and IE (C,J,E)

  8. Research Accomplishment • Awards • MSRA Best Research Team(2010) • Finalist of WSJ Asian Innovation Awards (2010) • MS ARD Best Project (Engkoo) • MSRA Best Innovation (1998-2008): IME and Chinese couplets • Academic impact • Best result in NIST 2008 SMT, CWMT 2008 and CWMT 2009 • Best result in SIGHAN 2006 bake off on Chinese word segmentation • Best result in cross language information retrieval in TREC-9, NTCIR-III • 40 ACL papers, 9 SIGIR, 17 Coling papers (2000-2010) • PC Chair, area chair of ACL • Collaboration with universities • HIT Joint lab on NLP, Speech and Search, Tsinghua Joint lab on Media and Network • 400 interns in 12 years • Summer schools since 2001 • PhD supervisors at universities

  9. Summer School on Information Extraction (Harbin, June, 2005) Cheng Niu: Information extraction Frank Seide: Speech information extraction and search Hwee Tou Ng: Advanced topics of information extraction Chin-Yew Lin: Information extraction for automatic summarization

  10. Projects based on NLP 2.0 • Engkoo: Web-based English learning service • Data mining from the web • Chinese couplets • Include user’s power into system evolvement • Semantic analysis and search of micro-blogging • Move to SNS, mobile

  11. Engkoo Parallel data mining from the web Video: http://video.sina.com.cn/v/b/37417609-1286528122.html

  12. Rapidly Changing Language • Approximately 1.5 billion people speak English as a primary, secondary or business language • China: The largest “English speaking” country with 250 million English learners and USD 60 billion annual expenses • Problem: Live language: new words, new meanings Key Insight: With billions of translated web pages and sharable repositories of language data growing every day, the Internet holds the sum of human language knowledge

  13. www.engkoo.com Major Features: Microsoft Products: Endless Lexicon with Native Definitions Bing Human-Like TTS & Phonetic Search Office State-of-the-Art Machine Translation (NIST OpenMT Winner) MSN Real-time Interactive Alignment

  14. Massive Dictionary Mined from the Web

  15. Fresh and Diverse Examples

  16. Advanced Search with Sentence Analysis

  17. Sentences Classification

  18. Learn Contextual Usage with Word Alignment

  19. Learn Contextual Usage with Word Alignment

  20. Learn Contextual Usage with Word Alignment

  21. Hints of Easy-Confused Words

  22. 1. word’s idiomatic usage • Verb~Noun (decline~offer) • Verb~Adv (greatly~improve) • Adj~Noun (arduous~task) • Adv~Adj (extremely~bad) • 2. paraphrasing • turn_on~light, switch_on~light • laborious~task, hard~task • deeply~moved, deeply~touched • 3. collocation translations • 订~计划,make~plan • 订~旅馆, book~room • 订~杂志, • subscribe to ~magazine Knowlege Mining Pipeline tokenizing: he could hardly afford to waste that golden time. 他 无法 浪费 那样 的 好 时光。 skeleton parsing: (Tsub~he~afford) (ModAdv~hardly~afford) (Tobj~waste~afford) (Tobj~time~waste) (AdjAttrib~golden~time) (Tsub~他~浪费) (ModAdv~无法~浪费)(Tobj~浪费~时光) (AdjAttrib~好~时光) alignment: he(他) could hardly afford to(无法) waste(浪费) that(那样的) golden(好) time(时光) • single word • “he”, “could”, “hardly”, “afford” etc. • “他”, “无法”, ”浪费“ etc. • 2. single word with its POS • “he_Pron”, “could_Verb”,“hardly_Adv” etc. • “他_Pron”, “无法_Adv”, ”浪费_Verb“ etc. • 3. collocation • “Tsub~he~afford ”, “Tobj~time~waste” etc. • “Tsub~他~浪费”, “ModAdv~无法~浪费” etc. Parallel Sentence: He could hardly afford to waste that golden time. 他无法浪费那样的好时光。 Machine Translation Model Paraphrasing Model Mined Data Parsed Data Indexed Data Linguistic Parsing Web Mining Linguistic Knowledge Knowledge Mining Multi-level Indexing

  23. Chinese Couplets Include user‘s power into system evolvement

  24. Demo Chinese Couplets (http://duilian.msra.cn) http://video.sina.com.cn/v/b/10937201-1452530713.html

  25. FS and SS Share the Same Style Repetition of pronunciations(音韵联) 风(wind)----------------水(water) 吹 (blow) ---------------使 (make) 荞(buckwheat) -- ------舟(ship) 动(wave)----------------流 (go) 桥(bridge) -------------洲(island) 未 (not) -----------------不 (not) 动(wave) ---------------流(go)

  26. FS and SS Share the Same Style Decomposition of characters (拆字联) 有(have)-----------------缺(lack) 子(son) -------------------鱼(fish) 有 (have) ------------------缺(lack) 女(daughter)-------------羊(mutton) 方 (so) ---------------------敢 (dare) 称 (call) --------------------叫 (call) 好(good) -------------------鲜(fresh) 好 女 子 鲜 鱼 羊

  27. FS and SS Share the Same Style Person name (人名联) Palindrome (回文联) 板桥(Banqiao)----------------东坡(Dongpo) 造(produce) -------------------居 (live) 桥(bridge) ---------------------坡 (mountain) 板(board)----------------------东(east) • Banqiao(板桥) and Dongpo(东坡) are famous litterateurs • Reading from top to down is identical to down to top

  28. SS Generation Process 海 阔 凭 鱼 跃 Sea wide allow fish jump Linguistic filtering SMT decoding Reranking 山 hill 高 high 虫 insect 飞 fly 山高任鸟飞 天高任鸟鸣 天高任鸟飞 山深任鸟飞 天高任花香 天高任鸟舞 山高任花香 …… 山高任鸟飞 天高任鸟鸣 天高任鸟飞 山高靠虎啸 山高任虎啸 山深任鸟飞 天高任花香 …… 天高任鸟飞 山高任鸟飞 天高任鸟鸣 天高任鸟舞 山深任鸟飞 山高任花香 天高任花香 …… 鸟 bird 舞 dance 任 permit 天 sky 深 deep 虎 tiger 鸣 tweedle 天 高 sky high 鸟 飞 bird fly 倚 depend 虎 啸 tiger roar 山 高 hill high

  29. SS Generation Approach FS input • A multi-phase SMT approach • Phase1: a phrase-based log-linear model • Phase2: some linguistic filters • Phase3: a Ranking SVM Phrase-based log-linear model N-best candidates Linguistic filters Ranking SVM model SS output

  30. Great Examples • FS:月落乌啼霜满天 • SS:风吹雁过雨连宵 • FS:千江有水千江月 • SS:万里无云万里星 • FS:秦淮河桨声灯影 • SS:松花江水色月光 • FS:此木为柴山山出 (此+木=柴;山+山=出) • SS:白水作泉日日昌 (白+水=泉;日+日=昌)

  31. User log for Model Enhancement • Motivation • Training data is not adequate • While user log is big(60k/m), increasing, diverse • What logs we record • User inputs • User finalized couplets • Second sentences selected out of the candidates provided by our system • User modified second sentences

  32. User’s Log Analysis • Data Source • Log from http://couplet.msra.cn • Time period • Aug. 31-Oct. 9, 2006

  33. New Framework with Log Data First sentence input Translation model Translation model Source-Channel model Language model Language model Training data Log data N-best candidates Mutual information Re-ranking Mutual information Second sentence output User operation

  34. Twitter Search Move to social internet and mobile

  35. A collection of tweets Tweets Tweets Cluster Statistical Relationship Learning News & Images Link Extraction Community Extraction User Influence Measure Multi-level Indexing Hot tag, topic Extraction Popular Tweet Extraction Semantic Search Noise Filtering Top video, music, artists Extraction Individual tweet Semantic Role Labeling Sentiment Analysis NE Recognition Dependency Parsing Co-reference Sentence Boundary Detection Text Normalization Classification Raw Data

  36. Conclusion • Internet trends and impacts to NLP • NLP2.0 strategy • Web data mining: Engkoo • User’s power: Couplets • SNS and mobile: Twitter search

More Related