NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia

NLP Research at Internet AgeAn Overview of NLP at Microsoft Research Asia Ming Zhou Manager of Natural Language Group Microsoft Research Asia

Trends of Internet Services • Eco system to work with third party’s apps • Apple Apps, Facebook, Twitter, Baidu, Sina, QQ • Real time content collection and search • Twitter, Facebook, Del.ici.ous, NYT, YouTube • Mobile search • Contextual intent understanding • Towards decision making and action taking • Social power • Social tags (like) for general search engines • Search engines in SNS • Social QA

Impact and Challenge to NLP Research • Impact • Biggest database ever – connects data • Biggest social network – connects people • Harnessing collective intelligence • Contextual information processing: User, user’s social network, location, time • Real-time information processing: Collection, index, operation without delay • Challenge • How to leverage data, people, contextual information to reach real-time information processing?

Problems of Traditional NLP Approaches (NLP 1.0) • Deep in individual component technologies but reach upper bounds • Less consider scenarios, user’s need, market need • Serious data sparseness with human annotation • Evaluation bottleneck • Slow deployment • Lack effective framework to involve users’ feedback

New Strategy of NLP (NLP2.0) • Data collection from the web • Domain specific and open-IE • Contextual NLP • Maximize on the system level not on the individual component • Earlier deployment on Internet • Make best use of social factors

Our Vision and Task Understand user and document in any language, for any device and any applications • Advanced NLP technologies • Word breaker, POS tagging, chunking, syntactic parser, semantic role labeling, speller, query suggestion, summarization • Chinese, Japanese, English • Multi-language information access • Statistical machine translation • Multi-language search • Semantic computing • Sentiment analysis, event extraction, ontology learning • Understanding query intent and document • Contextual NLP

MSRA NLP Research Overview Translation evaluation paraphrasing Tran. know. acquisition Vertical search WEB mining for MT Cross language IR NLP enriched Indexing and search SMT MRD Balanced corpus Query-doc relevance MRD Bilingual corpus Parsing lexicon Tagged corpus Translation lexicon Bilingual tagged corpus Text mining Applications Chinese IME English writing wizard News Search Comparison Shopping Japanese IME Pocket translator Twitter Search Chatbot Query speller Couplet generation Resume Routing General web search Component techs Text analysis Machine Translation Information Extraction Information Retrieval Skeleton parser Meta data extraction Term extraction Named entity identification Annotation tool Pos tagging Machine learning SLM Data NLP (C, J, E) MT (C, J, E) IR and IE (C,J,E)

Research Accomplishment • Awards • MSRA Best Research Team(2010) • Finalist of WSJ Asian Innovation Awards (2010) • MS ARD Best Project (Engkoo) • MSRA Best Innovation (1998-2008): IME and Chinese couplets • Academic impact • Best result in NIST 2008 SMT, CWMT 2008 and CWMT 2009 • Best result in SIGHAN 2006 bake off on Chinese word segmentation • Best result in cross language information retrieval in TREC-9, NTCIR-III • 40 ACL papers, 9 SIGIR, 17 Coling papers (2000-2010) • PC Chair, area chair of ACL • Collaboration with universities • HIT Joint lab on NLP, Speech and Search, Tsinghua Joint lab on Media and Network • 400 interns in 12 years • Summer schools since 2001 • PhD supervisors at universities

Summer School on Information Extraction (Harbin, June, 2005) Cheng Niu: Information extraction Frank Seide: Speech information extraction and search Hwee Tou Ng: Advanced topics of information extraction Chin-Yew Lin: Information extraction for automatic summarization

Projects based on NLP 2.0 • Engkoo: Web-based English learning service • Data mining from the web • Chinese couplets • Include user’s power into system evolvement • Semantic analysis and search of micro-blogging • Move to SNS, mobile

Engkoo Parallel data mining from the web Video: http://video.sina.com.cn/v/b/37417609-1286528122.html

Rapidly Changing Language • Approximately 1.5 billion people speak English as a primary, secondary or business language • China: The largest “English speaking” country with 250 million English learners and USD 60 billion annual expenses • Problem: Live language: new words, new meanings Key Insight: With billions of translated web pages and sharable repositories of language data growing every day, the Internet holds the sum of human language knowledge

www.engkoo.com Major Features: Microsoft Products: Endless Lexicon with Native Definitions Bing Human-Like TTS & Phonetic Search Office State-of-the-Art Machine Translation (NIST OpenMT Winner) MSN Real-time Interactive Alignment

Massive Dictionary Mined from the Web

Fresh and Diverse Examples

Advanced Search with Sentence Analysis

Sentences Classification

Learn Contextual Usage with Word Alignment

Hints of Easy-Confused Words

1. word’s idiomatic usage • Verb~Noun (decline~offer) • Verb~Adv (greatly~improve) • Adj~Noun (arduous~task) • Adv~Adj (extremely~bad) • 2. paraphrasing • turn_on~light, switch_on~light • laborious~task, hard~task • deeply~moved, deeply~touched • 3. collocation translations • 订~计划,make~plan • 订~旅馆, book~room • 订~杂志, • subscribe to ~magazine Knowlege Mining Pipeline tokenizing: he could hardly afford to waste that golden time. 他无法浪费那样的好时光。 skeleton parsing: (Tsub~he~afford) (ModAdv~hardly~afford) (Tobj~waste~afford) (Tobj~time~waste) (AdjAttrib~golden~time) (Tsub~他~浪费) (ModAdv~无法~浪费)(Tobj~浪费~时光) (AdjAttrib~好~时光) alignment: he(他) could hardly afford to(无法) waste(浪费) that(那样的) golden(好) time(时光) • single word • “he”, “could”, “hardly”, “afford” etc. • “他”, “无法”, ”浪费“ etc. • 2. single word with its POS • “he_Pron”, “could_Verb”,“hardly_Adv” etc. • “他_Pron”, “无法_Adv”, ”浪费_Verb“ etc. • 3. collocation • “Tsub~he~afford ”, “Tobj~time~waste” etc. • “Tsub~他~浪费”, “ModAdv~无法~浪费” etc. Parallel Sentence: He could hardly afford to waste that golden time. 他无法浪费那样的好时光。 Machine Translation Model Paraphrasing Model Mined Data Parsed Data Indexed Data Linguistic Parsing Web Mining Linguistic Knowledge Knowledge Mining Multi-level Indexing

Chinese Couplets Include user‘s power into system evolvement

Demo Chinese Couplets (http://duilian.msra.cn) http://video.sina.com.cn/v/b/10937201-1452530713.html

FS and SS Share the Same Style Repetition of pronunciations(音韵联) 风(wind)----------------水(water) 吹 (blow) ---------------使 (make) 荞(buckwheat) -- ------舟(ship) 动(wave)----------------流 (go) 桥(bridge) -------------洲(island) 未 (not) -----------------不 (not) 动(wave) ---------------流(go)

FS and SS Share the Same Style Decomposition of characters (拆字联) 有(have)-----------------缺(lack) 子(son) -------------------鱼(fish) 有 (have) ------------------缺(lack) 女(daughter)-------------羊(mutton) 方 (so) ---------------------敢 (dare) 称 (call) --------------------叫 (call) 好(good) -------------------鲜(fresh) 好女子鲜鱼羊

FS and SS Share the Same Style Person name (人名联) Palindrome (回文联) 板桥(Banqiao)----------------东坡(Dongpo) 造(produce) -------------------居 (live) 桥(bridge) ---------------------坡 (mountain) 板(board)----------------------东(east) • Banqiao(板桥) and Dongpo(东坡) are famous litterateurs • Reading from top to down is identical to down to top

SS Generation Process 海阔凭鱼跃 Sea wide allow fish jump Linguistic filtering SMT decoding Reranking 山 hill 高 high 虫 insect 飞 fly 山高任鸟飞天高任鸟鸣天高任鸟飞山深任鸟飞天高任花香天高任鸟舞山高任花香 …… 山高任鸟飞天高任鸟鸣天高任鸟飞山高靠虎啸山高任虎啸山深任鸟飞天高任花香 …… 天高任鸟飞山高任鸟飞天高任鸟鸣天高任鸟舞山深任鸟飞山高任花香天高任花香 …… 鸟 bird 舞 dance 任 permit 天 sky 深 deep 虎 tiger 鸣 tweedle 天高 sky high 鸟飞 bird fly 倚 depend 虎啸 tiger roar 山高 hill high

SS Generation Approach FS input • A multi-phase SMT approach • Phase1: a phrase-based log-linear model • Phase2: some linguistic filters • Phase3: a Ranking SVM Phrase-based log-linear model N-best candidates Linguistic filters Ranking SVM model SS output

Great Examples • FS:月落乌啼霜满天 • SS:风吹雁过雨连宵 • FS:千江有水千江月 • SS:万里无云万里星 • FS:秦淮河桨声灯影 • SS:松花江水色月光 • FS:此木为柴山山出 (此+木=柴;山+山=出) • SS:白水作泉日日昌 (白+水=泉;日+日=昌)

User log for Model Enhancement • Motivation • Training data is not adequate • While user log is big(60k/m), increasing, diverse • What logs we record • User inputs • User finalized couplets • Second sentences selected out of the candidates provided by our system • User modified second sentences

User’s Log Analysis • Data Source • Log from http://couplet.msra.cn • Time period • Aug. 31-Oct. 9, 2006

New Framework with Log Data First sentence input Translation model Translation model Source-Channel model Language model Language model Training data Log data N-best candidates Mutual information Re-ranking Mutual information Second sentence output User operation

Twitter Search Move to social internet and mobile

A collection of tweets Tweets Tweets Cluster Statistical Relationship Learning News & Images Link Extraction Community Extraction User Influence Measure Multi-level Indexing Hot tag, topic Extraction Popular Tweet Extraction Semantic Search Noise Filtering Top video, music, artists Extraction Individual tweet Semantic Role Labeling Sentiment Analysis NE Recognition Dependency Parsing Co-reference Sentence Boundary Detection Text Normalization Classification Raw Data

Conclusion • Internet trends and impacts to NLP • NLP2.0 strategy • Web data mining: Engkoo • User’s power: Couplets • SNS and mobile: Twitter search

NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia