1 / 70

Why should I care about Computational Linguistics & Language Processing?

Why should I care about Computational Linguistics & Language Processing?. Hsiao-Wuen Hon 洪小文 Assistant Managing Director Microsoft Research Asia. Should I care?. Medical school 金饭碗 Electronics 配股 Easy way to become millionaire Chip manufacture TSMC, UMC Hardware

neena
Télécharger la présentation

Why should I care about Computational Linguistics & Language Processing?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Why should I care about Computational Linguistics & Language Processing? Hsiao-Wuen Hon 洪小文 Assistant Managing Director Microsoft Research Asia

  2. Should I care? • Medical school • 金饭碗 • Electronics • 配股 • Easy way to become millionaire • Chip manufacture • TSMC, UMC • Hardware • Acer, Quanta, 鸿海, BenQ, 英业达, MiTac • NLP? Speech? IR? HWR?

  3. It is actually a good choice • People go on to have good careers • Many applications • IR, HWR • Investment banks • Bioinformatics • ….. • With many smart people • Software Industry cares • Not overproducing students

  4. Industry Cares • People you might know • Academics • Pillars of A.I. • Well funded • Taiwan professors • Oversea professors • V. Zue, B.H. Juang, F. Jelinek, M. Libermann, N. Chomsky, Michael Collin, Fernando Pereira …

  5. Industry Cares • Industrial R&D Labs • Executives • Kai-Fu Lee (MS), Qi Lu (Yahoo), … • Microsoft • X.D. Huang,洪小文,马维英, Eric Chang, 周明, Eric Brill, Ken Church, … • Continue hiring • Google • Speech - Amit Singhal, Michael Riley, … etc., • NL – Franz Och, Krishna Bharat, Dekang Lin, … • Aggressively hiring • Others…

  6. Industry Cares • Other applications • Renaissance Technologies • Hedge fund management – 4 billions in assets • Time-series predication based on S&L technologies • a.k.a ex-IBM S&L group • P. Brown, R. Mercer, P. De Souza, L. Bahl, Della Pietra brothers, … • Startups • Nuance, SpeechWorks, InfoTalk, iPhrase, Lexicus, …

  7. Bill Gates’ vision PC on everyone’s desktop (’75) Information at your finger tips (’90) Seamless Computing (’03) S&L technologies is the key Billions of $ investment in S&L technologies Full-size S&L product & research groups Multi-lingual & multi-products Continue hiring Expanded investment due to search/Google Microsoft Cares

  8. Information Agent • “Do what I mean” • “Find what I want” • How to turn on Firewall in Windows? • Speech recognition • Signal to text • Natural language understanding • Syntax/semantics • Domain knowledge • Knowledge search • AI-Complete

  9. A Long Long Journey • Speech • Ubiquitous interface • Automatic Speech Recognition • Text-to-Speech • Natural Language • Spelling/grammar/style checking • IME • Machine translation • Information Retrieval & Mining

  10. Speech • SAPI 1.0 – 6.0 • Window Sound System in ’92 • Platform for building speech app. in Windows • Accessibility support (Screen Reader) • Office Dictation • Chinese, English • Microsoft Speech Server • Telephony speech & multiomdal platform • Other – Encarta, WinCE/Smartphone…

  11. 30% Human Error Rate Machine Error Rate 25% Log (Machine Error Rate) 20% 15% 10% 5% 0% 1993 1996 1999 2002 2005 2008 2011 Speech

  12. Mulan AT&T Loquendo Elan Speech MSRA Speech • TTS – multi-lingual natural TTS • ASR • Chinese LVCSR - dictation/telephony/embedded • Fundamental research • AIME: Audio Info. Management & Extraction • Audio/video file indexing/retrieval • Offline transcription/extraction/summarization • More in Eric’s keynote tomorrow • From the Lab to Ubiquity: Speech Technology's Road to Mainstream

  13. NLP Contributes to MS Products • IME (Chinese, Japanese, …) • Spelling/grammar checking • Spam filtering • English Writing Wizard (EWW) • Spoken language interface • IR and CLIR • Text mining • Machine translation • Search engine • QA (AskMSR) • SLM for Speech • Text analysis for TTS • …..

  14. NLP “Rainbow” Understanding Knowledge base Analysis Generation Discourse Discourse Logical Form Logical Form Transfer Syntax Syntax Dictionary Dictionary Morphology Morphology Word Breaking Grammar Checking Machine Translation Source Text Target Text

  15. Translation evaluation paraphrasing Tran. know. acquisition Web retrieval Shallow MT Cross language IR Indexing EBMT & SMT MRD Balanced corpus QMapping MRD Bilingual corpus Parsing lexicon Tagged corpus Translation lexicon Bilingual tagged corpus Resume routing NLP at MSRA Applications Chinese IME English writing wizard Enterprise search Japanese IME Pocket translator SQL Text Mining Spelling check Extended TM Resume Routing NLP Machine Translation Information Extraction Information Retrieval Research Meta data extraction Skeleton parser Term extraction Named entity identification Annotation tool Pos tagging Machine learning SLM Monolingual resources (C, J, E) Bilingual resources (C, E) Special purpose Linguistic Resources

  16. NLP at MSRA • TIME • Email Routing • Spam filtering • Resume routing • Support routing • EWW • Translation

  17. TIME System TIME Platform • Text Information Management & Extraction • Goal: extract information from text data • genres: email, newspaper, report, web pages • formats: Word document, PDF/PS, HTML/XML • languages: English, Chinese, Japanese, … • Applications: search, question answering, data mining, machine translation

  18. TIME Components • Linguistic processing  TIME linguistic platform • Text normalization: sentence splitting, tokenization, morphological analysis • Entity extraction: person name, company name, time expression, phrases • Relation learning: syntactic/semantic dependencies between entities • Information extraction • Document property extraction: title, author, key term, summary • Domain knowledge extraction: concept, concept relation, glossary, taxonomy, event • Cross-lingual information exchange • Translation at word, entity, term, skeleton, text levels • Reading, writing, cross language information retrieval

  19. TIME Demo

  20. Multi-lingual linguistic unit processing • Word • Tokenization • Named entity recognition (NER) • POS • Sentence • Chunking (VP/NP) • Source-channel models:

  21. TIME (linguistic unit processing)

  22. Chinese Tokenization & NEI

  23. English Chunking and POS Tagging

  24. English Chunking and POS Tagging

  25. Skeleton Parser • Skeleton == <subject V object> • Input: He is succeeded by Ivan Allen Jr. • Output Obj Sub [He] is succeeded by [Ivan Allen Jr.] • More robust & faster than traditional parser • Adequate for most applications • Collocation checking, Spell checking, Grammar checking, QA, Search

  26. Skeleton Parser • Key Dependency Relations • A set of most important relations (e.g. subject, object…) • Definition based on application • Our Target: A Robust&FastDependency Extractor • Not rely on high quality (hand-annotated) training data. • High efficiency in dealing with large scale of data (e.g. web data) • Potential Applications • Information Extraction, Q/A, TDT • Who (Subject-Verb), Whom (Verb-object), What (Adj-Noun) • Machine translation • Skeleton translation • NL-based Information Retrieval • Cross-Language IR • Re-ranking by triple matching

  27. Proposed approach Input Sentence Raw corpus NLPWin Parser PoS Tagging Parsed corpus Chunking Noise Filtering Shallow Parser Training Data Key Dependency Triples Training

  28. The proposed approach Input Sentence Raw corpus NLPWin Parser PoS Tagging Parsed corpus Chunking Noise Filtering Shallow Parser Training Data Key Dependency Triples Training

  29. The proposed approach Input Sentence Raw corpus NLPWin Parser PoS Tagging Parsed corpus Chunking Noise Filtering Feature Extraction Training Data Classification Training Key Dependency Triples

  30. Skeleton Parser

  31. Skeleton Parser

  32. Term Extraction Candidate Generation Ranking Term List Text Options: Term frequency TF-IDF Entropy reduction ER-IDF Options: Boundary determination BaseNP Pattern filtering Terms

  33. Term Extraction

  34. Term Extraction

  35. Text Mining Roadmap Information Desk Meta Data for Sharepoint SQL Text Mining Text Miner • Key technologies • Metadata extraction • Ranking algorithm • Multi-languages support

  36. Information Desk • http://msra-nlc-tm1

  37. http://msra-nlc-tm1/

  38. Machine Translation Roadmap • Direction • Template based • Linguistic data acquisition from Web mining • TIME Search Engine Office EWW Mobility • Key technologies • Skeleton parser • Collocation checker • Paraphrase • Knowledge acquisition • Adaptive to new language pairs

  39. EWW (English Writing Wizard) Idiomatic Usage • Objectives • Make your English writing as good as native speakers Input: question question (Noun) Verb+question: raise ~, ask ~, resolve ~, pose ~ Adj+question: unanswered ~, serious ~, big ~, real ~ question (Verb) question+Noun: ~ motive, ~ value, ~ truth, ~ boy question+Adv: ~ intensely, ~ orally, ~ closely, ~ at_all Adv+question: privately ~, cautiously ~, hardly ~ • Features • Idiomatic usages • Synonymous collocation • Collocation translations • Bilingual example sentences Synonymous Collocation attain~dobj~level  achieve~dobj~level attract~dobj~fan  draw~dobj~fan take~dobj~reins  assume~dobj~reins|hold~dobj~reins bad~Intnsifs~extremely risky~Intnsifs~extremely unusual~Intnsifs~quite  unusual~Intnsifs~rather vision~Attrib~unusual sight~Attrib~unusual Improve~Mod~greatly  Improve~Mod~considerably • Technology Highlights • Auto extraction of idiomatic usage • Auto extraction of synonymous collocation • Auto extraction of collocation translations • Example sentence retrieval Collocation Translation 克服~困难 conquer difficulty, overcome difficulty, master~difficulty overcome~adversity, surmount~difficulty

  40. Web Search & Mining • Internet + Data + Information -> Search, Mining, Sharing, & Intelligence • Lots of text • Text-based IR • Text Mining • Semantic/Structure Mining • Media Search • Surrounding text • Audio/video transcription • Make Billions of $ from trillions of words

  41. Information Retrieval • Text Processing • Tokenization • Normalization – stemming, … • Precision/Recall • Beyond 1st order statistics (TF-IDF) • N-gram for adaptive indexing • Better model of P(Doc|Query) • Classification vs. term frequency • Result Summarization • Query sensitive • U盘 (优盘) vs. 大拇哥 • Result clustering & classification

  42. Search Long Result List • A user search for information about “jaguar”, a Mac OS • However, the relevant results are mixed with other pages • The user need to go through a long list to find desired information

  43. Clustering vs. Classification Clustering Results for “jaguar” Classification Results for “jaguar”

More Related