1 / 36

Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering

Multilingual and Crosslingual Information System. Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering National Cheng Kung University 2014/02/17. Contact Information. Room: 4261, Monday 09:10 - 12:00 AM Instructor: Prof. Wen-Hsiang Lu ( 盧文祥 ) Office: 4216

zonta
Télécharger la présentation

Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multilingual and Crosslingual Information System Wen-Hsiang Lu (盧文祥) Department of Computer Science and Information Engineering National Cheng Kung University 2014/02/17

  2. Contact Information • Room: 4261, Monday 09:10 - 12:00 AM • Instructor: Prof. Wen-Hsiang Lu (盧文祥) • Office: 4216 • Office hours: Monday 12:10 - 2:10PM • Phone: 62545 • Web page: http://myweb.ncku.edu.tw/~whlu/mis.htm • Email: whlu@mail.ncku.edu.tw • Teaching assistant: 王廷軒 • Email: playif@gmail.com

  3. Course Grading • Class participation/presentation: 30% • Tests: 25% • Project: 25% • Homeworks: 20%

  4. Source Textbooks • Christopher D. Manning and Hinrich Schutze, Foundations of Statistical Natural Language Processing, The MIT Press, 1999. (全華科技圖書 : 02-23717725) • Daniel Jurafsky and James H. Martin, Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Prentice Hall, 2000. • James Allen, Natural Language Understanding, Benjamin/Cummings Publishing Co, 1995. • Gregory Grefenstette, Cross-Language Information Retrieval, Kluwer, 1998. • Jean Veronis, Parallel Text Processing: Alignment and Use of Translation Corpora, Kluwer, 2000.

  5. Other Useful Sources (1) • Reference Books • Charniak, E. Statistical Language Learning. • Cover, T. M., Thomas, J. A. Elements of Information Theory. • Jelinek, F. Statistical Methods for Speech Recognition. • Major Conferences: • ACL (Association of Computational Linguistics) • COLING (International Conference on Computational Linguistics ) • HLT (Human Language Technology Conference) • IJCNLP (International Joint Conference on Natural Language Processing ) • Journals • Computational Linguistics • Natural Language Engineering • TALIP (ACM Transactions on Asian Language Information Processing) • TSLP (ACM Transactions on Speech and Language Processing)

  6. Other Useful Sources (2) • Resource URL • http://www.aclclp.org.tw/res_other_c.php (中華民國計算語言學學會) • http://nlp.stanford.edu/software/index.shtml (Stanford NLP Group) • http://www.phontron.com/nlptools.php (Graham Neubig) • Tools/Software • Online Dictionary • WordNethttp://wordnet.princeton.edu/ • HowNethttp://www.keenage.com/html/c_index.html • The Academia Sinica Bilingual Ontological Wordnet (BOW)http://bow.sinica.edu.tw/

  7. CKIP (中研院詞庫小組)(Chinese Knowledge and Information Processing) • Parser: http://140.109.19.112/main.exe?id=6833 • POS (part of speech) tagger: http://ckipsvr.iis.sinica.edu.tw/

  8. Eric Brill's POS Tagger • Website: http://cst.dk/online/pos_tagger/uk/ This/DT is/VBZ a/DT book/NN ./.

  9. Stanford Parser • Website • http://nlp.stanford.edu/software/lex-parser.shtml • Tools • Online version • Stanford Parser version 1.5.1 • English & Chinese • http://josie.stanford.edu:8080/parser/

  10. Stanford Parser

  11. [Homework 1] • Using CKIP POS (part of speech) tagger, Eric Brill’s POS tagger, and Stanford parser to tag and parse at least three sentence.

  12. Course Topics • Probability and Information Theory • basics: definitions, formulas, examples. • Language Modeling • n-gram models, parameter estimation • smoothing (EM algorithm) • Some Linguistics • phonology, morphology, syntax, semantics, discourse • Words and the Lexicon • word classes, mutual information, lexicography.

  13. Course Topics (cont.) • Hidden Markov Models • background, algorithms, parameter estimation • Tagging: methods, algorithms, evaluation • tag sets, HMM tagging, transformation-based, feature-based • Grammars and Parsing: data, algorithms • statistical parsing: algorithms, parameterization, evaluation

  14. Course Topics (cont.) • Applications • Machine Translation (MT) • Acoustic Speech Recognition (ASR) • Information Retrieval (IR) • Cross-Language Information Retrieval (CLIR) • Question Answering (QA) • Cross-Language Question Answering (CLQA) • Summarization • Information Extraction • …

  15. Course Introduction • Lecture1: Introduction • Lecture2: Mathematical Foundations • Lecture3: Linguistics Essentials • Lecture4: Corpus-based Work • Lecture5: Collocations • Lecture6: Statistical Inference: n-gram Models over Sparse Data • Lecture7: Word Sense Disambiguation • Lecture8: Statistical Alignment and Machine Translation • Lecture9: Markov Models • Lecture10: Term Translation Extraction & Cross-Language Information Retrieval • Lecture11 :Statistical/Probabilistic Models for Word Alignment & CLIR • Lecture12: Part-of-Speech Tagging • Lecture13: Probabilistic Context Free Grammars • Lecture14: Question Answering

  16. The Ultimate Research Goal in Natural Language Processing(NLP) • To develop an automated language understanding system • Why is this important? • Easy for everyone to use language • Natural Human interface for a variety of applications (e.g., database access, on-line tutor, robot control, etc.) • Language seems fundamental for developing an intelligent system • iPhone Siri • IBM's DeepQA project

  17. Natural Language is VERY Useful

  18. OCR Problems

  19. Aspects of Computational Linguistics • Description of the Language: universals, cross-linguistic research • Implementation of Computer Model:algorithms and data structures, formal models to represent knowledge, model of the reasoning process • Psycho-Linguistic Aspect:humans are an existence proof of the computability of language comprehension; psychological research can be used to justify a computer model; obtain human processing parameters

  20. NLP Issues • Why is NLP difficult? • Many “words”, many “phenomena”, many “rules” • OED (Oxford English Dictionary): 400k words; Finnish lexicon (of forms): ~2 ×107 • sentences, clauses, phrases, constituents, coordination, negation, imperatives/questions, inflections, parts of speech, pronunciation, topic/focus, and much more! • irregularity (exceptions, exceptions to the exceptions, ...) • potato  potato es (tomato, hero,...); photo  photo s, and even: both mango  mango s or  mango es • Adjective / Noun order: new book, electrical engineering, general regulations, flower garden, garden flower

  21. Difficulties in NLP (cont.) • Ambiguity • books: NOUN or VERB? • you need many booksvs. she books her flights online • Thank you for not smoking, drinking, eating or playing radios without earphones. (MTA bus) • Thank you for not eating without earphones?? • Thank you for drinking?? … • Fred’s hat was blown off by the wind. He tried to catch it. • ...catch the windor ...catch the hat ?

  22. Rules or Statistics? • Preferences: • context clues: she books  books is a verb • rule: if an ambiguous word (verb/nonverb) is preceded by a matching personal pronoun word is a verb • pronoun reference: • she/he/it often refers to the most recent noun or pronoun (but there are certainly exceptions) • selectional restrictions: • catching hat is better than catching wind (but not always) • semantics: • We thank people for doing helpful things or not doing annoying things

  23. Solutions • Don’t guess if you know: • morphology (inflections) • lexicons (word information) • unambiguous names • perhaps some (really) fixed phrases • syntactic rules? • Use statistics (based on real-world data) for preferences (only?) • No doubt: but this is an important question!

  24. Types of Linguistic Knowledge • Acoustic/Phonetic Knowledge: How words are related to their sounds. (transliteration) • E ri c sson <=> 易利信 • Morphological Knowledge: How words are constructed out of basic meaning units. un + friend + ly  unfriendly love + past tense  loved object + oriented  object-oriented

  25. More Types of Linguistic Knowledge • Lexical Knowledge (or Dictionary): This should include information on parts of speech, features (e.g., number, case), typical usage, and word meaning. • Syntactic Knowledge: How words are put together to make legal sentences (or constituents of sentences).

  26. More Types of Linguistic Knowledge • Semantic Knowledge: Word meanings, how words combine into sentence meaning, • e.g., Fred tossed the ball. Semantic roles

  27. More Types of Linguistic Knowledge • Pragmatic Knowledge: How context affects the interpretation of a sentence.Examples: • Louise loves him. [Context 1:] Who loves Fred? [Context 2:] Louise has a cat.  • What time is it? [Context 1:] Fred is fidgeting (坐立不安) and staring at his watch. [Context 2:] Louise has no watch. 

  28. More Types of Linguistic Knowledge • World Knowledge:How other people‘s minds work, what a listener knows or believes, the etiquette (成規) of language.Examples: • Will you pass the salt? • I read an article about the war in the paper. • Fred saw the bird with his binoculars. • Tim was invited to Tom's birthday party. He went to the store to buy him a present.

  29. Multilingualism Issues in Web Age • Language barrier • There are about 6,700 languages listed in the Ethnologue(http://www.ethnologue.com/) • Information overloading • Scaling up of language resources • Webpages • News • Weblogs • Microblogs

  30. Multilingual Understanding??

  31. Multilingual Understanding??

  32. Multilingual Understanding??

  33. Real World Situation • Use statistical model based on REAL WORLD DATA and care about the best sentence only • Imagine: • Each sentence W = { w1, w2, ..., wn } gets a probability P(W|X) in a context X • For every possible context X, sort all the imaginable sentences W according to P(W|X): • Ideal situation: best sentence (most probable in context X) P(W) Wbest Wworst

  34. Real World Situation • Unable to specify a set of grammatical sentences using fixed “categorical” rules • (disregarding the “grammaticality” issue) best sentence (most probable in context X) P(W) Wbest Wworst

More Related