Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering

Multilingual and Crosslingual Information System Wen-Hsiang Lu (盧文祥) Department of Computer Science and Information Engineering National Cheng Kung University 2014/02/17

Contact Information • Room: 4261, Monday 09:10 - 12:00 AM • Instructor: Prof. Wen-Hsiang Lu (盧文祥) • Office: 4216 • Office hours: Monday 12:10 - 2:10PM • Phone: 62545 • Web page: http://myweb.ncku.edu.tw/~whlu/mis.htm • Email: whlu@mail.ncku.edu.tw • Teaching assistant: 王廷軒 • Email: playif@gmail.com

Course Grading • Class participation/presentation: 30% • Tests: 25% • Project: 25% • Homeworks: 20%

Source Textbooks • Christopher D. Manning and Hinrich Schutze, Foundations of Statistical Natural Language Processing, The MIT Press, 1999. (全華科技圖書 : 02-23717725) • Daniel Jurafsky and James H. Martin, Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Prentice Hall, 2000. • James Allen, Natural Language Understanding, Benjamin/Cummings Publishing Co, 1995. • Gregory Grefenstette, Cross-Language Information Retrieval, Kluwer, 1998. • Jean Veronis, Parallel Text Processing: Alignment and Use of Translation Corpora, Kluwer, 2000.

Other Useful Sources (1) • Reference Books • Charniak, E. Statistical Language Learning. • Cover, T. M., Thomas, J. A. Elements of Information Theory. • Jelinek, F. Statistical Methods for Speech Recognition. • Major Conferences: • ACL (Association of Computational Linguistics) • COLING (International Conference on Computational Linguistics ) • HLT (Human Language Technology Conference) • IJCNLP (International Joint Conference on Natural Language Processing ) • Journals • Computational Linguistics • Natural Language Engineering • TALIP (ACM Transactions on Asian Language Information Processing) • TSLP (ACM Transactions on Speech and Language Processing)

Other Useful Sources (2) • Resource URL • http://www.aclclp.org.tw/res_other_c.php (中華民國計算語言學學會) • http://nlp.stanford.edu/software/index.shtml (Stanford NLP Group) • http://www.phontron.com/nlptools.php (Graham Neubig) • Tools/Software • Online Dictionary • WordNethttp://wordnet.princeton.edu/ • HowNethttp://www.keenage.com/html/c_index.html • The Academia Sinica Bilingual Ontological Wordnet (BOW)http://bow.sinica.edu.tw/

CKIP (中研院詞庫小組)(Chinese Knowledge and Information Processing) • Parser: http://140.109.19.112/main.exe?id=6833 • POS (part of speech) tagger: http://ckipsvr.iis.sinica.edu.tw/

Eric Brill's POS Tagger • Website: http://cst.dk/online/pos_tagger/uk/ This/DT is/VBZ a/DT book/NN ./.

Stanford Parser • Website • http://nlp.stanford.edu/software/lex-parser.shtml • Tools • Online version • Stanford Parser version 1.5.1 • English & Chinese • http://josie.stanford.edu:8080/parser/

Stanford Parser

[Homework 1] • Using CKIP POS (part of speech) tagger, Eric Brill’s POS tagger, and Stanford parser to tag and parse at least three sentence.

Course Topics • Probability and Information Theory • basics: definitions, formulas, examples. • Language Modeling • n-gram models, parameter estimation • smoothing (EM algorithm) • Some Linguistics • phonology, morphology, syntax, semantics, discourse • Words and the Lexicon • word classes, mutual information, lexicography.

Course Topics (cont.) • Hidden Markov Models • background, algorithms, parameter estimation • Tagging: methods, algorithms, evaluation • tag sets, HMM tagging, transformation-based, feature-based • Grammars and Parsing: data, algorithms • statistical parsing: algorithms, parameterization, evaluation

Course Topics (cont.) • Applications • Machine Translation (MT) • Acoustic Speech Recognition (ASR) • Information Retrieval (IR) • Cross-Language Information Retrieval (CLIR) • Question Answering (QA) • Cross-Language Question Answering (CLQA) • Summarization • Information Extraction • …

Course Introduction • Lecture1: Introduction • Lecture2: Mathematical Foundations • Lecture3: Linguistics Essentials • Lecture4: Corpus-based Work • Lecture5: Collocations • Lecture6: Statistical Inference: n-gram Models over Sparse Data • Lecture7: Word Sense Disambiguation • Lecture8: Statistical Alignment and Machine Translation • Lecture9: Markov Models • Lecture10: Term Translation Extraction & Cross-Language Information Retrieval • Lecture11 :Statistical/Probabilistic Models for Word Alignment & CLIR • Lecture12: Part-of-Speech Tagging • Lecture13: Probabilistic Context Free Grammars • Lecture14: Question Answering

The Ultimate Research Goal in Natural Language Processing(NLP) • To develop an automated language understanding system • Why is this important? • Easy for everyone to use language • Natural Human interface for a variety of applications (e.g., database access, on-line tutor, robot control, etc.) • Language seems fundamental for developing an intelligent system • iPhone Siri • IBM's DeepQA project

Natural Language is VERY Useful

OCR Problems

Aspects of Computational Linguistics • Description of the Language: universals, cross-linguistic research • Implementation of Computer Model:algorithms and data structures, formal models to represent knowledge, model of the reasoning process • Psycho-Linguistic Aspect:humans are an existence proof of the computability of language comprehension; psychological research can be used to justify a computer model; obtain human processing parameters

NLP Issues • Why is NLP difficult? • Many “words”, many “phenomena”, many “rules” • OED (Oxford English Dictionary): 400k words; Finnish lexicon (of forms): ~2 ×107 • sentences, clauses, phrases, constituents, coordination, negation, imperatives/questions, inflections, parts of speech, pronunciation, topic/focus, and much more! • irregularity (exceptions, exceptions to the exceptions, ...) • potato  potato es (tomato, hero,...); photo  photo s, and even: both mango  mango s or  mango es • Adjective / Noun order: new book, electrical engineering, general regulations, flower garden, garden flower

Difficulties in NLP (cont.) • Ambiguity • books: NOUN or VERB? • you need many booksvs. she books her flights online • Thank you for not smoking, drinking, eating or playing radios without earphones. (MTA bus) • Thank you for not eating without earphones?? • Thank you for drinking?? … • Fred’s hat was blown off by the wind. He tried to catch it. • ...catch the windor ...catch the hat ?

Rules or Statistics? • Preferences: • context clues: she books  books is a verb • rule: if an ambiguous word (verb/nonverb) is preceded by a matching personal pronoun word is a verb • pronoun reference: • she/he/it often refers to the most recent noun or pronoun (but there are certainly exceptions) • selectional restrictions: • catching hat is better than catching wind (but not always) • semantics: • We thank people for doing helpful things or not doing annoying things

Solutions • Don’t guess if you know: • morphology (inflections) • lexicons (word information) • unambiguous names • perhaps some (really) fixed phrases • syntactic rules? • Use statistics (based on real-world data) for preferences (only?) • No doubt: but this is an important question!

Types of Linguistic Knowledge • Acoustic/Phonetic Knowledge: How words are related to their sounds. (transliteration) • E ri c sson <=> 易利信 • Morphological Knowledge: How words are constructed out of basic meaning units. un + friend + ly  unfriendly love + past tense  loved object + oriented  object-oriented

More Types of Linguistic Knowledge • Lexical Knowledge (or Dictionary): This should include information on parts of speech, features (e.g., number, case), typical usage, and word meaning. • Syntactic Knowledge: How words are put together to make legal sentences (or constituents of sentences).

More Types of Linguistic Knowledge • Semantic Knowledge: Word meanings, how words combine into sentence meaning, • e.g., Fred tossed the ball. Semantic roles

More Types of Linguistic Knowledge • Pragmatic Knowledge: How context affects the interpretation of a sentence.Examples: • Louise loves him. [Context 1:] Who loves Fred? [Context 2:] Louise has a cat. • What time is it? [Context 1:] Fred is fidgeting (坐立不安) and staring at his watch. [Context 2:] Louise has no watch.

More Types of Linguistic Knowledge • World Knowledge:How other people‘s minds work, what a listener knows or believes, the etiquette (成規) of language.Examples: • Will you pass the salt? • I read an article about the war in the paper. • Fred saw the bird with his binoculars. • Tim was invited to Tom's birthday party. He went to the store to buy him a present.

Multilingualism Issues in Web Age • Language barrier • There are about 6,700 languages listed in the Ethnologue(http://www.ethnologue.com/) • Information overloading • Scaling up of language resources • Webpages • News • Weblogs • Microblogs

Multilingual Understanding??

Real World Situation • Use statistical model based on REAL WORLD DATA and care about the best sentence only • Imagine: • Each sentence W = { w1, w2, ..., wn } gets a probability P(W|X) in a context X • For every possible context X, sort all the imaginable sentences W according to P(W|X): • Ideal situation: best sentence (most probable in context X) P(W) Wbest Wworst

Real World Situation • Unable to specify a set of grammatical sentences using fixed “categorical” rules • (disregarding the “grammaticality” issue) best sentence (most probable in context X) P(W) Wbest Wworst

Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering