150 likes | 268 Vues
This session introduces the fundamentals of computational linguistics and its applications in processing natural language. It explores the difference between computational linguistics (CL) and natural language processing (NLP), detailing various applications such as automatic speech recognition, machine translation, and sentiment analysis. The session discusses basic and advanced NLP tasks, emphasizing the role of linguistic knowledge in developing effective systems. By understanding these concepts, learners can appreciate how computers can transform, classify, and sometimes truly understand human language.
E N D
Introduction to CL Session 1: 7/08/2011
What is computational linguistics? • Processing natural language text by computers • for practical applications • ... or linguistic research • Among practical applications • Sometimes the computer only needs to classify or transform the text • ... but sometimes it needs to “understand” • Ex: Watson: winner of ‘Jeopardy’ • CL vs. NLP (natural language processing)
NLP applications • Automatic speech recognition (ASR): speech text • Machine translation (MT): L1 L2 • Information retrieval (IR): Query + documents a subset of doc • Information extraction (IE): document “database”
NLP applications (cont) • Question answering (QA): Question + documents Answer • Summarization: documents summary • Natural language generation (NLG): representation text
Other Applications • Call Center • Spam filter • Spell checker • Sentiment analysis: product reviews • Bio-NLP: processing clinical data • ….
Basic NLP tasks: Shallow processing • Tokenization: • He visited New York in 2003. • Morphological analysis: • visited visit + -ed • Part-of-speech tagging • He/Pron visited/V New/?? York/N in/Prep 2003/CD • Name-entity tagging • He visited [LOCATION New York] in [YEAR 2003] • Chunking • [NP He] [V visited] [NP New York] in [NP 2003]
Basic NLP tasks: Deep processing • Parsing • (S (NP (PRON he)) (VP (V visited) ….) • Semantic analysis • Semantic tagging: [AGENT He] visited [DEST New York] …. • Meaning: visit (he, New-York) • Discourse • Co-reference: “He” refers to “John” • Discourse structure • Dialogue • Generation
Ambiguity • Phonological ambiguity: (ASR) • “too”, “two”, “to” • “ice cream” vs. “I scream” • “ta” in Mandarin: he, she, or it • Morphological ambiguity: (morphological analysis) • unlockable: [[un-lock]-able] vs. [un-[lock-able]] • Syntactic ambiguity: (parsing) • John saw a man with a telescope. • Time flies like an arrow.
Ambiguity (cont) • Lexical ambiguity: (WSD) • Ex: “bank”, “saw”, “run” • Semantic ambiguity: (semantic representation) • Ex: every boy loves his mother • Ex: John and Mary bought a house • Discourse ambiguity: • Susan called Mary. She was sick. (coreference resolution) • It is pretty hot here. (intention resolution) • Machine translation: • “brother”, “cousin”, “uncle”, etc.
Ambiguity resolution • Rule-based or knowledge-based: • Parsing: • I saw a man with a hat • I saw a man with a telescope (in my hand) • WSD: • “bank” • MT: • “brother”, “cousin”, “uncle” • Statistical approach: • Require training data • Build a statistical model • Knowledge and rules can be incorporated into the model as features etc.
Major approaches to NLP • Rule-based approach • Statistical approach • Supervised learning • Semi-supervised learning • Unsupervised learning
Supervised learning algorithms • Hidden Markov Model (HMM) • Decision tree • Decision list • Naïve Bayes • Transformation-based Learning (TBL) • Maximum Entropy (MaxEnt) • Support Vector Machine (SVM) • Conditional Random Field (CRF) • …
Data • Raw text: • Monolingual: English/Chinese/Arabic Gigawords • Parallel data: UN data, EuroParl • Treebank: • Syntactic treebanks: a set of parse trees • Proposition Bank: • Discourse Treebank • Dictionaries • WordNet • FrameNet • …
Task1 ML1 ML2 D1 D2 D_n Applications Task2 Task_i … … ML_m …
The role of linguistics knowledge in NLP • An NLP system is language-independent. • Good or bad? • Good: it can be ported to many languages without any changes. • Bad: it cannot take advantage of properties of certain languages. • How to incorporate (linguistic) knowledge in statistical systems? • the design of models • as features • as filters • … Building a treebank is an effective way.