1 / 44

Natural Language Processing

Natural Language Processing. Session in the section “Web search” of the course “Web Mining” at the Ecole Nationale Superieure des Télécommunications , in Paris/France, in fall 2010 by Fabian M. Suchanek. This document is available under a

csilla
Télécharger la présentation

Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Natural Language Processing Session in the section “Web search” of the course “Web Mining” at the EcoleNationaleSuperieure des Télécommunications, in Paris/France, in fall 2010 by Fabian M. Suchanek This document is available under a Creative Commons Attribution Non-Commercial License

  2. Organisation • 2h class on Natural Language Processing • Small home-work given at the end, to be handed in for the next session (on paper or by email) • Web-site: http://suchanek.name Teaching • Class will be held in English, questions and answers in French are OK 

  3. Natural Language ... appears everywhere on the Internet Inbox: 3726 unread messages (250 billion mails/day; ≈ number of stars in our galaxy; 80% spam) (100 million blogs) me: Would you like to have dinner with me tonight? Cindy: no. (1 billion chat msg/day on Facebook; 1 billion = 10^9 = distance Chicago-Tokio in cm) (1 trillion Web sites; 1 trillion = 10^12 ≈ number of cells in the human body)

  4. Natural Language: Tasks • Automatic text summarization Let me first say how proud I am to be the president of this country. Yet, in the past years, our country... [1 hour speech follows]  Summary: Taxes will increase by 20%. • Machine translation librairie book store • Information Extraction Elvis Presley lives on the moon. lives(ElvisPresley, moon) • Natural language understanding Close the file! Clean up the kitchen! • Natural language generation Dear user, I have cleaned up the kitchen for you. • Text Correction My hardly loved mother-in law  My heartily loved mother-in-law • Question answering Where is Elvis?  On the moon

  5. Fields of Linguistics “I can resist anything but temptation” [Oscar Wilde] • /aikεnrəzistεniθiŋ.../ • (Phonology, the • study of pronunciation) • temptations / temptation • (Morphology, the • study of word constituents) Sentence • “I” = • (Semantics, the • study of meaning) Verbal phrase Noun phrase • (Syntax, the • study of grammar)

  6. Phonology • Spelling and pronunciation do not always coincide • The International Phonetic Alphabet (IPA) binds phonetic symbols to exact mouth positions and thus sounds • We can describe the sounds • of language precisely

  7. Phonology • The same spelling can be pronounced in different ways • (such words are called heteronyms) • The same pronunciation can be spelled in different ways (such words are called homophones) I read a book every day. / ... ri:d .../ I read a book yesterday. / ... rɛ:d .../ ...ough: ought, plough, cough, tough, though, through site / sight / cite • Conclusion: It is hard to wreck a nice beach • (= It is hard to recognize speech)

  8. Fields of Linguistics “I can resist anything but temptation” [Oscar Wilde] • /aikεnrəzistεniθiŋ.../ • (Phonology, the • study of pronunciation) ✔ • temptations / temptation • (Morphology, the • study of word constituents) Sentence • “I” = • (Semantics, the • study of meaning) Verbal phrase Noun phrase • (Syntax, the • study of grammar)

  9. Morphology: Morphemes • Words can consist of constituents that carry meaning (morphemes) prefix suffix affixes • un-break-able • “un” is a morpheme that indicates negation • “break” is the root morpheme • “able” is a morpheme that indicates being capable of something • Words can be assembled from other words houseboat = house + boat Druckerzeugnisse = Druck-Erzeugnisse ? = Drucker-Zeugnisse ? Donaudampfschifffahrtskapitaen = Donau + Dampf + Schiff + Fahrt + Kapitaen [captain of a steam ship ride on the Danube]

  10. Morphology: Non-Triviality • Morphemes do not always simply add up happy + ness ≠ happyness (but: happiness) dish + s ≠ dishs (but: dishes) un + correct ✗ in +correct = incorrect • Example: plural and singular boy + s -> boys (easy) city + s -> cities atlas -> atlas, bacterium -> bacteria, automaton -> automata alumnus -> alumni, axis -> axes, man -> men, mouse -> mice, person -> people, physics -> (no pl), mechanic -> mechanics

  11. Morphology: Stemming Stemming: The process of mapping different related words onto one single word form. bike, biking, bikes, racebike -> BIKE Stemming is useful to allow search engines to find related words: User: “biking” BIKE Stemming word does not appear  word appears  This Web page tells you everything about bikes. ... THIS WEB PAGE TELL YOU EVERYTHING ABOUT BIKE. ... Stemming

  12. Morphology: Stemming Stemming can be done at different levels of agressivity: • Just mapping plural forms to singular words  word Stemming is the process of mapping different related words onto one word form. Stemming is the process of mapping different related word onto one word form. Still not trivial: universities  university emus  emu, but genus  genus mechanics  mechanic (the guy) or mechanics (the science)

  13. Morphology: Stemming Stemming can be done at different levels of aggressivity: • Just mapping plural forms to singular • Reduction to the lemma, i.e., the non-inflected form mapping  map, stemming  stem, is  be, related  relate Stemming is the process of mapping different related words onto one word form. Stem be the process of map different relate word onto one word form. Still not trivial: interrupted, interrupts, interrupt  interrupt ran  run

  14. Morphology: Stemming Stemming can be done at different levels of agressivity: • Just mapping plural forms to singular • Reduction to the lemma, i.e., the non-inflected form • Reduction to the stem, i.e., the common core of all related words different  differ (because of “to differ”) Stemming is the process of mapping different related words onto one word form. Stem be the process of map differ relate word onto one word form. May be too strong: interrupt, rupture, disrupt rupt

  15. Morphology: Stemming Stemming can be done at different levels of agressivity: • Just mapping plural forms to singular • Reduction to the lemma, i.e., the non-inflected form • Reduction to the stem, i.e., the common core of all related words • Common stemming techniques: • rule-based (e.g., the Porter Stemmer for reduction to the stem) • dictionary-based • stochastic

  16. Fields of Linguistics “I can resist anything but temptation” [Oscar Wilde] • /aikεnrəzistεniθiŋ.../ • (Phonology, the • study of pronunciation) ✔ • temptations / temptation • (Morphology, the • study of word constituents) ✔ Sentence • “I” = • (Semantics, the • study of meaning) Verbal phrase Noun phrase • (Syntax, the • study of grammar)

  17. Semantics: Meaning of Words • A word can have multiple meanings (such a word is called a homonym) bow • A meaning can be expressed by multiple words (such words are called synonyms) author writer

  18. Semantics: Taxonomy of Words • A word is a hypernymof another word, if its meaning is more general that that of the other word The other word is called the hyponym. • This allows us to arrange words in a hierarchy • (called “taxonomy”) entity (This is not clean because it mixes up words and concepts) person object singer author weapon singer-songwriter bow “Every bow is a weapon.” “Every weapon is an object.”

  19. Semantics: WordNet • WordNet is a well-known semantic lexion, http://wordnet.princeton.edu A concept is identified by a “synset”, i.e., a set of synonymous words {entity} {person, human} {object, thing} {singer} {author, writer} {weapon} {decoration} {singer-songwriter} {bow} {bow, bow tie} Homonyms belong to multiple synsets.

  20. Semantics: WSD • Word Sense Disambiguation (WSD) is the process of finding the meaning of a word in a context They used a bow to hunt animals. • ? Words associated with “bow (weapon)”: { kill, hunt, Indian, prey } Words associated with “bow (bow tie)”: { suit, clothing, reception } e.g. from WordNet Words of the sentence: { they, used, to, hunt, animals } Overlap: 1/5 Overlap: 0/5 ✔ ✗

  21. Semantics: WSD • Word Sense Disambiguation (WSD) is the process of finding the meaning of a word in a context • Naive method: Comparison of contexts

  22. Fields of Linguistics “I can resist anything but temptation” [Oscar Wilde] • /aikεnrəzistεniθiŋ.../ • (Phonology, the • study of pronunciation) ✔ • temptations / temptation • (Morphology, the • study of word constituents) ✔ Sentence ✔ • “I” = • (Semantics, the • study of meaning) Verbal phrase Noun phrase • (Syntax, the • study of grammar)

  23. Syntax: Correct Sentences Bob stole the cat. ✔ Cat the Bob stole. ✗ Bob, who likes Alice, stole the cat. Bob, who likes Alice, who hates Carl, stole the cat. Bob, who likes Alice, who hates Carl, who owns the cat, stole the cat. • There are infinitively many correct sentences, • ...yet not all sentences are correct.

  24. Syntax: Grammars Bob stole the cat. ✔ Cat the Bob stole. ✗ Grammar: A formalism that decides whether a sentence is syntactically correct. Example: Bob eats Sentence -> Noun Verb Noun -> Bob Verb -> eats

  25. Syntax: Phrase Structure Grammars Non-terminal symbols: abstract phrase constituent names, such as “sentence”, “noun”, “verb” (in blue) Terminal symbols: words of the language, such as “Bob”, “eats”, “drinks” Given two disjoint sets of symbols, N and T, a (context-free) grammar is a relation between N and strings over N u T: G N x (N u T)* U Sentence -> Noun Verb Noun -> Bob Verb -> eats Production rules Rules always start from non-terminal symbols, never from terminal symbols (hence the name) N = {Sentence, Noun, Verb} • T = {Bob, eats} Lexicon

  26. Syntax: Using Grammars Sentence -> Noun Verb Noun -> Bob Verb -> stole N = {Sentence, Noun, Verb} • T = {Bob, eats} • All symbols are terminal • no more rule applicable • stop here • we have a sentence Sentence Apply rule • Sentence -> Noun Verb • Noun + Verb Sentence • Noun -> Bob Apply rule • Bob Verb Verb Noun Apply rule • Verb -> eats eats Bob • Bob eats = Rule derivation Parse tree

  27. Syntax: A More Complex Example 1. Sentence -> NounPhraseVerbPhrase 2. NounPhrase -> ProperNoun • 3. ProperNoun -> Bob • 4. VerbPhrase -> Verb NounPhrase • 5. Verb -> stole • 6. NounPhrase -> Determiner Noun • 7. Determiner -> the • 8. Noun -> cat Sentence VerbPhrase NounPhrase ProperNoun Verb NounPhrase Determiner Noun Bob stole cat the

  28. Syntax: A More Complex Example 1. Sentence -> NounPhraseVerbPhrase 2. NounPhrase -> ProperNoun • 3. ProperNoun -> Bob • 4. VerbPhrase -> Verb NounPhrase • 5. Verb -> stole • 6. NounPhrase -> Determiner Noun • 7. Determiner -> the • 8. Noun -> cat Sentence VerbPhrase NounPhrase Verb NounPhrase Determiner Noun the cat ProperNoun stole Bob

  29. Syntax: Recursive Structures 1. Sentence -> NounPhraseVerbPhrase 2. NounPhrase -> ProperNoun • 3. NounPhrase -> Determiner Noun • 4. NounPhrase -> NounPhrase Subordinator VerbPhrase • 5. VerbPhrase -> Verb NounPhrase Recursive rules: allow a circle in the derivation Sentence NounPhrase VerbPhrase NounPhrase Subordinator Verb VerbPhrase NounPhrase Verb NounPhrase Determiner Noun ProperNoun stole who ProperNoun cat the likes Bob Alice

  30. Syntax: Recursive Structures 1. Sentence -> NounPhraseVerbPhrase 2. NounPhrase -> ProperNoun • 3. NounPhrase -> Determiner Noun • 4. NounPhrase -> NounPhrase Subordinator VerbPhrase • 5. VerbPhrase -> Verb NounPhrase Sentence NounPhrase VerbPhrase NounPhrase Subordinator Verb VerbPhrase NounPhrase Verb NounPhrase Determiner Noun ProperNoun stole who cat Alice who hates ... the likes Bob Carl who owns...

  31. Syntax: Language Language of a grammar: The set of all sentences that can be derived from the start symbol by rule applications. • Bob stole • Bob stole the cat • Bob likes Alice • Bob stole Alice • Alice stole Bob who likes the cat • The cat likes Alice who stole Bob • Bob likes Alice who likes Alice who likes Alice... • ... The grammar is a finite description of an infinite set of sentences • The Bob stole likes. • Stole stole stole. • Bob cat Alice likes. • ...

  32. Syntax: What we cannot (yet) do What is difficult to do with context-free grammars: • agreement between words Bob kicks the dog. I kicks the dog. ✗ • sub-categorization frames Bob sleeps. Bob sleeps you. ✗ • meaningfulness Bob switches the computer off. Bob switches the cat off. ✗

  33. Syntax: Parsing Parsing is the process of, given a grammar and a sentence, finding the phrase structure tree. Sentence -> Noun Verb Noun -> Bob Verb -> stole N = {Sentence, Noun, Verb} • T = {Bob, eats} Sentence Verb Noun eats Bob

  34. Syntax: Ambiguity Sentence NounPhrase VerbPhrase NounPhrase Pronoun Noun Adjective Verb They relatives visiting were playing children = They were relatives who came to visit.

  35. Syntax: Ambiguity Sentence NounPhrase VerbPhrase NounPhrase AuxiliaryVerb Verb Pronoun Noun They relatives visiting were cooking dinner = They were on a visit to relatives.

  36. Syntax: Parsers Parsing is the process of, given a grammar and a sentence, finding the parse tree. There may be multiple parse trees for a given sentence (a phenomenon called syntactic ambiguity). Common techniques: • naive: Bottom up trial and error • more advanced: with dynamic programming, stochastic

  37. Syntax: Part Of Speech Sentence NounPhrase VerbPhrase VerbPhrase NounPhrase NounPhrase NounPhrase Noun Verb Subordinator ProperNoun Determiner ProperNoun Verb Bob who likes cat Alice the stole The part of speech (POS) of “Bob”

  38. Syntax: Part Of Speech The part of speech (POS) of a terminal symbol in a parse tree is the parent node of the terminal symbol. The POS of a word determines the role of the word in the sentence. Proper nouns, nouns, verbs, prepositions, adjectives, ... Noun Verb Subordinator ProperNoun Determiner ProperNoun Verb Bob who likes cat Alice the stole The part of speech (POS) of “Bob”

  39. Syntax: POS Tagging POS tagging is the process of, given a sentence and a (partial) grammar, determining the part of speech of each word. The train that Alice took was heading to Rome. The/Det train/Noun that/Subordinator Alice/ProperNountook/Verb was/AuxiliaryVerbheading/Verb to/Preposition Rome/ProperName.

  40. Syntax: POS Tagging POS tagging is the process of, given a sentence and a (partial) grammar, determining the part of speech of each word. POS tagging isnot trivial, because the same word can appear with different POS: • Some words belong to two word classes (“run” as a verb or noun) • Some word forms may be ambiguous: • Sound sounds sound sound. Common techniques: • rule-based • statistical • using dynamic programming

  41. Syntax: Word Classes A word class is a set of words sharing the same POS. E.g.: Nouns, verbs, subordinators, determiners,... An open word class is a word class that can be extended: • Proper nouns: Alice, Fabian, Elvis, ... • Nouns: computer, weekend, ... • Adjectives: self-reloading fridge, ... • Verbs: download, ... A closed word class is a word class that cannot be extended: • Pronouns: he, she, it, this, ... (≈ what can replace a noun) • Determiners: the, a, these, your, my, ... (≈ what goes before a noun) • Prepositions: in, with, on, ... (≈ what goes before determiner + noun) • Subordinators: who, whose, that, which, because, ... • (≈ what introduces a sub-ordinate sentence)

  42. Syntax: Stop words • Words of the closed word classes are often perceived • as contributing less to the meaning of a sentence. • Words closed word classes often perceived • contributing less meaning sentence. Therefore, these words are often ignored in web search. Words that are ignored in a Web search are called stop words. a, the, in, those, could, can, not, ... This may not always be reasonable “Vacation outside Europe” “VacationEurope”

  43. Fields of Linguistics “I can resist anything but temptation” [Oscar Wilde] • /aikεnrəzistεniθiŋ.../ • (Phonology, the • study of pronunciation) • temptations / temptation • (Morphology, the • study of word constituents) ✔ ✔ Sentence ✔ • “I” = • (Semantics, the • study of meaning) ✔ Verbal phrase Noun phrase • (Syntax, the • study of grammar)

  44. Homework • Phonology: Find two French words that sound the same, but are written differently (homophones) • Morphology: Find an example Web search query where stemming to the stem (most aggressive variant) is too aggressive. • Semantics: Make a taxonomy of at least 5 words in a thematic domain of your choice • Syntax: POS-tag the sentence The quick brown fox jumps over the lazy dog • Hand in by e-mail to f.m.suchanek@gmail.com or on paper in the next session

More Related