1 / 57

Practical Natural Language Processing

N eve r s aw. Always show the customer the same characters . C ould n’t u nder stand. wet d og. It wa s goo d, but there w asn’t. Practical Natural Language Processing. Lo ve i f I w as a d run k coll ege. I ha ve alw ays fou nd th is t o be t h e dow nside.

sukey
Télécharger la présentation

Practical Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Never saw Always show the customer the same characters Couldn’t understand wet dog It was good, but there wasn’t Practical NaturalLanguage Processing Love if I was a drunk college I have always found this to be the downside EverythingI could have expected if Catherine Havasi Luminoso / MIT Media Lab havasi@luminoso.com Christine C. Quinn, the New York City Council speaker, released a video and planned to visit all five boroughs on Sunday as she officially began her campaign Many social norms, like “tha.

  2. There are notes!luminoso.com/blog

  3. Too much text?

  4. Wouldn’t it be cool if we could talk to a computer?

  5. This is hard. It takes a lot of knowledge to understand language

  6. I made her duck.

  7. I made her duck • I  cooked  waterfowl  for  her  benefit  (to  eat) • I  cooked  waterfowl  belonging  to  her • I  created  the  (plaster?)  duck  she  owns • I made sure she got her head down • I  waved  my  magic  wand  and  turned  her  into   undifferentiated  waterfowl

  8. Language is Recursive • You can build new concepts out of old ones indefinitely

  9. Language is Creative It smelled terrible. It was really stuffy. Smells like an old house. It was like it had been shut away for a long time. Was like a wet dog. Smelled really musty. Reminds me of a dusty closet. Really stale. Confidential Luminoso http://lumino.so

  10. A multi-lingual world

  11. Linguistics to the rescue?

  12. Linguistics to the rescue? --Randall Munroe, xkcd.org/114

  13. “Much Debate”

  14. We just want to get things done.

  15. So, what is state of the art?

  16. The NLP process • Take in a string of language • Where are the words? • What are the root forms of these words? • How do the words fit together? • Which words look important? • What decisions should we make based on these words?

  17. The NLP process (simplified) • Fake understanding

  18. The NLP process (simplified) • Fake understanding • Until you make understanding

  19. Example: Detecting bad words • You want to flag content with certain bad words in it • Don’t just match sequences of characters • That would lead to this classic mistake

  20. Many forms of fowl language • Suppose we want people to not say the word “duck”

  21. Many forms of fowl language “What the duck’s wrong with this” “It’s all ducked up” “Un-ducking-believable”

  22. Step 1: break text into tokens it ’s all ducked up un ducking believable

  23. Step 2: replace tokens with their root forms it → it ’s→ is all → all ducked →duck up → up un → un ducking →duck believable → believe

  24. In a few lines of Python: >>> import nltk >>> text = "It's all ducked up. Un-ducking-believable." >>> tokens = nltk.wordpunct_tokenize(text.lower()) >>> tokens [’it', "'", 's', 'all', 'ducked', 'up', '.', ’un', '-', 'ducking', '-', 'believable', '.'] >>> stemmer = nltk.stem.PorterStemmer() >>> [stemmer.stem_word(token) for token in tokens] [’it', "'", 's', 'all', 'duck', 'up', '.', ’un', '-', 'duck', '-', 'believ', '.']

  25. Stemmers can spell things oddly • duck → duck • ducking → duck • believe →believ • believable →believ • happy →happi • happiness →happi

  26. Stemmers can mix up some words • sincere →sincer • sincerity →sincer • universe →univers • university →univers

  27. The NLP tool chain • Some source of text (a database, a labeled corpus, Web scraping, Twitter...) • Tokenizer: breaks text into word-like things • Stemmer: finds words with the same root • Tagger: identifies parts of speech • Chunker: identifies key phrases • Something that makes decisions based on these results

  28. Useful toolkits • NLTK (Python) • LingPipe (Java) • Stanford Core NLP (Java; many wrappers) • FreeLing (C++)

  29. The statistics of text • Often we want to understand the differences between different categories of text • Different genres • Different writers • Different forms of writing

  30. Collecting word counts • Start with a corpus of text • Brown corpus (1961) • British National Corpus (1993) • Google Books (2009, 2012)

  31. Collecting word counts >>> import nltk >>> from nltk.corpus import brown >>> from collections import Counter >>> counts = Counter(brown.words()) >>> counts.most_common()[:20] [('the', 62713), (',', 58334), ('.', 49346), ('of', 36080), ('and', 27915), ('to', 25732), ('a', 21881), ('in', 19536), ('that', 10237), ('is', 10011), ('was', 9777), ('for', 8841), ('``', 8837), ("''", 8789), ('The', 7258), ('with', 7012), ('it', 6723), ('as', 6706), ('he', 6566), ('his', 6466)]

  32. Collecting word counts >>> for category in brown.categories(): ... frequency = Counter(brown.words( ... categories=category)) ... ... for word in frequency: ... frequency[word] /= counts[word] + 100. ... ... # format the results nicely ... print "%20s -> %s" % (category, ... ', '.join(word for word, prop ... in frequency.most_common()[:10]))

  33. Prominent words by category editorial -> Berlin, Khrushchev, East, editor, nuclear, West, Soviet, Podger, Kennedy, budget fiction -> Kate, Winston, Scotty, Rector, Hans, Watson, Alex, Eileen, doctor, ! government -> fiscal, Rhode, Act, Government, shelter, States, tax, Island, property, shall hobbies -> feed, clay, Hanover, site, your, design, mold, Class, Junior, Juniors news -> Mrs., Monday, Mantle, yesterday, Dallas, Texas, Kennedy, Tuesday, jury, Palmer religion -> God, Christ, Him, Christian, Jesus, membership, faith, sin, Church, Catholic reviews -> music, musical, Sept., jazz, Keys, audience, singing, Newport, cholesterol science_fiction -> Ekstrohm, Helva, Hal, B'dikkat, Mercer, Ryan, Earth, ship, Mike, Hesperus

  34. Classifying text • We can take text that’s categorized and figure out its word frequencies • Wouldn’t it be more useful to look at word frequencies and figure out the category?

  35. Example: Spam filtering • Paul Graham’s SpamBayes(2002) • Remember what e-mail was like before 2002? • A simple classifier (Naive Bayes) changed everything

  36. Supervised classification • Distinguish things from other things based on examples

  37. Applications • Spam filtering • Detecting important e-mails • Topic detection • Language detection • Sentiment analysis

  38. Naive Bayes • We know the probability of various data given a category • Estimate the probability of the category given the data • Assume all features of the data are independent (that’s the naive part) • It’s simple • It’s fast • Sometimes it even works

  39. A quick Naive Bayes experiment • nltk.corpus.movie_reviews: movie reviews labeled as ‘pos’ or ‘neg’ • Define document_features(doc) to describe a document by the words it contains

  40. Statistics beyond single words • Many interesting things about text are longer than one word • bigram: a sequence of two tokens • collocation: a bigram that seems to be more than the sum of its parts

  41. When is a bigram interesting?

  42. Guess the text

  43. Guess the text >>> from nltk.book import text4 >>> text4.collocations() United States; fellow citizens; four years; years ago; Federal Government; General Government; American people; Vice President; Old World; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice; God bless; every citizen; Indian tribes; public debt; one another; foreign nations; political parties

  44. Guess the text >>> from nltk.book import text3 >>> text3.collocations() said unto; pray thee; thou shalt; thou hast; thy seed; years old; spake unto; thou art; LORD God; every living; God hath; begat sons; seven years; shalt thou; little ones; living creature; creeping thing; savoury meat; thirty years; every beast

  45. Guess the text >>> from nltk.book import text6 >>> text6.collocations() BLACK KNIGHT; HEAD KNIGHT; Holy Grail; FRENCH GUARD; Sir Robin; Run away; CARTOON CHARACTER; King Arthur; Iesudomine; Pie Iesu; DEAD PERSON; Round Table; OLD MAN; dramatic chord; donaeis; eis requiem; LEFT HEAD; FRENCH GUARDS; music stops; Sir Launcelot

  46. What about grammar? • Eh • Too hard

  47. What about word meanings? • “I liked the movie.” • “I enjoyed the film.” • These have a lot more in common than “I” and “the”.

  48. WordNet • A dictionary for computers • Contains links between definitions • Words form (roughly) a tree

  49. Synset Definition good, right, ripe – (most suitable or right for a particular purpose; "a good time to plant tomatoes"; "the right time to act"; "the time is ripe for great sociological changes") Glosses

More Related