1 / 52

Analyzing unstructured text with topic models

Analyzing unstructured text with topic models. Mark Steyvers Dep. of Cognitive Sciences & Dep. of Computer Science University of California, Irvine. collaborators: Padhraic Smyth, UC Irvine; Tom Griffiths UC Berkeley. Analyzing Unstructured Text. Pennsylvania Gazette (1728-1800)

Télécharger la présentation

Analyzing unstructured text with topic models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Analyzing unstructured text with topic models Mark Steyvers Dep. of Cognitive Sciences & Dep. of Computer Science University of California, Irvine collaborators: Padhraic Smyth, UC Irvine; Tom Griffiths UC Berkeley

  2. Analyzing Unstructured Text Pennsylvania Gazette (1728-1800) 80,000 articles Enron 250,000 emails NYT 330,000 articles NSF/ NIH 100,000 grants AOL queries 20,000,000 queries 650,000 users 16 million Medline articles

  3. Topic Models and Text Analysis • Can answer a number of questions: • What is in this corpus? • What is in this document, paragraph, or sentence? • What does this person/group of people write about? • What tags are appropriate for this document? • What are the topical trends over time?

  4. Topic Models • Automatic and unsupervised extraction of semantic themes from large text collections. • Widely used model in machine learning and text mining • pLSI Model: Hoffman (1999) • LDA Model: Blei, Ng, and Jordan (2001, 2003) • LDA with Gibbs sampling : Griffiths and Steyvers (2003, 2004)

  5. Basic Assumptions • Each topic is a distribution over words • Each document a mixture of topics • Each word in a document originates from a single topic

  6. Model P( words | document ) = S P(words|topic) P (topic|document) Topic = probability distribution over words topic weights for each document Automatically learned from text corpus

  7. Toy Example MONEY1 BANK1 BANK1 LOAN1 BANK1 MONEY1 BANK1 MONEY1 BANK1 LOAN1 LOAN1 BANK1 MONEY1 .... 1.0 .6 RIVER2 MONEY1 BANK2 STREAM2 BANK2 BANK1 MONEY1 RIVER2 MONEY1 BANK2 LOAN1 MONEY1 .... .4 1.0 RIVER2 BANK2 STREAM2 BANK2 RIVER2 BANK2.... Topics Topic Weights Documents and topic assignments

  8. Statistical Inference MONEY? BANK BANK? LOAN? BANK? MONEY? BANK? MONEY? BANK? LOAN? LOAN? BANK? MONEY? .... ? ? RIVER? MONEY? BANK? STREAM? BANK? BANK? MONEY? RIVER? MONEY? BANK? LOAN? MONEY? .... ? RIVER? BANK? STREAM? BANK? RIVER? BANK?.... Topics Topic Weights Documents and topic assignments

  9. Statistical Inference • Exact inference is intractable • Markov chain Monte Carlo (MCMC) with Gibbs sampling • scalable to large document collections (e.g. all of wikipedia) • parallelizable • Form of dimensionality reduction • Number of topics T= 50…2000

  10. Examples Topics from New York Times Terrorism Wall Street Firms Stock Market Bankruptcy SEPT_11 WAR SECURITY IRAQ TERRORISM NATION KILLED AFGHANISTAN ATTACKS OSAMA_BIN_LADEN AMERICAN ATTACK NEW_YORK_REGION NEW MILITARY NEW_YORK WORLD NATIONAL QAEDA TERRORIST_ATTACKS WALL_STREET ANALYSTS INVESTORS FIRM GOLDMAN_SACHS FIRMS INVESTMENT MERRILL_LYNCH COMPANIES SECURITIES RESEARCH STOCK BUSINESS ANALYST WALL_STREET_FIRMS SALOMON_SMITH_BARNEY CLIENTS INVESTMENT_BANKING INVESTMENT_BANKERS INVESTMENT_BANKS WEEK DOW_JONES POINTS 10_YR_TREASURY_YIELD PERCENT CLOSE NASDAQ_COMPOSITE STANDARD_POOR CHANGE FRIDAY DOW_INDUSTRIALS GRAPH_TRACKS EXPECTED BILLION NASDAQ_COMPOSITE_INDEX EST_02 PHOTO_YESTERDAY YEN 10 500_STOCK_INDEX BANKRUPTCY CREDITORS BANKRUPTCY_PROTECTION ASSETS COMPANY FILED BANKRUPTCY_FILING ENRON BANKRUPTCY_COURT KMART CHAPTER_11 FILING COOPER BILLIONS COMPANIES BANKRUPTCY_PROCEEDINGS DEBTS RESTRUCTURING CASE GROUP

  11. Learning multiple meanings of words PRINTING PAPER PRINT PRINTED TYPE PROCESS INK PRESS IMAGE PRINTER PRINTS PRINTERS COPY COPIES FORM OFFSET GRAPHIC SURFACE PRODUCED CHARACTERS PLAY PLAYS STAGE AUDIENCE THEATER ACTORS DRAMA SHAKESPEARE ACTOR THEATRE PLAYWRIGHT PERFORMANCE DRAMATIC COSTUMES COMEDY TRAGEDY CHARACTERS SCENES OPERA PERFORMED TEAM GAME BASKETBALL PLAYERS PLAYER PLAY PLAYING SOCCER PLAYED BALL TEAMS BASKET FOOTBALL SCORE COURT GAMES TRY COACH GYM SHOT JUDGE TRIAL COURT CASE JURY ACCUSED GUILTY DEFENDANT JUSTICE EVIDENCE WITNESSES CRIME LAWYER WITNESS ATTORNEY HEARING INNOCENT DEFENSE CHARGE CRIMINAL HYPOTHESIS EXPERIMENT SCIENTIFIC OBSERVATIONS SCIENTISTS EXPERIMENTS SCIENTIST EXPERIMENTAL TEST METHOD HYPOTHESES TESTED EVIDENCE BASED OBSERVATION SCIENCE FACTS DATA RESULTS EXPLANATION STUDY TEST STUDYING HOMEWORK NEED CLASS MATH TRY TEACHER WRITE PLAN ARITHMETIC ASSIGNMENT PLACE STUDIED CAREFULLY DECIDE IMPORTANT NOTEBOOK REVIEW

  12. Demographic Analysis of Search Queries

  13. AOL dataset • Dataset: - 20,000,000+ web queries - 650,000+ users • Users were given “anonymous” user-id • No demographics in this dataset

  14. Example query log from user #2178 ID Query Date/Time URL clicked 2178 dog eats uncooked pasta 2006-05-26 15:31:56 2178 inducing dog vomiting 2006-05-26 15:32:46 http://www.twodogpress.com 2178 inducing dog vomiting 2006-05-26 15:32:46 http://www.canismajor.com 2178 inducing dog vomiting 2006-05-26 15:32:46 http://kitchen.robbiehaf.com 2178 inducing dog vomiting 2006-05-26 15:32:46 http://www.dog-first-aid-101.com 2178 inducing dog vomiting 2006-05-26 15:38:36 2178 walmart 2006-05-12 12:39:52 http://www.walmart.com 2178 sears 2006-05-12 12:44:22 http://www.sears.com 2178 target 2006-05-12 17:05:36 http://www.target.com 2178 babycenter.com 2006-05-12 17:43:59 http://www.babycenter.com 2178 google 2006-05-16 10:54:39 http://www.google.com 2178 fit pregnancy 2006-05-16 15:34:23 2178 baby center 2006-05-16 15:37:22 2178 yahoo.com 2006-05-18 17:11:05 http://www.yahoo.com 2178 applebee's carside 2006-05-19 19:21:08 http://www.applebees.com 2178 baby names 2006-05-20 15:02:38 http://www.babynames.com 2178 baby names 2006-05-20 15:02:38 http://www.babynamesworld.com 2178 baby names 2006-05-20 15:02:38 http://www.thinkbabynames.com 2178 mortgage calculator 2006-05-24 14:39:05 http://www.bankrate.com 2178 us zip codes 2006-05-25 21:26:47 http://www.usps.com 2178 us zip codes 2006-05-25 21:26:47 http://www.usps.com

  15. Another Query Database… • Not publicly available • Dataset • 250,000+ users • 411,000+ queries • Age and gender of users are known: • age brackets: 0-12, 13-17, 18-20, 21-24, 25-29, 30-34, 35-44, 45-54, 55-64, 65+

  16. Topic modeling of queries • Each user searches for a mixture of topics • Each topic is a probability distribution over query words

  17. Four example topics (out of 200) auto car parts cars used ford honda truck toyota party store wedding birthday jewelry ideas cards cake gifts webmd cymbalta xanax gout vicodin effexor prednisone lexapro ambien hannah montana zac efron disney high school musical mileycyrus hilary duff Probability distribution over words. Most likely words listed at the top

  18. User = mixture of topics auto car parts cars used ford honda truck toyota party store wedding birthday jewelry ideas cards cake gifts webmd cymbalta xanax gout vicodin effexor prednisone lexapro ambien hannah montana zac efron disney high school musical mileycyrus hilary duff 80% 20% 100% User #7654 User #246

  19. Topic Analysis • Find likely topics for each demographic bucket • Find likely demographics given topics • What’s on the mind of people in different age-groups?

  20. “poems” topic

  21. “myspace” topic

  22. “sports” topic

  23. “MTV” topic

  24. “Clothing Stores” topic

  25. “Hairstyles” topic

  26. “recipes” topic

  27. Results • Topic models give quick summaries of demographic trends in query datasets • Other potential applications: • e.g. blogs, social networking sites, email, etc • clinical data, e.g. therapy discussions

  28. Analyzing Emailswho writes on what topics?

  29. Enron email data 250,000 emails 5000 authors 1999-2002

  30. Author-topic models • We can learn the association between authors of documents and topics • Assume each author works on a mixture of topics

  31. ENRON Email: who writes on certain topics? ... But also over senders (authors) of email. Most likely authors listed at the top

  32. Enron email: two example topics (T=100)

  33. Detecting Papers on Unusual Topics for Authors • We can calculate perplexity (unusualness) for words in a document given an author Papers ranked by perplexity for M. Jordan:

  34. Author Separation Can model attribute words to authors correctly within a document?

  35. Application:Faculty Browser

  36. Faculty Browser • Automatically analyzes computer science papers by UC San Diego and UC Irvine researchers • Finds topically related researchers

  37. one topic most prolific researchers in this topic

  38. one researcher topics this researcher is interested in other researchers with similar topical interests

  39. Inferred network of researchers connected through topics

  40. Modeling Extensions

  41. Entity-topic modeling 330,000 articles 2000-2002 Who is mentioned in what context?

  42. Extracted Named Entities Three investigations began Thursday into the securities and exchange_commission's choice of william_webster to head a new board overseeing the accounting profession. house and senate_democrats called for the resignations of both judge_webster and harvey_pitt, the commission's chairman. The white_house expressed support for judge_webster as well as for harvey_pitt, who was harshly criticized Thursday for failing to inform other commissioners before they approved the choice of judge_webster that he had led the audit committee of a company facing fraud accusations. “The president still has confidence in harvey_pitt,” said dan_bartlett, bush's communications director … • Used standard algorithms to extract named entities: • People • Places • Organizations

  43. Standard Topic Model with Entities

  44. Example of Extracted Entity-Topic Network

  45. Topic Trends Tour-de-France Proportion of words assigned to topic for that time slice Quarterly Earnings Anthrax

  46. Learning Topic Hierarchies(example: psych Review Abstracts) THE OF AND TO IN A IS A MODEL MEMORY FOR MODELS TASK INFORMATION RESULTS ACCOUNT SELF SOCIAL PSYCHOLOGY RESEARCH RISK STRATEGIES INTERPERSONAL PERSONALITY SAMPLING MOTION VISUAL SURFACE BINOCULAR RIVALRY CONTOUR DIRECTION CONTOURS SURFACES DRUG FOOD BRAIN AROUSAL ACTIVATION AFFECTIVE HUNGER EXTINCTION PAIN RESPONSE STIMULUS REINFORCEMENT RECOGNITION STIMULI RECALL CHOICE CONDITIONING SPEECH READING WORDS MOVEMENT MOTOR VISUAL WORD SEMANTIC ACTION SOCIAL SELF EXPERIENCE EMOTION GOALS EMOTIONAL THINKING GROUP IQ INTELLIGENCE SOCIAL RATIONAL INDIVIDUAL GROUPS MEMBERS SEX EMOTIONS GENDER EMOTION STRESS WOMEN HEALTH HANDEDNESS REASONING ATTITUDE CONSISTENCY SITUATIONAL INFERENCE JUDGMENT PROBABILITIES STATISTICAL IMAGE COLOR MONOCULAR LIGHTNESS GIBSON SUBMOVEMENT ORIENTATION HOLOGRAPHIC CONDITIONIN STRESS EMOTIONAL BEHAVIORAL FEAR STIMULATION TOLERANCE RESPONSES

  47. Hidden Markov Topics Model • Syntactic dependencies  short range dependencies • Semantic dependencies  long-range q Semantic state: generate words from topic model z1 z2 z3 z4 w1 w2 w3 w4 Syntactic states: generate words from HMM s1 s2 s3 s4 (Griffiths, Steyvers, Blei, & Tenenbaum, 2004)

  48. NIPS Semantics KERNEL SUPPORT VECTOR SVM KERNELS # SPACE FUNCTION MACHINES SET NETWORK NEURAL NETWORKS OUPUT INPUT TRAINING INPUTS WEIGHTS # OUTPUTS IMAGE IMAGES OBJECT OBJECTS FEATURE RECOGNITION VIEWS # PIXEL VISUAL EXPERTS EXPERT GATING HME ARCHITECTURE MIXTURE LEARNING MIXTURES FUNCTION GATE MEMBRANE SYNAPTIC CELL * CURRENT DENDRITIC POTENTIAL NEURON CONDUCTANCE CHANNELS DATA GAUSSIAN MIXTURE LIKELIHOOD POSTERIOR PRIOR DISTRIBUTION EM BAYESIAN PARAMETERS STATE POLICY VALUE FUNCTION ACTION REINFORCEMENT LEARNING CLASSES OPTIMAL * NIPSSyntax IN WITH FOR ON FROM AT USING INTO OVER WITHIN # * I X T N - C F P IS WAS HAS BECOMES DENOTES BEING REMAINS REPRESENTS EXISTS SEEMS SEE SHOW NOTE CONSIDER ASSUME PRESENT NEED PROPOSE DESCRIBE SUGGEST HOWEVER ALSO THEN THUS THEREFORE FIRST HERE NOW HENCE FINALLY MODEL ALGORITHM SYSTEM CASE PROBLEM NETWORK METHOD APPROACH PAPER PROCESS USED TRAINED OBTAINED DESCRIBED GIVEN FOUND PRESENTED DEFINED GENERATED SHOWN

More Related