1 / 111

Topic Models for Social Network Analysis and Bibliometrics

Topic Models for Social Network Analysis and Bibliometrics. Andrew McCallum Computer Science Department University of Massachusetts Amherst. Joint work with  Xuerui Wang, Natasha Mohanty, Andres Corrada, Chris Pal, Wei Li, David Mimno and Gideon Mann. Goal:.

Télécharger la présentation

Topic Models for Social Network Analysis and Bibliometrics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Topic Models forSocial Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with Xuerui Wang, Natasha Mohanty, Andres Corrada, Chris Pal, Wei Li, David Mimno and Gideon Mann.

  2. Goal: Mine actionable knowledgefrom unstructured text.

  3. From Text to Actionable Knowledge Spider Filter Data Mining IE Segment Classify Associate Cluster Discover patterns - entity types - links / relations - events Database Documentcollection Actionableknowledge Prediction Outlier detection Decision support

  4. Joint Inference Uncertainty Info Spider Filter Data Mining IE Segment Classify Associate Cluster Discover patterns - entity types - links / relations - events Database Documentcollection Actionableknowledge Emerging Patterns Prediction Outlier detection Decision support

  5. Discriminatively-trained undirected graphical models Conditional Random Fields [Lafferty, McCallum, Pereira] Conditional PRMs [Koller…], [Jensen…], [Geetor…], [Domingos…] Complex Inference and Learning Just what we researchers like to sink our teeth into! Unified Model Spider Filter Data Mining IE Segment Classify Associate Cluster Discover patterns - entity types - links / relations - events Probabilistic Model Documentcollection Actionableknowledge Prediction Outlier detection Decision support

  6. Context Spider Filter Data Mining IE Segment Classify Associate Cluster Discover patterns - entity types - links / relations - events Database Documentcollection Joint inference among detailed steps Actionableknowledge Leveraging Text in Social Network Analysis Prediction Outlier detection Decision support

  7. Outline Social Network Analysis with Topic Models • Role Discovery (Author-Recipient-Topic Model, ART) • Group Discovery (Group-Topic Model, GT) • Enhanced Topic Models • Correlations among Topics (Pachinko Allocation, PAM) • Time Localized Topics (Topics-over-Time Model, TOT) • Markov Dependencies in Topics (Topical N-Grams Model, TNG) • Bibliometric Impact Measures enabled by Topics Multi-Conditional Mixtures

  8. Social Network in an Email Dataset

  9. Clustering words into topics withLatent Dirichlet Allocation [Blei, Ng, Jordan 2003] GenerativeProcess: Example: For each document: 70% Iraq war 30% US election Sample a distributionover topics,  For each word in doc Iraq war Sample a topic, z Sample a wordfrom the topic, w “bombing”

  10. Example topicsinduced from a large collection of text JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTUNITIES WORKING TRAINING SKILLS CAREERS POSITIONS FIND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY EARN ABLE SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BIOLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIENTIST STUDYING SCIENCES BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIELD PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNIS TEAMS GAMES SPORTS BAT TERRY FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POLES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORCE MAGNETS BE MAGNETISM POLE INDUCED STORY STORIES TELL CHARACTER CHARACTERS AUTHOR READ TOLD SETTING TALES PLOT TELLING SHORT FICTION ACTION TRUE EVENTS TELLS TALE NOVEL MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNESS STRANGE FEELING WHOLE BEING MIGHT HOPE DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PERSON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECTIONS CERTAIN WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL DIVE DOLPHIN UNDERWATER [Tennenbaum et al]

  11. Example topicsinduced from a large collection of text JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTUNITIES WORKING TRAINING SKILLS CAREERS POSITIONS FIND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY EARN ABLE SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BIOLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIENTIST STUDYING SCIENCES BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIELD PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNIS TEAMS GAMES SPORTS BAT TERRY FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POLES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORCE MAGNETS BE MAGNETISM POLE INDUCED STORY STORIES TELL CHARACTER CHARACTERS AUTHOR READ TOLD SETTING TALES PLOT TELLING SHORT FICTION ACTION TRUE EVENTS TELLS TALE NOVEL MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNESS STRANGE FEELING WHOLE BEING MIGHT HOPE DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PERSON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECTIONS CERTAIN WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL DIVE DOLPHIN UNDERWATER [Tennenbaum et al]

  12. From LDA to Author-Recipient-Topic [McCallum et al 2005] (ART)

  13. Inference and Estimation • Gibbs Sampling: • Easy to implement • Reasonably fast r

  14. Enron Email Corpus • 250k email messages • 23k people Date: Wed, 11 Apr 2001 06:56:00 -0700 (PDT) From: debra.perlingiere@enron.com To: steve.hooser@enron.com Subject: Enron/TransAltaContract dated Jan 1, 2001 Please see below. Katalin Kiss of TransAlta has requested an electronic copy of our final draft? Are you OK with this? If so, the only version I have is the original draft without revisions. DP Debra Perlingiere Enron North America Corp. Legal Department 1400 Smith Street, EB 3885 Houston, Texas 77002 dperlin@enron.com

  15. Topics, and prominent senders / receiversdiscovered by ART Topic names, by hand

  16. Topics, and prominent senders / receiversdiscovered by ART Beck = “Chief Operations Officer” Dasovich = “Government Relations Executive” Shapiro = “Vice President of Regulatory Affairs” Steffes = “Vice President of Government Affairs”

  17. Comparing Role Discovery Traditional SNA ART Author-Topic connection strength (A,B) = distribution over recipients distribution over authored topics distribution over authored topics

  18. Comparing Role DiscoveryTracy Geaconne  Dan McCarty Traditional SNA ART Author-Topic Different roles Different roles Similar roles Geaconne = “Secretary” McCarty = “Vice President”

  19. Comparing Role DiscoveryLynn Blair  Kimberly Watson Traditional SNA ART Author-Topic Very similar Very different Different roles Blair = “Gas pipeline logistics” Watson = “Pipeline facilities planning”

  20. McCallum Email Corpus 2004 • January - October 2004 • 23k email messages • 825 people From: kate@cs.umass.edu Subject: NIPS and .... Date: June 14, 2004 2:27:41 PM EDT To: mccallum@cs.umass.edu There is pertinent stuff on the first yellow folder that is completed either travel or other things, so please sign that first folder anyway. Then, here is the reminder of the things I'm still waiting for: NIPS registration receipt. CALO registration receipt. Thanks, Kate

  21. Four most prominent topicsin discussions with ____?

  22. Two most prominent topicsin discussions with ____?

  23. Role-Author-Recipient-Topic Models

  24. Results with RART:People in “Role #3” in Academic Email • olc lead Linux sysadmin • gauthier sysadmin for CIIR group • irsystem mailing list CIIR sysadmins • system mailing list for dept. sysadmins • allan Prof., chair of “computing committee” • valerie second Linux sysadmin • tech mailing list for dept. hardware • steve head of dept. I.T. support

  25. Roles for allan (James Allan) • Role #3 I.T. support • Role #2 Natural Language researcher Roles for pereira (Fernando Pereira) • Role #2 Natural Language researcher • Role #4 SRI CALO project participant • Role #6 Grant proposal writer • Role #10 Grant proposal coordinator • Role #8 Guests at McCallum’s house

  26. ART: Roles but not Groups Traditional SNA ART Author-Topic Not Not Block structured Enron TransWestern Division

  27. Outline Social Network Analysis with Topic Models • Role Discovery (Author-Recipient-Topic Model, ART) • Group Discovery (Group-Topic Model, GT) • Enhanced Topic Models • Correlations among Topics (Pachinko Allocation, PAM) • Time Localized Topics (Topics-over-Time Model, TOT) • Markov Dependencies in Topics (Topical N-Grams Model, TNG) • Bibliometric Impact Measures enabled by Topics a Multi-Conditional Mixtures

  28. Groups and Topics • Input: • Observed relations between people • Attributes on those relations (text, or categorical) • Output: • Attributes clustered into “topics” • Groups of people---varying depending on topic

  29. Discovering Groups from Observed Set of Relations Student Roster Adams BennettCarterDavis Edwards Frederking Academic Admiration Acad(A, B) Acad(C, B) Acad(A, D) Acad(C, D) Acad(B, E) Acad(D, E) Acad(B, F) Acad(D, F) Acad(E, A) Acad(F, A) Acad(E, C) Acad(F, C) Admiration relations among six high school students.

  30. Adjacency Matrix Representing Relations Student Roster Adams BennettCarterDavis Edwards Frederking Academic Admiration Acad(A, B) Acad(C, B) Acad(A, D) Acad(C, D) Acad(B, E) Acad(D, E) Acad(B, F) Acad(D, F) Acad(E, A) Acad(F, A) Acad(E, C) Acad(F, C)

  31. Group Model: Partitioning Entities into Groups Stochastic Blockstructures for Relations [Nowicki, Snijders 2001] Beta Multinomial Dirichlet S: number of entities G: number of groups Binomial Enhanced with arbitrary number of groups in [Kemp, Griffiths, Tenenbaum 2004]

  32. Two Relations with Different Attributes Student Roster Adams BennettCarterDavis Edwards Frederking Academic Admiration Acad(A, B) Acad(C, B) Acad(A, D) Acad(C, D) Acad(B, E) Acad(D, E) Acad(B, F) Acad(D, F) Acad(E, A) Acad(F, A) Acad(E, C) Acad(F, C) Social Admiration Soci(A, B) Soci(A, D) Soci(A, F) Soci(B, A) Soci(B, C) Soci(B, E) Soci(C, B) Soci(C, D) Soci(C, F) Soci(D, A) Soci(D, C) Soci(D, E) Soci(E, B) Soci(E, D) Soci(E, F) Soci(F, A) Soci(F, C) Soci(F, E)

  33. The Group-Topic Model: Discovering Groups and Topics Simultaneously [Wang, Mohanty, McCallum 2006] Beta Uniform Multinomial Dirichlet Dirichlet Binomial Multinomial

  34. Inference and Estimation • Gibbs Sampling: • Many r.v.s can be integrated out • Easy to implement • Reasonably fast We assume the relationship is symmetric.

  35. Dataset #1:U.S. Senate • 16 years of voting records in the US Senate (1989 – 2005) • a Senator may respond Yea or Nay to a resolution • 3423 resolutions with text attributes (index terms) • 191 Senators in total across 16 years S.543 Title: An Act to reform Federal deposit insurance, protect the deposit insurance funds, recapitalize the Bank Insurance Fund, improve supervision and regulation of insured depository institutions, and for other purposes. Sponsor: Sen Riegle, Donald W., Jr. [MI] (introduced 3/5/1991) Cosponsors (2) Latest Major Action: 12/19/1991 Became Public Law No: 102-242. Index terms: Banks and bankingAccountingAdministrative feesCost controlCreditDeposit insuranceDepressed areas and other 110 terms Adams (D-WA), Nay Akaka (D-HI), Yea Bentsen (D-TX), Yea Biden (D-DE), Yea Bond (R-MO), Yea Bradley (D-NJ), Nay Conrad (D-ND), Nay……

  36. Topics Discovered (U.S. Senate) Mixture of Unigrams Group-Topic Model

  37. Groups Discovered (US Senate) Groups from topic Education + Domestic

  38. Senators Who Change Coalition the most Dependent on Topic e.g. Senator Shelby (D-AL) votes with the Republicans on Economic with the Democrats on Education + Domestic with a small group of maverick Republicans on Social Security + Medicaid

  39. Dataset #2:The UN General Assembly • Voting records of the UN General Assembly (1990 - 2003) • A country may choose to vote Yes, No or Abstain • 931 resolutions with text attributes (titles) • 192 countries in total • Also experiments later with resolutions from 1960-2003 Vote on Permanent Sovereignty of Palestinian People, 87th plenary meeting The draft resolution on permanent sovereignty of the Palestinian people in the occupied Palestinian territory, including Jerusalem, and of the Arab population in the occupied Syrian Golan over their natural resources (document A/54/591) was adopted by a recorded vote of 145 in favour to 3 against with 6 abstentions: In favour: Afghanistan, Argentina, Belgium, Brazil, Canada, China, France, Germany, India, Japan, Mexico, Netherlands, New Zealand, Pakistan, Panama, Russian Federation, South Africa, Spain, Turkey, and other 126 countries. Against: Israel, Marshall Islands, United States. Abstain: Australia, Cameroon, Georgia, Kazakhstan, Uzbekistan, Zambia.

  40. Topics Discovered (UN) Mixture of Unigrams Group-TopicModel

  41. GroupsDiscovered(UN) The countries list for each group are ordered by their 2005 GDP (PPP) and only 5 countries are shown in groups that have more than 5 members.

  42. Outline Social Network Analysis with Topic Models • Role Discovery (Author-Recipient-Topic Model, ART) • Group Discovery (Group-Topic Model, GT) • Enhanced Topic Models • Correlations among Topics (Pachinko Allocation, PAM) • Time Localized Topics (Topics-over-Time Model, TOT) • Markov Dependencies in Topics (Topical N-Grams Model, TNG) • Bibliometric Impact Measures enabled by Topics a a Multi-Conditional Mixtures

  43. “images, motion, eyes” “motion, some junk” LDA 20 visual model motion field object image images objects fields receptive eye position spatial direction target vision multiple figure orientation location LDA 100 motion detection field optical flow sensitive moving functional detect contrast light dimensional intensity computer mt measures occlusion temporal edge real Latent Dirichlet Allocation [Blei, Ng, Jordan, 2003] α N θ n z β T w φ

  44. Correlated Topic Model [Blei, Lafferty, 2005]   N logistic normal  n z β T w φ Square matrix of pairwise correlations.

  45. Pachinko Machine

  46. Pachinko Allocation Model Thanks to Michael Jordan for suggesting the name [Li, McCallum, 2005] 11 Given: directed acyclic graph (DAG); at each interior node: a Dirichlet over its children and words at leaves Model structure, not the graphical model 21 22 For each document: Sample a multinomial from each Dirichlet 31 32 33 For each word in this document: Starting from the root, sample a child from successive nodes, down to a leaf. Generate the word at the leaf 41 42 43 44 45 word1 word2 word3 word4 word5 word6 word7 word8 Like a Polya tree, but DAG shaped, with arbitrary number of children.

  47. Pachinko Allocation Model [Li, McCallum, 2005] 11 DAG may have arbitrary structure • arbitrary depth • any number of children per node • sparse connectivity • edges may skip layers Model structure, not the graphical model 21 22 31 32 33 41 42 43 44 45 word1 word2 word3 word4 word5 word6 word7 word8

  48. Pachinko Allocation Model [Li, McCallum, 2005] 11 Model structure, not the graphical model 21 22 Distributions over distributions over topics... Distributions over topics;mixtures, representing topic correlations 31 32 33 41 42 43 44 45 Distributions over words (like “LDA topics”) word1 word2 word3 word4 word5 word6 word7 word8 Some interior nodes could contain one multinomial, used for all documents. (i.e. a very peaked Dirichlet)

More Related