1 / 94

Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Entity-Based Data Mining from Spatio-Temporal Events and Text Sources Presentation at KD-D Program Review, Nov 18-19 2003. Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine {smyth, sharad}@ics.uci.edu www.datalab.uci.edu. Project Participants.

jiro
Télécharger la présentation

Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Entity-Based Data Mining fromSpatio-Temporal Events and Text Sources Presentation at KD-D Program Review, Nov 18-19 2003 Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine {smyth, sharad}@ics.uci.edu www.datalab.uci.edu

  2. Project Participants • Principal Investigators: • Padhraic Smyth: Data mining • Sharad Mehrotra: Databases • Collaborators • Mark Steyvers: Text and Author Modeling • Postdoctoral Researchers • Michal Rosen-Zvi, Dmitri Kalashnikov • Staff Programmer • Amnon Meyers: Information Extraction • Students • Phd: Joshua O Madadhain, Scott White, Yiming Ma, Dawit Seid • Undergraduates: Yan-Biao Boey, Momo Alhazzazi • Acknowledgements • Steve Lawrence for CiteSeer data

  3. Problem of Interest • Intelligence Analysis today • Massive volumes/streams of data • Text (newswire, reports, etc) • Web data • Transactions/events • Central problems • Need flexible tools to support an analyst’s exploration of the data • Automatically focus an analyst’s attention on interesting parts of the data space • Need new theories/methods/tools….

  4. Entities and Events • Entities = Individuals, groups, communities, organizations, etc • Events = Contacts, collaborations, meetings, products, etc • Working hypothesis • A large component of intelligence work is centered on entities and events • Extracting entity-information from text streams and transaction data • Predicting entity behavior • Detecting groups of related entities • Our broad goal • Develop next-generation data management, exploration, and analysis tools for entity-event data

  5. Nodes = Entities = Biotech-Related Organizations Edges = Events = Collaborations

  6. Red indicates nodes selected by the data analyst as important

  7. Algorithm determines blue nodes are important relative to red nodes (Oxford and Cambridge)

  8. Research Issues • Information extraction • Data management tools • Visualization techniques • Interactive ad hoc querying and mining • Statistical modeling of graph data • Query languages for graphs • Scalability to large graphs • ……

  9. Focus of Our Research Text Sources Information Extraction Entity-Event Databases Statistical Modeling and Data Mining Visualization Query Languages User Modeling

  10. Major Themes in Our Work • Focus on data in the form of graphs • Nodes = entities, edges = events • Nodes and edges have attributes (e.g., temporal) • Year 1: entities = computer science researchers • Year 1: limited spatio-temporal aspects • Integration and coupling of • Statistical modeling and data mining • Visualization • Query languages and data management • Scalability • Methods should scale to millions of nodes and edges • User Interaction • Conditional “query-driven” analysis and mining • Contrast with offline global modeling

  11. Accomplishments • Infrastructure and Data Sets • Created testbed data sets, e.g., 100k entities, 400k events • Developed suite of text information extraction tools Developed and released a general public-domain JAVA API for graph data analysis and visualization • Statistical Modeling and Data Mining • Developed new statistical technique for modeling entities based on authored text • Developed new class of scalable algorithms for interactive graph-based data mining

  12. Accomplishments • Graph-based Querying • Developed framework for general graph-based query language • New accurate and efficient algorithms for interactive similarity queries and query refinement on graphs • Software Tools • Netsight: JAVA-based graph visualization and analysis tool • Browser tool for exploring author-topic models • Interactive query refinement system • Prototype system for graph-based query language for interacting with heterogenous graph data

  13. Publications in Year 1 • Data Mining on Graphs • S. White and P. Smyth, Algorithms for Discovering Relative Importance In Graphs,Proceedings of the Ninth International ACM SIGKDD Conference, August 2003. Extended version submitted to JICRD, June 2003. • J. O'Madadhain, D. Fisher, S. White, and Y. Boey, The JUNG (Java Universal Network/Graph) Framework, UCI-ICS Tech Report 03-17, October 2003: invited presentation, Stanford Workshop on Statistical Inference, Computing and Visualization for Graphs, August 2003. • Modeling the Internet and the Web: Probabilistic Methods and Algorithms, P. Baldi, P. Frasconi, and P. Smyth, Wiley, June 2003. • Statistical Author-Topic Models • T. Griffiths and M. Steyvers (in press). Finding Scientific Topics.Proceedings of the National Academy of Sciences • M. Steyvers, M. Rosen-Zvi, T. Griffiths, P. Smyth, Author Attribution with LDA, NIPS workshop on Syntax, Semantics, and Statistics, December 2003 • Data Management and Graph Querying • Y. Ma, S. Mehrotra, D. Seid, A Framework for Refining Similarity Queries Using Learning Techniques, UCI-ICS Tech Report 03-19, Nov. 2003. Extended version submitted to EDBT 2004. • Y. Ma, D. Seid, S. Mehrotra, Interactive Filtering of Data Streams by Refining Similarity Queries, UCI-ICS Tech Report 03-07, June. 2003. • D. Seid, M. Ortega-Binderbergery, Z. Chen, and S. Mehrotra, Evaluating Top-k Selection and Preference Queries on Multiple Indexed Attributes. Submitted to EDBT'04. • D. Seid, and S. Mehrotra, Complex Analytical Queries on Graphs and Hierarchies, (in preparation). • L. Jin, C. Li, S. Mehrotra, Efficient Record Linkage in Large Data Sets, in the 8th International Conference on Database Systems for Advanced Applications (DASFAA 2003) 26 - 28 March, 2003, Kyoto, Japan.

  14. Data Sets

  15. Information Extraction

  16. Author Database Schema Note: “individual-centric” not “document-centric”

  17. Focus of Our Research Text Sources Information Extraction Entity-Event Databases Statistical Modeling and Data Mining Visualization Query Languages User Modeling

  18. “9/11 Network”

  19. From graphs to Markov chains 3 C • Importance = recursive function of nodes pointing at you 4 A B 2 2 D

  20. From graphs to Markov chains 3 C 0.6 C • Importance = recursive function of nodes pointing at you 1.0 0.33 4 A B 2 A B 0.4 0.5 0.77 0.33 2 D 0.5 D

  21. From graphs to Markov chains 3 C 0.6 C • Importance = recursive function of nodes pointing at you • Markov approach… • Notion of a “token” circulating around in Markov fashion • Important actors see the token more often • Importance = stationary probability of each node • PageRank: surfer randomly following links on the Web 1.0 0.33 4 A B 2 A B 0.4 0.5 0.77 0.33 2 D 0.5 D

  22. Relative importance of node V to A: Trade off [distance from A, structural importance of V]

  23. Add backlinks to A with probability b (e.g., 0.3)

  24. Algorithms for Relative Importance(S. White and P. Smyth, ACM KDD 2003: also JICRD, submitted) • PageRank with Priors (PRankP) • Random walks that start from A and return to A periodically • Relative importance = stationary probability • Iterative algorithm (e.g., Haveliwala, 2002) • HITS with priors • Formulate HITS as Markov chain, same idea…. • K-Step Markov • Use the transient probability distribution starting from A • Faster than stationary probability methods • Weighted Paths • Heuristic approximation to K-step Markov: even faster • All algorithms scale linearly in number of edges • Different constant factors

  25. Computation Times for Ranking Algorithms (in seconds) PRankP and HITS converged in 20-30 iterations

  26. Computation Times for Ranking Algorithms (in seconds) PRankP and HITS converged in 20-30 iterations

  27. Weighted versus Unweighted Graphs

  28. Visualization and Analysis Software

  29. http://jung.sourceforge.net JUNG Java Universal Network/Graph Framework 16,000 page visits 800 downloads since August

  30. Demo of Netsight software

  31. Entity Models from Text Data

  32. Authors Words Can we model authors, given documents? (more generally, build statistical profiles of entities given sparse observed data)

  33. Authors Hidden Topics Words Model = Author-Topic distributions + Topic-Word distributions Parameters learned via Bayesian learning

  34. Authors Hidden Topics Words

  35. Authors Hidden Topics Words

  36. Authors Hidden Topics Words

  37. Authors Hidden Topics Words

  38. Authors Hidden Topics Words

  39. Authors Hidden Topics Words

  40. Hidden Topics Words “Topic Model”: - document can be generated from multiple topics - Hofmann (SIGIR ’99), Blei, Jordan, Ng (JMLR, 2003)

  41. Authors Hidden Topics Words Model = Author-Topic distributions + Topic-Word distributions NOTE: documents can be composed of multiple topics

  42. Author Modeling Data Sets

  43. Topic Models from CiteSeer WORDS: probabilistic, Bayesian, carlo, monte, distribution, inference, conditional, prior, mixture, Markov, posterior, belief…… AUTHORS: N_Friedman, D_Heckerman, Z_Ghahramani, D_Koller, M_Jordan, R_Neal, A_Raftery, T_Lukasiewicz, J_Halpern…. WORDS: retrieval, text, document, information, content, indexing, relevance, collection, query, IR, feedback…. AUTHORS: D. Oard, W_Croft, K_Jones, P_Schauble, E_Voorhees, A_Singhal, D_Hawking, J_Allan, A_Smeaton, M_Hearst,….

  44. Topic Models from CiteSeer WORDS: Web, user, world, wide, pages, www, site, internet, hypertext, hypermedia, content, links, page, navigation.. AUTHORS: S. Lawrence, B. Mobasher, M. Levene, D. Florescu, O. Etzioni, R_Studer, W. Hall, R. Fielding, J. Pitkow, M. Crovella,…. WORDS: data, mining, attributes, discovery, association, large, knowledge, databases, dataset, interesting, frequent, discover, sets…. AUTHORS: J. Han, R. Rastogi, M. Zaki, R. Ng, B. Liu, H. Mannila, S. Brin, H Liu, L. Holder, H. Toivonen…

  45. Author-Topic Models from CiteSeer • Author = A McCallum: • Topic 1: classification, training, generalization, decision, data,… • Topic 2: learning, machine, examples, reinforcement, inductive,….. • Topic 3: retrieval, text, document, information, content,… • Author = H Garcia-Molina: - Topic 1: query, index, data, join, processing, aggregate…. - Topic 2: transaction, concurrency, copy, permission,distributed…. - Topic 3: source, separation, paper, heterogeneous, merging….. • Author = P Cohen: - Topic 1: agent, multi, coordination, autonomous, intelligent…. - Topic 2: planning, action, goal, world, execution, situation… - Topic 3: human, interaction, people, cognitive, social, natural….

  46. Author-Topic Browser • Interesting scalability issues • CiteSeer model exceeds 1 Gbyte • Real-time query answering demands Gibbs sampling (not well suited to SQL!) • Solution • Coupling of Gibbs sampling and relational DB (it works!) JAVA Query GUI SQL Interface Bayesian Sampling MySQL DB Original Text + Statistical Model

More Related