Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Entity-Based Data Mining fromSpatio-Temporal Events and Text Sources Presentation at KD-D Program Review, Nov 18-19 2003 Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine {smyth, sharad}@ics.uci.edu www.datalab.uci.edu

Project Participants • Principal Investigators: • Padhraic Smyth: Data mining • Sharad Mehrotra: Databases • Collaborators • Mark Steyvers: Text and Author Modeling • Postdoctoral Researchers • Michal Rosen-Zvi, Dmitri Kalashnikov • Staff Programmer • Amnon Meyers: Information Extraction • Students • Phd: Joshua O Madadhain, Scott White, Yiming Ma, Dawit Seid • Undergraduates: Yan-Biao Boey, Momo Alhazzazi • Acknowledgements • Steve Lawrence for CiteSeer data

Problem of Interest • Intelligence Analysis today • Massive volumes/streams of data • Text (newswire, reports, etc) • Web data • Transactions/events • Central problems • Need flexible tools to support an analyst’s exploration of the data • Automatically focus an analyst’s attention on interesting parts of the data space • Need new theories/methods/tools….

Entities and Events • Entities = Individuals, groups, communities, organizations, etc • Events = Contacts, collaborations, meetings, products, etc • Working hypothesis • A large component of intelligence work is centered on entities and events • Extracting entity-information from text streams and transaction data • Predicting entity behavior • Detecting groups of related entities • Our broad goal • Develop next-generation data management, exploration, and analysis tools for entity-event data

Nodes = Entities = Biotech-Related Organizations Edges = Events = Collaborations

Red indicates nodes selected by the data analyst as important

Algorithm determines blue nodes are important relative to red nodes (Oxford and Cambridge)

Research Issues • Information extraction • Data management tools • Visualization techniques • Interactive ad hoc querying and mining • Statistical modeling of graph data • Query languages for graphs • Scalability to large graphs • ……

Focus of Our Research Text Sources Information Extraction Entity-Event Databases Statistical Modeling and Data Mining Visualization Query Languages User Modeling

Major Themes in Our Work • Focus on data in the form of graphs • Nodes = entities, edges = events • Nodes and edges have attributes (e.g., temporal) • Year 1: entities = computer science researchers • Year 1: limited spatio-temporal aspects • Integration and coupling of • Statistical modeling and data mining • Visualization • Query languages and data management • Scalability • Methods should scale to millions of nodes and edges • User Interaction • Conditional “query-driven” analysis and mining • Contrast with offline global modeling

Accomplishments • Infrastructure and Data Sets • Created testbed data sets, e.g., 100k entities, 400k events • Developed suite of text information extraction tools Developed and released a general public-domain JAVA API for graph data analysis and visualization • Statistical Modeling and Data Mining • Developed new statistical technique for modeling entities based on authored text • Developed new class of scalable algorithms for interactive graph-based data mining

Accomplishments • Graph-based Querying • Developed framework for general graph-based query language • New accurate and efficient algorithms for interactive similarity queries and query refinement on graphs • Software Tools • Netsight: JAVA-based graph visualization and analysis tool • Browser tool for exploring author-topic models • Interactive query refinement system • Prototype system for graph-based query language for interacting with heterogenous graph data

Publications in Year 1 • Data Mining on Graphs • S. White and P. Smyth, Algorithms for Discovering Relative Importance In Graphs,Proceedings of the Ninth International ACM SIGKDD Conference, August 2003. Extended version submitted to JICRD, June 2003. • J. O'Madadhain, D. Fisher, S. White, and Y. Boey, The JUNG (Java Universal Network/Graph) Framework, UCI-ICS Tech Report 03-17, October 2003: invited presentation, Stanford Workshop on Statistical Inference, Computing and Visualization for Graphs, August 2003. • Modeling the Internet and the Web: Probabilistic Methods and Algorithms, P. Baldi, P. Frasconi, and P. Smyth, Wiley, June 2003. • Statistical Author-Topic Models • T. Griffiths and M. Steyvers (in press). Finding Scientific Topics.Proceedings of the National Academy of Sciences • M. Steyvers, M. Rosen-Zvi, T. Griffiths, P. Smyth, Author Attribution with LDA, NIPS workshop on Syntax, Semantics, and Statistics, December 2003 • Data Management and Graph Querying • Y. Ma, S. Mehrotra, D. Seid, A Framework for Refining Similarity Queries Using Learning Techniques, UCI-ICS Tech Report 03-19, Nov. 2003. Extended version submitted to EDBT 2004. • Y. Ma, D. Seid, S. Mehrotra, Interactive Filtering of Data Streams by Refining Similarity Queries, UCI-ICS Tech Report 03-07, June. 2003. • D. Seid, M. Ortega-Binderbergery, Z. Chen, and S. Mehrotra, Evaluating Top-k Selection and Preference Queries on Multiple Indexed Attributes. Submitted to EDBT'04. • D. Seid, and S. Mehrotra, Complex Analytical Queries on Graphs and Hierarchies, (in preparation). • L. Jin, C. Li, S. Mehrotra, Efficient Record Linkage in Large Data Sets, in the 8th International Conference on Database Systems for Advanced Applications (DASFAA 2003) 26 - 28 March, 2003, Kyoto, Japan.

Data Sets

Information Extraction

Author Database Schema Note: “individual-centric” not “document-centric”

Focus of Our Research Text Sources Information Extraction Entity-Event Databases Statistical Modeling and Data Mining Visualization Query Languages User Modeling

“9/11 Network”

From graphs to Markov chains 3 C • Importance = recursive function of nodes pointing at you 4 A B 2 2 D

From graphs to Markov chains 3 C 0.6 C • Importance = recursive function of nodes pointing at you 1.0 0.33 4 A B 2 A B 0.4 0.5 0.77 0.33 2 D 0.5 D

From graphs to Markov chains 3 C 0.6 C • Importance = recursive function of nodes pointing at you • Markov approach… • Notion of a “token” circulating around in Markov fashion • Important actors see the token more often • Importance = stationary probability of each node • PageRank: surfer randomly following links on the Web 1.0 0.33 4 A B 2 A B 0.4 0.5 0.77 0.33 2 D 0.5 D

Relative importance of node V to A: Trade off [distance from A, structural importance of V]

Add backlinks to A with probability b (e.g., 0.3)

Algorithms for Relative Importance(S. White and P. Smyth, ACM KDD 2003: also JICRD, submitted) • PageRank with Priors (PRankP) • Random walks that start from A and return to A periodically • Relative importance = stationary probability • Iterative algorithm (e.g., Haveliwala, 2002) • HITS with priors • Formulate HITS as Markov chain, same idea…. • K-Step Markov • Use the transient probability distribution starting from A • Faster than stationary probability methods • Weighted Paths • Heuristic approximation to K-step Markov: even faster • All algorithms scale linearly in number of edges • Different constant factors

Computation Times for Ranking Algorithms (in seconds) PRankP and HITS converged in 20-30 iterations

Weighted versus Unweighted Graphs

Visualization and Analysis Software

http://jung.sourceforge.net JUNG Java Universal Network/Graph Framework 16,000 page visits 800 downloads since August

Demo of Netsight software

Entity Models from Text Data

Authors Words Can we model authors, given documents? (more generally, build statistical profiles of entities given sparse observed data)

Authors Hidden Topics Words Model = Author-Topic distributions + Topic-Word distributions Parameters learned via Bayesian learning

Authors Hidden Topics Words

Hidden Topics Words “Topic Model”: - document can be generated from multiple topics - Hofmann (SIGIR ’99), Blei, Jordan, Ng (JMLR, 2003)

Authors Hidden Topics Words Model = Author-Topic distributions + Topic-Word distributions NOTE: documents can be composed of multiple topics

Author Modeling Data Sets

Topic Models from CiteSeer WORDS: probabilistic, Bayesian, carlo, monte, distribution, inference, conditional, prior, mixture, Markov, posterior, belief…… AUTHORS: N_Friedman, D_Heckerman, Z_Ghahramani, D_Koller, M_Jordan, R_Neal, A_Raftery, T_Lukasiewicz, J_Halpern…. WORDS: retrieval, text, document, information, content, indexing, relevance, collection, query, IR, feedback…. AUTHORS: D. Oard, W_Croft, K_Jones, P_Schauble, E_Voorhees, A_Singhal, D_Hawking, J_Allan, A_Smeaton, M_Hearst,….

Topic Models from CiteSeer WORDS: Web, user, world, wide, pages, www, site, internet, hypertext, hypermedia, content, links, page, navigation.. AUTHORS: S. Lawrence, B. Mobasher, M. Levene, D. Florescu, O. Etzioni, R_Studer, W. Hall, R. Fielding, J. Pitkow, M. Crovella,…. WORDS: data, mining, attributes, discovery, association, large, knowledge, databases, dataset, interesting, frequent, discover, sets…. AUTHORS: J. Han, R. Rastogi, M. Zaki, R. Ng, B. Liu, H. Mannila, S. Brin, H Liu, L. Holder, H. Toivonen…

Author-Topic Models from CiteSeer • Author = A McCallum: • Topic 1: classification, training, generalization, decision, data,… • Topic 2: learning, machine, examples, reinforcement, inductive,….. • Topic 3: retrieval, text, document, information, content,… • Author = H Garcia-Molina: - Topic 1: query, index, data, join, processing, aggregate…. - Topic 2: transaction, concurrency, copy, permission,distributed…. - Topic 3: source, separation, paper, heterogeneous, merging….. • Author = P Cohen: - Topic 1: agent, multi, coordination, autonomous, intelligent…. - Topic 2: planning, action, goal, world, execution, situation… - Topic 3: human, interaction, people, cognitive, social, natural….

Author-Topic Browser • Interesting scalability issues • CiteSeer model exceeds 1 Gbyte • Real-time query answering demands Gibbs sampling (not well suited to SQL!) • Solution • Coupling of Gibbs sampling and relational DB (it works!) JAVA Query GUI SQL Interface Bayesian Sampling MySQL DB Original Text + Statistical Model

Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Presentation Transcript

ADVANCE PROGRAM University of California Irvine

Padhraic Smyth Information and Computer Science University of California, Irvine www.datalab.uci.edu

* Fordham University Department of Computer and Information Science

University of California Irvine

Center for Embedded Computer Systems University of California, Irvine

Ross Conner University of California Irvine USA and

UCI University of California, Irvine

Hawaii Pacific University and University of California Irvine

University of California Irvine

Center for Embedded Computer Systems University of California, Irvine

University of California, Irvine Undergraduate Students

University of Southern California Department Computer Science

University of California, Irvine University Registrar

University of California, Irvine and San Diego

Alfred Kobsa School of Information and Computer Science University of California, Irvine, U.S.A.

G. Avolio – University of California, Irvine

ADVANCE PROGRAM University of California Irvine

Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Getting Grants University of California, Irvine

University of California, Irvine

Center for Embedded Computer Systems University of California, Irvine and San Diego

Elizabeth Losh, University of California, Irvine