350 likes | 492 Vues
Joint Enhancement of Topic Modeling and Information Network Mining. Mid-Year PI Report focusing on I3.2 Heng Ji City University of New York NSCTA/INARC. INARC Project Major Contributions. I3.2-Subtask 1:
E N D
Joint Enhancement of Topic Modeling and Information Network Mining Mid-Year PI Report focusing on I3.2 Heng Ji City University of New York NSCTA/INARC
INARC Project Major Contributions I3.2-Subtask 1: Disambiguate objects with rich semantic structures extracted from interconnected texts (ACL2011) A new Collaborative Network Ranking Theory for Coreference Resolution (EMNLP2011-sub): Markov Logic Networks and Learning-to-Rank to Enhance Open Domain Role Discovery (TAC2010, LNCS, SIGIR2011-sub, EMNLP2011-sub) 16.4% improvement over state-of-the-art entity linking and 13%-22% improvement over link discovery I3.2-Subtask 2 (with H. Deng (UIUC) and J. Han (UIUC); Focus of this Talk) Novel topic modeling: Multi-typed objects are treated differently along with their inherent textual information and the rich semantics of the heterogeneous information network (KDD2011-sub, IEEE Journal invited-sub) Exploit the power of extended topic modeling for event network partitioning and refinement through active learning and topic cluster driven inferences. (ACL2011-sub, IEEE Journal invited-sub) Model the dynamics of information networks through a new temporal event network representation theory, evaluation metric and corresponding kernel methods (ACL2011-sub, EMNLP2011-sub) I3.2-Subtask 3 (with H. Deng (UIUC) and J. Han (UIUC)) Self-Boosting Terrorism Network Search and Browsing (Springer Book Chapter, SIGIR2011-sub) *I3.1: Uncovering Hierarchical Relationships among Linked Objects (with C. Wang (UIUC) and J. Han (UIUC), KDD'11 sub, presented by J. Han) CUNY Students and Post-docs: Q. Li, X. Li, W. Lin, Z. Chen, S. Tamang, S. Anzaroot, J. Artiles 2
Mining and Modeling Interconnected Information Networks 3 • Text-rich heterogeneous information network • Textual documents (news, blogs, twitter, papers, reports) are getting richer • Approximately 80% percent of all data in information network is held in an unstructured format; Thousands of "attack" events and hundreds of "arrest" events can be mined from one week's unstructured textual data • Identify topics and events from documents using topic models • Interconnect with users and other objects • How topics propagate from documents to objects?
A Starting Point: ‘Isolated’ Information Network Website: We are all Khaled Said Residence: Tahrir (Feb 18th, 2001-present) Residence: Tahrir (Feb 18th, 2001-present)
Joint Enhancement of Topic Modeling and Heterogeneous Information Network Mining Topic model Biased propagation • Fundamental Theory: InforNet construction and knowledge discovery capability can be mutually enhanced by network analysis on text and interconnected data • Q1: How to discover latent topics and identify clusters of multi-typed objects simultaneously? A1: Probabilistic Topic Modeling with Biased Propagation to take advantage of inter-connectivity in InforNets • Q2: How can text data and heterogeneous InforNet mutually enhance each other in topic modeling and other text mining tasks? A2: Incorporate topic clusters to partition and refine InforNets, yield new representation, evaluation metric and modeling theory 5
Preliminaries • Maximize the log likelihood of a collection of docs 6
Probabilistic Topic Models with Biased Propagation Basic Idea: (Biased Topic Propagation) • Propagate the topic probabilities obtained by topic models from documents to other objects through the heterogeneous InforNet • A simple and unbiased topic propagation does not make much sense 7 Intuition: • InforNet provides valuable information • Different objects have their own inherent information (e.g., D with rich text and U without explicit text) • To treat documents with rich text and other objects without explicit text in a different way Topic(D) inherent text + connected U Topic(U) connected D
Biased Random Walk The topic distribution of an object is determined by the average topic distribution of connected documents ξ: control the balance between inherent topic distribution and the propagated topic distribution Inherent topic distributions of docs Propagated topic distribution 8 • Basic criterion • The topic of an object without explicit text depends on the topic of the documents it connects • E.g., the research topic of an author could be characterized by his/her published papers; • The topic of a document is correlated with its objects to some extent, and should be principally determined by its inherent content of the text
Joint Enhancement of Topic Modeling and Heterogeneous Information Network Mining Topic model √ Biased propagation • Fundamental Theory: InforNet construction and knowledge discovery capability can be mutually enhanced by network analysis on text and interconnected data • Q1: How to discover latent topics and identify clusters of multi-typed objects simultaneously? • A1: Probabilistic Topic Modeling with Biased Propagation to take advantage of inter-connectivity in InforNets • Q2: How can text data and heterogeneous InforNet mutually enhance each other in topic modeling and other text mining tasks? A2: Incorporate topic clusters to partition and refine InforNets, yield new representation, evaluation metric and modeling theory 10
Across a heterogeneous information network, a particular object can sometimes be an event trigger and sometimes not, and can represent different event types Within a cluster of topically-related documents, the distribution is much more convergent e.g. In the overall information networks only 7% of “fire” indicate “End-Position” events; while all of “fire” in a topic cluster are “End-Position” events Topic Modeling can enhance information network construction by grouping similar objects, event types and roles together TMBP for InforNet Refinement
Bombing Threats Tracking and Dynamic Terrorism Networks Construction Most information obtained from text-rich InforNet construction so far is viewed as static, ignoring the temporal dimension of many links in the networks It’s not enough to rely on information reporting time (publication years, blog post dates, news release time, narrative order, etc.) for open-domain real-world scenarios – only 3.71% correlation with gold-standards Temporal information on individual documents can be sparse, incomplete and inaccurate. About 50% events don’t include explicit time arguments Open-domain Progressive Information Network Analysis with TMBP Iran Supreme National Security Council 0.9 Islamic Republic of Iran Broadcasting 0.3 employee 2005 – 2007 2005 – 2007 employee employee Ali Larijani 0.6 1989-2005 0.8 spouse 1978- Tehran University 1982–1987 Farideh Motahari Hassan Rowhani School-attended 0.4 13 13
TMBP based Information Aggregation • Toward deep analysis and global aggregation across information networks • Partition Infornet based on topic modeling • Within a topic cluster, we can recover temporal information by gleaning knowledge across networks and reach a global estimation of time boundaries • Research Methods • Novel representation of complex temporal information • Meaningful comparison of approaches through InforNet-specific metrics • Design novel dependency path based kernel methods to capture long contexts • Global inference and aggregation over text-rich InforNet in order to reduce vagueness and over-constraining, resolve contradiction, and improve information quality
New Representation Theory and Evaluation Metric Vague model Over-constraining model • 4-tuple representation • T1=Earliest possible start/ T2=latest possible start /T3= Earliest possible end / T4=latest possible end • Can represent punctual start/end points (T1 = T2, T3 = T4) • Captures uncertainty when necessary (T1 < T2, T3 < T4) • Consistency restrictions: T1 <= T2, T3 <= T4, T1<=T3, T2<=T4 • A new quality of information metric based on formal constraints: • Detect cases of non-informative nodes and links in information networks • Allow independent parameterization of vagueness and over-constraining errors • Error penalization can be tuned for more coarse or fine grained penalization • ti: automatic output; gi: gold-standard
Dependency Paths based Kernel Method and Information Aggregation with CCMs • Dependency paths based kernel method for local network prediction • Maximize global network quality by aggregating temporal information across documents over the entire information networks, using Conditional Constraint Models for optimization (Collaboration with Dan Roth (UIUC))
Topic Modeling Experiments Compared to State-of-the-art 17 • Data Collection • DBLP • NSF-Awards • Metrics • Accuracy (AC) • Normalized mutual information (NMI) • Results: improve 20%-40% over Probabilistic Latent Semantic Analysis (PLSA)
Topic Modeling based Active Learning for Event and Role Mining (Enhance Portability) • Data: open-domain news with gold-standard information annotation • Learning algorithm: combining pattern matching and Maximum Entropy based classification of triggers, arguments and roles • Automatically select topically-related documents as for event training data annotation • Using Topic modeling, with only 1/4 training data we can achieve comparable performance as passive learning 18
Topic Modeling based MLN Inference (Enhance Quality) • Topic-cluster wide cross-document inference based on Markov Logic Networks (MLN) to enhance event and role mining • One trigger sense per topic cluster / One argument role per topic cluster • Remove events and roles with low local and cluster-wide confidence • Adjust event and role labeling to achieve cluster-wide consistency • Results: Precision (P), Recall (R), F-Measure (F) 19
Progressive Temporal Infornet Mining Results No InforNet Aggregation over 2 tuples Aggregation over 10 tuples Exploit InforNet • Data • 1.3 million newswire documents and 0.4 million web blogs/forum documents • Overall Comparison with State-of-the-Art • Impact of Information Aggregation 20
Enrich and enhance the quality of information gathering from daily events and trends, and detecting terrorism or other potential threats by exploring unstructured text messages, blogs, twitters, news, reports integrated information networks Improved information quality has potential of pointing the soldiers and military data analysts to more relevant information, go beyond keyword based Information Retrieval approaches Multi-facet object search can provide methods for finding groups of soldiers with certain expertise and finding characteristics of enemies that may pose an imminent threat (An example: Web-scale Terrorism Network Search and Browsing) Developed methods to efficiently trace membership relations, attack/arrest/die activities and information clusters involving any specific entities Improve the quality of information by the interconnected network itself (self-boosting information networks) Potential Army Impact and Technology Transition 22 22
Collaborations • Within Task: • With J. Han on subtask 2 and 3, >2 teleconferences every week, frequent teleconferences/emails among students/post-docs, submitted 2 joint research papers (1 SIGIR2011 submission and 1 ACL2011 submission), preparing 3 new joint research papers • With D. Roth, collaboration on Constrained Conditional Models (I1.1) for Information Aggregation, entity coreference resolution and event extraction • Cross-Task: • With J. Han on I3.1, weekly teleconferences, regular emails, submitted 1 joint research paper to KDD2011 • With T. Huang on I1.1, on multi-media InforNet construction and utilization, published 2 joint research papers, submitted a joint NSF proposal • Cross-Center: • With S. Parsons (SCNARC and T1.4), on using text-rich information networks for trust prediction and dynamic social network analysis, co-advising a PhD student
Research Plans for Next Six Months • Continue research conducted in the current I3.2 APP • Explore topic correlation and social correlation from neighbors for improving topic modeling (with Hongbo Deng, Jiawei Han and collaboration with SCNARC) • Introduce more constraints in cross-link inferences (with D. Roth) • Exploit new graph alignment algorithms for text mining (with X. Yan) • Exploit implicit links for InforNet analysis, such as the response structures in twitter data • Technology Transition: Apply all of the successful approaches to military applications, e.g. conduct tight collaborations with ARL (e.g. Dr. Robert Cole) to make terrorism network search engine deliverable; with ARL (Dr. Robert Winkler) on entity coreference resolution; with A. Leung on military data topic and event analysis • Collaborations with researchers in other tasks and networks • I3.1 APP: Continue collaborations with Jiawei Han (UIUC), to extend the work of uncovering hierarchical relationships to more general relation types, data genres and domains • Work with Thomas Huang (UIUC, I1.1) on cross-media transfer learning • Work with Jiawei Han (UIUC, E2.3) on evolution of information networks • Work with Simon Parsons (T1.4) on automatic social network analysis, and exploit logic reasoning to enhance entity disambiguation and information aggregation 24
A Research Path Ahead to 2012 • Next year research planned if funded: • Effective theories and methods for mining text-rich heterogeneous networks involving social and communication networks • Leverage topic modeling for improving expert finding (expertise ranking problem) on heterogeneous information network • Continue to exploit network structures to enhance knowledge discovery and population • Multi-dimensional, hierarchical abstractive summarization based on information network analysis • Explore collaborations with information fusion tasks in I1 • Explore collaborations with social network and trust projects on automatic social network construction and mining • Application of effective theories and methods in military applications 25
Research Papers I3.1 (UIUC+CUNY) C. Wang, J. Han, X. Li, Q. Li, W. Lin, A. Lee, H. Li and H. Ji. 2011. Uncovering Hierarchical Relationships among Linked Objects: A Probabilistic Modeling Approach. Submitted to KDD2011. I3.2 Accepted/Published: Z. Chen, S. Tamang, A. Lee, X. Li, W. Lin, J. Artiles, M. Snover, M. Passantino and H. Ji. CUNY-BLENDER TAC-KBP2010 Entity Linking and Slot Filling System Description. Proc. TAC2010. H. Li, X. Li, H. Ji and Y. Marton. Domain-Independent Novel Event Discovery and Semi-Automatic Event Annotation. Proc. PACLIC 2010. H. Ji, R. Grishman. Knowledge Base Population: Successful Approaches and Challenges. Proc. ACL-HLT2011. H. Ji, Adam Lee and Wen-Pin Lin. Information Network Construction and Alignment from Automatically Acquired Comparable Corpora. Invited book chapter for Building and Using Comparable Corpora. Springer. H. Ji, B. Favre, W. Lin, D. Gillick, D. Hakkani-Tur and R. Grishman. Open-domain Multi-document Summarization via Information Extraction: Challenges and Prospects. Invited book chapter for Multi-source, Multilingual Information Extraction and Summarisation. Springer. Submitted (CUNY + UIUC) H. Ji and J. Han. 2011. Web-Scale Knowledge Discovery and Information Extraction. Invited Paper for IEEE Special Issue on Web-Scale Multimedia Processing and Applications. (CUNY + UIUC) H. Li, H. Ji, H. Deng and J. Han. 2011. Topically Related Data is Better Data: Topic Modeling for Event Extraction. ACL-HLT2011. (CUNY + UIUC) S. Anzaroot, J. Artiles, H. Ji, H. Deng and J. Han. 2011. Search and Browsing Self-Boosting Information Networks. SIGIR2011. J. Artiles, Q. Li, E. Amigo and H. Ji. 2011. Leveraging Cross-document Redundancy for Temporal Information Extraction. EMNLP2011. J. Artiles, E. Amigo, Q. Li and H. Ji. 2011. Evaluating Temporal Information Extraction. ACL-HLT2011 Z. Chen and H. Ji. 2011. Collaborative Ranking: A Case Study in Entity Linking. EMNLP2011. Q. Li, J. Artiles and H. Ji. 2011. Dependency Paths Kernel for Temporal Relation Classification. ACL-HLT2011. S. Tamang and H. Ji. 2011. Learning-to-Rank for Slot Filling System Combination and Assessment. EMNLP2011. Z. Chen, S. Tamang, A. Lee and H. Ji. 2011. A Toolkit for Knowledge Base Population. SIGIR2011. X. Li and H. Ji. 2011. Comment-guided Learning for Automatic Assessment. EMNLP2011. 26
Awards and Keynote Speech Heng Ji. CUNY Chancellor's "Salute to Scholar" Award, November 2010. Heng Ji. National Science Foundation Research Experiences for Undergraduates, March 2011 Heng Ji, Web-Scale Knowledge Discovery and Population from Unstructured Data, Keynote Speech ACLCLP 2010 Information Retrieval Conference, December 2010. Heng Ji. Overview of the TAC2010 Knowledge Base Population Track, Keynote Speech at Web People Search (WePS-3) Conference, September 2010. Five students received university-wide awards 27
Brief Summary of My Team’s Other Research Work in I3.1 and I3.2 28
Leverage Semantic Information Network to Enhance Entity Coreference Resolution / Entity Identification Disambiguation Name Variant Clustering Apply Graph-cutting based algorithms on semantic information networks 9.4% absolute improvement in micro-averaged accuracy 29
Micro and Macro Collaborative Networks Ranking for Entity and Event Coreference Resolution • Previous methods only focused on the target node and one learning theory itself • Propose a new collaborative network ranking theory which imitates human collaborative learning • Leverage inter-connections among collaborative entities in information networks • Automatic profiling for each node • Construct a collaborative network for each entity based on graph-based clustering • Rank multiple decisions from collaborative entities (micro) and algorithms (macro) based on global prediction • 7% absolute improvement in micro-averaged accuracy • On-going CUNY+UIUC work: using topic modeling for entity clustering correct rank : 30 30
Markov Logic Networks and Learning-to-Rank to Enhance Open Domain Role Discovery Khamis Mushait Boston Al-Qaeda forum V6 V7 V8 twitter V15 residence V14 V13 member origin residence Wail Al-Shehri Waleed Al-Shehri Wail Al-Shehri Waleed Al-Shehri V9 sibling 911 Suspect Terrorist Network V10 Terrorist Information Network V4 V3 V3 V4 Abdul Rahman Al-Omari Abdul Aziz Al-Omari Abdul Rahman Al-Omari Abdul Aziz Al-Omari pilot V11 Mohamed Atta V12 web blog V16 Mohamed Atta pilot news page Saudi Arabian Airlines • Discovered 26 roles for persons, 16 roles for organizations and 13 roles for locations • Markov Logic Networks for Cross-slot and Cross-query reasoning based on InfoNet and textual linkages to resolve conflictions and predict missing links • Weight=15: • Weight=100: • Maximum Entropy based Learning-to-rank model to re-rank candidate answers • 13%-22% absolute F-measure improvement (CUNY) Chen et al. "CUNY-BLENDER TAC-KBP2010 Entity Linking and Slot Filling System Description". Proc. TAC2010 and Lecture Notes in Computer Science, 2010 31 31
Uncovering Hierarchical Relationships among Linked Objects Examples of features and rules (UIUC + CUNY) Chi Wang, Jiawei Han, Xiang Li, Qi Li, Wen-Pin Lin, Adam Lee, Hao Li, Heng Ji, "Uncovering Hierarchical Relationships among Linked Objects: A Probabilistic Modeling Approach", KDD'11 (sub) • Parent-child, manager-subordinate, organizational, initiator-follower • DAG underlying tree • Data: Nodes, links, labeled trees • Jointly Learn the importance of features and rules (challenge: joint learning) • Infer the tree structures of unlabeled data (challenge: model & feature design) • Develop a general model & summarize typical features w/ uncertain importance • Local feature (singleton potential) • Dependency rule (pairwise potential) • Test on two tasks • Uncover family tree structure • Uncover online discussion structure
Uncovering Hierarchical Relationships among Linked Objects Using a novel discriminative model CRF-Hier optimized for joint modeling of tree structure learning and reasoning 10%-12% higher performance than state-of-the-art Mohammed bin Awad bin Laden Salem bin Laden Bakr bin Laden Abdullah Osama bin Laden Osama bin Laden Saad bin Laden Omar Osama bin Laden (UIUC + CUNY) Chi Wang, Jiawei Han, Xiang Li, Qi Li, Wen-Pin Lin, Adam Lee, Hao Li, Heng Ji, "Uncovering Hierarchical Relationships among Linked Objects: A Probabilistic Modeling Approach", KDD'11 (sub)
Potential Transition Example: Terrorism Networks Search and Browsing Engine • In many scenarios, a user may only know information about limited portions of objects or dimensions of links in information networks and thus have difficulty at creating informative queries • For example, a military data analyst may have a list of famous terrorism organizations without knowing their detailed person member names, but still wish to track activities about these members
Multi-Facet Search in Self-Boosting Information Networks (Example: Terrorism Network Search and Browsing) Demo Video: http://nlp.cs.qc.cuny.edu/terrorism.m4v • Facilitate a military analyst in expert finding and terrorist information search gathering, control and analysis for any given query • Entity-topic analyzer for self-expansion and self-boosting: Terrorism organization members status of members (die, arrest,...) and information networks associated with each member (CUNY + UIUC) Sam Anzaroot, Javier Artiles, Heng Ji, Hongbo Deng and Jiawei Han. 2011. Search and Browsing Self-Boosting Information Networks. SIGIR2011 [SUB]