1 / 84

Link Mining

Link Mining. Lise Getoor University of Maryland, College Park. joint work with Indrajit Bhattacharya, Qing Lu and Prithviraj Sen. Roadmap. Intro to Link Mining Link Mining Tasks Link Mining Challenges Some Current Projects Link-based Classification

winka
Télécharger la présentation

Link Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Link Mining Lise Getoor University of Maryland, College Park joint work with Indrajit Bhattacharya, Qing Lu and Prithviraj Sen MRDM 2004 Workshop

  2. Roadmap • Intro to Link Mining • Link Mining Tasks • Link Mining Challenges • Some Current Projects • Link-based Classification • Link-based classification using a variety of link descriptions • Link-based classification using labeled and unlabeled data • Link-based Clustering • Entity detection • Group Detection • Conclusion MRDM 2004 Workshop

  3. Link Mining • Traditional machine learning and data mining approaches assume: • A random sample of homogeneous objects from single relation • Real world data sets: • Multi-relational, heterogeneous and semi-structured • Link Mining • newly emerging research area at the intersection of research in social network and link analysis, hypertext and web mining, graph mining, relational learning and inductive logic programming MRDM 2004 Workshop

  4. Linked Data • Heterogeneous, multi-relational data represented as a graph or network • Nodes are objects • May have different kinds of objects • Objects have attributes • Objects may have labels or classes • Edges are links • May have different kinds of links • Links may have attributes • Links may be directed, are not required to be binary MRDM 2004 Workshop

  5. Sample Domains • web data (web) • bibliographic data (cite) • epidimiological data (epi) • communication data (comm) • customer networks (cust) • collaborative filtering problems (cf) • trust networks (trust) • biological data (bio) MRDM 2004 Workshop

  6. Link Mining Tasks • Link-based Object Classification • Object Type Prediction • Link Type Prediction • Predicting Link Existence • Link Cardinality Estimation • Object Consolidation • Group Detection • Subgraph Discovery • Metadata Mining MRDM 2004 Workshop

  7. Link-based Object Classification • Predicting the category of an object based on its attributes andits links and attributes of linked objects • web: Predict the category of a web page, based on words that occur on the page, links between pages, anchor text, html tags, etc. • cite: Predict the topic of a paper, based on word occurrence, citations, co-citations • epi: Predict disease type based on characteristics of the patients infected by the disease MRDM 2004 Workshop

  8. Object Class Prediction • Predicting the type of an object based on its attributes andits links and attributes of linked objects • comm: Predict whether a communication contact is by email, phone call or mail. • cite: Predict the venue type of a publication (conference, journal, workshop) MRDM 2004 Workshop

  9. Link Type Classification • Predicting type or purpose of link based on properties of the participating objects • web: predict advertising link or navigational link; predict an advisor-advisee relationship • epi: predicting whether contact is familial, co-worker or acquaintance MRDM 2004 Workshop

  10. Predicting Link Existence • Predicting whether a link exists between two objects • web: predict whether there will be a link between two pages • cite: predicting whether a paper will cite another paper • epi: predicting who a patient’s contacts are MRDM 2004 Workshop

  11. Link Cardinality Estimation I • Predicting the number of links to an object • web: predict the authoratativeness of a page based on the number of in-links; identifying hubs based on the number of out-links • cite: predicting the impact of a paper based on the number of citations • epi: predicting the number of people that will be infected based on the infectiousness of a disease. MRDM 2004 Workshop

  12. Link Cardinality Estimation II • Predicting the number of objects reached along a path from an object • Important for estimating the number of objects that will be returned by a query • web: predicting number of pages retrieved by crawling a site • cite: predicting the number of citations of a particular author in a specific journal MRDM 2004 Workshop

  13. Object Consolidation • Predicting when two objects are the same, based on their attributes and their links • aka: record linkage, duplicate elimination, identity uncertainty • web: predict when two sites are mirrors of each other. • cite: predicting when two citations are referring to the same paper. • epi: predicting when two disease strains are the same • bio: learning when two names refer to the same protein MRDM 2004 Workshop

  14. Group Detection • Predicting when a set of entities belong to the same group based on clustering both object attribute values and link structure • web – identifying communities • cite – identifying research communities MRDM 2004 Workshop

  15. Subgraph Identification • Find characteristic subgraphs • Focus of graph-based data mining (Cook & Holder, Inokuchi, Washio & Motoda, Kuramochi & Karypis, Yan & Han) • bio – protein structure discovery • comm – legitimate vs. illegitimate groups • chem – chemical substructure discovery MRDM 2004 Workshop

  16. Metadata Mining • Schema mapping, schema discovery, schema reformulation • cite – matching between two bibliographic sources • web - discovering schema from unstructured or semi-structured data • bio – mapping between two medical ontologies MRDM 2004 Workshop

  17. Link Mining Tasks • Link-based Object Classification • Object Type Prediction • Link Type Prediction • Predicting Link Existence • Link Cardinality Estimation • Object Consolidation • Group Detection • Subgraph Discovery • Metadata Mining MRDM 2004 Workshop

  18. Link Mining Challenges • Logical vs. Statistical dependencies • Feature construction • Instances vs. Classes • Collective Classification • Collective Consolidation • Effective Use of Labeled & Unlabeled Data • Link Prediction • Closed vs. Open World Challenges common to any link-based statistical model (Bayesian Logic Programs, Conditional Random Fields, Probabilistic Relational Models, Relational Markov Networks, Relational Probability Trees, Stochastic Logic Programming to name a few) MRDM 2004 Workshop

  19. Logical vs. Statistical Dependence • Coherently handling two types of dependence structures: • Link structure - the logical relationships between objects • Probabilistic dependence - statistical relationships between attributes • Challenge: statistical models that support rich logical relationships • Model search complicated by the fact that attributes can depend on arbitrarily linked attributes -- issue: how to search this huge space MRDM 2004 Workshop

  20. P3 P2 Model Search P1 P1 P3 P2 I1 I1 A1 A1 P ? MRDM 2004 Workshop

  21. Feature Construction • In many cases, objects are linked to a set of objects. To construct a single feature from this set of objects, we may either use: • Aggregation • Selection MRDM 2004 Workshop

  22. P7 P8 P1 P5 P1 P4 P3 P2 P2 P9 P6 I2 mode mode A2 P ? Aggregation P1 P3 P2 I1 A1 P P ? MRDM 2004 Workshop

  23. P1 P3 P2 P3 P2 Selection P1 P3 P2 I1 A1 P ? P MRDM 2004 Workshop

  24. Individuals vs. Classes • Does model refer • explicitly to individuals • classes or generic categories of individuals • On one hand, we’d like to be able to model that a connection to a particular individual may be highly predictive • On the other hand, we’d like our models to generalize to new situations, with different individuals MRDM 2004 Workshop

  25. Instance-based Dependencies P3 P3 I1 A1 Papers that cite P3 are likely to be MRDM 2004 Workshop

  26. Class-based Dependencies ? ? I1 A1 Papers that cite are likely to be MRDM 2004 Workshop

  27. Collective classification • Using a link-based statistical model for classification • Inference using learned model is complicated by the fact that there is correlation between the object labels MRDM 2004 Workshop

  28. Collective consolidation • Using a link-based statistical model for object consolidation • Consolidation decisions should not be made independently MRDM 2004 Workshop

  29. Labeled & Unlabeled Data • In link-based domains, unlabeled data provide three sources of information: • Helps us infer object attribute distribution • Links between unlabeled data allow us to make use of attributes of linked objects • Links between labeled data and unlabeled data (training data and test data) help us make more accurate inferences MRDM 2004 Workshop

  30. Link Prior Probability • The prior probability of any particular link is typically extraordinarily low • For medium-sized data sets, we have had success with building explicit models of link existence • It may be more effective to model links at higher level--required for large data sets! MRDM 2004 Workshop

  31. Closed World vs. Open World • The majority of SRL approaches make a closed world assumption, which assumes that we know all the potential entities in the domain • In many cases, this is unrealistic • Work by Milch, Marti, Russell on BLOG MRDM 2004 Workshop

  32. Link Mining Tasks Link-based Object Classification Object Type Prediction Link Type Prediction Predicting Link Existence Link Mining Challenges Logical vs. Statistical dependencies Feature construction Instances vs. Classes Collective Classification Link Cardinality Estimation Object Consolidation Group Detection Subgraph Discovery Metadata Mining Collective Consolidation Effective Use of Labeled & Unlabeled Data Link Prediction Closed vs. Open World Link Mining Summary MRDM 2004 Workshop

  33. Roadmap • Intro to Link Mining • Link Mining Tasks • Link Mining Challenges • Some Current Projects • Link-based Classification • work with Qing Lu and Prithviraj Sen • Link-based classification using a variety of link descriptions • Link-based classification using labeled and unlabeled data • Link-based Clustering • Entity detection • Group Detection • Conclusion MRDM 2004 Workshop

  34. Object Classification • Traditional Object Classification • Assume objects sampled from a single relation • Object Attributes (OA) X4 X5 X3 X2 X1 MRDM 2004 Workshop

  35. Object Classification with Linked Data • Traditional Object Classification • Assume objects sampled from a single relation • Object Attributes (OA) X4 X5 X3 • Linked Data • Links among objects • Represented as a graph X2 X1 MRDM 2004 Workshop

  36. Link-based Object Classification • Predicting the category of an object based on its attributes andits links and attributes of linked objects • Citation domain: Predict the topic of a paper, based on word occurrence, citations and co-citations MRDM 2004 Workshop

  37. Related Work: Link-based Classification • Hypertext Classification using Links • Class labels of linked objects • Soumen Chakrabarti (1998) • Oh, et al. (1999) • Unique document ID, Popescul et al. (2002) • Regularities, Yang et al. (2002) • Use of Unlabeled Data • Co-training, Blum and Mitchel (1998) • EM-algorithm, Nigam et al. (2000) • Systematical investigation of EM and Co-training, Ghani (2001) • TSVM, Joachims (1999) MRDM 2004 Workshop

  38. Our Approach • Link-based models • Integrate link features with object attributes using logistic regression • Investigate use of labeled and unlabeled data for link-based classification MRDM 2004 Workshop

  39. Features • Object Attributes • Notation: OA(X) • Link Descriptions • Notation: LD(X) • Statistics computed from linked objects • Computed separately for each of: • In-Links(X) • Out-Links(X) • Co-In(X) • Co-Out(X) • Three types of Link Descriptions: • Mode, Binary, Count MRDM 2004 Workshop

  40. Link Descriptions Categories X MRDM 2004 Workshop

  41. mode: Link Descriptions Categories In-Links(X) X MRDM 2004 Workshop

  42. mode: mode: Link Descriptions Categories In-Links(X) Out-Links(X) X MRDM 2004 Workshop

  43. mode: mode: mode: Link Descriptions Categories In-Links(X) Out-Links(X) X CO(X) MRDM 2004 Workshop

  44. mode: mode: mode: mode: Link Descriptions Categories In-Links(X) Out-Links(X) X CO(X) CI(X) MRDM 2004 Workshop

  45. mode: mode: mode: mode: Link Descriptions Categories In-Links(X) Out-Links(X) binary: (1,1,1) binary: (1,1,0) X CO(X) CI(X) binary: (1,1,0) binary: (1,0,0) MRDM 2004 Workshop

  46. mode: mode: mode: mode: Link Descriptions Categories In-Links(X) Out-Links(X) binary: (1,1,1) count: (3,1,1) binary: (1,1,0) count: (1,2,0) X CO(X) CI(X) binary: (1,1,0) binary: (1,0,0) count: (2,1,0) count: (2,0,0) MRDM 2004 Workshop

  47. Predictive Model for Classification • A structured logistic regression • Compute P(c | OA(X)) and P(c | LD(X)) separately using separate logistic regression models • where OA(X) are the object attributes and LDf(X) are the link features MRDM 2004 Workshop

  48. Prediction • category set { } P1 P1 P2 P2 P5 P5 P3 P3 P4 P4 Step 1: Bootstrap using object attributes only MRDM 2004 Workshop

  49. Prediction P1 P1 P2 P2 P5 P5 P3 P3 P4 P4 P4 Step 2: Iteratively update the category of each object, based on linked object’s categories MRDM 2004 Workshop

  50. Data Sets MRDM 2004 Workshop

More Related