Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: 2007.06.21

IJCAI 2003 Workshop on Learning Statistical Models from Relational DataFirst-Order Probabilistic Models for Information Extraction NIPS 15th, 2003 Identity Uncertainty and Citation Matching Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: 2007.06.21

Outlines • Introduction • Related works • Models for the bibliography domain • Experiment on model A • Desiderata for a FOPL • Conclusions 2/18

Introduction –Citation Matching Problem • Citation matching: • the problem of deciding which citations correspond to the same publication • Difficulties • Different citation styles • An imperfect copy of the book’s title • Different ways to refer an object (identity) • Ambiguity • “Wauchope, K. Eucalyptus: Integrating Natural language Input with a Graphical User Interface” • Author: “Wauchope, K. Eucalyptus” or “Wauchope, K.” ? • Tasks • Parsing • Disambiguation • Matching 3/18

Introduction –Citation Matching Problem: Examples Journal of Artificial Intelligence Research, or Artificial Intelligence Journal ?? 4/18

Introduction –First-Order Probabilistic Models 5/18

Introduction –Result of Model B 6/18

Related Works • IE • the Message Understanding Conferences [DARPA,1998] • Bayesian modeling • finding stochastically repeated patterns (motifs) in DNA sequences [Xing et al., 2003] • Robot localization [Anguelov et al., 2002] • FOPL/RPM (Relational Prob. Model) • A. Pfeffer. Probabilistic Reasoning for Complex Systems. PhD thesis, Stanford, 2000. 7/18

Models for the Bibliography Domain –Model A • [Pasula et al. 2003] 8/18

Models for the Bibliography Domain –Model A (Cont.) • Suggest a declarative approach to identity uncertainty using a formal language • Algorithm • Steps • Generate objects/instances • Parse and fill attributes • Inference (Approximation, MCMC) • Cluster the identity (publication) 9/18

Models for the Bibliography Domain –Model A (Cont.) • Attributes using unconditional probability • learn several bigram models • letter-based models of first names, surnames, and title words • using the following resources • the 2000 Census data on US names • a large A.I. BibTeX bibliography • a hand-parsed collection of 500 citations • Attributes using conditional probability • Using noise channels for some attributes • the corruption models of Citation.obsTitle, AuthorAsCited.surname, and AuthorAsCited.fnames • The parameters of the corruption models are learnt online, using stochastic EM • Citation.parse • It keeps track of the segmentation of Citation.text • An author segment, a title segment, and three filler segments (one before, one after, and one in between) • Citation.text • Be constrained by Citation.parse, Paper.pubType, … • These models were learned using our pre-segmented file. 10/18

Models for the Bibliography Domain –Model B 11/18

Models for the Bibliography Domain –Model B (Cont.) • Generating objects • The set of Author objects, and the set of Collection objects are generated independently. • the set of Publication objects is generated conditional on the Authors and Collections. • CitationGroup objects are generated conditional on the Authors and Collections. • Citation objects are generated from the CitationGroups. 12/18

Models for the Bibliography Domain –Model B (Cont.) • Fill attributes • Author.Name • is chosen from a mixture of a letter bigram distribution with a distribution that chooses from a set of commonly occurring names • Publications.Title • is generated from an n-gram model, conditioned on Publications.area • More specific relations and conditions between attributes 13/18

Experiment on model A –Experiment Setting • Dataset • Citeseer’s hand-matched datasets • Each of these datasets contains several hundred citations of machine learning papers • Citeseer’s phrase matchingalgorithm • a greedy agglomerative clustering method • based on a metric that measures the degrees to which the words and phrases of any two citations overlap • half of them in clusters ranging in size from two to twenty-one citations 14/18

Experiment on model A –Experiment Result 15/18

Desiderata for a FOPL • Contains • A probability distribution over possible worlds • The expression power to model the relational structure of the world • An efficient inference algorithm • A learning procedure which allows priors over the parameters • Has the ability • to answer queries • to make inferences about the existence or nonexistence of objects having particular properties • to represent common types of compound objects • to represent probabilistic dependencies • to incorporate domain knowledge into the inference algorithms 16/18

Conclusions • First-order probabilistic models • a useful, probably necessary, component of any system that extracts complex relational information from unstructured text data • Some of the directions we plan to pursue in the future • defining a representation language that allows such models to be specified declaratively, • scaling up the inference procedure to handle large knowledge bases 17/18

Thanks!!

Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: 2007.06.21