220 likes | 323 Vues
Explore the application of probabilistic models to classify web pages and documents based on their content and links, with examples from WebKB and CORA datasets. Learn about link uncertainty, reference uncertainty, modeling citation structures, and constraints between citing and cited papers.
E N D
Learning Probabilistic Models of Link Structure Getoor, Friedman, Koller, Taskar
Example Application: WebKB • Classify web page as course, student, professor, project, none using… • Words on the web page • Links from other web pages (and the class of those pages, recursively) • Words in the “anchor text” from the other page <a href=“url”>anchor text</a>. • Web pages obtained from Cornell, Texas, Washington, and Wisconsin
Example Application: CORA • Classify documents according to topic (7 levels) using… • words in the document • papers cited by the document • papers citing the document
Document Document Document Document Document Document Document Document class class class class class class class class words words words words words words words words Standard PRM • parents(Doc.class) = {MODE(Doc.citers.class),MODE(Doc.cited.class)} citers MODE MODE cited
Problem: The Citation Structure is Fixed • The existence (or non-existence) of a link cannot serve as evidence • Individually-linked papers only influence the class through the MODE.
Possible Solution: Link Uncertainty • Model the existence of links as random variables • Create a Link instance for each pair of possibly-linked objects
Cites Cites Cites Document Document Document Exists Exists Exists class class class words words words Unrolled Network
Getoor’s Diagram • Entity classes (Paper) • Relation classes (Cites) • Technically, every instance has an Exists variable which is true for all Entity instances.
Semantics • P is the basic CPT • P* will be the equivalent unrolled CPT • Require that an object does not exist if any of the objects it points to do not exist
Experimental Results • Cora and WebKB
A Second Approach:Reference Uncertainty • Treat reference attributes as random variables • Each reference attribute takes as value an object of the indicated class • Citation • Citing: reference attribute, value is a Paper • Cited: reference attribute, value is a Paper
Problems • How many citation objects exist? Consequently, how many reference random variables exist? • How do we represent P(Citation.cites | …)? Citation.cites could take on thousands of possible values. • Huge conditional probability table • Costly inference at run time
SolutionsProblem 1: How many citations? • Fix the number of Citation objects • This gives the “object skeleton”
Theory Learning Paper Paper Paper Paper Paper Paper Paper Paper Paper Citation Graphics Citing Cited Problem 2: Too many potential values for a reference attribute • Attach to each reference attribute a set of partition attributes • The reference attribute chooses a partition • A Paper is then chosen uniformly at random from the partition
Representing Constraints Between Citing and Cited Papers Parents(Cites.Cited) = {Cites.Citing.Topic}
Sciting Theory Learning Paper Paper Paper Paper Paper Paper Paper Paper Paper Graphics Details • Each reference attribute has a selector attribute S that chooses the partition. Citation Citing Scited Cited
Class-level Dependency Graph • Five types of edges • Type I: edges within a single object • Type II: edges between objects • Type III: edges from every reference attribute along any reference paths • Type IV: edges from every partition attribute to the selector attributes that use those partition attributes to choose a partition • Type V: edge from selector attributes to their corresponding reference attributes
Movie Theater Example • Type I: Genre Popularity • Type II: Shows.Movie.Genre Shows.Profit Shows.Theater.Type SMovie • Type III: Move Profit; Theater Smovie • Type IV: Genre SMovie • Type V: STheater Theater; SMovie Movie
Unrolled Graph? • The Unrolled Graph can have a huge number of edges • Is learning and inference really feasible?
Homework Exercise • Construct the dependency graph for the citation example • Construct an unrolled network for a reference uncertainty example