320 likes | 445 Vues
Probabilistic Models for Relational Data. Seminar Data Mining (SS 2005) Prof. Dr. Thomas Hofmann Dipl. Inform. Steffen Hartmann Xin Dong 05,07,2005. History/Introduction. “flat” data relational data plate models and probabilistic relational models (PRMs)
E N D
Probabilistic Models for Relational Data Seminar Data Mining (SS 2005) Prof. Dr. Thomas Hofmann Dipl. Inform. Steffen Hartmann Xin Dong 05,07,2005
History/Introduction • “flat” data relational data • plate models and probabilistic relational models (PRMs) graphically quite different similar to express probabilistic relationships • probabilistic entity-relationship (PER) model an extension of the ER model enhances the expressiveness make relationships first class objects easy to model relational data. • directed acyclic probabilistic entity-relationship (DAPER) model more similar, more expressive the use of restricted relationships, self relationships, probabilistic relationships
The Basic Ideas ---ER Model Entity relationship (ER) model • a commonly used abstract representation of database structure • the first step in the process of building a relational database • Features of anticipated data and how they interrelate are encoded • used to create a relational schema for the database, which in turn is used to build the database itself • is a representation of a database structure, not of a particular database that contains data
The Basic Ideas ---ER Model Definitions • entity --- a thing or object that is or may be stored in a database • relationship--- a specific interaction among entities • attribute --- a variable describing some property of an entity or relationship.
The Basic Ideas --- ER Model Example 1 A university database maintains records on students and their IQs, courses and their difficulty, and the courses taken by students and the grades they receive. distinguish between: • ER diagram and ER model ER diagram --- only graph ER model --- ER diagram + mechanism • skeleton and instance for an ER model skeleton --- collection of corresponding entity and relationship sets instance --- skeleton + assignment of a value to every attribute an instance of an ER model is an actual database
Student John mary entity class Course Diff entity set Course cs107 stat10 Takes Grade attribute class Student IQ Takes Student Course John cs107 mary cs107 mary stat10 relationship class (a). ER model relationship set cs107.Diff stat10.Diff (b). An example skeleton for the entity and relationshipclasses T(john,cs107).G T(mary,cs107).G T(mary,stat10).G john.IQ mary.IQ (c). The attributesdefined by theapplication of the ER model to the skeleton.
skeleton for a set of entity and relationship classes instance for an ER model Student John mary Student John mary Student . IQ 120 125 Course cs107 stat10 Course cs107 stat10 Course . Diff A B Takes Student Course John cs107 mary cs107 mary stat10 Takes Student Course John cs107 mary cs107 mary stat10 Takes . Grade 3.0 2.0 1.0
The Basic Ideas --- DAPER Model directed acyclic probabilistic entity relationship (DAPER) model • ER model with directed (solid) arcs and local distribution classes arc class --- represent probabilistic dependencies among corresponding attributes local distribution classes --- define local distributions for attributes • DAPER diagram --- graph • DAPER model --- diagram + the local distribution classes + the mechanism, by which a DAPER model defines a directed acyclic graphical (DAG) model given a skeleton.
The Basic Ideas --- DAPER Model Example 2 In the university database (Example 1), a student’s grade in a course depends both on the student’s IQ and on the difficulty of the course. • arc class • Constraint • local distribution class a specification from which local distributions for attributes corresponding to the attribute class can be constructed, when a DAPER model is expanded to a DAG model local distribution class for Takes.Grade p (Takes.Grade | Student.IQ, Course.Diff) is a specification from which the local distributions for Takes(s, c).Grade, for all students s and courses c, can be constructed.
Student John mary Course Diff Course[Diff] = Course[Grade] Course cs107 stat10 Takes Grade student[IQ] = student[Grade] Student IQ Takes Student Course John cs107 mary cs107 mary stat10 (a). DAPER model cs107.Diff stat10.Diff T(john,cs107).G T(mary,cs107).G T(mary,stat10).G (b). An example skeleton for the entity and relationshipclasses john.IQ mary.IQ (c). Directed acyclic graphical (DAG) model defined by application of DAPER model to ER skeleton
The Basic Ideas --- plate Model • developed as a language for compactly representing graphical models in which there are repeated measurements • no formal definition of a plate model, we provide one here. This definition enhances the expressivity of such models while retaining their essence • plate and DAPER models are equivalent
The invertible mapping from a DAPER to plate model Course Attribute classes of an entity class are drawn as ovals inside the rectangle corresponding to the entity, but outside any intersection. Attribute classes associated with a relationship class are drawn in the intersection corresponding to the relationship class. Diff Arc classes and constraints are drawn just as they are in DAPER models. In additon, local distribution classes are specified just as they are in DAPER models. (not shown in the graph) entity class -> a large rectangle, called a plate The plate is labeled with the entity-class name Plates are allowed to intersect or overlap A relationship class is drawn at the named intersection of the plates Course [Diff] = Course [Grade] Takes Grade Student [IQ] = Student [Grade] IQ Student Plate model depicting the structure of a university database.
The Basic Ideas --- PRMs Probabilistic Relational Models (PRMs) • developed explicitly for the purpose of representing relational data • extends the relational model— another commonly used representation for the structure of a database • directed PRMs equivalent to DAPER models and plate models
The invertible mapping from a DAPER model to a directed PRM Course Diff the ER-modelcomponent of theDAPER model is mapped to a relational model in a standard way both entity and relationship classes are represented as tables attribute classes forentity and relationship classes are represented as attributes or columns in the corresponding tables of the relational model the probabilistic components of the DAPER model are mapped to those of the directed PRM arc classes and constraints just as they are in the DAPER model. Course [Diff] = Course [Grade] Takes Course Student Grade Student [IQ] = Student [Grade] Student IQ PRM model depicting the structure of a university database.
Probabilistic Entity-Relationship Models • Fundamentals ground graph --- structure of the DAG model created by the expansion of a DAPER model given a skeleton drawing of arcs --- important part of this expansion mechanism --- important conditional independence relations could be expressed
Probabilistic Entity-Relationship Models Example 3 A database contains diseases and symptoms for a given patient. Every disease is a potential cause of every symptom. Example 4 Extending Example 3, suppose a physician has identified the possible causes of each symptom.
d1.Present d2.Present d3.Present Disease Present s1.Present s2.Present s3.present (b) A ground graph (a DAG model structure) generated by the application of this DAPER model to any given a skeleton is a full bipartite graph. (e) A DAG model resulting from the expansion of the DAPER model to the skeleton. Causes Causes (d, s) Causes Symptom Present Disease Symptom d1 s1 d1 s2 (a) A DAPER model for a complete bipartite graph between symptoms and diseases. (c) A DAPER model for a incomplete bipartite graph between symptoms and diseases. d1 s3 d2 s2 (d) A possible skeleton d3 s3
Probabilistic Entity-Relationship Models Example 5 Extending Example 3 in a different way, suppose the physician has identified both primary (major) and secondary (minor) causes of disease. Example 6 Extending Example 3 in a different way, suppose that both diseases and symptoms have category labels — labels drawn from the same set of categories. The possible causes of a symptom are diseases that have at least one category in common with that symptom.
Disease Present Disease Present R1 1°Causes Causes 2°Causes 1°Causes (d, s) v 2°Causes(d, s) Causes (d, s) Category Symptom Present R2 (b) A DAPER model with a disjunctive constraint. (a) A DAPER model (in Example 4) Present Symptom (c) A constraint containing the existence quantifier.
Probabilistic Entity-Relationship Models • Restricted Relationships A relationship class R in an ER (or PER) model is restrictedwhen some skeletons for the entity and relationship classes of the ER model are prohibited. graphical notation has been developed for common restrictions extremely useful tool for modeling with PER models.
Probabilistic Entity-Relationship Models Example 7 A binary outcome O is measured on patients in multiple hospitals. Each patient is treated in exactly one hospital. It is believed that outcomes in any given hospital h are i.i.d. given binomial parameter h.θ; and that these binomial parameters are themselves i.i.d. across hospitals given hyper parameters α.
Hospital h[ ]=h[ ] h1. . . . hm. In In (h, p) p11. . . . p1n1. pm1. . . . pmnm. Patient (b) The ground graph for a skeleton containing m hospitals and ni patients in hospital i applied to the DAPER model. (c) A DAPER model equivalent to the one in (a). (a) A DAPER model
Probabilistic Entity-Relationship Models • Self Relationships Self relationships are relationships that relate like entities (and perhaps other entities as well). A self-relationship classis one that contains self relationships.
Probabilistic Entity-Relationship Models Example 9 In the university-database example (Example 2), a student’s grade in a course depends on whether an advisor of the student is a friend of a teacher of the course.
Full F(p, pf) F Friend there are two instances of the Professor entity class named“Professor (Teacher)” and “Professor (Advisor).” Note that copying allows us to annotate the role that each copy of the entity class plays in the self-relationship class. Models drawn with this copy convention are sometimes more transparent. Professor Professor (Advisor) Professor (Teacher) Teaches(p, c) Advises(pf, s) Teaches an ordinary attribute θ corresponding to this uncertain distribution. Course Diff c[D]=c[G] Takes Grade Advises s[IQ]=s[G] Student IQ (c) DAPER model, the Professor entity class has been copied. (a) ER model (b) DAPER model
F has one attribute class F.Friend,where the attribute F(p, pf).Friendis true if professor pf is a friend of professor p. Note that F has theFullconstraint so that we can model whether any one professor is a friend of another. Also note thatF(p1, p2).Friendmay be true whileF(p2, p1).Friendmay be false.
The constraint on the arc class from F.FriendtoTakes.GradeisTeaches(p, c) ∧ Advises(pf, s). Thus, in any ground graph generated from this model, there is an arc from attribute F(p, pf ).Friend to attributeTakes(s, c).Grade whenever a teacher of the course is p and an advisor of the student is pf —precisely the additional dependence described in the example.
Probabilistic Entity-Relationship Models • Probabilistic Relationships Example 12 (Relationship existence) A database contains academic papers and citations for a subset of those papers. Using the citations we have, we model how the topics of two papers influence whether one paper cites the other. Example 13 Modifying Example 12, we now know that the database was constructed such that contains at most ten citations from the bibliography of any paper.
we areuncertainabout the citations of papers whose citations have not been recorded. To model this uncertainty, we use a DAPER Model in which Cites is aFull relationship class with attribute classCites.Exists, where Cites(pcg, pcd).Existsis true when paperpcg cites paperpcd. In addition, to model how the topics of two papers influence this existence, we add the attribute class Paper.Topicand the arc classes. Paper (Citing) Topic p[T]=pcg [E] Full Cites Cites(pcg,pcd) Exists pcg [E]=p[<=10] <=10 p[T]=pcd [E] Paper (Cited) Topic (c) A DAPER model for the situation where citations are limited to ten per paper. (b) A DAPER model for the situation where citations are uncertain. (a) An ER model
With respect to Figure b, we have added a binary, attribute classPaper. <= 10. Thedouble ovalassociated with this Attribute class indicates that this attribute expands todeterministic attributesin a ground graph. In particular, a ground graph attribute p. <= 10will have parents Cites(pcg, pcd).Exists, for allpcd, and will be true exactly when ten or fewer of these parents are true. To encode the restriction, we setp. <= 10to true for everyp when performing inference in the ground graph.
Summary • ER model by example • definitions for the DAPER model, plate model and PRM • examine DAPER models in detail restricted relationships self relationships probabilistic relationships
Thank you very much! We thank David Heckerman, Christopher Meek, and Daphne Koller for this paper and useful comments.