490 likes | 590 Vues
PXML: A Probabilistic Semistructured Data Model and Algebra. Edward Hung, Lise Getoor, V.S. Subrahmanian University of Maryland, College Park ICDE, Bangalore, India, Mar 2003. Outline. Motivating example Semistructured data model PXML data model Semantics Algebra
E N D
PXML: A Probabilistic Semistructured Data Model and Algebra Edward Hung, Lise Getoor, V.S. Subrahmanian University of Maryland, College Park ICDE, Bangalore, India, Mar 2003
Outline • Motivating example • Semistructured data model • PXML data model • Semantics • Algebra • Probabilistic point query • Related work
Motivating Example • Surveillance applications monitoring a region of battlefield • Image processing system identifies vehicles in convoys appearing in the region in different time • Convoys • timestamp • tanks, trucks, etc • Uncertainty • number of vehicles • Category and identity of a vehicle, e.g., a tank? T-72?
Motivating Example • Semistructured data model • General hierarchical structure is known. • The schema is not fixed • Number of vehicles • Properties of vehicles • Our work: store uncertain information in probabilistic environments.
Semistructured Data Model • Example
PIXML Data Model • Uncertainty • Existence of sub-objects • Number of sub-objects • Identity of the sub-objects
card(convoy2, ts)=[1,1] Time = 15 card(convoy2, truck)=[1,2] PIXML Data Model (Cardinality) • Example of cardinality Weak Instance W = Semistructured Instance + card
PIXML Data Model (Weak Instance) • Example of a weak instance W card(S1,convoy)=[2,2] card(convoy1,ts)=[1,1] card(convoy1,truck)=[1,1] card(convoy1,tank)=[1,1] card(convoy2,ts)=[1,1] card(convoy2,truck)=[1,2]
PIXML Data Model • Example of an instance compatible with W card(convoy1,ts)=[1,1] card(convoy1,truck)=[1,1] card(S1,convoy)=[2,2] card(convoy1,tank)=[1,1] card(convoy2,ts)=[1,1] card(convoy2,truck)=[1,2]
D(W)= the set of all semistructured instances compatible with the weak instance W
card(convoy2, ts)=[1,1] Time = 15 Time = 15 card(convoy2, truck)=[1,2] Time = 15 Time = 15 Potential child set of convoy2, PC(convoy2) = {{ts2, truck3, truck4}, {ts2, truck3}, {ts2, truck4}}
card(convoy2, ts)=[1,1] Time = 15 Time = 15 Time = 15 card(convoy2, truck)=[1,2] Time = 15 Object probability function (OPF) for convoy2 w.r.t. W is a mapping w: PC(convoy2) [0,1] s.t. wconvoy2({ts2, truck3 , truck4}) = 0.2 wconvoy2({ts2, truck3}) = 0.5 wconvoy2({ts2, truck4}) = 0.3
Semantics (Local Interpretation) • Interpretation • Local interpretation, p • a mapping from the set of non-leaf objects to OPFs • Example • p(convoy2) = wconvoy2
Semantics (Local Interpretation) • Here the opf assigns the probability to each possible set of children. • More independence assumptions are possible to make the representation more compact • e.g. independence between trucks and tanks. • e.g. all trucks are all indistinguishable.
Semantics (Global Interpretation) • Previously, probabilities are assigned to the actual children of each non-leaf object in a local manner. • Now we are going to assign probabilities of each compatible instance globally.
Semantics (Global Interpretation) • Interpretation • Global interpretation, P • a mapping from D(W) (the set of semistructured instances compatible with W) to [0,1] s.t.
S1a S1b S1c P(S1a) = 0.12 P(S1b) = 0.08 P(S1c) = 0.2 S1d S1e S1f P(S1d) = 0.18 P(S1e) = 0.12 P(S1f) = 0.3
Semantics (Local Global) • We have defined operators to convert between local and global interpretations. • Theorems (Reversibility) • The conversions from local to global interpretation and from global to local interpretation are correct. • The conversion between local and global interpretations is reversible.
Algebra • Operators • Projection • Selection • Cross-product • Path expression • o.l1.l2…ln S1.convoy.truck
Algebra (Projection) • Ancestor projection • Descendant projection • Single projection
Algebra (Projection) Semistructured Instance • Ancestor projection ( )
Globally • Ancestor projection ( )
Probabilistic Instance • Ancestor projection ( ) card(convoy1,ts)=[1,1] card(I2,convoy)=[1,1] card(convoy1,truck)=[1,1] p(I2)({convoy1})=0.8 card(convoy1,tank)=[1,1] p(convoy1)({ts1,truck1,tank1})=0 p(convoy1)({ts1,truck1,tank2})=0.1 p(convoy1)({ts1,truck2,tank1})=0.3 p(convoy1)({ts1,truck2,tank2})=0.6 PC(convoy1) card(I2,convoy)=[1,1] card(convoy1,truck)=[1,1] After normalization, p(I2)({convoy1})=1 Children of convoy1 before = CI2(convoy1)={ts1, truck1, truck2, tank1, tank2} Children of convoy1 after = CI2’(convoy1)={truck1, truck2} PC’(convoy1)={{truck1},{truck2}}
Probabilistic Instance • Ancestor projection ( ) card(convoy1,ts)=[1,1] card(I2,convoy)=[1,1] card(convoy1,truck)=[1,1] p(I2)({convoy1})=0.8 card(convoy1,tank)=[1,1] p(convoy1)({ts1,truck1,tank1})=0 p(convoy1)({ts1,truck1,tank2})=0.1 p(convoy1)({ts1,truck2,tank1})=0.3 p(convoy1)({ts1,truck2,tank2})=0.6 PC(convoy1) card(I2,convoy)=[1,1] card(convoy1,truck)=[1,1] After normalization, p(I2)({convoy1})=1 For {truck1}, p(convoy1)({truck1}) = 0 + 0.1 = 0.1 For {truck2}, p(convoy1)({truck2}) = 0.3 + 0.6 = 0.9 After normalization, p(convoy1)({truck1}) = 0.1, p(convoy1)({truck2}) = 0.9
Ancestor Projection • Experiments • running time is linear to the number of objects (selected objects and their ancestors) • time to update the OPF entries of an object o is sub-quadratic to the number of OPF entries
card(I7, convoy)=[1,2], wI7({convoy1})=0.2, wI7({convoy2})=0.5, wI7({convoy1,convoy2})=0.3 Algebra (Selection) ( ) card(convoy1, tank)=[1,1] wconvoy1({tank1})=0.3, wconvoy1({tank2})=0.7 card(convoy2, tank)=[1,1] wconvoy2({tank2})=0.4, wconvoy2({tank3})=0.6 0.14 +0.2 +0.036 +0.084 +0.126 =0.586 D(I7) 0.036 / 0.586 0.06 0.054 0.14 / 0.586 0.084 0.2 / 0.586 / 0.586 0.3 0.126 / 0.586
Algebra (Cross product (x)) card(I4, truck)=[1,1] p(I4)({truck1})=0.2 p(I4)({truck2})=0.8 card(I5, tank)=[1,1] p(I5)({tank1})=0.1 p(I5)({tank2})=0.9 card(I6, truck)=[1,1] card(I6, tank)=[1,1] I4 x I5 p(I6)({truck1, tank1})=0.2*0.1=0.02
Algebra (Cross product (x)) card(I4, truck)=[1,1] p(I4)({truck1})=0.2 p(I4)({truck2})=0.8 card(I5, tank)=[1,1] p(I5)({tank1})=0.1 p(I5)({tank2})=0.9 card(I6, truck)=[1,1] card(I6, tank)=[1,1] I4 x I5 p(I6)({truck1, tank1})=0.2*0.1=0.02 p(I6)({truck1, tank2})=0.2*0.9=0.18
Algebra (Cross product (x)) card(I4, truck)=[1,1] p(I4)({truck1})=0.2 p(I4)({truck2})=0.8 card(I5, tank)=[1,1] p(I5)({tank1})=0.1 p(I5)({tank2})=0.9 card(I6, truck)=[1,1] card(I6, tank)=[1,1] I4 x I5 p(I6)({truck1, tank1})=0.2*0.1=0.02 p(I6)({truck1, tank2})=0.2*0.9=0.18 p(I6)({truck2, tank1})=0.8*0.1=0.08
Algebra (Cross product (x)) card(I4, truck)=[1,1] p(I4)({truck1})=0.2 p(I4)({truck2})=0.8 card(I5, tank)=[1,1] p(I5)({tank1})=0.1 p(I5)({tank2})=0.9 card(I6, truck)=[1,1] card(I6, tank)=[1,1] I4 x I5 p(I6)({truck1, tank1})=0.2*0.1=0.02 p(I6)({truck1, tank2})=0.2*0.9=0.18 p(I6)({truck2, tank1})=0.8*0.1=0.08 p(I6)({truck2, tank2})=0.8*0.9=0.72
card(I7, convoy)=[1,2], wI7({convoy1})=0.2, wI7({convoy2})=0.5, wI7({convoy1,convoy2})=0.3 Probabilistic Point Query card(convoy1, tank)=[1,1] wconvoy1({tank1})=0.3, wconvoy1({tank2})=0.7 card(convoy2, tank)=[1,1] wconvoy2({tank2})=0.4, wconvoy2({tank3})=0.6 0.14 +0.2 +0.036 +0.084 +0.126 =0.586 D(I7) 0.036 0.06 0.054 0.14 0.084 0.2 0.3 0.126
card(I7, convoy)=[1,2], wI7({convoy1})=0.2, wI7({convoy2})=0.5, wI7({convoy1,convoy2})=0.3 Probabilistic Point Query card(convoy1, tank)=[1,1] wconvoy1({tank1})=0.3, wconvoy1({tank2})=0.7 card(convoy2, tank)=[1,1] wconvoy2({tank2})=0.4, wconvoy2({tank3})=0.6 D(I7) 0.2*0.7+0.5*0.4+0.3*(1-(1-0.7)*(1-0.4))
card(I7, convoy)=[1,2], wI7({convoy1})=0.2, wI7({convoy2})=0.5, wI7({convoy1,convoy2})=0.3 Probabilistic Point Query card(convoy1, tank)=[1,1] wconvoy1({tank1})=0.3, wconvoy1({tank2})=0.7 card(convoy2, tank)=[1,1] wconvoy2({tank2})=0.4, wconvoy2({tank3})=0.6 D(I7) 0.2*0.7+0.5*0.4+0.3*(1-(1-0.7)*(1-0.4))
card(I7, convoy)=[1,2], wI7({convoy1})=0.2, wI7({convoy2})=0.5, wI7({convoy1,convoy2})=0.3 Probabilistic Point Query card(convoy1, tank)=[1,1] wconvoy1({tank1})=0.3, wconvoy1({tank2})=0.7 card(convoy2, tank)=[1,1] wconvoy2({tank2})=0.4, wconvoy2({tank3})=0.6 D(I7) 0.2*0.7+0.5*0.4+0.3*(1-(1-0.7)*(1-0.4)) = 0.14+0.2+0.246 = 0.586
Related Work • Another paper of interval probability version in ICDT 2003: • Semantics • Interpretations • Satisfaction • Consistency • Query and r-answer (objects satisfying the query with minimal probability no less than r)
Related Work • Semistructured Probabilistic Objects (SPOs) (Dekhtyar, Goldsmith, Hawkes, in SSDBM, 2001) • SPO: express contexts (not random variables) in a semistructured manner • PXML data model stores XML data AND probabilistic information.
Related Work • ProTDB (Nierman, Jagadish, in VLDB, 2002) • Independent probabilities assigned to each child VS arbitrary distributions over sets of children • Tree-structured VS arbitrary acyclic • Our model theory provides two formal semantics • We propose a set of algebraic operators and point probabilistic query
Questions and Answers Thank you very much!
Future Work • System implementation • Query optimization
Summary • PIXML data model • Semistructured instance • Weak instance (add cardinality) • Probabilistic instance (add ipf) • Semantics • Local and Global • Interpretation • Satisfaction
Related Work • Semistructured Probabilistic Objects (SPOs) (Dekhtyar, Goldsmith, Hawkes, in SSDBM, 2001) • SPO: express contexts (not random variables) in a semistructured manner • PIXML data model stores XML data AND probabilistic information.
Related Work • ProTDB (Nierman, Jagadish, in VLDB, 2002) • Point probabilities VS interval probabilities • Independent probabilities assigned to each child VS arbitrary distributions over sets of children • Tree-structured VS arbitrary acyclic • Our model theory provides two formal semantics • Differences in their queries and our algebra and query.
Future Work • System implementation • Query optimization
Summary • PXML data model • Semistructured instance • Weak instance (add cardinality) • Probabilistic instance (add ipf) • Semantics • Local and Global • Interpretation • Satisfaction • Algebra • Projections, selection, cross product
Algebra (Projection) • Equivalence Equivalent
Algebra (Projection) • Equivalence Equivalent e1 and e2 are a sequence of zero or more edges. Thus, I.e1.lm can include I.lm, I.l1.lm, I.l2.l3.lm, etc.
Algebra (Cross product) • Equivalence • (I1 x I2) x I3 • I1 x (I2 x I3) • (I1 x I3) x I2 Equivalent
Related Work • Bayesian net (Pearl, 1988) • random variables (probability of events) • ours: existence of children requires existence of parents