1 / 18

Probabilistic Databases

Probabilistic Databases. Amol Deshpande, University of Maryland. Overview. V.S. Subrahmanian ProbView, PXML, Temporal Probabilistic Databases, Probabilistic Aggregates Lise Getoor Statistical Relational Learning, Probabilistic Relational Models, Entity Resolution Amol

conan
Télécharger la présentation

Probabilistic Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Probabilistic Databases Amol Deshpande, University of Maryland

  2. Overview • V.S. Subrahmanian • ProbView, PXML, Temporal Probabilistic Databases, Probabilistic Aggregates • Lise Getoor • Statistical Relational Learning, Probabilistic Relational Models, Entity Resolution • Amol • MauveDB: Statistical Modeling in Databases, Correlated tuples in probabilistic databases

  3. Overview of Today’s Presentation • Model-based Views/MauveDB [Amol] • Statistical Relational Learning [Lise] • Representing arbitrarily correlated data and processing queries over it [Prithviraj]

  4. Overview of Today’s Presentation • Model-based Views/MauveDB [Amol] • Goal: Making it easy to continuously apply statistical models to streaming data • Current focus on designing declarative interfaces, and on efficient maintenance algorithms • Less on the “probabilistic databases” issues • Statistical Relational Learning [Lise] • Representing arbitrarily correlated data and processing queries over it [Prithviraj]

  5. Wireless sensor networks Distributed measurement networks (e.g. GPS) RFID Industrial Monitoring Motivation • Unprecedented, and rapidly increasing, instrumentation of our every-day world • Huge data volumes generated continuously that must be processed in real-time • Typically imprecise, unreliable and incomplete data • Measurement noises, low success rates, failures etc…

  6. Data Processing Step 1 • Process data using a statistical/probabilistic model • Regression and interpolation models • To eliminate spatial or temporal biases, handle missing data, prediction • Filtering techniques (e.g. Kalman Filters), Bayesian Networks • To eliminate measurement noise, to infer hidden variables etc Temperature monitoring GPS Data Kalman Filters et Regression/interpolation models

  7. home office A Motivating Example • Inferring “transportation mode”/ “activities” [Henry Kautz et al] • Using easily obtainable sensor data, e.g. GPS, RFID proximity data • Can do much if we can infer these automatically Have access to noisy “GPS” data Infer the transportation mode: walking, running, in a car, in a bus

  8. Motivating Example • Inferring “transportation mode”/ “activities” [Henry Kautz et al] • Using easily obtainable sensor data, e.g. GPS, RFID proximity data • Can do much if we can infer these automatically home office Preferred end result: Clean path annotated with transportation mode

  9. Transportation Mode: Walking, Running, Car, Bus True velocity and location Observed location Dynamic Bayesian Network Use a “generative model” for describing how the observations were generated Time = t Need conditional probability distributions e.g. a distribution on (velocity, location) given the transportation mode Prior knowledge or learned from data Mt Xt Ot

  10. Transportation Mode: Walking, Running, Car, Bus True velocity and location Observed location Dynamic Bayesian Network Use a “generative model” for describing how the observations were generated Time = t+1 Time = t Mt+1 Mt Xt+1 Xt Ot+1 Ot

  11. Transportation Mode: Walking, Running, Car, Bus True velocity and location Observed location Dynamic Bayesian Network Given a sequence of observations (Ot), find the most likely Mt’s that explain it. Or could provide a probability distribution on the possible Mt’s. Time = t+1 Time = t Mt+1 Mt Xt+1 Xt Ot+1 Ot

  12. Statistical Modeling of Sensor Data • No support in database systems --> Database ends up being used as a backing store • With much replication of functionality • Very inefficient, not declarative… • How can we push statistical modeling inside a database system ?

  13. Abstraction: Model-based Views • An abstraction analogous to traditional database views • Present the output of the application of model as a database view • That the user can query as with normal database views

  14. User Example DBN View User view of the data - Smoothed locations - Inferred variables e.g. select count(*) group by mode sliding window 5 minutes Application of the model/inference is pushed inside the database Opens up many optimization opportunities e.g. can do inference lazily when queried etc Original noisy GPS data

  15. Correlations User Strong and complex correlations across tuples - Mutual exclusivity - Temporal correlations

  16. MauveDB: Status • Written in the Apache Derby Java open source database system • Support for Regression- and Interpolation-based views • Neither produce probabilistic data • SIGMOD 2006 (w/ Sam Madden) • Currently building support for views based on Dynamic Bayesian networks [Bhargav] • Kalman Filters, HMMs etc • Initial focus on the user interfaces and efficient inference • Will generate probabilistic data; may not be able to do anything too sophisticated with it

  17. Research Challenges/Future Work • Generalizing to arbitrary models ? • Develop APIs for adding arbitrary models • Try to minimize the work of the model developer • Probabilistic databases • Uncertain data with complex correlation patterns • Query processing, query optimization • View maintenance in presence of high-rate measurement streams

  18. Thanks !! Mauve == Model-based User Views

More Related