150 likes | 273 Vues
Join the Computer Science Department at Rutgers University for an engaging seminar series focused on Probabilistic Databases and other advanced data management topics. This seminar meets weekly on Thursdays from 1-2:30 PM in CoRE A. Each session features student presentations on selected topics, including overview and in-depth research papers. Participants will engage in discussions and explore practical applications and techniques in various areas, such as query optimization, data cleaning, and integration. Enhance your understanding of complex data management issues with expert insights and peer collaboration.
E N D
Readings in Data ManagementSpring 2008 Computer Science Department Rutgers University
Seminar Information • Web page: http://www.cs.rutgers.edu/~amelie/courses/dbseminar.html • Meets Thursday 1-2:30pm in CoRE A
Organization • Weekly presentation on a DB topic (30 minutes) • We will select 2-3 topics to focus on the course of the semester • For each topic • First week: overview paper (survey, influential work) • Subsequent weeks: more complex papers on the subject • Possibly a few external presentations such as: • Students preparing for DB conference talks or quals • Invited speakers • Discussion on the paper
Topics • First Topic:Probabilistic Databases • We will select next topics from (non exhaustive list): • Question answering • Web Search • Personal Information Spaces • Query Optimization • Data Cleaning • Data Integration • Data Mining • Query Processing Techniques • Adaptive, Automatic, Autonomic Systems • OLAP • Stream Aggregation • Storage, Indexing, and System Architecture • XML Processing • Preference functions • Spatial and High-Dimensional Data • Recovery • Privacy in DBMS • …
What I expect from you • 1-2 presentation over the course of the semester • First-year students will be given “overview” presentation assignments at the beginning of each topic • More Senior students will present more research-focused papers • Number of presentations depends on the number of students in the seminar • Everyone should read the paper in advance and prepare 1-2 questions/discussion topics • Participation in discussion • There are no “stupid” questions! If you did not understand something, chances are others did not either
Presentations • I will select a list of papers to present for each topic • Start with an introductory paper • The papers that go deeper into one or more aspect of the problem • You are welcome to suggest some papers on the topic, as long as it is related (so that we can have more meaningful discussions) • Papers that I have overlooked • Papers on a different aspect of the topic that you would like to focus on
First topic: Probabilistic Databases • Uncertainty/Imprecision in data • Query Semantics • Probabilistic Data Representation Next few slides from Dan Suciu’s tutorial, more at
Databases Today are Deterministic • An item either is in the database or is not • A tuple either is in the query answer or is not • This applies to all variety of data models: • Relational, E/R, NF2, hierarchical, XML, …
What is a Probabilistic Database ? • “An item belongs to the database” is a probabilistic event • “A tuple is an answer to the query” is a probabilistic event • Can be extended to all data models;
Two Types of Probabilistic Data • Database is deterministicQuery answers are probabilistic • Database is probabilisticQuery answers are probabilistic
Long History Probabilistic relational databases have been studied from the late 80’s until today: • Cavallo&Pitarelli:1987 • Barbara,Garcia-Molina, Porter:1992 • Lakshmanan,Leone,Ross&Subrahmanian:1997 • Fuhr&Roellke:1997 • Dalvi&S:2004 • Widom:2005
So, Why Now ? Application pull: • The need to manage imprecisions in data Technology push: • Advances in query processing techniques
Application Pull Need to manage imprecisions in data • Many types: non-matching data values, imprecise queries, inconsistent data, misaligned schemas, etc, etc The quest to manage imprecisions = major driving force in the database community • Ultimate cause for many research areas: data mining, semistructured data, schema matching, nearest neighbor
Technology Push Processing probabilistic data is fundamentally more complex than other data models • Some previous approaches sidestepped complexity There exists a rich collection of powerful, non-trivial techniques and results, some old, some very recent, that could lead to practical management techniques for probabilistic databases.
Suggested Papers to discuss • Nilesh Dalvi, Dan Suciu: Efficient Query Evaluation on Probabilistic Databases. (VLDB 2004). • Minos Garofalakis et al, Probabilistic Data Management for Pervasive Computing: The Data Furnace Project. IEEE Data Eng. Bull. 29(1)(2006) • Omar Benjelloun, Anish Das Sarma, Chris Hayworth, Jennifer Widom: An Introduction to ULDBs and the Trio System. IEEE Data Eng. Bull. 29(1)(2006) • Prithviraj Sen, Amol Deshpande, Representing and Querying Correlated Tuples in Probabilistic Databases (ICDE 2007)