Data Quality Challenges in Community Systems

AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron Gao, Fei Chen, Yoonkyong Lee, Raghu Ramakrishnan, Jeff Naughton

Numerous Web Communities • Academic domains • database researchers, bioinformatists • Infotainments • movie fans, mountain climbers, fantasy football • Scientific data management • biomagnetic databank, E. Coli community • Business • enterprise intranets, tech support groups, lawyers • CIA / homeland security • Intellipedia

Much Efforts to Build Community Portals • Initially taxonomy based (e.g., Yahoo style) • But now many structured data portals • capture key entities and relationships of community No general solution yet on how to build such portals

Cimple Project @ Wisconsin / Yahoo! Research Develops such a general solution using extraction + integration + mass collaboration Maintain and add more sources Keyword search SQL querying Question answering Browse Mining Alert/Monitor News summary Jim Gray Jim Gray Researcher Homepages Conference Pages Group Pages DBworld mailing list DBLP Web pages * * * * give-talk * * * SIGMOD-04 SIGMOD-04 * * * * * * * * Text documents Mass collaboration

Prototype System: DBLife • Integrate data of the DB research community • 1164 data sources Crawled daily, 11000+ pages = 160+ MB / day

Data Extraction

Data Integration Raghu Ramakrishnan co-authors = A. Doan, Divesh Srivastava, ...

“Proactive Re-optimization write write write Pedro Bizarro Shivnath Babu coauthor coauthor David DeWitt advise advise coauthor Jennifer Widom PC-member PC-Chair SIGMOD 2005 Resulting ER Graph

Provide Services • DBLife system

Mass Collaboration: Voting Picture is removed if enough users vote “no”.

Mass Collaboration via Wiki

Summary: Community Systems • Data integration systems + extraction + Web 2.0 • manage both data and users in a synergistic fashion • In sync with current trends • manage unstructured data (e.g., text, Web pages) • get more structure (IE, Semantic Web) • engage more people (Web 2.0) • best-effort data integration, data spaces, pay-as-you-go • Numerous potential applications But raises many difficult data quality challenges

Rest of the Talk • Data quality challenges in 1. Source selection 2. Extraction and integration 3. Detecting problems and providing feedback 4. Mass collaboration • Conclusions & ways forward

1. Source Selection Maintain and add more sources Keyword search SQL querying Question answering Browse Mining Alert/Monitor News summary Jim Gray Jim Gray Researcher Homepages Conference Pages Group Pages DBworld mailing list DBLP Web pages * * * * give-talk * * * SIGMOD-04 SIGMOD-04 * * * * * * * * Text documents Mass collaboration

Current Solutions vs. Cimple • Current solutions • find all relevant data sources (e.g., using focused crawling, search engines) • maximize coverage • have lot of noisy sources • Cimple • starts with a small set of high-quality “core” sources • incrementally adds more sources • only from “high-quality” places • or as suggested by users (mass collaboration)

Start with a Small Set of “Core” Sources • Key observation: communities often follow 80-20 rules • 20% of sources cover 80% of interesting activities • Initial portal over these 20% often is already quite useful • How to select these 20% • select as many sources as possible • evaluate and select most relevant ones

Evaluate the Relevancy of Sources • Use PageRank + virtual links across entities + TF/IDF ... Gerhard Weikum G. Weikum See [VLDB-07a]

Add More Sources over Time • Key observation: most important sources will eventually be mentioned within the community • so monitor certain “community channels” to find them Message type: conf. ann. Subject: Call for Participation: VLDB Workshop on Management of Uncertain Data Call for Participation Workshop on "Management of Uncertain Data" in conjunction with VLDB 2007 http://mud.cs.utwente.nl ... • Also allow users to suggest new sources • e.g., the Silicon Valley Database Society

Summary: Source Selection • Sharp contrast to current work • start with highly relevant sources • expand carefully • minimize “garbage in, garbage out” • Need a notion of source relevance • Need a way to compute this

2. Extraction and Integration Maintain and add more sources Keyword search SQL querying Question answering Browse Mining Alert/Monitor News summary Jim Gray Jim Gray Researcher Homepages Conference Pages Group Pages DBworld mailing list DBLP Web pages * * * * give-talk * * * SIGMOD-04 SIGMOD-04 * * * * * * * * Text documents Mass collaboration

Extracting Entity Mentions • Key idea: reasonable plan, then patch • Reasonable plan: • collect person names, e.g., David Smith • generate variations, e.g., D. Smith, Dr. Smith, etc. • find occurrences of these variations ExtractMbyName Union Works well, but can’t handle certain difficult spots s1 … sn

Handling Difficult Spots • Example • R. Miller, D. Smith, B. Jones • if “David Miller” is in the dictionary  will flag “Miller, D.” as a person name • Solution: patch such spots with stricter plans ExtractMStrict ExtractMbyName FindPotentialNameLists Union s1 … sn

s1 … sn Matching Entity Mentions • Key idea: reasonable plan, then patch • Reasonable plan • mention names are the same (modulo some variation)  match • e.g., David Smith and D. Smith MatchMbyName Extract Plan Union Works well, but can’t handle certain difficult spots

MatchMStrict MatchMbyName Extract Plan Extract Plan DBLP Union \ {s1 … sn} DBLP Handling Difficult Spots • Estimate the semantic ambiguity of data sources • use social networking techniques [see ICDE-07a] • Apply stricter matchers to more ambiguous sources DBLP: Chen Li · · · 41. Chen Li, Bin Wang, Xiaochun Yang. VGRAM. VLDB 2007. · · · 38. Ping-Qi Pan, Jian-Feng Hu, Chen Li. Feasible region contraction. Applied Mathematics and Computation. · · ·

Union \ {s1 … sn} DBLP Going Beyond Sources: Difficult Data Spots Can Cover Any Portion of Data MatchMStrict2 MatchMStrict Mentions that Match “J. Han” MatchMbyName Extract Plan Extract Plan DBLP

Summary: Extraction and Integration • Most current solutions • try to find a single good plan, applied to all of data • Cimple solution: reasonable plan, then patch • So the focus shifts to: • how to find a reasonable plan? • how to detect problematic data spots? • how to patch those? • Need a notion of semantic ambiguity • Different from the notion of source relevance

3. Detecting Problems and Providing Feedback Maintain and add more sources Keyword search SQL querying Question answering Browse Mining Alert/Monitor News summary Jim Gray Jim Gray Researcher Homepages Conference Pages Group Pages DBworld mailing list DBLP Web pages * * * * give-talk * * * SIGMOD-04 SIGMOD-04 * * * * * * * * Text documents Mass collaboration

How to Detect Problems? • After extraction and matching, build services • e.g., superhomepages • Many such homepages contain minor problems • e.g., X graduated in 19998 X chairs SIGMOD-05 and VLDB-05 X published 5 SIGMOD-03 papers • Intuitively, something is semantically incorrect • To fix this, lets build a Semantic Debugger • learns what is a normal profile for researcher, paper, etc. • alerts the builder to potentially buggy superhomepages • so feedback can be provided

What Types of Feedback? • Say that a certain data item Y is wrong • Provide correct value for Y, e.g., Y = SIGMOD-06 • Add domain knowledge • e.g., no researcher has ever published 5 SIGMOD papers in a year • Add more data • e.g., X was advised by Z • e.g., here is the URL of another data source • Modify the underlying algorithm • e.g., pull out all data involving X match using names and co-authors, not just names

How to Make Providing Feedback Very Easy? • “Providing feedback” for the masses • in sync with current trends of empowering the masses • Extremely crucial in DBLife context • If feedback can be provided easily • can get more feedback • can leverage the mass of users • But this turned out to be very difficult

How to Make Providing Feedback Very Easy? • Say that a certain data item Y is wrong • Provide correct value for Y, e.g., Y = SIGMOD-06 • Add domain knowledge • Add more data • Modify the underlying algorithm Provide form interfaces Provide a Wiki interface Unsolved, some recent interest on how to mass customize software Critical in our experience, but unsolved See our IEEE Data Engineering Bulletin paper on user-centric challenges, 2007

What Feedback Would Make the Most Impact? • I have one hour spare time, would like to “teach” DBLife • what problems should I work on? • what feedback should I provide? • Need a Feedback Advisor • define a notion of system quality Q(s) • define questions q1, ..., qn that DBLife can ask users • for each qi, evaluate its expected improvement in Q(s) • pick question with highest expected quality improvement • Observations • a precise notion of system quality is now crucial • this notion should model the expected usage

Summary: Detection and Feedback • How to detect problems? • Semantic Debugger • What types of feedback & how to easily provide them? • critical, largely unsolved • What feedback would make most impact? • crucial in large-scale systems • need a Feedback Advisor • need a precise notion of system quality

4. Mass Collaboration Maintenance and expansion Keyword search SQL querying Question answering Browse Mining Alert/Monitor News summary Jim Gray Jim Gray Researcher Homepages Conference Pages Group Pages DBworld mailing list DBLP Web pages * * * * give-talk * * * SIGMOD-04 SIGMOD-04 * * * * * * * * Text documents Mass collaboration

Mass Collaboration: Voting Can be applied to numerous problems

Example: Matching • Hard for machine, but easy for human Dell laptop X200 with mouse ... Mouse for Dell laptop 200 series ... Dell X200; mouse at reduced price ...

Challenges • How to detect and remove noisy users? • evaluate them using questions with known answers • How to combine user feedback? • # of yes votes vs. # of no votes See [ICDE-05a, ICDE-08a]

Mass Collaboration: Wiki • Community wikipedia • built by machine + human • backed up by a structured database V1 W1 V2 W2 G M DataSources V3 W3 V3’ W3’ u1 T T3’

Mass Collaboration: Wiki Machine David J. DeWitt Professor Interests:Parallel Database <# person(id=1){name}=David J. DeWitt #> <# person(id=1){title}=Professor #>Interests:<# person(id=1).interests(id=3).topic(id=4){name}=Parallel Database #> Human Human Machine Machine David J. DeWitt John P. Morgridge ProfessorUW-Madison since 1976Interests:Parallel Database Privacy <# person(id=1){name}=David J. DeWitt #> <# person(id=1){name}=David J. DeWitt #> <# person(id=1){title}= John P. Morgridge Professor #> <# person(id=1){organization}=UW-Madison#>since 1976 Interests:<# person(id=1).interests(id=3).topic(id=4){name}=Parallel Database #><# person(id=1).interests(id=5).topic(id=6){name}=Privacy #> <# person(id=1){title}=John P. Morgridge Professor #><# person(id=1) {organization}=UW #> since 1976 Interests:<# person(id=1).interests(id=3).topic(id=4){name}=Parallel Database #>

Sample Data Quality Challenges • How to detect noisy users? • no clear solution yet • for now, limit editing to trusted editors • modify notion of system quality to account for this • How to combine feedback, handle inconsistent data? • user vs. user • user vs. machine • How to verify claimed ownership of data portions? • e.g., this superhomepage is about me • only I can edit it See [ICDE-08b]

Summary: Mass Collaboration • What can users contribute? • How to evaluate user quality? • How to reconcile inconsistent data?

Additional Challenges • Dealing with evolving data (e.g., matching) • Iterative code development • Lifelong quality improvement • Querying over inconsistent data • Managing provenance and uncertainty • Generating explanations • Undo

Conclusions • Community systems: • data integration + IE + Web 2.0 • potentially very useful in numerous domains • Such systems raise myriad data quality challenges • subsume many current challenges • suggest new ones • Can provide a unifying context for us to make progress • building systems has been a key strength of our field • we need a community effort, as always See “cimple wisc” for more detail Let us know if you want code/data

Data Quality Challenges in Community Systems