Understanding the Power and Pitfalls of Spreadsheets in Data Management

Managing SpreadsheetsMichael CafarellaZhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan PrevoUniversity of MichiganNew England Database SummitFebruary 1, 2013

Spreadsheets: The Good Parts • A “Swiss Army Knife” for data: storing, sharing, transforming • Sophisticated users who are not DBAs • Contain lots of data, found nowhere else • Everyone uses them; almost wholly ignored by DB community • Thanks, Jeremy!

Spreadsheets: The Awful Parts • Users toss in data, worry about schemas later (well, never) • Spreadsheets designed for humans, not query processors • No explicit schemas: • Poor data integrity(Zeeberget al, 2004) • Integration very hard • Tumor suppresor gene Deleted In Esophogeal Cancer 1 • aka, DEC1 • aka, (according to Excel) 01-DEC

Spreadsheets: The Awful Parts • Users toss in data, worry about schemas later (well, never) • Spreadsheets designed for humans, not query processors • No explicit schemas: • Poor data integrity(Zeeberget al, 2004) • Integration very hard

A Data Tragedy • Spreadsheets build, then entomb, our best, most expensive, data • >400,000 just from ClueWeb09 • From gov’ts, WTO, many other sources • How many inside firewall? • Application vision: Ad-hoc integration & analysis for any dataset • Challenge: recover relations from any spreadsheet, w/little human effort

Closeup • One hierarchy error yields many bad tuples • Too many datasets to process manually Desired tuple:

Agenda • Spreadsheets: An Overview • Extracting Data • Hierarchy Extraction • Manual Repairs • Experimental Results • Demo • Related and Future Work

Extracting Tuples • Extract frame, attribute hierarchy trees • Map values to attributes; create tuples • Apply manual repairs, repeat • How many repairs for 100% accuracy? • Yields tuples, not relations • We won’t discuss: relation assembly

1. Frame Detection • Key assumption: inputs are data frames • Locate metadata in top/left regions • Locate data in center block

Closeup

1. Frame Detection • Key assumption: inputs are data frames • Locate metadata in top/left regions • Locate data in center block • ~72% of spreadsheets fit; others not relational • Each non-empty row labeled one of TITLE, HEADER, DATA, FOOTNOTE • Reconstruct regions from labels • Infer labels with linear-chain Conditional Random Field (Lafferty et al, 2001) • Layout features: has bold cell? Merged cell? • Text features: contains ‘table’, ‘total’? Indented text? Numeric cells? Year cells?

2. Hierarchy Extraction

Closeup

2. Hierarchy Extraction • One task for TOP, one for LEFT • Create boolean random var for each candidate parent relationship • Build conditional random field to obtain best variable assignment

2. Hierarchy Extraction

2. Hierarchy Extraction • CRFs use potential functions to incorporate features • Node potentials represent single parent/child match • Share style? Near each other? WS-separated? • Edge potentials tie pairs of parent/child decisions • Share style pairs? Share text? Indented similiarly? • Spreadsheet potentials ensure a legal tree • One-parent potential: -∞ weight for multiple parents • Directional potential: -∞ weight when parent edges go in opposite directions • Run Loopy Belief Propagation for node + edge; post-inference test and repair for spreadsheet • Real sheets yielded 1K-8K variables; inference <0.13 sec • Approach adapted from (Pimplikar, Sarwagi, 2012)

3. Manual Repair • User reviews, repairs extraction • Goal: reduce user burden • Extractor makes repeated mistakes, either within spreadsheet or within corpus • Headache for user to repeat fixes • Our sol’n: after each repair, add repair potentials to CRF • Links user-repaired nodes to a set of nodes throughout CRF • Incorporates info on node similarity • Edges are generated heuristically • After each repair, re-run inference

Experiments • General survey of spreadsheet use • Evaluate: • Standalone extraction accuracy • Manual repair effectiveness • Test sets: • SAUS: 1,322 files from 2010 Statistical Abstract of the United States • WEB: 410,554 files from 51,252 domains, crawled from ClueWeb09

Spreadsheets in the Wild • Very common for Web-published gov’t data

Spreadsheets in the Wild

Standalone Extraction • 100 random H-Sheets from SAUS, WEB • Three metrics • Pairs: parent/child pairs labeled correctly (F1) • Tuples: relational tuples labeled correctly (F1) • Sheets: % of sheets labeled 100% correctly • Two methods • Baseline uses just formatting, position • Hierarchy uses our approach

Standalone Extraction

Manual Repair: Effectiveness • Gather 10 topic areas from SAUS, WEB • Expert provides ground-truth hierarchies • Extract; repeatedly repair and recompute

Manual Repair: Ordering • Good ordering: errors steadily decrease • Bad: extended periods of slow decrease

End-To-End Extraction • What is overall utility of our extractor? • Final metric: Correct tuples per manual repair

Demo Details • Ran SAUS corpus through extractor • Simple ad hoc integration analysis tool on top of extracted data • Early version of relation reconstruction • Early version of data ranking, join finding

Related Work • Spreadsheet as interface(Witkowski et al, 2003), (Liu et al, 2009) • Spreadsheet extraction • User-provided rules(Ahmad et al, 2003), (Hung et al, 2011) • No explicit user rules (Abraham and Erwig, 2007), (Cunha et al, 2009) • Ad hoc integration for found data (Cafarella et al, 2009), (Pimplikar and Sarawagi, 2012), (Yakout et al, 2012) • Semi-automatic data programming • Wrangler (Guo, et al, 2011)

Conclusions and Future Work • Spreadsheet extraction opens new datasets • Manual repair ensures accuracy, low user burden • Ongoing and Future Work • Relation assembly • Data relevance ranking • Join finding

Understanding the Power and Pitfalls of Spreadsheets in Data Management