320 likes | 460 Vues
This presentation by researchers from the University of Michigan explores the dual nature of spreadsheets as essential tools for data storage, sharing, and transformation, while also highlighting their inherent limitations. Despite their widespread use, spreadsheets often lead to poor data integrity and lack structured schemas, creating challenges for data analysts. Key discussions include methods for extracting data, hierarchy extraction, and the importance of manual repairs. The talk also details experimental results and proposes future work in enhancing spreadsheet management practices.
E N D
Managing SpreadsheetsMichael CafarellaZhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan PrevoUniversity of MichiganNew England Database SummitFebruary 1, 2013
Spreadsheets: The Good Parts • A “Swiss Army Knife” for data: storing, sharing, transforming • Sophisticated users who are not DBAs • Contain lots of data, found nowhere else • Everyone uses them; almost wholly ignored by DB community • Thanks, Jeremy!
Spreadsheets: The Awful Parts • Users toss in data, worry about schemas later (well, never) • Spreadsheets designed for humans, not query processors • No explicit schemas: • Poor data integrity(Zeeberget al, 2004) • Integration very hard • Tumor suppresor gene Deleted In Esophogeal Cancer 1 • aka, DEC1 • aka, (according to Excel) 01-DEC
Spreadsheets: The Awful Parts • Users toss in data, worry about schemas later (well, never) • Spreadsheets designed for humans, not query processors • No explicit schemas: • Poor data integrity(Zeeberget al, 2004) • Integration very hard
A Data Tragedy • Spreadsheets build, then entomb, our best, most expensive, data • >400,000 just from ClueWeb09 • From gov’ts, WTO, many other sources • How many inside firewall? • Application vision: Ad-hoc integration & analysis for any dataset • Challenge: recover relations from any spreadsheet, w/little human effort
Closeup • One hierarchy error yields many bad tuples • Too many datasets to process manually Desired tuple:
Agenda • Spreadsheets: An Overview • Extracting Data • Hierarchy Extraction • Manual Repairs • Experimental Results • Demo • Related and Future Work
Agenda • Spreadsheets: An Overview • Extracting Data • Hierarchy Extraction • Manual Repairs • Experimental Results • Demo • Related and Future Work
Extracting Tuples • Extract frame, attribute hierarchy trees • Map values to attributes; create tuples • Apply manual repairs, repeat • How many repairs for 100% accuracy? • Yields tuples, not relations • We won’t discuss: relation assembly
1. Frame Detection • Key assumption: inputs are data frames • Locate metadata in top/left regions • Locate data in center block
1. Frame Detection • Key assumption: inputs are data frames • Locate metadata in top/left regions • Locate data in center block • ~72% of spreadsheets fit; others not relational • Each non-empty row labeled one of TITLE, HEADER, DATA, FOOTNOTE • Reconstruct regions from labels • Infer labels with linear-chain Conditional Random Field (Lafferty et al, 2001) • Layout features: has bold cell? Merged cell? • Text features: contains ‘table’, ‘total’? Indented text? Numeric cells? Year cells?
2. Hierarchy Extraction • One task for TOP, one for LEFT • Create boolean random var for each candidate parent relationship • Build conditional random field to obtain best variable assignment
2. Hierarchy Extraction • CRFs use potential functions to incorporate features • Node potentials represent single parent/child match • Share style? Near each other? WS-separated? • Edge potentials tie pairs of parent/child decisions • Share style pairs? Share text? Indented similiarly? • Spreadsheet potentials ensure a legal tree • One-parent potential: -∞ weight for multiple parents • Directional potential: -∞ weight when parent edges go in opposite directions • Run Loopy Belief Propagation for node + edge; post-inference test and repair for spreadsheet • Real sheets yielded 1K-8K variables; inference <0.13 sec • Approach adapted from (Pimplikar, Sarwagi, 2012)
3. Manual Repair • User reviews, repairs extraction • Goal: reduce user burden • Extractor makes repeated mistakes, either within spreadsheet or within corpus • Headache for user to repeat fixes • Our sol’n: after each repair, add repair potentials to CRF • Links user-repaired nodes to a set of nodes throughout CRF • Incorporates info on node similarity • Edges are generated heuristically • After each repair, re-run inference
Agenda • Spreadsheets: An Overview • Extracting Data • Hierarchy Extraction • Manual Repairs • Experimental Results • Demo • Related and Future Work
Experiments • General survey of spreadsheet use • Evaluate: • Standalone extraction accuracy • Manual repair effectiveness • Test sets: • SAUS: 1,322 files from 2010 Statistical Abstract of the United States • WEB: 410,554 files from 51,252 domains, crawled from ClueWeb09
Spreadsheets in the Wild • Very common for Web-published gov’t data
Standalone Extraction • 100 random H-Sheets from SAUS, WEB • Three metrics • Pairs: parent/child pairs labeled correctly (F1) • Tuples: relational tuples labeled correctly (F1) • Sheets: % of sheets labeled 100% correctly • Two methods • Baseline uses just formatting, position • Hierarchy uses our approach
Manual Repair: Effectiveness • Gather 10 topic areas from SAUS, WEB • Expert provides ground-truth hierarchies • Extract; repeatedly repair and recompute
Manual Repair: Ordering • Good ordering: errors steadily decrease • Bad: extended periods of slow decrease
End-To-End Extraction • What is overall utility of our extractor? • Final metric: Correct tuples per manual repair
Agenda • Spreadsheets: An Overview • Extracting Data • Hierarchy Extraction • Manual Repairs • Experimental Results • Demo • Related and Future Work
Demo Details • Ran SAUS corpus through extractor • Simple ad hoc integration analysis tool on top of extracted data • Early version of relation reconstruction • Early version of data ranking, join finding
Related Work • Spreadsheet as interface(Witkowski et al, 2003), (Liu et al, 2009) • Spreadsheet extraction • User-provided rules(Ahmad et al, 2003), (Hung et al, 2011) • No explicit user rules (Abraham and Erwig, 2007), (Cunha et al, 2009) • Ad hoc integration for found data (Cafarella et al, 2009), (Pimplikar and Sarawagi, 2012), (Yakout et al, 2012) • Semi-automatic data programming • Wrangler (Guo, et al, 2011)
Conclusions and Future Work • Spreadsheet extraction opens new datasets • Manual repair ensures accuracy, low user burden • Ongoing and Future Work • Relation assembly • Data relevance ranking • Join finding