100 likes | 220 Vues
This presentation discusses the challenges and advancements in extracting structured data from the web. While web pages contain discernible structures for humans, they remain opaque to machines and search engines. We explore various extraction techniques, including the WebTables system that automates database extraction from web crawls, large-scale entity extraction through Structurepedia, and the use of MapReduce optimizations via Manimal. With significant advances in schema statistics and data integration applications like Octopus and Meez, we are paving the way for a more data-aware web.
E N D
Managing The Structured Web Michael J. Cafarella University of Michigan Michigan CSE April 23, 2010
The Structured Web • Web pages contain structure that is obvious to humans, though not machines • Search engines are largely blind to it • Databases need data that is perfectly structured
Different Approaches • Extraction Techniques • Tables: WebTables [WebDB’08, VLDB’08] • Large-scale entity extraction: Structurepedia [ongoing] • Applications • Web data integration: Octopus [VLDB’09] • Structure-aware Web search: Meez [ongoing] • Tools • MapReduce Optimizer: Manimal [ongoing] • Progress in one reinforces others
Different Approaches • Extraction Techniques • Tables: WebTables [WebDB’08, VLDB’08] • (w/ Alon Halevy, Yang Zhang, Daisy Wang, Eugene Wu) • Large-scale entity extraction: Structurepedia [ongoing] • Applications • Web data integration: Octopus [VLDB’09] • Structure-aware Web search: Meez [ongoing] • Tools • MapReduce Optimizer: Manimal [ongoing] • (w/ Chris Re)
WebTables Schema Statistics Applications • WebTables system automatically extracts dbs from web crawl [WebDB08, “Uncovering…”, Cafarella et al][VLDB08, “WebTables: Exploring…”, Cafarella et al] • An extracted relation is one table plus labeled columns • Estimate that our crawl of 14.1B raw HTML tables contains ~154M good relational dbs Raw crawled pages Raw HTML Tables Recovered Relations
Schema Statistics • Schema stats useful for computing attribute probabilities • p(“make”), p(“model”), p(“zipcode”) • p(“make” | “model”), p(“make” | “zipcode”) • Allows many applications • Schema “tab-complete” • Synonym discovery • Others • Progress in extraction technique enables new data applications
Manimal (ongoing) • MapReduce very popular for “big data” • Easy for non-database programmers • Parallelizable, but inefficient • RDBMSes challenging for “big data” • Programming and admin relatively difficult • When well-used, very efficient • Manimal is hybrid MapReduce/RDBS execution system • Static analysis to extract code semantics • if(score > 5)… database selection • Extractions enable RDBMS-style optimizations • Progress in extraction enables new data tools