480 likes | 587 Vues
Explore how statistical diagrams are improving research with innovative metadata extraction and ranking techniques. Learn about our approach, experiments, and the future of diagram search.
E N D
Searching for Statistical Diagrams Michael Cafarella University of Michigan Joint work with Shirley Zhe Chen and Eytan Adar Brigham Young University November 17, 2011
Statistical Diagrams • Everywhere in serious academic, governmental, scientific documents • Our only peek into data behind docs • Previously precious and rare, Web gives us a precious flood • In small Web crawl, found 319K diagrams in 153K academic papers • Google makes it easy to find docs, images; very hard to find diagrams
Outline • Intro • Related Work • Our Approach • Metadata Extraction • Ranking • Search Snippets • Experiments • Work In Progress: Spreadsheets
Outline • Intro • Related Work • Our Approach • Metadata Extraction • Ranking • Search Snippets • Experiments • Work In Progress: Spreadsheets
Telephone System Manhole Drawing, fromArias, et al, Pattern Recognition Letters 16, 1995
Sample Architectural Drawing, from Ah-Soon and Tombre, Proc’ds of Fourth Int’l Conf on Document Analysis and Recognition, 1997.
Previous Work • Understanding diagrams isn’t new • Understanding a Web’s worth of diagrams is new • Need to search statistical diagrams in medicine, economics, biology, physics, etc • The (old) phone company can afford a system tailored for manhole diagrams; we can’t • Effective scaling with # of topics is central goal of domain-independent information extraction
Domain-Independent IE • General IE topic since early 1990s • Goal is to obtain structured information from unstructured raw documents • [Title, Price] from online bookstores • [Director, Film] from discussion boards • [Scientist, Birthday] from biographies • Traditional IE requires topic-specific code, features, data • Supervision costs grow with # of domains • Domain-independent IE does not
Related Work • Domain-independent extraction: • Text (Banko et al, IJCAI 2007; Shinyama+Sekine, HLT-NAACL 2006) • Tables (Cafarella et al., VLDB 2008) • Infoboxes (Wu and Weld, CIKM 2007) • Specific to diagrams, some DI: • Huang et al, “Associating text and graphics…”, ICDAR 2005 • Huang et al, “Model-based chart image recognition”, GREC 2003 • Kaiser et al, “Automatic extraction…”, AAAI 2008 • Liu et al, “Automated analysis…”, IJDAR 2009
Outline • Intro • Related Work • OurApproach • Metadata Extraction • Ranking • Search Snippets • Experiments • Work In Progress: Spreadsheets
Our Approach • Typical Web search pipeline • Crawl Web for documents • Obtain and index text • Make index queryable • Our novel components • Diagram metadata extraction • Custom search ranker • Snippet generator
Metadata Extraction • Recover good (text, x, y) from PDFs • Apply simple role label: title, legend, etc • Does text start with capitalized word? • Ratio of text region height to width • % of words in region that are nouns • Group texts into “model diagram” candidates, throw away unlikely ones • E.g., must include something on x scale • Relabel text using geometric relationships • Distance, angle to diagram’s origin? • Leftmost in diagram? Under a caption?
Search Ranker • Tested four versions • Naïve – standard document relevance • Reference – caption and context only • Field – all fields, equal weighting • Weighted – all fields, trained weights
Snippet Generation • Tested five versions Caption and Context text only No graphics at all Caption and Context text accompany graphics Apply metadata over graphic Apply metadata over graphic 1. Original-snippet 2. Small-snippet 3. Text-snippet 4. Integrated-snippet 5. Enhanced-snippet
Outline • Intro • Related Work • Our Approach • Metadata Extraction • Ranking • Search Snippets • Experiments • Work In Progress: Spreadsheets
Experiments • Crawled Web for scientific papers • From ClueWeb09 • Any URL ending in .pdf from .eduURL • 319K diagrams • Fed data to prototypesearch engine • Evaluated • Metadata extraction • Rank quality • Snippet effectiveness • All results compared against human judgments
2. Experiments - Ranking Improves ranking 52% over naïve ranker
3. Experiments - Snippets Improves snippet accuracy 33% over naïve soln
Clustering 1978-2004 2000-2004 2002-2006
Other Applications • Working now • Keyword, axis search • Similar diagrams / similar papers • In future: • Improved academic paper search • “Show plots that support my hypothesis”
Outline • Intro • Related Work • Our Approach • Metadata Extraction • Ranking • Search Snippets • Experiments • Work In Progress: Spreadsheets
Spreadsheets • SAUS has >1,300; we’ve dl’ed 350K • Many tasks: search, facets, integration, etc.
Future Work • Has experiment X ever been run? • WY GDP vs coal production in 2002 • Preemptively compute good diagrams • Lots (most?) structured data lives outside DBMS • Spreadsheets • HTML tables • Log files, sensor readings, experiments, … • Structured search
Conclusions • Metadata extraction enables 52% better search ranking • Extraction-enhanced snippets allow users to choose 33%more accurately • We rely on open information extraction, but extracted data not the main product • Can be successful even with imperfect extractors
WebTables [VLDB08, “WebTables…”, Cafarella, Halevy, Wang, Wu, Zhang] • In corpus of 14B raw HTML tables, ~154M are “good” databases • Largest corpus of databases & schemas we know of • The WebTables system: • Recovers good relations from crawl and enables search • Builds novel apps on the recovered data
WebTables Pipeline Inverted Index Raw crawled pages Raw HTML Tables Recovered Relations Relation Search • 2.6M distinct schemas • 5.4M attributes Attribute Correlation Statistics Db The Unreasonable Effectiveness of Data [Halevy, Norvig, Pereira]
Synonym Discovery • Use schema statistics to automatically compute attribute synonyms • More complete than thesaurus • Given input “context” attribute set C: • A = all attrs that appear with C • P = all (a,b) where aA, bA, ab • rm all (a,b) from P where p(a,b)>0 • For each remaining pair (a,b) compute: