WebTables : Exploring the Power of Tables on the Web

Michael J. Cafarella, Alon Halevy, Zhe Daisy Wang, Eugene Wu, Yang Zhang VLDB 2008 2009. 01. 08. Summarized and Presented by Babar Tareen, IDS Lab., Seoul National University WebTables: Exploring the Power of Tables on the Web

Introduction • Web is a corpus of unstructured data • Some structure is imposed by • Hierarchical URLs • Hyperlink Graph • Web pages generally contain • Text as paragraphs • Tabular data (Relations) • Text and tables have different characteristics • Tables have more structured data than raw text

Introduction (2) • Tables can give some hints about semantics • Headers • Tuples • Regular keyword query techniques are not very effective for tables

Motivation • Enable analysis and integration of data on the web • User demand for structured data • For 30 million queries users clicked on results containing tables • This paper focuses on two fundamental questions • What are effective methods for searching within large collections of tables? • Is there additional power that can be derived by analyzing large corpus of tables?

WebTables - Data • WebTables system considers HTML tables that are already surfaced and crawlable • Deep Web refers to the content that is made available through filling HTML forms • Corpus • 14.1 Billion raw HTML tables • 154 Million distinct relational databases • Relational database form 1.1% of raw HTML tables • 60% of data from non-deep-web sources • 40% of data from parameterized URLs

Extracting Relations • Most HTML tables are used for page layouts • To filter relational and non relational tables • Handwritten detectors • Statistically trained classifiers • Training & Test data generated by two independent judges • Scale of relational quality 1-5 • Tables that received average score of 4 or above were considered as relational

Data Model

Attribute Correlation Statistics Database (ACSDb) • For each Unique Schema Rs, ACSDb contains frequency count • A = {(Rs1,C1), (Rs2,C2), (Rs3,C3) … } • If schema appears multiple times under same domain name it is counted only once • ACSDb contains • 5.4M unique attribute names • 2.6M unique schemas • ACSDb is simple but can be used to compute probabilities • For example, conditional probability of finding attribute ‘Address’ in a schema given attribute ‘Name’ P(address|name) = count of schemas containing address, name / count of schemas containing name

ACSDb

Relation Search • WebTables search engine allows users to rank relations by relevance • Query appropriate visualizations can be created • Columns containing place names can be displayed on a map • Graphs can be generated from table data • Traditional structured operations can be applied over search results • Selection • Projection

Ranking • Keyword ranking for databases is a novel problem • Challenges • Relations does not exist in a domain specific schema graph • Word frequencies apply ambiguously to tables (Ex: which table in the page is described by which frequent word) • Attribute labels are extremely important • Attributes provide good summaries of the subject matter • Tuples may have a key like element that summaries the row • Ranking Functions • naïveRank • filterRank • featureRank • schemaRank

Ranking Function (1) • Naïve Rank • It simply uses the top k search engine result pages to generate relations. • If there are no relations in the top k search results, naïve Rank will emit no relations. • Roughly simulates modern search engine user

Ranking Function (2) • Filter Rank • Similar to naïve rank • It will go as far down the search result pages as necessary to find ‘k’ relations

Ranking Function (3) • Feature Rank • Does not rely on an existing search engine • Uses relation specific features to score each extracted relation in the Corpus • Sorts results by score • Different feature scores were combined using linear regression estimator • trained by a thousand (q, relation) pairs each scored by two human judges

Ranking Function (4) • Schema Rank • Same as feature Rank • Additionally uses ACSDb based Schema coherence score • Coherent Schema is one where attributes are strongly related • Make, Model • Make, Zipcode • PMI - Point Mutual Information • Gives a sense of how strongly two items are related • Coherence score for a schema is the average of all possible attribute-pairwise PMI scores for the schema

Indexing • Traditional Search Engines use Inverted Index • Inverted Index can not retrieve relational features • Inverted Index • Term -> (docid, offset) • WebTables data exists in two dimensions • Term -> (docid, offset-X, offset-Y)

ACSDb Application (1) • Schema Auto Complete • Designed to assist novice database designers when creating a relational schema • Schemas consisting of Single Relations • User enter one or more domain-specific attributes and the auto-completer guesses the rest if the attributes

ACSDb Application (2) • Attribute Synonym-Finding • Automatically find synonyms between arbitrary attribute strings • Based on a set of context attributes generates attribute pairs • Assumptions • Synonymous attributes will never appear together in same chema • Odds of synonymity are higher if p(a,b) = 0 despite a large value for p(a)p(b) • Two synonyms will appear in similar contexts

ACSDb Application (3) • Join Graph Traversal • Provide a useful way of navigating huge graph of 2.6M Schemas • Basic join graph • Contains a node ‘N’ for each unique schema • Undirected join link between any two schemas that share a attribute • Every schema that contains ‘name’ field is linked to every other schema that contains ‘name’ • Cluster together similar schemas to minimize graph clutter • Schema: X,Y • Shared Attribute: D

Exp. Results – Relation Ranking Rank-ACSD beats Naïve (simulates search engine users) by 78-100% All of the non-Naïve solutions improve as k (number of results) increases

Exp. Results – Schema Auto Complete • Test Scenario • 6 Humans designed schemas using given attributes • Auto-Complete tool got three tries • By 3rd output Auto complete was able to reproduce a large number of schemas • No test designer recognized ‘ab’ as an abbrevation for ‘at-bats’, baseball terminology

Exp. Results – Synonym Finding Ranked by quality An ideal ranking would present a stream of only correct synonyms, followed by only incorrect ones Poor ranking will mix them together

Exp. Results – Join Graph Traversal

Conclusion • WebTables is first large scale attempt to extract relational information embedded in HTML tables • Relation Ranking • ACSDb uses • Schema auto complete • Attribute Synonym Finding • Join Graph Traversing • Adding signal for source page quality like PageRank will improve overall quality

Discussion • Pros • Handling tables separately for search is a good idea • Cons • Most of the paper is focused on uses of ACSDb

WebTables : Exploring the Power of Tables on the Web