Chapter 15: Data Integration on the Web

Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION ANHAI DOAN ALON HALEVY ZACHARY IVES

Outline • Introduction, opportunities and challenges with Web data • The Deep Web • Vertical search • Surfacing the Deep Web • Creating topical portals • Lightweight data management on the Web • Discovery of data sets • Extracting data from Web pages • Combining multiple data sets • Re-using others’ work

Broad Range of Data on the Web

Key Characteristics • Scale and heterogeneity • Data is about everything! Overlapping sources, varying levels of quality. • Multiple formats (tables, lists, cards, etc.) • Data is laid out for visual appeal • Extracting the data is very tricky! • Semantics of the data are rarely specified and need to be inferred from text and other clues.

Different Forms of Structured Data on the Web

Tables: hundreds of millions good ones

Databases Behind Forms  The Deep/Invisible Web used cars store locations recipes patents radio stations Tens of millions of high-quality forms

HTML Lists Every list item is a row in a table, but figuring out cell boundaries is very tricky.

Structured data embedded more loosely in pages. Extraction is very tricky!

What Can we do with Structured Web Data? • Integrate: • Imagine integrating your data with any data on the Web! • Insights come when independently developed data sets come together • (of course, you can also get garbage that way, so you need to be careful). • Improve web search • Find tables & lists when they’re relevant to queries • Answer fact-seeking queries with facts rather than links to Web pages. • Aggregate: answer “total GDP of 10 largest countries” by putting together facts from multiple pages

Bigger Vision: create an ecosystem of structured data on the Web Discover via search Extract from Web Sources Publish back to the Web Manage, Analyze, Visualize, Integrate, create compelling stories

What is the Deep Web? • Content hidden behind HTML forms, not accessible to search engines.

The Deep Web • The collection of databases that are accessed by users entering values into HTML forms. • The crawler of search engines cannot fill the forms, therefore the content is invisible to the search engine. • The work on the Deep Web illustrates many of the challenges of managing Web data.

Two Approaches to the Deep Web • Build a vertical search engine: • Apply all the data integration techniques we’ve learned so far to a set of data sources such as job sites, airplane reservations, etc. • The approach is applicable to domains that have thousands of form sites. • Surface the content: • Try to guess good queries to pose to the forms. Insert the resulting HTML pages into the Web index. • The approach covers the long tail of content on the Web.

Approach #1: Vertical Search: Data Integration

Vertical Search as Data Integration • Mediated schema: the properties of the domain that need to be exposed to the user • If you include too many attributes in the mediated schema, you may not be able to query them on many sources. • Source descriptions: relatively simple. Sources are often distinguished by their geographical coverage. • Wrappers: • Parsing the answers from the resulting HTML is the tricky part. • Alternate approach: don’t parse the answers. Just show the user the returned web pages.

Deep Web: the Long Tail Amish quilts Tree Search Parking tickets in India Horses

The Surfacing Approach • Crawl & Indexing time • Pre-compute interesting form submissions • Insert resulting pages into the Web Index • Query time: nothing! • Deep web URLs in the Index are like any other URL • Advantages • Reuse existing search engine infrastructure • Reduced load on target web sites – users click only on what they deem relevant. • Approach taken at Google for the long tail.

Surfacing Challenges • Predicting the correct input combinations • Generating all possible URLs is wasteful and unnecessary • Cars.com has ~500K listings, but 250M possible queries • Predicting the appropriate values for text inputs • Valid input values are required for retrieving data • Ingredients in recipes.com and zipcodes in borderstores.com • Don’t do anything bad! • Coverage of the crawl: don’t try to cover sites in their entirety, it’s not necessary. • Once you get part of the content, there will be links to the rest • It’s enough to have part of the content in the index to send it relevant traffic.

Form Processing 101 <form action=http://www.borders.com/locator method=GET> <select name=store><option …/>… </select> … <input name=zip type=text/> <input name=search type=submit value=Go/> <input name=site type=hidden value=homepage/> </form> • GET and POST: types of HTML forms • Only GETs can be surfaced on submit URL: http://www.borders.com/locator?store=All&city=&state= &zip=94043&within=25&search=Go&site=homepage

Predicting Input Combinations • Forms can have multiple inputs • Generating all possible URLs is wasteful! … and un-necessary! • Goal: minimize URLs while maximizing retrieval! • Other considerations • Generated URLs must be good candidates for index • Only need URLs sufficient to drive traffic • Only need URLs sufficient to seed the web crawler • Solution: discover only informative input combinations. Google's Deep-Web Crawl (VLDB 2008)

Informative Form Fields Result pages different  informative http://jobs.shrm.org/search?state=All&kw=&type=All http://jobs.shrm.org/search?state=AL&kw=&type=All http://jobs.shrm.org/search?state=AK&kw=&type=All … http://jobs.shrm.org/search?state=WV&kw=&type=All Result pages similar un-informative http://jobs.shrm.org/search?state=All&kw=&type=ALL http://jobs.shrm.org/search?state=All&kw=&type=ANY http://jobs.shrm.org/search?state=All&kw=&type=EXACT Varying the state results in qualitatively different content, and hence it is an informative field.

Computing Informative Field Combinations • Informative field combinations can be computed bottom up: • Begin with single fields and find which ones are informative. • For every informative combination, try to add another field and check if the resulting combination is still informative. • In practice, we rarely need combinations of more than 3 fields.

Challenge 2: Generic and Typed Text boxes • Generic Search Boxes • Accept any keywords • Challenge: selecting the most appropriate values • Typed Text Boxes • Only values belonging to specific types, e.g., zipcodes • Challenge: selecting the type of the input Google's Deep-Web Crawl (VLDB 2008)

Example: www.wipo.int Google's Deep-Web Crawl (VLDB 2008)

Input values for Generic Search • Iterative Probing for search boxes • Select an initial list of candidate keywords • Download pages based on current set of keywords • Extract more candidate keywords from result pages • Refine the current set of keywords • Repeat until no more new candidate keywords • Prune list of candidate keywords

Example: www.wipo.int Metalworking Protein Antibody Pyrazole Immobilizer Vasoconstriction Phosphinates Nosepiece Sandbridge Viscosity Carboxydiphenylsulphide Ozonizer …

Topical Portals • An integrated view of a topic: • E.g., a info about database researchers, all info about coffee and their growing regions. • Topical portals find different aspects of the same objects on different sources • E.g., publications of a person may come from one source, while their job affiliations may come from another • In contrast, vertical search integrated similar objects from multiple sources • E.g., job listings, apartments for rent, …

Topical Portal: example Integrated Page for an Entity

Building a Topical Portal • Approach #1: • Perform a focused crawl of the Web to find pages on the topic • Use word signatures as a method for determining the topic of a page. • Use information extraction techniques to get the data out of the pages. • Perform reference resolution and schema matching to create a cleaner set of data.

Creating a Topical Portal • Approach #2: • Start with a set of well known sites in the domain • Create an initial schema for the domain (the properties you’re interested in modeling) • Create extractors for pages on the known sites • Note: extractors will be more accurate because they were created for the sites themselves • Result: a good basis of entities and relationships to build on. • Extend the initial data set: • Follow references from the initial set of chosen pages • Use collaboration (of people in the community) to find additional data and to correct extractions.

Lightweight Combination of Web Data • With such a vast collection of data, we would like to enable easy data integration. • Imagine a school student combining her data about bird species with a country population table found on the Web • A journalist creating a news story with data about riots in the UK and needing to combine it with demographic data • … • Many data integration tasks are transient: the result will be used for a short period of time only • Hence, creating the integrated data must be easy. Creating a mediated schema and mappings is too tedious.

Challenges to Data Integration on the Web • Discovering data on (search engines are optimized for documents, not tables or lists) • Extracting the data from the Web pages into a form that can be processed • Combining multiple data sets • Unique opportunities on the Web: re-use work of others!

Not a great result!

But the data does exist out there!

Discovering Data on the Web • Search engines are optimized for documents • E.g., proximity of terms matters in ranking. In tables, the schema applies to all rows. “zambia” is far from “population” in a document containing population data, but should be considered close. • No special attention is given to schema rows (if they can be detected) or columns closer to the left of the table (that are often the “subject” of the table). • Tables with high quality data look like ones that are used for formatting. • Over 99% of the HTML tables on the Web are not high quality data tables!

Challenges to Discovering the Semantics of Structured Data on the Web

Semantics Embedded in Surrounding Text Topic of table is in the text, and the token “2006” is crucial to understanding the data.

No schema, but beautifully understandable table by people.

Structured Data can be Plain Complicated!

HTML Tables used for Formatting

“Vertical” Tables: one tuple of a bigger table

Can’t Use Domain Knowledge: Data is about Everything Tree Search Amish quilts Parking tickets in India Horses

Search by Tweaking Document Traditional Search • Consider new cues in ranking: • Hits on left column • Hits on schema (where there is one) • Number of rows, columns • Hits on table body • Size of table relative to page • But we can do better: try to recover the underlying semantics of the data.

Recovering Table Semantics: cells on the Web are mentioned in Web text If we see these patterns enough times, we can infer that Green Ash is a North American species

Recovering Table Semantics: cells on the Web are mentioned in Web text If we infer that a large fraction of the left column are North American tree species, we can infer that the table is about these tree species. Which is not mentioned on the page!

Extracting Data from the Page • In the case of tables, it’s fairly easy • Main challenge: decide if there is a row with attribute names • Lists are tricky: punctuation and formatting do not always provide the right cues for partitioning a list element into cells boundaries. • Structured data in cards: in general, it’s an information extraction problem.

Chapter 15: Data Integration on the Web

Chapter 15: Data Integration on the Web

Presentation Transcript

Oracle Data Integration Strategy and Roadmap Oracle Fusion Middleware Product Management

Chapter 7 Using Data Flow Diagrams

Chapter 3: Data Mining and Data Visualization

9. SYSTEM INTEGRATION

Chapter 2

Chapter 8 – Further Applications of Integration

Chapter 6

A Small Tutorial on Big Data Integration

Chapter 13

Chapter 1 Exploring Data

Chapter 2

KX-TVM50/200 VM Integration with TDA system

Chapter 6 Applications

STRING Modeling of biological systems through cross-species data integration

Data Warehousing : and OLAP — BIS 541 — — Chapter 3 —

Chapter 7: Data Link Control Protocols

Chapter 2

Chapter 2 Data Design and Implementation

Chapter 3 The Data Link Layer

Chapter 12

CHAPTER 2: MANIPULATING QUERY EXPRESSIONS

Chapter 15 Quantifying Data