Deep Web Integration: Querying Structured Data on the Deep Web

Deep Web Integration:Querying Structured Data on the Deep Web Fangjiao Jiang

Outline • Background • Access Deep Web • MetaQuerier • Metasearch engine vs. MetaQuerier • Related research groups • Conclusion • … • Some suggestions

Background Part 1

The previous Web: things are just on the surface

The current Web: Getting “deeper” • A great number of data is hidden behind query forms

The Problem for access data from Deep Web • Deep = not accessible through traditional search engines ? ? ? ?

Why is it important? • More than 10 million distinct forms

Why is it important? • Up to5,000 billions dynamic result pages

Why is it important? ——Google’s Recent Survey [CIDR 2007] • If there are 1 billion web pages 25 million potential Deep Web sources

Cars.com Challenge: How to enable effective access to the Deep Web?

Access the Deep Web Part 2

Repository Web Database Web Database Web Database … Integrated query interface QUERYWeb databases Three different manners • Warehouse-like approach • MetaQuerier • Surfacing the Deep Web 1) Pre-compute appropriate queriers over the forms 2) Insert the resulting pages into a web-search index

Web Database Web Database Web Database Web Database Web Database … PDF Journal Homepage 中文期刊全文数据库国家自然基金信息库 …… DOC PS Auhtor Homepage Conf. Homepage (1) Warehouse-like approach

The Deep Web (2) MetaQuerier MetaQuerier Front-end: Query Execution Schema matching Result processing Query Translation Source Selection MetaQuerier is what we focus on. Query Web databases Find Web databases Deep Web Repository Query Interfaces Query Capabilities Subject Domains Unified Interfaces Back-end: Semantics Discovery Database Crawler Interface Extraction Source Clustering interface integration

(3) Surfacing the Deep Web [VLDB’08] • Viewpoint • Many domains and many languages • No human in the loop, no site-specific scripts • Main idea • predicting input values for text boxes • predicting input combinations • Google’s Deep-Web crawling system • Affects more than 1000 queries per second • Enables access to more than a million Deep-Web sites • Spans 50+ languages and 100+ domains

MetaQuerier Part 3

A Survey on Deep Web [SIGMOD 2006] • How many deep-Web sources are out there? • 307,000 sites, 450,000 DBs, 1,258,000 interfaces. • How structured in Deep Web? • 348,000 (structured) : 102,000 (text) == 3 : 1 • How do search engines cover them? • covered 10% sources. • What’s the subject distribution of Web databases? • Across all areas • How complex are they? • “Amazon effects”

Reported the “Amazon effect”… Condition patterns converge even across domains! Attributes converge in a domain!

Technical Challenges • How to discover the query interface? • Which form is the query interface of a Web database? • How to understand a query interface? • Where is the first condition? What’s its attribute? • How to match query interfaces? • What does “author” on this source match on that? • How to translate queries? • How to ask this query on that source?

Technical Challenges • How to extract the query results? • According to vision information? • How to identify the same entity? • Especially the large-scale entity identification. • How to annotate the query results? • How to specify the semantic of the data?

Metasearch Engine VS. Metaquerier Part 4

Online data Data Search Engine Surface Web Deep Web Metasearch Engine Metaquerier Example: mamma.com Example: Addall.com Search Engine 1 Web database 1 Search Engine 2 Web database 2 …… …… Search Engine n Web database n Preliminary

Search Engine Document search engine Key technology Crawling the Web Re-crawl Changed added Indexing Web Pages Index terms Stop words Stemming Invert file structure Term (p,w) Web Database Database search engine Search Engine VS. Web Database OK

Search Engine Document search engine Key technology Ranking Page Similar (Query, Page) Linkage information (Pagerank) Result Organization Matching score (descending) Clustering/categorizing Large “apple” Effective and Efficient Retrieval Recall-precision curve Web Database Database search engine Search Engine VS. Web Database OK

Online data Data Search Engine Surface Web Deep Web Metasearch Engine Metaquerier Example: mamma.com Example: Addall.com Search Engine 1 Web database 1 Search Engine 2 Web database 2 …… …… Search Engine n Web database n Metasearch Engine VS. MetaQuerier

Search Engine Selection Search Result Extraction Result Merging Query interface integration Database selection Query translation Result Extraction , Entity Identification and Annotation Metasearch Engine VS. Metaquerier

Main research groups Part 5

Main research groups Yiyao Lu Weiyi MengProfessor Binghamton University Eduard Dragut Hai He Interface extraction, interface integration, Query translation, Result annotation, Kevin Chen-Chuan ChangAssociate Professor University of Illinois at Urbana-Champaign Bin He Zhen Zhang Interface extraction, interface integration, Query translation

Main research groups • Others … Jayant Madhavan, Google, Inc. Google base Zaiqing Nie Microsoft,Inc. Vertical search Microsoft Luis Gravano Columbia University Top-k query Panagiotis G. Ipeirotis New York University Classification

Conclusion: Our works toward large scale integration • Completed several key subtasks: • Deep Web Data Extraction[TKDE 2009， WEBDB 2006, WISE 2005, WAIM 2005] • Query translation[DASFAA 2009, DASFAA 2007, SKG 2008] • Deep Web survey[VLDB Workshop 2006, 计算机学报2007] • Schema matching[计算机学报2008] • Database selection[软件学报2008] • Moving forward to exciting system issues: • System integration for building an integration system • Web data integration in mobile environment

Some suggestions Part 6

Four years ago… • How to find a paper? Google enough? • What are the theories we should to be familiar with first?

Find the papers … • Google • Google scholar • DBLP Bibliography • C-DBLP • Libra Academic Search • ACM Digital Library • Citeseer • Authors’ homepage • Send the Email to author

Journal: TOIS TODS VLDB J. TKDE Conferences/Workshop SIGMOD/ WebDB VLDB ICDE EDBT WWW SIGIR CIKM/WIDM WISE DASFAA Find the papers …

Read the books … • Information Retrieval • Data Mining • Machine Learning • Statistics • Theory of probability …

Three years ago… • How to find a problem? • Which problem is significant?

Two years ago… • How to write a paper?

Helpful points… • Right subject • Well-define problem • Clear contribution • Good Structure and logical flow • Proper use of words • Notice format, equation, reference… • Ask others to read your paper • Record your own mistake • Not leave out the important related work

Take some time to learn… • Latex • Matlab or Gnuplot (for the chart if necessary)

Thanks for Your Attentions (Q&A)

Deep Web Integration: Querying Structured Data on the Deep Web

Deep Web Integration: Querying Structured Data on the Deep Web

Presentation Transcript

Anthropogenic Impacts on the Deep-Sea

SQL Server Integration Services Deep Dive

Deep Sea

Web-scale Data Integration: You can only afford to Pay As You Go

Deep Economy

DEEP ROOTS = SUSTAINED GROWTH

Developing a Deep-Learning Course

The DEEP Project

Deep Learning

Deep Indexing: Harnessing the Power of Data Discovery Mark Hyer VP, Secondary Publishing

Deep Oceans By Joshua

Deep Learning

Deep Carbon Observatory

Deep Linking for American FactFinder 2

The Deep Sea

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

On the diagnosis of deep convection

Dual Casing Running for deep water spud in

Evaluating Regional Trade Agreements: Deep and Shallow Integration

Panoramic Survey of the Deep Universe

Deep Ocean

Common Core Pathways