1 / 40

Deep Web Integration: Querying Structured Data on the Deep Web

Deep Web Integration: Querying Structured Data on the Deep Web. Fangjiao Jiang. Outline. Background Access Deep Web MetaQuerier Metasearch engine vs. MetaQuerier Related research groups Conclusion … Some suggestions. Background. Part 1.

lenci
Télécharger la présentation

Deep Web Integration: Querying Structured Data on the Deep Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Deep Web Integration:Querying Structured Data on the Deep Web Fangjiao Jiang

  2. Outline • Background • Access Deep Web • MetaQuerier • Metasearch engine vs. MetaQuerier • Related research groups • Conclusion • … • Some suggestions

  3. Background Part 1

  4. The previous Web: things are just on the surface

  5. The current Web: Getting “deeper” • A great number of data is hidden behind query forms

  6. The Problem for access data from Deep Web • Deep = not accessible through traditional search engines ? ? ? ?

  7. Why is it important? • More than 10 million distinct forms

  8. Why is it important? • Up to5,000 billions dynamic result pages

  9. Why is it important? ——Google’s Recent Survey [CIDR 2007] • If there are 1 billion web pages 25 million potential Deep Web sources

  10. Cars.com Challenge: How to enable effective access to the Deep Web?

  11. Access the Deep Web Part 2

  12. Repository Web Database Web Database Web Database … Integrated query interface QUERYWeb databases Three different manners • Warehouse-like approach • MetaQuerier • Surfacing the Deep Web 1) Pre-compute appropriate queriers over the forms 2) Insert the resulting pages into a web-search index

  13. Web Database Web Database Web Database Web Database Web Database … PDF Journal Homepage 中文期刊全文数据库 国家自然基金信息库 …… DOC PS Auhtor Homepage Conf. Homepage (1) Warehouse-like approach

  14. The Deep Web (2) MetaQuerier MetaQuerier Front-end: Query Execution Schema matching Result processing Query Translation Source Selection MetaQuerier is what we focus on. Query Web databases Find Web databases Deep Web Repository Query Interfaces Query Capabilities Subject Domains Unified Interfaces Back-end: Semantics Discovery Database Crawler Interface Extraction Source Clustering interface integration

  15. (3) Surfacing the Deep Web [VLDB’08] • Viewpoint • Many domains and many languages • No human in the loop, no site-specific scripts • Main idea • predicting input values for text boxes • predicting input combinations • Google’s Deep-Web crawling system • Affects more than 1000 queries per second • Enables access to more than a million Deep-Web sites • Spans 50+ languages and 100+ domains

  16. MetaQuerier Part 3

  17. A Survey on Deep Web [SIGMOD 2006] • How many deep-Web sources are out there? • 307,000 sites, 450,000 DBs, 1,258,000 interfaces. • How structured in Deep Web? • 348,000 (structured) : 102,000 (text) == 3 : 1 • How do search engines cover them? • covered 10% sources. • What’s the subject distribution of Web databases? • Across all areas • How complex are they? • “Amazon effects”

  18. Reported the “Amazon effect”… Condition patterns converge even across domains! Attributes converge in a domain!

  19. Technical Challenges • How to discover the query interface? • Which form is the query interface of a Web database? • How to understand a query interface? • Where is the first condition? What’s its attribute? • How to match query interfaces? • What does “author” on this source match on that? • How to translate queries? • How to ask this query on that source?

  20. Technical Challenges • How to extract the query results? • According to vision information? • How to identify the same entity? • Especially the large-scale entity identification. • How to annotate the query results? • How to specify the semantic of the data?

  21. Metasearch Engine VS. Metaquerier Part 4

  22. Online data Data Search Engine Surface Web Deep Web Metasearch Engine Metaquerier Example: mamma.com Example: Addall.com Search Engine 1 Web database 1 Search Engine 2 Web database 2 …… …… Search Engine n Web database n Preliminary

  23. Search Engine Document search engine Key technology Crawling the Web Re-crawl Changed added Indexing Web Pages Index terms Stop words Stemming Invert file structure Term (p,w) Web Database Database search engine Search Engine VS. Web Database OK

  24. Search Engine Document search engine Key technology Ranking Page Similar (Query, Page) Linkage information (Pagerank) Result Organization Matching score (descending) Clustering/categorizing Large “apple” Effective and Efficient Retrieval Recall-precision curve Web Database Database search engine Search Engine VS. Web Database OK

  25. Online data Data Search Engine Surface Web Deep Web Metasearch Engine Metaquerier Example: mamma.com Example: Addall.com Search Engine 1 Web database 1 Search Engine 2 Web database 2 …… …… Search Engine n Web database n Metasearch Engine VS. MetaQuerier

  26. Search Engine Selection Search Result Extraction Result Merging Query interface integration Database selection Query translation Result Extraction , Entity Identification and Annotation Metasearch Engine VS. Metaquerier

  27. Main research groups Part 5

  28. Main research groups Yiyao Lu Weiyi MengProfessor Binghamton University Eduard Dragut Hai He Interface extraction, interface integration, Query translation, Result annotation, Kevin Chen-Chuan ChangAssociate Professor University of Illinois at Urbana-Champaign Bin He Zhen Zhang Interface extraction, interface integration, Query translation

  29. Main research groups • Others … Jayant Madhavan, Google, Inc. Google base Zaiqing Nie Microsoft,Inc. Vertical search Microsoft Luis Gravano Columbia University Top-k query Panagiotis G. Ipeirotis New York University Classification

  30. Conclusion: Our works toward large scale integration • Completed several key subtasks: • Deep Web Data Extraction[TKDE 2009, WEBDB 2006, WISE 2005, WAIM 2005] • Query translation[DASFAA 2009, DASFAA 2007, SKG 2008] • Deep Web survey[VLDB Workshop 2006, 计算机学报2007] • Schema matching[计算机学报2008] • Database selection[软件学报2008] • Moving forward to exciting system issues: • System integration for building an integration system • Web data integration in mobile environment

  31. Some suggestions Part 6

  32. Four years ago… • How to find a paper? Google enough? • What are the theories we should to be familiar with first?

  33. Find the papers … • Google • Google scholar • DBLP Bibliography • C-DBLP • Libra Academic Search • ACM Digital Library • Citeseer • Authors’ homepage • Send the Email to author

  34. Journal: TOIS TODS VLDB J. TKDE Conferences/Workshop SIGMOD/ WebDB VLDB ICDE EDBT WWW SIGIR CIKM/WIDM WISE DASFAA Find the papers …

  35. Read the books … • Information Retrieval • Data Mining • Machine Learning • Statistics • Theory of probability …

  36. Three years ago… • How to find a problem? • Which problem is significant?

  37. Two years ago… • How to write a paper?

  38. Helpful points… • Right subject • Well-define problem • Clear contribution • Good Structure and logical flow • Proper use of words • Notice format, equation, reference… • Ask others to read your paper • Record your own mistake • Not leave out the important related work

  39. Take some time to learn… • Latex • Matlab or Gnuplot (for the chart if necessary)

  40. Thanks for Your Attentions (Q&A)

More Related