1 / 17

Searching the Hidden Web

Searching the Hidden Web. Donghui Xu Spring 2011, COMS E6125 Prof. Gail Kaiser. Outline. What is the hidden Web Two approaches in searching the hidden Web Browsing Yahoo! like Web directory Crawling the hidden Web conclusion. The Surface Web. The surface Web reachable via hyperlinks.

soyala
Télécharger la présentation

Searching the Hidden Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Searching the Hidden Web Donghui XuSpring 2011, COMS E6125Prof. Gail Kaiser

  2. Outline • What is the hidden Web • Two approaches in searching the hidden Web • Browsing Yahoo! like Web directory • Crawling the hidden Web • conclusion

  3. The Surface Web • The surface Web • reachable via hyperlinks

  4. What is the Hidden Web • The hidden Web • no static hyperlink points to the webpage • access via a query interface • dynamically generated base on the query submitted

  5. What is the Hidden Web

  6. Size of the Hidden Web • About 500 times larger than the surface web • The surface web - 1 billion pages • Hidden web - over 550 billion pages • Top sixty largest Deep web sites are about 40 times larger than the surface web. the Deep Web V.S. the Surface Web (from Bergman)

  7. Quality of the Hidden Web some of the largest Hidden Web sites (from Bergman)

  8. Two Approaches to Access the Hidden Web • Browsing Yahoo! like Web directory • Crawling the Hidden Web.

  9. Browsing Yahoo! like Web Directory • Manually populate Yahoo! like directory • Classify collections of text database into categories and subcategories

  10. Browsing Yahoo! like Web Directory • Pros • Intuitive • Easy to use • Cons • Labor intensive Yahoo Directory containing 200, 0000 categories and there are millions of database searchable online • Accurate classification is not an easy task

  11. Crawling the hidden Web • Main challenge in searching the hidden Web • How to automatically generate meaningful query as input against query interface • The query generation problem • assume that a Web site contains a set of pages, s. • each query qiissued returns a subset of s, si • the task is to select a set of queries that would return maximum number of unique pages in the database with minimum cost

  12. query selection algorithms • Random - select the query randomly from a list of keywords (e.g. a random word from an English dictionary). • Generic Frequency - select a list of most frequent key words from a generic document corpus. • Adaptive - select promising keywords from documents downloaded based on previously issued queries.

  13. Evaluation of Query Selection Algorithm comparison of policies for dmoz (modified from Ntoulaset al )

  14. Evaluation of Query Selection Algorithm comparison of policies for PubMed (modified from Ntoulas et al)

  15. Conclusion • The surface web is the tip of the iceberg • Beneath it is an even vaster hidden Web • Two main approaches to access the hidden Web • Yahoo! like web directory • Crawling the Hidden Web • Much work need to be done. • Hidden Web searching technology would enable us to connect different data sources and allow businesses use data in new ways.

  16. References • [1] "The Deep Web: Surfacing Hidden Value"Michael K. Bergman. . The Journal of Electronic Publishing, August 2001 • [2] "Exploring a 'Deep Web' That Google Can’t Grasp"Alex Wright. . New York Times, February 3 2009 • [3] S. Raghavan and H. Garcia-Molina. “Crawling the Hidden Web.” In Proceedings of the International Conference on Very Large Databases (VLDB), 2001. • [4] Panagiotis G. Ipeirotis, AlexandrosNtoulas, Junghoo Cho, Luis Gravano "Modeling and Managing Content Changes in Text Databases."ACM Transactions on Database Systems, 32(3): June 2007. • [5] Christopher D. Manning, PrabhakarRaghavan and HinrichSchütze, Introduction to Information Retrieval, Cambridge University Press. 2008. • [6] AlexandrosNtoulas, PetrosZerfos, Junghoo Cho "Downloading Textual Hidden Web Content by Keyword Queries" ,In Proceedings of the Joint Conference on Digital Libraries (JCDL),June 2005 • [7] J. P. Callan and M. E. Connell. Query-based sampling of text databases. Information Systems, 97–130, 2001.

  17. Thanks!

More Related