Downloading Textual Hidden-Web Content Through Keyword Queries

Downloading Textual Hidden-WebContent Through Keyword Queries Downloading Textual Hidden-WebContent Through Keyword Queries Alexandros NtoulasPetros Zerfos Junghoo Cho University of California Los Angeles Computer Science Department {ntoulas, pzerfos, cho}@cs.ucla.edu JCDL, June 8th 2005

Motivation • I would like to buy a used ’98 Ford Taurus • Technical specs ? • Reviews ? • Classifieds ? • Vehicle history ? Google?

Why can’t we use a search engine ? • Search engines today employ crawlers that find pages by following links around • Many useful pages are available only after issuing queries (e.g. Classifieds, USPTO, PubMed, LoC, …) • Search engines cannot reach such pages: there are no links to them (Hidden-Web) • In this talk: how can we download Hidden-Web content?

Outline • Interacting with Hidden-Web sites • Algorithms for selecting queries for the Hidden-Web sites • Experimental evaluation of our algorithms

Interacting with Hidden-Web pages (1) • The user issues a query through a query interface liver

Result List Page Interacting with Hidden-Web pages (2) • The user issues a query through a query interface • A result list is presented to the user

Interacting with Hidden-Web pages (3) • The user issues a query through a query interface • A result list is presented to the user • The user selects and views the “interesting” results

Querying a Hidden-Web site Procedurewhile ( there are available resources ) do (1) select a query to send to the site (2) send query and acquire result list (3) download the pages done

How should we select the queries ? (1) • S: set of pages in Web site (pages as points) • qi: set of pages returned if we issue query qi(queries as circles)

How should we select the queries ? (2) • Find the queries (circles) that cover the maximum number of pages (points) • Equivalent to the set-covering problem in graph-theory

Challenges during query selection • In practice we don’t know which pages will be returned by which queries (qi are unknown) • Even if we did know qi, the set-covering problem is NP-Hard • We will present approximation algorithms to the query selection problem • We will assume single-keyword queries

Some background (1) • Assumption: When we issue query qito a Web site, all pages containing qiare returned • P(qi): fraction of pages from site we get back after issuing qi • Example: q = liver • No. of docs in DB: 10,000 • No. of docs containing liver: 3,000 • P(liver) = 0.3

Some background (2) • P(q1/\q2): fraction of pages containing both q1and q2 (intersection of q1 and q2) • P(q1\/q2): fraction of pages containing either q1or q2 (union of q1 and q2) • Cost and benefit: • How much benefit do we get out of a query ? • How costly is it to issue a query?

Cost function • The cost to issue a query and download the Hidden-Web pages: • cq: query cost • cr: cost for retrieving a result item • cd: cost for downloading a document cq Cost(qi) = + crP(qi) + cdP(qi) (2) Cost for retrieving a result item times no. of results (3) Cost for retrieving a doc times no. of docs (1) Cost for issuing a query

Problem formalization Find the set of queries q1,…,qn which maximizes P(q1\/…\/qn) Under the constraint:

Query selection algorithms • Random: Select a query randomly from a precompiled list (e.g. a dictionary) • Frequency-based: Select a query from a precompiled list based on frequency (e.g. a corpus previously downloaded from the Web) • Adaptive: Analyze previously downloaded pages to determine “promising” future queries

Adaptive query selection • Assume we have issued q1,…,qi-1. • To find a promising query qi we need to estimate P(q1\/…\/qi-1\/qi) • P( (q1\/…\/qi-1) \/ qi) = P(q1\/…\/qi-1) + P(qi) - P(q1\/…\/qi-1) P(qi|q1\/…\/qi-1) Known (by counting) since we have issued q1,…,qi-1 Can measure by counting P(qi) within P(q1,…,qi-1) What about P(qi) ?

Estimating P(qi) • Independence estimator • Zipf estimator [IG02] • Rank queries based on frequency of occurrence and fit a power law distribution • Use fitted distribution to estimate P(qi) P(qi) ~ P(qi|q1\/…\/qi-1)

Query selection algorithm foreachqiin [potential queries] do Pnew(qi) = P(q1\/…\/qi-1\/qi) – P(q1\/…\/qi-1) Estimate done returnqi with maximum Efficiency(qi)

Other practical issues • Efficient calculation of P(qi|q1\/…\/qi-1) • Selection of the initial query • Crawling sites that limit the number of results(e.g. DMOZ returns up to 10,000 results) • Please refer to our paper for the details

Experimental evaluation • Applied our algorithms to 4 different sites

Policies • Random-16K • Pick query randomly from 16,000 most popular terms • Random-1M • Pick query randomly from 1,000,000 most popular terms • Frequency-based • Pick query based on frequency of occurrence • Adaptive

Coverage of policies • What fraction of the Web sites can we download by issuing queries ? • Study P(q1\/…\/qi) as i increases

Coverage of policies for PubMed • Adaptive gets ~80% with ~83 queries • Frequency needs 103 for the same coverage

Coverage of policies for DMOZ (whole) • Adaptive outperforms others

Coverage of policies for DMOZ (arts) • Adaptive performs best in topic-specific texts

Other experiments • Impact of the initial query • Impact of the various parameters of the cost function • Crawling sites that limit the number of results(e.g. DMOZ returns up to 10,000 results) • Please refer to our paper for the details

Related work • Issuing queries to databases • Acquire language model [CCD99] • Estimate fraction of the Web indexed [LG98] • Estimate relative size and overlap of indexes [BB98] • Build multi-keyword queries that can return a large number of documents [BF04] • Harvesting approaches/cooperative databases (OAI [LS01], DP9 [LMZN02])

Conclusion • An adaptive algorithm for issuing queries to Hidden-Web sites • Our algorithm is highly efficient (downloaded >90% of a site with ~100 queries) • Allows users to tap into unexplored information on the Web • Allows the research community to download, mine, study, understand the Hidden-Web

References • [IG02] P. Ipeirotis, L. Gravano. Distributed search over the hidden web: Hierarchical database sampling and selection. VLDB 2002. • [CCD99] J. Callan, M.E. Connel, A. Du. Automatic discovery of language models for text databases. SIGMOD 1999. • [LG98] S. Lawrence, C.L. Giles. Searching the World Wide Web. Science 280(5360):98-100, 1998. • [BB98] K. Bharat, A. Broder. A technique for measuring the relative size and overlap of public web search engines. WWW 1998. • [BF04] L. Barbosa, J. Freire. Siphoning hidden-web data through keyword-based interfaces. • [LS01] C. Lagoze, H.V. Sompel. The Open Archives Initiative: Building a low-barrier interoperability framework. JCDL 2001. • [LMZN02] X. Liu, K. Maly, M. Zubair, M.L. Nelson. DP9-An OAI Gatway Service for Web Crawlers. JCDL 2002.

Thank you ! Questions ?

Impact of the initial query • Does it matter what the first query is ? • Crawled PubMed with queries: • data (1,344,999 results) • information (308,474 results) • return (29,707 results) • pubmed (695 results)

Impact of the initial query • Algorithm converges regardless of initial query

Incorporating the document download cost • Cost(qi) = cq + crP(qi)+ cdPnew (qi) • Crawled PubMed with • cq = 100 • cr = 100 • cd = 10,000

Incorporating document download cost • Adaptive uses resources more efficiently • Document cost significant portion of the cost

Can we get all the results back ? …

Downloading from sites limiting the number of results (1) • Site returns qi’ instead of qi • For qi+1 we need to estimate P(qi+1|q1\/…\/qi)

Downloading from sites limiting the number of results (2) • Assuming qi’ is a random sample of qi

Impact of the limit of results • How does the limit of results affect our algorithms ? • Crawled DMOZ but restricted the algorithms to 1,000 results instead of 10,000

Dmoz with a result cap at 1,000 • Adaptive still outperforms frequency-based

Downloading Textual Hidden-Web Content Through Keyword Queries

Downloading Textual Hidden-Web Content Through Keyword Queries

Presentation Transcript

CRAWLING THE HIDDEN WEB

English Through Content

Crawling the Hidden Web

Web content

Supporting Location-Based Approximate-Keyword Queries

Collective Spatial Keyword Queries: A Distance Owner-Driven Approach

Crawling the Hidden Web

XClean: Providing Valid Spelling Suggestions for XML Keyword Queries

Crawling the Hidden Web

Crawling the Hidden Web

Crawling Deep Web Content Through Query Forms

Crawling the Hidden Web

YouTube , Flickr and Non-Textual Blog Content

Web Based Probabilistic Textual Entailment

Clustering Web Queries

Heavy-Tailed Distribution and Multi-Keyword Queries

MS Access Pass Through Queries

Processing XML Keyword Search by Constructing Effective Structured Queries

Semalt About Keyword Stuffing And Hidden Text

Detecting Hidden Attacks through the Mobile App-Web Interfaces

How you can Textual content Inmate Properly