html5-img
1 / 67

Classifying and Searching "Hidden-Web" Text Databases

Classifying and Searching "Hidden-Web" Text Databases. Panos Ipeirotis. Computer Science Department Columbia University. Motivation? “Surface” Web vs. “Hidden” Web. “Surface” Web Link structure Crawlable Documents indexed by search engines. “Hidden” Web No link structure

mizell
Télécharger la présentation

Classifying and Searching "Hidden-Web" Text Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Classifying and Searching "Hidden-Web" Text Databases Panos Ipeirotis Computer Science Department Columbia University

  2. Motivation?“Surface” Web vs. “Hidden” Web • “Surface” Web • Link structure • Crawlable • Documents indexed by search engines • “Hidden” Web • No link structure • Documents “hidden” in databases • Documents not indexed by search engines • Need to query each collection individually Panos Ipeirotis - Columbia University

  3. Hidden-Web Databases: Examples • Search on U.S. Patent and Trademark Office (USPTO) database: • [wireless network]  25,749 matches • (USPTO database is at http://patft.uspto.gov/netahtml/search-bool.html) • Search on Google restricted to USPTO database site: • [wireless network site:patft.uspto.gov]  0 matches as of Feb 10th, 2004 Panos Ipeirotis - Columbia University

  4. Interacting With Hidden-Web Databases • Browsing: Yahoo!-like directories • InvisibleWeb.com • SearchEngineGuide.com • Searching: Metasearchers Populated Manually Panos Ipeirotis - Columbia University

  5. Outline of Talk • Classification of Hidden-Web Databases • Search over Hidden-Web Databases • SDARTS Panos Ipeirotis - Columbia University

  6.    ? ? ? ? ? ? ? ? Hierarchically Classifying the ACM Digital Library ACM DL  Panos Ipeirotis - Columbia University

  7. Text Database Classification: Definition • For a text database D and a category C: • Coverage(D,C) = number of docs in D about C • Specificity(D,C) = fraction of docs in D about C • Assign a text database to a category C if: • Database coverage for C at least Tc Tc:coverage threshold (e.g., > 100 docs in C) • Database specificity for C at least Ts Ts:specificity threshold (e.g., > 40% of docs in C) Panos Ipeirotis - Columbia University

  8. Brute-Force Classification “Strategy” • Extract all documents from database • Classify documents on topic (use state-of-the-art classifiers: SVMs, C4.5, RIPPER,…) • Classify database according to topic distribution Problem: No direct access to full contents of Hidden-Web databases Panos Ipeirotis - Columbia University

  9. Classification: Goal & Challenges • Goal: Discover database topic distribution • Challenges: • No direct access to full contents of Hidden-Web databases • Only limited search interfaces available • Should not overload databases Key observation: Only queries “about” database topic(s) generate large number of matches Panos Ipeirotis - Columbia University

  10.     Query-based Database Classification: Overview • Train document classifier • Extract queries from classifier • Adaptively issue queries to database • Identify topic distribution based on adjusted number of query matches • Classify database TRAIN CLASSIFIER EXTRACT QUERIES Sports: +nba +knicks Health: +sars QUERY DATABASE +sars 1254 IDENTIFY TOPIC DISTRIBUTION CLASSIFY DATABASE Panos Ipeirotis - Columbia University

  11.         Training a Document Classifier • Get training set (set of pre-classified documents) • Select best features to characterize documents (Zipf’s law + information theoretic feature selection) [Koller and Sahami 1996] • Train classifier (SVM, C4.5, RIPPER, …) TRAIN CLASSIFIER EXTRACT QUERIES Sports: +nba +knicks Health +sars QUERY DATABASE Output: A “black-box” model for classifying documents IDENTIFY TOPIC DISTRIBUTION CLASSIFY DATABASE Document   Classifier Panos Ipeirotis - Columbia University

  12. Easy for decision-tree classifiers (C4.5) for which rule generators exist (C4.5rules) C4.5rules • Trickier for other classifiers: we devised rule-extraction methods for linear classifiers (linear-kernel SVMs, Naïve-Bayes, …) Rule extraction      Extracting Query Probes ACM TOIS 2003 Transform classifier model into queries • Trivial for “rule-based” classifiers (RIPPER) TRAIN CLASSIFIER EXTRACT QUERIES Sports: +nba +knicks Health: +sars QUERY DATABASE +sars 1254 IDENTIFY TOPIC DISTRIBUTION CLASSIFY DATABASE Example query for Sports: +nba +knicks Panos Ipeirotis - Columbia University

  13.     Querying Database with Extracted Queries • Issue each query to database to obtain number of matches without retrieving any documents • Increase coverage of rule’s category accordingly (#Sports = #Sports + 706) TRAIN CLASSIFIER EXTRACT QUERIES Sports: +nba +knicks Health: +sars QUERY DATABASE +sars 1254 IDENTIFY TOPIC DISTRIBUTION CLASSIFY DATABASE SIGMOD 2001 ACM TOIS 2003 Panos Ipeirotis - Columbia University

  14.     Identifying Topic Distribution from Query Results • Document classifiers not perfect: • Rules for one category match documents from other categories • Querying not perfect: • Queries for same category might overlap • Queries do not match all documents in a category TRAIN CLASSIFIER Query-based estimates of topic distribution not perfect EXTRACT QUERIES Sports: +nba +knicks Health +sars QUERY DATABASE IDENTIFY TOPIC DISTRIBUTION Solution: Learn to adjust results of query probes CLASSIFY DATABASE Panos Ipeirotis - Columbia University

  15. Confusion Matrix Adjustment of Query Probe Results correct class Correct (but unknown) topic distribution Incorrect topic distribution derived from query probing 800+500+0 = X = 80+4250+2 = 20+750+48 = assigned class This “multiplication” can be inverted to get a better estimate of the real topic distribution from the probe results 10% of “sport” documents match queries for “computers” Panos Ipeirotis - Columbia University

  16.     Confusion Matrix Adjustment of Query Probe Results • M usually diagonally dominant for “reasonable” document classifiers, hence invertible • Compensates for errors in query-based estimates of topic distribution TRAIN CLASSIFIER Coverage(D) ~ M-1 . ECoverage(D) EXTRACT QUERIES Sports: +nba +knicks Adjusted estimate of topic distribution Health Probing results +sars QUERY DATABASE IDENTIFY TOPIC DISTRIBUTION CLASSIFY DATABASE Panos Ipeirotis - Columbia University

  17.     Classification Algorithm (Again) TRAIN CLASSIFIER • Train document classifier • Extract queries from classifier • Adaptively issue queries to database • Identify topic distribution based on adjusted number of query matches • Classify database One-time process EXTRACT QUERIES Sports: +nba +knicks Health +sars QUERY DATABASE +sars 1254 IDENTIFY TOPIC DISTRIBUTION For every database CLASSIFY DATABASE Panos Ipeirotis - Columbia University

  18. Experimental Setup • 72-node 4-level topic hierarchy from InvisibleWeb/Yahoo! (54 leaf nodes) • 500,000 Usenet articles (April-May 2000): • Newsgroups assigned by hand to hierarchy nodes • RIPPER trained with 54,000 articles (1,000 articles per leaf), 27,000 articles to construct confusion matrix • 500 “Controlled” databases built using 419,000 newsgroup articles (to run detailed experiments) • 130 real Web databases picked from InvisibleWeb (first 5 under each topic) comp.hardware rec.music.classical rec.photo.* Panos Ipeirotis - Columbia University

  19. Experimental Results:Controlled Databases • Accuracy (using F-measure): • Above 80% for most <Tc, Ts> threshold combinations tried • Degrades gracefully with hierarchy depth • Confusion-matrix adjustment helps • Efficiency: Relatively small number of queries (<500) needed for most threshold <Tc, Ts> combinations tried Panos Ipeirotis - Columbia University

  20. Experimental Results: Web Databases • Accuracy (using F-measure): • ~70% for best <Tc, Ts> combination • Learned thresholds that reproduce human classification • Tested threshold choice using 3-fold cross validation • Efficiency: • 120 queries per database on average needed for choice of thresholds, no documents retrieved • Only small part of hierarchy “explored” • Queries are short: 1.5 words on average; 4 words maximum (easily handled by most Web databases) Panos Ipeirotis - Columbia University

  21. Other Experiments • Effect of choice of document classifiers: • RIPPER • C4.5 • Naïve Bayes • SVM • Benefits of feature selection • Effect of search-interface heterogeneity: Boolean vs. vector-space retrieval models • Effect of query-overlap elimination step • Over crawlable databases: query-based classification orders of magnitude faster than “brute-force” crawling-based classification ACM TOIS 2003 IEEE Data Engineering Bulletin 2002 Panos Ipeirotis - Columbia University

  22. Hidden-Web Database Classification: Summary • Handles autonomous Hidden-Web databases accurately and efficiently: • ~70% F-measure • Only 120 queries issued on average, with no documents retrieved • Handles large family of document classifiers(and can hence exploit future advances in machine learning) Panos Ipeirotis - Columbia University

  23. Outline of Talk • Classification of Hidden-Web Databases • Search over Hidden-Web Databases • SDARTS Panos Ipeirotis - Columbia University

  24. Interacting With Hidden-Web Databases • Browsing: Yahoo!-like directories • Searching: Metasearchers } Content not accessible through Google NYTimesArchives … … PubMed … Query Metasearcher USPTO Library of Congress … Panos Ipeirotis - Columbia University

  25. Metasearchers Provide Access to Distributed Databases Database selection relies on simple content summaries: vocabulary, word frequencies thrombopenia Metasearcher PubMed (11,868,552 documents) … aids 121,491 cancer 1,562,477 heart 691,360hepatitis121,129 thrombopenia 24,826 …   ? PubMed NYTimesArchives USPTO ... thrombopenia 24,826 ... ... thrombopenia 18 ... ... thrombopenia 0 ... Panos Ipeirotis - Columbia University

  26. Extracting Content Summaries from Autonomous Hidden-Web Databases [Callan&Connell2001] • Send random queries to databases • Retrieve top matching documents • If retrieved 300 documents then stop; else go to Step 1 Content summary contains words in sample and document frequency of each word • Problems: • Random sampling retrieves non-representative documents • Frequencies in summary “compressed” to sample size range • Summaries from small samples are highly incomplete Panos Ipeirotis - Columbia University

  27. Extracting Representative Document Sample Problem 1: Random sampling retrieves non-representative documents • Train a document classifier • Create queries from classifier • Adaptively issue queries to databases • Retrieve top-k matching documents for each query • Save #matches for each one-word query • Identify topic distribution based on adjusted number of query matches • Categorize the database • Generate content summary from document sample Sampling retrieves documents only from “topically dense” areas from database Panos Ipeirotis - Columbia University

  28. Sample Frequencies vs. Actual Frequencies Problem 2: Frequencies in summary “compressed” to sample size range PubMed (11,868,552 docs) …cancer 1,562,477 heart 691,360… PubMed Sample (300 documents) … cancer 45 heart 16… Sampling Key Observation: Query matches reveal frequency information Panos Ipeirotis - Columbia University

  29. Adjusting Document Frequencies • Zipf’s lawempiricallyconnects word frequency f and rank r f= A (r + B) c frequency rank VLDB 2002 Panos Ipeirotis - Columbia University

  30. Adjusting Document Frequencies • Zipf’s lawempiricallyconnects word frequency f and rank r • We know document frequency and rank r of the words in sample f= A (r + B) c frequency Frequency in sample 100 rank 1 12 78 …. VLDB 2002 Rank in sample Panos Ipeirotis - Columbia University

  31. Adjusting Document Frequencies • Zipf’s lawempiricallyconnects word frequency f and rank r • We know document frequency and rank r of the words in sample • We know real document frequency f of some words from one-word queries frequency f= A (r + B) c Frequency in database rank 1 12 78 …. VLDB 2002 Rank in sample Panos Ipeirotis - Columbia University

  32. Adjusting Document Frequencies • Zipf’s lawempiricallyconnects word frequency f and rank r • We know document frequency and rank r of the words in sample • We know real document frequency f of some words from one-word queries • We use curve-fitting to estimate the absolute frequency of all words in sample f= A (r + B) c frequency Estimated frequency in database rank 1 12 78 …. VLDB 2002 Panos Ipeirotis - Columbia University

  33. Actual PubMed Content Summary • Extracted automatically • ~ 27,500 words in extracted content summary • Fewer than 200 queries sent • At most 4 documents retrieved per query PubMedcontent summary Number of Documents: 8,691,360 (Actual: 11,868,552) Category: Health, Diseases … cancer 1,562,477 heart581,506 (Actual: 691,360) aids 121,491 hepatitis73,481 (Actual: 121,129) … basketball 907 (Actual: 1,063) cpu 598 (heart, hepatitis, basketball not in 1-word probes) Panos Ipeirotis - Columbia University

  34. Sampling and Incomplete Content Summaries Problem 3: Summaries from small samples are highly incomplete • Many words appear in “relatively few” documents (Zipf’s law) • Low-frequency words are often important • Small document samples miss many low-frequency words Sample=300 Log(Frequency) 107 106 Frequency & rank of 10% most frequent words in PubMed database 9,000 . . aphasia ~9,000 docs / ~0.1% 103 102 Rank 2·104 4·104 105 Panos Ipeirotis - Columbia University

  35. Sample-based Content Summaries Main Idea: Database Classification Helps • Similar topics ↔ Similar content summaries • Extracted content summaries complement each other Challenge: Improve content summary quality without increasing sample size Panos Ipeirotis - Columbia University

  36. Databases with Similar Topics • CANCERLIT` contains “metastasis”, not found during sampling • CancerBACUP contains “metastasis” • Databases under same category have similar vocabularies, and can complement each other Panos Ipeirotis - Columbia University

  37. Content Summaries for Categories • Databases under same category share similar vocabulary • Higher level category content summaries provide additional useful estimates • All estimates in category path are potentially useful Panos Ipeirotis - Columbia University

  38. Enhancing Summaries Using “Shrinkage” • Estimates from database content summaries can be unreliable • Category content summaries are more reliable (based on larger samples) but less specific to database • By combining estimates from category and database content summaries we get better estimates SIGMOD 2004 Panos Ipeirotis - Columbia University

  39. Shrinkage-based Estimations Adjust estimate for metastasis in D: λ1 * 0.002 +λ2 * 0.05 + λ3 * 0.092+ λ4 * 0.000 Select λi weights to maximize the probability that the summary of D is from a database under all its parent categories  Avoids “sparse data” problem and decreases estimation risk Panos Ipeirotis - Columbia University

  40. Adaptive Application of Shrinkage • Database selection algorithms assign scores to databases for each query • When frequency estimates are uncertain, assigned score is uncertain… • …but sometimes confidence about assigned score is high • When confident about score, shrinkage unnecessary Unreliable Score Estimate: Use shrinkage Probability 0 1 Database Score for a Query Reliable Score Estimate: Shrinkage might hurt Probability 0 1 Database Score for a Query Panos Ipeirotis - Columbia University

  41. Extracting Content Summaries: Problems Solved Problem 1:Random sampling may retrieve non-representative documents Solution: Focus querying on “topically dense” areas of the database Problem 2: Frequencies are “compressed” to the sample size range Solution: Exploit number of matches for query and adjust estimates using curve fitting Problem 3: Summaries based on small samples are highly incomplete Solution: Exploit database classification and augment summaries using samples from topically similar databases Panos Ipeirotis - Columbia University

  42. Searching Algorithm Classify databases and extract document samples Adjust frequencies in samples One-time process For each query: For each database D: • Assign score to database D (using extracted content summary) • Examine uncertainty of score • If uncertainty high, apply shrinkage and give new score; else keep existing score Query only top-K scoring databases For every query Panos Ipeirotis - Columbia University

  43. Experimental Setup • Two standard testbeds from TREC (“Text Retrieval Conference”): • 200 databases • 100 queries with associated human-assigned document relevance judgments • Two sets of experiments: • Content summary quality Metrics: precision, recall, Spearman correlation coefficient, KL-divergence • Database selection accuracy Metric: fraction of relevant documents for queries in top-scored databases SIGMOD 2004 Panos Ipeirotis - Columbia University

  44. Experimental Results Content summary quality: • Shrinkageimproves quality of content summaries without increasing sample size • Frequency estimation gives accurate (within ±20%) estimates of actual frequencies Database selection accuracy: • Frequency estimation: Improves performance by 20%-30% • Focused sampling: Improves performance by 40%-50% • Adaptive application of shrinkage: Improves performance up to 100% • Shrinkage is robust: Improved performance consistently across many different configurations Panos Ipeirotis - Columbia University

  45. Other Experiments • Additional data set: 315 real Web databases • Choice of database selection algorithm (CORI, bGlOSS, Language Modeling) • Effect of stemming • Effect of stop-word elimination SIGMOD 2004 Panos Ipeirotis - Columbia University

  46. Classification & Search: Overall Contributions • Support for browsing and searching Hidden-Web databases • No need for cooperation: Work with autonomous Hidden-Web databases • Scalable and work with large number of databases • Not restricted to “Hidden”-Web databases: Work withany searchable text database Classification and content summary extraction implemented and available for download at:http://sdarts.cs.columbia.edu Panos Ipeirotis - Columbia University

  47. Outline of Talk • Classification of Hidden-Web Databases • Search over Hidden-Web Databases • SDARTS: Protocol and Toolkit for Metasearching Panos Ipeirotis - Columbia University

  48. SDARTS: Protocol and Toolkit for Metasearching Query Harrison’s Online SDARTS British Medical Journal PubMed Unstructured text documents DLI2 Corpus XML documents Local Web Panos Ipeirotis - Columbia University

  49. SDARTS: Protocol and Toolkit for Metasearching Accomplishments: • Combines the strength of existing Digital Library protocols (SDLIP, STARTS) • Enables indexing and wrapping of “local” collections of text and XML documents • Enables “declarative” wrapping of Hidden-Web databases, with no programming • Extracts content summary, topical focus, and technical level of each database • Interfaces with Open Archives Initiative, an emerging Digital Library interoperability protocol • Critical building block for search component of Columbia’s PERSIVAL project (5-year, $5M NSF Digital Libraries – Phase 2 project) • Open source, available at: http://sdarts.cs.columbia.edu ~1,000 downloads since Jan 2003 • Supervised and coordinated eight students during development ACM+IEEE JCDL Conference 2001, 2002 Panos Ipeirotis - Columbia University

  50. Current Work: Updating Content Summaries Databases are not static. Their content changes. When should we refresh the content summary? • Examined 150 real Web databases over 52 weeks • Modeled changes using “survival analysis” techniques (Cox proportional hazards model) • Currently developing updating algorithms: • Contact database only when necessary • Improve quality of summaries by exploiting history Joint work with Junghoo Cho and Alex Ntoulas (UCLA) Panos Ipeirotis - Columbia University

More Related