1 / 66

Combining Classifiers to Identify Online Databases

Combining Classifiers to Identify Online Databases. Luciano Barbosa and Juliana Freire School of Computing University of Utah {lbarbosa,juliana}@cs.utah.edu. The Hidden Web. Web content hidden behind form interfaces Search for books, airfare tickets Not accessible from search engines

ilyssa
Télécharger la présentation

Combining Classifiers to Identify Online Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Combining Classifiers to Identify Online Databases Luciano Barbosa andJuliana Freire School of Computing University of Utah {lbarbosa,juliana}@cs.utah.edu

  2. The Hidden Web • Web content hidden behind form interfaces • Search for books, airfare tickets • Not accessible from search engines • Millions of online databases - Hsieh et al. SIGMOD 2006 • High-quality content How to leverage this information?

  3. Making the Hidden Web more Accessible: Current Approaches • Database directories (NAR database compilation - Galperin NAR2007) • Web Integration Systems (Google Base; Chang et al. CIDR 2005) • Hidden-Web crawling (Raghavan & Molina VLDB 2001; Barbosa & Freire SBBD 2004)

  4. The Hidden-Web Infrastructure Applications Database Directory Web Integration Systems Hidden Web Crawlers … Hidden-Web Infrastructure Form Repository Barbosa et al. ICDE2007 Form Location Form Clustering Form Identification

  5. Outline • Combining Classifiers to Identify Online Databases • An Adaptive Crawler for Locating Hidden-Web Entry Points

  6. Problem Definition Given a set F of Web forms automatically gathered by a focused crawler in an online database domain D, our goal is to select from F only the forms that are entry points to databases in D.

  7. Challenges • Locate online databases (later!) • Online databases are very sparsely distributed on the Web • Select only “relevant databases” , I.e., filter out non-searchable forms and forms not in domain • There is great variation in the way Web forms are designed, even within a well-defined domain • High structural variability, heterogeneous vocabulary, vocabulary overlap across domains

  8. Form Variability • Searchable X Non-searchable Searchable Non-searchable

  9. Form Variability • Different domains with similar content Hotel Airfare

  10. Form Variability • Heterogeneity in same domain

  11. Solution Overview: Pruning the Search Space Web Searchable Forms Relevant Forms Pages in the domain Relevant forms Non-relevant forms

  12. HIerarchical Form Identification Locating Forms Identifying Relevant Forms Focused Crawler Generic Form Classifier Domain-Specific Form Classifier Searchable forms Relevant forms Web pages Forms Form structure Form textual content Page textual content HIFI

  13. HIFI: Phase I Locating Forms Identifying Relevant Forms Focused Crawler Generic Form Classifier Domain-Specific Form Classifier Searchable forms Relevant forms Web pages Forms Form structure Form textual content Page textual content HIFI

  14. Looking at Form Structure • Searchable X Non-searchable Searchable Non-searchable

  15. Looking at Form Structure • Searchable forms shares similar structure • Statistics about form components Structural features are good indicators of whether forms are searchable or not

  16. Generic Form Classifier - GFC • 13 features • hidden tags; radios; file inputs; submit tags; image inputs; buttons; resets; password tags; textboxes; “search” inside form tags; items in selection lists; submission method (post or get); text sizes in textboxes

  17. Generic Form Classifier • Test error • GFC is domain independent • Previous classifiers for identifying searchable forms are domain dependent • Use the content inside tags

  18. HIFI: Phase II Locating Forms Identifying Relevant Forms Focused Crawler Generic Form Classifier Domain-Specific Form Classifier Searchable forms Relevant forms Web pages Forms Form structure Form textual content Page textual content HIFI

  19. Looking at Form Content • Problem of focused crawler + GFC • Co-occurrence of different searchable forms in the domain

  20. Looking at the Form Content <form id="search_form" name="search" action="http://us.rd.yahoo.com/hotjobs/search/home/* method="get"> Search for Jobs Across the Web Job Category <select tabindex="4" name="industry1" id="industry1"> <option value="FIN">Accounting/Finance</option> <option value="ADV">Advertising/Public Relations</option> <option value="ART">Arts/Entertainment/Publishing</option> <option value="BAM">Banking/Mortgage</option> </select> Keyword(s) <input name="keywords_all" id="keywords_all" type="text" value=""> (e.g. Job title, company, occupation) City & State or Zip <input tabindex="3" type="checkbox" align="left" name="metro_area" id="metro_area" value="1" checked /> Include surrounding cities <input type="hidden" name="country1" id="country1" value="USA"> </form>

  21. Domain-Specific Form Classifier–DSFC • Forms in a given domain contain a well-defined and restricted vocabulary [He et al., CIKM 2004] • Usage of the textual content that can be automatically extracted from forms • Remove the html tags • Vector of 500 most frequent stemmed words in training set • Weight in the vector: term frequency

  22. Classifier Creation • Test error in the 8 domains • Best results: SVM

  23. Hierarchical Classification • GFC • Coarse classification: high recall • Domain independent • DSFC • Smaller search space: high precision • Domain specific • Benefits • Simplify the search space • Allows the construction of simpler classifiers • Use appropriate learning techniques for each feature space • Deal with badly formed forms

  24. Experiments • Assess the quality of HIFI • In 8 representative domains--variation in form structure, vocabulary, size (details in paper) • Over different inputs • Effectiveness of monolithic classifier vs. HIFI

  25. Evaluation Metrics False Negative True Positive False Positive True Negative False Positive False Negative

  26. Exceptions GFC removes a significant percentage of irrelevant forms Misclassifies only a few relevant forms (high recall) GFC Results

  27. HIFI Performance • HIFI = GFC + DSFC • High recall and precision

  28. High recall Low precision over non-searchable forms More specific model HIFI X Monolithic Classifier • Configuration 1 • Content • Configuration 2 • Structure + content Combining classifiers gives the best tradeoff between precision and recall

  29. Sensitivity to Input Quality • Classification accuracy depends on the input quality • Input from two focused crawlers • BFC (Chakrabarti et al., WWW1999)--less focused • FFC (Barbosa & Freire, WebDB 2005)-- more focused

  30. Percentage of Relevant Forms

  31. Sensitivity to Input Quality • Results: F-Measure HIFI is effective for ‘noisy’ input HIFI performs better for the higher-quality input

  32. Related Work • Identifying searchable forms • Pre-query (Hess and Kushmerick IIWeb 2003; Cope et al. ADC 2003) • Domain-dependent; manual extraction of form attributes • Post-query (Bergholz and Chidlovskii WISE 2003) • Require forms to be automatically submitted • Hierarchical classifiers • Image classification (Heiseler et al. Pattern Recognition 2003) • Part-of-speech tagging (Even-Zohar and Roth EMNLP 2001)

  33. Conclusion • Effective and automatic approach to identify forms in a domain • Partition the search space • Construction of simpler and more effective classifiers • Future directions • Handle simple search forms • Use semi-supervised learning to build the DSFC

  34. Outline • Combining Classifiers to Identify Online Databases • An Adaptive Crawler for Locating Hidden-Web Entry Points

  35. Problem Definition Given an online database domain, to automatically locate forms that serve as entry points to databases in this domain

  36. Challenge • Online databases are very sparsely distributed on the Web • A content-based focused crawler retrieves only 94 Movie search forms after crawling 100,000 pages • Requirements • Perform a broad search • Avoid visiting unproductive Web regions

  37. Our Approach • Focused crawler • Restricted to a topic • Delayed benefit • Identifies the neighborhood of the forms • Suitable to sparse domains • Online learning • Learning of experience • Adaptive aspect • Removes possible bias in crawler policy

  38. Outline • FFC (Barbosa and Freire, WebDB2005) • Components • Limitations • ACHE • Adaptive component • Automatic feature selection • Experimental Evaluation

  39. FFC • Focuses on broad topic based on the page content - similar to topic-focused crawlers • Prioritizes links to follow based on hyperlink path patterns- similar to reinforcement-learning-based crawlers • Effective for locating searchable forms Searchable Forms Form Database Searchable Form Classifier Page Forms Page Classifier Crawler Most relevant link Links (Link, Relevance) Link Classifier Frontier Manager

  40. Page Classifier • Focus on a specific topic based on the page content Web Off-topic pages On-topic pages

  41. Form page Link neighborhood at level 1 Level 1 Link neighborhood at level 2 Level 2 Link Classifier • Gives relevance to pages close to form pages • Patterns in the link neighborhood: anchor, URL, text in the proximity of the URL On-topic pages

  42. Frontier Manager • Each non-visited link has the expected reward given by Link Classifier • Implements the crawler policy to maximize the expected reward

  43. FFC: Limitations • Requires substantial manual tuning • Features selected manually for the LC • Efficiency is highly dependent on training examples used to build the Link Classifier • Retrieves a large percentage of irrelevant forms

  44. Searchable Forms Relevant Forms Searchable Form Classifier Domain-Specific Form Classifier Page Forms Page Classifier Crawler Form Identification Most relevant link Links Form path (Link, Relevance) Adaptive Link Learner Automatic Feature Selection Link Classifier Frontier Manager ACHE: Overview Form Database

  45. Adaptive Crawler as a Learning Agent • Behavior generating element (BGE) • Maximize the expected reward (exploitation) • Problem generator (PG) • Suggesting actions that will lead to new experiences even if the benefit is not immediate (exploration) • Critic • Feedback on the success (or failure) of its actions • Online learning • Takes critic’s feedback into account to update the policy used by the BGE.

  46. Searchable Forms Relevant Forms Form Database Searchable Form Classifier Domain-Specific Form Classifier Page Forms Page Classifier Crawler Form Identification Most relevant link Links Form path (Link, Relevance) Adaptive Link Learner Automatic Feature Selection Link Classifier Frontier Manager ACHE as a Learning Agent Critic Online Learning Element BGE + PG

  47. Adaptive Link Learner • Learns from the successful paths • Effectiveness depends on the accuracy of the HIFI

  48. Automatic Feature Selection • Features to successful paths • anchor, URL, and text around links • Select the stemmed terms with the highest DF in each feature space • DF comparable to IG and Chi-square (Yang and Pedersen, 1997) • Aggressive feature selection • Naive Bayes better results with few features (Zheng et al., 2004)

  49. Experiments • Evaluating • Effectiveness in retrieving relevant forms • Quality of the features automatically selected by AFS • Online learning in the crawling process • Database domains

  50. Experiments: Crawling strategies

More Related