Crawling the Hidden Web

Crawling the Hidden Web Authors: Sriram Raghavan Hector Gracia-Molina Presented by: Jorge Zamora

Outline • Hidden Web • Crawler Operation Model • HiWE – Hidden Web Exposer • LITE – Layout-based Information Extraction • Experimental Results • Relation to class lectures • Pros/Cons • Conclusion Crawling the Hidden Web

Hidden Web • PIW – Publicly Indexable Web • Deep Web • 500 times the PIW • Hidden Crawler • Parse, process and interact with forms • Task specific approach • Two Steps • Resource Discovery • Content Extraction Crawling the Hidden Web

Hidden Crawler – Operation Model Crawling the Hidden Web

Hidden Crawler – Operation Model • Internal form representation F = ({{E1, E2,…,En},S,M}) • Task specific database • Formulates search queries • Matching Function Match(({E1,…,En},S,M),D) = {[E1<-v1,…,En<- Vn]}. • Response Analysis • Success and error pages, Storage, Tuning Crawling the Hidden Web

Hidden Crawler – Performance • Challenge • Wanted to get away from a metric significantly depended on D • Submission Effiency • Ntotal = total number of forms crawler submits • SEstrict = Nsucess/Ntotal • Penalizes the crawler which might be correct but did not yield any results • SElenient = Nvalid/NTotal • Penalized only if the form submission is semantically incorrect. • Difficult to evaluate - must evaluate every form submission. Crawling the Hidden Web

HiWE • Hidden Web Exposer • Prototype Hidden Web Crawler built at Stanford • Basic idea • extract some kind of descriptive information or label for each element in the form • task-specific which contains a finite set of categories with associated labels • Matching algorithms attempts to match form labels with database values to form value assignment sets Crawling the Hidden Web

HiWE – Conceptual Parts Crawling the Hidden Web

HiWE – Form Representation • F = ({E1,E2,…,En} S, 0) • Dom(Ei) • Label(Ei) Crawling the Hidden Web

HiWE – Task specific Database • Organized as a finite set of concepts of categories • Each concept has one or more labels and associated values • Each Row in the LVS table is of the form (L, V), • L is a label • V = {v1,…, vn} is a fuzzy • vi represents a value • Fuzzy set V has associated membership function Mv • Mv(vi) is the crawlers confidence of assignment Crawling the Hidden Web

HiWE – Matching Function • Label Matching • All labels are normalized • Common case, Stemming, Stop word removal • String Matching • with min edit distances, word orderings • Threshold of Sigma < edit operations. Then set to nil • Ranking Value Assignments • Min Rho. • Fuzzy Conjunction - Rho fuz • Average – Rho avg • Probabilistic – Rho prob Crawling the Hidden Web

HiWE – Populating LVS Table • Explicit Initialization • Built-in entries • Dates, Times, names of months, days of the week • Wrapped data Sources • Set of Labels, new entries • Set of Values, search similar, expand existing • Crawling Experience • Finite domain elements • Can be used to fill out the second form more efficiently Crawling the Hidden Web

HiWE – Computing Weights • Explicit initialization • Fixed, predefined weights (usually 1) representing maximum confidence in human supplied values • External data sources or crawler activity • Positive boost – Successful • Negative boost – Unsuccessful • Initial weights obtained from external data sources are computed by the wrapper Crawling the Hidden Web

HiWE – Computing Weights • Finite domain • Case 1 – Crawler Extracts label, Label Match found • Unions the values to the • Boost the weights/confidence of the existing values • Case 2 – Crawler Extracts label, Label Match = nil • New row is added in LVS table • Case 3 – Can not extract label • Identify values that most closely resembles Dom(E) • Once located, add values in Dom(E) to value set Crawling the Hidden Web

HiWE – Explicit Configuration Crawling the Hidden Web

LITE • Layout-based information extraction • Used in automatically extracting semantic information from search forms. • In addition to text, uses the physical layout of the page to aid in extraction • Not always reflected in HTML markup Crawling the Hidden Web

LITE – Usage in HiWE • Used in Label Extraction • Implemented by page pruning. Isolate elements that directly influence the layout of the form elements and labels Crawling the Hidden Web

LITE – Steps • Approximate layout of pruned page discarding images, font styles and style sheets • Identifies pieces of text closest to form element as candidates • Ranks Each candidate taking into account position, font size, font style, number of words • Chooses the highest ranked candidate as label associated with element Crawling the Hidden Web

Experiment - Parameters • Task 1 Shown which is for “News articles, reports, press releases, and white papers relating to the semiconductor industry, dated sometime in the last ten years” Crawling the Hidden Web

Results – Value Ranking • Was executed three times with same parameters, initializations values and parameters but using different ranking function • Pave might be a better choice for maximum content extraction • Pfuz is the most efficient • Pprob submits the most forms but performs poorly Crawling the Hidden Web

Results – Form Size 78.9% 3735 88.77% 88.96% 3214 2950 2853 2800 2491 90% Number of form submissions 1404 Crawling the Hidden Web

Results – Crawler additions to LVS Crawling the Hidden Web

Results – LITE Label Extraction • Elements from 1 to 10 • Manually analyzed to derive correct label • Also ran other label extraction heuristics • Purely textual analysis • Common ways forms are laid out • LITE was 93% vs 72% and 83% Crawling the Hidden Web

Relation to Class Notes • Content driven Crawler • Different crawlers for different purposes • Contains Similar crawler Metrics • Crawling speed • Scalability • Page importance • Freshness • Data Transfer • Stored after crawled Crawling the Hidden Web

Cons • Freshness/Recrawling isn’t addressed • Task specific, human configuration • Login Based, Cookie JAR implementation • Didn’t discuss Hidden fields or Capchas • Didn’t run task 1 results without LITE. • Not using the “name” element tag in form elements • Required fields vs. not required • Wild cards, incomplete forms • Form element decencies. Crawling the Hidden Web

Pros • First Hidden Crawler Report • Not run at runtime • VS. shopping and travel sites that do. • Gets better overtime Crawling the Hidden Web

Conclusion / Thoughts • Hidden web is much bigger now. • Hidden web reached now with google analytics and google ads • Now we also have ajax based forms. How do we deal with ajax based forms? Crawling the Hidden Web

Thank You Questions ? Crawling the Hidden Web

Crawling the Hidden Web