Crawling the Hidden Web

Crawling the Hidden Web Authors: SriramRaghavan, Hector Garcia-Molina VLDB 2001 Speaker: Karthik Shekar

Deep Web / Hidden Web • Content hidden behind the search forms / registration portals. • Dynamically generated based on a query. • Size: ~550 times that of PIW (based on study in 2000) • Importance: Quality content

User form interaction

Crawler form interaction Components of HiWE ( Hidden Web Exposer ) • Internal Form Representaion • Task-specific database • Matching function • Response analysis

HiWE Architecture • LVS table – task specific database • Form Analyzer, Form Processor, Response Analyzer – take care of the form processing & submission operations. • Parser, Crawl Manager, URL List – parts of the basic PIW crawler.

Internal Form Representation F=({Elements},S,M) S – Submission Information eg. Submission URL M – Meta Information eg. Web-site hosting form, #inlinks. [ in HiWE it is Ф ]

Label – Value Set Table • Each row – ( L , V ) • V – fuzzy-graded set of values for the label L • Mv – membership function, assigns weights to each vi in V • Mv(vi) – crawler’s confidence that this assignment to label(element) is semantically correct.

Label – Value Set Table • Ways to populate the table : • Explicit initialization • Feeding in the data at start up • Built-in entries • Date, time etc. • Wrapped data sources • Retrieve data from other sources by querying • Type 1 query: return a set of values for a given set of labels • Type 2 query: for a set of values return other values belonging to the same set.

Computing weights on each Vi • wBuilt-in & explicit values = 1 • For values which the crawler picks up: • Label(e) is extracted and there is no entry in the LVS – new row is added ( label(e) , dom(e) ) & Mdom(e)(x) = 1,x єdom(e) ; 0,otherwise • Label(e) is extracted and there is an entry in LVS ( label(e) , V ) – entry is modified to ( label(e) , V U dom(e) ) with MV U dom(e)(x) = max(Mv(x),Mdom(e)(x))

Computing weights on each Vi • Label(e) could not be retrieved – For each row calculate a score given by ∑xєdom(e)Mv(x) |dom(e)| Find the row with the max score- (Lmax, Vmax) Replace the row with (Lmax, Vmax U D’) [ where D’ is new set from dom(e) such that MD’(x) = max-score * Mdom(e)(x) ]

Label Matching • Normalization of all labels ( case folding , stemming , stop words removal ) • Computing edit distance • Word ordering ( eg. Company type & type of company ) • Block edit distance is used

Ranking value assignment • Aggregation functions • Fuzzy conjunction ρfuz = mini=1..nMvi(vi) • Average • Probabilistic ρprob= 1 – Пi=1..n (1- Mvi(vi)) Mvi(vi) – likelihood that the assignment is useful ρfuz < ρavg < ρprob More aggresive

LITE • Layout based Information Extraction • Based on the physical layout of the page • Reason: semantic information is not always reflected in the HTML markup

LITE & form analysis • Pruning • Identify text closest to the form element – candidates • Rank the candidates • Choose the highest ranked candidates as label • Perform post-processing

+/- • Simple simulation of the user interaction with the form • Learning-based operational model • Task/application specific crawler • Efficient Label Extraction method • Re-use of existing modules • Coverage is a challenge • Execution time would depend on the look up…

Crawling the Hidden Web