1 / 15

Crawling the Hidden Web

Crawling the Hidden Web. Authors: Sriram Raghavan , Hector Garcia-Molina VLDB 2001 Speaker: Karthik Shekar. Deep Web / Hidden Web. Content hidden behind the search forms / registration portals. Dynamically generated based on a query. Size: ~550 times that of PIW (based on study in 2000)

jenna
Télécharger la présentation

Crawling the Hidden Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Crawling the Hidden Web Authors: SriramRaghavan, Hector Garcia-Molina VLDB 2001 Speaker: Karthik Shekar

  2. Deep Web / Hidden Web • Content hidden behind the search forms / registration portals. • Dynamically generated based on a query. • Size: ~550 times that of PIW (based on study in 2000) • Importance: Quality content

  3. User form interaction

  4. Crawler form interaction Components of HiWE ( Hidden Web Exposer ) • Internal Form Representaion • Task-specific database • Matching function • Response analysis

  5. HiWE Architecture • LVS table – task specific database • Form Analyzer, Form Processor, Response Analyzer – take care of the form processing & submission operations. • Parser, Crawl Manager, URL List – parts of the basic PIW crawler.

  6. Internal Form Representation F=({Elements},S,M) S – Submission Information eg. Submission URL M – Meta Information eg. Web-site hosting form, #inlinks. [ in HiWE it is Ф ]

  7. Label – Value Set Table • Each row – ( L , V ) • V – fuzzy-graded set of values for the label L • Mv – membership function, assigns weights to each vi in V • Mv(vi) – crawler’s confidence that this assignment to label(element) is semantically correct.

  8. Label – Value Set Table • Ways to populate the table : • Explicit initialization • Feeding in the data at start up • Built-in entries • Date, time etc. • Wrapped data sources • Retrieve data from other sources by querying • Type 1 query: return a set of values for a given set of labels • Type 2 query: for a set of values return other values belonging to the same set.

  9. Computing weights on each Vi • wBuilt-in & explicit values = 1 • For values which the crawler picks up: • Label(e) is extracted and there is no entry in the LVS – new row is added ( label(e) , dom(e) ) & Mdom(e)(x) = 1,x єdom(e) ; 0,otherwise • Label(e) is extracted and there is an entry in LVS ( label(e) , V ) – entry is modified to ( label(e) , V U dom(e) ) with MV U dom(e)(x) = max(Mv(x),Mdom(e)(x))

  10. Computing weights on each Vi • Label(e) could not be retrieved – For each row calculate a score given by ∑xєdom(e)Mv(x) |dom(e)| Find the row with the max score- (Lmax, Vmax) Replace the row with (Lmax, Vmax U D’) [ where D’ is new set from dom(e) such that MD’(x) = max-score * Mdom(e)(x) ]

  11. Label Matching • Normalization of all labels ( case folding , stemming , stop words removal ) • Computing edit distance • Word ordering ( eg. Company type & type of company ) • Block edit distance is used

  12. Ranking value assignment • Aggregation functions • Fuzzy conjunction ρfuz = mini=1..nMvi(vi) • Average • Probabilistic ρprob= 1 – Пi=1..n (1- Mvi(vi)) Mvi(vi) – likelihood that the assignment is useful ρfuz < ρavg < ρprob More aggresive

  13. LITE • Layout based Information Extraction • Based on the physical layout of the page • Reason: semantic information is not always reflected in the HTML markup

  14. LITE & form analysis • Pruning • Identify text closest to the form element – candidates • Rank the candidates • Choose the highest ranked candidates as label • Perform post-processing

  15. +/- • Simple simulation of the user interaction with the form • Learning-based operational model • Task/application specific crawler • Efficient Label Extraction method • Re-use of existing modules • Coverage is a challenge • Execution time would depend on the look up…

More Related