1 / 46

Exploiting Inter-Class Rules for Focused Crawling

Exploiting Inter-Class Rules for Focused Crawling. İsmail Sengör Altıngövde Bilkent University Ankara, Turkey. Our Research: The Big Picture. Goal: M etadata based modeling and querying of web resources Stages:

jenis
Télécharger la présentation

Exploiting Inter-Class Rules for Focused Crawling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exploiting Inter-Class Rules for Focused Crawling İsmail Sengör Altıngövde Bilkent University Ankara, Turkey

  2. Our Research: The Big Picture • Goal: Metadata based modeling and queryingof webresources • Stages: • Semi automated metadata extraction from web resources Focused crawling fits here! • Extending SQL to support ranking and text-based operations in an integrated manner • Developing query processing algorithms • Prototyping a digital library application for CS resources

  3. Overview • Motivation • Background & related work • Interclass rules for focused crawling • Preliminary results

  4. Motivation • Crawlers a.k.a. bots,spiders, robots • Goal: Fetching all the pages on the Web, to allow succeding useful tasks (e.g., indexing) • “all pages”: means somewhat 4 billion pages today (due to Google) • Requires enormous hardware and network resources • Consider the growth rate & refresh rate of Web • What about hidden-Web and dynamic content?

  5. Motivation • Certain applications do need such powerful (and expensive) crawlers • e.g., a general purpose search engine • And some others don’t... • e.g., a portal on computer science papers, or people homepages...

  6. Motivation • Let’s relax the problem space: • “Focus” on a restricted target space of Web pages • that may be of some “type” (e.g., homepages) • that may be of some “topic” (CS, quantum physics) • The “focused” crawling effort would • use much less resources, • be more timely, • be more qualified for indexing & searching purposes

  7. Motivation • Goal: Design and implement a focused Web crawler that would • gather only pages on a particular “topic” (or class) • use interclass relationships while choosing the next page to download • Once we have this, we can do many interesting things on top of the crawled pages • I plan to be around for a few more years!!! 

  8. Background: A typical crawler • Starts from a set of “seed pages” • Follows all hyperlinks it encounters, to eventually traverse the entire Web • Applies breadth-first search (BFS) • Runs endless in cycles • to revisist modified pages • to access unseen content

  9. Our simple BFS crawler

  10. Crawling issues... • Multi-threading • Use separate and dedicated threads for DNS resolution and actual page downloading • Cache and prefetch DNS resolutions • Content-seen test • Avoid duplicate content, e.g., mirrors • Link extraction and normalization • Canonical URLs

  11. More issues... • URL-seen test • Avoid being trapped in a cycle! • Hash visited URLs by MD5 algorithm and store ina database. • 2-level hashing to exploit spatio-temporal locality • Load balancing among hosts: Be polite! • Robot exclusion protocol • Meta tags

  12. Even more issues?! • Our crawler is simple, since issues like • Refreshing crawled web pages • Performance monitoring • Hidden-Web content are left out... • And some of the implemented issues can be still improved • “Busy queue” for the politeness policy!

  13. Background: Focused crawling “A focused crawler seeks and acquires [...] pages on a specific set of topics representing a relatively narrow segment of the Web.” (Soumen Chakrabarti) • The underlying paradigm is Best-First Search instead of the Breadth-First Search

  14. Breadth vs. Best First Search

  15. Two fundamental questions • Q1: How to decide whether a downloaded page is on-topic, or not? • Q2: How to choose the next page to visit?

  16. Early algorithms • FISHSEARCH: Query driven • A1: Pages that match to a query • A2: Neighborhood of the pages in the above • SHARKSEARCH: • Use TF-IDF & cosine measure from IR to determine page relevance • Cho et. al. • Reorder crawl frontier based on “page importance” score (PageRank, in-links, etc.)

  17. Chakrabarti’s crawler • Chakrabarti’s focused crawler • A1: Determines the page relevance using a text classifier • A2: Adds URLs to a max-priority queue with their parent page’s score and visits them in descending order! • What is original is using a text classifier!

  18. The baseline crawler • A simplified implementation of Chakrabarti’s crawler • It is used to present & evaluate our rule based strategy • Just two minor changes in our crawler architecture, and done!!!

  19. Our baseline crawler

  20. The baseline crawler • An essential component is text classifier • Naive-Bayes classifier called Rainbow • Training the classifier • Data: Use a topic taxonomy (The Open Directory, Yahoo). • Better than modeling a negative class

  21. Baseline crawler: Page relevance • Testing the classifier • User determines focus topics • Crawler calls the classifier and obtains a score for each downloaded page • Classifier returns a sorted list of classes and scores (A 80%, B 10%, C 7%, D 1%,...) • The classifier determines the page relevance!

  22. Baseline crawler: Visit order • The radius-1 hypothesis: If page u is an on-topic example and u links to v, then the probability that v is on-topic is higher than the probability that a random chosen Web page is on-topic.

  23. Baseline crawler: Visit order • Hard-focus crawling: • If a downloaded page is off-topic, stops following hyperlinks from this page. • Assume target is class B • And for page P, classifier gives: A 80%, B 10%, C 7%, D 1%,... • Do not follow P’s links at all!

  24. Baseline crawler: Visit order • Soft-focus crawling: • obtains a page’s relevance score (ascore on the page’s relevance to the targettopic) • assigns this score to every URLextracted from this particular page, and adds to the priority queue • Example: A 80%, B 10%, C 7%, D 1%,... • Insert P’s links with score 0.10 into PQ

  25. Rule-based crawler: Motivation • Two important observations: • Pages not only refer to pages from the same class, but also pages from other classes. • e.g., from “bicycle” pages to “first aid” pages • Relying on only radius-1 hypothesis is not enough!

  26. Rule-based crawler: Motivation • Baseline crawler can not support tunneling • “University homepages” link to “CS pages”, which link to “researcher homepages”, and which futher link to “CS papers” • Determining score only w.r.t. the similarity to the target class is not enough!

  27. Our solution • Extract rules that statistically capture linkage relationships among the classes (topics) andguide crawler accordingly • Intuitively, we determine relationships like “pages in class A refer to pages in class B with probability X” A B (X)

  28. Our solution • When crawler seeks for class B and page P at hand is of class A, • consider all paths from A to B • compute an overall score S • add links from P to the PQ with this score S • Basically, we revise radius-1 hypothesis with class linkage probabilities.

  29. How to obtain rules?

  30. An example scenario • Assume our taxonomy have 4 classes: • department homepages (DH) • course homepages (CH) • personal homepages (PH) • sports pages (SP) • First, obtain train-0 set • Next, for each class, assume 10 pages are fetched pointed to by the pages in train-0 set

  31. An example scenario The distribution of links to classes Inter-class rules for the above distribution

  32. Seed and target classes are both from the class PH.

  33. Seed and target classes are both from the class PH.

  34. Rule-based crawler Rule-based approach succesfully uses class linkage information • to revise radius-1 hypothesis • to reach an immediate award

  35. Rule-based crawler: Tunneling • Rule based approach also support tunneling by a simple application of transitivity. • Consider URL#2 (of class DH) • A direct rule is: DH  PH (0.1) • An indirect rule is: from DH  CH (0.8) and CH  PH (0.4) obtain DH  PH (0.8 * 0.4 = 0.32) • And, thus DH  PH (0.1 + 0.32 = 0.42)

  36. Rule-based crawler: Tunneling Observe that • In effect, the rule based crawler becomes aware of a path DH  CH  PH, although it has only trained with paths of length 1. • The rule based crawler can succesfully imitate tunneling.

  37. Rule-based score computation • Chain the rules up to some predefined MAX-DEPTH number (e.g., 2 or 3) • Merge the paths with the function SUM • If no rules whatsoever, stick on soft-focus score • Note that • Rule db can be represented as a graph, and • For a given target class all cycle free paths (except self loop of T) can be computed (e.g., modify BFS)

  38. Rule-based score computation

  39. Preliminary results: Set-up • DMOZ taxonomy • leafs with more than 150 URLs • 1282 classes (topics) • Train-0 set: 120K pages • Train-1 set: 40K pages pointed to by 266 interrelated classes (all about science) • Target topics are also from these 266 classes

  40. Preliminary results: Set-up • Harvest ratio:the average relevance of all pages acquired by the crawler to the target topic

  41. Preliminary results • Seeds are from DMOZ and Yahoo! • Harvest rate improve from 3 to 38% • Coverage also differs

  42. Harvest Rate

  43. Future Work • Sophisticated rule discovery techniques (e.g., topic citation matrix of Chakrabarti et al.) • On-line refinement of the rule database • Using the entire taxonomy but not only leafs

  44. Acknowledgments • We gratefully thankÖ. Rauf Atay for the implementation.

  45. References • I. S. Altıngövde, Ö. Ulusoy, “Exploiting Inter-Class Rules for Focused Crawling”, IEEE Intelligent Systems Magazine, to appear. • S. Chakrabarti, “Mining the WebDiscovering Knowledge from Hypertext Data.” Morgan Kaufmann Publishers, 352 pages, 2003. • S. Chakrabarti, M. H. van den Berg, and B.E. Dom, “Focused crawling: a new approach to topic-specific web resource discovery,” In Proc. of 8th International WWW Conference (WWW8), 1999.

  46. Any questions???

More Related