Accelerated Focused Crawling Through Online Relevance Feedback

Accelerated Focused CrawlingThrough Online Relevance Feedback Soumen Chakrabarti, IIT BombayKunal Punera, IIT BombayMallela Subramanyam, UT Austin

Baseline learner Dmoztopic taxonomy Class modelsconsisting ofterm stats First-generation focused crawling If Pr(c*|u) is large enoughthen enqueue all outlinks v of uwith priority Pr(c*|u) Frontier URLSpriority queue Seed URLs Pick best Crawler Crawldatabase Newly fetchedpage u • Crawl regions of the Web pertaining to specific topic c*, avoiding irrelevant topics • Guess relevance of unseen node v based on the relevance of u (uv) evaluated by topic classifier Submit page for classification

Baseline crawling results • 20 topics from http://dmoz.org • Half to two-thirds of pages fetched are irrelevant • “Every click on a link is a leap of faith” • Humans leap better than focused crawlers • Adequate clues in text + DOM to leap better

How to seek out distant goals? • Manually collect paths (context graphs) leading to relevant goals • Use a classifier to predict link-distance to goal from page text • Prioritize link expansion by estimated distance to relevant goals • No discrimination amongdifferent links on a page Goal 1 2 3 1 2 3 4 Classifier Crawler

An instance (u,v)for the apprentice Apprentice learner Onlinetraining Pr(c|u) forall classes c Classmodels u ... submit (u,v)to the apprentice + - v Apprentice assigns moreaccurate priority to node v Pr(c*|v) The apprentice+critic strategy If Pr(c*|u) islarge enough... Frontier URLspriority queue Baseline learner (Critic) Good Dmoztopic taxonomy Class modelsconsisting ofterm stats u Pick best Good/ Bad Crawler v Crawldatabase Newly fetchedpage u Submit page for classification

Design rationale • Baseline classifier specifies what to crawl • Could be a user-defined black box • Usually depends on large amounts of training data, relatively slow to train (hours) • Apprentice classifier learns how to locate pages approved by the baseline classifier • Uses local regularities in HTML, site structure • Less training data, faster training (minutes) • Guards against high fan-out (~10) • No need to manually collect paths to goals

LCA “download” Apprentice feature design • HREF source page u represented by DOM tree • Leaf nodes marked with offsets wrt the HREF • Many usable representations for term at offset d • A t,d tuple, e.g., “download”,-2 • t,p,d where p is path length from t to d through LCA ul li li li li tt TEXT a HREF TEXT em TEXT TEXT font TEXT TEXT TEXT @-2 @-1 @0 @0 @1 @2 @3 Offsets

Offsets of good t,d features • Plot information gain at each d averaged over terms t • Max at d=0, falls off on both sides, but… • Clear saw-tooth pattern for most topics—why? • <li><a…><li><a…>… • <a…><br><a…><br>… • Topic-independent authoring idioms, best handled by apprentice

Apprentice learner design • Instance (u,v),Pr(c*|v)  represented by • t,d features for d up to some dmax • HREF source topics {c,Pr(c|u) c} • Cocitation: w1, w2 siblings, w1 goodw2 good • Learning algorithm • Want to map features to a score in [0,1] • Discretize [0,1] into ranges, each a label  • Class label  has an associated value q • Use a naïve Bayes classifier to find Pr(|(u,v)) • Score = q Pr(|(u,v))

Apprentice accuracy • Better elimination of useless outlinks with increasing dmax • Good accuracy with dmax< 6 • Using DOM offset info improves accuracy • Small accuracy gains • LCA distance in t,p,d • Source page topics • Cocitation features

Offline apprentice trials • Run baseline, train apprentice • Start new crawler at the same URLs • Let it fetch any page it schedules • (Recall) limit it to pages visited by the baseline crawler • Baseline loss > recall loss > apprentice loss • Small URL overlap

Online guidance from apprentice • Run baseline • Train apprentice • Re-evaluate frontier • Apprentice not as optimistic as baseline • Many URLs downgraded • Continue crawling with apprentice guidance • Immediate reduction in loss rate

Summary • New generation of focused crawler • Discriminates between links, learning online • Apprentice easy and fast to train online • Accurate with small dmax around 4—6 • DOM-derived features better than text • Effective division of labor (‘what’ vs. ‘how’) • Loss rate reduced by 30—90% • Apprentice better at guessing relevance of unvisited nodes than baseline crawler • Benefits visible after 100—1000 page fetches

Ongoing work • Extending to larger radius and deeper DOM + site structure • Public domain C++ software • Crawler • Asynchronous DNS, simple callback model • Can saturate dedicated 4Mbps with a Pentium2 • HTML cleaner • Simple, customizable, table-driven patch logic • Robust to bad HTML, no crash or memory leak • HTML to DOM converter • Extensible DOM node class

Accelerated Focused Crawling Through Online Relevance Feedback