320 likes | 592 Vues
FOCUSED CRAWLING. Context. World Wide Web growth. Inktomi crawler: Hundreds of Sun Sparc workstations; Sun Spark Э 75GB RAM, 1TB disk; Over 10M pages crawled. Still only 30-40% Web crawled. Long refreshes (weeks up to a month). Low precision results for crafty queries.
E N D
Context World Wide Web growth. Inktomi crawler: Hundreds of Sun Sparc workstations; Sun Spark Э 75GB RAM, 1TB disk; Over 10M pages crawled. Still only 30-40% Web crawled. Long refreshes (weeks up to a month). Low precision results for crafty queries. Burden of indexing millions of pages. Inefficient location of relevant topic-specific resources when using keyword queries. 1
Why Focused? Better cover a single galaxy than the whole universe. Work done on relatively narrow segment of Web. Respectable coverage at rapid rate (due to segment-of-interest narrowness). Small investment in hardware. Low network resource usage. 2
Core Elements Focused crawler = example-driven automatic porthole generator. Guided by a classifier and a distiller. Former recognizes relevance from examples embedded in topic taxonomy. Latter identifies topical vantage points on Web. Based on canonical topic taxonomy with examples. 3
Operation Synopsis Taxonomy creation. Example collection. Taxonomy selection and refinement. Interactive exploration. Training. Resource discovery. Distillation. Feedback. 4
Taxonomy Creation Pre-training classifier with: Canonical taxonomy, Corresponding examples. 5
Example Collection Collect URLs of interest (e.g browsing). Import collected URLs. 6
Taxonomy Selection and Refinement Propose most common classes where examples fit best. Mark classes as GOOD. Refine taxonomy, i.e.: Refine categories and/or, Move documents from one category to another. Integration time required by major changes is: Few hours for 260,000 Yahoo! documents. Smaller changes (moving docs) are interactive. 7
Interactive Exploration Propose URLs found in small neighbourhood of examples. Examine and include some of these examples. 8
Training Integrate refinements into statistical class model (classifier-specific action). 9
Distillation Identify relevant hubs by running (intermittently and/or concurrently) a topic distillation algorithm. Raise visit priorities of hubs and immediate neighbours. 10
Feedback Report most popular sites and resources. Mark results as useful/useless. Send feedback to classifier and distiller. 11
Snapshot 12
Some definitions... G = directed hypertext graph. C = tree-shaped hierarchical topic directory. D(c) = examples referred by topic node c Є C. C* = subset of topics marked good and known as user's interest. Remarks: Good topic is not ancestor of another good topic. p = web page, RC*(p) = relevance of p wrto C* must be furnished to the system. Rroot(p) = 1 ; Rc0(p) = ∑Rci(p) where {ci} children of c0. 13
Crawler in terms of Graph Start by visiting all pages Є D(C*). Inspect V = set of visited pages. Choose unvisited page from crawl frontier. GOAL: visit as many relevant pages and as few irrelevant pages as possible, i.e: Find V D(C*) | V reachable from D(C*) s.t. ∑R(v)/|V| -> max, v Є V. Goal attainable due to citations. 14
Classification • Definitions: • good(c) = c is marked as good. • For d=document: • P(d|r) = 1; • P(c|d) = P(parent(c)|d)*P(c|d,parent(c)); • P(c|d,parent(c)) = P(c|parent(c)) * P(d|c) / ∑P(d|ci) where ci are the siblings of c; • P(d|c) depends on document generation model; • P(c|parent(c)) = prior distribution of documents. • Steps for model generation: • Pick leaf node c* using defined probabilities. • Class c* has a die with as many faces as unique tokens Є U. • Face t turns with probability θ(c*,t). • Length n(d) is chosen arbitrarily by generator. • Flip die and write token corresponding to face. • If token t occurs n(d,t) times => 15
Remarks on Classification Documents seen as bag of words, without order information and inter-term correlation. During crawling the task is the reverse of generation. Two types of focus possible with classifier: Hard-focus: Find c* with highest probability; If Э ancestor of c* s.t. good(ancestor) => allow future visits of links Є d; Else prune at d. Soft-focus: Page relevance R(d) = ∑good(c)P(c|d); Assume priority of neighbour(d) = R(d); If multiple paths for a page => take maximum of relevance; When neighbour visited => update score. 16
Distillation Goal: identify hubs. Overtaken idea: v node Є Web has two scores a(v), h(v) => h(u) = ∑ (u,v) Є E a(v) (1) a(v) = ∑(u,v) Є E h(u) (2) E = adjacency matrix Enhancements: Non-unit edge weight; Forward and backward weights matrices: EF and EB EF[u,v] = R(v) prevents leakage of prestige from relevant hubs to irrelevant authorities; EB[u,v] = R(u) prevents relevant authority from reflecting prestige on irrelevant hubs; ρ = threshold for including relevant authorities into graph. Steps: Construct edge set E, only for pages on different sites, with forward and backward edge weights. Apply (1) and (2) always restricting authorities using ρ. 17
Integration with the Crawler One watchdog thread: Inspect new work from crawl frontier (stored on disk); Pass new work to working threads(using shared memory buffers). Many working threads: Save details of newly explored pages in per-worker disk structures; Invoke classifier for each new page. Stop workers, collect and integrate results into central pool (priority queue). Soft crawling -> URLs ordered by: (# page-fetches ascending, R descending) Hard crawling -> surviving URLs ordered by: # page-fetches ascending Populate link graph. Periodically stop crawler and execute distiller => revisit obtained hubs + visit unvisited pages pointed by hubs. 18
Integration 19
Evaluation Performance parameters: Precision (relevance); Quality of resource discovery. Synopsis: Experimental setup; Harvesting rate of relevant pages; Acquisition robustness; Resource discovery robustness; Good resources remoteness; Effect of distillation on crawling. 20
Experimental Setup Crawler = C++ application. Operating through firewall. Crawler run with relatively few threads. Up to 12 example web pages used / category 6,000 URLs / hour returned. 20 topics (gardening, mutual funds, cycling, etc). 21
Harvesting Rate of Relevant Pages Goal: high relevant-page acquisition rate. Low harvest rate -> time spent merely on eliminating irrelevant pages => better use ordinary crawl instead. 3 crawls done: Same sample set Э few dozen relevant URLs. Unfocused: All out-links registered for exploration; No use of R, except measurement => little slow down. Soft: Probably more robust than hard crawling, BUT needs more skill against unwanted topic diffusion. Problem distinguish between noisy and systematic drop in relevance. Hard; 22
Acquisition Robustness • Goal: maintain proper acquisition rate without being too sensitive on the start set. • Tests: • 2 disjoint sets Є 30% of starting URLs randomly chosen. • For each subset launch a focused crawler. • Goal achieved by measuring URLs overlap. • Generous visits to new IP-addresses and also normal increase in overlapping IP-addresses. 24
URL Overlap 25
Resource Discovery Robustness 2 sets of crawlers launched from different random samples. popularity/quality algorithm run with 50 iterations. Server overlap measured. Result: most popular sites identified by both sets of crawlers although different samples sets were used. 27
Good Resources Remoteness Any real exploration done ? Non-trivial work done by focused crawler, i.e pursuing certain paths while pruning others. Large # of servers found at 10 links away and beyond from starting set. Millions of pages within 10 links distance. 28
Effect of Distillation on Crawling Relevant page may be abandoned due to misclassification (e.g page has many images /classifier mistakes). Distiller reveals top hubs => new unvisited URLs. 30
Conclusion Strengths: Steady collection of relevant resources; Robustness to different starting conditions; Localization of good resources; Immunity to noise; Learning specialization from examples; Filtering done at data-acquisition level rather than as post-processing; Crawling done to greater depths due to frontier crawling; Still to go: At what specificity can focused crawl be sustained? i.e how do harvest rates depend on topics? Sociology of citations between topics => insights on how Web evolves. ... 31