1 / 16

The Structure of Broad Topics on the Web

The Structure of Broad Topics on the Web. Soumen Chakrabarti Mukul M. Joshi Kunal Punera Lab. for Intelligent Internet Research, IIT Bombay David M. Pennock NEC Research Institute. Strongly connected core (SCC). IN. OUT. “This is the Web”. Graph structure of the Web.

heath
Télécharger la présentation

The Structure of Broad Topics on the Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Structure of Broad Topicson the Web Soumen ChakrabartiMukul M. JoshiKunal PuneraLab. for Intelligent Internet Research,IIT Bombay David M. PennockNEC Research Institute

  2. Stronglyconnectedcore (SCC) IN OUT “This isthe Web” Graph structure of the Web • Over two billion nodes, 20 billion links • Power-law degree distribution • Pr(degree = k)  1/k2.1 • Looks like a “bow-tie” at large scale

  3. The need for content-based models • Why does a radius-1 expansion help in topic distillation? • Why does topic-specific focused crawling work? • Why is a global PageRank useful for specific queries? Searchengine Rootset Query Crawler Classifier Checkfrontier topic Prune if irrelevant Uniformjump Walk toout-neighbor

  4. “This is theWeb with topics!” The need for content-based models • How are different topics linked to each other? • Application: crawling, classification, clustering • Are URL collections representative of Web topic populations? • Web directories: Dmoz, Yahoo! • TREC Web track

  5. How to characterize “topics” • Web directories—a natural choice • Start with http://dmoz.org • Keep pruning until all leaf topicshave enough (>300) samples • Approx 120k sample URLs • Flatten to approx 482 topics • Train text classifier (Rainbow) • Characterize new document d as a vector of probabilities pd = (Pr(c|d) c) Test doc Classifier

  6. Critique and defense • Cannot capture fine-grained or emerging topics • Emerging topics most often specialize existing broad topics, which rarely change • Classifier may be inaccurate • Adequate if much better than random guess • Can compensate errors using held-out validation data • Results depend on one Web directory • Can repeat with many others and compare

  7. Background topic distribution • What fraction of Web pages are about Health? • Sampling via random walk • PageRank walk (Henzinger et al.) • Undirected regular walk (Bar-Yossef et al.) • Make graph undirected (link:…) • Add self-loops so that all nodes have the same degree • Sample with large stride • Collect topic histograms

  8. Convergence • Start from pairs of diverse topics • Two random walks, sample from each walk • Measure distance between topic distributions • L1 distance |p1 – p2| = c|p1(c) – p2(c)| in [0,2] • Below .05 —.2 within 300—400 physical pages

  9. Biases in topic directories • Use Dmoz to train a classifier • Sample the Web • Classify samples • Diff Dmoz topic distribution from Web sample topic distribution • Report maximum deviation in fractions • NOTE: Not exactly Dmoz

  10. Topic-specific degree distribution • Preferential attachment: connect to v w.p. proportional to current degree of v, regardless of topic • More realistic: u has a topic, and links to v with related topics • Unclear if power-law should still hold • Holds for large degree Intra-topiclinkage Inter-topiclinkage

  11. Random forward walk without jumps • Sampling walk is designed to mix topics well • How about walking forward without jumping? • Start from a page u0 on a specific topic • Sample many forward random walks (u0, u1, …, ui, …) • Compare (Pr(c|ui) c) with (Pr(c|u0) c) and with the background distribution

  12. Observations and implications • Forward walks wander away fromstarting topic slowly • But do not converge to thebackground distribution • Global PageRank ok alsofor topic-specific queries • Jump parameter d=.1—.2 • Topic drift not too bad withinpath length of 5—10 • Prestige conferred mostly bysame-topic neighbors • Also explains why focused crawling works W.p. d jump toa random node W.p. (1-d)jump to anout-neighboru.a.r. Jump High-prestige node

  13. Citation matrix • Given a page is about topic i, how likely is it to link to topic j? • Matrix C[i,j] = probability that page about topic i links to page about topic j • Soft counting: C[i,j] += Pr(i|u)Pr(j|v) • Applications • Classifying Web pages into topics • Focused crawling for topic-specific pages • Finding relations between topics in a directory u v

  14. True topic From topic To topic  Guessed topic  Citation, confusion, correction From topic Classifier’s confusion on held-out documents can be used to correct confusion matrix ArtsBusinessComputersGamesHealthHomeRecreationReferenceScienceShoppingSocietySports To topic 

  15. Fine-grained views of citation Prominent off-diagonal(/Arts/Music to /Shopping/Music)entries raise designissues for taxonomyeditors and maintainers Clear block-structure derived from coarse-grain topics Strong diagonal blocks reflecttightly-knit topic communities

  16. Concluding remarks • A model for content-based communities • New characterization and measurement of topical locality on the Web • How to set the PageRank jump parameter? • Topical stability of topic distillation • Better crawling and classification • A tool for Web directory maintenance • Fair sampling and representation of topics • Block-structure and off-diagonals • Taxonomy inversion

More Related