Enhanced topic distillation using text, markup tags, and hyperlinks

Enhanced topic distillation using text, markup tags, and hyperlinks Soumen ChakrabartiMukul JoshiVivek Tawde www.cse.iitb.ac.in/~soumen

Topic distillation Keyword query • Given a query or some example URLs • Collect a relevant subgraph (community) of the Web • Bipartite reinforcement between hubs and authorities • Prototypes: • HITS, Clever, SALSA • Bharat and Henzinger Searchengine Base set Expanded set Root set

Two issues • How to collect the base set • Radius-1 expansion is arbitrary • Content relevance must play a role • How to spread prestige along links • Instability of HITS (Borodin, Lempel Zheng) • Stability of PageRank (Zheng) • Stochastic variants of HITS (Lempel) • Need better recall collecting base graph • Need accurate ‘boundaries’ around it

Challenges and limitations • Topic distillation results deteriorating • Web authoring style in flux since 1996 • Complex pages, templates, cloaks • File or page boundary less meaningful • “Clique attacks”—rampant multi-host ‘nepotism’ via rings, ads, banner exchanges • Models too simplistic • Hub and authority symmetry is illusory • Coarse-grain hub model ‘leaks’ authority • Ad-hoc linear segmentation not content-aware

Clique attacks! Irrelevantlinks formpseudo-community Relevant regionsthat lead to inclusionof page in base set

Benign drift and generalization Remainingsectionsgeneralize and/or drift This sectionspecializes on‘Shakespeare’

html DocumentObject Model(DOM) body head Frontier ofdifferentiation table tr td tr td table ul Relevantsubtree … tr tr tr … li li li td td td a a a a Irrelevantsubtree ski.qaz.com Toncheese.co.uk art.qaz.com www.fromages.com A fine-grained hypertext model <html>…<body>… <table …> <tr><td> <table …> <tr><td><a href=“http://art.qaz.com”>art</a></td></tr> <tr><td><a href=“http://ski.qaz.com”>ski</a></td></tr>… </table> </td></tr> <tr><td> <ul> <li><a href=“http://www.fromages.com”>Fromages.com</a> French cheese…</li> <li><a href=“http://www.teddingtoncheese.co.uk”>Teddington…</a> Buy online…</li> … </ul>… </td></tr> </table>… </body></html>

3 9 6 12 3 Preliminary approaches • Apply HITS to fine-grained base graph • Blocked reinforcement • Model DOM trees as resistance or flow networks • Ad-hoc decay factors • Apply B&H outlier elimination to every DOM node • Hot absorbs cold, includes drift-enhancing links Cold Warm enoughto figure asone hub Hot

Generative model for hub text Global termdistribution 0 • Global hub text distribution 0 relevant to given query • Authors use internal DOM nodes to hierarchically specialize 0 into I • At a certain frontier, local models are ‘frozen’ and text generated Progressive‘distortion’ Modelfrontier I Other pages

Examples using the binary model • Binary model: • Code length for document d • Cost for specializing a term distribution

Discovering the frontier Referencedistribution0 • Use u to directly generate text snippets in the subtree rooted at u • Expand to children v and use different params for each tree • Greedily pick better local choice Cumulative distortion cost =KL(0u) + … + KL(uv) u v Dv

Exploiting co-citation in our model 1 2 Initial values ofleaf hub scores = target auth scores Segment treeusing hub scores Have reasonto believethese could be good too 0.10 0.20 0.01 0.06 0.05 0.13 3 4 Aggregate hubscores are copiedback to leaves 0.12 ‘Known’authorities 0.13 0.10 0.20 0.12 0.12 0.12 0.10 0.20 0.13 Frontier microhubsaccumulate scores Non-linear transform, unlike HITS

Complete algorithm • Collect root set and base set • Pre-segment using text and mark relevant micro-hubs to be pruned • Assign only root set authority scores to 1s • Iterate • Transfer from authority to hub leaves • Re-segment hub DOM trees using link + text • Smooth and redistribute hub scores • Transfer from hub leaves to authority roots • Report top authority and ‘hot’ microhubs

Experimental setup • Large data sets • 28 queries from Clever, >20 topics from Dmoz • Collect 2000…10000 pages per query/topic • Several million DOM nodes and fine links • Find top authorities using various algos • Measurements + anecdotes • For ad-hoc query, measure cosine similarity of authorities with root-set centroid in vector space • Compare HITS, DOM, DOM+Text

Avoiding topic drift via micro-hubs Query: cyclingNo danger of topic drift Query: affirmative actionTopic drift from software sites

Empirical convergence • Convergence for all queries within 20 iterations • Faster convergence for drift-free graphs, slower for graphs that posed a danger of topic drift • Very important to not set all auth scores to > 0

Results for the Clever benchmark • Take top 40 auths • Find average cosine similarity to root set centroid • HITS < DOM+Text < DOM similarity • DOM alone cannot prune well enough: most top auths from root set • HITS drifts often

Dmoz experiments and results • 223 topics from http://dmoz.org • Sample root set URLs from a class c • Top authorities not in root set submitted to Rainbow classifier • d Pr(c |d) is the expected number of relevant documents • DOM+Text best DMoz Train Rainbowclassifier Sample Test Music Expanded set Root set Top authority

Anecdotes • “amusement parks”: http://www.411fun.com/THEMEPARKSleaks authority via nepotistic links to www.411florists.com, www.411fashion.com, www.411eshopping.com, etc. • New algorithm reduces drift • Mixed hubs accurately segmented, e.g. amusement parks, classical guitar, Shakespeare and sushi • Mixed hubs and clique attacks rampant

Application: surfing like humans Focused Crawling Train a topic classifierInitialize priority queue to a few sample URLs about a topicAssume they have relevance = 1Repeat Fetch page most relevant to topic Estimate relevance R using classifier Guess that all outlinks have relevance R Add outlinks to priority queue ? ! ? • Problem: average out-degree is too high (~10) • Discovering irrelevance after 10X more work • Can we use DOM and text to bias the ‘walk’?

Preliminary results Relevance R1 Featurescollected fromsource pageDOM Relevance R2 Promising andunpromising‘clicks’ Feedback Standardfocusedcrawler Meta-learner

Summary • Hypertext shows complex idioms, missed by coarse-grained graph model • Enhanced fine-grained distillation • Identifies content-bearing ‘hot’ micro-hubs • Disaggregates hub scores • Reduces topic drift via mixed hubs and pseudo-communities • Application: online reinforcement learning • Need probabilistic combination of evidence from text and links

Enhanced topic distillation using text, markup tags, and hyperlinks

Enhanced topic distillation using text, markup tags, and hyperlinks

Presentation Transcript

Graphics and Hyperlinks

HTML (Hyper Text Markup Language)

HTML Hyper Text Markup Language

HTML5 ( Hyper Text Markup Language)

Using Hyper Text Markup Language to develop a Web page

TEXT: Topic 9.

Hyper Text Markup Language

Hyperlinks

Hyperlinks

Searching and Browsing Using Tags

Entering HTML Tags and Text

Using Predicate-Argument Structure for Topic- and Event-based Distillation

XML for Text Markup

Hyperlinks

Enhanced topic distillation using text, markup tags, and hyperlinks

HTML Hyper Text Markup Language

Topic Distillation and Web Page Categorization

Hyperlinks