160 likes | 317 Vues
Web Spam Detection with Anti-Trust Rank. Vijay Krishnan Rashmi Raj Computer Science Department Stanford University. The World Wide Web. Huge Distributed content creation, linking (no coordination) Structured databases, unstructured text, semi-structured data.
E N D
Web Spam Detection with Anti-Trust Rank Vijay Krishnan Rashmi Raj Computer Science Department Stanford University
The World Wide Web • Huge • Distributed content creation, linking (no coordination) • Structured databases, unstructured text, semi-structured data. • Content includes truth, lies, obsolete information, contradictions, …
PageRank • Intuition: “a page is important if important pages link to it.” • In high-falutin’ terms: importance = the principal eigenvector of the stochastic matrix of the Web. (A few fixups needed.)
PageRank • Web graph encoded by matrix M • NXN matrix (N = number of web pages) • Mij = 1/|O(j)| iff there is a link from j to i • Mij = 0 otherwise • O(j) = set of pages node i links to • Define matrix A as follows • Aij = βMij + (1-β)/N, where 0<β<1 • 1-β is the “tax” discussed in prior lecture • Page rank r is first eigenvector of A • Ar = r
Many Random Walkers Model • Imagine a large number M of independent, identical random walkers (MÀN) • At any point in time, let M(p) be the number of random walkers at page p • The page rank of p is the fraction of random walkers that are expected to be at page p i.e., E[M(p)]/M.
Economic Considerations • Search has become the default gateway to the web • Very high premium to appear on the first page of search results • e.g., e-commerce sites • advertising-driven sites
What is Web Spam? • Spamming = any deliberate action solely in order to boost a web page’s position in search engine results, incommensurate with page’s real value • Spam = web pages that are the result of spamming • This is a very broad defintion • SEO industry might disagree! • SEO = search engine optimization • Approximately 10-15% of web pages are spam
Types of Spamming Techniques • Term spamming • Manipulating the text of web pages in order to appear relevant to queries • Link spamming • Creating link structures that boost page rank or hubs and authorities scores
Link Spam • Three kinds of web pages from a spammer’s point of view • Inaccessible pages • Accessible pages • e.g., web log comments pages • spammer can post links to his pages • Own pages • Completely controlled by spammer • May span multiple domain names
Link Spam Detection • Open research area • One approach: TrustRank
Trust Rank • Basic principle: approximate isolation • It is rare for a “good” page to point to a “bad” (spam) page • Sample a set of “seed pages” from the web. • Set trust of each trusted page to 1 • Propagate trust through links • Each page gets a trust value between 0 and 1 • Use a threshold value and mark all pages below the trust threshold as spam
Anti-Trust Approach • Broadly based on the same “approximate isolation principle” • This principle also implies that the pages pointing to spam pages are very likely to be spam pages themselves. • Anti-Trust is propagated in the reverse direction along incoming links, starting from a seed set of spam pages. • A page can be classified as a spam page if it has Anti-Trust Rank value more than a chosen threshold value.
Seed Set selection • Seed spam set chosen from pages with high page rank. • Nearly 100% URLS containing certain terms like {viagra,gambling, hardporn} as substrings are spam. Use these for evaluation. • Also some seed pages were chosen by an Oracle (Human Expert).
Results • Overall Percentage of “spam” pages =0.28%. • Average page rank of “spam”/Average Page Rank = 2.6. • % of “spam” pages in: • top 1000 Anti-Trust rank pages = 25.3% • Bottom 1000 Trust rank pages = 0.68% • Ratio of average page ranks of spam pages returned by ATR vs. TR is roughly 6.
References • The PageRank citation ranking: Bringing order to the web. L. Page, S. Brin, R. Motwani and T. Winograd. Technical Report, Stanford University, 1998. • Combating Web Spam with Trust Rank. Zoltan Gyongyi, Hector Garcia-Molina and Jan Pedersen. In VLDB 2004. • Topic-sensitive PageRank. Taher Haveliwala. In WWW 2002. • The WebGraph dataset. Online at: • http://webgraph-data.dsi.unimi.it/