790 likes | 955 Vues
Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari. Ph.D. Defense, Sept 25, 2007. THESIS STATEMENT. It is possible to develop an effective, efficient and adaptive system to detect spam blogs. CONTRIBUTIONS. a principled study of the characteristics of the problem,
E N D
Detecting Spam Blogs: An Adaptive Online ApproachPranam Kolari Ph.D. Defense, Sept 25, 2007
THESIS STATEMENT It is possible to develop an effective, efficient and adaptive system to detect spam blogs.
CONTRIBUTIONS • a principled study of the characteristics of the problem, • a well motivated feature discovery effort, • a cost-sensitive, real-time filtering implementation, and • an ensemble driven classifier co-evolution.
OUTLINE • Introduction • Characterization • Feature Discovery • Cost-aware pipeline • Adaptive Classifiers • Evaluation • Conclusions • Future Directions
WHAT IS SPAM? • “Unsolicited usually commercial e-mail sent to a large number of addresses” – Merriam Webster Online • As the Internet has supported new applications, many other forms are common, requiring a much broader definition Capturing user attention unjustifiably in Internet enabled applications (e-mail, Web, Social Media etc..)
SPAM TAXONOMY INTERNET SPAM DIRECT INDIRECT [Forms] Bookmark Spam E-Mail Spam Comment Spam IM Spam (SPIM) Spam Blogs (Splogs) Social Network Spam General Web Spam [Mechanisms] Spamdexing Social Media Spam
SPAMDEXING Affiliate Programs Context Ads (i) arbitrage ads/affiliate links (ii) in-links Spam pages, Spam Blogs [DOORWAY] JavaScript Redirect Spammer owned domains Affiliate Program Buyers spamdex (iii) Spam pages, Spam Blogs, Spam Comments, Guestbook Spam Wiki Spam SERP Search Engines
SPAM BLOG Advertisements in Profitable Contexts Auto-generated and/or Plagiarized Content Link Farms to promote other spam pages
OUTLINE • Introduction • Characterization • Feature Discovery • Cost-aware pipeline • Adaptive Classifiers • Evaluation • Conclusions • Future Directions
CONTRIBUTIONS • a principled study of the characteristics of the problem, • a well motivated feature discovery effort, • a cost-sensitive, real-time filtering implementation, and • an ensemble driven classifier co-evolution.
CHARACTERIZATION • WorldNet defines characterize as “to describe or portray the characters or the qualities or peculiarities” • Our efforts • Define and Scope the Problem • Field Study • Principled Empirical Analysis • Publicize and solicit feedback
SCOPE Update Pings 2 Ping Stream Update Pings 3 Fetch Content 1 Splog Filtering between steps 2 and 3 (Pre-indexing) , used by blog harvester
BLOGS & SPAMDEXING • Bias of Search engines to blogs • through quick indexing (ping servers) • and higher relevance (temporal) • Availability of third party blogging platforms • providing service for free • supporting programmatic content injection • enjoying high authority and trust (e.g. blogspot) • enabling obfuscation (doorways) to search engines and DMCA notices
SPLOGS BY NUMBERS • 75% of update pings (eBiquity 2006) • 20% of indexed Blogosphere (Umbria 2006) • 56% of update pings (eBiquity 2007) 56% of all active blogs are splogs! (2007)
SPLOG DETECTION PROBLEM • Given a blog, is it authentic or spam? • Explore evidence space • Contents of the Blogs (Local Attributes) • Evidence from Neighbors (Global Attributes) P(splog(x)/ O(x)) P(splog(x)/ L(x))
EXISTING CONTEXTS E-MAIL BLOGS WEB NATURE time/posts time • Web Search Engines • Blog Search Engines • Blog Hosting Services • (Ping Servers) • Users • E-mail Service • Provider • Search Engines • Page Hosting • Services (e.g. Tripod) WHO USES IT? • Fast Detection • Low Overhead • Online • Batch Detection • Mostly Offline • Fast Detection • Low Overhead CONSTRAINTS • Scripts, Doorways • Temporal Deception • Image Spam, • Character Salad • Scripts, Doorways ATTACKS
RELATED WORK – WEB SPAM • Local Content (Drost et al, 2005) • using TFIDF word-features, specialized features etc. • Statistical Properties (Fetterly et al, 2004) • using page updates, identical pages through page-stitching • Trust-Rank (Gyongi et al, 2004) • As an extension to Page-Rank • Splog Detection (Salvetti et al, Lin et al)
OUTLINE • Introduction • Characterization • Feature Discovery • Cost-aware pipeline • Adaptive Classifiers • Evaluation • Conclusions • Future Directions
CONTRIBUTIONS • a principled study of the characteristics of the problem, • a well motivated feature discovery effort, • a cost-sensitive, real-time filtering implementation, and • an ensemble driven classifier co-evolution.
MACHINE LEARNING CLASSIFICATION • Document as vectors in a feature space • Feature Space • Discovery • Representation • Selection • Classification Techniques • Support Vector Machines (Discriminative) • Naïve Bayes Classifier (Generative) • Tools (libsvm, weka) f1, f2, f3 .. fm
MACHINE LEARNING EVALUATION • Precision (P) • a measure of correctness of classified documents • Recall (R) • a measure of completeness of classified documents • F-1 = 2*P*R/(P+R) • ROC AUC* – Area Under the Curve • a measure of discriminatory power * Presented in Thesis Document
DATASETS • SPLOG-2005 • Sampled Summer 2005 at Technorati • Labeled samples of 700 blogs and 700 splogs • Only Blog-homepages • SPLOG-2006 • Sampled Oct 2006 at Weblogs.com • Labeled samples of 750 blogs and 750 splogs • Blog-homepages + feeds
EXPERIMENTAL SETUP • Binary feature encoding • Top 50K selected using frequency count • SVMs • Default parameters • Linear Kernel • No stemming or stop word elimination • Naïve Bayes • Ten fold cross-validation
URL 2005 2006
URL • 3,4,5 charactergrams from URL • Captures profitable contexts • Highly effective at ping streams • Supports an extremely low cost classifier 2005 2006
WORDS 2005 2006
WORDS • Words (Text) on a Blog • Previously effective in topic classification • Captures profitable advertising contexts • Interesting Authentic Genre Observed 2005 2006
WORDGRAMS 2005 2006
WORDGRAMS • Word-2-grams, 2 adjacent words • Shallow NLP technique to tackle word salad • Word salad less common in web spam (TFIDF) • Word-x-gram features, exponential with x 2005 2006
CHARACTERGRAMS 2005 2006
CHARACTERGRAMS • 3,4,5 charactergrams from blog content • Can capture character salad (e.g. p1lls) • Feature selection important 2005 2006
OUTLINKS 2005 2006
OUTLINKS • Out-links tokenized by non-alphabets • Similar to URL n-grams, likely more robust • Novel feature space 2005 2006
ANCHORS 2005 2006
ANCHORS • Anchor text tokenized into words • Subsumed by words, but obfuscation difficult • Capture personalization of publishing template • Novel feature space 2005 2006
Splog software ?! “Honestly, Do you think people who make $10k/month from adsense make blogs manually? Come on, they need to make them as fast as possible. Save Time = More Money! It's Common SENSE! How much money do you think you will save if you can increase your work pace by a hundred times? Think about it…” “Discover The Amazing Stealth Traffic Secrets Insiders Use To Drive Thousands Of Targeted Visitors To Any Site They Desire!” “Holy Grail Of Advertising... “ “Easily Dominate Any Market, AnySearch Engine, Any Keyword.” $ 197
HTMLTAGS 2005 2006
HTMLTAGS • Use HTML Tags – stylistic information • Capture signatures of splog software • Fully language independent • Novel feature space 2005 2006
FEED BASED DETECTION • Limitations using only home-pages • No knowledge of blog lifetime • Classifiers less effective in early lifecycle • Benefits of using feeds • Most recent posts, lifetime, metadata • Capture correlations across posts • Limitations of using only feeds • Loose out signatures in publishing template
FEED ITEM DISTRIBUTON • Plot number of items in feeds (SPLOG-2006) • Authentic Blogs feature normal distribution • Splogs – many with just one post • Knowledge of classifier effectiveness vs. lifetime
FEED BASED DETECTION • Disjoint feature spaces – Words, Tags • Trained and Tested with n (x-axis) posts • Publishing template signatures important • Tags much more effective – early lifecycle
RELATED CLASSIFIERS • Blog Identification • Competency requirement for blog harvesters • F-1 measure of 98% • Relational Features • Less Effective (High P, Low R) • Short-lived blogs, lifetime dependent • Knowledge of Web-graph • Derived Features • Less Effective
FEATURE SPACE OBSERVATIONS • Cost based classifier bucketing • Known Feature Spaces • Words continue to be effective • Word-grams against obfuscation • Novel Feature Spaces • Out-links, Anchors capture useful signals • HTML Tags very effective, even early lifecycle • Feature Space Exploration • Tags, JavaScript, Feed Classification
OUTLINE • Introduction • Characterization • Feature Discovery • Cost-aware pipeline • Adaptive Classifiers • Evaluation • Conclusions • Future Directions
CONTRIBUTIONS • a principled study of the characteristics of the problem, • a well motivated feature discovery effort, • a cost-sensitive, real-time filtering implementation, and • an ensemble driven classifier co-evolution.
META-PING SYSTEM • Regular Expression Filtering (March 2005) • List of Authentic Blogs (August 2005) • Blog Home-page Classifier (December 2005) • URL Classifier (October 2006) • Feed Classifier (May 2007) • Cost-Aware Pipeline Implementation (Jan 2007)
META-PING SYSTEM Increasing Cost PRE-INDEXING SPING FILTER LANGUAGE IDENTIFIER Ping Stream REGULAREXPRESSIONS BLACKLISTS WHITELISTS URLFILTERS HOMEPAGEFILTERS FEEDFILTERS Ping Stream BLOG IDENTIFIER Ping Stream PING LOG IP BLACKLISTS AUTHENTIC BLOGS
META-PING SYSTEM • Static Design • Project specific thresholds • Classifiers in pipeline • Based on accrued domain knowledge • Dynamic Possibilities • Classifier Thresholds • Classifier use • Queuing analysis and Precision/Recall requirements
OUTLINE • Introduction • Characterization • Feature Discovery • Cost-aware pipeline • Adaptive Classifiers • Evaluation • Conclusions • Future Directions