Detecting Spammers on Twitter

Detecting Spammers on Twitter FabricioBenevenuto, Gabriel Magno, Tiago Rodrigues, and Virgilio Almeida Universidade Federal de Minas Gerais Belo Horizonte, Brazil ACSAC 2010 A Presentation at Advanced Defense Lab

Outline • Introduction • Background • Dataset and Labeled Collection • Identifying User Attributes • Detecting Spammers • Related Work • Conclusion Advanced Defense Lab

Introduction • Twitter has recently emerged as a popular social system. • With a simple interface where only 140 character messages can be posted. • These services open opportunities for new forms of spam Advanced Defense Lab

Introduction • 4-step approach • Crawled a near-complete dataset from Twitter. • Created a labeled collection with users “manually” classified as spammers and non-spammers. • Conducted a study about the characteristics of tweet content and user behavior. • Used supervised machine learning methodto identify spammers. Advanced Defense Lab

Background • Relationship links are directional. • A re-tweeted message usually starts with “RT@username”. • Twitter users usually use hashtags (#) to identify certain topics. • Trending Topics • #musicmonday Advanced Defense Lab

Background • A URL to a website containing advertisements completely unrelated to a hashtag on the tweet • Re-tweets in which legitimate links are changed to illegitimate ones. Advanced Defense Lab

Dataset and Labeled Collection • We asked Twitter to allow us to collect such data and they white-listed 58 servers located at the MPI-SWS. • Twitter assigns each user a numeric ID which uniquely identifies the user’s profile. • We launched our crawler in August 2009 to collect all user IDs ranging from 0 to 80 million. • In total • 54,981,152 used accounts • 1,963,263,821 social links • 1,755,925,520 tweets Advanced Defense Lab

Building a labeled collection • Three desired properties that need to be considered to create such collection of users labeled as spammers and non-spammers. • The collection needs to have a significant number spammers and non-spammers. • The labeled collection needs to include spammers who are aggressive in their strategies and mostly affect the system. • The users are chosen randomly and not based on their characteristics. Advanced Defense Lab

Building a labeled collection • Three trending topics • The Michael Jackson’s death • Susan Boyle’s emergence • The hashtag “#musicmonday” Advanced Defense Lab

Building a labeled collection • We developed a website to help volunteers to manually label users as spammers or non-spammers based on their tweets containing #keywords related to the trending topics. • In total, 8,207 users were labeled, including 355 spammers and 7,852 non-spammers. • We select only 710 of the legitimate users to include in our collection. Advanced Defense Lab

Indentifying User Attributes • Content Attributes • the maximum, minimum, average, and median of the following metrics: • number of hashtags per number of words on each tweet • number of URLs per words • number of words of each tweet • number of characters of each tweet • number of URLs on each tweet • number of hashtags on each tweet • number of numeric characters that appear on the text • number of users mentioned on each tweet • number of times the tweet has been re-tweeted • the fraction of tweets with at least one word from a popular list of spam words • the fraction of tweets that are reply messages • the fraction of tweets of the user containing URLs 39 Advanced Defense Lab

Identifying User Attributes • Total 1065 users. • 39% of the spammers posted all their tweets containing spam words, whereas non-spammers typically do not post more than 4% of their tweets containing spam word. Advanced Defense Lab

Indentifying User Attributes • User Behavior Attributes • the maximum, minimum, average, and median of the following metrics: • the time between tweets • number of tweets posted per day • number of tweets posted per week • number of followers • number of followees • fraction of followers per followees • number of tweets • age of the user account • number of times the user was mentioned • number of times the user was replied to • number of times the user replied someone • number of followees of the user’s followers • number tweets receveid from followees • existence of spam words on the user’s screename 23 Advanced Defense Lab

Identifying User Attributes • (a) Spammers have a high ratio of followers per follwees. • (b) Spammers usually have new accounts probably because they are constantly being blocked by other users and reported to Twitter. • (c) non-spammers receive a much large amount of tweets from their followees in comparison with spammers. Advanced Defense Lab

Detecting Spammers • SVM-light • 5-fold cross-validation. • In each test, the original sample is partitioned into 5 sub-samples, out of which four are used as training data, and the remaining one is used for testing. Advanced Defense Lab

Detecting Spammers Advanced Defense Lab

Detecting Spammers • X2 Advanced Defense Lab

Detecting Spams • Consider the following attributes for each tweet: • number of words from a list of spam words • number of hashtags per words • number of URLs per words • number of words • number of numeric characters on the text • number of characters that are numbers • number of URLs • number of hashtags • number of mentions • number of times the tweet has been replied Advanced Defense Lab

Detecting Spams Advanced Defense Lab

Related Work • Spam has been observed in various applications, including e-mail, web search engines, blogs, videos, and opinions. • RE: • Each user specifies a list of users who they are willing to receive content from. Advanced Defense Lab

Conclusions • Crawled the Twitter site to obtain more than 54 million user profiles. • Investigate different tradeoffs for our classification approach and the impact of different attributes sets. Advanced Defense Lab

Detecting Spammers on Twitter

Detecting Spammers on Twitter

Presentation Transcript

Follow DHEOdisha on Twitter

Detecting Spammers on Social Networks

Detecting Social Spam Campaigns on Twitter

Detecting Spammers on Social Networks

Follow @ residerocks on Twitter

Detecting Spammers on Social Networks

Towards Detecting Influenza Epidemics by Analyzing Twitter Massages

Journalist Roles on Twitter

Not on Twitter?

Participate on Twitter

DETECTING SPAMMERS ON SOCIAL NETWORKS

Twitter Games: How Successful Spammers Pick Targets

Detecting Influenza Outbreaks by Analyzing Twitter Messages

Detecting Spammers with SNARE: Spatio -temporal Network-level Automatic Reputation Engine

Getting started on --- Twitter

Twitter Catches The Flu: Detecting Influenza Epidemics using Twitter

Detecting Product Review Spammers using Rating Behaviors

Follow us on twitter

Getting Started on Twitter

Frank SanPietro on Twitter

Tweens On Twitter

Detecting Spammers with SNARE: Spatio-temporal Network-level Automatic Reputation Engine