520 likes | 681 Vues
This paper presents an in-depth analysis of SMS spam, highlighting its rapid growth and the unique challenges it poses compared to email spam. AT&T Security Research Center's data reveals that over 90% of internet emails are spam, while SMS spam is escalating over 500% annually. Key findings focus on spammer behaviors, account types used, response ratios, and message timing. The analysis leverages three distinct data sets to identify patterns in spam activities, evaluate the effectiveness of current detection measures, and discuss implications for network resource management and user safety.
E N D
Crime Scene Investigation: SMS Spam Data Analysis Roger PiquerasJover AT&T Security Research Center New York, NY roger.jover@att.com IlonaMurynets AT&T Security Research Center New York, NY ilona@att.com IMC’12, November 14–16, 2012, Boston, Massachusetts, USA.
Spam is the commonly adopted name to refer to unwanted messages that are massively sent to a large number of recipients. e-mail spam • 90% of the daily e-mail via the Internet is spam • multiple solutions detect and block • a small amount of spam reaching inboxes SMS spam ?
SMS-spam • connect aircards & cell to PC • yearly growth larger than 500% • effective anti-abuse messaging filters injected • content-based algorithms (for email) works less efficient Why??? • acronyms/pruned spellings/emoticons • Shut down/swap SIM
SMS-spam • consume network resources for legitimate services otherwise. • user pays at a per received message basis • exposes smart phone users to viruses • fraudulent messaging activities such as phishing, identity theft and fraud This paper: • used forSMS spam detection engine
Outline • three data sets for analysis • Data analysis • Account information • Messaging Abuse • Response ratio • Message timing and time series • The Scene of the Crime • Location & targets • Mobility • Hardware choice • Voice and IP traffic
three data sets: SMS cell M2M • tier-1 cellular operator • Call Detail Records (CDR) of 9000 SMSspammer & 17000 legitimate (cell & M2M) • Mobile Originated (MO):transmitting party • Mobile Terminated (MT):receiver • Spammers identified & disconnected from the network. • SMS: prepaid cell: postpaid • M2M: TAC
Outline • three data sets for analysis • Data analysis • Account information • Messaging Abuse • Response ratio • Message timing and time series • The Scene of the Crime • Location & targets • Mobility • Hardware choice • Voice and IP traffic
notes • In all the figures throughout the paper, legitimate cellphone users, M2M systems and spammers (SMS)are represented in green, blue and red, respectively.
Account information • spammers (99.64%) are using pre-paid accounts with unlimited messaging plans • SIM cards are constantly switched to circumvent detection schemes • discard it once an account is canceled and work with a new one • average age is 7 to 11 days (legitimate user is several months to a couple years)
Outline • three data sets for analysis • Data analysis • Account information • Messaging Abuse • Response ratio • Message timing and time series • The Scene of the Crime • Location & targets • Mobility • Hardware choice • Voice and IP traffic
Messaging Abuse • Spammers generate a large load of messages • Spammers not only send but also receive more than legitimate customers do • opt-out • trick
Messaging Abuse Actual spam messages often attempt to trick the recipient into replying to the message. Despite a small percentage of users will reply, the large amount of accounts targeted in a spam campaign results in many responses.
Messaging Abuse • legitimate accounts have a small set of recipients. (7 on average) • spammers hit a couple of thousand victims • legitimate users send multiple messages to a small set of destinations • spammers send one message to each victim
Outline • three data sets for analysis • Data analysis • Account information • Messaging Abuse • Response ratio • Message timing and time series • The Scene of the Crime • Location & targets • Mobility • Hardware choice • Voice and IP traffic
Response ratio • legitimate users, messages are sent in response to a previous message in a sequential way. the response ratio close to 1. • For spammers the amount of MT SMSs is proportionally very small to the number of transmitted messages. the response ratio is close to 0
Outline • three data sets for analysis • Data analysis • Account information • Messaging Abuse • Response ratio • Message timing and time series • The Scene of the Crime • Location & targets • Mobility • Hardware choice • Voice and IP traffic
Message timing and time series • Inter-SMS intervals for spammers are short less random -- low entropy • intervals for legitimate messages are less frequently random--higher entropy. • Messaging activities of certain M2M devices are prescheduled.
Outline • three data sets for analysis • Data analysis • Account information • Messaging Abuse • Response ratio • Message timing and time series • The Scene of the Crime • Location & targets • Mobility • Hardware choice • Voice and IP traffic
Location & targets • California, • Sacramento and Orange • Los Angeles • New York/New Jersey/Long Island • Miami Beach • Illinois, Michigan • North Carolina and Texas.
Location & targets • The legitimate recipients -- local area (i.e. the area around the subscriber’s home or areas where the subscriber works, used to live or where friends and relatives reside). • The spam recipients distributed uniformly over the US population.
Location & targets • Spammers are characterized by messaging a large number of area codes, always greater than those of cell-phone users and M2M.
Location & targets • low entropy (legitimate cell) -- contacts repeatedly the same area codes. • High entropy (SMS) -- sends messages to a more random set of area codes. • Network enabled appliances (M2M) -- a predefined set of cell-phones, the entropy is the lowest.
Location & targets • linear relation -- SMS spammers • Both M2M systems and cell-phone users cluster around the bottom-left area of • the graph. • M2M send up to 20000 messages to 1 single destination???
Location & targets • Cellphone users destinations-to-messages ratio and a small set of area codes. • A great majority of spammers exhibit the opposite behavior. • bottom-right corner (SMS) target very specific geographical regions. ratio of one destination/message. targeted area codes is limited
Outline • three data sets for analysis • Data analysis • Account information • Messaging Abuse • Response ratio • Message timing and time series • The Scene of the Crime • Location & targets • Mobility • Hardware choice • Voice and IP traffic
Outline • three data sets for analysis • Data analysis • Account information • Messaging Abuse • Response ratio • Message timing and time series • The Scene of the Crime • Location & targets • Mobility • Hardware choice • Voice and IP traffic
Hardware choice • 1. USB Modem/Aircard A1 • 2. Feature mobile-phone M1 • 3. Feature mobile-phone M2 • 4. USB Modem/Aircard A2 • 5. USB Modem/Aircard A3
Outline • three data sets for analysis • Data analysis • Account information • Messaging Abuse • Response ratio • Message timing and time series • The Scene of the Crime • Location & targets • Mobility • Hardware choice • Voice and IP traffic
STOPPING THE CRIME • An advanced SMS spam detection algorithm is proposed based on an ensemble of decision trees • Over 40 specific features are extracted from messaging patterns and processed through a combination of decision trees
CONCLUSIONS • pre-paid accounts ---- 7 and 11 days. • large number of messages sent to a wide target(also receive a large amount) • five different models of hardware • large number of phone calls, very short duration • main geographical sources in US: Sacramento, Los Angeles-Orange County and Miami Beach • certain networked appliances • have messaging behavior close to that of a spammer.