1 / 24

How to beat an Adaptive Spam Filter

How to beat an Adaptive Spam Filter John Graham-Cumming Creator and Maintainer of POPFile Research Director, Sophos’s Anti-Spam Task Force Token Space neither “Red Coat” Spams Obfuscated spam is trivial to spot and filter No need to even read the text, the obfuscations are enough

erika
Télécharger la présentation

How to beat an Adaptive Spam Filter

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. How to beat an Adaptive Spam Filter John Graham-Cumming Creator and Maintainer of POPFile Research Director, Sophos’s Anti-Spam Task Force

  2. Token Space neither

  3. “Red Coat” Spams • Obfuscated spam is trivial to spot and filter • No need to even read the text, the obfuscations are enough • No real email contains the word Viagra written &#86;<font size=0>&nbsp;</font>&#105;<font size=0>&nbsp;</font>&#97;<font size=0>&nbsp;</font>&#103;<font size=0>&nbsp;</font>&#114;<font size=0>&nbsp;</font>&#97; • “Field Guide to Spam” highlights spammer obfuscations: Invisible Ink, Camouflage, Hypertextus Interruptus... • www.sophos.com/spaminfo

  4. POPFile's working great for me... but not 100% • November 3, 2003 through December 22, 2003 • Total mails received: 52,931 • Total spams: 35,928 (68%!) • Total spams missed: 125 • So POPFile ~99.7% accurate • 1 in 254 spams gets through... why?

  5. Taxonomy of filter busting spams • 52%: “picospams” • 13%: RTF • 9%: Challenge/Response • 9%: NDR • 4%: Totally blank • 13%: Other • Multiple copies of an offer for an “Incredible Spam Filter” • A message in Hebrew

  6. RTF • Microsoft email clients sniff Rich Text Format • (actually they sniff a lot of different formats) Content-Type: text/plain {\rtf1\ansi\ansicpg1252\uc1 \deff0\deflang1046\deflangfe1046{\fonttbl{\f0\froman\fcharset0\fprq2{\*\panose 02020603050405020304}Times New Roman;}{\f1\fswiss\fcharset0\fprq2{\*\panose 020b0604020202020204}Arial;} {\f16\froman\fcharset129\fprq2{\*\panose 02030600000101010101}Batang{\*\falt??};}{\f28\froman\fcharset0\fprq2{\*\panose 02040602050305030304}Book Antiqua;}{\f29\froman\fcharset129\fprq2{\*\panose 00000000000000000000}@Batang;}{\f40\froman\fcharset238\fprq2 Times New Roman CE;}

  7. Challenge/Response • Received a number of “fake challenges” • Challenges directed me to a spammer's web site • This is how spammers can kill C/R • Personal note: I don't “do” C/R. If I mail you and you challenge me I hit delete, because, as Dan Quinlan put it: “C/R is the ultimate email diss. By using it you are saying, 'my time is more important than yours.'”

  8. Non-deliverable Response • As well as faking C/R messages, spammers fake NDRs • The NDR has the “original email” (actually a spam) as an attachment • Spammers can even get NDRs generated for them by badly configured mail servers • Send spam to known wrong address on a mail server with a forged from address • Mail server sends NDR to the forged from attaching the spam

  9. picospams • Spam containing either: • As few tokens as possiblerobin: http://www.xg187.com • Only HTML tokens<a href=http://www.spammersite.com/><img src=http://www.spammersite.com/img></a> • Picospams got through because • Hadn't been seen before • Contained “good” headers • Had “word salad” Thanks to Robin Keir for the tiny robin: mail

  10. “Good Headers” • The combination of two things leads to the ham tokens outweighing the spam • picospam text • Relaying the message through a good server • Suitable good servers are: • Mail relays like acm.org, alum.mit.edu • SourceForge.net • Mailing lists

  11. “Word Salad” • Spam stuffed with randomly selected words:<a href="http://www.2004hosting.net/cable/"><img border="0" src="http://www.2004hosting.net/fiter1.jpg"></a>deliverance banister haploid sin beachcomb case stub doublet bread confucius buckaroo questionnaire tech issuance diagnose anglican finance pirouette u.s.a agree faculty nomenclature sheik insinuate pack dutchmen inhibition dubious patriotic aluminate • Sometimes words are hidden using Invisible Ink, Camouflage, MIME is Money or other tricks The term “word salad” was coined by Cindy Harris in a POPFile forum.

  12. “Word Salad” Experiment • Took a real picospam (HTML style) that had previously been caught by POPFileSubject: cialis is now ready <DIV align=center><FONT face="arial black" size=2>Save over 70% on</FONT></DIV><CENTER><FONT face="arial black" size=2>USA approved meds</B></FONT><BR></CENTER><center><a href="http://cfcliihhp.646fgfg5.com/v95/index.php?id=v95">Come visit us</a> • Added 100s of words from /usr/share/dict/words • Scored for spam vs. ham against my POPFile installation

  13. “Word Salad” Results Number of spams (per 10,000) that got through Number of words added

  14. “Word Salad” Ineffective • Best result was 0.04% get through if • Send each person 10,000 copies of each spam • AND each spam is 3x bigger than before • Ineffective because • Randomly chosen words are likely to be: • First in neither, • then in spammy, • finally in hammy! Because spammers send so much spam!

  15. Word Salad neither neither

  16. Word Salad Variants • Got similar results using words pulled from • News stories via news.google.com • Articles from wikipedia.org • Back to basics… • A filter busting spam needs to: • HAVE FEW tokens that look like spam • HAVE MORE tokens that look like my ham • How do you find my hammy tokens?

  17. Bayes vs. Bayes • If adaptive filters are so smart, perhaps they can beat adaptive filters? • Experiment: • Take a trained spam filter (“Good” POPFile) • And an untrained spam filter (“Evil” POPFile) • Take a spam that got through “Good” • Send copies of the spam with 5 random words appended • Train “Evil” depending on if it gets through “Good” or not

  18. B vs B berkshire marriott wireless

  19. How to get feedback • When sending each message include a unique web bug • Creates an effective feedback loop • Spammer can use web bug to train their POPFile installation • Bad news... this works: • Tested against my POPFile installation • Sent 10,000 emails containing 5 randomwords from /usr/share/dict/words • Found my kryptonite

  20. Kryptonite Words • accommodations, arrangements, berkshire, category, channel, checking, comment, currency, endless, entitled, flying, hills, independent, invoice, logging, marriott, occupancy, officer, operated, quantity, redeeming, rent, shared, silicon, touch, wireless • Adding just one of these words turns the spam into a ham!

  21. Is B vs B practical? • Took 10,000 messages to one email address to train evil POPFile • But what about 10 messages to 1,000 mail addresses? • Say send 10 copies of a spam to everyone at company.com: might find company.com specific kryptonite

  22. Defense against the dark arts • Absolutely NO feedback to spammers • No rendering HTML • No bouncing • No SMTP server errors • No selective challenge/response • No NDRs • Mailing List/Mail Forwards • Do spam filtering on in bound messages • Integrate header analysis with adaptive filtering

  23. Conclusions • Current spam is “easy” for adaptive filters to detect • As spammers react to adaptive filtering spam will get harder to recognize • Feedback mechanisms present a risk to the effectiveness of adaptive filtering • Adaptive filters will need merging with “traditional” anti-spam techniques like DNSBL

  24. Thank you. All questions will now be answered via telepathy :-)

More Related