Automated Worm Fingerprinting

Automated Worm Fingerprinting Sumeet Singh, Cristian Estan, George Varghese and Stefan Savage Department of Computer Science and Engineering University of California, San Diego Presented at : Operating System Design & Implementation (OSDI) 2004 Ramanarayanan Ramani (Ram) Support for this work was provided by NIST Grant 60NANB1D0118 and NSF Grant 0137102.

Overview • Why Automated Systems • Detecting Worms • Characterize Worms • Worm Containment • Worm Behavior • Identify Worm Signatures • Earlybird System Design • Statistics • Conclusion

Why Automated Systems • Identify worm – Manually characterize Signature – Update Antivirus & Network filters • Code Red worm took 14 hours to infect • Slammer took 10 minutes – no time to manually identify signature • Need automatic worm signature identification & secure networks

Detecting Worms • Network Telescopes : Monitor request to large unused, yet routable address space • Can Identify random scan worms • Cannot identify Hit-list or Email worms • Cannot characterize the signature

Detecting Worms • Using Honeypots • Not allow any malicious incoming traffic • Unwanted outgoing traffic : may be due to worm : identify malicious code performing this • Use malicious code to identify signature • Takes long time & requires manual signature identification

Detecting Worms • Host-based behavioral detection • Analyze patterns of system calls. • (e.g.) Route Received packet to be sent • Identify suspicious activity • Expensive to manage • Needs to employed in every system separately

Characterize Worm • Characterization is the process of analyzing and identifying a new worm or exploit • Create a priori vulnerability signatures • Can only be applied to vulnerabilities that are already well-known and well-characterized manually

Characterize Worm • First Automated System : Used by IBM for virus • Allow to infect “Decoy” programs • Identify invariant strings in Infected objects to characterize viruses • Assumes the presence of a known instance of a virus and a controlled environment to monitor

Characterize Worm • Honeycomb system of Kreibich and Crowcroft • Host-based intrusion detection system • Automatically generates signatures by looking for longest common subsequences among sets of strings found in message exchanges. • Very slow

Characterize Worm • Kim and Karp's Autograph system • Autograph also uses network-level data to infer worm signatures • Employ Rabin fingerprints to index counters of content substrings • Use white-lists to set aside well known false positives • Has extensive support for distributed deployments • Relies on a pre filtering step that identifies flows with suspicious scanning activity

Worm Containment • Mechanism used to slow or stop the spread of an active worm • Host quarantine • String-matching • Connection throttling

Worm Behavior • Behave quite differently from the popular client-server and peer-to-peer applications • Have some common behavior patterns across worms – useful to identify and characterize them

Worm Behavior • Content invariance • Some or all of the worm program is invariant across every copy • Some worms make use of limited polymorphism - encrypting each worm instance independently and/or randomizing filler text • But still some portion is invariant

Worm Behavior • Content prevalence • Worms are designed foremost to spread - the invariant portion of a worm's content will appear frequently on the network as it spreads or attempts to spread

Worm Behavior • Address dispersion • Packets containing a live worm will tend to reflect a variety of different source and destination addresses • This range increases when there is a major outbreak

Identify Worm Signatures ProcessTrafc(payload,srcIP,dstIP) 1 prevalence[payload]++ 2 Insert(srcIP,dispersion[payload].sources) 3 Insert(dstIP,dispersion[payload].dests) 4 if (prevalence[payload]>PrevalenceTh 5 and size(dispersion[payload].sources)>SrcDispTh 6 and size(dispersion[payload].dests)>DstDispTh) 7 if (payload in knownSignatures) 8 return 9 endif 10 Insert(payload,knownSignatures) 11 NewSignatureAlarm(payload) 12 endif

Identify Worm Signature • This method is called Content Sifting • Too much data to be handled in high speed networks • Too many substrings need to be stored • Too much time taken to process one packet

Earlybird System Design • Scan network & process packets • Identify repeating substrings along with list of the source & destination • If repetition is over threshold, set substring to be signature & ask network security system to block packets with respective signature

Estimate Content prevalence • Finding the packet payloads that appear at least x times among the N packets sent during a given interval • Uses multi-stage filters with conservative update to dramatically reduce the memory footprint of the problem • Append the destination port and protocol to the content before hashing • Detecting repeating strings with a small fixed length B • Compute a variant of Rabin fingerprints for all possible substrings of a certain length • Each packet with a payload of s bytes has s - B +1 strings of length , so the memory references used per packet – very high

Estimating address dispersion • Address dispersion is critical for avoiding false positives • Count the distinct source IP addresses and destination IP addresses associated with each piece of content suspected of being generated by a worm • Use approximate counting of distinct addresses using Bitmaps • Direct Bitmaps : 32-bits. Hash Addresses to One bit and set that bit • For a threshold of 30 distinct addresses – 20 bits set • Ability to estimate the actual values of each counter is less

Estimating address dispersion • Earlybird technique – Scaled Bitmaps • Accurately estimates address dispersion using five times less memory • Sub-sampling the range of the hash space • (e.g.) To count up to 64 sources using 32 bits, one might hash sources into a space from 0 to 63 yet only set bits for values that hash between 0 and 31 - ignoring half of the sources • We track a continuously increasing count by simply increasing this scaling factor whenever the bitmap is filled

Estimating address dispersion • Once the bitmap is scaled to a new configuration, the addresses that were active throughout the previous configuration are lost and adjusting for this bias directly can lead to double counting • So we use multiple bitmaps to store history – here we use 3

Estimating address dispersion

Estimating address dispersion UpdateBitmap(IP) 1 code = Hash(IP) 2 level = CountLeadingZeroes(code) 3 bitcode = FirstBits(code << (level+1)) 4 if (level base and level < base+numbmps) 5 SetBit(bitcode,bitmaps[level-base]) 6 if (level == base and CountBitsSet(bitmaps[0]) == max) 7 NextConguration() 8 endif 9 endif ComputeEstimate(bitmaps,base) 1 numIPs=0 2 for i= 0 to numbmps-1 3 numIPs=numIPs+b ln(b/CountBitsNotSet(bitmaps[i])) 4 endfor 5 correction= 2(2^base - 1) / (2^numbmps - 1) . b ln(b/(b - max)) 6 return numIPs 2base=(1 – (2 ^ (-numbmps)))+correction

CPU scaling • Processing each packet payload as a single string is easy • But when applying Rabin fingerprints, the processing of every substring of length B can overload the CPU during high traffic load – too much processing • A packet with 1,000 bytes of payload and B = 40, requires processing 960 Rabin fingerprints • To reduce processing time – sample the packets which are processed • Randomly sampling substrings to process could cause us to miss a large fraction of the occurrences of each substring

CPU Scaling • Instead use value sampling and select only those substrings for which the fingerprint matches a certain pattern – like last six bits are 0 • The probability of detecting a worm with a signature of length x • Probability of tracking a worm with a signature of 100 bytes is 55%, but for a worm with a signature of 200 bytes it increases to 92%, and for 400 bytes to 99.64%

Complete System

Program Loop ProcessPacket() 1 InitializeIncrementalHash(payload,payloadLength,dstPort) 2 while (currentHash=GetNextHash()) 3 if (currentADEntry=ADEntryMap.Find(currentHash)) 4 UpdateADEntry(currentADEntry,srcIP,dstIP,packetTime) 5 if ( (currentADEntry.srcCount > SrcDispTh) and (currentADEntry.dstCount > DstDispTh) ) 6 ReportAnomalousADEntry(currentADEntry,packet) 7 endif 8 else 9 if ( MsfIncrement(currentHash) > PravalenceTh) 10 newADEntry=InitializeADEntry(srcIP,dstIP,packetTime) 11 ADEntryMap.Insert(currentHash,newADEntry) 12 endif 13 endif 14 endwhile

Statistics • Implementation is written in C • The aggregator also uses the MySql database to log all events • Used popular rrd-tools library for graphical reporting • PHP scripting for administrative control

Content prevalence threshold • Using a 60 second measurement interval and a whole packet CRC, over 97 percent of all signatures repeat two or fewer times and 94.5 percent are only observed once • Using a finer grained content hash or a longer measurement interval increases these numbers even further

Address dispersion threshold • After 10 minutes there are over 1000 signatures with a low dispersion threshold of 2 • Using a threshold of 30, there are only 5 or 6 prevalent strings meeting the dispersion criteria

Garbage Collection • When the timeout is set to 100 seconds, then almost 60 percent of all signatures are garbage collected before a subsequent update • Using a timeout of 1000 seconds, this number is reduced to roughly 20 percent of signatures

Positives • Automatic Detection, Characterization & Containment • Low processor time consumed • Low memory consumption • Identify new worms and produce signatures – even E-Mail worms

Problems • Can’t identify worms with very less or no invariant portion • Can use compression modules like zip to confuse Earlybird • Vulnerabilities in IPSec, SSL & VPN can’t be secured • Attempt to evade our monitor through traditional IDS evasion techniques – like IP spoofing • Stealth worm difficult to identify • Purposely create worm defense to disallow some service by spreading similar packets

Suggestions • Uncompress Packets & Identify original contents • Need to have system as firewall for Secure protocols • Use triggering data across time scales (In paper) or maintain history of slowly repeating data • Check working of worm – see if it is really a worm in infected systems

Questions

Automated Worm Fingerprinting