Automated Worm Fingerprinting

Automated Worm Fingerprinting Sumeet Singh, Cristian Estan, George Varghese and Stefan Savage Presenter: Yi Qiao

Outline • Introduction • Background • Worm Behavior and Worm Signatures • Practical Content Sifting • Implementation and Evaluation • Limitations • Conclusions

Introduction • Internet worms • Small programs that exploit software vulnerability in popular network service, seize control of program execution, and send a copy of themselves to other susceptible hosts • Bigger threat and damage • Software homogeneity, Internet’s unrestricted communication model • Increased speed, virulence and sophistication of new generations of worms and viruses • Different mechanisms and consequences • Little advance for worm detection, characterization and containment • Detection: intrusion detection + administrator legwork • Manual characterization of worm signature • Contain infections through anti-virus software and network filtering • Inefficient, expensive, and slow • Hours and days to complete • Effective worms containment can require a reaction time of sixty seconds!

Introduction • What this work has done • Two observations • Some portions of the content in existing worms is invariant • Spreading dynamics of a worm is atypical of Internet applications • Content sifting to identify new worms and their precise signatures • A prototype system based on the content sifting approach, Earlybird, for real-time worm detection and containment

Background • Empirical analyses of the CodeRed worm outbreak • The operational repair rate averaged under 2 percent per day • Fully automated intervention is necessary to manage outbreaks • Analysis of the Slammer outbreak • All Internet address space was scanned under 10 minutes • The need for fast and automated reactions • Different granularity of containment mechanisms – signature based VS IP address based • Signature-based methods can be an order of magnitude more effective • halt all spreading once a signature is identified • Signatures must be generated quickly to offer effective containment • Slammer may require signature operation under 5 minutes or even 60 seconds

Existing Techniques • Worm Detection • Scan detection • A worm can be highly unusual in the number, frequency and distribution of addresses it scans • Network telescopes – passively monitors for large ranges of unused yet routable address space • Not suited for non-random spread worms (e.g, email viruses, worms via IM or p2p communications) • IP-based detection – less responsive • Honeypots • Monitored idle hosts with untreated vulnerabilities to isolate and analyze a worm • Honeypots have to be infected + slow manual analysis • Host-based behavioral detection • Dynamic analysis of system call patterns for anomalous activity • Expensive to manage and deploy • Hard to infer large-scale outbreak

Existing Techniques • Characterization – the process of analyzing and identifying a new worm or exploit • A priori vulnerability signatures • Known exploitable vulnerabilities in deployed software • Can be deployed before new worm outbreaks • Relies on well-known vulnerabilities • Automation for signature extraction by Kephart and Arnold • Identify invariant code strings through decoy program infection • Assumes controlled environment and a know instance of virus • Kim and Karp’s Autograph system • Use network-level data to infer worm signatures • Difference with this work • A prefiltering step that identifies flows with suspicious scanning activity • Cannot detect email borne worms, UDP-based worms, or worms through P2P • Extensive support and active coordination between multiple sensors • Offline system only evaluated through traces

Existing Techniques • Containment – mechanism to slow or stop the spread of an active worm • Host quarantine • The act of preventing an infected host from communicating with other hosts • String-matching containment • Match network traffic against particular strings and drop associated packets • Approach used in the work • Connection throttling • Proactively limit the rate of all outgoing connections • Slowdown but not stop of the spread of any worm

Worm Behavior and Signatures • Content invariance • Some or all of the worm program is invariant across every body • Some has limited polymorphism, but key portions are still invariant • Content prevalence • Invariant portion of a worm’s content will appear frequently on the network as it spreads or attempts to spread • Address dispersion • The number of distinct hosts infected in a worm grows over time, and the distribution of infected addresses will be far more uniform than typical traffic

Worm Behavior and Signatures • Worms must • Generate significant traffic to spread • Traffic contains common substrings • Directed between a variety of difference sources and destinations • Content sifting • Sifting out network content which is not prevalent or not widely dispersed, leaving only the worm-like content • Prevalence table – catch packet strings that are seen often • Address lists – strings coming from enough sources and going to enough destinations • Substrings left after sifting can be used as signature to filter out worms

Content Sifting

Practical Content Sifting • Scale to high-speed links • Estimating content prevalence • Table indexed by payload can use up all memory in no time • 1 GByte table exhausted in 10 seconds on a 1Gbps link • Indexing the table using a fixed size hash of the packet payload • Multi-stage filters with conservative update • Multiple hash tables • Hash content using different hash functions in different hash tables, and increment corresponding table entry counter in each table • Record the content string if all hashed counters are above certain threshold • Dramatically reduces memory requirement

Practical Content Sifting • Estimating content prevalence • Append the destination port and protocol to the content before hashing • Effectively exclude large amounts of prevalent content not generated by worms (potential false positives) • Invariant content could be a string much smaller than a single packet, and occurs at different offsets • Detecting repeating strings with a small fixed length ß • A variant of Rabin fingerprints is used to all possible substrings of a certain length

Practical Content Sifting • Estimating address dispersion • Count the distinct source and destination IP addresses associated with each suspected content string • Critical for avoiding false positives among the prevalence content strings • Efficient solutions needed due to large number of suspected content strings • Scaled bitmap • Accurately estimate address dispersion using small amount of memory • Hash each content source or destination to a bitmap • Subsampling the range of the hash space • Allow the storage of the bitmap to remain constant across an enormous range of counts

Practical Content Sifting • Estimating address dispersion • Recycle the bitmap covering the largest fraction of the hash space when it is filled up • Clear and map it to the largest uncovered portion of the hash space, which is half of the portion covered by the rightmost bitmap

Practical Content Sifting • CPU scaling • Payload string requires significant CPU processing • Large number of substrings in each packet payload • Overload the CPU during high traffic load • If ß=40, a 1000-byte packet requires processing 960 Rabin fingerprints • Traffic surges make the problem even worse • Solution • Dynamic sampling of substrings • Value sampling – only choose substrings for which the fingerprint matches a certain pattern • Assume a sample fraction of f, a worm substring length of ß and a worm signature length of x, the miss probability • Sample value f – tradeoff between processing overhead and probability of missing a worm • X>=400 for all current worms - when f=1/64, the probability of false negatives is at most 0.36%

Practical Content Sifting • Summary • Content prevalence table • A high-pass filter for frequent content • Four independent hash functions – 4 counter arrays updated using conservative update optimization • Address dispersion table • Typically fewer values – only those strings exceeding the prevalence threshold • Both tables need to be cleared regularly • 60 seconds for content prevalence table, hours for address dispersion table • Modest memory requirement, no deployment restrictions, can be implemented in either hardware or software

Practical Content Sifting

Implementation and Evaluation • System design • EarlyBird system built and run at UCSD campus for over eight months • Two major components • Sensors • Sifts through traffic on configurable address space zones and reports anomalous signatures • Aggregators • Coordinates updates from sensors, coalesces related signatures and activates blocking services, administrative reporting and control • Automatically generates and deploys precise content-based signatures to automatically block outbreaks

Implementation and Environment • Earlybird sensor on a 1.6Ghz AMD Opteron 242 1U server configured with standard linux 2.6 kernel • Single-threaded application executes at user-level • 5,000 lines of code • Sifts over 1TB of traffic per day and keeps up with over 200 Mbps of continuous traffic • Sampling probability of 1/64 • Monitor all inbound and outbound traffic • The router manages traffic to/from 5000 hosts

Parameter tuning • Content prevalence threshold • Use a value of 3 (on a 60 second measurement interval) • Over 97% of all signatures repeat two or fewer times and 94.5% percent are only observed once • Enormous number of content strings are removed from consideration in content prevalence test

Parameter tuning • Address dispersion threshold • As the dispersion threshold increases, the number of strings detected decreases dramatically • With a threshold of 30, only 5 or 6 prevalent strings meet the dispersion criteria – either worms or strings can be post-filtered by a whitelist • Tradeoff between detecting speed and false positives

Parameter tuning • Garbage collection • The elapsed time before an entry in the address dispersion table is garbage collected • With a timeout value of 100 seconds, 60 percent of all signatures are garbage collected before a subsequent update occurs, preventing the signature from meeting the dispersion threshold and being reported • With a timeout of 1000 second, the percentage reduces to 20% • A timeout of several hours is chosen since the dispersion table is small

Performance • Processing time • Count elapsed CPU cycles for each component • Most significant operations • Initial Rabin fingerprint, accessing the multistage filter and creating a new address dispersion table entry • Considering the 1/64 sampling rate, the effective per byte processing time is 0.042 microseconds • Can sustain a 200Mbps load

Performance • Memory consumption • Major memory hog – the content prevalence table • 4 stage filters, each stage 524,288 bins, each bin 8 bits – a total of 2MB memory • Other memory usage • The address dispersion table • 5K and 25K entries of 28 bytes each – under 1 MB of memory • Total memory consumption of EarlyBird • 4MB • Can be further reduced if using higher prevalence threshold • Potential on-chip implementation possible

Trace-based verification • False positives • The prevalence of different signatures over time that meet the dispersion threshold of 10 • Two most active signatures – the Slammer and Opaserv worms • A pervasive string on TCP port 455 and the Blaster worm • Others • Likely worms • Distributed scans and some particular protocol structures • Two principal sources of false positives • Common protocol headers – can be easily whitelisted • Unsolicited bulk email (SPAM) – harder to be whitelisted, yet their interdiction is far more benign • One source of false positives that defies easy analysis • Many-to-many download profile of BitTorrent

Trace-based verification • False negatives • Impossible to quantitatively demonstrate the absence of false negatives • Every worm outbreak reported on public mailing lists was detected by EarlyBird • No false negatives when compared with the snort-signature mailing list

Performance • Inter-packet signatures • An attacker can evade detection by splitting an invariant string into pieces one byte smaller than smaller than ß • Content sifting algorithm to detect such simple evasions at the cost of per flow state management • Live experiences of EarlyBird • EarlyBird detected signatures for variants of CodeRed, MyDoom mail worm and the recently Sasser and Kibvu.B worm • Sasser and Vibvu.B signatures were reported long before the public reports of the worm’s spread

Limitations and extensions • Variant content • Worms with little or no invariant content • Instruction sequence mutation, semantically equivalent but textually distinct code • More complex analysis for content sifting is needed • Compression • Common code sequence reuse – lead to lots of false positives • Vulnerabilities in popular implementations of encrypted session protocols such as SSH can be exploited by worms • Problems cannot be handled by current techniques • Network evasion • Evade monitoring through traditional IDS evasion techniques

Limitations and extensions • Extensions • Sensitivity study of parameters and “autotune” capacity for EarlyBird’s content sifting parameters in different environments • Handle slow worms • Maintaining triggering data across multiple time scales • Hybrid system combined with host-based intrusion detection or honeypots • Containment • Rate-limit first before final traffic block • Tradeoff between detection speed and false positives • Malicious worm detection trigger • Denial-of-service on legitimate traffic carrying a specific string • Coordination • Share a given signature across deployment at different sites • Related issues of trust, validation and policy

Conclusions • An approach for real-time detection of unknown worms and automated extraction of unique content signatures • Content sifting algorithm efficiently analyses network traffic for prevalent and widely dispersed content strings • Moderate memory and computational requirements • EarlyBird is able to detect and extract signatures of all contemporary worms and also for new worms • Underlying methodology can be used for some other detections • Bulk email (SPAM), peer-to-peer system activity • Feasibility of sophisticated wide-spread network security • Signature learning at Gigabit speeds is viable

Automated Worm Fingerprinting