1 / 31

Automated Worm Fingerprinting

Automated Worm Fingerprinting. Sumeet Singh, Cristian Estan, George Varghese and Stefan Savage Presenter: Yi Qiao. Outline. Introduction Background Worm Behavior and Worm Signatures Practical Content Sifting Implementation and Evaluation Limitations Conclusions. Introduction.

mohawk
Télécharger la présentation

Automated Worm Fingerprinting

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automated Worm Fingerprinting Sumeet Singh, Cristian Estan, George Varghese and Stefan Savage Presenter: Yi Qiao

  2. Outline • Introduction • Background • Worm Behavior and Worm Signatures • Practical Content Sifting • Implementation and Evaluation • Limitations • Conclusions

  3. Introduction • Internet worms • Small programs that exploit software vulnerability in popular network service, seize control of program execution, and send a copy of themselves to other susceptible hosts • Bigger threat and damage • Software homogeneity, Internet’s unrestricted communication model • Increased speed, virulence and sophistication of new generations of worms and viruses • Different mechanisms and consequences • Little advance for worm detection, characterization and containment • Detection: intrusion detection + administrator legwork • Manual characterization of worm signature • Contain infections through anti-virus software and network filtering • Inefficient, expensive, and slow • Hours and days to complete • Effective worms containment can require a reaction time of sixty seconds!

  4. Introduction • What this work has done • Two observations • Some portions of the content in existing worms is invariant • Spreading dynamics of a worm is atypical of Internet applications • Content sifting to identify new worms and their precise signatures • A prototype system based on the content sifting approach, Earlybird, for real-time worm detection and containment

  5. Background • Empirical analyses of the CodeRed worm outbreak • The operational repair rate averaged under 2 percent per day • Fully automated intervention is necessary to manage outbreaks • Analysis of the Slammer outbreak • All Internet address space was scanned under 10 minutes • The need for fast and automated reactions • Different granularity of containment mechanisms – signature based VS IP address based • Signature-based methods can be an order of magnitude more effective • halt all spreading once a signature is identified • Signatures must be generated quickly to offer effective containment • Slammer may require signature operation under 5 minutes or even 60 seconds

  6. Existing Techniques • Worm Detection • Scan detection • A worm can be highly unusual in the number, frequency and distribution of addresses it scans • Network telescopes – passively monitors for large ranges of unused yet routable address space • Not suited for non-random spread worms (e.g, email viruses, worms via IM or p2p communications) • IP-based detection – less responsive • Honeypots • Monitored idle hosts with untreated vulnerabilities to isolate and analyze a worm • Honeypots have to be infected + slow manual analysis • Host-based behavioral detection • Dynamic analysis of system call patterns for anomalous activity • Expensive to manage and deploy • Hard to infer large-scale outbreak

  7. Existing Techniques • Characterization – the process of analyzing and identifying a new worm or exploit • A priori vulnerability signatures • Known exploitable vulnerabilities in deployed software • Can be deployed before new worm outbreaks • Relies on well-known vulnerabilities • Automation for signature extraction by Kephart and Arnold • Identify invariant code strings through decoy program infection • Assumes controlled environment and a know instance of virus • Kim and Karp’s Autograph system • Use network-level data to infer worm signatures • Difference with this work • A prefiltering step that identifies flows with suspicious scanning activity • Cannot detect email borne worms, UDP-based worms, or worms through P2P • Extensive support and active coordination between multiple sensors • Offline system only evaluated through traces

  8. Existing Techniques • Containment – mechanism to slow or stop the spread of an active worm • Host quarantine • The act of preventing an infected host from communicating with other hosts • String-matching containment • Match network traffic against particular strings and drop associated packets • Approach used in the work • Connection throttling • Proactively limit the rate of all outgoing connections • Slowdown but not stop of the spread of any worm

  9. Worm Behavior and Signatures • Content invariance • Some or all of the worm program is invariant across every body • Some has limited polymorphism, but key portions are still invariant • Content prevalence • Invariant portion of a worm’s content will appear frequently on the network as it spreads or attempts to spread • Address dispersion • The number of distinct hosts infected in a worm grows over time, and the distribution of infected addresses will be far more uniform than typical traffic

  10. Worm Behavior and Signatures • Worms must • Generate significant traffic to spread • Traffic contains common substrings • Directed between a variety of difference sources and destinations • Content sifting • Sifting out network content which is not prevalent or not widely dispersed, leaving only the worm-like content • Prevalence table – catch packet strings that are seen often • Address lists – strings coming from enough sources and going to enough destinations • Substrings left after sifting can be used as signature to filter out worms

  11. Content Sifting

  12. Practical Content Sifting • Scale to high-speed links • Estimating content prevalence • Table indexed by payload can use up all memory in no time • 1 GByte table exhausted in 10 seconds on a 1Gbps link • Indexing the table using a fixed size hash of the packet payload • Multi-stage filters with conservative update • Multiple hash tables • Hash content using different hash functions in different hash tables, and increment corresponding table entry counter in each table • Record the content string if all hashed counters are above certain threshold • Dramatically reduces memory requirement

  13. Practical Content Sifting • Estimating content prevalence • Append the destination port and protocol to the content before hashing • Effectively exclude large amounts of prevalent content not generated by worms (potential false positives) • Invariant content could be a string much smaller than a single packet, and occurs at different offsets • Detecting repeating strings with a small fixed length ß • A variant of Rabin fingerprints is used to all possible substrings of a certain length

  14. Practical Content Sifting • Estimating address dispersion • Count the distinct source and destination IP addresses associated with each suspected content string • Critical for avoiding false positives among the prevalence content strings • Efficient solutions needed due to large number of suspected content strings • Scaled bitmap • Accurately estimate address dispersion using small amount of memory • Hash each content source or destination to a bitmap • Subsampling the range of the hash space • Allow the storage of the bitmap to remain constant across an enormous range of counts

  15. Practical Content Sifting • Estimating address dispersion • Recycle the bitmap covering the largest fraction of the hash space when it is filled up • Clear and map it to the largest uncovered portion of the hash space, which is half of the portion covered by the rightmost bitmap

  16. Practical Content Sifting • CPU scaling • Payload string requires significant CPU processing • Large number of substrings in each packet payload • Overload the CPU during high traffic load • If ß=40, a 1000-byte packet requires processing 960 Rabin fingerprints • Traffic surges make the problem even worse • Solution • Dynamic sampling of substrings • Value sampling – only choose substrings for which the fingerprint matches a certain pattern • Assume a sample fraction of f, a worm substring length of ß and a worm signature length of x, the miss probability • Sample value f – tradeoff between processing overhead and probability of missing a worm • X>=400 for all current worms - when f=1/64, the probability of false negatives is at most 0.36%

  17. Practical Content Sifting • Summary • Content prevalence table • A high-pass filter for frequent content • Four independent hash functions – 4 counter arrays updated using conservative update optimization • Address dispersion table • Typically fewer values – only those strings exceeding the prevalence threshold • Both tables need to be cleared regularly • 60 seconds for content prevalence table, hours for address dispersion table • Modest memory requirement, no deployment restrictions, can be implemented in either hardware or software

  18. Practical Content Sifting

  19. Implementation and Evaluation • System design • EarlyBird system built and run at UCSD campus for over eight months • Two major components • Sensors • Sifts through traffic on configurable address space zones and reports anomalous signatures • Aggregators • Coordinates updates from sensors, coalesces related signatures and activates blocking services, administrative reporting and control • Automatically generates and deploys precise content-based signatures to automatically block outbreaks

  20. Implementation and Environment • Earlybird sensor on a 1.6Ghz AMD Opteron 242 1U server configured with standard linux 2.6 kernel • Single-threaded application executes at user-level • 5,000 lines of code • Sifts over 1TB of traffic per day and keeps up with over 200 Mbps of continuous traffic • Sampling probability of 1/64 • Monitor all inbound and outbound traffic • The router manages traffic to/from 5000 hosts

  21. Parameter tuning • Content prevalence threshold • Use a value of 3 (on a 60 second measurement interval) • Over 97% of all signatures repeat two or fewer times and 94.5% percent are only observed once • Enormous number of content strings are removed from consideration in content prevalence test

  22. Parameter tuning • Address dispersion threshold • As the dispersion threshold increases, the number of strings detected decreases dramatically • With a threshold of 30, only 5 or 6 prevalent strings meet the dispersion criteria – either worms or strings can be post-filtered by a whitelist • Tradeoff between detecting speed and false positives

  23. Parameter tuning • Garbage collection • The elapsed time before an entry in the address dispersion table is garbage collected • With a timeout value of 100 seconds, 60 percent of all signatures are garbage collected before a subsequent update occurs, preventing the signature from meeting the dispersion threshold and being reported • With a timeout of 1000 second, the percentage reduces to 20% • A timeout of several hours is chosen since the dispersion table is small

  24. Performance • Processing time • Count elapsed CPU cycles for each component • Most significant operations • Initial Rabin fingerprint, accessing the multistage filter and creating a new address dispersion table entry • Considering the 1/64 sampling rate, the effective per byte processing time is 0.042 microseconds • Can sustain a 200Mbps load

  25. Performance • Memory consumption • Major memory hog – the content prevalence table • 4 stage filters, each stage 524,288 bins, each bin 8 bits – a total of 2MB memory • Other memory usage • The address dispersion table • 5K and 25K entries of 28 bytes each – under 1 MB of memory • Total memory consumption of EarlyBird • 4MB • Can be further reduced if using higher prevalence threshold • Potential on-chip implementation possible

  26. Trace-based verification • False positives • The prevalence of different signatures over time that meet the dispersion threshold of 10 • Two most active signatures – the Slammer and Opaserv worms • A pervasive string on TCP port 455 and the Blaster worm • Others • Likely worms • Distributed scans and some particular protocol structures • Two principal sources of false positives • Common protocol headers – can be easily whitelisted • Unsolicited bulk email (SPAM) – harder to be whitelisted, yet their interdiction is far more benign • One source of false positives that defies easy analysis • Many-to-many download profile of BitTorrent

  27. Trace-based verification • False negatives • Impossible to quantitatively demonstrate the absence of false negatives • Every worm outbreak reported on public mailing lists was detected by EarlyBird • No false negatives when compared with the snort-signature mailing list

  28. Performance • Inter-packet signatures • An attacker can evade detection by splitting an invariant string into pieces one byte smaller than smaller than ß • Content sifting algorithm to detect such simple evasions at the cost of per flow state management • Live experiences of EarlyBird • EarlyBird detected signatures for variants of CodeRed, MyDoom mail worm and the recently Sasser and Kibvu.B worm • Sasser and Vibvu.B signatures were reported long before the public reports of the worm’s spread

  29. Limitations and extensions • Variant content • Worms with little or no invariant content • Instruction sequence mutation, semantically equivalent but textually distinct code • More complex analysis for content sifting is needed • Compression • Common code sequence reuse – lead to lots of false positives • Vulnerabilities in popular implementations of encrypted session protocols such as SSH can be exploited by worms • Problems cannot be handled by current techniques • Network evasion • Evade monitoring through traditional IDS evasion techniques

  30. Limitations and extensions • Extensions • Sensitivity study of parameters and “autotune” capacity for EarlyBird’s content sifting parameters in different environments • Handle slow worms • Maintaining triggering data across multiple time scales • Hybrid system combined with host-based intrusion detection or honeypots • Containment • Rate-limit first before final traffic block • Tradeoff between detection speed and false positives • Malicious worm detection trigger • Denial-of-service on legitimate traffic carrying a specific string • Coordination • Share a given signature across deployment at different sites • Related issues of trust, validation and policy

  31. Conclusions • An approach for real-time detection of unknown worms and automated extraction of unique content signatures • Content sifting algorithm efficiently analyses network traffic for prevalent and widely dispersed content strings • Moderate memory and computational requirements • EarlyBird is able to detect and extract signatures of all contemporary worms and also for new worms • Underlying methodology can be used for some other detections • Bulk email (SPAM), peer-to-peer system activity • Feasibility of sophisticated wide-spread network security • Signature learning at Gigabit speeds is viable

More Related