360 likes | 482 Vues
This paper presents FiG, an innovative framework for automatic fingerprint generation, aimed at accurately identifying software versions and operating systems. By applying machine learning techniques, we develop an automated approach to create fingerprints that distinguish various implementations of operating systems and services, effectively addressing the limitations of traditional manual methods. Our contribution includes a novel query exploration process and classification functions that improve efficiency and accuracy in determining system vulnerabilities. Experimental results demonstrate the feasibility and effectiveness of our approach.
E N D
FiG: Automatic Fingerprint Generation Shobha Venkataraman Joint work with Juan Caballero, Pongsin Poosankam, Min Gyung Kang, Dawn Song & Avrim Blum Carnegie Mellon University
Fingerprinting Used to identify: • versions of software on hosts • operating systems of hosts • hosts running versions with vulnerabilities Linux Solaris Windows XP SP2 Network administrator Windows XP SP1
Queries Host Responses Output: what OS? (e.g. Linux) The Fingerprinting Process Fingerprint: • set of queries sent to host + • classification function analyzing queries & responses Well-known fingerprinting tools: nmap, fpdns Fingerprinting Tool
What classification function? Fingerprinting Tool What queries? Finding Fingerprints • How do fingerprinting tools get fingerprints? • Existing approach: • Manual identification • Incomplete, time-consuming • Difficult to keep up-to-date Need automatic, accurate fingerprint generation!
Our Contribution: FiG In particular: • Use machine learning to automatically generate fingerprints • Automatically generate accurate fingerprints: • Distinguishing OS • Distinguishing implementations of DNS servers • Finding new fingerprints Demonstrate automatic fingerprint generation is possible
Outline • Fingerprint Generation Problem • Overview of Approach • Automatic Fingerprint Generation • Experimental Results • Conclusion
Fingerprints Fingerprint Generation Problem Solaris Goal: find fingerprints, i.e. • Useful queries • Classification function that distinguishes implementations Windows XP Linux Fingerprint Generator Fingerprinting Tool
Outline • Fingerprint Generation Problem • Overview of Approach • Automatic Fingerprint Generation • Experimental Results • Conclusion
Candidate Queries Query Exploration Fingerprints FiG: Overview of Approach • Query exploration: Generate candidate queries • Learning: Automatically find fingerprints Learning Fingerprinting Tool FiG: Automatic Fingerprint Generation
Candidate Queries Query Exploration Fingerprints FiG: Overview of Approach Learning Fingerprinting Tool FiG: Automatic Fingerprint Generation
Query Exploration • Goal: generate candidate queries • query: specially crafted packet sent to host • Infeasible to generate all possible queries • All queries = all possible byte combinations of packet header • e.g., 40 bytes of TCP & IP header => 2^320 queries! • Instead, use protocol semantics to design queries
Query Exploration • Queries: packets with unusual values in fields of header Explore unusual values for fields independently • Explore fields with rich semantics exhaustively i.e., all possible values e.g., TCP flags • Explore other fields selectively i.e., some valid, invalid values e.g., tcp checksum, tcp src port
Candidate Queries Query Exploration Fingerprints Testing Phase: test accuracy of fingerprints Training Phase: learn potential fingerprints Data Collection FiG: Overview of Approach Learning Fingerprinting Tool
Training Phase Testing Phase Data Collection Data Collection 1. Send candidate queries to hosts 2. Collect responses from hosts 3. Split into training & testing data Training Data Candidate Queries And Responses Data Collection Testing Data
Testing Phase Data Collection Training Phase Training Phase Goal: learn potential fingerprints from data Intuition: different implementations differ in bytes of responses Learn which bytes of responses distinguish between implementations!
Testing Phase Data Collection What we’re learning Training Phase Outline: • Features • Classification functions • Combining into fingerprints <queries, responses>Solaris 1. Extract features <queries, responses> Linux Data Collection <queries, responses> Windows 2. Combine features to distinguish implementations Training Data
a b c d e f g h j k i k a b c d e f g h j i 3 0 7 4 9 6 Features • Analyze only bytes of response • Use both value & position of individual bytes in response • Capture this idea with position-substring Response byte sequence 2 10 1 8 5 Some example position-substrings
position-substrings of response to query q Classification Functions Analyze each query & each implementation separately e.g. for query q, for Linux implementation YES (comes from Linux) Classification function NO (does not come from Linux) Two classes of functions: • Conjunctions • Decision lists
00 00 16 d0 00 04 16 d0 Conjunctions • Capture identical behaviour across all hosts • require position-substrings distinctive to Linux to appear in responses from ALL Linux hosts if (response[4-5]==0x0000 && response[34-35]==0x16d0) then Linux else NotLinux Linux NotLinux Positions 34-35 Positions 4-5
ff ff 40 e8 Decision Lists • Need more expressivity than conjunctions • Capture multiple types of behaviour within implementation • allow many sets of position-substrings, each distinctive to implementation (e.g. Windows) if (response[34-35] == 0xffff) then Windows else if (response[34-35] == 0x40e8) then Windows else NotWindows Windows Windows Positions 34-35
Testing Phase Data Collection What we’re learning Training Phase Outline: • Features • Classification functions • Combining into fingerprints <queries, responses>Solaris 1. Extract features <queries, responses> Linux Data Collection <queries, responses> Windows 2. Combine features to distinguish implementations
Binary-fingerprints • Binary-fingerprint for implementation (e.g., Linux) is: • single query + • classification function: e.g., conjunction or decision list • = boolean: e.g. Linux, or Not Linux? • Binary-fingerprint separates ONE implementation • Learning (so far) finds binary-fingerprints • Conjunctions/decision lists of position-substrings (e.g. Linux or Not Linux? Windows or NotWindows?)
Multi-class Fingerprint • Combine binary-fingerprints for multiple implementations • Multi-class fingerprint is: • single query + • classification functions e.g. conjunctions, decision lists • = implementation, e.g. Linux, Windows, Solaris, unknown? Linux or Not Linux? Linux? Windows? Solaris? unknown? Windows or Not Windows? Solaris or Not Solaris? Multi-class fingerprint (for query q) Binary-fingerprints for query q
Training Phase Summary • Analyze responses to all queries, one at a time • Use position-substrings of bytes in response • Generate binary-fingerprints & multi-class fingerprints • Send these to testing phase
Training Phase Data Collection Testing Phase Testing Phase Binary & Multi-class Fingerprints Which fingerprints are accurate? Fingerprints Testing Data Fingerprinting Tool
Outline • Fingerprint Generation Problem • Overview of Approach • Automatic Fingerprint Generation • Query Exploration Phase • Learning Phase • Experimental Results • Experimental Setup & Data • Fingerprinting Results: Binary & Multi-class Fingerprints • Examples of New Fingerprints • Conclusion
Experiment Setup & Data • OS fingerprint generation: • 3 OS: 77 Windows, 29 Linux, 22 Solaris hosts • 305 different queries • DNS fingerprint generation: • 5 DNS server implementations: 10 BIND8, 12 BIND9, 11 Windows Server 2003, 10 MyDNS, 11 TinyDNS hosts • 96 different queries
Multi-class Fingerprints One-query fingerprint distinguishing ALL implementations simultaneously • OS: 66 queries with multi-class fingerprints • DNS: 19 queries with multi-class fingerprints • All these are decision lists! • No multi-class fingerprints with conjunctions found • Decision list has greater discriminatory power
All Fingerprints: OS One-query fingerprint distinguishing ONE implementation from rest Binary-fingerprints • Lots more binary-fingerprints! • Find conjunctions & decision lists in binary-fingerprints • Again, more fingerprints with more expressive decision lists • Similar results for DNS
Examples of New Fingerprints • Invalid value in data offset field: • Windows & Solaris hosts respond when value < 5 • Linux hosts do not respond • RST+ACK packets in responses: • Linux & Solaris hosts set TCP Ack # to 0 • Windows hosts set TCP Ack # to Ack # of query
Examples of New Fingerprints • Behaviour on ECN & CWR bits • Linux & Windows ignore ECN & CWR bits in queries • Solaris do not ignore them (sometimes) • Behaviour of QdCount field on invalid queries (DNS fingerprinting) • Some servers copy the field value, others don’t
Conclusion • Automatic fingerprint generation is possible • Use machine learning to identify fingerprints • Generate fingerprints automatically for 2 applications: • Distinguish OS • Distinguish implementations of DNS servers • Find multi-class fingerprints using decision lists • Discover new fingerprints for fingerprinting tools
Thank You! Questions? shobha@cs.cmu.edu
Binary-fingerprints: DNS One-query fingerprint distinguishing ONE implementation from rest • Similar results for DNS binary-fingerprints • More fingerprints with more expressive decision list • No binary-fingerprints with conjunctions for BIND8 & BIND9
Related Work • Active fingerprinting: • Comer & Lin ’94: Probing to find differences in TCP • Padhye & Floyd ’01: compliance testing & protocol violations • Passive Fingerprinting • Paxson ’97: TCP implementation with traffic traces • Beverly ’04, Lippman et al ’03: classify OS • Franklin et al ’06: wireless device driver fingerprinting • Tools: • OS fingerprinting: Nmap, queso, Xprobe, Snacktime • Passive fingerprinting: p0f, siphon • Defeating OS fingerprinting: • Smart et al ’00: TCP Fingerprint scrubber • Tools: Morph, IPPersonality