Automatic for the people: Reducing inadvertent leaks by personal machines

Automatic for the people: Reducing inadvertent leaks by personal machines Landon Cox Duke University

Inadvertent leaks • Usability and privacy: A Study of Kazaa ... • Good and Krekelberg, CHI, 2003 • In 12 hours, found 150 inboxes on Kazaa • Observed people downloading dummy inbox • Problem hasn’t gone away

Stories from 2009

Technical solution? Servers: Asbestos, HiStar, Flume Languages: Jif, Laminar, Resin Desktop: PrivacyScope, TightLip Process Files Reference monitor Process Network Process IPC Policy User Admin Dev Automation

Automatic policy specific. • State of the art: pattern matching • Look for strings that look like SSNs, CCs, etc. • find_SSNs, Firefly, SENF, Spider, etc. • A bit brittle and error-prone • High false positive/negative rates • Let’s take a different approach

Key observations 1) Personal machines often cache sensitive data 2) Servers force clients to access files using crypto 3)Crypto is general technique, used across admin. domains and applications

RedFlag overview • Identifies processes that store decrypted data • Unobtrusive (requires no user input) • Compatible with legacy applications • Compatible with existing Internet protocols • High-level insights • Stop trying to figure out what sensitive data looks like • Use heuristics of how sensitive data is handled

Caveats • We cannot stop all inadvertent leaks • Stop large, important class of leaks • Trust and threat model • Uncompromised host • No IP spoofing or DNS hijacking • Correct, trusted reference monitor (take your pick) • Buggy/absent access-control policies

RedFlag system overview Monitor sockets Compose rules Inspect process

Monitoring sockets • Goal • Try to identify incoming encrypted data • Only at application level (e.g., SSL) • Easy for most widely used apps • Look at remote port (e.g., 443 or 993) • Not always sufficient • Non-standard ports: Skype, Groove, Groupwise • XMPP sends SSL, non-SSL data to same port (5222/TCP)

Information entropy • Compute entropy score for ambiguous ports • Negligible performance overhead • If score above threshold (~7.9 bits/byte), invoke inspection process • Can induce false positives • Compressed data sent in the clear (e.g., mp3s) • On-the-fly compression schemes (e.g., http content-coding=gzip) • Luckily, doesn’t need to be 100% accurate • Really just a performance optimization to save work • Only used as a first-pass filter • Correct any mistakes in inspection phase

Inspect process • Goals of inspection • Infer when file write depends on network read • Determine whether file write is decrypted data • Use taint-tracking • Too slow to perform in critical path of desktop apps • Perform asynchronously via deterministic replay • Fork if network monitor flags process (port or entropy) • Log libc calls in original, use log in replay process • Attach taint-tracker to replayed process (e.g., PIN) • Perform analysis on a free core in the background

Taint tracking • Implement with PIN • Rewrite instructions to propagate taint • Record taint in shadow memory • Key questions • What are the taint sources? • What info to send to the policy composer?

Address space “/tmp/attach.pdf, 74.125.45.83:443” } <!DOCTYPE html PUBLIC ... Taint label (byte) 0 0 0 0 0 1 } } Shadow memory Fine when there is no ambiguity about the source But what about ambiguous ports?

Ambiguous ports • Search process memory for AES s-boxes • S-boxes are set by algorithm designer • S-boxes are unlikely to appear randomly • (also look for well-known transformations)

0 0 0 0 0 1 Ambiguous ports • If we find s-boxes in a library data section • Assume image is a crypto library • Vast majority of crypto libraries include AES implementation • Instrument lib to set “crypto bit” of inbound taint labels • If crypto bit == 1, network data was “routed” through crypto lib • If crypto bit == 0, assume network data was not decrypted • Also use s-boxes as taint source • Data derived from s-boxes have “AES bit” set • Can use to gauge strength of crypto algorithm Taint label (byte) 1 1 } ID index AES bit Crypto bit

Compose rules • Taint-tracking gives three pieces of info • Description of network source • If data was routed through crypto library • If data was derived from AES s-box • Can use this to compose policies

Compose rules • Same source • Allow sensitive files to be copied back to their source • Raise alert otherwise • Generalize hostnames (e.g., *.google.com) • Obfuscation vs. confidentiality • Many P2P clients use crypto to obfuscate • Aren’t trying to protect data so use weak algorithms • (e.g., BitTorrent and LimeWire explicitly do not support AES) • If ambiguous port + no AES, then ignore file

RedFlag implementation • Runs on Ubuntu 8.10 • Modified Jockey for logging/replay • Supports multi-threaded programs • User-level thread library • PIN tool for tainting • Based on sequential taint tracker from Speck • Modified to allow tainting during replay • Implemented s-box search, crypto and AES bits in taint label

Evaluation • Accuracy • How well can RedFlag identify crypto libraries using s-boxes? • How well does RedFalg categorize sensitive files? • Performance • Will asynchronous taint-tracking fall behind?

Identifying crypto libraries • Looked at 10 Ubuntu programs • Email: checkgmail, thunderbird • IM: pidgin • P2P: Azureus, Limewire, Skype, Transmission • Web: Firefox, Opera, wget • Successfully identified crypto libs in all • Including custom implementations, plugins (flash player) • Interesting case: Opera folds crypto into exectable

Categorizing sensitive files • Non-sensitive files • Used Firefox • Loaded 30 most popular webistes (alexa) • RedFlag produced no false positives/negatives • Sensitive files • Downloaded 17 representative sensitive docs • Firefox, thunderbird, pidgin

Categorizing sensitive files

Taint-tracking performance

Conclusions • RedFlag automates policy specification • Heuristic-based approach • Monitor process behavior, not file content • Sensitive files usually downloaded using crypto • Deal with ambiguous ports using entropy scores, AES s-boxes • Evaluation highlights • Automatically identified crypto libraries • Correctly categorized files in 45/47 scenarios • No false positives, three false negatives • Sufficient idle time in long-running process

Thanks!I’m happy to take questions

Automatic for the people: Reducing inadvertent leaks by personal machines