Characterizing and Reasoning about Security Vulnerabilities

Characterizing and Reasoning about Security Vulnerabilities Shuo Chen Center for Reliable and High-Performance Computing Coordinated Science Laboratory University of Illinois at Urbana-Champaign Preliminary Examination, May 4th, 2004 Committee Chair: Prof. Ravishankar K. Iyer Committee: Prof. Vikram Adve Prof. Jose Meseguer Prof. David Nicol

Significance of Software Implementation Errors • Bugtraq: 70% of security vulnerabilities due to implementation errors.

What I Have Done • Analyzed CERT and Bugtraq reports and the corresponding application source code. • Developed a new FSM representation to decompose each security vulnerability to a series of elementary activities (primitive FSMs), each indicating a simple predicate. • The FSM analysis showed • Many vulnerabilities ( 66%) due to pointer taintedness: user input value used as a pointer value (which should be transparent to users). • A significant portion of vulnerabilities ( 33.6%) due to errors in library functions or incorrect invocations of library functions • The FSM modeling led to a formal reasoning approach to examine pointer taintedness in applications.

Formal Analysis of Pointer Taintedness • Pointer Taintedness: a pointer value, including a return address, is derived directly or indirectly from user input. (formally defined using equational logic) • Provides a unifying perspective for reasoning about a significant number of security vulnerabilities. • The notion of pointer taintedness enables: • Static analysis: reasoning about the possibility of pointer taintedness by source code analysis; • Runtime checking: inserting assertions in object code to check pointer taintedness at runtime; • Hardware architecture-based support to detect pointer taintedness. • Current focus: extraction of security specifications of library functions based on pointer taintedness semantics.

Publications of My Research • Papers: • J. Xu, S. Chen, Z. Kalbarczyk, R. K. Iyer. "An Experimental Study of Security Vulnerabilities Caused by Errors". DSN 2001. • S. Chen, J. Xu, R. K. Iyer, K. Whisnant. "Modeling and Analyzing the Security Threat of Firewall Data Corruption Caused by Instruction Transient Errors". DSN 2002. • S. Chen, Z. Kalbarczyk, J. Xu, R. K. Iyer. "A Data-Driven Finite State Machine Model for Analyzing Security Vulnerabilities". DSN 2003. • S. Chen, K. Pattabiraman, Z. Kalbarczyk, R. K. Iyer, “Formal Reasoning of Various Categories of Widely Exploited Security Vulnerabilities Using Pointer Taintedness Semantics”, IFIP Information Security Conference, 2004. • Security Vulnerability Report • S. Chen and J. Xu, “Bugtraq ID 6255: NULL HTTPD Heap Corruption Vulnerability”, the Bugtraq List.

A Finite State Machine Approach for Analyzing Security Vulnerabilities

Overview of the Study • An analysis of security vulnerability databases (CERT and Bugtraq) • Examination of security vulnerabilities at the application source-code level • A security vulnerability usually consists of a series of vulnerabilities in multiple elementary activities. Each can be represented by a primitive FSM, indicating a simple predicate. • Provide formalism in reasoning and describing security vulnerabilities. • Usefulness of the formalism: discovery of the HTTP daemon heap overflow vulnerability.

Observation from Data Analysis Same vulnerabilities can be classified in different categories. Why? Because of the existence of multiple elementary activities.

SPEC Check State Reject State Accept State Primitive FSM • We use Primitive FSM (pFSM) to depict an elementary activity, which specifies a predicate (SPEC) that should be guaranteed in order to ensure security. IMPL_REJECT SPEC_REJECT IMPL_ACCEPT SPEC_ACCEPT

NULL HTTPD Heap Corruption Vulnerabilities (Bugtraq #5774, #6255) Op 1: Read user input from a socket into a heap buffer contentLen<0 Size(PostData)<length(input) pFSM1 get (contentLen, input) pFSM2 Copy input from the socket Calloc PostData[1024+contentLen] contentLen>=0 length(input) <= Size(PostData) Heap structure corrupted * Op 2: Free the buffer B->fd=AB->bk=C B->fd and B->bk changed Calloc is called pFSM3 When buf is freed, execute B->fd->bk = B->bk B->fd and B->bk unchanged A function pointer corrupted * Op 3: Manipulate the function pointer - Load pFree to the memory during program initialization pFree changed- pFSM4 pFree unchanged - Execute pFree when function free is called Attacker’s malicious code is executed

Op 1: Read User Data from a Socket to a Heap Buffer 0: Get contentLen //Negative ?? 1: PostData = calloc(contentLen +1024, sizeof(char));x=0; rc=0; 2: pPostData= PostData; 3: do { 4: rc=recv(sock, pPostData, 1024, 0); 5: if (rc==-1) { 6: closeconnect(sid,1); 7: return; 8: } 9: pPostData+=rc; 10: x+=rc; 11: }while ((rc==1024)|| (x<contentLen)); contentLen<0 pFSM1 get (contentLen, input) contentLen is an integer, input: string to be read from a socket contentLen>=0 Calloc PostData[1024+contentLen] ? length(input)>Size(PostData) pFSM2 Copy input from the socket to PostData by recv() call length(input) <= Size(PostData)

Sendmail Debugging Function Signed Integer Overflow (Bugtraq #3163) Operation 1: Write integer i to tTvect[x] ? ( integer represented by str_x)> 231 x > 100 pFSM1 get text strings str_x and str_i x < 0 or x > 100 ( integer represented by str_x) 231 convert str_i and str_x to integer i and x x  100 pFSM2 0  x  100 tTvect[x]=i Function pointer is tainted * Operation 2: Manipulate the function pointer ? addr_setuid changed Load the function pointer pFSM3 addr_setuid unchanged Execute code referred by addr_setuid Execute malicious code

Modeled Vulnerabilities • Signed Integer Overflow • Heap Corruption • Stack Overflow • Format String Vulnerabilities • File Race Conditions • Some Input Validation Vulnerabilities

Formal Reasoning of Security Vulnerabilities by Pointer Taintedness Semantics

Pointer Taintedness Caused Vulnerabilities • Format string vulnerability • Taint an argument pointer of functions such as printf, fprintf, sprintf and syslog. • Stack smashing • Taint a return address. • Heap corruption • Taint the free-chunk doubly-linked list of the heap. • Glibc globbing vulnerabilities • User input resides in a location that is used as a pointer by the parent function of glob().

fmt: format string pointer ap: argument pointer fmt: format string pointer ap: argument pointer Example of Format String Vulnerability Vulnerable code: recv(buf); printf(buf); /* should be printf(“%s”,buf) */ \xdd \xcc \xbb \xaa %d %d %d %n High … %n %d %d %d 0xaabbccdd Stack growth Low In vfprintf(), if (fmt points to “%n”) then **ap = (character count) *ap is a tainted value.

Taintedness Semantics (Memory Model) • Astore represents a snapshot of the memory state at a point in the program execution. • For each memory location, we can evaluate two properties: content and taintedness (true/false). • Operations on memory locations: • The fetch operation Ftch(S,A)gives the content of the memory address A in store S • The location-taintedness operation LocT(S,A) gives the taintedness of the location A in store S • Operations on expressions: • The evaluation operation Eval(S,E)evaluates expression E in store S • The expression-taintedness operation ExpT(S,E) computes the taintedness of expression E in store S

Axioms of Eval and ExpT operations Eval(S, I) = I // I is an integer constant Eval(S, ^ E1) = Ftch(S, Eval(S,E1)) Eval(S, E1 + E2) = Eval(S, E1) + Eval(S, E2) Eval(S, E1 - E2) = Eval(S, E1) - Eval(S, E2) … … ExpT (S, I) = false ExpT(S, ^ E1) = LocT(S,Eval(S,E1)) ExpT(S,E1 + E2) = ExpT(S,E1) or ExpT((S,E2) ExpT(S,E1 - E2) = ExpT(S,E1) or ExpT((S,E2) … … E.g., is the expression (^100)–2 tainted? ExpT(S, (^100)–2) = ExpT(S, (^100)) or ExpT(S, 2) = LocT(S,100) or false = LocT(S,100) Note: ^ is the dereference operator, ^100 gives the content in the location 100

Semantics of Language L • Extend the semantics proposed by Goguen and Malcolm • The following operations (arithmetic/logic) are defined: • +, -, *, /, %, !, &&, ||, !=, ==, …… • The following instructions are defined: • mov [Exp1] <- Exp2 • branch (Condition) Label • call FuncName(Exp1,Exp2,…) • Axioms defining mov instruction semantics • Specify the effects of applying mov instruction on a store • Allow taintedness to propagate from Exp2 to [Exp1]. • Axioms defining the semantics of recv (similarly, scanf, recvfrom) • Specify the memory locations tainted by the recv call.

Extracting Function Specifications by Theorem Prover Automatically translated to Language L C source code of a library function Code in language L Theorem generation Critical instruction – indirect writes For each mov [^ E1] <- E2, generate theorems: a) E1 should not be tainted b) The mov instruction should not taint any location outside the buffer pointed by E1 ITP theorem prover A set of sufficient conditions that imply the validity of the theorems. They are the security specifications of the analyzed function.

Example: strcpy() char * strcpy (char * dst, char * src) { char * res; 0: res =dst; while (*src!=0) { 1: *dst=*src; dst++; src++; } 2: *dst=0; return res; } 0: mov [res] <- ^ dst lbl(#while#6) branch (^ ^ src is 0) #ex#while#6 1: mov [^ dst] <- ^ ^ src mov [dst] <- (^ dst) + 1 mov [src] <- (^ src) + 1 branch true #while#6 lbl(#ex#while#6) 2: mov [^ dst] <- 0 mov [ret] <- ^ res Translate to Language L Theorem generation a) Suppose S1 is the store before Line L1, then LocT(S1,dst) = false b) If S0 is the store before Line L0, and S2 is the store after Line L1, then I < Eval(S0, ^dst) or Eval(S0, ^dst+dstsize)  I => LocT(S2,I) = LocT(S0, I) c) Suppose S3 is the store before Line L2, then LocT(S3,dst) = false Theorem prover

Specifications Suggested by Theorem Prover • Suppose when function strcpy() is called, the size of destination buffer (dst) is dstsize, the length of user input string (src) is srclen • Specifications that are extracted by the theorem proving approach • srclen <= dstsize • The buffers src and dst do not overlap in such a way that the buffer dst covers the NULL-terminator of the src string. • The buffers dst and src do not cover the function frame of strcpy. • Initially, dst is not tainted Documented in Linux man page Not documented

Example Scenario Are the extracted specifications possible to be violated in application code? index Destination buffer should not cover the function frame of strcpy. char input[240]; void foo( ) { int offset; char buf[200]; scanf(“%s”, input ); offset = 200 – strlen( input ); strcpy( buf + offset , input ); } High buf buf+offset foo buf Return Addr. Stack growth Frame Pointer src strcpy dst res Low

Other Examples • A simplied version of printf() • 55 lines of C code • Four security specifications are extracted, including one indicating format string vulnerability • Function free() of a heap management system • 36 lines of C code • Seven security specifications are extracted, including several specifications indicating heap corruption vulnerabilities. • Socket read functions of Apache HTTPD and NULL HTTPD • The Apache function is proved to be free of pointer taintedness. • Two (known) vulnerabilities are exposed in the theorem proving process.

Summary • FSM representation: decompose each vulnerability to multiple simple predicates (with real vulnerability examples) • A common characteristic of many predicates: their violations result in pointer taintedness • Defined a memory model to reason about pointer taintedness • Developed a theorem proving approach to extract security specifications from library functions

Future Directions • Develop a VCGen (verification condition generator) to facilitate theorem proving. (in progress) • Apply the pointer taintedness analysis to a substantial number of commonly used library functions to extract their security specifications. • Compiler techniques for inserting “guarding code” to check unproved properties at runtime. • Explore the possibility of building the taintedness notion into virtual machines. • Architecture supports for pointer taintedness detection. A module working with RSE (Reliability and Security Engine).

Backup Slides

Format String Vulnerability int vfprintf (FILE *s, const char *format, va_list ap) { char * p; … … *(int *) va_arg (ap, void *) = count; … … } int printf (const char *format, ...) { … … count = vfprintf (stdout, format, arg); … … } int i,j; int main() { char buf[100]; *(unsigned int *)buf=&i; *(buf+4)=0; strcat(buf,"%d%d%d12345%n"); printf(buf); } buf = \x78 \x99 \x04 \x08 %d %d %d ‘1’ ‘2’ ‘3’ ‘4’ ‘5’ %n the addr of i

Elementary Activity 1 of Sendmail Vulnerability Elementary Activity 1: get user input Get strings str_x and str_i, convert them to integers x and i a ? (integer represented by str_x) > 231 pFSM Get str_x and str_i 1 (integer represented by str_x)  231 Convert str_x and str_i to integers x and i

Elementary Activity 2 of Sendmail Vulnerability Elementary Activity 2: assign debug level x >100 x<0 or x>100 Convert str_x and str_i to integers x and i pFSM2 x 100 0x 100 tTvect[x]=i A function pointer (psetuid) is corrupted

Elementary Activity 3 of Sendmail Vulnerability Elementary Activity 3: manipulation of function pointer psetuid A function pointer (psetuid) is corrupted ? psetuid is changed Load psetuid to the memory pFSM 3 starting sendmail program Execute the code referred by psetuid psetuid is unchanged Execute maliciouscode

Appropriateness of Dereference • A data value x is appropriate to be dereferenced if and only if one of the following condition is true, assuming Y,Z are integer constants: • x is &foo (foo is a program variable) • x is malloc(Y) • If there exist values a, b and c that are appropriate to dereference, (recursive definition) and x = a + b – c + Z • Theorems to prove for indirect write mov [^E1] <- E2 • E1 should be appropriate to dereference • If E2 is not appropriate to dereference, then [^E1] should not be appropriate to dereference.

About Equational Logic A logic defined by equations. Equations are used to rewrite symbolic terms (by replacing the term on the left of the equation with the term on the right of the term). Emphasize on its executability. Define the natural number (NAT): Operators: 0 : a constant of NAT s_ : NAT -> NAT (successor operator) _+_ : NAT NAT -> NAT (addition operator) Equations: 0 + N = N (s M) + N = M + (s N) Example: (s s s 0) + (s s 0) = (s s 0) + (s s s 0) = (s 0) + (s s s s 0) = 0 + (s s s s s 0) = s s s s s 0 Intuitively, this represents “3 + 2 = 5”

Semantics of mov and recv • Axioms of mov instruction Ftch((S ; mov [E1] <- E2),X) = Eval(S,E2) if (Eval(S,E1) is X) . Ftch((S ; mov [E1] <- E2),X) = Ftch(S,X) if not (Eval(S,E1) is X) . LocT((S ; mov [E1] <- E2),X) = ExpT(S,E2) if (Eval(S,E1) is X) . LocT((S ; mov [E1] <- E2),X) = LocT(S,X) if not (Eval(S,E1) is X) . • Semantics of recv (similarly, scanf, recvfrom) • LocT(S ; call recv (sock , buf , len, flag), A) = true if Eval(S,buf) <= A and A < Eval(S, buf + len) . • LocT(S ; call recv (sock , buf , len, flag), A) = LocT(S, A) otherwise .

Related Work • Security Modeling • Sheyner and Wing: Attack graphs • Ortalo and Deswarte: Markov models • Static code analysis • Buffer overflow detection: Wagner, many others • Format string detection: CQUAL, SPLINT • Assembly code verification: Proof-Carrying Code • Generic (annotation based): SPLINT, Eau Claire • Taintedness analysis • Perl runtime • CQUAL and SPLINT: taintedness of program variables. • A symbol gets tainted only if an explicit C statement passes a tainted value to it by assignment, argument passing or function return. No underlying memory model. • Not sufficient to detect real pointer taintedness vulnerabilities.

Position My Work Application Code Library Functions Existing static analysis tools Security Specs My work e.g., src_len < dst_size (strcpy) src and dst do not overlap (strcpy) Do not free a stack buffer Do not double free a buffer First argument of printf cannot come from user … …

Presentation Outline • A Brief Description of FSM Approach of Modeling and Analyzing Security Vulnerabilities • Real Examples of Pointer Taintedness • Definition of Pointer Taintedness in Equational Logic • Extraction of Function Specifications by Theorem Proving • Summary and Future Directions

Extraction of Security Specs of Library Functions using Pointer Taintedness • A formal approach to reason about potential vulnerabilities in library source code. • Reasoning based on a hypothetical memory model: a boolean property taintedness associated with each memory location. • The semantics of pointer taintedness defined in equational logic. • A theorem prover employed to extract security specifications of library functions. • Security specifications extracted by the analysis: • expose different classes of known security vulnerabilities, such as format string, heap corruption and buffer overflow vulnerabilities; • indicate function invocation scenarios that may expose new vulnerabilities.

Observations from Data Analysis (cont.) • Exploiting a vulnerability involves multiple vulnerable operationson several objects. • Exploits must pass through multiple elementaryactivities, each providing an opportunity for performing a security check. • For each elementary activity, the vulnerability data and corresponding code inspections allow us to define a predicate, which if violated, will result in a security vulnerability.

Characterizing and Reasoning about Security Vulnerabilities