静态代码分析

静态代码分析 梁广泰 2011-05-25

提纲动机程序静态分析（概念+实例）程序缺陷分析（科研工作）

动机 • 云平台特点 • 应用程序直接部署在云端服务器上，存在安全隐患 • 直接操作破坏服务器文件系统 • 存在安全漏洞时，可提供黑客入口 • 资源共享，动态分配 • 单个应用的性能低下，会侵占其他应用的资源 • 解决方案之一： • 在部署应用程序之前，对其进行静态代码分析： • 是否存在违禁调用？（非法文件访问） • 是否存在低效代码？（未借助StringBuilder对String进行大量拼接） • 是否存在安全漏洞？（SQL注入，跨站攻击，拒绝服务） • 是否存在恶意病毒？ • ……

提纲动机程序静态分析（概念+实例）程序缺陷分析（科研工作）

静态代码分析 • 定义： • 程序静态分析是在不执行程序的情况下对其进行分析的技术，简称为静态分析。 • 对比： • 程序动态分析：需要实际执行程序 • 程序理解：静态分析这一术语一般用来形容自动化工具的分析，而人工分析则往往叫做程序理解 • 用途： • 程序翻译/编译（编译器），程序优化重构，软件缺陷检测等 • 过程： • 大多数情况下，静态分析的输入都是源程序代码或者中间码（如Java bytecode），只有极少数情况会使用目标代码；以特定形式输出分析结果

静态代码分析 • Basic Blocks • Control Flow Graph • Dataflow Analysis • Live Variable Analysis • Reaching Definition Analysis • Lattice Theory

Basic Blocks • A basic block is a maximal sequence of consecutive three-address instructions with the following properties: • The flow of control can only enter the basic block thru the 1st instr. • Control will leave the block without halting or branching, except possibly at the last instr. • Basic blocks become the nodes of a flow graph, with edges indicating the order.

i = 1 j = 1 t1 = 10 * i t2 = t1 + j t3 = 8 * t2 t4 = t3 - 88 a[t4] = 0.0 j = j + 1 if j <= 10 goto (3) i = i + 1 if i <= 10 goto (2) i = 1 t5 = i - 1 t6 = 88 * t5 a[t6] = 1.0 i = i + 1 if i <= 10 goto (13) A B C D E F Basic Block Example Leaders Basic Blocks

Control-Flow Graphs • Control-flow graph: • Node: an instruction or sequence of instructions (a basic block) • Two instructions i, j in same basic blockiff execution of i guarantees execution of j • Directed edge: potentialflow of control • Distinguished start node Entry & Exit • First & last instruction in program

Control-Flow Edges • Basic blocks = nodes • Edges: • Add directed edge between B1 and B2 if: • Branch from last statement of B1 to first statement of B2 (B2 is a leader), or • B2 immediately follows B1 in program order and B1 does not end with unconditional branch (goto) • Definition of predecessor and successor • B1 is a predecessor of B2 • B2 is a successor of B1

CFG Example

静态代码分析 • Basic Blocks • Control Flow Graph • Dataflow Analysis • Live Variable Analysis • Reaching Definition Analysis • Lattice Theory

Dataflow Analysis • Compile-Time Reasoning About • Run-Time Values of Variables or Expressions • At Different Program Points • Which assignment statements produced value of variable at this point? • Which variables contain values that are no longer used after this program point? • What is the range of possible values of variable at this program point? • ……

Program Points • One program point before each node • One program point after each node • Join point – point with multiple predecessors • Split point – point with multiple successors

Live Variable Analysis • A variable v is live at point p if • v is used along some path starting at p, and • no definition of v along the path before the use. • When is a variable v dead at point p? • No use of v on any path from p to exit node, or • If all paths from p redefine v before using v.

What Use is Liveness Information? • Register allocation. • If a variable is dead, can reassign its register • Dead code elimination. • Eliminate assignments to variables not read later. • But must not eliminate last assignment to variable (such as instance variable) visible outside CFG. • Can eliminate other dead assignments. • Handle by making all externally visible variables live on exit from CFG

Conceptual Idea of Analysis • start from exit and go backwards in CFG • Compute liveness information from end to beginning of basic blocks

Liveness Example 0101110 a = x+y; t = a; c = a+x; x == 0 • Assume a,b,c visible outside method • So are live on exit • Assume x,y,z,t not visible • Represent Liveness Using Bit Vector • order is abcxyzt 1100111 a b c x y z t b = t+z; 1000111 1100100 1100100 a b c x y z t c = y+1; 1110000 a b c x y z t

Formalizing Analysis • Each basic block has • IN - set of variables live at start of block • OUT - set of variables live at end of block • USE - set of variables with upwards exposed uses in block (use prior to definition) • DEF - set of variables defined in block prior to use • USE[x = z; x = x+1;] = { z } (x not in USE) • DEF[x = z; x = x+1; y = 1;] = {x, y} • Compiler scans each basic block to derive USE and DEF sets

Algorithm for all nodes n in N - { Exit } IN[n] = emptyset; OUT[Exit] = emptyset; IN[Exit] = use[Exit]; Changed = N - { Exit }; while (Changed != emptyset) choose a node n in Changed; Changed = Changed - { n }; OUT[n] = emptyset; for all nodes s in successors(n) OUT[n] = OUT[n] U IN[p]; IN[n] = use[n] U (out[n] - def[n]); if (IN[n] changed) for all nodes p in predecessors(n) Changed = Changed U { p };

静态代码分析 – 概念 • Basic Blocks • Control Flow Graph • Dataflow Analysis • Live Variable Analysis • Reaching Definition Analysis • Lattice Theory

Reaching Definitions • Concept of definition and use • a = x+y is a definition of a is a use of x and y • A definition reaches a use if value written by definitionmay be read by use

s = 0; a = 4; i = 0; k == 0 b = 1; b = 2; i < n s = s + a*b; i = i + 1; return s Reaching Definitions

Reaching Definitions and Constant Propagation • Is a use of a variable a constant? • Check all reaching definitions • If all assign variable to same constant • Then use is in fact a constant • Can replace variable with constant

s = 0; a = 4; i = 0; k == 0 b = 1; b = 2; i < n s = s + a*b; i = i + 1; return s Is a Constant in s = s+a*b? Yes! On all reaching definitions a = 4

s = 0; a = 4; i = 0; k == 0 b = 1; b = 2; i < n s = s + 4*b; i = i + 1; return s Constant Propagation Transform Yes! On all reaching definitions a = 4

Computing Reaching Definitions • Compute with sets of definitions • represent sets using bit vectors • each definition has a position in bit vector • At each basic block, compute • definitions that reach start of block • definitions that reach end of block • Do computation by simulating execution of program until reach fixed point

1 2 3 4 5 6 7 0000000 1: s = 0; 2: a = 4; 3: i = 0; k == 0 1110000 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1110000 1110000 4: b = 1; 5: b = 2; 1111000 1110100 1 2 3 4 5 6 7 1111111 1111100 i < n 1111111 1111100 1 2 3 4 5 6 7 1111100 1111111 1 2 3 4 5 6 7 1111111 1111100 6: s = s + a*b; 7: i = i + 1; return s 1111111 1111100 0101111

Formalizing Reaching Definitions • Each basic block has • IN - set of definitions that reach beginning of block • OUT - set of definitions that reach end of block • GEN - set of definitions generated in block • KILL - set of definitions killed in block • GEN[s = s + a*b; i = i + 1;] = 0000011 • KILL[s = s + a*b; i = i + 1;] = 1010000 • Compiler scans each basic block to derive GEN and KILL sets

Example

Forwards vs. backwards • A forwards analysis is one that for each program point computes information about the past behavior. • Examples of this are available expressions and reaching definitions. • Calculation: predecessors of CFG nodes. • A backwards analysis is one that for each program point computes information about the future behavior. • Examples of this are liveness and very busy expressions. • Calculation: successors of CFG nodes.

May vs. Must • A may analysis is one that describes information that may possibly be true and, thus, computes an upper approximation. • Examples of this are liveness and reaching definitions. • Calculation: union operator. • A must analysis is one that describes information that must definitely be true and, thus, computes a lower approximation. • Examples of this are available expressions and very busy expressions. • Calculation: intersection operator.

静态代码分析 – 概念 • Basic Blocks • Control Flow Graph • Dataflow Analysis • Live Variable Analysis • Reaching Definition Analysis • Lattice Theory

Basic Idea • Information about program represented using values from algebraic structure called lattice • Analysis produces lattice value for each program point • Two flavors of analysis • Forward dataflow analysis • Backward dataflow analysis

Partial Orders • Set P • Partial order  such that x,y,zP • x  x (reflexive) • x  y and y  x implies x  y (asymmetric) • x  y and y  z implies x  z (transitive) • Can use partial order to define • Upper and lower bounds • Least upper bound • Greatest lower bound

Upper Bounds • If S  P then • xP is an upper bound of S if yS. y  x • xP is the least upper bound of S if • x is an upper bound of S, and • x  y for all upper bounds y of S •  - join, least upper bound (lub), supremum, sup •  S is the least upper bound of S • x  y is the least upper bound of {x,y}

LowerBounds • If S  P then • xP is a lower bound of S if yS. x  y • xP is the greatest lower bound of S if • x is a lower bound of S, and • y  x for all lower bounds y of S •  - meet, greatest lower bound (glb), infimum, inf •  S is the greatest lower bound of S • x  y is the greatest lower bound of {x,y}

Covering • x y if x  y and xy • x is covered by y (y covers x) if • x  y, and • x  z  y implies x  z • Conceptually, y covers x if there are no elements between x and y

Example • P = { 000, 001, 010, 011, 100, 101, 110, 111} (standard Boolean lattice, also called hypercube) • x  y if (x bitwise and y) = x 111 • Hasse Diagram • If y covers x • Line from y to x • y above x in diagram 011 110 101 010 001 100 000

Lattices • If x  y and x  y exist for all x,yP, then P is a lattice. • If S and S exist for all S  P, then P is a complete lattice. • All finite lattices are complete

Lattices • If x  y and x  y exist for all x,yP, then P is a lattice. • If S and S exist for all S  P, then P is a complete lattice. • All finite lattices are complete • Example of a lattice that is not complete • Integers I • For any x, yI, x  y = max(x,y), x  y = min(x,y) • But  I and  I do not exist • I  {, } is a complete lattice

Lattice Examples • Lattices • Non-lattices

Semi-Lattice • Only one of the two binary operations (meet or join) exist • Meet-semilattice If x  y exist for all x,yP • Join-semilattice If x  y exist for all x,yP

Monotonic Function & Fixed point • Let L be a lattice. A function f : L → L is monotonic if ∀x, y ∈ S : xy ⇒ f (x) f (y) • Let A be a set, f : A → A a function, a ∈A . If f (a) = a, then a is called a fixed point of f on A

Existence of Fixed Points • The height of a lattice is defined to be the length of the longest path from ⊥ to ⊤ • In a complete lattice L with finite height, every monotonic function f : L → L has a uniqueleast fixed-point :

Knaster-Tarski Fixed Point Theorem • Suppose (L, ) is a complete lattice, f: LL is a monotonic function. • Then the fixed point m of f can be defined as

Calculating Fixed Point • The time complexity of computing a fixed-point depends on three factors: • The height of the lattice, since this provides a bound for i; • The cost of computing f; • The cost of testing equality. • The computation of a fixed-point can be illustrated as a walk up the lattice starting at ⊥:

Application to Dataflow Analysis • Dataflow information will be lattice values • Transfer functions operate on lattice values • Solution algorithm will generate increasing sequence of values at each program point • Ascending chain condition will ensure termination • Will use  to combine values at control-flow join points

Transfer Functions • Transfer function f: PP for each node in control flow graph • f models effect of the node on the program information

Transfer Functions Each dataflow analysis problem has a set F of transfer functions f: PP • Identity function iF • F must be closed under composition: f,gF. the function h = x.f(g(x)) F • Each f F must be monotone: x  y implies f(x)  f(y) • Sometimes all fF are distributive: f(x  y) = f(x)  f(y) • Distributivity implies monotonicity

静态代码分析

静态代码分析

Presentation Transcript