Understanding Software Reverse Engineering: Grading, Techniques, and Applications
This lecture delves into Software Reverse Engineering, focusing on grading metrics, including class participation, homework, and project assessments. We explore various static and dynamic analysis techniques, their applications, and importance in software maintenance and malware comprehension. Learn how reverse engineering reduces costs, enhances software reuse, and facilitates documentation and design discovery. Case studies in static analysis, including UML sequence diagrams, demonstrate control flow analysis. Understand the critical trade-offs in precision and size in software analysis.
Understanding Software Reverse Engineering: Grading, Techniques, and Applications
E N D
Presentation Transcript
Lecture 16 Software Reverse Engineering
Grading • Algorithm for deciding your final grades: • Final score: 10% class participation + 40% homework + 50% project • Rank the list: around 50% of A (subject to change), the rest will be B • Project grading: • Signup (2%) + Proposal (10%) + Mid-point check (8%): 20% • Overall score: 80% • Presentation (10%) • Documents (30%) • Quality of your (part of) work (40%) -- your score • 1 person group: you can do less work, but the quality should be good
A Roadmap for Today • Software reverse engineering: an introduction • Static approaches: a case study • Dynamic approaches: a case study • Reverse engineering tools
What is Software Reverse Engineering • Determining structure or behavior of software by building static or dynamic models • The process of analyzing a subject system to create representations of the system at a higher level of abstraction [Chikofsky90] • Goals: • Understand malware : security • Understand legacy code: software maintenance • Input: source code or binary code Output: invariants, architecture, API rules, … code is not changed
Why Reverse Engineering? • Software maintenance is ”modification of a software to correct faults, to improve performance or to adapt to a changed environment”. (ASNI/IEEE Std 729) • Software maintenance accounts for 50%~90% of total costs in software life-cycle. • Reverse engineering is part of maintenance process and can facilitate this practice. Through reverse engineering, cost can be reduced and value can be added.
Applications of Reverse Engineering • Program comprehension, visualization • Software reuse • Document • Design discovery • Software verification • Modify software • Change of the environment • Redesign the software
Software Reverse Engineering Overall Approaches • Two general techniques: static and dynamic analysis • Static analysis: search source code • Dynamic analysis: running programs with given input, and collect and analyze runtime information • Two steps: • Collect info • Compilation phases (from source code) • Profilers, logs, debuggers • Abstract info and build models • Mining understandable, high-level models
Static models • Based on code structure, dependency, architecture • Example models: • Class diagrams • Design patterns • Dependency graphs at the levels of components, functions and variables • Contracts • Aspects
Static Approaches • Static: only process source code, not execute the programs • Advantages: • No executables required • No input needed • Types of Static Analyses (some of them done in compilers) • Control and data flow analysis • Type checking: types and a set of operations associated with types • Dependency analysis • Slicing and dicing (different ways to partition the software)
Case study: Static Control-Flow Analysis for Reverse Engineering of UML Sequence Diagrams • Existing work: UML class diagram and UML sequence diagram • Tools: Together ControlCenter by Borland and Eclipse UML by Omondo • UML sequence diagram: • Software understanding • used for testing - interactions among collaborating objects
A Graph Representation for a Program • Control flow graph (CFG): <N, E> N: a set of statements in a program E: represent control transfer between two statements bar() 1 bar(); s = (char*)malloc(80); x[10] = ‘\0’; if(strlen(t)<8) strcpy(s,t); else strcat(x,t); s = (char*)malloc(80) 2 x[10] = ‘0’ 3 strlen(t) < 8 4 yes no 5 strcpy(s,t) strcat(x,t) 6
Case study: Static Control-Flow Analysis for Reverse Engineering of UML Sequence Diagrams • Two Challenges: • How should a CFG be mapped to UML? • Can UML 2.0 enough to specify the discovered control flows? • Sequence diagram: objects and message exchange • Four types of control primitives: • Opt • Loop • Alt • Break
Control Flow Analysis • Control flow analysis • Find a branch node (alt/opt edges) • Find a Merge point • Find the header of the loop • All of the Loop exit edges • No exceptional flow is considered, as any program point potentially throws an asynchronous exceptions in Java
Design Decisions for Analysis • Tradeoffs between precision and size of sequence diagram: Mapping with replication – Full Precise
Design Decisions for Analysis • Not precise, no replication
Applications of Dependency Graphs • Security check • Guidance for refactoring • Regression Testing
Summary for Static Techniques • Static approaches advantages and disadvantages: • No executable and input needed • Potentially imprecise: e.g., infeasible • References: • Case study: static control-flow analysis for reverse engineering of UML sequence diagrams • Dependency: combining slicing and constraint solving for validate of measurement software
Abstracting the dynamic model • Finding behavior patterns, repeating sequences of events • E.g. socket protocol, secure API sequences • Using static abstractions • E.g. representing interactions between high-level software elements in sequence diagrams • Dynamic information is combined with the high-level static model
Dynamic models • Finding out the run-time behaviour of software • debugger, profiler, source code instrumentation • Visualisation: • scenarios (sequence diagrams) • State diagrams • (hierarchical) graphs
Other Information can be Found Using Dynamic Approaches • Object creation and related dependencies • Dynamic binding, polymorphism • Method calls (virtual calls and function pointers) • Looking for dead code/reachability analysis • Memory management • Performance and related problems • Concurrency
Case Study: dynamic analysis to find program invariants • Program invariant: a property that holds at a certain point or points of a program • Dynamic invariant detection: runs a program, observes the values that the program computes, and reports the properties that were true over the observed executions • Types of invariants • Constant • Non-zero • Range: a < x < b • Linear: y = ax+b • Ordering: a than b ……
Case Study: dynamic analysis to find program invariants • Use of the invariants: • Generate test inputs, predict incompatibilities of component integrations, repairing inconsistent data structures, check correctness • Reference: http://pag.csail.mit.edu/daikon
A stack example Fields: Object[] theArray; // Array that contains the stack elements. inttopOfStack; // Index of top element. -1 if stack is empty Methods: void push(Object x) // Insert x void pop() // Remove most recently inserted item Object top() // Return most recently inserted item Object topAndPop() // Remove and return most recently inserted item booleanisEmpty() // Return true if empty; else false booleanisFull() // Return true if full; else false void makeEmpty() // Remove all items
Steps to Run Daikon to Infer Invariants for Stack • Create simple test class: StackArTester • Daikon instruments the code and analyzes the resulting execution traces • Outputs procedural pre/post conditions and also object invariants hold at every public method entry and exit
Daikon Output for the Stack Example Object invariants for StackAr this.theArray != null this.theArray.getClass() == java.lang.Object[].class this.topOfStack >= -1 this.topOfStack <= this.theArray.length - 1 this.theArray[0..this.topOfStack] elements != null this.theArray[this.topOfStack+1..] elements == null Pre-conditions for the StackAr constructor capacity >= 0 Post-conditions for the StackAr constructor orig(capacity) == this.theArray.length this.topOfStack == -1 this.theArray[] elements == null Post-conditions for the isFull method this.theArray == orig(this.theArray) this.theArray[] == orig(this.theArray[]) this.topOfStack == orig(this.topOfStack) (return == false) <==> (this.topOfStack < this.theArray.length - 1) (return == true) <==> (this.topOfStack == this.theArray.length - 1)
Daikon Internal design • Grammar of variables: global, input, parameters, return • Grammar of predicates: (75 templates) • conditional predicate • supplied template • Program points: entry and exit
Daikon Internal Structures • Instrumenters (language dependent) • Inference engine (generate-and-check algorithm) • Test a set of parameters against traces • Assume all invariants possible and then exclude ones that contradict with the observed values • Optimizations • Equal variables • Dynamically constant variables • Suppress weaker variables • Variable hierachy
Summary for Dynamic techniques • Need a set of good test cases • Challenges of scalabilities • Precise techniques
Reverse engineering for OO software • Dynamic behavior may be hard to detect from static model (creating and deleting objects, garbage collection, dynamic binding,…)-> this emphasises dynamic modelling • Pure object languages support encapsulation (classes, packages,…)-> helps in static reverse engineering -> increases usability of metrics • OO paradigm supports the use of design patterns-> reusability applications (pattern recognition)
Tools • Rigi (University of Victoria, Canada) • http://www.rigi.csc.uvic.ca/ • a research prototype that represents an open and public domain reverse engineering tool • user programmable • analysis for: C, C++, COBOL, PL/AS, LaTeX • SNIFF+ (TakeFive Software) • a software development environment that also provides reverse engineering capabilities
Tools • McCabe’s Visual Reengineering Toolset and Visual Quality Toolset • various views • software metrics (complexity and structuredness) • shown as specific colors on the views • Logiscope (CS Verilog) • reverse eng, code testing, static and dynamic testing, metrics • analysis for: C, C++, Java, ADA • ESW (Viasoft Inc.) • forward and reverse engineering (maintenance), metrics, testing
Tools • Refine (Reasoning Systems Inc.) • an open and programmable tool that works in the Refinery environment • tools for generating source code parsing and conversion tools • features for analyzing and re-engineering code • analysis for: Ada, C, Cobol • Imagix4D (Imagix Corp.) • http://www.powersoftware.com/english/im/index.html • a closed tool that provides a large set of built-in functionalities • several views (also 3D) • analysis for: C/C++
CodeCrawler: * a reverse engineering tool that combines metrics and graphs to visualize OO systems
Tools for OO languages • Produce a class diagram from code • Rational Rose (Rational Software Corp.) • Paradigm Plus (Computer Associates International) • OEW (Innovative Software GmbH) • Graphical Designer (Advanced Software Technologies Inc.) • Domain Objects (Domain ObjectsInc.) • COOL:Jex (Sterling Software Inc.) • Fujaba (Paderborn University) • ...
Wild & Crazy Ideas How good software needs to be? As a consumer, you will feel comfortable to take an airplane with the failure rate of: A: 0 B: 0.000001 and below D: 0.001 and below C: 0.01 and below What about stock software, mobile phone, …