CS590 Z Matching Program Versions

CS590 Z Matching Program Versions Xiangyu Zhang

Problem Statement • Suppose a program P’ is created by modifying P. Determine the difference between P and P’. For an artifact c’ in P’, decide if c’ belongs to the difference, if not, find the correspondence of c’ in P. • Static mapping • Non-trivial • Name comparison? • What if • Clone analysis, comparison checking

Motivations • Validate compiler transformations • Facilitate regression testing • Reverse obfuscation • Information propagation • Debugging • Code plagiarism detection • Information Assurance

Approaches • Static Approaches • Entity name based • String based (MOSS) • AST based (DECKARD) • CFG based (JDIFF) • PDG based (PDIFF) • Binary based (BMAT) • Log based (editor plugin, comparison checking) • Dynamic Approaches (not today)

Static Approaches • Entity name matching • Model a function/field as tuples • Coarse grained matching • String matching • Diff (CVS, Subservion) • Longest common subsequence (LCS) • Available operations are addition and deletion • Matched pairs can not cross one another • Programs are far more complicated than strings • Copy, paste, move • CP-Miner (scale to linux kernel clone detection) • Frequent subsequence mining

MOSS • Code plagiarism detection • It also handles other digital contents • Challenges • White space (variable name) • Noise (“the”, “int i”); • Order scrambling (paragraph reorders) • Problem statement • Given a set of documents, identify substring matches that satisfy two properties: • If there is a substring match at least as long as the guarantee threshold t, then this match is detected; • Do not detect any matches shorter than the noise threshold, k.

MOSS • k-gram • A continuous substring of length k

MOSS • Incremental hashing • Hashing strings of length k is expensive for large k. • “rolling” hash function • The (i+1)th k-gram hash = F (the ith k-gram hash, …)

MOSS • Fingerprint selection • A subset of hash values • Our goals: find all matching substrings >t; ignore matchings <k) • One of every tth hash values • 0 mod p

MOSS • Winnowing • Observation: given a sequence of hashes h1,…hn, if n>t-k, then at least one of the hi must be chosen • Have a sliding window with size w=t-k+1 • In each window select the minimum hash value, break ties by select the rightmost occurrence.

MOSS • Algorithm • Build an index mapping fingerprints to locations for all documents. • Each document is fingerprinted a second time and the selected fingerprints are looked up in the index; this gives the list of all matching fingerprints for each document. • Sort (d,d1,fx), (d, d2,fy) by the first two elements. • Matches between documents are rank-ordered by size (number of fingerprints)

MOSS • Advantages • Guarantee to detect any >t substring matches • Limitations • Minor edits fail MOSS. • x= a*b + c vs. z= c + a*b • Insertion, deletion

AST based matching • [YANG, 1991, Software Practice and Experience] • Given two functions, build the ASTs • Match the roots • If so, apply LCS to align subtrees • Continue recursively • Fragile

DECKARD (ICSE 2007)

DECKARD • Advantages • Scalability • Insensitive to minor structural changes such as reordering, insertion, deletion • Limitations • Structural similarity only • Insertion that incurs structure change.

CFG matching • Hammock graph (JDIFF ,ASE 2004) • Match classes by names • Match fields by types • Match methods by signatures • Match instruction in methods by hammock graphs • A hammock is a single entry single exit subgraph of a CFG.

CFG matching • Pros • Orthogonal • Can be combined with other matching techniques • Simple • Cons • Coarse grained matching only • Not good at clone detection • In case of code transformation

Semantic Based Matched • Using PDG (SAS’01)

Semantic Based

Semantic Based • Pros • Non-contiguous, intertwined, reordered • Insensitive to code transformations. • Cons • Scalability • Points-to analysis • Starting from a matching pair seems to be a problem

Wrap Up • For clone detection • Maybe structural / text similarity is a good idea • For whole program matching / method matching with code transformations • Semantic based is more appropriate • Scalability • PDG < CFG | AST < STRING < NAME

CS590 Z Matching Program Versions

CS590 Z Matching Program Versions

Presentation Transcript

SharePoint Versions

Durham County Matching Grants Program

Veterinary Internship and Residency Matching Program

Matching Data for EHDI Tracking Program

Android: versions

Matching Church Scholarship Program

Castor Versions

NAESB Versions

ARM versions

For: CS590 Intelligent Systems

Overview of Residency Matching Program

The National Residency Matching Program (NRMP)

Versions of Truth

CS590 Z Software Defect Analysis

Graph Analysis Matching Program

2011-12 Matching Grant Program Workshop

Mentor/Mentee matching workshop program

Bible Versions

Bible Versions

Matching Data for EHDI Tracking Program