Longest common subsequence

Longest common subsequence • Definition 1: Given a sequence X=x1x2...xm, another sequence Z=z1z2...zk is a subsequence of X if there exists a strictly increasing sequence i1i2...ik of indices of X such that for all j=1,2,...k, we have xij=zj. • Example 1: If X=abcdefg, Z=abdg is a subsequence of X. X=abcdefg,Z=ab d g chapter25

Definition 2: Given two sequences X and Y, a sequence Z is a common subsequence of X and Y if Z is a subsequence of both X and Y. • Example 2: X=abcdefg and Y=aaadgfd. Z=adf is a common subsequence of X and Y. X=abc defg Y=aaaadgfd Z=a d f chapter25

Definition 3: A longest common subsequence of X and Y is a common subsequence of X and Y with the longest length. (The length of a sequence is the number of letters in the seuqence.) • Longest common subsequence may not be unique. • Example: abcd acbd Both acd and abd are LCS. chapter25

Longest common subsequence problem • Input: Two sequences X=x1x2...xm, and Y=y1y2...yn. • Output:a longest common subsequence of X and Y. • A brute-force approach Suppose that mn. Try all subsequence of X (There are 2m subsequence of X), test if such a subsequence is also a subsequence of Y, and select the one with the longest length. chapter25

LCS: Applications • Compare two versions of source code for the same program. • Unix command: diff for compare text files. chapter25

Charactering a longest common subsequence • Theorem (Optimal substructure of an LCS) • Let X=x1x2...xm, and Y=y1y2...yn be two sequences, and • Z=z1z2...zk be any LCS of X and Y. • 1. If xm=yn, then zk=xm=yn and Z[1..k-1] is an LCS of X[1..m-1] and Y[1..n-1]. • 2. If xmyn, then zkxm implies that Z is an LCS of X[1..m-1] and Y. • 2. If xmyn, then zkyn implies that Z is an LCS of X and Y[1..n-1]. chapter25

The recursive equation • Let c[i,j] be the length of an LCS of X[1...i] and Y[1...j]. • c[i,j] can be computed as follows: 0 if i=0 or j=0, c[i,j]= c[i-1,j-1]+1 if i,j>0 and xi=yj, max{c[i,j-1],c[i-1,j]} if i,j>0 and xiyj. Computing the length of an LCS • There are nm c[i,j]’s. So we can compute them in a specific order. chapter25

The algorithm to compute an LCS • 1. for i=1 to m do • 2. c[i,0]=0; • 3. for j=0 to n do • 4. c[0,j]=0; • 5. for i=1 to m do • 6. for j=1 to n do • 7. { • 8. if x[I] ==y[j] then • 9. c[i,j]=c[i-1,j-1]+1; • 10 b[i,j]=1; • 11. elseif c[i-1,j]>=c[i,j-1] then • 12. c[i,j]=c[i-1,j] • 13. b[i,j]=2; • 14. else c[i,j]=c[i,j-1] • 15. b[i,j]=3; • 14 } • b[i,j] stores the directions. 1—diagnal, 2-up, 3-forward. chapter25

Example 1: X=BDCABA and Y=ABCBDAB. chapter25

Constructing an LCS (back-tracking) • We can find an LCS using b[i,j]’s. • We start with b[n,m] and track back to some cell b[0,i] or b[i,0]. • The algorithm to construct an LCS 1. i=m 2. j=n; 3. if i==0 or j==0 then exit; 4. if b[i,j]==1 then { i=i-1; j=j-1; print “xi”; } 5. if b[i,j]==2 i=i-1 6. if b[i,j]==3 j=j-1 7. Goto Step 3. • The time complexity: O(nm). chapter25

Shortest common supersequence • Definition:Let X and Y be two sequences. A sequence Z is a supersequence of X and Y if both X and Y are subsequence of Z. • Shortest common supersequence problem: Input: Two sequences X and Y. Output: a shortest common supersequence of X and Y. • Example: X=abc and Y=abb. Both abbc and abcb are the shortest common supersequences for X and Y. chapter25

Recursive Equation: • Let c[i,j] be the length of an LCS of X[1...i] and X[1...j]. • c[i,j] can be computed as follows: j if i=0 i if j=0, c[i,j]= c[i-1,j-1]+1 if i,j>0 and xi=yj, min{c[i,j-1]+1,c[i-1,j]+1} if i,j>0 and xiyj. chapter25

chapter25

Assignment 3: (Due week 13, Monday at 7: 30 pm) Question1: Write a program to compute the SCS for two sequences. Use s1=abcdabbcabddabcd and s2=abbcabbdacbdadbc as the test input. Backtracking is required, i.e. the program MUST output the shortest common super-sequence, Not just the length of SCS. Question 2. Write a program to calculate the maximum degree of a node in a undirected graph. (1) Use an adjacency matrix to store the graph; (2) Use a adjacency list to store the graph. (3) give the time complexity of the two programs. Which one is better? Why? You can use the graph in slide 22 as the test input. chapter25

Part-H1Graphs 1843 ORD SFO 802 1743 337 1233 LAX DFW chapter25

A graph is a pair (V, E), where V is a set of nodes, called vertices E is a collection of pairs of vertices, called edges Vertices and edges are positions and store elements Example: A vertex represents an airport and stores the three-letter airport code An edge represents a flight route between two airports and stores the mileage of the route Graphs (§ 12.1) 849 PVD 1843 ORD 142 SFO 802 LGA 1743 337 1387 HNL 2555 1099 1233 LAX 1120 DFW MIA chapter25

Edge Types • Directed edge • ordered pair of vertices (u,v) • first vertex u is the origin • second vertex v is the destination • e.g., a flight • Undirected edge • unordered pair of vertices (u,v) • e.g., a flight route • Directed graph • all the edges are directed • e.g., route network • Undirected graph • all the edges are undirected • e.g., flight network flight AA 1206 ORD PVD 849 miles ORD PVD chapter25

V a b h j U d X Z c e i W g f Y Terminology • End vertices (or endpoints) of an edge • U and V are the endpoints of a • Edges incident on a vertex • a, d, and b are incident on V • Adjacent vertices • U and V are adjacent • Degree of a vertex • X has degree 5 • Parallel edges • h and i are parallel edges • Self-loop • j is a self-loop chapter25

Terminology (cont.) • Path • sequence of alternating vertices and edges • begins with a vertex • ends with a vertex • each edge is preceded and followed by its endpoints • Simple path • path such that all its vertices and edges are distinct • Examples • P1=(V,b,X,h,Z) is a simple path • P2=(U,c,W,e,X,g,Y,f,W,d,V) is a path that is not simple V b a P1 d U X Z P2 h c e W g f Y chapter25

Terminology (cont.) • Cycle • circular sequence of alternating vertices and edges • each edge is preceded and followed by its endpoints • Simple cycle • cycle such that all its vertices and edges are distinct • Examples • C1=(V,b,X,g,Y,f,W,c,U,a,) is a simple cycle • C2=(U,c,W,e,X,g,Y,f,W,d,V,a,) is a cycle that is not simple V a b d U X Z C2 h e C1 c W g f Y chapter25

Adjacency List Structure • Incidence sequence for each vertex • sequence of references to edge objects of incident edges • Edge objects • references to associated positions in incidence sequences of end vertices chapter25

Adjacency Matrix Structure • Augmented vertex objects • Integer key (index) associated with vertex • 2D-array adjacency array • Reference to edge object for adjacent vertices • “Infinity” for non nonadjacent vertices • A graph with no weight has 0 for no edge and 1 for edge chapter25

A B D E C Part-H2Depth-First Search chapter25

Depth-first search (DFS) is a general technique for traversing a graph A DFS traversal of a graph G Visits all the vertices and edges of G Determines whether G is connected Computes the connected components of G Computes a spanning forest of G DFS on a graph with n vertices and m edges takes O(n + m ) time DFS can be further extended to solve other graph problems Find and report a path between two given vertices Find a cycle in the graph Depth-First Search (§ 12.3.1) chapter25

DFS Algorithm • The algorithm uses a mechanism for setting and getting “labels” of vertices and edges AlgorithmDFS(G, v) Inputgraph G and a start vertex v of G Outputlabeling of the edges of G in the connected component of v as discovery edges and back edges setLabel(v, VISITED) for all e  G.incidentEdges(v) ifgetLabel(e) = UNEXPLORED w opposite(v,e) if getLabel(w) = UNEXPLORED setLabel(e, DISCOVERY) DFS(G, w) else setLabel(e, BACK) AlgorithmDFS(G) Inputgraph G Outputlabeling of the edges of G as discovery edges and back edges for all u  G.vertices() setLabel(u, UNEXPLORED) for all e  G.edges() setLabel(e, UNEXPLORED) for all v  G.vertices() ifgetLabel(v) = UNEXPLORED DFS(G, v) chapter25

A B D E C A A B D E B D E C C Example unexplored vertex A visited vertex A unexplored edge discovery edge back edge chapter25

A A A B D E B D E B D E C C C A B D E C Example (cont.) chapter25

DFS and Maze Traversal • The DFS algorithm is similar to a classic strategy for exploring a maze • We mark each intersection, corner and dead end (vertex) visited • We mark each corridor (edge ) traversed • We keep track of the path back to the entrance (start vertex) by means of a rope (recursion stack) chapter25

Setting/getting a vertex/edge label takes O(1) time Each vertex is labeled twice once as UNEXPLORED once as VISITED Each edge is labeled twice once as UNEXPLORED once as DISCOVERY or BACK Method incidentEdges is called once for each vertex DFS runs in O(n + m) time provided the graph is represented by the adjacency list structure Recall that Sv deg(v)= 2m Analysis of DFS chapter25

Longest common subsequence

Longest common subsequence

Presentation Transcript

Longest Common Subsequence (LCS)

Longest Common Subsequence as Private Search

Longest Common Subsequence

Longest Common Rigid Subsequence

Longest common subsequence

Longest common subsequence (LCS) Problem

Longest Common Subsequence (LCS)

Longest Common Subsequence

A Fast Multiple Longest Common Subsequence (MLCS) Algorithm

Longest Common Subsequence

Longest Common Subsequence

Longest Common Subsequence

Pattern Matching Longest Common Subsequence

More dynamic programming Longest common subsequence

Longest Common Subsequence

Longest Common Subsequence

Dynamic Programming (Longest Common Subsequence)

ITCS 6114 Dynamic programming Longest Common Subsequence

Dynamic programming Longest Common Subsequence

Longest Common Subsequence (LCS) - Scoring

The Longest Common Subsequence Problem

Longest Common Subsequence (LCS)