1 / 28

EE3J2 Data Mining Lecture 14: Sequence Analysis & Dynamic Programming Martin Russell

EE3J2 Data Mining Lecture 14: Sequence Analysis & Dynamic Programming Martin Russell. Objectives. To understand the principal of Dynamic Programming (DP) To understand how DP can be used to compute the distance between two sequences of data To understand what is meant by, and how to compute:

cargan
Télécharger la présentation

EE3J2 Data Mining Lecture 14: Sequence Analysis & Dynamic Programming Martin Russell

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EE3J2 Data MiningLecture 14: Sequence Analysis & Dynamic ProgrammingMartin Russell EE3J2 Data Mining

  2. Objectives • To understand the principal of Dynamic Programming (DP) • To understand how DP can be used to compute the distance between two sequences of data • To understand what is meant by, and how to compute: • An alignment path • The DP recurrence equation • The distance matrix • The accumulated distance matrix • The optimal path EE3J2 Data Mining

  3. Sequences • Sequential data common in real applications: • DNA analysis in bioinformatics and forensic science • Sequences of the letters A, G, C and T • Signature recognition biometrics • Words and text • Spelling and grammar checkers, author verification,.. • Speech, music and audio • Speech/speaker recognition, speech coding and synthesis • Electronic music • Radar signature recognition… EE3J2 Data Mining

  4. Retrieving sequential data • Large corpora of sequential data • E.g: DNA databases • Individual sequences may not be amenable to human interpretation • Need for automated sequential data retrieval • Need to ‘mine’ sequential data • Fundamental requirement is a measure of distance between two sequences EE3J2 Data Mining

  5. Basic definitions • In a typical sequence analysis application we have a basic alphabet consisting of N symbols • Examples: • In conventional text A is the set of letters plus punctuation plus ‘white space’ • In bioinformatics, scientists us {A, B, D, F} to describe DNA sequences EE3J2 Data Mining

  6. Sequences of continuous variables • In some important applications the elements of a discrete sequence are taken from a continuous vector space, rather than a finite set • Sequences of continuous variables can be dealt with in two ways: • Directly • Clustering can be used to represent the space as a set of K centroids, and each vector can be replaced by its closest centroid (Vector Quantisation) EE3J2 Data Mining

  7. Distance between sequences (1) • Consider two sequences of symbols from the alphabet A = {A, B, C, D} • How similar are the sequences: • S1 = ABCD • S2 = ABD • Intuitively S2 is obtained from S1 by deleting C • An alternative explanation is that S2 is obtained from S1 by substituting D for C and then deleting D EE3J2 Data Mining

  8. Distance between sequences (2) • Alternatively S2 was obtained from S1 be deleting ABCD and inserting ABC • … • Why is the first explanation intuitively ‘correct’ and the last one intuitively ‘wrong’? • Because we favour the simplest explanation which involves the minimum number of insertions, deletions and substitutions • Or do we? EE3J2 Data Mining

  9. Distance between sequences (3) • Consider: • S1 = AABC • S2 = SABC • S3 = PABC • S4 = ASCB • If we know that these sequences were typed then S2 is closer to S1 than S3 is, because A and S are adjacent on a keyboard • Similarly S4 is close to S2 because letter-swapping (SA AS etc) is a common typographical error EE3J2 Data Mining

  10. A B B C A B C Alignments • The relationship between two sequences can be expressed as an alignment between their elements • E.G: Insertion EE3J2 Data Mining

  11. substitution A X C A B C Alignment: deletion and substitution deletion A C A B C N.B: All edits described relative to vertical string EE3J2 Data Mining

  12. General alignment path A C X C C D A B C D Which alignment is best? EE3J2 Data Mining

  13. The Distance Matrix • Suppose that d(A,B) denotes the distance between the alphabet symbols A and B • Examples: • d(A,B) = 0 if A = B, otherwise d(A,B) = 1 • In typing, d(A,B) might indicate how unlikely it is that A would be mistyped as B • In bioinformatics d(A,B) might indicate how unlikely it is that symbol A in a DNA string is replaced by B • For continuous valued sequences d could be Euclidean distance EE3J2 Data Mining

  14. Notation • Suppose we have an alphabet: • A distance matrix for A is an N N matrix where is the distance between the mthand nthalphabet symbols EE3J2 Data Mining

  15. The Accumulated Distance • Consider two sequences: • S1 = ABCD • S2 = ACXCCD • For any alignment path p between S1 and S2 we define the accumulated distance between S1 and S2, denoted by ADp(S1,S2), to be the sum over all the nodes of p of the corresponding distances between elements of S1 and S2. EE3J2 Data Mining

  16. A C X C C D A B C D Path p Accumulated distance along p EE3J2 Data Mining

  17. A C X C C D A B C D Accumulated distance (continued) KDEL KINS KINS KINS EE3J2 Data Mining

  18. Optimal path and DP distance • The optimal path is the path with minimum accumulated distance • Formally the optimal path is where: • The DP distance, or accumulated distanceAD(S1,S2) between S1 and S2 is given by: EE3J2 Data Mining

  19. *If KDEL and KINSare not defined you should assume that they are zero Calculating the optimal path • Given • the distance matrix D, • the insertion penalty KINS , and • the deletion penalty KDEL* • How can we compute the optimal path between two (potentially long) sequences S1 and S2? • In a real application the lengths of S1 and S2 might be large EE3J2 Data Mining

  20. A X Y B Dynamic Programming (DP) • The optimal path is calculated using Dynamic Programming (DP) • DP is based on the principle of optimality • Suppose all paths from X to Y must pass through A or B immediately before reaching Y. The optimal path from X to Y is the best of: • (i) the best path from X to A plus the cost of going from A to Y • (ii) the best path from X to B plus the cost of going from B to Y EE3J2 Data Mining

  21. A C X C C D A B C D DP – step 1 • Step 1: draw the trellis of all possible paths EE3J2 Data Mining

  22. d(A,A) ad(2,1)=ad(1,1)+d(2,1) DP – forward pass – initialisation (Forward) accumulated distance matrix A C X C C D (Forward) path matrix A B C D EE3J2 Data Mining

  23. (i-1,j) (i-1,j-1) (i,j-1) ad(i,j) • ad(i,j) is the sum of distances along the best (partial) path from (1,1) to (i,j) • It is calculated using the principle of optimality • In addition, the forward path matrix records the local optimal path at each point EE3J2 Data Mining

  24. ad(1,2) = ad(1,1)+d(A,C) DP – forward pass – continued (Forward) accumulated distance matrix A C X C C D (Forward) path matrix A B C D EE3J2 Data Mining

  25. DP – forward pass – continued (Forward) accumulated distance matrix A C X C C D (Forward) path matrix A B C D EE3J2 Data Mining

  26. DP – forward pass – continued (Forward) accumulated distance matrix A C X C C D (Forward) path matrix A B C D EE3J2 Data Mining

  27. DP – forward pass – continued (Forward) accumulated distance matrix A C X C C D (Forward) path matrix A B C D Optimal path obtained by back-tracking through the forward path matrix, starting at the bottom right-hand corner EE3J2 Data Mining

  28. Summary • Introduction to sequence analysis • Definitions of: • distance, • forward accumulated distance, • forward path matrix, • optimal path • Dynamic programming • How to computed the accumulated distance using DP • How to recover the optimal path EE3J2 Data Mining

More Related