Species Identification through DNA String Analysis

Mark Vorster Supervisor: Prof Philip Machanick Species Identificationthrough DNA String Analysis

- - - - Research Overview Bioinformatics String Matching Discussion Questions Research Overview Goal • Aid bioinformaticians in research by providing a tool which can identify similar DNA sequences in order to infer homogeneity, in a timely manner. Reason for problems • Large data sets • Days of processing • No existing specific tools

- - - - Research Overview Bioinformatics String Matching Discussion Questions Bioinformatics "Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioural or health data, including those to acquire, store, organize, archive, analyse, or visualise such data.“ Biomedical Information Science and Technology Initiatives Definition Committee - Dr Huerta "The branch of science concerned with information and information flow in biological systems, esp. the use of computational methods in genetics and genomics.“ Oxford English Dictionary

- - - - Research Overview Bioinformatics String Matching Discussion Questions History of Bioinformatics and Genetics • 1953 - Watson, Crick , Wilkins and Franklin. • Discrete abstraction Adenine – Thymine Guanine – Cytosine One helical turn = 3.4 nm Sugar-phosphate backbone base Hydrogen bonds 4 http://www.accessexcellence.org/RC/VL/GG/images/structure.gif

- - - - Research Overview Bioinformatics String Matching Discussion Questions Sequence Analysis and Sequence Alignment • Sequence Alignment • Global Alignment is expensive • Assumption: Sequences are already Globally Aligned Alignment Differences TGAGCACCT • Insertion TGACGCACCT • Deletion TGA_CACCT • Replacement TGATCACCT • Phylogenetic inference 5

- - - - Research Overview Bioinformatics String Matching Discussion Questions FASTA File Format • Leading ‘>’ • Sequence Identifier • Description or comment • A number of lines of genetic code • Other Symbols >SequenceName description or comment CCGGAATACCTAGGAC GCCTTCATCCCCCGCC GGTCTGTGATGTCCCA ATGGACCGGA >NextSequence description of comment ACGCCTGATTACCTGC TAGTCGGGATGATAAC CAAGAATTTGTGTCTG

- - - - Research Overview Bioinformatics StringMatching Discussion Questions Approximate String Matching Algorithm • Nesting loops inefficient • Dynamic Programing • Take into account all previous information • Improved to O(n2) | where n is number of bases in shorter sequence • Goal: Find the closet match between two strings Or the minimum number of differences

- - - - Research Overview Bioinformatics StringMatching Discussion Questions Approximate String Matching Algorithm Minimum of: • MatchCost = D[i-1][j-1] , if pi = tj • ReviseCost = D[i-1][j-1]+1 , if pi≠ tj • InsertCost = D[i-1][j]+1 • DeleteCost = D[i][j-1]+1 • D[0][j] = 0 and D[i][0] = i

- - - - Research Overview Bioinformatics StringMatching Discussion Questions Approximate String Matching Algorithm

- - - - Research Overview Bioinformatics StringMatching Discussion Questions Approximate String Matching Algorithm j tj D[i-1][j-1] D[i-1][j] pi i D[i-1][j-1] D[i][j-1] • MatchCost = D[i-1][j-1] , if pi = tj • ReviseCost = D[i-1][j-1]+1 , if pi ≠ tj • InsertCost = D[i-1][j]+1 • DeleteCost = D[i][j-1]+1 • MatchCost = N/A • ReviseCost = 3 • InsertCost = 2 • DeleteCost = 4 • -> Min = 2

- - - - Research Overview Bioinformatics StringMatching Discussion Questions Approximate String Matching Algorithm

- - - - Research Overview Bioinformatics String Matching Discussion Questions Approximate String Matching Algorithm • Changes • D[i][0] = i , if pi = t0 • D[i][0] = i + 1 , if pi ≠ t0 • D[0][j] = j , if p0 = tj • D[0][j] = j + 1 , if p0 ≠ tj • Additional stop case for mismatch

- - - - Research Overview Bioinformatics String Matching Discussion Questions Approximate String Matching Algorithm

- - - - Research Overview Bioinformatics String Matching Discussion Questions Discussion • Grouping Algorithm • Scale of the problem • 400 – 800 bases per sequence • Ten thousands of sequences • Assumptions: • Sequences Globally Aligned • Sequences Begin at the Same Place

- - - - Research Overview Bioinformatics String Matching Discussion Questions Example Grouping

- - - - Research Overview Bioinformatics String Matching Discussion Questions Results Comparisons for n sequence = (n-1)n/2 O(n2), where n is number of sequences. ~1600 comparisons per second. 10000 sequence ~8.6 hours. (from 10 days)

- - - - Research Overview Bioinformatics String Matching Discussion Questions ?

Species Identification through DNA String Analysis