1 / 17

Species Identification through DNA String Analysis

Mark Vorster Supervisor: Prof Philip Machanick. Species Identification through DNA String Analysis. -. -. -. -. Research Overview. Bioinformatics. String Matching. Discussion. Questions. Research Overview. Goal

shay-cannon
Télécharger la présentation

Species Identification through DNA String Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mark Vorster Supervisor: Prof Philip Machanick Species Identificationthrough DNA String Analysis

  2. - - - - Research Overview Bioinformatics String Matching Discussion Questions Research Overview Goal • Aid bioinformaticians in research by providing a tool which can identify similar DNA sequences in order to infer homogeneity, in a timely manner. Reason for problems • Large data sets • Days of processing • No existing specific tools

  3. - - - - Research Overview Bioinformatics String Matching Discussion Questions Bioinformatics "Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioural or health data, including those to acquire, store, organize, archive, analyse, or visualise such data.“ Biomedical Information Science and Technology Initiatives Definition Committee - Dr Huerta "The branch of science concerned with information and information flow in biological systems, esp. the use of computational methods in genetics and genomics.“ Oxford English Dictionary

  4. - - - - Research Overview Bioinformatics String Matching Discussion Questions History of Bioinformatics and Genetics • 1953 - Watson, Crick , Wilkins and Franklin. • Discrete abstraction Adenine – Thymine Guanine – Cytosine One helical turn = 3.4 nm Sugar-phosphate backbone base Hydrogen bonds 4 http://www.accessexcellence.org/RC/VL/GG/images/structure.gif

  5. - - - - Research Overview Bioinformatics String Matching Discussion Questions Sequence Analysis and Sequence Alignment • Sequence Alignment • Global Alignment is expensive • Assumption: Sequences are already Globally Aligned Alignment Differences TGAGCACCT • Insertion TGACGCACCT • Deletion TGA_CACCT • Replacement TGATCACCT • Phylogenetic inference 5

  6. - - - - Research Overview Bioinformatics String Matching Discussion Questions FASTA File Format • Leading ‘>’ • Sequence Identifier • Description or comment • A number of lines of genetic code • Other Symbols >SequenceName description or comment CCGGAATACCTAGGAC GCCTTCATCCCCCGCC GGTCTGTGATGTCCCA ATGGACCGGA >NextSequence description of comment ACGCCTGATTACCTGC TAGTCGGGATGATAAC CAAGAATTTGTGTCTG

  7. - - - - Research Overview Bioinformatics StringMatching Discussion Questions Approximate String Matching Algorithm • Nesting loops inefficient • Dynamic Programing • Take into account all previous information • Improved to O(n2) | where n is number of bases in shorter sequence • Goal: Find the closet match between two strings Or the minimum number of differences

  8. - - - - Research Overview Bioinformatics StringMatching Discussion Questions Approximate String Matching Algorithm Minimum of: • MatchCost = D[i-1][j-1] , if pi = tj • ReviseCost = D[i-1][j-1]+1 , if pi≠ tj • InsertCost = D[i-1][j]+1 • DeleteCost = D[i][j-1]+1 • D[0][j] = 0 and D[i][0] = i

  9. - - - - Research Overview Bioinformatics StringMatching Discussion Questions Approximate String Matching Algorithm

  10. - - - - Research Overview Bioinformatics StringMatching Discussion Questions Approximate String Matching Algorithm j tj D[i-1][j-1] D[i-1][j] pi i D[i-1][j-1] D[i][j-1] • MatchCost = D[i-1][j-1] , if pi = tj • ReviseCost = D[i-1][j-1]+1 , if pi ≠ tj • InsertCost = D[i-1][j]+1 • DeleteCost = D[i][j-1]+1 • MatchCost = N/A • ReviseCost = 3 • InsertCost = 2 • DeleteCost = 4 • -> Min = 2

  11. - - - - Research Overview Bioinformatics StringMatching Discussion Questions Approximate String Matching Algorithm

  12. - - - - Research Overview Bioinformatics String Matching Discussion Questions Approximate String Matching Algorithm • Changes • D[i][0] = i , if pi = t0 • D[i][0] = i + 1 , if pi ≠ t0 • D[0][j] = j , if p0 = tj • D[0][j] = j + 1 , if p0 ≠ tj • Additional stop case for mismatch

  13. - - - - Research Overview Bioinformatics String Matching Discussion Questions Approximate String Matching Algorithm

  14. - - - - Research Overview Bioinformatics String Matching Discussion Questions Discussion • Grouping Algorithm • Scale of the problem • 400 – 800 bases per sequence • Ten thousands of sequences • Assumptions: • Sequences Globally Aligned • Sequences Begin at the Same Place

  15. - - - - Research Overview Bioinformatics String Matching Discussion Questions Example Grouping

  16. - - - - Research Overview Bioinformatics String Matching Discussion Questions Results Comparisons for n sequence = (n-1)n/2 O(n2), where n is number of sequences. ~1600 comparisons per second. 10000 sequence ~8.6 hours. (from 10 days)

  17. - - - - Research Overview Bioinformatics String Matching Discussion Questions ?

More Related