1 / 123

Sequencing, Sequence Alignment & Software

Sequencing, Sequence Alignment & Software. Lushan Wang, Shandong University. Objectives. Understand how DNA sequence data is collected and prepared Be aware of the importance of sequence searching and sequence alignment in biology and medicine

javen
Télécharger la présentation

Sequencing, Sequence Alignment & Software

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sequencing, Sequence Alignment & Software Lushan Wang, Shandong University

  2. Objectives • Understand how DNA sequence data is collected and prepared • Be aware of the importance of sequence searching and sequence alignment in biology and medicine • Be familiar with the different algorithms and scoring schemes used in sequence searching and sequence alignment

  3. 30,000

  4. Shotgun Sequencing Isolate Chromosome ShearDNA into Fragments Clone into Seq. Vectors Sequence

  5. Principles of DNA Sequencing Primer DNA fragment Amp pBR322 Tet Ori Denature with heat to produce ssDNA Klenow + ddNTP + dNTP + primers

  6. The Secret to Sanger Sequencing

  7. dATP dCTP dGTP dTTP ddCTP dATP dCTP dGTP dTTP ddTTP dATP dCTP dGTP dTTP ddATP Principles of DNA Sequencing 3’ Template G C A T G C 5’ 5’ Primer dATP dCTP dGTP dTTP ddCTP GddC GCddA GCAddT ddG GCATGddC GCATddG

  8. Principles of DNA Sequencing G T short _ _ C A G C A T G C + + long

  9. Capillary Electrophoresis Separation by Electro-osmotic Flow

  10. Multiplexed CE with Fluorescent detection ABI 377, 3700 96x700 bases

  11. Shotgun Sequencing Assembled Sequence Sequence Chromatogram Send to Computer

  12. Shotgun Sequencing • Very efficient process for small-scale (~10 kb) sequencing (preferred method) • First applied to whole genome sequencing in 1995 (H. influenzae) • Now standard for all prokaryotic genome sequencing projects • Successfully applied to D. melanogaster • Moderately successful for H. sapiens

  13. The Finished Product GATTACAGATTACAGATTACAGATTACAGATTACAG ATTACAGATTACAGATTACAGATTACAGATTACAGA TTACAGATTACAGATTACAGATTACAGATTACAGAT TACAGATTAGAGATTACAGATTACAGATTACAGATT ACAGATTACAGATTACAGATTACAGATTACAGATTA CAGATTACAGATTACAGATTACAGATTACAGATTAC AGATTACAGATTACAGATTACAGATTACAGATTACA GATTACAGATTACAGATTACAGATTACAGATTACAG ATTACAGATTACAGATTACAGATTACAGATTACAGA TTACAGATTACAGATTACAGATTACAGATTACAGAT

  14. Sequencing Successes T7 bacteriophage completed in 1983 39,937 bp, 59 coded proteins Escherichia coli completed in 1998 4,639,221 bp, 4293 ORFs Sacchoromyces cerevisae completed in 1996 12,069,252 bp, 5800 genes

  15. Sequencing Successes Caenorhabditis elegans completed in 1998 95,078,296 bp, 19,099 genes Drosophila melanogaster completed in 2000 116,117,226 bp, 13,601 genes Homo sapiens completed in 2003 3,201,762,515 bp, 31,780 genes

  16. Genomes to Date • 8 vertebrates (human, mouse, rat, fugu, zebrafish) • 3 plants (arabadopsis, rice, poplar) • 2 insects (fruit fly, mosquito) • 2 nematodes (C. elegans, C. briggsae) • 1 sea squirt • 4 parasites (plasmodium, guillardia) • 4 fungi (S. cerevisae, S. pombe) • 200+ bacteria and archebacteria • 2000+ viruses

  17. So what do we do with all this sequence data?

  18. Sequence Alignment

  19. Alignments tell us about... • Function or activity of a new gene/protein • Structure or shape of a new protein • Location or preferred location of a protein • Stability of a gene or protein • Origin of a gene or protein • Origin or phylogeny of an organelle • Origin or phylogeny of an organism

  20. Factoid: Sequence comparisons lie at the heart of all bioinformatics

  21. Similarity refers to the likeness or % identity between 2 sequences Similarity means sharing a statistically significant number of bases or amino acids Similarity does not imply homology Homology refers to shared ancestry Two sequences are homologous is they are derived from a common ancestral sequence Homology usually implies similarity Similarity versus Homology

  22. Similarity versus Homology • Similarity can be quantified • It is correct to say that two sequences are X% identical • It is correct to say that two sequences have a similarity score of Z • It is generally incorrect to say that two sequences are X% similar

  23. Similarity versus Homology • Homology cannot be quantified • If two sequences have a high % identity it is OK to say they are homologous • It is incorrect to say two sequences have a homology score of Z • It is incorrect to say two sequences are X% homologous

  24. Homologues & All That • Homologue (or Homolog) • Protein/gene that shares a common ancestor and which has good sequence and/or structure similarity to another (general term) • Paralogue (or Paralog)平行同源 • A homologue which arose through gene duplication in the same species/chromosome • Orthologue (or Ortholog)垂直同源 • A homologue which arose through speciation (found in different species)

  25. Sequence Complexity MCDEFGHIKLAN…. High Complexity ACTGTCACTGAT…. Mid Complexity NNNNTTTTTNNN…. Low Complexity Translate those DNA sequences!!!

  26. Assessing Sequence Similarity THESTORYOFGENESIS THISBOOKONGENETICS THESTORYOFGENESI-S THISBOOKONGENETICS THE STORY OF GENESIS THIS BOOK ON GENETICS Two Character Strings Character Comparison * * * * * * * * * * * Context Comparison

  27. Rbn KETAAAKFERQHMD Lsz KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNT Rbn SST SAASSSNYCNQMMKSRNLTKDRCKPMNTFVHESLA Lsz QATNRNTDGSTDYGILQINSRWWCNDGRTP GSRN Rbn DVQAVCSQKNVACKNGQTNCYQSYSTMSITDCRETGSSKY Lsz LCNIPCSALLSSDITASVNC AKKIVSDGDGMNAWVAWR Rbn PNACYKTTQANKHIIVACEGNPYVPHFDASV Lsz NRCKGTDVQA WIRGCRL Assessing Sequence Similarity is this alignment significant?

  28. Is This Alignment Significant?

  29. Some Simple Rules • If two sequence are > 100 residues and > 25% identical, they are likely related • If two sequences are 15-25% identical they may be related, but more tests are needed • If two sequences are < 15% identical they are probably not related • If you need more than 1 gap for every 20 residues the alignment is suspicious

  30. Twilight Zone Doolittle’s Rules of Thumb

  31. Sequence Alignment - Methods • Dot Plots • Dynamic Programming • Heuristic (Fast) Local Alignment • Multiple Sequence Alignment • Contig Assembly

  32. Dot Plots

  33. Dot Plots • “Invented” in 1970 by Gibbs & McIntyre • Good for quick graphical overview • Simplest method for sequence comparison • Inter-sequence comparison • Intra-sequence comparison • Identifies internal repeats • Identifies domains or “modules”

  34. Dot Plot Algorithm • Take two sequences (A & B), write sequence A out as a row (length=m) and sequence B as a column (length =n) • Create a table or “matrix” of “m” columns and “n” rows • Compare each letter of sequence A with every letter in sequence B. If there’s a match mark it with a dot, if not, leave blank

  35. Dot Plot Algorithm A C D E F G H G A C D E F G H G

  36. Dot Plots & Internal Repeats

  37. Dot Plots • Most commercial programs offer pretty good dot plot programs including: • GCG/Omiga/DS gene (Accelrys Inc.) • PepTool (BioTools Inc.) • LaserGene (DNAStar) • Popular freeware package is Dotter www.cgr.ki.se/cgr/groups/sonnhammer/Dotter.html • Dotlet http://www.isrec.isb-sib.ch/java/dotlet/Dotlet.html • JDotter http://athena.bioc.uvic.ca/sars/jdotter/main.php

  38. G E N E T I C S G E N E T I C S G 10 0 0 0 0 0 0 0 G 60 40 30 20 20 0 10 0 E 0 10 0 10 0 0 0 0 E 40 50 30 30 20 0 10 0 N 0 0 10 0 0 0 0 0 N 30 30 40 20 20 0 10 0 E 0 0 0 10 0 0 0 0 E 20 20 20 30 20 10 10 0 S 0 0 0 0 0 0 0 10 S 20 20 20 20 20 0 10 10 I 0 0 0 0 0 10 0 0 I 10 10 10 10 10 20 10 0 S 0 0 0 0 0 0 0 10 S 0 0 0 0 0 0 0 10 Dynamic Programming

  39. Dynamic Programming • Developed by Needleman & Wunsch (1970) • Refined by Smith & Waterman (1981) • Ideal for quantitative assessment • Guaranteed to be mathematically optimal • Slow N2 algorithm • Performed in 2 stages • Prepare a scoring matrix using recursive function • Scan matrix diagonally using traceback protocol

  40. Identity Scoring Matrix (Sij) 得分矩阵

  41. Seq1 xi-1 xi Seq2 F(i-1,j-1) F(i,j-1) yj-1 yj +s(xi, yj) -d -d F(i-1,j) F(i,j) 动态规划算法…

  42. The Recursive Function Si-1,j-1or max Si-x,j-1 + wx-1or max Si-1,j-y + wy-1 Sij = sij + max 2<x<i 2<y<j W = gap penalty (空位罚分) S = alignment score (比对得分) 逆归函数

  43. A Simple Example... A A T V D A 1 1 0 0 0 V 0 1 1 2 1 V D A A T V D A 1 1 0 0 0 V 0 1 1 2 1 V 0 1 1 2 2 D 0 1 1 1 3 A A T V D A 1 1 0 0 0 V 0 1 1 2 1 V 0 1 1 2 2 D 0 1 1 1 3 A A T V D | | | A V - V D A A T V D | | | A - V V D A A T V D | | | A V V D

  44. Could We Do Better? • Key to the performance of Dynamic Programming is the scoring function • Dynamic Programming always gives the mathematically correct answer • Dynamic Programming does not always give the biologically correct answer • The weakest link -- The Scoring Matrix

  45. Scoring Matrices • An empirical model of evolution, biology and chemistry all wrapped up in a 20 X 20 table of integers • Structurally or chemically similar residues should ideally have high diagonal or off-diagonal numbers • Structurally or chemically dissimilar residues should ideally have low diagonal or off-diagonal numbers

  46. A Better Matrix - PAM250

  47. A T V D A 2 T 1 3 V 0 0 4 D 0 0-2 4 Gap Penalty = -1 Using PAM250... A A T V D A 2 1 0 -1 -1 V -1 2 1 5 -1 V D A A T V D A 2 1 0 -1 -1 V -1 2 1 5 -1 V -1 1 2 5 3 D -1 1 1 0 9 A A T V D A 2 1 0 -1 -1 V -1 2 1 5 -1 V -1 1 2 5 3 D -1 1 1 0 9 A A T V D | | | A V - V D

  48. PAM Matrices • Developed by M.O. Dayhoff (1978) • PAM = Point Accepted Mutation • Matrix assembled by looking at patterns of substitutions in closely related proteins • 1 PAM corresponds to 1 amino acid change per 100 residues • 1 PAM = 1% divergence or 1 million years in evolutionary history

  49. Dynamic Programming • Great for doing pairwise global alignments • Produces a quantitative alignment “score” • Problems if one tries to do alignments with very large sequences (memory requirement grows as N2 or as N x M) • Serious problems if one tries to align one sequence against a database (10’s of hours) • Need an alternative…..

  50. Fast Local Alignment Methods ACDEAGHNKLM... KKDEFGHPKLM... SCDEFCHLKLM... MCDEFGHNKLV... ACDEFGHIKLM... QCDEFGHAKLM... AQQQFGHIKLPI... WCDEFGHLKLM... SMDEFAHVKLM... ACDEFGFKKLM...

More Related