1 / 20

Alignment, Part I

Alignment, Part I. Vasileios Hatzivassiloglou University of Texas at Dallas. Other databases. NCBI BLAST Basic Local Alignment Search Tool Multiple programs for sequence searching and comparisons Gene Expression Omnibus (GEO) maintained by NCBI

huslu
Télécharger la présentation

Alignment, Part I

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Alignment, Part I Vasileios Hatzivassiloglou University of Texas at Dallas

  2. Other databases • NCBI BLAST • Basic Local Alignment Search Tool • Multiple programs for sequence searching and comparisons • Gene Expression Omnibus (GEO) • maintained by NCBI • contains output of gene expression experiments

  3. Links • GenBank (http://www.ncbi.nlm.nih.gov/GenBank/) • ExPASy (http://www.expasy.org/) • SwissProt (http://www.expasy.org/sprot/) • GO (http://www.geneontology.org/) • PubMed (http://www.ncbi.nlm.nih.gov/sites/entrez) • MeSH browser (http://www.nlm.nih.gov/mesh/MBrowser.html) • NCBI Blast (http://blast.ncbi.nlm.nih.gov/Blast.cgi) • NCBI GEO (http://www.ncbi.nlm.nih.gov/projects/geo/) • Human Protein Atlas (http://www.proteinatlas.org/)

  4. Assignment • Search the above databases for information on a gene/protein of your choice • Briefly report your findings (90 seconds) next Tuesday, September 30 • Examples: interleukin-N (e.g., 3), elastase, thrombin, creatine kinase, myosin-N (e.g., 2)

  5. Sequences • Sequences of symbols central to bioinformatics • DNA • RNA • proteins • Fixed alphabet (size 4 for DNA/RNA, 20 for proteins)

  6. Sequence similarity • Important for many biological problems • Examples • Similar primary structure in proteins implies similar form and function • Similar sequences in genes / proteins imply homologues across organisms • Similar short sequences lead to motif finding • Similarities between gene regions can be used for phylogenetic classification

  7. How to measure similarity • Given two sequences S and T, we look into ways to derive T from S using elementary operations • Substitution (change a letter) • Deletion • Insertion • Process is reversible (S→T and T→S) • Many ways, some obviously more efficient

  8. Edit distance • Each elementary operation is assigned a cost • Overall cost is the sum of the costs for each operation taken (linear model) • The editdistance between two strings is the minimum total cost among all possible sequences of operations that transform S into T

  9. Alignment • An equivalent way to measuring edit distance is to align the two sequences • An alignment extends the sequences S and T into S′ and T′ using the same alphabet plus “-” (the space character), and matches S′[i] with T′[i]

  10. Definitions • A string is a finite sequence of characters from a finite alphabet Σ • The length of a string S, denoted |S|, is the number of characters it contains (can be 0) • S[i] is the i-th character of S • A subsequence of a string S is the string formed by omitting a number of characters from S (order of characters does not change)

  11. Defining alignment formally • An alignment is the mapping of two strings S and T from alphabet Σ into strings S′ and T′ where • The alphabet of S′ and T′ is Σ plus “-” • S is a subsequence of S′. All characters in S′ not in this subsequence must be “-”. • T is a subsequence of T′. All characters in T′ not in this subsequence must be “-”. • |S′| = |T′| • There is no i for which S′ [i] = T′ [i] = “-”

  12. Example alignment Sequences: • GCGCATGGATTGAGCGA • TGCGCCATTGATGACCA A possible alignment: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A

  13. Alignment operations -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A Three elements: • Perfect matches • Mismatches • Insertions & deletions (indel)

  14. Alignments are not unique For example, compare: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A to ------GCGCATGGATTGAGCGA TGCGCC----ATTGATGACCA--

  15. Measuring alignment quality • For each position i in the alignment, calculate the scoring functionσ(S′[i], T′[i]) • The scoring function depends only on the symbols S′[i] and T′[i], not on position • A very simple scoring function might be • σ(x, x) = +1 for x a letter • σ(x, y) = –2 for x,y different letters • σ(x, -) = σ(-, x) = -1 for indel

  16. Overall alignment score • Defined as the sum of the applicable values of the scoring function • As with our definition of edit distance, this is a linear model

  17. Scoring functions • Usually based on how similar the two symbols are • Derived from confusion probabilities • In biology, chemically similar amino-acids have lower penalties for substitution • In speech recognition, “p”→ “b” costs less than “p”→ “r” • Cost of indels depends on application

  18. Comparing alignments -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A 4 indel, 13 matches, 2 mismatches score: +5 ------GCGCATGGATTGAGCGA TGCGCC----ATTGATGACCA-- 12 indel, 5 matches, 6 mismatches score: -19

  19. Optimal alignment • An alignment which maximizes the overall alignment score is called optimal • Often, there is more than one optimal alignment for two strings • depends on sophistication of scoring function • The optimal alignment score can be used as a similarity value

  20. Finding the optimal alignment • Simple algorithm: Construct all possible alignments, score them, and pick the best • How many alignments are there for two strings of length n and m?

More Related