1 / 129

Recherche dans des bases de données de séquences biologiques

Using BLAST to Search Sequence Databases. Recherche dans des bases de données de séquences biologiques. Cédric Notredame. Outline. -Evolution and Sequence Similarity. - The inside of BLAST. - Using BLAST. - Adapting BLAST to your needs. - Searching Protein Domains with BLAST.

loman
Télécharger la présentation

Recherche dans des bases de données de séquences biologiques

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using BLAST to Search Sequence Databases Recherche dans des bases de données de séquences biologiques Cédric Notredame

  2. Outline -Evolution and Sequence Similarity -The inside of BLAST -Using BLAST -Adapting BLAST to your needs -Searching Protein Domains with BLAST -Digging Genomes

  3. Two Minutes of the Evolutionnary Clock…

  4. An Alignment is a STORY ADKPKRPLSAYMLWLN ADKPKRPLSAYMLWLN ADKPKRPLSAYMLWLN Mutations + Selection ADKPKRPKPRLSAYMLWLN ADKPRRPLS-YMLWLN

  5. ADKPKRPLSAYMLWLN ADKPKRPLSAYMLWLN ADKPKRPLSAYMLWLN Mutations + Selection ADKPKRPKPRLSAYMLWLN ADKPRRPLS-YMLWLN ADKPRRP---LS-YMLWLN ADKPKRPKPRLSAYMLWLN Deletion Insertion Mutation An Alignment is a STORY

  6. How Do Sequences Evolve ? + - - In the core, SIZE MATTERS On the surface, CHARGE MATTERS OmpR, Cter Domain In a structure, each Amino Acid plays a Special Role

  7. Why Does It Make Sense To Align Sequences ? Same Sequence Same Origin Same Function Same 3D Fold

  8. How Can We Compare Sequences ? The Twilight Zone Similar Sequence Similar Structure Different Sequence Structure ???? 30% %Sequence Identity Same 3D Fold 30 Twilight Zone Length 100

  9. Different molecular clocks for different proteins--another prediction

  10. A few Basic Definitions

  11. A few Definitions Query : Your sequence Subject: The database against which you search Heuristic: Algorithm that does not guaranty the optimal solution

  12. Other Important Definitions Identity Proportion of IDENTICAL residues between two sequences. Depends on the Alignment. Unit: the % id Similarity Proportion of SIMILAR residues Two residues are similar if their substitution cost is higher than 0.Depends on the matrixUnit: the %similarity Homology Sequences SIMILAR enough are sometimes HOMOLOGOUS HOMOLOGY  COMMON ANCESTOR Unit: Yes or No! DIFFERENT sequences can also be Homologous

  13. More Important Definitions Hit A sequence that matches your sequence and reported by BLAST. E-Value Expectation value How many times would you expect to find a hit by chance only? Depends on the alignment. Depends on the matrix Depends on the database Sensitive to Low complexity regions Unit: must be lower than 0.0001 to mean something

  14. A Good Hit Is Something You Would Not Expect by Chance

  15. What is BLAST ?

  16. BLAST Basic Local Alignment Search Tool BLAST is a Program Designed for RAPIDLY Comparing Your Sequence With every Sequence in a database and REPORT the most SIMILAR sequences

  17. 2-Comparison Engine LOCAL Alignment Database Search 1-Query 3-Database 4-Statistical Evaluation (E-Value) PROBLEM: LOCAL ALIGNMENT (SW)TOO SLOW

  18. SW Q 1.10e-20 10 1.10e-100 1.10e-2 1.10e-1 10 3 1 3 6 BLAST 1.10e-2 1 20 15 13 Database Search

  19. This is where Blast SAVES TIME This is where it LOSES HITS Most BLAST parameters refer to this step BLAST Basic Local Alignment Search Tool BLAST is a Heuristic Smith and Waterman BLAST = 3 STEPS 1-Decide who will be compared

  20. BLAST Basic Local Alignment Search Tool BLAST is a Heuristic Smith and Waterman BLAST = 3 STEPS 1-Decide who will be compared 2-Check the most promising Hits 3-Compute the E-value of the most interesting Hits

  21. BLAST Heuristic Algorithms A Bit of History • Smith and Waterman • Exact Local Dynamic Programming, 1981 • FASTA • Lipman and Pearson, 1985 • Looks for similar words (k-tup) on the same diagonal. • Comparison on the sequences one by one… • BLAST • Altschul et al., 1990 • The most widely cited tool in Biology • www.ncbi.nlm.nih.gov/Education/BLASTinfo/tut1.html

  22. The Inside of BLAST

  23. RSL score > T LKP AAA AAC AAD YYY score < T ACT RSL TVF ... ... ... List of all the 3AA words that Can be found in the database Words with a score > T LKP Inside BLAST Step 1: finding the worthy words Query REL

  24. ACT ACT RSL RSL RSL TVF RSL TVF Inside BLAST Step 2: Eliminate the database sequences that do not contain any interesting word Sequences within the database Look for «interesting» words ACT RSL TVF ... ... List of « interesting » words > T • Sequences containing interesting words (Hits)

  25. Database sequence Query X Extension by limited Dynamic Programming Inside BLAST: the end Step 3: Extension of the Hits Database sequence Query X • 2 "Hits" on the same diagonal distant by less than X

  26. The Statistics in BLAST

  27. BLAST Statistics: Raw Score • Evaluation of the score • Raw Score • Sum of the substitutions and gap penalties. • Not very informative

  28. BLAST Statistics: P Values • Derived Statistics • p-value • Probability of finding an alignment with such a score, by chance. • The lower, the better

  29. BLAST Statistics: P-Values Just as the sum of a large number of independent identically distributed (i.i.d) random variables tends to a normal distribution, the maximum of a large number of i.i.d. random variables tends to an extreme value distribution. Extreme value distribution (Gumbel) normal distribution

  30. BLAST Statistics: P-Values P-Value: Probability that a random alignments obtains a score superior or Equal to X K must be calibrated with the database composition Lambda is calibrated with the matrix being used

  31. BLAST Statistics: E-Values • Derived Statistics • E-value • Number of alignments expected by chance • The lower, the better: <0.00001 For Values Lower than 0.0001, E-Value ~ P-Value The E-Values are easier to compare than P-Values

  32. BLAST Statistics: Bit-Score • Bit Score • Evaluates the amount of information in the alignment • Makes it possible to compare alignments

  33. BLAST Statistics: Booby Trap! The E-Value depends on N, the Database size. If N increases, some Hits can be lost

  34. P31383 Vs YEAST P31383 Vs UniProt

  35. The Many Flavorsof BLAST

  36. Database Against Database: « Farm-Blast » Genome 1 Genome 2 Ideal for finding Orthologues

  37. The Classics 1 SequenceVs A sequence Db

  38. nucleotide blastn nucleotide blastx nucleotide VS protein protein tblastn nucleotide VS protein protein nucleotide nucleotide tblastx VS protein protein The Many Flavors of BLAST Program Query Database protein protéine blastp

  39. protein protein Psi-blast protein RPS-blast Domain protein DNA DART-blast mega-blast protein Large DNA The Many Flavors of BLAST Program Query Database

  40. If your Sequence is a Protein

  41. If your Sequence is made of DNA

  42. BLASTing with DNA: Asking the right question.

  43. Keeping an Eye on the Public Servers.

  44. Using BLAST: The Basic Way

  45. Database Search Database Search Result=Prediction Protein X IS or IS NOT homologous to the QUERRY.

  46. Submitting your Query

  47. Understanding the BLAST Output Graphic Display Hit List Alignments

More Related