1 / 59

GROUP MEMBERS: MUHAMMAD KHAIRULANWAR IZZAT BIN HUSSIN AC100076

GROUP MEMBERS: MUHAMMAD KHAIRULANWAR IZZAT BIN HUSSIN AC100076 MURNIYANTI BINTI MALIK AC100078 NG SHEE TING AC100079 SCHEE XIN LIN AC100086 AW MEI YEE AC100062. INTRODUCTION. @Ng Shee Ting. INTRODUCTION(cont..).

genica
Télécharger la présentation

GROUP MEMBERS: MUHAMMAD KHAIRULANWAR IZZAT BIN HUSSIN AC100076

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GROUP MEMBERS: MUHAMMAD KHAIRULANWAR IZZAT BIN HUSSIN AC100076 MURNIYANTI BINTI MALIK AC100078 NG SHEE TING AC100079 SCHEE XIN LIN AC100086 AW MEI YEE AC100062

  2. INTRODUCTION @Ng Shee Ting

  3. INTRODUCTION(cont..) @Ng Shee Ting

  4. INTRODUCTION(cont..) @Ng Shee Ting

  5. PURPOSES @Ng Shee Ting

  6. @Ng Shee Ting

  7. WHAT IS PSSM?? @Ng Shee Ting

  8. PSSM CONT.. The PSSM is generated by calculating position-specific scores for each position in the alignment. Highly conserved positions receive high scores and weakly conserved positions receive scores near zero. *Note: A profile is a table of observed frequencies of amino acids (or nucleotides) at each position in a multiple alignment. @Ng Shee Ting

  9. PSSM CONT.. • PSI-BLAST PSSM is derived from local alignments • Only positions present in the query sequence are used • If the query has L positions(length), PSSM will also have L positions and generate a 20 X L matrix. @Ng Shee Ting

  10. Basic Concept for calculation(this example counting for nucleotide) @Izzat

  11. CALCULATION cont… Row Column (Positions) @Izzat

  12. CALCULATION cont… Refer back to Table A Shading indicates fraction of occurances for that base at that position: red (1.0), orange (0.8), yellow (0.6). @Izzat

  13. CALCULATION cont… cThe background frequencies used to calculate the scores are A = T = 0.32; C = G = 0.18. Table 1D was calculated with the default scoring system used by the Gibbs Sampler @Izzat

  14. CALCULATION cont… • In the example shown in Table 1D, the score for an adenine in position one is calculated: • Score(position 1, A) = [3+ √5 (0.32)] / [5 + √5] = 0.51 @Izzat

  15. CALCULATION cont… • Score(position 1, A) = [3+ 0.1(0.32)] / [5 + 0.1] = 0.59 cThe background frequencies used to calculate the scores are A = T = 0.32; C = G = 0.18. Table 1E used the default scoring system of Meme. @Izzat

  16. CALCULATION cont… dEach element of the table is equal to the negative log10 of the corresponding element of Table 1E. (*-log) @Izzat

  17. EXAMPLE 20X L matrix Position 1 Position 15 L positions Y appear twice in this position @Izzat

  18. PSSM CALCULATION Column 1: frequency (A, 1) = 0 / 5 = 0, frequency (G, 1) = 5 / 5 = 1, ... Column 2: frequency (A, 2) = 0 / 5 = 0; frequency (H,2) = 5 / 5 = 1, ... ... Column 15: frequency (A, 15) = 2 / 5 = 0.4, frequency (C, 15) = 1 / 5 = 0.2; ... Some frequencies are equal to 0 because of the number sequence in the multiple alignment . Such a frequency could lead to " exclusion "of the amino acid involved in this position. @Izzat

  19. CONT.. One way around this by adding a "small value" at all frequencies observed.  This low " frequency non-observed "is called a" pseudo-count .”  In the previous example with a " pseudo-count "of 1: Column 1: f '(A, 1) = (0 +1) / (5 +20) = 0.04, f' (G, 1) = (5 +1) / (5 +20) = 0.24 ; ... Column 2: f '(A, 2) = (0 +1) / (5 +20) = 0.04, f' (H,2) = (5 +1) / (5 +20) = 0.24 ; ... ... Column 15: f '(A, 15) = (2 +1) / (5 +20) = 0.12, f' (C, 15) = (1 +1) / (5 +20) = 0.08 ; ... @Izzat

  20. Table of full calculated f’

  21. PSSM CONT.. The frequency of each amino acid determined at each position is compared to the frequency with which each amino acid is expected in a random sequence .  It is assumed that each amino acid is observed with the same frequency in a random sequence. Score ij = log (f 'ij / q i ) where: -Score ij is the score for the residue i at position j -f 'ij is the relative frequency for residue i at position j, corrected by the " pseudo-count " -q i is the relative frequency expected for the residue i in a random sequence @Izzat

  22. PSSM full calculated Score ij @Izzat

  23. Exercise: Since the fully calculated score and f’ are given from the diagram given above. You can calculate the q I [using formula:Score ij = log (f 'ij / q i )] @Izzat

  24. Solution • You can reverse the formula whereby q i = f 'ij/10^ Score ij -Any value with -0.2 in the table, q i =0.0634 -Any value with 2.3 in the table, q i =1.203*10^(-3) -Any value with 0.7 in the table, q i =0.015 -Any value with 1.3 in the table, q i = 6.014*10^(-3) @Izzat

  25. Why PSSM? This PSSM is used to further search the database for new matches, and is updated for subsequent iterations with these newly detected sequences. is a matrix used for biological data, and its main role in PSI-BLAST search is to increase the sensitivity of results. The profile is used to perform a second BLAST search and the results of each "iteration" used to refine the profile. This iterative searching strategy results in increased sensitivity. @Izzat

  26. E Value? • an abbreviated term for “Expected Value” or “Expectation Value”. a parameter that describes the number of hits one can "expect" to see just by chance when searching a database of a particular size. E value works for the longest row ofmatches in an alignment of length L. @Schee Xin Lin

  27. E Value cont It decreases exponentially with the Score (S) that is assigned to a match between two sequences. Shorter sequences have a high probability of occurring in the database purely by chance. @Schee Xin Lin

  28. E Value cont For example, an E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size one might expect to see 1 match with a similar score simply by chance. This means that the lower the E-value, or the closer it is to "0" the more "significant" the match is. @Schee Xin Lin

  29. EQUATION E = Kmn e – λS • This is the equation for calculating the e value. • m :the length of the query sequence • n : the database sequence • S: score • The parameters, K and λ are constants representing the scoring system. @Schee Xin Lin

  30. Example of calculation Constants • λ=0.219 • K=0.082 • s=103 • m=100 • n = 2X10^8 • λ s=0.219x103=22.6 • e- λ s = 1.6x10^-10 • Kmne- λ s = 0 .082x100X2X10^8x1.6x10^-10 = 0.2624 @Schee Xin Lin

  31. In a typical current database search, a protein of length 250 might be compared to a protein database of 50 000 000 total residues. @Schee Xin Lin

  32. Doubling the length of either sequence will double the number of HSPs. • Doubling the score S will exponentially reduce the expected number of HSPs.(The higher the score, the lower the expected number of HSPs) • Thus, we anticipate E is proportional to mn. Also, E is proportional to e – λS. @Schee Xin Lin

  33. Relationship between E and mn Relationship between E and e – λS E E mn e – λS @Schee Xin Lin

  34. HOW PSI BLAST WORKS? @Aw Mei Yee

  35. PSI BLAST FLOW CHART 1 2 3 4 @Aw Mei Yee

  36. PRINCIPLES 1. A standard BLAST search is performed against a database using a substitution matrix (e.g. BLOSUM62). PSI-BLAST principle: 2. A PSSM is constructed automatically from a multiple alignment of the highest scoring hits of the initial BLAST search. High conserved positions receive high scores and weakly conserved positions receive low scores. @Aw Mei Yee

  37. PRINCIPLES cont.. 3. The PSSM replaces the initial matrix (e.g. BLOSUM62) to perform a second BLAST search. 4. Steps 3 and 4 can be repeated and the new found sequences included to build a new PSSM. 5. We say that the PSI-BLAST has converged if no new sequences are included in the last cycle. @Aw Mei Yee

  38. @Aw Mei Yee Sequence in FASTA format Example of FASTA format: >gi|18892811|gb|AAL80910.1| transposase [Pyrococcus furiosus DSM 3638] MVVLSFQRKILIIKSEIYPIVSKHYPKNTRREVISLYDLITFAILAHLHFNGVYKHAYRVLIEEMKLFPK IRYNKLTERLNRHEKLLLLAQEELFKKHAREYVRILDSKPIQTKELARKNRKDKEGSSEVISEKPAVGFV PSKKKFYYGYKLTCYSDGNLLALLSVDPANKHDVSVVREKFWVIVEEFSGCFLFLDKGYVSRGLEEEFLR FGVVYTPVKRGNQISNLEEKKFYKYLSDFRRRIETLFSKFSEFLLRPSRSVSLRGLAVRILGAILAVNLD RLYNFTGGGN

  39. Peptide Sequence Databases Try Choose refseq @Aw Mei Yee

  40. Choose PSI BLAST @Aw Mei Yee

  41. PSI BLAST USES TWO E-VALUE: • the threshold E-value for the initial BLAST. • the inclusion E-value to accept sequences in the PSSM construction (default is 0.005). @Aw Mei Yee

  42. Try Set to 0.0001 Can change threshold (cut off)according to desired for the next iteration Lastly, click Blast to Start the search @Aw Mei Yee

  43. OUTPUT @Aw Mei Yee

  44. FIRST ITERATION Click Go for 2nd iteration @Aw Mei Yee

  45. SECOND ITERATION Click Go for 3rd iteration @Aw Mei Yee

  46. THIRD ITERATION Click Go for 4th iteration @Aw Mei Yee

  47. FORTH ITERATION @Aw Mei Yee

  48. After the second iteration, PSIBLAST E value are not directly comparable to those calculated by BLAST. • This is because that BLAST scores the target sequence against each database sequence using a matrix (PSSM) contain fix value for each amino acid pair. @Aw Mei Yee

  49. Sequence derived from previous iteration Newly searched sequence which homolog with new iteration @Aw Mei Yee

  50. SAMPLE ALIGNMENT (HIT TABLE) identical matches are marked by "+" symbol in a line between the query and the database sequence. Gaps are introduced with a "-" symbol The hit sequence is presented in the Sbjct: line, and the query sequence in the Query: line. @Aw Mei Yee

More Related