1 / 44

Part I : SEQUENCE COMPARISON

Part I : SEQUENCE COMPARISON. PAIRWISE ALIGNMENT Manisha Brahmachary. OUTLINE. What is sequence Comparison Ways to do Sequence Comparison Dot Plot BLAST FASTA. What is sequence alignment or sequence comparison?.

ofira
Télécharger la présentation

Part I : SEQUENCE COMPARISON

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Part I : SEQUENCE COMPARISON PAIRWISE ALIGNMENT Manisha Brahmachary designed by Manisha, NUS

  2. OUTLINE • What is sequence Comparison • Ways to do Sequence Comparison • Dot Plot • BLAST • FASTA designed by Manisha, NUS

  3. What is sequence alignmentor sequence comparison? • Given two sequences of letters and a scoring scheme for evaluating matching letters , find best pairing from one sequence to letters of the other sequence. • THIS IS A RATHER LONGER SENTENCE THAN THE NEXT. • THIS IS A SHORT SENTENCE • Align: • THIS IS A RATHER LONGER SENTENCE THAN THE NEXT. • THIS IS A#######SHORT###SENTENCE############## (path 1) • or • THIS IS A SHORT#########SENTENCE############## (path 2) designed by Manisha, NUS

  4. Aligning biological sequences • DNA (4 letter alphabet) • TTGACAC • TTTACAC • Proteins (20 letter alphabet) • RKVA--GMAKPNM • RKIAVAAASKPAV designed by Manisha, NUS

  5. Why do Sequence Alignment? • Finding novel genes in silico • Phylogenetic/Evolutionary • Structure-template for modelling • Functional prediction designed by Manisha, NUS

  6. Types of Sequence Comparison • Pairwise Alignment • Comparison of two sequences • Multiple Alignment • Comparison of more than two sequences designed by Manisha, NUS

  7. CONCEPTS IN SEQUENCE COMPARISON • IDENTITY • Percentage identity between sequences means that they have a certain number of residues (nucleotide /amino- acids ) that are identical at that particular position after aligning both sequences. designed by Manisha, NUS

  8. RCI CTRGFCRCLCRR Query: RCLCRRGVCRCICT R Subject: • Exact match (shown by | ) : 10 identical residues • Above example : • Percentage identity: 10 identical matches /15 residues in the aligned sequence *100 = 66% identity designed by Manisha, NUS

  9. RCI CTRGFCRCLCRR Query: RCLCRRGVCRCICT R Subject: MISMATCH(s) HERE designed by Manisha, NUS

  10. RCICT-RGFCRCLC---RR RCLCRRGVCRCICTAR Query: Subject: • Mismatch when different characters , therefore insertion of gaps. • Gaps have penalties: • Insertion of first gap( GAP OPENING) : high penalty • (For eg. –2, subtracting 2 ) • Insertion of consecutive gaps ( GAP EXTENSION): less penalty • (For eg. -1 (subtracting 1 for each consecutive gap) • More no. of gaps lesser the score of the alignment designed by Manisha, NUS

  11. RCICT-RGFCRCLC---RR RCLCRRGVCRCICTAR- • Substitution: • Less score than identical match • For eg: +1 per substitution designed by Manisha, NUS

  12. Category Amino Acid Acids and Amides Asp (D) Glu(E) Asn (N) Gln (Q) Basic His (H) Lys (K) Arg (R) Aromatic Phe (F) Tyr (Y) Trp (W) Hydrophilic Ala (A) Cys (C) Gly (G) Pro (P) Ser (S) Thr (T) Hydrophobic Ile (I) Leu (L) Met (M) Val (V) • Substitution - Replace a residue with another of similar physiochemical property. designed by Manisha, NUS

  13. Similarity RCICT-RGFCRCLC---RR • Similarity = Identical matches + Substitutions • Eg. (10 identical matches + • 2 substitution) / 15 aligned residues * 100 = 80% similarity RCLCRRGVCRCICTAR designed by Manisha, NUS

  14. ACTCGGCCCCGCG CTCACTG C ACTCGGAC - -GCG CTCAGTGC For DNA: Identity and gap are applicable designed by Manisha, NUS

  15. Similarity Vs. Homology • Homology:When two similar proteins come from a common ancestor. • Homologyis inferred fromSimilarity • If two sequences are similar, then they are known as homologous sequences. • Usually, at least 30% identity over 400 bp for DNA sequences and over 125 amino acids for proteins. designed by Manisha, NUS

  16. Scoring Matrices used in sequence comparison • What is a scoring Matrix: • Scoring matrices are used when we compare sequences with one another • Gives us a measure of which residue can be substituted by which residue. designed by Manisha, NUS

  17. Scoring Matrices • For Amino acids, Each amino acid is compared to every other and a score is given to this pair • High score if they are the same residue (e.g. Cysteine compared to cysteine) • Low, if they are very different (e.g. Tryptophan compared to cysteine) designed by Manisha, NUS

  18. A C G T 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 Scoring Matrices for DNA: • DNA sequence: 4 characters only (A,T,G,C) • Unitary matrix used for scoring: • A scoring system in which only identical characters receive a positive score. A C G T designed by Manisha, NUS

  19. SCORING SCHEMES FOR PROTEIN SEQUENCE ALIGNMENTS • Scoring matrices used are: PAM(Point Accepted Mutation) and BLOSUM(BLOcks SUbstitution Matrix • BLOSUM45---->BLOSUM 90 means MORE DIVERSETOLESS DIVERSE • PAM30---PAM250 means LESS DIVERSE TO MORE DIVERSE NOTE: Many different matrices are in use, each gives different values to pairs of amino acids Depending on how distantly related your sequences are, you might want to choose different matrices for your comparisons designed by Manisha, NUS

  20. Scoring Matrices Notes: BLOSUM 45 BLOSUM62 BLOSUM90 PAM250 PAM160 PAM 100 MORE DIVERGENT LESS DIVERGENT designed by Manisha, NUS

  21. Ways to do Pairwise Alignment • Dot Plot (simplest method) • Statistical computation based • Local alignment e.g. BLAST, FASTA • Global alignment e.g. CLUSTAL designed by Manisha, NUS

  22. What are Dot Plots Program to do sequence comparison to find out: –Are the two sequences similar ? – Are there Repeat regions in your sequence? designed by Manisha, NUS

  23. STEPS IN DOT PLOT • Take two sequences to be compared • Sequence A:MEHRKPGTGQ • Sequence B:MEHRKPGTGQ • Place sequence A in x-axis (Row). Place sequence B in y-axis (Column) M E H R K P G T G Q X-axis Y-axis M E H R K P G T G Q designed by Manisha, NUS

  24. Plot a dot everytime there is a match between an element of row sequence and an element of column sequence • Do you see any diagonal line extending? • If yes, then there is a match ! designed by Manisha, NUS

  25. Patterns in Dot Plot When two sequences are “identical” Sequence : GGTCCTTGGCTGAAAG ACCCCA GGTCCTTGGCTGAAAGACCCCA GGTCCTTGGCTGAAAGACCCCA designed by Manisha, NUS

  26. Application of Dot Plot Sequence used: Human ALU sequence CATCTCAAAAACAACAACAAAAAAAAAAAAAAAAGAAAAAAAA • Using self comparison : Finding Repeats CATCTCAAAAACAACAACAAAAAAAAAAAAAAAAGAAAAAAAA CATCTCAAAAACAACAACAAAAAAAAAAAAAAAAGAAAAAAAA • Omit main diagonal • Clusters of diagonal • lines show repeats • in the sequence. designed by Manisha, NUS

  27. Notes:What are repeats? • Repeats:are stretches of repeated regions of residues in a sequence. • Importance of repeats: • In protein: • Regulatory regions • Binding sites • In DNA: • Present in Transposons, chromosomal mutational hotspots, many genetic diseases related with repeats.eg.Huntington. designed by Manisha, NUS

  28. Patterns in Dot Plot When two sequences are similar : Broken diagonal,the interrupted region shows regions of mismatch GREGYPADSKGCKITCFLTAAGYCNTECTLKKGSSGYCAWPACYCYG MKGMILFISCLLLIDIVVGGKEGYLMDHEGCKLSCFIRPSGYCGRECTLKKGS designed by Manisha, NUS

  29. Patterns in Dot Plot Two different, but related sequences Broken diagonal clusters of dots parallel to the central diagonal. Distance between the lines show no. of insertions done to get the alignment. GREGYPADSKGCKITCFLTAAGYCNTECTLKKGSSGYCAWPA ARDGYPVDEKGCKLSCLINDKWCNSACHSRGGKYGYCYTGGL designed by Manisha, NUS

  30. Two models of alignment:Local and Global alignments • Global alignment: • Looks for similarity across full extent of sequences • Site:http://www2.igh.cnrs.fr/bin/align-guess.cgi designed by Manisha, NUS

  31. GLOBAL Alignment • The two sequences are matched across their whole sequence length. designed by Manisha, NUS

  32. Local alignment • Looks for regions of similarity in parts of the sequences only Softwares : BLAST, FASTA designed by Manisha, NUS

  33. Local Alignment • Example of local alignment between two sequences using lalign program. (http://www.ch.embnet.org/software/LALIGN_form.html) • Notice that the alignment is shown only of those regions that have strong identity or strong similarity designed by Manisha, NUS

  34. Why two different models? • Global alignment • High degree of Homology • Good for modelling • Local Alignment • Localised Similarity ( conserved regions with structural , functional importance, Repeats, Domains) designed by Manisha, NUS

  35. FASTA • Fast Alignment (expanded form of FASTA)by Pearson and Lipmann. • Is a method based on dynamic programming. • Websites available: • http://www.ebi.ac.uk/fasta33/ • http://www.dna.affrc.go.jp/htdocs/Blast/fasta.html designed by Manisha, NUS

  36. What is BLAST? • Basic Local Alignment Search Tool (BLAST) • Method for Pairwise Alignment. • Is used to search for homologous sequences from a database (of nucleotide/protein sequence) for a given query sequence. • Modified version of FASTA • Faster in generating output. • Sites for doing BLAST: • http://www.ncbi.nlm.nih.gov designed by Manisha, NUS

  37. How to go about doing BLAST SARS virus gene: SGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDDTVYCPRHVICTAEDMLNPNYEDLLIRKSNHSFLVQAGNVQLRVIGHSMQNCLLRLKVDTSNPKTPKYKFVRIQPGQTFSVLACYNGSPSGVYQCAMRPNHTIKGSFLNGSCGSVGFNIDYDCVSFCYMHHMELPTGVHAGTDLEGKFYGPFVDRQTAQAAGTDTTITLNVLAWLYAAVINGDRWFLNRFTTTLNDFNLVAMKYNYEPLTQDHVDILGPLSAQTGIAVLDMCAALKELLQNGMNGRT ILGSTILEDEFTPFDVVRQCSGVTFQ designed by Manisha, NUS

  38. designed by Manisha, NUS

  39. BLAST output for a protein query sequence from a SARS virus Score (bits) is the score given letter by letter during alignment based on the Subtitution matrices. High score = less E value. designed by Manisha, NUS

  40. E value: No. of chance • alignments that one will get as hits. • Lower the E value • lesser no. of chance hits • E value of zero or less than zero indicates very good hit (highly homologous sequence) • E value is also known as P(N) in some BLAST programs designed by Manisha, NUS

  41. BLAST OUTPUT Gives the identity Gives the similarity designed by Manisha, NUS

  42. BLAST • BLAST query schemes: • Amino acid seq: against db? • Blastp (protein sequence db) • Tblastn (translated nucleotide sequence db) • DNA seq: against db? • Blastn (nucleotide db) • Blastx ( protein sequence db) • Tblastx (translated nucleotide sequence db) designed by Manisha, NUS

  43. Gene(CDNA), Unknown CTAACATGCTTAGGATAATGGCCTCTCTTGTTCTTGCTCGCAAACATAACACTTGCTGTAACTTATCACA BLAST Translate into 6 frames, Amino acid seq.choose appropriate frame. NMLRIMASLVLARKHNTCCNLSHRFYRLANECAQVLSEMVMCGGSLYVKPGGTSSGDATTAYANSVFNIC DNA Sequencing BLAST RESULTS Choose the best hit using the lowest E value, highest %identity If , High % identity and low e-value Function, family of gene found CLUSTAL Use multiple sequences Find conserved regions, Domains, Phylogenetic relations:which family of gene closest to your target gene/protein designed by Manisha, NUS

  44. SUMMARY • TODAY WE LOOKED AT: Methods to compare two sequences: • Dot plots (simplest, graphical view) • Different patterns of Dot plots • Local alignment • Global alignment • Difference between these two models • FASTA • BLAST • other types of BLAST designed by Manisha, NUS

More Related