1 / 32

Bioinformatics

Bioinformatics. Ayesha M. Khan Spring 2013. What’s in a secondary database?. It should be noted that within multiple alignments can be found conserved motifs that reflect shared structural or functional characteristics of the constituent sequences.

imelda
Télécharger la présentation

Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bioinformatics Ayesha M. Khan Spring 2013

  2. What’s in a secondary database? • It should be noted that within multiple alignments can be found conserved motifs that reflect shared structural or functional characteristics of the constituent sequences. • Such conserved motifs may be used to build characteristic signatures that aid family and/or functional diagnoses of newly determined structures. Lec-7

  3. Conservation patterns-functional cues • E.g. the amino acids that are consistently found at enzyme active sites, or the nucleotides that are associated with transcription factor binding sites. ATP/GTP binding proteins Lec-7

  4. Conservation patterns …..-functional cues GAL4 binding sequence Lec-7

  5. So what exactly is a pattern? • Pattern describes a motif using a qualitative consensus sequence • Early patterns were first reported as consensus sequences These patterns were essentially composite sequences consisting of the most common residue occurring at a position in an alignment. • A later approach stored the pattern as a regular expression. A regular expression is much more flexible than a consensus sequence because more than one residue can be stored at each position. There are many patterns that can be described as regular expressions Lec-7

  6. pattern • Uses regular expression (reducing the sequence data to a consensus) • Mismatches are not tolerated • E.g., [GA]-[IMFAT]-H-[LIVF]-H-{S}-x-[GP]-[SDG]-x-[STAGDE] • Each position in pattern is separated with a hyphen • x can match any residue • [ ] are used to indicate ambiguous positions in the pattern • { } are used to indicate residues that are not allowed at this position • ( ) surround repeated residues, e.g. A(3) means AAA Lec-7

  7. “Rules” • “Rules” are patterns which are much shorter, generic and not associated with specific protein families. • They may denote sugar attachment sites, phosphorylation or hydroxylation sites etc. • N-glycosylation site: N-{P}-[ST]-{P} • Protein kinase C phosphorylation site: [ST]-x-[RK] • Realistically, short motifs can only be used to provide a guide as to whether a certain type of functional site might exist in a sequence, which must be verified by experiment. Lec-7

  8. Consensus sequences • The consensus sequence method is the simplest method to build a model from a multiple sequence alignment. • The consensus sequence is built using the following rules: • Majority wins. • Skip too much variation. Lec-7

  9. Consensus sequences (contd.) Advantages: • This method is very fast and easy to implement. Limitations: • Models have no information about variations in the columns. • Very dependent on the training set. • No scoring, only binary result (YES/NO). When I use it? • Useful to find highly conserved signatures, as for example enzyme restriction sites for DNA. Lec-7

  10. In cases of extreme sequence divergence: • The following approaches can be used to identify distantly related members to a family of protein (or DNA) sequences •Position-specific scoring matrix (PSSM) •Profile •Hidden Markov Model These methods work by providing a statistical frame where the probability of residues or nucleotides at specific sequences are tested Thus, in multiple alignments, information on all the members in the alignment is retained. Lec-7

  11. Sequence Profiles • A sequence profile is a position-specific scoring matrix (PSSM) that gives a quantitative description of a sequence motif. • Unlike deterministic patterns, profiles assign a score to a query sequence and are widely used for database searching. • A simple PSSM has as many columns as there are positions in the alignment, and either 4 rows (one for each DNA nucleotide) or 20 rows (one for each amino acid). Lec-7

  12. PSSM Mkj score for the jth nucleotide at position k pkj probability of nucleotide j at position k pj “background” PSSM probability of nucleotide j Lec-7

  13. Computing a PSSM Ckj : No. of jth type nucleotide at position k Z: Total no of aligned sequences pj: background probability of nucleotide j pkj: probability of nucleotide j at position k Lec-7

  14. Computing a PSSM… Lec-7

  15. Computing a PSSM… Lec-7

  16. Computing a PSSM… Lec-7

  17. PSI-BLASTPosition-Specific Iterated BLAST • Many proteins in a database are too distantly related to a query to be detected using standard BLAST. • In many other cases matches are detected but are so distant that the inference of homology is unclear. • Enter the more sensitive PSI-BLAST Lec-7

  18. PSI-BLAST scheme Lec-7

  19. PSI-BLAST… • The search process is continued iteratively, typically about 5 times, and at each step a new PSSM is built. • The search process can be stopped at any point, typically whenever few new results are returned or no new sensible results are found. Lec-7

  20. PSI BLAST errors • Unrelated hits- how to avoid them? • Perform multi-domain splitting of your query sequence • Inspect each PSI-BLAST iteration, removing suspicious hits • Lower the Expect-level (E-value) Lec-7

  21. Markov Model • Markov Chain • A Markov chain describes a series of events or states • There is a certain probability to move from one state to the next state • This is known as the transition probability • Probability of going to future state depends on current state not previous state. • In a Markov model all states are observable Lec-7

  22. Hidden Markov model • A Markov model may consist of observable states and unobservable or “hidden” states. • The hidden states also affect the outcome of the observed states. • In a sequence alignment, a gap is an unobserved state that influences the probability of the next nucleotide. • In DNA, there are four symbols or states: G, A, T and C (20 in proteins). The probability value associated with each symbol is the emission probability. Lec-7

  23. Markov Model-example Emission probability Transition probability 0.40 • This particular Markov model has a probability of • 0.80 X0.40 X 0.32 = 0.102 • to generate the sequence AG • This model shows that the sequence AT has the highest probability to occur Where do these numbers come from? A Markov model has to be “trained” with examples Lec-7

  24. Hidden Markov model… The frequencies of occurrence of nucleotides in a multiple aligned sequence is used to calculate the emission and transition probabilities of each symbol at each state The trained HMM is then used to test how well a new sequence fits to the model •A state can either be a match/mismatch (mismatch is low probability match) (observable) Insertion (hidden) Deletion (hidden) Lec-7

  25. Markov models (contd) Example: A general Markov chain modeling DNA *note that any sequence can be traced through the model by passing from one state to the next via transitions A Markov chain is defined by: A finite set of states, S1, S2, S3….SN A set of transition probabilities, aij An initial state probability distribution (or emission probability) πi Lec-7

  26. Markov chain examplex={a, b} We observe the following sequence: abaaababbaa Transition probabilities: Initial state probabilities: Lec-7

  27. Markov models (contd) Typical questions we can ask with Markov chains are: • What is the probability of being in a particular state at a particular time?(By time here we can read position in our query sequence) • What is the probability of seeing a particular sequence of states? (i.e., the score for a particular query sequence given the model) Lec-7

  28. Markov chains-positional dependencies The connectivity or topology of a Markov chain can easily be designed to capture dependencies and variable length motifs. Lec-7

  29. Markov chains-Insertions and deletions Lec-7

  30. Markov chains-boundary detection Given a sequence we wish to label each symbol in the sequence according to its class (e.g. transmembrane regions or extracellular/cytosolic) How is it possible? Lec-7

  31. Markov chains-boundary detectioncontd. • Given a training set of labeled sequences we can begin by modeling each amino acid as hydrophobic (H) or hydrophilic (L) • i.e. reduce the dimensionality of the 20 amino acids into two classes e.g., A peptide sequence can be represented as a sequence of Hs and Ls. • e.g. HHHLLHLHHLHL... Lec-7

  32. Markov chains-boundary detectioncontd. A simpler question: is a given sequence a transmembrane sequence? A Markov chain for recognizing transmembrane sequences: Question: Is sequence HHLHH a transmembrane protein? P(HHLHH) = 0.6 x 0.7 x 0.7 x 0.3 x 0.7 x 0.7 = 0.043 Lec-7

More Related