530 likes | 659 Vues
In protein sequence analysis, we explore how to determine if an orphan sequence belongs to a known family based on various domain representations, including patterns, alignments, and HMMs. Examining examples such as Innexins and their role in gap junctions, we investigate strategies for identifying additional family members and creating accurate representations. Additionally, we discuss the importance of conserved residues in determining protein functions and the utilization of regular expressions for matching and searching protein sequences in databases.
E N D
CSE182-L6 Protein sequence analysis CSE182
Possible domain queries • Case 1: • You have a collection of sequences that belong to a family (contain a functional domain). • Given an ‘orphan’ sequence, does it belong to the family? • There are different solutions depending upon the representation of the domain (patterns/alignments/HMM/profiles) • Case 2: • You have an orphan sequence from an uncharacterized family. Can you identify other members of the family, and create a representation of them (Harder problem). CSE182
EX: Innexins • The Macagno lab is studying Gap junction proteins, Innexins (invertebrate analogs of connexins) in Hirudo • Innexins have been found in C. elegans, and Drosophila. • In C. elegans, 25 members of this family have been found, and partially categorized. CSE182
Innexins in Hirudo • When certain Innexins are knocked out, they cause serious defects in cells in the ganglia. • The EST database (partial gene sequences) contains a number of putative Innexins, discovered via BLAST. • Project: • Q: Can you confirm that these are Innexins. Can you find more members? (this lecture) • Q: Can you characterize them w.r.t known innexins in C. elegans, and Drosophila? • Q: Use your method for other families of interest. Netrins, and their receptors. CSE182
Not all features(residues) are important Skin patterns Facial Features CSE182
Protein sequence motifs • Premise: • The sequence of a protein sequence gives clues about its structure and function. • Not all residues are equally important in determining function. • Suppose we knew the key residues of a family. If our query matches in those residues, it is a member. Otherwise, it is not. • The key residues can be identified if we had structural information, or through conserved residues in an alignment of the family. CSE182
Representation of domains/families. • We will consider a number of representations that describe key residues, characteristic of a family • Patterns (regular expressions) • Alignments • Profiles • HMMs • Start with the following: • A collection of sequences with the same function. • Region/residues known to be significant for maintaining structure and function. • Develop a pattern of conserved residues around the residues of interest • Iterate for appropriate sensitivity and specificity CSE182
From alignment to patterns * ALRDFATHDDF SMTAEATHDSI ECDQAATHEAS ATH-[DE] • Search a database with the resulting pattern • Refine pattern to eliminate false positives • Iterate CSE182
Regular Expression Patterns • Zinc Finger motif • C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H • 2 conserved C, and 2 conserved H • How can we search a database using these motifs? • The motif is described using a regular expression. What is a regular expression? CSE182
Regular Expressions • Concise representation of a set of strings over alphabet . • Described by a string over • R is a r.e. if and only if CSE182
Regular Expression • Q: Let ={A,C,E} • Is (A+C)*EEC* a regular expression? • Is *(A+C) regular? • Q: When is a string s in a regular expression? • R =(A+C)*EEC* • Is CEEC in R? • AEC? • ACEE? CSE182
Regular Expression & Automata • Every R.E can be expressed by an automaton (a directed graph) with the following properties: • The automaton has a start and end node • Each edge is labeled with a symbol from , or • Suppose R is described by automaton A • S R if and only if there is a path from start to end in A, labeled with s. CSE182
Examples: Regular Expression & Automata • (A+C)*EEC* A C E E start end C • Is CEEC in R? • AEC? • ACEE? • ACE? CSE182
Constructing automata from R.E • R = {} • R = {}, • R = R1 + R2 • R = R1 · R2 • R = R1* CSE182
Regular Expression Matching • Given a database D, and a regular expression R, is a substring of D in R? • Is there a string D[l..c] that is accepted by the automaton of R? • Simpler Q: Is D[1..c] accepted by the automaton of R? CSE182
Alg. For matching R.E. • If D[1..c] is accepted by the automaton RA • There is a path labeled D[1]…D[c] that goes from START to END in RA D[1] D[2] D[c] CSE182
Alg. For matching R.E. • If D[1..c] is accepted by the automaton RA • There is a path labeled D[1]…D[c] that goes from START to END in RA • There is a path labeled D[1]..D[c-1] from START to node u, and a path labeled D[c] from u to the END u D[1] .. D[c-1] D[c] CSE182
D.P. to match regular expression u v • Define: • A[u,] = Automaton node reached from u after reading • Eps(u): set of all nodes reachable from node u using epsilon transitions. • N[c] = subset of nodes reachable from START node after reading D[1..c] • Q: when is v N[c] u Eps(u) CSE182
D.P. to match regular expression • Q: when is v N[c]? • A: If for some u N[c-1], w = A[u,D[c]], • v {w}+ Eps(w) CSE182
Algorithm CSE182
The final step • We have answered the question: • Is D[1..c] accepted by R? • Yes, if END N[c] • We need to answer • Is D[l..c] (for some l, and some c) accepted by R CSE182
Representation 2: Profiles • Profiles versus regular expressions • Regular expressions are intolerant to an occasional mis-match. • The Union operation (I+V+L) does not quantify the relative importance of I,V,L. It could be that V occurs in 80% of the family members. • Profiles capture some of these ideas. CSE182
Profiles • Start with an alignment of strings of length m, over an alphabet A, • Build an |A| X m matrix F=(fki) • Each entry fki represents the frequency of symbol k in position i 0.71 0.14 0.28 0.14 CSE182
Profiles • Start with an alignment of strings of length m, over an alphabet A, • Build an |A| X m matrix F=(fki) • Each entry fki represents the frequency of symbol k in position i 0.71 0.14 0.28 0.14 CSE182
Scoring matrices i • Given a sequence s, does it belong to the family described by a profile? • We align the sequence to the profile, and score it • Let S(i,j) be the score of aligning position i of the profile to residue sj • The score of an alignment is the sum of column scores. s sj CSE182
Scoring Profiles Scoring Matrix i k fki s CSE182
Domain analysis via profiles • Given a database of profiles of known domains/families, we can query our sequence against each of them, and choose the high scoring ones to functionally characterize our sequences. • What if the sequence matches some other sequences weakly (using BLAST), but does not match any Profile? CSE182
Psi-BLAST idea • Iterate: • Find homologs using Blast on query • Discard very similar homologs • Align, make a profile, search with profile. • Why is this more sensitive? Seq Db CSE182
Pigeonhole principle again: • If profile of length m must score >= T • Then, a sub-profile of length l must score >= lT|/m • Generate all l-mers that score at least lT|/M • Search using an automaton • Multiple alignment: • Use ungapped multiple alignments only Psi-BLAST speed • Two time consuming steps. • Multiple alignment of homologs • Searching with Profiles. • Does the keyword search idea work? CSE182
Representation 3: HMMs • Question: • your ‘friend’ likes to gamble. • He tosses a coin: HEADS, he gives you a dollar. TAILS, you give him a dollar. • Usually, he uses a fair coin, but ‘once in a while’, he uses a loaded coin. • Can you say what fraction of the times he loads the coin? CSE182
Representation 3: HMMs • Building good profiles relies upon good alignments. • Difficult if there are gaps in the alignment. • Psi-BLAST/BLOCKS etc. work with gapless alignments. • An HMM representation of Profiles helps put the alignment construction/membership query in a uniform framework. V CSE182
The generative model • Think of each column in the alignment as generating a distribution. • For each column, build a node that outputs a residue with the appropriate distribution 0.71 Pr[F]=0.71 Pr[Y]=0.14 0.14 CSE182
A simple Profile HMM • Connect nodes for each column into a chain. Thie chain generates random sequences. • What is the probability of generating FKVVGQVILD? • In this representation • Prob [New sequence S belongs to a family]= Prob[HMM generates sequence S] • What is the difference with Profiles? CSE182
Profile HMMs can handle gaps • The match states are the same as on the previous page. • Insertion and deletion states help introduce gaps. • A sequence may be generated using different paths. CSE182
Example A L - L A I V L A I - L • Probability [ALIL] is part of the family? • Note that multiple paths can generate this sequence. • M1I1M2M3 • M1M2I2M3 • In order to compute the probabilities, we must assign probabilities of transition between states CSE182
Profile HMMs • Directed Automaton M with nodes and edges. • Nodes emit symbols according to ‘emission probabilities’ • Transition from node to node is guided by ‘transition probabilities’ • Joint probability of seeing a sequence S, and path P • Pr[S,P|M] = Pr[S|P,M] Pr[P|M] • Pr[ALIL AND M1I1M2M3] = Pr[ALIL| M1I1M2M3,M] Pr[M1I1M2M3|M] • Pr[ALIL | M] = ? CSE182
Protein structure basics CSE182
Side chains determine amino-acid type • The residues may have different properties. • Aspartic acid (D), and Glutamic Acid (E) are acidic residues CSE182
Various constraints determine 3d structure • Constraints • Structural constraints due to physiochemical properties • Constraints due to bond angles • H-bond formation • Surprisingly, a few conformations are seen over and over again. CSE182
Alpha-helix • 3.6 residues per turn • H-bonds between 1st and 4th residue stabilize the structure. • First discovered by Linus Pauling CSE182
Beta-sheet • Each strand by itself has 2 residues per turn, and is not stable. • Adjacent strands hydrogen-bond to form stable beta-sheets, parallel or anti-parallel. • Beta sheets have long range interactions that stabilize the structure, while alpha-helices have local interactions. CSE182
Domains • The basic structures (helix, strand, loop) combine to form complex 3D structures. • Certain combinations are popular. Many sequences, but only a few folds CSE182
3D structure • Predicting tertiary structure is an important problem in Bioinformatics. • Premise: Clues to structure can be found in the sequence. • While de novo tertiary structure prediction is hard, there are many intermediate, and tractable goals. • The PDB database is a compendium of structures PDB CSE182
Searching structure databases • Threading, and other 3d Alignments can be used to align structures. • Database filtering is possible through geometric hashing. CSE182
Trivia Quiz • What research won the Nobel prize in Chemistry in 2004? • In 2002? CSE182
Nobel Citation 2002 CSE182
Nobel Citation, 2002 CSE182
Mass Spectrometry CSE182