Protein Sequence Analysis: Identifying and Characterizing Domain Families

CSE182-L6 Protein sequence analysis CSE182

Possible domain queries • Case 1: • You have a collection of sequences that belong to a family (contain a functional domain). • Given an ‘orphan’ sequence, does it belong to the family? • There are different solutions depending upon the representation of the domain (patterns/alignments/HMM/profiles) • Case 2: • You have an orphan sequence from an uncharacterized family. Can you identify other members of the family, and create a representation of them (Harder problem). CSE182

EX: Innexins • The Macagno lab is studying Gap junction proteins, Innexins (invertebrate analogs of connexins) in Hirudo • Innexins have been found in C. elegans, and Drosophila. • In C. elegans, 25 members of this family have been found, and partially categorized. CSE182

Innexins in Hirudo • When certain Innexins are knocked out, they cause serious defects in cells in the ganglia. • The EST database (partial gene sequences) contains a number of putative Innexins, discovered via BLAST. • Project: • Q: Can you confirm that these are Innexins. Can you find more members? (this lecture) • Q: Can you characterize them w.r.t known innexins in C. elegans, and Drosophila? • Q: Use your method for other families of interest. Netrins, and their receptors. CSE182

Not all features(residues) are important Skin patterns Facial Features CSE182

Protein sequence motifs • Premise: • The sequence of a protein sequence gives clues about its structure and function. • Not all residues are equally important in determining function. • Suppose we knew the key residues of a family. If our query matches in those residues, it is a member. Otherwise, it is not. • The key residues can be identified if we had structural information, or through conserved residues in an alignment of the family. CSE182

Representation of domains/families. • We will consider a number of representations that describe key residues, characteristic of a family • Patterns (regular expressions) • Alignments • Profiles • HMMs • Start with the following: • A collection of sequences with the same function. • Region/residues known to be significant for maintaining structure and function. • Develop a pattern of conserved residues around the residues of interest • Iterate for appropriate sensitivity and specificity CSE182

From alignment to patterns * ALRDFATHDDF SMTAEATHDSI ECDQAATHEAS ATH-[DE] • Search a database with the resulting pattern • Refine pattern to eliminate false positives • Iterate CSE182

Regular Expression Patterns • Zinc Finger motif • C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H • 2 conserved C, and 2 conserved H • How can we search a database using these motifs? • The motif is described using a regular expression. What is a regular expression? CSE182

Regular Expressions • Concise representation of a set of strings over alphabet . • Described by a string over • R is a r.e. if and only if CSE182

Regular Expression • Q: Let ={A,C,E} • Is (A+C)*EEC* a regular expression? • Is *(A+C) regular? • Q: When is a string s in a regular expression? • R =(A+C)*EEC* • Is CEEC in R? • AEC? • ACEE? CSE182

Regular Expression & Automata • Every R.E can be expressed by an automaton (a directed graph) with the following properties: • The automaton has a start and end node • Each edge is labeled with a symbol from , or  • Suppose R is described by automaton A • S  R if and only if there is a path from start to end in A, labeled with s. CSE182

Examples: Regular Expression & Automata • (A+C)*EEC* A C E E start end C • Is CEEC in R? • AEC? • ACEE? • ACE? CSE182

    Constructing automata from R.E  • R = {} • R = {},    • R = R1 + R2 • R = R1 · R2 • R = R1*      CSE182

Regular Expression Matching • Given a database D, and a regular expression R, is a substring of D in R? • Is there a string D[l..c] that is accepted by the automaton of R? • Simpler Q: Is D[1..c] accepted by the automaton of R? CSE182

Alg. For matching R.E. • If D[1..c] is accepted by the automaton RA • There is a path labeled D[1]…D[c] that goes from START to END in RA  D[1] D[2] D[c] CSE182

Alg. For matching R.E. • If D[1..c] is accepted by the automaton RA • There is a path labeled D[1]…D[c] that goes from START to END in RA • There is a path labeled D[1]..D[c-1] from START to node u, and a path labeled D[c] from u to the END u D[1] .. D[c-1] D[c] CSE182

D.P. to match regular expression u  v • Define: • A[u,] = Automaton node reached from u after reading  • Eps(u): set of all nodes reachable from node u using epsilon transitions. • N[c] = subset of nodes reachable from START node after reading D[1..c] • Q: when is v  N[c]  u Eps(u) CSE182

D.P. to match regular expression • Q: when is v  N[c]? • A: If for some u  N[c-1], w = A[u,D[c]], • v  {w}+ Eps(w) CSE182

Algorithm CSE182

The final step • We have answered the question: • Is D[1..c] accepted by R? • Yes, if END  N[c] • We need to answer • Is D[l..c] (for some l, and some c) accepted by R CSE182

Representation 2: Profiles • Profiles versus regular expressions • Regular expressions are intolerant to an occasional mis-match. • The Union operation (I+V+L) does not quantify the relative importance of I,V,L. It could be that V occurs in 80% of the family members. • Profiles capture some of these ideas. CSE182

Profiles • Start with an alignment of strings of length m, over an alphabet A, • Build an |A| X m matrix F=(fki) • Each entry fki represents the frequency of symbol k in position i 0.71 0.14 0.28 0.14 CSE182

Scoring matrices i • Given a sequence s, does it belong to the family described by a profile? • We align the sequence to the profile, and score it • Let S(i,j) be the score of aligning position i of the profile to residue sj • The score of an alignment is the sum of column scores. s sj CSE182

Scoring Profiles Scoring Matrix i k fki s CSE182

Domain analysis via profiles • Given a database of profiles of known domains/families, we can query our sequence against each of them, and choose the high scoring ones to functionally characterize our sequences. • What if the sequence matches some other sequences weakly (using BLAST), but does not match any Profile? CSE182

Psi-BLAST idea • Iterate: • Find homologs using Blast on query • Discard very similar homologs • Align, make a profile, search with profile. • Why is this more sensitive? Seq Db CSE182

Pigeonhole principle again: • If profile of length m must score >= T • Then, a sub-profile of length l must score >= lT|/m • Generate all l-mers that score at least lT|/M • Search using an automaton • Multiple alignment: • Use ungapped multiple alignments only Psi-BLAST speed • Two time consuming steps. • Multiple alignment of homologs • Searching with Profiles. • Does the keyword search idea work? CSE182

Representation 3: HMMs • Question: • your ‘friend’ likes to gamble. • He tosses a coin: HEADS, he gives you a dollar. TAILS, you give him a dollar. • Usually, he uses a fair coin, but ‘once in a while’, he uses a loaded coin. • Can you say what fraction of the times he loads the coin? CSE182

Representation 3: HMMs • Building good profiles relies upon good alignments. • Difficult if there are gaps in the alignment. • Psi-BLAST/BLOCKS etc. work with gapless alignments. • An HMM representation of Profiles helps put the alignment construction/membership query in a uniform framework. V CSE182

The generative model • Think of each column in the alignment as generating a distribution. • For each column, build a node that outputs a residue with the appropriate distribution 0.71 Pr[F]=0.71 Pr[Y]=0.14 0.14 CSE182

A simple Profile HMM • Connect nodes for each column into a chain. Thie chain generates random sequences. • What is the probability of generating FKVVGQVILD? • In this representation • Prob [New sequence S belongs to a family]= Prob[HMM generates sequence S] • What is the difference with Profiles? CSE182

Profile HMMs can handle gaps • The match states are the same as on the previous page. • Insertion and deletion states help introduce gaps. • A sequence may be generated using different paths. CSE182

Example A L - L A I V L A I - L • Probability [ALIL] is part of the family? • Note that multiple paths can generate this sequence. • M1I1M2M3 • M1M2I2M3 • In order to compute the probabilities, we must assign probabilities of transition between states CSE182

Profile HMMs • Directed Automaton M with nodes and edges. • Nodes emit symbols according to ‘emission probabilities’ • Transition from node to node is guided by ‘transition probabilities’ • Joint probability of seeing a sequence S, and path P • Pr[S,P|M] = Pr[S|P,M] Pr[P|M] • Pr[ALIL AND M1I1M2M3] = Pr[ALIL| M1I1M2M3,M] Pr[M1I1M2M3|M] • Pr[ALIL | M] = ? CSE182

Protein structure basics CSE182

Side chains determine amino-acid type • The residues may have different properties. • Aspartic acid (D), and Glutamic Acid (E) are acidic residues CSE182

Bond angles form structural constraints CSE182

Various constraints determine 3d structure • Constraints • Structural constraints due to physiochemical properties • Constraints due to bond angles • H-bond formation • Surprisingly, a few conformations are seen over and over again. CSE182

Alpha-helix • 3.6 residues per turn • H-bonds between 1st and 4th residue stabilize the structure. • First discovered by Linus Pauling CSE182

Beta-sheet • Each strand by itself has 2 residues per turn, and is not stable. • Adjacent strands hydrogen-bond to form stable beta-sheets, parallel or anti-parallel. • Beta sheets have long range interactions that stabilize the structure, while alpha-helices have local interactions. CSE182

Domains • The basic structures (helix, strand, loop) combine to form complex 3D structures. • Certain combinations are popular. Many sequences, but only a few folds CSE182

3D structure • Predicting tertiary structure is an important problem in Bioinformatics. • Premise: Clues to structure can be found in the sequence. • While de novo tertiary structure prediction is hard, there are many intermediate, and tractable goals. • The PDB database is a compendium of structures PDB CSE182

Searching structure databases • Threading, and other 3d Alignments can be used to align structures. • Database filtering is possible through geometric hashing. CSE182

Trivia Quiz • What research won the Nobel prize in Chemistry in 2004? • In 2002? CSE182

How are Proteins Sequenced? Mass Spec 101: CSE182

Nobel Citation 2002 CSE182

Nobel Citation, 2002 CSE182

Mass Spectrometry CSE182

Protein Sequence Analysis: Identifying and Characterizing Domain Families

Protein Sequence Analysis: Identifying and Characterizing Domain Families

Presentation Transcript

CSE182-L10

CSE182-L12

CSE182-L11

CSE182-L9

CSE182-L6

CSE182-L6

L6- L7

CSE182-L12

CSE182-L9

CSE182-L7

CSE182-L12

CSE182-L10

CSE182-L13

CSE182-L18

CSE182-L11