1 / 40

Bioinformatics The Prediction of Life

Bioinformatics The Prediction of Life. Tony C Smith Department of Computer Science University of Waikato tcs@cs.waikato.ac.nz. Bioinformatics. Computation with biological data Data: genes, proteins, microarrays, mass spectra, written documents, populations of organisms …

orea
Télécharger la présentation

Bioinformatics The Prediction of Life

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BioinformaticsThe Prediction of Life Tony C Smith Department of Computer Science University of Waikato tcs@cs.waikato.ac.nz

  2. Bioinformatics Computation with biological data Data: genes, proteins, microarrays, mass spectra, written documents, populations of organisms … Goal: knowledge discovery Bioinformatics Tony C Smith

  3. The essence is prediction … My dog is very littl_ ? • We know that letters do not occur in English at random; not all letters are equally common (e.g. ‘e’ is more common than ‘x’) • We know that context changes the probability of a letter (e.g. what’s the most likely letter after the sequence “I eat Weet-Bi_”) • Prediction is important in many applications (e.g. encryption, compression, communication, graphics, simulation … and bioinformatics!) Bioinformatics Tony C Smith

  4. Prediction in bioinformatics • Predicting the location of genes in DNA • Predicting the function of proteins • Predicting diseases from molecular samples • Predicting population dynamics • Anything that involves “making a judgment”; typically expressible as a yes/no decision about some sample datum Bioinformatics Tony C Smith

  5. Representation W e e t – B i x 0101011101100101011001010111010000101101 … … to the computer, everything is binary! Bioinformatics Tony C Smith

  6. 0101011101100101011001010111010000101101 0101101100100111111011010011010000101101 A A C G T C A T T C G A T G A T T C G A Just as we can teach a computer to predict things about a sequence of letters in English prose, we can also teach it to predict things about a other sequences—like a genetic sequence Bioinformatics Tony C Smith

  7. A genetic prediction problem ttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagc Bioinformatics Tony C Smith

  8. A genetic prediction problem ttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagctgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgttgcgcacccacaccagttatatagagacgaactc Bioinformatics Tony C Smith

  9. A genetic prediction problem ttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagctgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgttgcgcacccacaccagttatatagagacgaactcttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagctgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgttgcgcacccacaccagttatatagagacgaactc Bioinformatics Tony C Smith

  10. A genetic prediction problem • A gene encodes a protein • It is a blueprint that provides biochemical instructions on how to construct a sequence of amino acids so as to make a working protein that will perform some function in the organism Bioinformatics Tony C Smith

  11. RNA RNA RNA RNA RNA transcription factor A genetic prediction problem untranslated region encoding region Bioinformatics Tony C Smith

  12. A genetic prediction problem untranslated region Bioinformatics Tony C Smith

  13. A genetic prediction problem ttgcaatcggcgctacgcttcaaaatttattatattcccggc untranslated region Bioinformatics Tony C Smith

  14. A genetic prediction problem ttgcaatcggcgctacgcttcaaaatttattatattcccggc What transcription factors bind to this gene? Where is the transcription factor binding site? Bioinformatics Tony C Smith

  15. A genetic prediction problem ttgcaatcggcgctacgcttcaaaatttattatattcccggc Clues: A binding site is often a short general pattern E.g. CCGATNATCGG Bioinformatics Tony C Smith

  16. A genetic prediction problem ttgcaatcggcgctacgcttcaaaatttattatattcccggc Clues: The patterns are often reverse complements E.g. CCGATNATCGG GGCTANTAGCC Bioinformatics Tony C Smith

  17. A genetic prediction problem ttgcaatcggcgctacgcttcaaaatttattatattcccggc Clues: Where there is one binding site, often there is another nearby. Bioinformatics Tony C Smith

  18. A genetic prediction problem All of these properties are the kinds of things for which computer science has developed algorithms and data structures to identify quickly and efficiently, and therefore it is exactly the kind of problem computer scientists should be able to solve. Bioinformatics Tony C Smith

  19. proteomics Three consecutive nucleotides in the coding region form a ‘codon’ … i.e. encode an amino acid. A string of amino acids makes a protein. 3 nucleotides, 4 possibilities for each, so 43 = 64 possible codons But there are only 20 amino acids! Bioinformatics Tony C Smith

  20. proteomics There is quite a bit of redundancy in codons. Glycine: GGA, GGC, GGG, GGT Tyrosine: TAT, TAC Methionine: ATG Bioinformatics Tony C Smith

  21. Amino Acid R group Amide group Carboxyl group Bioinformatics Tony C Smith

  22. Amino Acid tyrosine glycine Bioinformatics Tony C Smith

  23. Primary structure: MSALVSTTPSLLAGVRNVDB ….. Bioinformatics Tony C Smith

  24. Tertiary Structure Bioinformatics Tony C Smith

  25. Secondary Structure Bioinformatics Tony C Smith

  26. Signal peptide • A relatively short sequence of amino residues at the N-terminus of the nascent protein typically 15-50 residues MAGPRPSPWARLLLAALISVSLSGTLARCKKAPVSKKCETCVGQAALTGL … • Cleaved off as protein passes through membrane (operates like a pass key) • Knowing signal peptide helps determine protein function in the organism Bioinformatics Tony C Smith

  27. How do we do it? see any patterns?ttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagctgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaatttcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtaacgcatcagactctcgtcgcgttcgcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgctacgcttcaaaatttattatattcccggcggcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgttgcgcacccacaccagttatatagagacgaactcttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttatttattatattcccggcgcggctacgttcatcccagcattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagctgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcaggacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagatgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctactcatatcgcagctacagcgcatcagacgcatacgacgacgaagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacgaactcgcatcagtgcaatcggccggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgttgcgcacccacaccagttatatagagacgaactcttagaggcgaggacatcatcatatcgcagctacagcgcatcagttagaggcgaggacatcatcatatcgcagctacagcgcatcagttagaggcgaggacatcatcatatcgc Bioinformatics Tony C Smith

  28. Local biases in residues around the cleavage site Sequence regularities can be exploited by statistical and pattern-based models Bioinformatics Tony C Smith

  29. Proteomic prediction Language:• letters combine to form words • words combine to form phrases • phrases combine to form sentences • sentences combine to form sentences (and ultimately Harry Potter books) Proteins: • amino acids combine to form peptides • peptides combine to form secondary motifs (e.g. α-helixes and β-sheets) • motifs combine to make proteins • proteins combine to make toenails (and ultimately people) Bioinformatics Tony C Smith

  30. Approach • Problem is stated as two-class: an amino acid is either the first residue of the mature protein or it is not • Each residue is described by a single document, which includes as many electrochemical, structural or contextual facts as are available (desirable) Bioinformatics Tony C Smith

  31. Properties of amino acids Bioinformatics Tony C Smith

  32. Residue as a document E.g. Cysteine Cys C aliphatic [yes], aromatic [no], hydrophobic [yes], charge [-], polarized [yes], small [no], number of nitrogen atoms [1], contains sulphur [yes], has a carbon ring [no], ionized [yes], valence [2], cbeta [no], covalent [yes], h-bond [yes], etc. (whatever else experimenter wants to include) Bioinformatics Tony C Smith

  33. Sample document PRNUM:1. AANUM:21. AMINO[-8]:L. ALIPH[-8]:-. AROMA[-8]:-. CBETA[-8]:-. CHARG[-8]:-. COVAL[-8]:-. HBOND[-8]:-. HPHOB[-8]:+. IONIZ[-8]:-. NITRO[-8]:1. POLAR[-8]:-. POSNG[-8]:0. SMALL[-8]:-. SULPH[-8]:-. TEENY[-8]:-. CRING[-8]:-. VALEN[-8]:2. AMINO[-7]:L. ALIPH[-7]:-. AROMA[-7]:-. CBETA[-7]:-. CHARG[-7]:-. COVAL[-7]:-. HBOND[-7]:-. HPHOB[-7]:+. IONIZ[-7]:-. NITRO[-7]:1. POLAR[-7]:-. POSNG[-7]:0. SMALL[-7]:-. SULPH[-7]:-. TEENY[-7]:-. CRING[-7]:-. VALEN[-7]:2. AMINO[-6]:F. ALIPH[-6]:+. AROMA[-6]:+. CBETA[-6]:-. CHARG[-6]:-. COVAL[-6]:-. HBOND[-6]:-. HPHOB[-6]:+. IONIZ[-6]:-. NITRO[-6]:1. POLAR[-6]:-. POSNG[-6]:0. SMALL[-6]:-. SULPH[-6]:-. TEENY[-6]:-. CRING[-6]:+. VALEN[-6]:2. AMINO[-5]:A. ALIPH[-5]:-. AROMA[-5]:-. CBETA[-5]:-. CHARG[-5]:-. COVAL[-5]:-. HBOND[-5]:-. HPHOB[-5]:-. IONIZ[-5]:-. NITRO[-5]:1. POLAR[-5]:-. POSNG[-5]:0. SMALL[-5]:+. SULPH[-5]:-. TEENY[-5]:+. CRING[-5]:-. VALEN[-5]:2. AMINO[-4]:T. ALIPH[-4]:+. AROMA[-4]:-. CBETA[-4]:+. CHARG[-4]:-. COVAL[-4]:-. HBOND[-4]:+. HPHOB[-4]:-. IONIZ[-4]:-. NITRO[-4]:1. POLAR[-4]:+. POSNG[-4]:0. SMALL[-4]:+. SULPH[-4]:-. TEENY[-4]:-. CRING[-4]:-. VALEN[-4]:2. AMINO[-3]:C. ALIPH[-3]:+. AROMA[-3]:-. CBETA[-3]:-. CHARG[-3]:-. COVAL[-3]:+. HBOND[-3]:+. HPHOB[-3]:+. IONIZ[-3]:+. NITRO[-3]:1. POLAR[-3]:+. POSNG[-3]:-. SMALL[-3]:-. SULPH[-3]:+. TEENY[-3]:-. CRING[-3]:-. VALEN[-3]:2. AMINO[-2]:I. ALIPH[-2]:-. AROMA[-2]:-. CBETA[-2]:+. CHARG[-2]:-. COVAL[-2]:-. HBOND[-2]:-. HPHOB[-2]:+. IONIZ[-2]:-. NITRO[-2]:1. POLAR[-2]:-. POSNG[-2]:0. SMALL[-2]:-. SULPH[-2]:-. TEENY[-2]:-. CRING[-2]:-. VALEN[-2]:2. AMINO[-1]:A. ALIPH[-1]:-. AROMA[-1]:-. CBETA[-1]:-. CHARG[-1]:-. COVAL[-1]:-. HBOND[-1]:-. HPHOB[-1]:-. IONIZ[-1]:-. NITRO[-1]:1. POLAR[-1]:-. POSNG[-1]:0. SMALL[-1]:+. SULPH[-1]:-. TEENY[-1]:+. CRING[-1]:-. VALEN[-1]:2. AMINO[0]:R. ALIPH[0]:+. AROMA[0]:-. CBETA[0]:-. CHARG[0]:+. COVAL[0]:-. HBOND[0]:+. HPHOB[0]:-. IONIZ[0]:+. NITRO[0]:4. POLAR[0]:+. POSNG[0]:+. SMALL[0]:-. SULPH[0]:-. TEENY[0]:-. CRING[0]:-. VALEN[0]:3. AMINO[1]:H. ALIPH[1]:+. AROMA[1]:+. CBETA[1]:-. CHARG[1]:+. COVAL[1]:-. HBOND[1]:+. HPHOB[1]:-. IONIZ[1]:+. NITRO[1]:3. POLAR[1]:+. POSNG[1]:+. SMALL[1]:-. SULPH[1]:-. TEENY[1]:-. CRING[1]:+. VALEN[1]:3. AMINO[2]:Q. ALIPH[2]:+. AROMA[2]:-. CBETA[2]:-. CHARG[2]:-. COVAL[2]:-. HBOND[2]:+. HPHOB[2]:-. IONIZ[2]:-. NITRO[2]:2. POLAR[2]:+. POSNG[2]:0. SMALL[2]:-. SULPH[2]:-. TEENY[2]:-. CRING[2]:-. VALEN[2]:2. AMINO[3]:Q. ALIPH[3]:+. AROMA[3]:-. CBETA[3]:-. CHARG[3]:-. COVAL[3]:-. HBOND[3]:+. HPHOB[3]:-. IONIZ[3]:-. NITRO[3]:2. POLAR[3]:+. POSNG[3]:0. SMALL[3]:-. SULPH[3]:-. TEENY[3]:-. CRING[3]:-. VALEN[3]:2. AMINO[4]:R. ALIPH[4]:+. AROMA[4]:-. CBETA[4]:-. CHARG[4]:+. COVAL[4]:-. HBOND[4]:+. HPHOB[4]:-. IONIZ[4]:+. NITRO[4]:4. POLAR[4]:+. POSNG[4]:+. SMALL[4]:-. SULPH[4]:-. TEENY[4]:-. CRING[4]:-. VALEN[4]:3. AMINO[5]:Q. ALIPH[5]:+. AROMA[5]:-. CBETA[5]:-. CHARG[5]:-. COVAL[5]:-. HBOND[5]:+. HPHOB[5]:-. IONIZ[5]:-. NITRO[5]:2. POLAR[5]:+. POSNG[5]:0. SMALL[5]:-. SULPH[5]:-. TEENY[5]:-. CRING[5]:-. VALEN[5]:2. AMINO[6]:Q. ALIPH[6]:+. AROMA[6]:-. CBETA[6]:-. CHARG[6]:-. COVAL[6]:-. HBOND[6]:+. HPHOB[6]:-. IONIZ[6]:-. NITRO[6]:2. POLAR[6]:+. POSNG[6]:0. SMALL[6]:-. SULPH[6]:-. TEENY[6]:-. CRING[6]:-. VALEN[6]:2. AMINO[7]:Q. ALIPH[7]:+. AROMA[7]:-. CBETA[7]:-. CHARG[7]:-. COVAL[7]:-. HBOND[7]:+. HPHOB[7]:-. IONIZ[7]:-. NITRO[7]:2. POLAR[7]:+. POSNG[7]:0. SMALL[7]:-. SULPH[7]:-. TEENY[7]:-. CRING[7]:-. VALEN[7]:2. AMINO[8]:Q. ALIPH[8]:+. AROMA[8]:-. CBETA[8]:-. CHARG[8]:-. COVAL[8]:-. HBOND[8]:+. HPHOB[8]:-. IONIZ[8]:-. NITRO[8]:2. POLAR[8]:+. POSNG[8]:0. SMALL[8]:-. SULPH[8]:-. TEENY[8]:-. CRING[8]:-. VALEN[8]:2. MULT3:7. MULT5:4. MULT7:3. MULT9:2. 2GRAM:IA. GRAM2:HQ. 3GRAM:CIA. GRAM3:HQQ. Bioinformatics Tony C Smith

  34. Artificial Intelligence • Computers do things only human brains can otherwise do expert expert Bioinformatics Tony C Smith

  35. Artificial Intelligence • Computers do things only human brains can otherwise do expert system expert Bioinformatics Tony C Smith

  36. Artificial Intelligence • Computers do things only human brains can otherwise do expert system learning system Bioinformatics Tony C Smith

  37. Machine learning • creating computer programs that get better with experience • learn how to make expert judgments • discover previously hidden, potentially useful information (data mining) What is machine learning? How does it work? • user provides learning system with examples of concept to be learned • induction algorithm infers a characteristic model of the examples • model is used to predict whether or not future novel instances are also examples – and it does this very consistently, and very, very quickly! Bioinformatics Tony C Smith

  38. Bioinformatics • Biologists know proteins, computer scientists know machine learning • Together, they can find hidden and potentially useful information about genes and proteins • Biotechnology is a multi-billion dollar industry • Biotechnology is one of the best funded areas of scientific research • Shortage of people educated in bioinformatics Bioinformatics Tony C Smith

  39. The University of Waikato • Waikato University is ranked first in the country in computer science and in molecular, cellular, and whole-organism biology • centre of the universe for machine learning Bioinformatics Tony C Smith

  40. The University of Waikato If you’re interested in getting involved in bioinformatics, or indeed any other area along the leading edge of computer science and/or biology, then … Waikato wants You! Bioinformatics Tony C Smith

More Related