300 likes | 427 Vues
Introduction. Developing an automated system for extracting and classifying proteins from newly sequenced genomes BackgroundArchitectureAdvantages. Motivation. Genome sequencing techniques greatly improvedMore whole genomes are being sequenced quickly - lots of data being generatedWithout anal
 
                
                E N D
1. Intelligent Curation Using Ontologies K.Wolstencroft
 
2. Introduction Developing an automated system for extracting and classifying proteins from newly sequenced genomes 
Background
Architecture
Advantages 
3. Motivation Genome sequencing techniques greatly improved
More whole genomes are being sequenced quickly - lots of data being generated
Without analysis and classification  sequences are simply a series of letters
Therefore, data analysis is now the rate-limiting step
 
4. Why Classify? Classification and curation of a genome is the first step in understanding the processes and functions happening in an organism
Classification enables comparative genomic studies - what is already known in other organisms
The similarities and differences between processes and functions in related organisms often provide the greatest insight into the biology 
5. BackgroundDNA to Proteins Genome sequencing  produces DNA sequences
DNA  blueprint of an organism
DNA encodes complex molecules  mostly proteins
Proteins are the functional molecules of a cell  
6. Proteins Complex molecules constructed from sequences of amino acids
20 different amino acids with different chemical properties 
7. Proteins Primary Structure Amino acid sequences can be represented as a series of single letters
	>1A5Y:_ PROTEIN TYROSINE PHOSPHATASE 1B
	MEMEKEFEQIDKSGSWAAIYQDIRHEASDFPCRVAKLPKNKNRNRYRDVSPFDHSRIKLHQEDNDYINASLIKMEEAQRSYILTQGPLPNTCGHFWEMVWEQKSRGVVMLNRVMEKGSLKCAQYWPQKEEKEMIFEDTNLKLTLISEDIKSYYTVRQLELENLTTQETREILHFHYTTWPDFGVPESPASFLNFLFKVRESGSLSPEHGPVVVHXSAGIGRSGTFCLADTCLLLMDKRKDPSSVDIKKVLLDMRKFRMGLIATAEQLRFSYLAVIEGAKFIMGDSSVQDQWKELSHEDLEPPPEHIPPPPRPPKRILEPHNGKCREFFPN 
8. ProteinsTertiary Structure Sequence determines structure 
9. Searching for Features The relationship between amino acid sequence and eventual protein structure means that we can search for distinct structural (and functional) domains within the sequence
Domains could be several amino acids long  or could span most of the protein
 
10. Example A search of the linear sequence of protein tyrosine phosphatase type K  identified 9 functional domains
>uniprot|Q15262|PTPK_HUMAN Receptor-type protein-tyrosine phosphatase kappa precursor (EC 3.1.3.48) (R-PTP-kappa).
MDTTAAAALPAFVALLLLSPWPLLGSAQGQFSAGGCTFDDGPGACDYHQDLYDDFEWVHV
SAQEPHYLPPEMPQGSYMIVDSSDHDPGEKARLQLPTMKENDTHCIDFSYLLYSQKGLNP
GTLNILVRVNKGPLANPIWNVTGFTGRDWLRAELAVSSFWPNEYQVIFEAEVSGGRSGYI
AIDDIQVLSYPCDKSPHFLRLGDVEVNAGQNATFQCIATGRDAVHNKLWLQRRNGEDIPV.. 
11. Human Expert Annotation Bioinformaticians use a series of tools to identify functional domains
Similarity searching, domain/motif identification
Tools include  BLAST / INTERPRO 
Tools simply show presence of domains
Use expert knowledge to classify proteins according to domain arrangements
Presence / order / number of each important
Can an ontology be used to capture this knowledge to the standard of a human annotator?
   
12. Protein Family Classification Proteins divided into broad functional classes Protein Families
Often diagnostic domains/motif signify family membership
Initial Study focuses on the protein phosphatase family 
13. The Protein Phosphatases large superfamily of proteins  involved in the removal of phosphate groups from molecules
Important proteins in almost all cellular processes
Involved in diseases  diabetes and cancer
human phosphatases well characterised 
14. Characterisation allows classification Diagnostic phosphatase domains/motifs  sufficient for membership of the protein phosphatase superfamily
Other motifs determine a proteins place within the family
This human expert knowledge can be captured and incorporated into the model if the domain organisations are represented in a formal DL OWL ontology 
15. Protein Functional Domains 
16. Determining Class Definitions R2A 
Contains 2 protein tyrosine phosphatase domains
Contains 1 transmembrane domain
Contains 4 fibronectin domains
Contains 1 immunoglobulin domain
Contains 1 MAM domain
Contains 1 cadherin-like domain
 
17. Protg OWL Modelling 
18. Requirements Extract phosphatase sequences from rest of protein sequences from a whole genome
Identify the domains present in each
Compare these sequences to the formal ontology descriptions
Classify each protein instance to a place in the hierarchy
 
19. Technology 
 
20. myGrid Workflow extract sequences from whole genome
perform simple filtering  patmatdb
performs InterproScan to determine domain architecture
transform the InterproScan results into abstract OWL instance descriptions  
21. myGrid Workflow 
 
22. InterproScan Results  
23. Conversion to abstract OWL format restriction(<http://www.owl-ontologies.com/unnamed.owl#containsDomainIPR000340> cardinality(1))
restriction(<http://www.owl-ontologies.com/unnamed.owl#containsDomainIPR001763> cardinality(1))
restriction(<http://www.owl-ontologies.com/unnamed.owl#containsDomainIPR000387> cardinality(1))
 
24. Instance Store Instance Store enables reasoning over individuals
Can support much higher numbers of individuals 
OWL ontology is loaded into the instance store
A DL reasoner (racer) is used to compare individuals to the OWL ontology definitions 
25. Instance Store  
26. Example Instances  Protein Individual
	Dual Specificity Phosphatase DUSE
restriction(<http://www.owl-ontologies.com/unnamed.owl#containsDomainIPR000340> cardinality(1))
restriction(<http://www.owl-ontologies.com/unnamed.owl#containsDomainIPR000387> cardinality(1))
 Ontology Definition of Dual Specificity Phosphatase
containsDomain IPR000340
Necessary and Sufficient for class membership
Also inherits
containsDomain IPR000387 from Parent Class PTP
 
27. So Far.. Human phosphatases have been classified using the system
The ontology classification performed equally well as expert classification
The ontology system refined classification
	- DUSC contains zinc finger domain 
	Characterised and conserved  but not in classification
	- DUSA contains a disintegrin domain
	 previously uncharacterised  evolutionarily conserved
 
28. Aspergillus fumigatus Phosphatase compliment very different from human
>100 human   <50 A.fumigatus
Whole subfamilies missing
Different fungi-specific phosphorylation pathways?
No requirement for tissue-specific variations?
Novel serine/threonine phosphatase with homeobox 
	conserved in aspergillus and closely related species, but not in any other 
29. Conclusions Using ontology allows automated classification to reach the standard of human expert annotation
Reasoning capabilities allow interpretation of domain organisation
Produces interesting biological questions
Allows fast, efficient comparative genomics studies
System currently describes protein phosphatases -  but possible to expand to other protein families 
30. Acknowledgements Group : myGrid
PhD Supervisors: Andy Brass, Robert Stevens
Phosphatase Biologist: Lydia Tabernero
Ontogrid and NIBHI