The Challenge of Predicting Gene Function

The Challenge ofPredicting Gene Function • Ross D. King • Department of Computer Science • University of Wales, Aberystwyth

Gene Function Prediction • The most important revelation from the sequenced genomes is that the functions of typically only between 60-70% of the predicted genes are known with any confidence. • The new science of functional genomics is dedicated to determining the function of the genes of unassigned function, and to further detailing the function of genes with purported function

Data Mining Prediction • We have developed a method for predicting the functional class of gene products based on ILP/Relational data mining. • The idea is to learn a reliable predictive function on the examples of genes with products of known function. • Then apply this function to genes where the functional class is unknown. • We call this approach: Data Mining Prediction (DMP).

Predicting Gene Function in Yeast We will demonstrate our approach using ORFs in yeast (Saccharomyces cerevisiae). • Using the MIPS functional classification scheme • For those ORFs whose function is currently unknown • Using 5 types of data: • Sequence statistics • Homology (sequence similarity) • Predicted Secondary Structure • Expression (microarray) • Phenotype

We want to map from sequence to function class

Classification Schemes 1 1,0,0,0 "METABOLISM" 2,0,0,0 "ENERGY" 3,0,0,0 "CELL CYCLE AND DNA PROCESSING" 4,0,0,0 "TRANSCRIPTION" 5,0,0,0 "PROTEIN SYNTHESIS" 6,0,0,0 "PROTEIN FATE (folding, modification, destination)" 8,0,0,0 "CELLULAR TRANSPORT AND TRANSPORT MECHANISMS" 10,0,0,0 "CELLULAR COMMUNICATION/SIGNAL TRANSDUCTION MECHANISM" 11,0,0,0 "CELL RESCUE, DEFENSE AND VIRULENCE" 13,0,0,0 "REGULATION OF / INTERACTION WITH CELLULAR ENVIRONMENT" 14,0,0,0 "CELL FATE" 29,0,0,0 "TRANSPOSABLE ELEMENTS, VIRAL AND PLASMID PROTEINS" 30,0,0,0 "CONTROL OF CELLULAR ORGANIZATION" 40,0,0,0 "SUBCELLULAR LOCALISATION" 62,0,0,0 "PROTEIN ACTIVITY REGULATION" 63,0,0,0 "PROTEIN WITH BINDING FUNCTION OR COFACTOR REQUIREMENT " 67,0,0,0 "TRANSPORT FACILITATION" 98,0,0,0 "CLASSIFICATION NOT YET CLEAR-CUT" 99,0,0,0 "UNCLASSIFIED PROTEINS" MIPS/GeneOntology

Classification Schemes 2 Hierarchy of classes 1,0,0,0 "METABOLISM" 1,1,0,0 "amino acid metabolism" 1,2,0,0 "nitrogen and sulfur metabolism" 1,3,0,0 "nucleotide metabolism" 1,4,0,0 "phosphate metabolism" 1,5,0,0 "C-compound and carbohydrate metabolism" 1,6,0,0 "lipid, fatty-acid and isoprenoid metabolism" 1,7,0,0 "metabolism of vitamins, cofactors, and prosthetic groups" 1,20,0,0 "secondary metabolism"

Classification schemes 3 Hierarchy of classes 1,0,0,0 "METABOLISM" 1,1,0,0 "amino acid metabolism" 1,1,1,0 "amino acid biosynthesis" 1,1,4,0 "regulation of amino acid metabolism" 1,1,7,0 "amino acid transport" 1,1,10,0 "amino acid degradation (catabolism)" 1,1,99,0 "other amino acid metabolism activities" 1,2,0,0 "nitrogen and sulfur metabolism" 1,3,0,0 "nucleotide metabolism" 1,4,0,0 "phosphate metabolism" 1,5,0,0 "C-compound and carbohydrate metabolism" 1,6,0,0 "lipid, fatty-acid and isoprenoid metabolism" 1,7,0,0 "metabolism of vitamins, cofactors, and prosthetic groups" 1,20,0,0 "secondary metabolism" ... and ORFs may have multiple functions too!

Sequence Data field description type aa_rat_X % of amino acid X in the protein real seq_len length of the protein sequence int aa_rat_pair_X_Y % of the amino acids X and Y consecutively real mol_wt molecular weight of the protein int theo_pI theoretical pI (isoelectric point) real atomic_comp_X atomic composition of X (C,H,N,O,S) real aliphatic_index aliphatic index real hydro grand average of hydropathy real strand the DNA strand 'w' or 'c' position the number of exons (no. of start positions) int cai codon adaptation index real motifs number of PROSITE motifs int tmSpans number of transmembrane spans int chromosome chromosome number 1..16,mit 478 attributes in total

Sequence database NRDB PSI-BLAST gene tfc sfc3 wsv442 cg9463 f1l3 organism baker's yeast fission yeast white spot virus fruit fly Arabidopsis score 0.0 1.0e-18 2.1 2.9 3.0 Homology data YAL001C: mvltiypdelvqivsdkiasnkgkitlnqlwdisgkyfdlsdk.... sfc3: keyword(membrane) length(358) dbref(prosite) dbref(embl) We look up the associated information from SwissProt

Predicted Secondary Structure Data mvltiypdelvqivsdkiasnkgkitlnqlwdisgkyfdlsdkkvk... cbbbbccaaaaaaaaaaaacccccbbbbaaaaaacccbbccccccb... We record length and relative positions of the secondary structure elements. This is relational data.

a0a7a14a21 YBR166C 0.33 -0.17 0.04 -0.07 YOR357C -0.64 -0.38 -0.32 -0.29 YLR292C -0.23 0.19 -0.36 0.14 YGL112C -0.69 -0.89 -0.74 -0.56 ... Expression Data • Microrarray experiments to measure expression changes in yeast under a variety of conditions, including cell cycle, heat shock, diauxic shift. • Short time series data, numerical-valued Spellman et al (1998), Roth et al (1998) DeRisi et al (1997), Eisen et al (1998) Gasch et al (2000, 2001), Chu et al (1998)

Phenotype Data • Data from knockout gene growth experiments • Many missing data • 69 attributes x 1461 ORFs of known function • 991 genes of unknown function • Data taken from 3 sources (TRIPLES, MIPS, EUROFAN) deleted ORF growth medium ORF YAL001C YAL019W YAL021C YAL029C calcofluor white w n n n sorbitol n s n w benomyl n w n w H2O2 w w n r ... s = sensitive (less growth) w = wild-type (no observable effect) r = resistant (more growth) n = no data

What are the Machine Learning Issues? • Large volume of data • Missing data • Accurate results required • Intelligible results required • Class hierarchy • Multiple labels • Relational data

orf time0 time7 time14 yal001c 0.34 0.52 0.48 yal002w 0.76 0.82 0.89 yal003w 0.77 0.46 0.78 yal004c 0.38 0.50 0.49 SwissProtID keyword p03415 apoptosis p03415 repeat p03415 zinc p08640 membrane orf SwissProtID e-val yal001c p03415 2e-4 yal001c p08640 8e-58 yal002w p32583 6e-52 yal002w p08775 3e-42 Relational vs Propositional Propositional: single table, fixed number of columns/attributes Relational: multiple tables, multiple values

Data Mining Prediction (DMP) Entire database Test data 1/3 2/3 PolyFARM Data for rule creation Validation data 1/3 2/3 Training data All rules Best rules Rule gener- ation Select best rules Measure rule accuracy C4.5 Results

Warmr • Warmr is an ILP Algorithm Developed by Dehaspe et al. • It is an ILP version of the well known Apriori data mining algorithm. • Designed to find frequent patterns in a datalog database.

PolyFARM • First-order association rule mining • Finding all frequent first order patterns in the data • Distributed on a Beowulf cluster • 47,034 homology patterns, f > 5% • 19,628 structure patterns, f > 2% • [Clare & King PADL 2003] hom(SPID, close) ^ sq_len(SPID, short) ^ classification(SPID, ecoli) A close homology to a short protein in E. coli struc(Pos1, a) ^ neighbour(Pos1, Pos2, c) ^ neighbour(Pos2, Pos3, a) ^ coil_dist(high) Contains alpha-coil-alpha with a high overall coil distribution

Propositionalisation Transforming relational data into boolean attributes patt1 patt2 patt3 patt4 ... patt47034 YAL001C 0 1 0 0 ... 1 YAL002W 0 1 1 0 ... 1 YAL003W 1 0 0 1 ... 0 YAL004W 1 1 0 0 ... 1 YAL005C 0 0 0 0 ... 1 ...

Dichotomic Search 1 • As an alternative to the WARMR data-mining approach, we developed a frequent pattern finding method based on dichotomic search. • This approach uses domain-specific logics as intermediates between propositional logic and predicate logic.

Dichotomic Search 2 • Most existing algorithms traverse the search space in either a top-down or a bottom-up fashion. We propose a new approach based on dichotomic search which explores the search space in both direction, allowing larger steps • Dichotomic search combines completeness (w.r.t. concepts), non-redundancy, and flexibility. • Ferre, S. & King, R.D. (2005). Fundamenta Informaticae

C4.5 aa_ratio_pair_p_y Open source decision tree algorithm • propositional learning • commonly used • produces interpretable rules • reliable • fast • accurate Made modifications for: • multiple labels • hierarchical labels [Clare & King Bioinformatics 2002] >0.232 <=0.232 metabolism strand w c transcription aa_rat_a >6.4 <=6.4 transport cell fate

Results • Many rules from each data type • Rules at each level of hierarchy • Some classes are much easier to predict than others (for example "protein synthesis" at 71-93%, "energy" at 20-47%) • Good levels of accuracy on held out test data • Many predictions for ORFs of unknown function (some function at some level is predicted for 96% of the ORFs of unknown function) • Some rules explainable by biology -> scientific knowledge discovery Clare & King (2003) Bioinformatics suppl. 2., 42-49

Accuracy Table

Expression Data Rule If in the micro-array experiment (sorbitol incubation) the ORF expression is > -0.25 andin the micro-array experiment (nitrogen depletion) the ORF expression is <= -1.29 andin the micro-array experiment (YPD stationary phase) the ORF expression is > -1.06 then the function of this ORF is ”pheromone response, mating type determination, sex-specific proteins" Accuracy on training data: 11/12 (92%) Accuracy on the test data: 3/4 (75%) 21 predictions made

If true: coil (of length 3) followed by alpha (10 <= length < 14) and true: coil (of length 1 or 2) followed by alpha (10 <= length < 14) and true: coil (of length 3) followed by alpha (3 <= length < 6) and false: coil followed by beta followed by coil (c-b-c) and false: coil (6 <= length < 10) followed by alpha (of length 1 or 2) then the function of this ORF is "mitochondrial transport" Structure Rule • 80% accurate on test data • Most matching ORFs belong to the Mitochondrial Carrier Family • These have 6 long transmembrane alpha-helices of about 20-30 amino acids • Why do we notice alpha-helices of length 10-14?

Alignment YJL133W -------NEYNPLIHCLC----GSISGSTCAAITTPLDCIKTVLQIRG------------ 251 YKR052C -------NSYNPLIHCLC----GGISGATCAALTTPLDCIKTVLQVRG------------ 241 YIL006W ----NNTNSINLQRLIMA----SSVSKMIASAVTYPHEILRTRMQLKS------------ 310 YBR104W ----LTRNEIPPWKLCLF----GAFSGTMLWLTVYPLDVVKSIIQNDD------------ 271 YGR096W ----KTTAAHKKWELATLNHSAGTIGGVIAKIITFPLETIRRRMQFMNSKHLEK------ 250 YJR095W -----QMDVLPSWETSCI----GLISGAIGPFSNAPLDTIKTRLQKDK------------ 246 YKL120W -----LMKDGPALHLTAS-----TISGLGVAVVMNPWDVILTRIYNQK------------ 261 YLR348C -----FDASKNYTHLTAS-----LLAGLVATTVCSPADVMKTRIMNGS------------ 239 YMR166C ----DGRDGELSIPNEILT---GACAGGLAGIITTPMDVVKTRVQTQQPPSQSNKSYSVT 300 YDL198C ------DYSQATWSQNFIS---SIVGACSSLIVSAPLDVIKTRIQNRN------------ 242 YGR257C ----RFASKDANWVHFINSFASGCISGMIAAICTHPFDVGKTRWQISMMN---------- 302 YDL119C FIHYNPEGGFTTYTSTTVNTTSAVLSASLATTVTAPFDTIKTRMQLEP------------ 255 YJL133W -SQTVSLEIMRKADTFSKAASAIYQVYGWKGFWRGWKPRIVANMPATAISWTAYECAKHF 310 YKR052C -SETVSIEIMKDANTFGRASRAILEVHGWKGFWRGLKPRIVANIPATAISWTAYECAKHF 300 YIL006W -DIPDSIQRR-----LFPLIKATYAQEGLKGFYSGFTTNLVRTIPASAITLVSFEYFRNR 364 YBR104W -LRKPKYKNS-----ISYVAKTIYAKEGIRAFFKGFGPTMVRSAPVNGATFLTFELVMRF 325 YGR096W FSRHSSVYGSYKGYGFARIGLQILKQEGVSSLYRGILVALSKTIPTTFVSFWGYETAIHY 310 YJR095W ---SISLEKQSGMKKIITIGAQLLKEEGFRALYKGITPRVMRVAPGQAVTFTVYEYVREH 303 YKL120W ----GDLYKG-----PIDCLVKTVRIEGVTALYKGFAAQVFRIAPHTIMCLTFMEQTMKL 312 YLR348C ----GDHQP------ALKILADAVRKEGPSFMFRGWLPSFTRLGPFTMLIFFAIEQLKKH 289 YMR166C HPHVTNGRPAALSNSISLSLRTVYQSEGVLGFFSGVGPRFVWTSVQSSIMLLLYQMTLRG 360 YDL198C ---FDNPESG------LRIVKNTLKNEGVTAFFKGLTPKLLTTGPKLVFSFALAQSLIPR 293 YGR257C ---NSDPKGGNRSRNMFKFLETIWRTEGLAALYTGLAARVIKIRPSCAIMISSYEISKKV 359 YDL119C ----SKFTNS------FNTFTSIVKNENVLKLFSGLSMRLARKAFSAGIAWGIYEELVKR 305

Alignment YJL133W -------cccccaaaaaa----aaaaaaaaaaacccaaaaaaaaaacc------------ 251 YKR052C -------cccccaaaaaa----aaaaaaaaaaacccaaaaaaaaaacc------------ 241 YIL006W ----ccccccccaaaaaa----aaaaaaaaaaacccaaaaaaaaaacc------------ 310 YBR104W ----ccccccccaaaaaa----aaaaaaaaaaacccaaaaaaaaaacc------------ 271 YGR096W ----cccccccccccccbaaaaaaaaaaaaaaacccaaaaaaaaaacccccccc------ 250 YJR095W -----cccccccaaaaaa----aaaaaaaaaaacccaaaaaaaaaccc------------ 246 YKL120W -----ccccccaaaaaaa-----aaaaaaaaaacccaaaaaaaaaacc------------ 261 YLR348C -----ccccccaaaaaaa-----aaaaaaaaaacccaaaaaaaaaacc------------ 239 YMR166C ----cccccccccaaaaaa---aaaaaaaaaaacccaaaaaaaaaacccccccccccccc 300 YDL198C ------cccccccaaaaaa---aaaaaaaaaaacccaaaaaaaaaacc------------ 242 YGR257C ----ccccccccccccaaaaaaaaaaaaaaaaacccaaaaaaaaaacccc---------- 302 YDL119C ccccccccccccccaaaaaaaaaaaaaaaaaaacccaaaaaaaaaacc------------ 255 YJL133W -ccccccccccccccaaaaaaaaaaaccccaaaaccaaaaaaacaaaaaaaaaaaaaaaa 310 YKR052C -ccccccccccccccaaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 300 YIL006W -ccccccccc-----aaaaaaaaaaaccccaaacccaaaaaaaccaaaaaaaaaaaaaaa 364 YBR104W -ccccccccc-----aaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 325 YGR096W cccccccccccccccaaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 310 YJR095W ---ccccccccccccaaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 303 YKL120W ----cccccc-----aaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 312 YLR348C ----ccccc------aaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 289 YMR166C cccccccccccccccaaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 360 YDL198C ---cccccca------aaaaaaaaaacccaaaaacccaaaaaaaaaaaaaaaaaaaaaaa 293 YGR257C ---ccccccccccccaaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 359 YDL119C ----ccccca------aaaaaaaaaacccaaaaacccaaaaaaccaaaaaaaaaaaaaaa 305

If the ORF is not weakly homologous to a protein in klebsiella and is strongly homologous to a protein in desulfurococcales and is strongly homologous to a short protein in cyprinidae then the function of this ORF is "Protein fate (folding, modification, destination)" Homology rule • This rule is 100% accurate on test data • Almost all matching ORFs are from the 20S proteasome subunit for degradation of proteins • These subunits exist in archaea and eukaryotes, but only in one specific branch of bacteria (actinomycetes).

Application of DMP to Bacterial Genomes • Successful for both M. tuberculosis and E. coli. • Of the ORFs with no assigned function >40% were predicted to have a function at one or more levels of the class hierarchy. • It was found that many of the predictive rules were more general than possible using sequence homology. References King et al. (2000) KDD 2000 King et al. (2000) Yeast (Comparative and Functional Genomics) King et al. (2001) Bioinformatics

Example Rule (level 2 E. coli) If the ORF is not predicted to have a b-strand of length  3  a homologous protein from class Chytridiomycetes was found Then its functional class is “Cell processes, Transport/binding proteins” 12/13 (86%) correct on Test Set - probability of this result occurring by chance is estimated at 4x10-7. 24 ORFs of unknown function are predicted by the rule. 16 ORFs now with putative or confirmed function - 93.8% accurate predictions

Experimental Conformation • The original bacterial ORF predictions were made over three years ago. • In the intervening time many more ORFs have been sequenced, making traditional homologous prediction methods more accurate and sensitive, and the function of some ORFs have been determined by wet biology. • The E. coli genome has been re-annotated by Monica Riley’s group.

“Wet” Biology conformation • A number of predictions have been confirmed or falsified by new “wet” experimental data. • This new data is biased towards hard classes. Despite this the results are still good: • Level 2: 23 predictions - 47.8% accuracy • Level 3: 23 predictions - 43.4% accuracy This is very much better than random as there are many classes.

Confirmation of “Wet” Predictions

Extension to Arabidopsis Genome • Collaborative project with the Institute of Grassland and Environmental Research and the University of Nottingham. • Large increase in data: 6,000 (yeast) -> 25,000 ORFs. • Large amount of micro-array data from the Nottingham Arabidopsis stock centre. • The increase in data is a challenge to our machine learning algorithms, 100s MBs. Clare, A., Karwath, A., Ougham, H. and King, RD (2006) Functional Bioinformatics for Arabidopsis thaliana. Bioinformatics 2006 22: 1130-1136;

Results • Accuracy comparable to yeast and bacteria • Large fraction of genes of currently unknown function are predicted. • Some rules could be interpreted in terms of known biology Clare, A., Karwath, A., Ougham, H. and King, RD (2006) Functional Bioinformatics for Arabidopsis thaliana. Bioinformatics 2006 22: 1130-1136;

Gibberellin Biosynthesis Prediction • Gibberellin is an important plant hormone. • Chosen because of interesting phenotypes – often extreme size. • Insertion of a promoter to overproduce gene product. • Result • 2 days earlier flowering • Average leaf number and weight increased at 21 days. • This phenotype is consistent with prediction.

Leaf number increases more rapidly in the mutant (yellow bars) than in wildtype Landsberg erecta (blue bars)

Paclobutrazol (P) (inhibitor of gibberllin) abolishes the difference between mutant (M) and wildtype (L)C = control

Availability All predictions available at http://www.genepredictions.org All rules and data available at http://www.aber.ac.uk/compsci/Research/bio/dss/

ILP 2005 Challenge 1 • Yeast function prediction data used as a community challenge: http://www.protein-logic.com/ • The intention of the challenge was to provide a real-world data set to test of how far we have progressed in the field of ILP and multi-relational data mining. The questions we wanted to answer were: Are the tools up to the job? Do they scale? Do they handle noisy, sparse and complex data?

ILP 2005 Challenge 2 A. J. Knobbe, E. K. Y. Ho, R. Malik: ILP CHallenge 2005: The Safarii MRDM environment. C. Perlich: Approaching the ILP 2005 challenge: Class-Conditional Bayesian Propositionalization for Genetic Classification. J. Struyf, C. Vens, T. Croonenborghs, S. Dzeroski, H. Blockeel: Applying Predictive Clustering Trees to the Inductive Logic Programming 2005 Challenge Data. F. Riguzzi: A Simple Approach to a Multi-Label Classification Problem.

Propositional Approach • Zafer Barutcuoglu, Robert E. Schapire and Olga G. Troyanskaya. Hierarchical multi-label prediction of gene function. Bioinformatics (in press) • Hierarchy of SVMs. • Uses a Bayesian net to combine predictions.

Conclusions • Data mining and machine learning are powerful tools for functional genomics. • The DMP method can be successfully applied to different genomes (bacterial, yeast, Arabidopsis) to predict gene functional class. • Micro-array data is a useful component in DMP. • Biological insight can be extracted from DMP rules. • The structure of gene prediction problems makes them an exciting test bed for machine learning methods.

Acknowledgements • Amanda Clare Aberystwyth • Andreas Karwath Freiburg (Aberystwyth) • Luc Dehaspe PharmaDM • Helen Ougham IGER BBSRC

The Need for Logic to Represent Scientific Knowledge • Logic is the best understood way to represent knowledge. • Traditional statistics, machine learning, and data mining are based on propositional logic. • For some problems we require a richer description language, i.e. first-order predicate calculus. • Using logic programming (predicate calculus) we can incorporate deduction, abduction, and induction.

The Challenge of Predicting Gene Function