Deeply investigating and analysis chemical genome wide fitness data. Predict gene-functional

Systematic analysis of genome-wide fitness data in yeast reveals novel gene function and drug action. M. Hillenmeyer (Stanford), E. Ericson (Toronto), R. Davis (Stanford), C. Nislow (Toronto), D. Koller (Stanford) and G. Giaever (Toronto) Published in Genome Biology 2010 Presented By: Yaron Margalit

Deeply investigating and analysis chemical genome wide fitness data. • Predict gene-functional • Predict protein-drug interactions • Have new observations or/and extend previous ones with the new data.

Outline • Brief introduction • Large-scale genome-wide Dataset • Co-fitness • Motivation and Definition • Implementation • Results • Co-inhibition • Motivation and Definition • Implementation • Results • Predict drug-target interactions • Motivation • Model • Results • Summary

Brief Introduction - Reminder • Deletion Mutants Sensitive to a Particular Drug Should be Synthetically Lethal with the Drug Target Synthetic Lethal Interactions Synthetic Chemical Interactions Alive Alive Drug Alive Alive Drug Dead Dead

CGI (C for chemical) vs. GI GI genes Library genes CGI chemicals

CGI notes • Some notes we need to take into account when we get into CGI: • Inactivation of the target protein function caused by the compound is not complete • Multi-drug resistant genes: Some mutant are hypersensitive to many drugs of different types (many promiscuous) • Side effects: compound cause inactivation of other proteins and not only the specific gene required

Hillenmeyer et al. Science 2008

Chemical genomic • Study relationship between small molecules and genes. • Small molecules: • Drugs – FDA approved • Chemical probes – well characterized • New compounds – unknown biological activity

Saccharomyces cerevisiae (the “beer yeast”) • “Beer yeast” consist of ~ 6000 genes. • ~ 1000 genes are essential • Dataset include large diploid deletion collections • ~ 6000 heterozygous gene deletion strains (+/-) • ~ 5000 homozygous deletion strains (-/-) • Only 5000 because about 1000 are essential (genes that a cell cannot live without regardless of conditions it grows in)

Data source • Used deletion sets to study cell growth rate (fitness) response to conditions (small compounds and environmental stressors): • 726 conditions per heterozygous deletion strain • 418 conditions per homozygous deletion strain • Homozygous or heterozygous gene mutation in combination with a drug (or other treatment) causes growth fitness defect (reduction) • Compared to no-drug control

co-fitness • Definition: co-fitness value - the similarity of two genes fitness score across experiments • Intuitive: • Gene-drug interaction: retrieve fitness defect score: compare gene’s intensity in a specific treatment to the same gene’s intensity in the control (no-drug) • Result to gene-gene relationship: Calculate correlation (similarity) between two genes (i.e. “how much genes are sensitive to similar drugs”) • co-fitness was calculated separately for the heterozygous and homozygous datasets

co-fitness – the similarity of two genes • How to calculate fitness defect (reduction) gene-drug interaction: • Z-score • P-value • Log ratio • Log P-value • Example of such a score, log rate: Where: - mean intensity of i replicate across multiple control conditions (controls) - intensity of i replicate under treatment t (cases)

co-fitness – the similarity of two genes • Calculate correlation gene-gene relationship. • Example of co-fitness, distance metric: Euclidean distance: Where: - i replicate, defect score of gene x under treatment t - i replicate, defect score of gene y under treatment t

co-fitness – the similarity of two genes • Goal: Quantify the degree to which co-fitness can predict gene function and compare its performance to other similarities types (datasets) • Several similarities – correlation based were tested: • Pearson correlation • Spearman rank correlation • Euclidean distance • Bicluster co-occurrence count • Bicluster Pearson correlation

co-fitness – picking best distance metric

co-fitness – the similarity of two genes • So far: We tested and found that Pearson correlation exhibit the best performance for co-fitness • Use co-fitness and evaluate its prediction of gene functional

co-fitness predicts reference network • Evaluate co-fitness prediction on expert-curated reference interaction (“reference network”) – gold standard compared dataset. • Each dataset compared to the reference network: • Reference network divided into 32 GO slim biological sub-net works • Each gene pair was assigned to the sub-network if both genes were annotated to that process

co-fitness predicts reference network

co-fitness more results • Essential genes were co-fit with other essential genes more frequently: • 40% essential genes co-fit with essential genes compared to 23% for non essential genes. • Pairs of co-complexed genes (genes encoded within same protien complex) increased co-fitness with other members of the complex.

co-fitness more results

co-fitness application example • Find nonessential proteins that might be essential for optimal growth in conditions. • Idea comes from previous study saying proteins that are essential in rich medium (type of condition) tend to cluster into complexes (i.e. essential complex). • Application: • Define complex to be essential if 80% of its members are essential. • Run over all co-fitness values and search for a significant essential complexes.

co-fitness application example • Create a synthetic data for each condition: • Generate 10,000 a random distribution – reassign genes to complexes (but maintain complexes size) • Protein complex is essential if at least 80% of its genes had a significant (P < 0.01 cutoff) fitness defect. • Identify condition with significantly more essential complexes if this essential complex was not observed essential in any of the 10,000 permutations.

Co-inhibition • Definition: co-inhibition value: correlation between drug1 and drug2 s.t. inhibit similar genes. • Intuitive (similar to co-fitness): • Gene-drug interaction: retrieve fitness defect score: compare gene’s intensity in a specific treatment to the same gene’s intensity in the control (no-drug) • Result to drug-drug relationship: Calculate correlation (similarity) between two drugs (i.e. “how much drugs inhibit similar genes”) • co-inhibition was calculated separately for the heterozygous and homozygous datasets

Co-inhibition • Claim that indicated from small scale databases: High co-inhibition value tend to share chemical structure and mechanism of action in the cell • Goal: use co-inhibition to predict mechanism of action and therefore identify drug targets or toxicities • Next steps: • Calculate co-inhibition (1) • Define chemical structural similarity (2) • Define chemical therapeutic (action) use (3) • Verify claim (1,2,3 share high percent similarity)

Calculate co-inhibition (1) • How to calculate fitness defect (reduction) gene-drug interaction – Similar to co-fitness • Z-score • P-value • Log ratio • Log P-value • Example of such a score, log rate: Where: - mean intensity of i replicate across multiple control conditions (controls) - intensity of i replicate under treatment t (cases)

Calculate co-inhibition (1) • Calculate correlation drug-drug relationship. • co-inhibition, distance metric that was used Pearson correlation: Where: - i replicate, defect score of drug x with gene g - i replicate, defect score of drug y with gene g

Define chemical structural similarity (2) • Model each chemical to substructure motifs • Construct substructure vectors (containing all possible substructures in our case 554 types) and set a value between 0-1 for each substructure is it similar to chemical structure or not. • Calculate structural similarity between 2 drugs by a distance metric.

Define chemical structural similarity (2) • Construct substructure vectors (containing all possible substructures in our case 554 types) and set a value between 0-1 for each substructure is it similar to chemical structure or not. • We will show 3 different ways to do that

chemical structural similarity – substructure vectors • First way Binary identifier • Simple binary vector where the value is 1 if the compound contains the substructure and 0 otherwise.

chemical structural similarity – substructure vectors • Second way IDF • Convert binary indicator to an inverse document frequency (IDF). IDF score for substructure mofiti (regardless of the chemical): C – number of compounds Cj – number of compounts that contain motif i • Set 0 if compound does not contain substructure and IDF > 0 otherwise.

chemical structural similarity – substructure vectors • Third way Binary-IDF • Convert binary indicator to an inverse document frequency (IDF). • Convert back to binary using a threshold on IDF value (for IDF > X threshold set 1 otherwise 0)

Define chemical structural similarity (2) • Model each chemical to substructure motifs • Construct substructure vectors (containing all possible substructures in our case 554 types) and set a value between 0-1 for each substructure is it similar to chemical structure or not. • Calculate structural similarity between 2 drugs by a distance metric.

Calculate chemical structural similarity (2) • For the binary data (first and third ways) they tested as a distance metric: • Tanimoto (Jaccard) coefficient • Hamming distance • Dice coefficient • For the IDF data (second way) they tested: • Cosine distance Pearson correlation • Spearman correlation • Euclidean distance • Kendall’s Tau • City-block distance

Calculate chemical structural similarity (2) • Greatest relationship done by using Binary-IDF with (threshold > 2.5) • Distance metric was Tanimoto (Jaccard) coefficient • Suggests that structure similarity should be defined by a less common substructures.

Define chemical therapeutic (action) use (3) • Use known data: • Define pair of compounds to be co-therapeutic if they share annotation at level 3 of the WHO (classification of drug uses) ATC hierarchy.

co-inhibition - is it really true? • Counted pairs of compounds that have: • Positive co-inhibition (correlation > 0) • Shared therapeutic class • Measurable structural similarity • From this counting: • 70% did not share structural similarity (Tanimoto similarity < 0.2) 

co-inhibition – results • Limited correlation between co-inhibition and similar chemical structure.

co-inhibition – results • Significant relationship between shared ATC therapeutic class and co-fitness

co-inhibition – results • Some observation of differences between shared structure and common therapeutic

Deeply investigating and analysis chemical genome wide fitness data. Predict gene-functional