Sam Danziger Institute For Genomics and Bioinformatics Department of Biomedical Engineering

Choosing where to look next in a mutation sequence space:Active Learning of informative p53 cancer rescue mutants Sam DanzigerInstitute For Genomics and Bioinformatics Department of Biomedical Engineering University of California, Irvine www.SamDanziger.com Jue Zeng Department of Medicine Rainer Brachmann Department of Medicine Richard Lathrop Department of Computer Science University of California, Irvine

Outline • Overview: Computer Guided Discovery • Problem: Cancer and p53 • Results: Best Active Learning • Next: Future Experiments

Computer Guided DiscoveryOf “Active” Mutant Proteins • Starting Point: A biomedically important protein with some known mutants. • Problem: Find novel mutant proteins with an “Active” phenotype. • Naive Solution: Make and test all other possible mutants in the wet lab. Known Mutants Other Possible Mutants

Known Mutants Why Use Computers? Assuming up to 5 mutants in 200 residuesHow Many Mutants are There?: ~10^11 Known Mutants: ~10^2 Spiral Galaxy M101 http://hubblesite.org/ ~10^9 stars.

A Better Solution: Active Learning Pick the best unknown mutants to know Unknown Known Example N+1 Example 1 Train the Classifier Example N+2 Classifier Example 2 Example N+3 Example 3 Choose an Example to Label Example N+4 … … Example N Example M Training Set Add the New Example To Training Set

Unknown Mutant 1 1 Unknown Mutant 2 2 An Example of Active Learning:Minimum Marginal Hyperplane Should unknown Mutant 1 or Mutant 2 be added to the training set? INACTIVE Known Inactive 1 Known Active 2 ACTIVE SelectMutant 2

Another Example: Maximum Curiosity Should Mutant 1 or Mutant 2 be added to the training set? Change in correlation coefficient Training Set Training Set + Mutant 1(Active) Cross-validator .0411 Training Set + Mutant 1(Inactive) Cross-validator -.6014 Training Set + Mutant 2(Active) Cross-validator .0309 Training Set + Mutant 2(Inactive) Cross-validator .0276 SelectMutant 1

Known Active Known Inactive Unclassified SelectedUnclassified OK A Third Example:Entropic Tradeoff INACTIVE OK OK OK ACTIVE

Which is the Best Active Learning Method? TYPE I: Select mutants that most improve the classifier if correctly predicted. • Maximum Curiosity • Composite Classifier • Improved Composite Classifier TYPE II: Select mutants that most improve the classifier. • Additive Curiosity • Additive Bayesian Surprise TYPE III: Common methods taken from the literature. • Minimum Marginal Hyperplane • Maximum Entropy TYPE IV: Variations on methods from the literature. • Maximum Marginal Hyperplane • Minimum Entropy • Entropic Tradeoff TYPE C: Controls • Non-iterated Prediction • Predict All Inactive • Random (30 trials)

The Problem: p53 and Cancerp53 mutations occur in ~50% of human cancers • Tumor Suppressor Protein. • Receives upstream signals indicating cellular stress. • Acts as a transcription factor in the cancer suppression pathway. p53 core domain bound to DNA Image Generated with UCSF Chimera Cho, Y., Gorina, S., Jeffrey, P.D., Pavletich, N.P. Crystal structure of a p53 tumor suppressor-DNA complex: understanding tumorigenic mutations. Sciencev265pp.346-355 , 1994

The p53 Cancer Pathway David W. Meek: http://www.dundee.ac.uk/biomedres/meek.htm

249 235+240 The Concept of “Cancer Rescue”:Second-site Suppressor Mutations 273 248 175 245 282 N C 102-292 324-355 1-42 Transactivation Core domain for DNA binding Tetramerization Cancer mutation prevalence data from the IARC p53 database: http://www-p53.iarc.fr/

Immediate Goal Ultimate Goal Find novel p53 Cancer Rescue Mutants. Intermediate Goal Advance medical practice by revealing p53 mutant functional properties across p53’s mutation sequence space. + = Inactive p53Cancer Mutant Functionally Active Rescued p53 Engineered Small MoleculeDrug

Evaluating Cancer Rescue Mutants in the Wet Lab INACTIVE ACTIVE A Yeast containing an inactive p53 cancer mutant will not grow. A Yeast containing an active p53 cancer rescue mutant will grow. Baroni, T.E., Wang, T., Qian, H., Dearth, L.R., Truong, L.N., Zeng, J., Denes, A.E., Chen, S.W. and Brachmann, R.K. (2004) A global suppressor motif for p53 cancer mutants. Proc Natl Acad Sci U S A, 101, 4930-5.

In Vitro Phenotype

Knowledge Model Experiment In a Nutshell Cancer Rescue Mutants Use Active Learning to select the p53 mutants that will be the most informative. Test the predictions in-vitro. Build classifiers of putative p53 cancer rescue mutants. Find all p53 cancer rescue mutants

The Active Learning Tradeoff:How Fast Does It Learn?

The Active Learning Tradeoff:How Accurate On The Chosen?

The Tradeoff Entropic Tradeoff Maximum Curiosity Geometric Distance? How Accurate on the Chosen? Area? Length * Width Sum? Length + Width Minimum Marginal Hyperplane How Fast Does It Learn? Solution: Average Score of All Three Metrics

The Overall Best

How Fast Does It Learn?The Three Previous Examples

How Accurate On The Chosen? The Three Previous Examples

Why Does Random Do So Well? Very Few Examples Tong, S. and D. Koller (2002). "Support vector machine active learning with applications to text classification." The Journal of Machine Learning Research2: 45-66.

Exploring New p53 Regions • Each new p53 region potentially introduces new rescue mechanisms. • New pools of mutants restart the Active Learning problem. 273 248 281-289 175 113-124 245 282 C N p53 Core Domain

Most Interesting or Most Interesting Active? Known Mutants Which Finds More Active Cancer Rescue Mutants? Select The Most Interesting Select The Most Interesting Active Iteration 1 Iteration 1 Iteration 2 Iteration 2 Iteration 3 Iteration 3

Knowledge Theory Experiment Conclusion Find Cancer Rescue Mutants

Baldi Lab Lathrop Lab Leuke Lab Luo Lab Brachmann Lab Pierre Baldi Jonathan Chen Hiroto Saigo S. Joshua Swamidass Richard Lathrop Gabe Moothart Ying Wang Ray Luo Qiang Lu Rainer Brachmann Jue Zeng Acknowledgments FundingNational Institute of Health ( p53: CA112560 ), UCI Office of Research and Graduate Studies, UCI Institute for Genomics and Bioinformatics ( BIT: LM007443 ), US Department of Energy (DOE)

Knowledge Theory Experiment Questions? Find Cancer Rescue Mutants

Most Interesting Region • Scan the p53 core domain to find the most interesting region.

Create All Single Point Mutations in a Region in-vitro? CODA*: Assemble p53 using thermodynamically optimized oligonucleotides. Allow all possible mutations within a region. Assemble mutated region with cancer mutants to look for rescue mutants. *http://www.codagenomics.com/

Knowledge Representation: Homology ModelingModeling done using Amber™ with zinc ion characteristics tuned by Dr. Qiang Lu working in Dr. Ray Lui’s lab. 1. Take a wild type crystal structure of the protein in question. 2. Substitute one or more amino acids to mutate the protein. 4. Minimize the energy of the new mutant protein. 3. Apply simulated physical laws to determine an energy function.

Knowledge Representation: Features Simulated Structure -> String of Numbers • 1d: Sequence Mutation Features • s1d: Sequence Similarity Features • 2d: Surface Map Features • 3d: Atomic Position Features • 4d: “Time Dependant” Stability Information

What is Machine Learning? Training: Set the parameters (W) with n features. Testing: Use the parameters (W) to predict unclassified examples

Machine Learning Use Homology Modeling to guide biological research Modeling: How To Use It Computer Generated Structure Biology Make a protein and test it in-vitro PRO: Real CON: Slow Predict a protein structure in-silico PRO: Fast CON: Inaccurate, what does it tell us?

Knowledge Model Experiment Maximum Curiosity Crossvalidate the training set with the chosen mutant and record the correlation coefficient. Choose a mutant from the test set that has not been considered yet. Assume the chosen is “Active” or “Inactive” Find the Mutants that Most Improve the Training Set Start with a training set of examples with known classes and an unclassed testset.

Exploring New p53 Regions • Each new p53 region potentially introduces new rescue mechanisms. • New pools of mutants restart the Active Learning problem. p53 Core Domain 113-124 281-289

Primary Collaborators Dr. Richard Lathrop School of Information and Computer Science Jue Zeng School of Medicine Dr. Rainer Brachmann School of Medicine

Sam Danziger Institute For Genomics and Bioinformatics Department of Biomedical Engineering