Machine Learning Challenges in Location Proteomics

Machine Learning Challenges in Location Proteomics Robert F. Murphy Departments of Biological Sciences and Biomedical Engineering & Center for Automated Learning and Discovery Carnegie Mellon University

Protein characteristics relevant to systems approach • sequence • structure • expression level • activity • partners • location

Subcellular locations from major protein databases • Giantin • Entrez: /note="a new 376kD Golgi complex outher membrane protein" • SwissProt: INTEGRAL MEMBRANE PROTEIN. GOLGI MEMBRANE. • GPP130 • Entrez: /note="GPP130; type II Golgi membrane protein” • SwissProt: nothing

More questions than answers • We learned that Giantin and GPP130 are both Golgi proteins, but do we know: • What part (i.e., cis, medial, trans) of the Golgi complex they each are found in? • If they have the same subcellular distribution? • If they also are found in other compartments?

Vocabulary is part of the problem • Different investigators may use different terms to refer to the same pattern or the same term to refer to different patterns • Efforts to create restricted vocabularies (e.g., Gene Ontology consortium) for location have been made

SWALL entries for giantin and gpp130 ID GIAN_HUMAN STANDARD; PRT; 3259 AA. AC Q14789; Q14398; GN GOLGB1. DR GO; GO:0000139; C:Golgi membrane; TAS. DR GO; GO:0005795; C:Golgi stack; TAS. DR GO; GO:0016021; C:integral to membrane; TAS. DR GO; GO:0007030; P:Golgi organization and biogenesis; TAS. ID O00461 PRELIMINARY; PRT; 696 AA. AC O00461; GN GPP130. DR GO; GO:0005810; C:endocytotic transport vesicle; TAS. DR GO; GO:0005801; C:Golgi cis-face; TAS. DR GO; GO:0005796; C:Golgi lumen; TAS. DR GO; GO:0016021; C:integral to membrane; TAS.

Words are not enough • Still don’t know how similar the locations patterns of these proteins are • Restricted vocabularies do not provide the necessary complexity and specificity

Needed: Systematic Approach • Need to advance past “cartoon” view of subcellular location • Need systematic, quantitative approach to protein location • Need new methods for accurately and objectively determining the subcellular location pattern of all proteins • Distinct from drug screening by low-resolution microscopy

First Decision Point • Classification by direct (pixel-by-pixel) comparison of individual images to known patterns is not useful, since • different cells have different shapes, sizes, orientations • organelles within cells are not found in fixed locations • Therefore, use feature-based methods rather than (pixel) model-based methods

Input Images • Created 2D image database for HeLa cells • Ten classes covering all major subcellular structures: Golgi, ER, mitochondria, lysosomes, endosomes, nuclei, nucleoli, microfilaments, microtubules • Included classes that are similar to each other

Example 2D Images of HeLa

Features: SLF • Developed sets of Subcellular Location Features (SLF) containing features of different types • Motivated in part by descriptions used by biologists (e.g., punctate, perinuclear) • First type of features derived from morphological image processing - finding objects by automated thresholding

Features: Morphological • Number of fluorescent objects per cell • Variance of the object sizes • Ratio of the largest object to the smallest • Average distance of objects to the ‘center of fluorescence’ • Average “roundness” of objects

Features: Haralick texture • Give information on correlations in intensity between adjacent pixels to answer questions like • is the pattern more like a checkerboard or alternating stripes? • is the pattern highly organized (ordered) or more scattered (disordered)?

Example: Difference detected by texture feature “entropy”

Features: Zernike moment • Measure degree to which pattern matches a particular Zernike polynomial • Give information on basic nature of pattern (e.g., circle, donut) and sizes (frequencies) present in pattern

Examples of Zernike Polynomials Z(2,0) Z(4,4) Z(10,6)

Subcellular Location Features: 2D • Morphological features • Haralick texture features • Zernike moment features • Geometric features • Edge features

2D Classification Results Overall accuracy = 92% (95% for major patterns)

Human Classification Results Overall accuracy = 83% (92% for major patterns)

Computer vs. Human

Extending to 3D: Labeling approach • Total protein labeled with Cy5 reactive dye • DNA labeled with PI • Specific Proteins labeled with primary Ab + Alexa488 conjugated secondary Ab

3D Image Set Nuclear ER Giantin gpp130 Lysosomal Mitoch. Nucleolar Actin Endosomal Tubulin

New features to measure “z” asymmetry • 2D features treated x and y equivalently • For 3D images, while it makes sense to treat x and y equivalently (cells don’t have a “left” and “right”, z should be treated differently (“top” and “bottom” are not the same) • We designed features to separate distance measures into x-y component and z component

Classification Results for 3D images Overall accuracy = 97%

How to do even better • Biologists interpreting images of protein localization typically view many cells before reaching a conclusion • Can simulate this by classifying sets of cells from the same microscope slide

Predicted Class DNA ER Gia Gpp Lyso Mito Nucl Actin Endo Tub DNA 100 0 0 0 0 0 0 0 0 0 ER 0 99 0 0 0 0 0 0 0 0 Gia 0 0 100 0 0 0 0 0 0 0 Gpp 0 0 0 99 0 0 0 0 0 0 True Class Lyso 0 0 0 0 100 0 0 0 0 0 Mito 0 0 0 0 0 100 0 0 0 0 Nucle 0 0 0 0 0 0 100 0 0 0 Actin 0 0 0 0 0 0 0 100 0 0 Endo 0 0 0 0 0 0 0 0 100 0 Tub 0 0 0 0 0 0 0 0 0 99 Classification of Sets of 3D Images Set size 9, Overall accuracy = 99.7%

First Conclusion • Description of subcellular locations for systems biology should be implemented using a data-driven approach rather than a knowledge-capture approach, but…

Entity extraction proteins, cells, drugs, experimental conditions, … Scope Annotated Scopes Caption Caption understanding ImagePtr Figure [Cohen et al, 2003] Label Matching alignment between caption entities and panels Label finding Panel labels Panel splitting image type, image scale, subcellular pattern analysis… ] [Murphy et al, 2001] Panels Annotated Panels Panel classification, Micrograph analysis [Murphy et al, 2001] Subcellular Location Image Finder • (Have automated system for finding images in on-line journal articles that match a particular pattern - enables connection between new images and previously published results)

Image Similarity • Classification power of features implies that they capture essential characteristics of protein patterns • Can be used to measure similarity between patterns

Clustering by Image Similarity • Ability to measure similarity of protein patterns allows us for the first time to create a systematic, objective, framework for describing subcellular locations • Ideal for database references • One way is by creating a Subcellular Location Tree • Illustration: Build hierarchical dendrogram

Subcellular Location Tree for 10 classes in HeLa cells

Do this for all proteins: Location Proteomics • Can use CD-tagging (developed by Dr. Jonathan Jarvik) to randomly tag many proteins: Infect population of cells with a retrovirus carrying a DNA sequence that will produce a “tag” in a random gene in each cell • Isolate separate clones, each of which produces express one tagged protein • Use RT-PCR to identify tagged gene in each clone • Collect images of many cells for each clone using fluorescence microscopy

Example images of CD-tagged clones Glut1 gene (type 1 glucose transporter) Tmpo gene (thymopoietin  tuba1 gene (-tubulin) Cald gene (caldesmon 1) Ncl gene (nucleolin) Rps11 gene (ribosomal protein S11) Hmga1 gene (high mobility group AT-hook 1) Col1a2 gene (procollagen type I 2) Atp5a1 gene (ATP synthase isoform 1)

Proof of principle • Cluster 46 clones expressing different tagged proteins based on their subcellular location patterns

Feature selection • Use Stepwise Discriminant Analysis to rank features based on their ability to distinguish proteins • Use increasing numbers of features to train neural network classifiers and evaluate classification accuracy over all 46 clones • Best performance obtained with 10 features

Tree building • Therefore use these 10 features with z-scored Euclidean distance function to build SLT • Find optimal number of clusters using k-means clustering and AIC • Find consensus hierarchical trees by randomly dividing the images for each protein in half and keeping branches conserved between both halves (repeat for 50 random divisions)

Consensus Subcellular Location Tree

Examples from major clusters

Significance • Proteins clustered by location analogous to proteins clustered by sequence (e.g., PFAM) • Can subdivide clusters by observing response to drugs, oncogenes, etc. • These represent protein location states • Base knowledge required for modeling • Can be used to filter protein interactions

From patterns to causes • Machine learning approaches have been previously used to find localization motifs in protein sequences, but the set of locations used was limited to major organelles • High-resolution subcellular location trees can be used to discover (recursively) new motifs that determine location of each group • Can include post-translational modifications

More Conclusions • Organized data collection approach is required to capture high-resolution information on the subcellular location of all proteins • Prohibitive combinatorial complexity make colocalization approach infeasible, so major effort should focus on one protein at a time

Center for Bioimage Informatics • $2.75 M CMU funding from NSF ITR • Joint with UCSB and collaborators at Berkeley and MIT • R. Murphy (CALD/Biomed.Eng./Biol.Sci.) • Jelena Kovacevic (Biomedical Engineering) • Tom Mitchell (CALD) • Christos Faloutsos (CALD)

Former students Michael Boland, Mia Markey, William Dirks, Gregory Porreca, Edward Roques, Meel Velliste Current grad students Kai Huang, Xiang Chen, Ting Zhao, Yanhua Hu, Elvira Garcia Osuna, Zhenzhen Kou, Juchang Hua Funding NSF, NIH, Rockefeller Bros. Fund, PA. Tobacco Settlement Fund Collaborators/Consultants Simon Watkins, David Cassasent, Tom Mitchell, Christos Faloutsos, Jon Jarvik, Peter Berget Acknowledgments

Machine Learning Challenges in Location Proteomics

Machine Learning Challenges in Location Proteomics

Presentation Transcript

Topics in Machine Learning

Machine Learning in Bioinformatics

Machine Learning

Machine Learning

MACHINE LEARNING

Machine Learning in DryadLINQ

Machine learning in IDS

Submodularity in Machine Learning

Machine Learning Challenges in Location Proteomics

LOCATION AND OTHER CHALLENGES

Challenges in Location-Aware Computing

Machine Learning in realtime

Machine Learning in GATE

Machine Learning Challenges Comp Bio 02-750

Experiments in Machine Learning

Evaluation in Machine Learning

Machine Learning in Football

Machine learning Courses | Machine Learning Training

LOCATION AND OTHER CHALLENGES

Experiments in Machine Learning

Machine learning in IDS

Machine Learning Projects | Machine Learning Applications | Machine Learning Training | Simplilearn