1 / 44

Machine Learning Challenges in Location Proteomics

Machine Learning Challenges in Location Proteomics. Robert F. Murphy Departments of Biological Sciences and Biomedical Engineering & Center for Automated Learning and Discovery Carnegie Mellon University. Protein characteristics relevant to systems approach. sequence structure

Télécharger la présentation

Machine Learning Challenges in Location Proteomics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine Learning Challenges in Location Proteomics Robert F. Murphy Departments of Biological Sciences and Biomedical Engineering & Center for Automated Learning and Discovery Carnegie Mellon University

  2. Protein characteristics relevant to systems approach • sequence • structure • expression level • activity • partners • location

  3. Subcellular locations from major protein databases • Giantin • Entrez: /note="a new 376kD Golgi complex outher membrane protein" • SwissProt: INTEGRAL MEMBRANE PROTEIN. GOLGI MEMBRANE. • GPP130 • Entrez: /note="GPP130; type II Golgi membrane protein” • SwissProt: nothing

  4. More questions than answers • We learned that Giantin and GPP130 are both Golgi proteins, but do we know: • What part (i.e., cis, medial, trans) of the Golgi complex they each are found in? • If they have the same subcellular distribution? • If they also are found in other compartments?

  5. Vocabulary is part of the problem • Different investigators may use different terms to refer to the same pattern or the same term to refer to different patterns • Efforts to create restricted vocabularies (e.g., Gene Ontology consortium) for location have been made

  6. SWALL entries for giantin and gpp130 ID GIAN_HUMAN STANDARD; PRT; 3259 AA. AC Q14789; Q14398; GN GOLGB1. DR GO; GO:0000139; C:Golgi membrane; TAS. DR GO; GO:0005795; C:Golgi stack; TAS. DR GO; GO:0016021; C:integral to membrane; TAS. DR GO; GO:0007030; P:Golgi organization and biogenesis; TAS. ID O00461 PRELIMINARY; PRT; 696 AA. AC O00461; GN GPP130. DR GO; GO:0005810; C:endocytotic transport vesicle; TAS. DR GO; GO:0005801; C:Golgi cis-face; TAS. DR GO; GO:0005796; C:Golgi lumen; TAS. DR GO; GO:0016021; C:integral to membrane; TAS.

  7. Words are not enough • Still don’t know how similar the locations patterns of these proteins are • Restricted vocabularies do not provide the necessary complexity and specificity

  8. Needed: Systematic Approach • Need to advance past “cartoon” view of subcellular location • Need systematic, quantitative approach to protein location • Need new methods for accurately and objectively determining the subcellular location pattern of all proteins • Distinct from drug screening by low-resolution microscopy

  9. First Decision Point • Classification by direct (pixel-by-pixel) comparison of individual images to known patterns is not useful, since • different cells have different shapes, sizes, orientations • organelles within cells are not found in fixed locations • Therefore, use feature-based methods rather than (pixel) model-based methods

  10. Input Images • Created 2D image database for HeLa cells • Ten classes covering all major subcellular structures: Golgi, ER, mitochondria, lysosomes, endosomes, nuclei, nucleoli, microfilaments, microtubules • Included classes that are similar to each other

  11. Example 2D Images of HeLa

  12. Features: SLF • Developed sets of Subcellular Location Features (SLF) containing features of different types • Motivated in part by descriptions used by biologists (e.g., punctate, perinuclear) • First type of features derived from morphological image processing - finding objects by automated thresholding

  13. Features: Morphological • Number of fluorescent objects per cell • Variance of the object sizes • Ratio of the largest object to the smallest • Average distance of objects to the ‘center of fluorescence’ • Average “roundness” of objects

  14. Features: Haralick texture • Give information on correlations in intensity between adjacent pixels to answer questions like • is the pattern more like a checkerboard or alternating stripes? • is the pattern highly organized (ordered) or more scattered (disordered)?

  15. Example: Difference detected by texture feature “entropy”

  16. Features: Zernike moment • Measure degree to which pattern matches a particular Zernike polynomial • Give information on basic nature of pattern (e.g., circle, donut) and sizes (frequencies) present in pattern

  17. Examples of Zernike Polynomials Z(2,0) Z(4,4) Z(10,6)

  18. Subcellular Location Features: 2D • Morphological features • Haralick texture features • Zernike moment features • Geometric features • Edge features

  19. 2D Classification Results Overall accuracy = 92% (95% for major patterns)

  20. Human Classification Results Overall accuracy = 83% (92% for major patterns)

  21. Computer vs. Human

  22. Extending to 3D: Labeling approach • Total protein labeled with Cy5 reactive dye • DNA labeled with PI • Specific Proteins labeled with primary Ab + Alexa488 conjugated secondary Ab

  23. 3D Image Set Nuclear ER Giantin gpp130 Lysosomal Mitoch. Nucleolar Actin Endosomal Tubulin

  24. New features to measure “z” asymmetry • 2D features treated x and y equivalently • For 3D images, while it makes sense to treat x and y equivalently (cells don’t have a “left” and “right”, z should be treated differently (“top” and “bottom” are not the same) • We designed features to separate distance measures into x-y component and z component

  25. Classification Results for 3D images Overall accuracy = 97%

  26. How to do even better • Biologists interpreting images of protein localization typically view many cells before reaching a conclusion • Can simulate this by classifying sets of cells from the same microscope slide

  27. Predicted Class DNA ER Gia Gpp Lyso Mito Nucl Actin Endo Tub DNA 100 0 0 0 0 0 0 0 0 0 ER 0 99 0 0 0 0 0 0 0 0 Gia 0 0 100 0 0 0 0 0 0 0 Gpp 0 0 0 99 0 0 0 0 0 0 True Class Lyso 0 0 0 0 100 0 0 0 0 0 Mito 0 0 0 0 0 100 0 0 0 0 Nucle 0 0 0 0 0 0 100 0 0 0 Actin 0 0 0 0 0 0 0 100 0 0 Endo 0 0 0 0 0 0 0 0 100 0 Tub 0 0 0 0 0 0 0 0 0 99 Classification of Sets of 3D Images Set size 9, Overall accuracy = 99.7%

  28. First Conclusion • Description of subcellular locations for systems biology should be implemented using a data-driven approach rather than a knowledge-capture approach, but…

  29. Entity extraction proteins, cells, drugs, experimental conditions, … Scope Annotated Scopes Caption Caption understanding ImagePtr Figure [Cohen et al, 2003] Label Matching alignment between caption entities and panels Label finding Panel labels Panel splitting image type, image scale, subcellular pattern analysis… ] [Murphy et al, 2001] Panels Annotated Panels Panel classification, Micrograph analysis [Murphy et al, 2001] Subcellular Location Image Finder • (Have automated system for finding images in on-line journal articles that match a particular pattern - enables connection between new images and previously published results)

  30. Image Similarity • Classification power of features implies that they capture essential characteristics of protein patterns • Can be used to measure similarity between patterns

  31. Clustering by Image Similarity • Ability to measure similarity of protein patterns allows us for the first time to create a systematic, objective, framework for describing subcellular locations • Ideal for database references • One way is by creating a Subcellular Location Tree • Illustration: Build hierarchical dendrogram

  32. Subcellular Location Tree for 10 classes in HeLa cells

  33. Do this for all proteins: Location Proteomics • Can use CD-tagging (developed by Dr. Jonathan Jarvik) to randomly tag many proteins: Infect population of cells with a retrovirus carrying a DNA sequence that will produce a “tag” in a random gene in each cell • Isolate separate clones, each of which produces express one tagged protein • Use RT-PCR to identify tagged gene in each clone • Collect images of many cells for each clone using fluorescence microscopy

  34. Example images of CD-tagged clones Glut1 gene (type 1 glucose transporter) Tmpo gene (thymopoietin  tuba1 gene (-tubulin) Cald gene (caldesmon 1) Ncl gene (nucleolin) Rps11 gene (ribosomal protein S11) Hmga1 gene (high mobility group AT-hook 1) Col1a2 gene (procollagen type I 2) Atp5a1 gene (ATP synthase isoform 1)

  35. Proof of principle • Cluster 46 clones expressing different tagged proteins based on their subcellular location patterns

  36. Feature selection • Use Stepwise Discriminant Analysis to rank features based on their ability to distinguish proteins • Use increasing numbers of features to train neural network classifiers and evaluate classification accuracy over all 46 clones • Best performance obtained with 10 features

  37. Tree building • Therefore use these 10 features with z-scored Euclidean distance function to build SLT • Find optimal number of clusters using k-means clustering and AIC • Find consensus hierarchical trees by randomly dividing the images for each protein in half and keeping branches conserved between both halves (repeat for 50 random divisions)

  38. Consensus Subcellular Location Tree

  39. Examples from major clusters

  40. Significance • Proteins clustered by location analogous to proteins clustered by sequence (e.g., PFAM) • Can subdivide clusters by observing response to drugs, oncogenes, etc. • These represent protein location states • Base knowledge required for modeling • Can be used to filter protein interactions

  41. From patterns to causes • Machine learning approaches have been previously used to find localization motifs in protein sequences, but the set of locations used was limited to major organelles • High-resolution subcellular location trees can be used to discover (recursively) new motifs that determine location of each group • Can include post-translational modifications

  42. More Conclusions • Organized data collection approach is required to capture high-resolution information on the subcellular location of all proteins • Prohibitive combinatorial complexity make colocalization approach infeasible, so major effort should focus on one protein at a time

  43. Center for Bioimage Informatics • $2.75 M CMU funding from NSF ITR • Joint with UCSB and collaborators at Berkeley and MIT • R. Murphy (CALD/Biomed.Eng./Biol.Sci.) • Jelena Kovacevic (Biomedical Engineering) • Tom Mitchell (CALD) • Christos Faloutsos (CALD)

  44. Former students Michael Boland, Mia Markey, William Dirks, Gregory Porreca, Edward Roques, Meel Velliste Current grad students Kai Huang, Xiang Chen, Ting Zhao, Yanhua Hu, Elvira Garcia Osuna, Zhenzhen Kou, Juchang Hua Funding NSF, NIH, Rockefeller Bros. Fund, PA. Tobacco Settlement Fund Collaborators/Consultants Simon Watkins, David Cassasent, Tom Mitchell, Christos Faloutsos, Jon Jarvik, Peter Berget Acknowledgments

More Related