Next-Generation Bioinformatics Systems

Next-Generation Bioinformatics Systems Jelena Kovačević Center for Bioimage InformaticsDepartment of Biomedical EngineeringCarnegie Mellon University

Acknowledgments Current PhDstudents PhD students Funding AminaChebira TadMerryman GowriSrinivasa DoruCristianBalcan ElviraGarciaOsuna PabloHenningsYeomans JasonThornton Collaborators Undergrads VijaykumarBhagavatula GeoffGordon JoséMoura BobMurphy MarkusPüschel MariosSavvides LionelCoulot Woon HoJung HeatherKirshner

Application area Acquisition Knowledge Extraction Computation Goal • Imaging in systems biology • Use informatics to • acquire, store, manipulate and share large bioimaging databases • Leads to • automated, efficient and robust processing • Need • Host of sophisticated tools from many areas

Application Areas • Bioimaging • Current focus in biology: mapping out the protein landscape • Fluorescence microscopy used to gather data on subcellular events► • Biometrics • Biosensing for providing security • to the financial industry • at US borders • Use person’s biometric characteristic to identify/verify►

Acquisition • Issues • z-stacks and time series resolution • Context-dependent • Slow-changing process needs to be acquired with coarser resolution • Changes need to be detected and reacted to • Efficiency of acquisition • Acquire only where and when needed  adaptivity • Sample question • How can we efficiently acquire fluorescence microscopy images? ►

Knowledge Extraction • Sample questions • How can we automatically and efficiently classify proteins based on images of their subcellular locations? ► • How can we identify/verify person’s identity based on his/her biometric characteristic? ► • Toolbox needed to solve the problem • Signal processing/data mining • Multiresolution tools allow for adaptive and efficient processing ►

vendor library or SPIRAL generated 10x reasonable implementation Computation • The problem: fast numerical software • Hard to write fast code • Best code platform-dependent • Code becomes obsolete as fast as it is written

The Solution Automatic generation and optimization of numerical software Tuning of implementation and algorithm A new breed of intelligent SW design tools SPIRAL: a prototype for the domain of DSP algorithms ► fast algorithm as SPL formula DSP transform (user specified) Formula generator controls controls runtime on given platform Platform adapted code C/Fortran program Formula translator Search engine SPIRALCode Generation for DSP Algorithms www.spiral.net

Acquisition How can we efficiently acquire fluorescence microscopy images? ► Knowledge extraction How can we automatically and efficiently classify proteins based on images of their subcellular locations? ► Computation Automatic code generation and optimization ► Bioimaging Acquisition Knowledge Extraction Computation Bioimaging

Motivation • Current focus in biological sciences • System-wide research “omics” • Human genome project • Next frontier • Proteomics • Subcellular location one of major components • Grand challenge • Develop an intelligent next-generation bioimaging system capable of fast, robust and accurate classification of proteins based on images of their subcellular locations

Problem Why acquire in areas of low fluorescence? Acquire only when and where needed Measure of success Problem dependent Here: Strive to maintain the achieved classification accuracy Efficient acquisition leads to Faster acquisition Possibility of increasing acquisition resolution Possible increase in classification accuracy due to increased resolution ER MR Acquisition of Fluorescence Microscopy Images

MR Acquisition of Fluorescence Microscopy Images 2D 3D • Approach • Develop algorithm on an acquired data set at maximum resolution • Implement a microscope’s scanning protocol • Algorithm:Mimic “Battleship” strategy • Acquire around the hits

2l 2l Initialize probe locations Probe N N yes Add probe locations Intensity > T? no M M yes Probe locations left? no Algorithm: Details

What will we lose? Scanning simplicity What will we gain? Faster acquisition process Time is proportional to the savings in samples Need to take into account the time to operate scanning unit Higher resolution in 3D The laser intensity can be reduced Reduces photobleaching Some sources indicated linear relationship, some other Trade-Offs

MR sampling algorithm Trivial approach Approximation Difference Image MR Algorithm (9.81:1) Mitochondrial compression versus distortion MSE Trivial Approach (9:1) Percent of samples kept / 100 Results in 3D

Results in 2D Accuracy [%] Compression Ratio

Current and Future Work • Implementation issues • Can one operate galvo-mirrors fast enough to capitalize on the gain? • Algorithmic issues • Add knowledge from classification (feedback) • Build models http://www.olympusconfocal.com/theory/confocalintro.html

Funding and References • Funding • NSF-0331657, “Next-Generation Bio-Molecular Imaging and Information Discovery,” NSF, $2,500,000, 10/03-9/08. Co-PI. • Journal papers • T.E. Merryman and J. Kovačević, “An adaptive multirate algorithm for acquisition of fluorescence microscopy data sets," IEEE Trans. Image Proc., special issue on Molecular and Cellular Bioimaging, September 2005. • Conference papers • T.E. Merryman, J. Kovačević, E.G. Osuna and R.F. Murphy, "Adaptive multirate data acquisition of 3D cell images," Proc. IEEE Int. Conf. Acoust., Speech, and Signal Proc., Philadelphia, PA, March 2005.

Segmentation Classification Knowledge Extraction MR Classification of Proteins • Why MR? • Introduction of simple MR features produced a statistically significant jump in accuracy • Introduce adaptivity with little computational cost This is tubulin

3D HeLa ► 2D HeLa ► 3T3 ► Huang & Murphy, Journal of Biomedical Optics 9(5), 893–912, 2004 Data Sets

Cells from Henrietta Lacks (d. 1951, cervical cancer) Confocal Scanning Laser Microscope (100x) DNA stain (PI), all protein stain (Cy5 reactive dye) and fluorescent anti-body for a specific protein 50-58 sets per class 14-24 2D slices per set Resolution 0.049 x 0.049 x 0.2 μm Covers all major subcellular structures ► 3D HeLa Data Set Huang & Murphy, Journal of Biomedical Optics 9(5), 893–912, 2004

Covers all major subcellular structures ► Golgi apparatus (giantin, gpp 130) Cytoskeleton (actin, tubulin) Endoplasmic reticulum membrane (ER) Lysosomes (LAMP2) Endosomes (transf. receptor) Nucleus (nucleolin) Mitochondria outer membrane 3D HeLa Data Set http://www.biologymad.com/

DNA Mitochondria Giantin Actin Tubulin Gpp130 ER LAMP2 Nucleolin Tfr Boland & Murphy, Bioinformatics 17(12), 1213-1223, 2001 2D HeLa Data Set • Cells from Henrietta Lacks (d. 1951, cervical cancer) • Widefield w nearest neighbor deconvolution (100x) • DNA stain and fluorescent anti-body for a specific protein • 78-98 sets per class • Resolution 0.23 x 0.23 μm

Preprocessing Manual shifting Manual rotation Feature computation Subcellular Location Features (SLF) Drawn from many different feature categories Texture, morphological, Gabor and wavelet Gabor and wavelet features improved accuracy significantly(from 88% to 92%) Classification Combination of classifiers Classification: Previous system Input image Preprocessing Feature extraction Classification Class

Points to Frames ► MD frames Wavelet/frame packets ► MR Classification of Proteins • What do we need? • Want to keep MR(based on results with Gabor and wavelet features) • Avoid manual processing • Rotation invariance • Shift invariance • Adaptivity

Does Adaptivity Help? • Would like to use wavelet packets ► • Do not have an obvious cost measure • Line of work Find out if adaptivity helps If it does, find a cost function to use with wavelet packets • Frame packets • Challenge: Same class, different story Tubulin

Clustering images Full wavelet tree Feature extraction K-means clustering Weights Voting Weight computation Gaussian modeling Gaussianmodels Training image Training Phase • Number of classes C • Number of training images/class N

Full Wavelet Tree Decomposition Clustering images Full wavelet tree • Grow a full tree ► • Depth L levels • Total number of subbands S

Feature Extraction Clustering images Full wavelet tree Feature extraction • Use Haralick texture features ► • One feature vector per subband s • Indexed by class c, training image n, subband s

K-Means Clustering Clustering images Full wavelet tree Feature extraction K-means clustering • Clustering in a fixed subband • Max K clusters/class Feature vector for image I from class c and subband s Cluster mean Clusteringimages of class c X

Gaussian Modeling Clustering images Full wavelet tree Feature extraction K-means clustering Gaussian modeling Training image • Model each cluster with a Gaussian pdf • Probability the training image belongs to class i • Output: single probability vector

Class 1 Class C Subband 1 Subband S Image 1 from Class 1 Image N from Class 1 Image 1 from Class C Image N from Class C From Feature Space to Probability Space

Class 1 Class C Subband 1 Subband S Image 1 from Class 1 Image N from Class 1 Image 1 from Class C Image N from Class C Weight Computation: Initialization Clustering images Full wavelet tree Feature extraction K-means clustering Weight computation Gaussian modeling Training image • Decision for vector tc,n,s

Class 1 Class C Subband 1 Subband S Image 1 from Class 1 Image N from Class 1 Image 1 from Class C Image N from Class C Weight Computation : Initialization • Initial weight for subband s: probability of correct decision correct incorrect incorrect correct correct correct correct incorrect

Class 1 Subband 1 Class 1 Class C Subband 1 Subband S Image 1 from Class 1 Subband S Image 1 from Class 1 Image N from Class 1 Image 1 from Class C Image N from Class C Weight Computation • Compute probability vector for each image

Weight AdjustmentVoting Clustering images Full wavelet tree Feature extraction K-means clustering Weights Voting Weight computation Gaussian modeling Gaussianmodels Training image • Make a decision • Decision correct • Do nothing, take next image • Decision incorrect • Adjust the weights, take next image • Make  runs through all the images • Does the algorithm converge?

Testing Phase Testing image • Compute probabilities for each subband • Compute the overall probability vector • Make the decision Weights Full wavelet tree Feature extraction Probability space Voting Class label Gaussianmodels

Results • C = 10 classes • N = 45 training images • T = 5 testing images • 10-fold cross validation • Training phase • 44 clustering images • 45-fold cross validation • L = 2,3 levels of Haar wavelet decomposition • K = 10 max number of clusters per class

Results

Results: Accuracy vs Number of Epochs

Classification Enhancement

Weight Adjustment: 2nd Try • Keep the previous best weight • Can do no worse than previous system

Principal Component Analysis • Using eigenspace representations for Haralick texture features Texture classification (TC) • Decomposition better than no decomposition(with or without PCA) • There is information in the subbands TC + PCA • Improves accuracy(with or without decomposition) Dimensionality reduction (DR) • Increases accuracy slightly without much complexity

Effect of Translation Variance • No translation • accuracy(MR frames)>accuracy(MR) • Translation • MR drops • MR frames stable

Conclusions and Future Directions • Adaptivity definitely helps! • Accuracy stable with the increased # of epochs • Investigate the algorithm for convergence • K-means clustering introduces randomness • There is no notion of global, local minima • Reducing K reduces randomness • Weighting • Should be done for each class separately • Would lead to WP trees • Find cost function • Construct frame packets

References • Conference papers • G. Srinivasa, A. Chebira, T. Merryman and J. Kovačević, “Adaptive multiresolution texture features for protein image classification”, Proc. BMES Annual Fall Meeting, Baltimore, MD, September 2005. • K Williams, T. Merryman and J. Kovačević, “A Wavelet Subband Enhancement to Classification”, Proc. Annual Biomed. Res. Conf. for Minority Students, Atlanta, GA, November 2005. Submitted. • A. Mintos, G. Srinivasa, A. Chebira and J. Kovačević, “Combining Wavelet Features with PCA for Classification of Protein Images”, Proc. Annual Biomed. Res. Conf. for Minority Students, Atlanta, GA, November 2005. Submitted. • T. Merryman, K. Williams and J. Kovačević, “A multiresolution enhancement to generic classifiers of subcellular protein location images”, Proc. IEEE Intl. Symp. Biomed. Imaging, Arlington, VA, April 2006. In preparation. • G. Srinivasa, T. Merryman, A. Chebira, A. Mintos and J. Kovačević, “Adaptive multiresolution techniques for subcellular protein location image classification”, Proc. IEEE Int. Conf. Acoust., Speech, and Signal Proc., Toulouse, France, May 2006. Invited paper. In preparation.

Automatic Code Generation • Work in progress

Acquisition Knowledge Extraction Computation Biometrics Biometrics • Acquisition • NIST database • Knowledge extraction • How can we identify/verify person’s identity based on his/her biometric characteristic? ► • Computation • Automatic code generation and optimization ►

Motivation • Security to the financial industry ► • 89,000 cases of identity theft in 2000 • Losses incurred by Visa/MasterCard $68.2 million • Security at US borders • Multimodal biometric systems • Grand challenge • Develop an intelligent next-generation biometric system capable of fast, robust and accurate identification and verification of human biometric characteristics.

Challenges • Variable conditions • Different lighting, indoors/outdoors, different poses, … • Small training sets • Uncooperative biometrics(access to only one picture of a suspected criminal) • Huge databases • Computation becomes an issue • Database sizes: up to hundreds of thousands

Next-Generation Bioinformatics Systems

Next-Generation Bioinformatics Systems

Presentation Transcript

Next Generation Information Systems

Requirements for Next Generation mmWave Systems

Generation Next

Next Generation Response Systems for SDSU

Next Generation of Logic Programming Systems

Next Generation Spatially Immersive Visualization Systems

Next Generation Simulation Systems:

Next Generation Type Systems

Next Generation Geoscience Visualization Systems

Next Generation

Next Generation Intelligent Systems

Towards Next Generation Logic Programming Systems

Bioinformatics for next-generation DNA sequencing

Next-Generation Bioinformatics Systems

Generation Next

Generation Next!

Introduction to next -generation sequencing technologies and bioinformatics

Next Generation Systems Engineering and CMMI

Next Generation Simulation Systems:

Next Generation Sequencing and Bioinformatics Analysis Pipelines

Next Generation Sequencing and Bioinformatics Analysis Pipelines

Generation Next