420 likes | 521 Vues
ACS National Meeting March 26-30, 2006 Atlanta, GA. Outline. What is independent component analysis (ICA)? Lombardo blood-brain barrier QSAR model - wavelet descriptors - K-PLS ICA transformed descriptors - sigma-tuned kernels - feature selection with sigma-tuning
E N D
ACS-Atlanta ICA Data Cleansing ACS National Meeting March 26-30, 2006 Atlanta, GA
ACS-Atlanta ICA Data Cleansing Outline • What is independent component analysis (ICA)? • Lombardo blood-brain barrier QSAR model • - wavelet descriptors • - K-PLS • ICA transformed descriptors • - sigma-tuned kernels • - feature selection with sigma-tuning • Data cleansing with ICA • - pseudo-inverse ICA transform
ACS-Atlanta ICA Data Cleansing What is Independent Component Analysis? • ICA: nonlinear extension of PCA • - used to separate source signals from linear mixtures (blind source separation) • - feature extraction: reveal hidden factors (latent variables) that are explanatory • Main difference between PCA and ICA • - PCA requires uncorrelatedness ICA: requires mutual independence • Mutual Independence is stronger than uncorrelated • - PCA: second-order statistics provide decorrelation • - ICA: higher-order statistics yield features as mutually independent as possible • Higher-order statistics: • - ICA can be considered as a nonlinear PCA • - useful when dealing with non-Gaussian distributions • We used Erikki Oja’s Fast ICA algorithm and Aapo Hyvärinen definition • - modified to fit in PLS NIPALS algorithm • - implemented inverse ICA with dropped components using pseudo-inverse
ACS-Atlanta ICA Data Cleansing
ACS-Atlanta ICA Data Cleansing Mixtures of Images (X) Original Images (S) Images Recovered with ICA (S_hat) ICA for blind source separation • ICA solves for S without knowing A • S consists of independent latent variables • We can use S as new latent variables
ACS-Atlanta ICA Data Cleansing Encoding Structure : Descriptors AAACCTCATAGGAAGCATACCAGGAATTACATCA… Structural Descriptors Physiochemical Descriptors Topological Descriptors Geometrical Descriptors Molecular Structures Descriptors Model Activity
ACS-Atlanta ICA Data Cleansing Histograms PIP (Local Ionization Potential) Compressed distribution Electron Density-Derived TAE-Wavelet Descriptors • Surface properties are encoded on 0.002 e/au3 surface Breneman, C.M. and Rhem, M. [1997] J. Comp. Chem., Vol. 18 (2), p. 182-197 • Histograms or wavelet encoding of surface properties give RECON/TAE property descriptors • 10x16 wavelet descriptors
ACS-Atlanta ICA Data Cleansing Wavelet Representations of Molecular Surface Properties
ACS-Atlanta ICA Data Cleansing Molecular Surfaces:Property Distributions Encoded as Wavelet Coefficients
ACS-Atlanta ICA Data Cleansing Structural Descriptors Physiochemical Descriptors Topological Descriptors Geometrical Descriptors + Activity Modeling Activity: bioactivity, ADME/Tox evaluation, hERG channel effects, p-450 isozyme inhibition, BBB…etc = Molecular Structures Descriptors Model Activity
ACS-Atlanta ICA Data Cleansing Machine Learning Methods and Statistical Learning Kernel PLS Regression “If your experiment needs statistics, you ought to have done a better experiment” - Ernest Rutherford “But what if you haven’t done the experiment yet?” - Curt Breneman
ACS-Atlanta ICA Data Cleansing Kernel PLS (K-PLS) • Direct Kernel PLS is PLS with the kernel transform as a pre-processing step • - K-PLS “better” nonlinear PLS • - PLS “better” Principal Component Analysis (PCA) for regression • K-PLS gives almost identical (but more stable) results as SVMs • - easy to tune (5 latent variables) • - unlike SVMs there is no patent on K-PLS • K-PLS transforms data from a descriptor space to a t-score space t2 t1 d3 d1 y d2
ACS-Atlanta ICA Data Cleansing K-PLS as a Direct Kernel Method Linear Model: - PCA model - PLS model - Ridge Regression - Self-Organizing Map . . .
ACS-Atlanta ICA Data Cleansing What is in a Kernel? • A kernel can be considered as a (nonlinear) data transformation • - Many different choices for the kernel are possible • - The Radial Basis Function (RBF) or Gaussian kernel is an effective nonlinear kernel • The RBF or Gaussian kernel is a symmetric matrix • - Entries reflect nonlinear similarities amongst data descriptions • - As defined by:
ACS-Atlanta ICA Data Cleansing DATASET Training set Test set Bootstrap sample k Predictive Bag Model Training Validation KPLS Model Tuning / Prediction Prediction Modeling Protocol Note: All models are validated by out-of-bag error assessment
ACS-Atlanta ICA Data Cleansing 28 sigmas (ICA -26)
ACS-Atlanta ICA Data Cleansing 28 sigmas
ACS-Atlanta ICA Data Cleansing Calculate ICA Components Make filter by identifying & removing noise ICAs Original Data + Noise ICA components Filter to Remove Noise Cleansed data Do inverse with noise ICAs removed with rectangular pseudo-inverse ICA data Cleansing Procedure
ACS-Atlanta ICA Data Cleansing Data Cleansing with Independent Component Analysis Original Data #3 Original Data #2 Original Data #1 ICA#1 ICA#3 ICA#2 Not used Cleansed Data #2 Cleansed Data #1 Cleansed Data #3
ACS-Atlanta ICA Data Cleansing Selected 28 ICAs: 98 percentile
ACS-Atlanta ICA Data Cleansing Blood-Brain barrier Model: WCD vs ICA cleansed WCD descriptors • Lombardo Blood-brain barrier data (62 data with 106/160 wavelet descriptors) • Retained 17/28 ICA components and cleansed data with inverse ICA transform • Used K-PLS model with 5 latent variables and Gaussian kernel Blood-Brain Barrier dataset: Lombardo, F; Blake, J.; Curatolo, W. J. Med. Chem.39, no. 24 (1996): 4750-4755
ACS-Atlanta ICA Data Cleansing Application #2: Use Clean Independent Components as New Descriptors Original Descriptor #3 Original Descriptor #1 Original Descriptor #2 ICA#2: New descriptor #1 ICA#3: New descriptor #2 ICA#1 Not used
ACS-Atlanta ICA Data Cleansing Blood-Brain barrier Model: WCD vs ICA descriptors Wavelet descriptors ICA Transform of Wavelet descriptors • Lombardo Blood-brain barrier data (62 data with 106/160 wavelet descriptors) • Used 17/28 ICA components as new descriptors • Used K-PLS model with 5 latent variables and multiple sigma-tuned Gaussian kernel Blood-Brain Barrier dataset: Lombardo, F; Blake, J.; Curatolo, W. J. Med. Chem.39, no. 24 (1996): 4750-4755
ACS-Atlanta ICA Data Cleansing Summary • We have: • Introduced ICA transform (as a related algorithm to PCA or PLS) • Illustrated sigma-tuning for Gaussian kernel • Showed ICA as a descriptor transform on Lombardo data • Introduced pseudo-inverse ICA transform as a data cleansing operation Future developments – ICA Cleansing for removing specific chemical effects Model interpretation through ICA target testing
ACS-Atlanta ICA Data Cleansing ACKNOWLEDGMENTS • Current and Former members of the DDASSL group • Breneman Research Group (RPI Chemistry) • N. Sukumar • M. Sundling • Min Li • Long Han • Jed Zaretski • Theresa Hepburn • Mike Krein • Steve Mulick • Shiina Akasaka • Hongmei Zhang • C. Whitehead (Pfizer Global Research) • L. Shen (BNPI) • L. Lockwood (Syracuse Research Corporation) • M. Song (Synta Pharmaceuticals) • D. Zhuang (Simulations Plus) • W. Katt (Yale University chemistry graduate program) • Q. Luo (J & J) • Embrechts Research Group (RPI DSES) • Tropsha Research Group (UNC Chapel Hill) • Bennett Research Group (RPI Mathematics) • Jinbo Bi • Collaborators: • Lawrence Research Group (NYS Wadsworth Labs) • Inna Vitol • Cramer Research Group (RPI Chemical Engineering) • Funding • NIH (GM047372-07) • NIH (1P20HG003899-01) • NSF (BES-0214183, BES-0079436, IIS-9979860) • GE Corporate R&D Center • Millennium Pharmaceuticals • Concurrent Pharmaceuticals • Pfizer Pharmaceuticals • ICAGEN Pharmaceuticals • Eastman Kodak Company • Chemical Computing Group (CCG)
ACS-Atlanta ICA Data Cleansing Reserve Slides
ACS-Atlanta ICA Data Cleansing Inverse ICA transform: Theory with general mixture coefficient matrix M is in MetaNeural format • M is the mixed signal matrix (e.g., n data points, 3 “noisy signals” or descriptors) • W is the mixture coefficient matrix (q = # ICAs) • S contains the pure base signal matrix, or the “true latent variables” • procedure: • - calculate W and S by applying ICA to a data file with 3 noisy descriptors (in M) • - determine traces to zero out from S by inspection • -calculate WT(W*WT), zero out appropriate rows & calculate “filtered” M • - test data to clean multiply by W first and bring in S domain • - convert S domain signal back to M domain with cleaned pseudo-inverse • determine number of PCAs with nonzero eigenvalues h (h <= 3) • generally q = h, but set “noise” components to 0 in entire weight complex • filter mode: bring new test data for filtering first in S domain with old weight matrix
ACS-Atlanta ICA Data Cleansing ICA (Independent Component Analysis) Filtering: The Concept Original Descriptors + Noise ICA Filter to Remove Noise Cleansed Descriptors
ACS-Atlanta ICA Data Cleansing Blood-Brain barrier model: WCD vs ICA selected descriptors Wavelet descriptors ICA Transform of Wavelet descriptors • Lombardo Blood-brain barrier data (62 data with 106/160 wavelet descriptors) • Used 17/28 ICA components as new descriptors • Used K-PLS model with 5 latent variables and multiple sigma-tuned Gaussian kernel Blood-Brain Barrier dataset: Lombardo, F; Blake, J.; Curatolo, W. J. Med. Chem.39, no. 24 (1996): 4750-4755
ACS-Atlanta ICA Data Cleansing Original image 514 KB High Resolution JPEG 26 KB Medium Resolution JPEG 11 KB Low Resolution JPEG 6.6 KB Illustration of Wavelet Image Compression Photographic images are routinely compressed using Discrete Cosine Transforms: Joint Photographic Experts Group (JPEG) In this example, nearly 90-fold data compression is achieved by retaining only the highest amplitude coefficients. We use Discrete Wavelet Transforms (DWT) to encode quantum chemical information content
ACS-Atlanta ICA Data Cleansing
ACS-Atlanta ICA Data Cleansing ICA cleansed data (109/160 features) 17/28 ICA comps
ACS-Atlanta ICA Data Cleansing Original data (109/160 features)
ACS-Atlanta ICA Data Cleansing Linear and Nonlinear Principal Components: ReplaceXnmby Tnh • PCA: Create a reduced feature set from original attributes • PCAs: Are orthogonal projections in directions of largest variance • PCA calculations can be done with Svante Wold’s NIPALS algorithm • - elegant and efficient algorithm • - hidden gem of an algorithm (not well known at all) • PCAs can also be calculation with specialized neural networks (Erikki Oja) • Related Methods: Partial-Least Squares (PLS) • Independent component analysis (ICA) • Other reduced sets feature via wavelet and Fourier transforms, …
ACS-Atlanta ICA Data Cleansing Electron Density-Derived Molecular Properties:Wavelet Coefficient Descriptors (WCD) Wavelet Decomposition: • Creates a set of coefficients that represent a waveform. • Small coefficients may be omitted to compress data. Wavelet Surface Property Density Reconstruction: 16 coefficients from S7 and D7 portions of the WCD vector represent surface property densities with >95% accuracy. Increased property accuracy of descriptors provides more predictive models. 1024 raw wavelet coefficients capture PIP distribution on molecular surface.
ACS-Atlanta ICA Data Cleansing 28 sigmas 200 its
ACS-Atlanta ICA Data Cleansing 28 sigmas 500 its
ACS-Atlanta ICA Data Cleansing 28 sigmas 17 selected
ACS-Atlanta ICA Data Cleansing Application #2: Use Independent Components as New Descriptors
ACS-Atlanta ICA Data Cleansing Outline • What is independent component analysis • Lombardo blood-brain barrier QSAR model • - wavelet descriptors • - K-PLS • ICA transformed descriptors • - sigma-tuned kernels • - feature selection with sigma-tuning • Data cleansing with ICA • - pseudo-inverse ICA transform