Big Challenges with Big Data in Life Sciences

Big Challenges with Big Data in Life Sciences Shankar Subramaniam UC San Diego

The Digital Human

A Super-Moore’s Law Adapted from Lincoln Stein 2012

The Phenotypic Readout

Data to Networks to Biology

NETWORK RECONSTRUCTION • Data-driven network reconstruction of biological systems • Derive relationships between input/output data • Represent the relationships as a network Experiments/Measurements Inverse Problem: Data-driven Network Reconstruction

Network ReconstructionsReverse Engineering of biological networks • Reverse engineering of biological networks: • Structural identification: to ascertain network structure or topology. • Identification of dynamics to determine interaction details. • Main approaches: • Statistical methods • Simulation methods • Optimization methods • Regression techniques • Clustering

Network Reconstruction of Dynamic Biological Systems: Doubly Penalized LASSO Behrang Asadi*, Mano R. Maurya*, Daniel Tartakovsky, Shankar Subramaniam Department of Bioengineering University of California, San Diego • NSF grants (STC-0939370, DBI-0641037 and DBI-0835541) • NIH grants 5 R33 HL087375-02 • * Equal effort

APPLICATION • Phosphoprotein signaling and cytokine measurements in RAW 264.7 macrophage cells.

MOTIVATION FOR THE NOVEL METHOD • Various methods • Regression-based approaches (least-squares) with statistical significance testing of coefficients • Dimensionality-reduction to handle correlation: PCR and PLS • Optimization/Shrinkage (penalty)-based approach: LASSO • Partial-correlation and probabilistic model/Bayesian-based • Different methods have distinct advantages/disadvantages • Can we benefit by combining the methods? • Compensate for the disadvantages • A novel method: Doubly Penalized Linear Absolute Shrinkage and Selection Operator (DPLASSO) • Incorporate both “statistical significant testing” and “Shrinkage”

LINEAR REGRESSION • Goal: Building a linear-relationship based model • X: input data (m samples by n inputs), zero mean, unit standard deviation • y: output data (m samples by 1 output column), zero-mean • b: model coefficients: translates into the edges in the network • e: normal random noise with zero mean • Ordinary Least Squares solution: • Formulation for dynamic systems:

STATISTICAL SIGNIFICANCE TESTING • Most coefficients non-zero, a mathematical artifact • Perform statistical significance testing • Compute the standard deviation on the coefficients • Ratio • Coefficient is significant (different from zero) if: • Edges in the network graph represents the coefficients. * Krämer, Nicole, and Masashi Sugiyama. "The degrees of freedom of partial least squares regression." Journal of the American Statistical Association106.494 (2011): 697-705.

CORRELATED INPUTS: PLS • Partial least squares finds direction in the X space that explains the maximum variance direction in the Y space • PLS regression is used when the number of observations per variable is low and/or collinearity exists among X values • Requires iterative algorithm: NIPALS, SIMPLS, etc • Statistical significance testing is iterative * H. WOLD, (1975), Soft modelling by latent variables; the non-linear iterative partial least squares approach, in Perspectives in Probability and Statistics, Papers in Honour of M. S. Bartlett, J. Gani, ed., Academic Press, London.

LASSO • Shrinkage version of the Ordinary Least Squares, subject to L-1 penalty constraint (the sum of the absolute value of the coefficients should be less than a threshold) The LASSO estimator is then defined as: Cost Function L-1 Constraint • Where represents the full least square estimates • 0 < t < 1: causes the shrinkage * Tibshirani, R.: ‘Regression shrinkage and selection via the Lasso’, J. Roy. Stat. Soc. B Met., 1996, 58, (1), pp. 267–288

Noise and Missing Data • More systematic comparison needed with respect to: • Noise: Level, Type • Size (dimension) • Level of missing data • Collinearity or dependency among input channels • Missing data • Nonlinearity between inputs/outputs and nonlinear dependency • Time-series inputs(/outputs) and dynamic structure

METHODS • Linear Matrix Inequalities (LMI)* • Converts a nonlinear optimization problem into a linear optimization problem. • Congruence transformation: • Pre-existing knowledge of the system (e.g. ) can be added in the form of LMI constraints: • Threshold the coefficients: * [Cosentino, C., et al., IET Systems Biology, 2007. 1(3): p. 164-173]

METRICS • Metrics for comparing the methods • Reconstruction from 80% of datasets and 20% for validation • RMSE on the test set, and the number and the identity of the significant predictors as the basic metric to evaluate the performance of each method • Fractional error in the estimating the parameters • Sensitivity, specificity, G, accuracy parameters smaller than 10% of the standard deviation of all parameter values were set to 0 when generating the synthetic data TP : True Positive FP : False Positive TN : True Negative FN : False Negative

RESULTS: DATA SETS • Data sets for benchmarking: Two data sets • First set: experimental data measured on macrophage cells (Phosphoprotein (PP) vs Cytokine)* • Second sets consist of synthetic data generated in Matlab. We build the model using 80% of the data-set (called training set) and use the rest of data-set to validate the model (called test set). * [Pradervand, S., M.R. Maurya, and S. Subramaniam, Genome Biology, 2006. 7(2): p. R11].

RESULTS: PP-Cytokine Data Set • Schematic representation of Phosphoprotein (PP) vs Cytokine • Signals were transmitted through 22 recorded signaling proteins and other pathways (unmeasured pathways). • Only measured pathways contributed to the analysis Schematic graphs from: [Pradervand, S., M.R. Maurya, and S. Subramaniam, Genome Biology, 2006. 7(2): p. R11].

PP-CYTOKINE DATASET Measurements of phosphoproteins in response to LPS Courtesy: AfCS

Measurements of cytokines in response to LPS ~ 250 such datasets

RESULTS: COMPARISON • Comparison on synthetic noisy data • The methods are applied on synthetic data with 22 inputs and 1 output. The true coefficients for the inputs (about 1/3rd) are made zero to test the methods if they identify them as insignificant. • Effect of noise level Four outputs with 5, 10, 20 and 40% noise levels, respectively, are generated from the noise-free (true) output. • Effect of noise type Three outputs with White, t-distributed, and uniform noise types, respectively are generated from the noise-free (true) output

RESULTS: COMPARISON • Variability between realizations of data with white noise • PCR, LASSO, and LMI—are used to identify significant predictors for 1000 input-output pairs. • Histograms of the coefficients in the three significant predictors common to the three methods: Mean and standard deviation in the histograms of the coefficients computed with PCR, LASSO, and LMI.

RESULTS: COMPARISON • Comparison of outcome of different methods on the real data • Different methods identified unique sets of common and distinct predictors for each output Graphical illustration of methods PCR, LASSO, and LMI in detection of significant predictors for output IL-6 in PP/cytokine experimental dataset • Only the PCR method detects the true input cAMP • zone I provides validation and it highlights the common output of all the methods

RESULTS: SUMMARY • Comparison with respect to different noise types: • LASSO is the most robust methods for different noise types. • Missing data RMSE: • LASSO less deviation, more robust. • Collinearity: • PCR less deviation against noise level, better accuracy and G with increasing noise level.

A COMPARISON (Asadi, et al., 2012)

DPLASSO Doubly Penalized Least Absolute Shrinkage and Selection Operator

OUR APPROACH: DPLASSO Model Statistical Significant Testing PLS LASSO Reconstructed Network

DPLASSO WORK FLOW • Our approach: DPLASSO includes two parameter selection layers: • Layer 1 (supervisory layer): • Partial Least Squares (PLS) • Statistical significance testing • Layer 2 (lower layer): • LASSO with extra weights on less informative model parameters derived in layer 1 • Retain significant predictors and set the remaining small coefficients to zero

DPLASSO: EXTENDED VERSION • Smooth weights: • Layer 1 : • Continuous significance score η (versus binary): • Mapping function (logistic significance score): • Layer 2: • Continuous weight vector (versus fuzzy weight vector) Tuning parameter

APPLICATIONS • Synthetic (random) networks: Datasets generated in Matlab • Biological dataset: Saccharomyces cerevisiae - cell cycle model

SYNTHETIC (RANDOM) NETWORKS Datasets generated in Matlab using: • Linear dynamic system • Dominant poles/Eigen values (λ) ranges [-2,0] • Lyapunov stable • Informal definition from wikipedia: if all solutions of the dynamical system that start out near an equilibrium point xe stay near xe forever, then the system is “Lyapunov stable”. • Zero-input/Excited-state release condition • 5% measurement (white) noise.

METRICS • Two metrics to evaluate the performance of DPLASSO • Sensitivity, Specificity, G (Geometric mean of Sensitivity and Specificity), Accuracy • The root-mean-squared error (RMSE) of prediction TP : True Positive FP : False Positive TN : True Negative FN : False Negative

TUNING • Tuning shrinkage parameter for DPLASSO The shrinkage parameters in LASSO level (threshold t) via k-fold cross-validation (k = 10) on associated dataset Rule of thumb after cross validations: Example: Optimal value of the tuning parameter for a network with 65% connectivity roughly equal to 0.65 Validation error versus selection threshold t for DPLASSO on synthetic data set

PERFORMANCE COMPARISON: ACCURACY • PLS Better performance • DPLASSO provides good compromise between LASSO and PLS in terms of accuracy for different network densities

PERFORMANCE COMPARISON: SENSITIVITY • LASSO has better performance • DPLASSO provides good compromise between LASSO and PLS in terms of Sensitivity for different network densities

PERFORMANCE COMPARISON: SPECIFICITY • DPLASSO provides good compromise between LASSO and PLS in terms of specificity for different network densities.

PERFORMANCE COMPARISON: NETWORK-SIZE Network Size: 10 * 100 potential connections Network Size: 20 * 400 potential connections Network Size: 50 * 2500 potential connections Accuracy Accuracy Accuracy • DPLASSO provides good compromise between LASSO and PLS in terms of accuracy for different network sizes • DPLASSO provides good compromise between LASSO and PLS in terms of sensitivity (not shown) for different network sizes

ROC CURVE vs. DYNAMICS AND WEIGHTINGS • DPLASSO exhibits better performance for networks with slow dynamics. • The parameter γ in DPLASSO can be adjusted to improve performance for fast dynamic networks

YEAST CELL DIVISION Experimental dataset generated via well-known nonlinear model of a cell division cycle of fission yeast. The model is dynamic with 9 state variables. * Novak, Bela, et al. "Mathematical model of the cell division cycle of fission yeast." Chaos: An Interdisciplinary Journal of Nonlinear Science 11.1 (2001): 277-286.

CELL DIVISION CYCLE True Network (Cell Division Cycle) Missing in DPLASSO! LASSO PLS DPLASSO

RECONSTRUCTION PERFORMANCE Case Study I: 10 Monte Carlo Simulations, Size 20, Average over different γ, λ, network density, and Monte Carlo sample datasets Case Study II: Cell Division Cycle, Average over γ value

CONCLUSION • Novel method, Doubly Penalized Linear Absolute Shrinkage and Selection Operator (DPLASSO), to reconstruct dynamic biological networks • Based on integration of significance testing of coefficients and optimization • Smoothening function to trade off between PLS and LASSO • Simulation results on synthetic datasets • DPLASSO provides good compromise between PLS and LASSO in terms of accuracy and sensitivity for • Different network densities • Different network sizes • For biological dataset • DPLASSO best in terms of sensitivity • DPLASSO good compromise between LASSO and PLS in terms of accuracy, specificity and lift

Information Theory Methods FarzanehFarangmehr

Mutual Information • It gives us a metric that is indicative of how much information from a variable can be obtained to predict the behavior of the other variable . • The higher the mutual information, the more similar are the two profiles. • For two discrete random variables of X={x1,..,xn} and Y={y1,…ym}: p(xi,yj) is the joint probability of xiand yj P(xi) and p(yj) are marginal probability of xi and yj

Information theoretical approachShannon theory • Hartley’s conceptual framework of information relates the information of a random variable with its probability. • Shannon defined “entropy”, H, of a random variable X given a random sample in terms of its probability distribution: • Entropy is a good measure of randomness or uncertainty. • Shannon defines “mutual information” as the amount of information about a random variable X that can be obtained by observing another random variable Y:

Mutual information networks X={x1 , …,xi} Y={y1, …,yj} • The ultimate goal is to find the best model that maps X  Y • The general definition: Y= f(X)+U. In linear cases: Y=[A]X+U where [A] is a matrix defines the linear dependency of inputs and outputs • Information theory maps inputs to outputs (both linear and non-linear models) by using the mutual information:

Mutual information networks • The entire framework of network reconstruction using information theory has two stages: 1-Mutual information measurements 2- The selection of a proper threshold. • Mutual information networks rely on the measurement of the mutual information matrix (MIM). MIM is a square matrix whose elements (MIMij= I(Xi;Yj)) are the mutual information between Xi and Yj. • Choosing a proper threshold is a non-trivial problem. The usual way is to perform permutations of expression of measurements many times and recalculate a distribution of the mutual information for each permutation. Then distributions are averaged and the good choice for the threshold is the largest mutual information value in the averaged permuted distribution.

Mutual information networksData Processing Inequality (DPI) • The DPI for biological networks states that if genes g1 and g3interact only through a third gene, g2, then: • Checking against the DPI may identify those gene pairs which are not directly dependent even if

ARACNe algorithm ARACNE flowchart [Califano and coworkers] • ARACNE stands for “Algorithm for the Reconstruction of Accurate Cellular NEtworks” [25]. • ARACNE identifies candidate interactions by estimating pairwise gene expression profile mutual information, I(gi, gj) and then filter MIs using an appropriate threshold, I0, computed for a specific p-value, p0. In the second step, ARACNe removes the vast majority of indirect connections using the Data Processing Inequality (DPI).

Big Challenges with Big Data in Life Sciences