Dimensionality Reduction in the Analysis of Human Genetics Data

Dimensionality Reduction in the Analysis of Human Genetics Data Petros Drineas Rensselaer Polytechnic Institute Computer Science Department To access my web page: drineas

Human genetic history Much of the biological and evolutionary history of our species is written in our DNA sequences. Population genetics can help translate that historical message. The genetic variation among humans is a small portion of the human genome.All humans are almost than 99.9% identical.

Our objective Fact: Dimensionality Reduction techniques (such as Principal Components Analysis – PCA) separate different populations and result to “plots” that correlate well with geography.

Our objective • Fact: • Dimensionality Reduction techniques (such as Principal Components Analysis – PCA) separate different populations and result to “plots” that correlate well with geography. • Our goal: • Based on this observation, we seek unsupervised, efficient algorithms for the selection of a small set of genetic markers that can be used to • capture population structure, and • predict individual ancestry.

The math behind… • To this end, we employ matrix algorithms and matrix decompositions such as • the Singular Value Decomposition (SVD), and • the CX decomposition. We provide novel, unsupervised algorithms for selecting ancestry informative markers.

Overview • Background • The Singular Value Decomposition (SVD) • The CX decomposition • Removing redundant markers (Column Subset Selection Problem) • Selecting Ancestry Informative Markers • A worldwide set of populations • Admixed European-American populations • The POPulation REference Sample (POPRES)

Single Nucleotide Polymorphisms (SNPs) Single Nucleotide Polymorphisms: the most common type of genetic variation in the genome across different individuals. They are known locations at the human genome where two alternate nucleotide bases (alleles) are observed (out of A, C, G, T). SNPs … AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG GG TT TT CC GG TT GG GG TT GG AA … … GG TT TT GG TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT AG CT AG GT GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT AG GG TT GG AA … … GG TT TT GG TT CC CC CG CC AG AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG TT GG AA … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG AA … individuals There are ~10 million SNPs in the human genome, so this matrix could have ~10 million columns.

Our data as a matrix SNPs Individuals … AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG GG TT TT CC GG TT GG GG TT GG AA … … GG TT TT GG TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT AG CT AG GT GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT AG GG TT GG AA … … GG TT TT GG TT CC CC CG CC AG AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG TT GG AA … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG AA … SNPs Individuals 0 0 0 1 0 -1 1 1 1 0 0 0 0 0 1 0 1 -1 -1 1 -1 0 0 0 1 1 1 1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 -1 -1 -1 1 -1 -1 1 1 1 -1 1 0 0 0 1 0 1 -1 -1 1 -1 1 -1 1 1 1 1 1 -1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 1 -1 -1 -1 1 -1 -1 1 1 1 -1 1 0 0 1 0 0 1 -1 -1 1 0 0 0 0 1 1 1 1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 -1 -1 -1 1 -1 -1 1 1 1 -1 1 0 0 0 1 1 -1 1 1 1 0 -1 1 0 1 1 0 1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 -1 -1 1 -1 -1 1 1 1 -1 1 -1 -1 -1 1 0 1 -1 -1 0 -1 1 1 0 0 1 1 1 -1 -1 -1 1 0 0 0 0 0 0 0 0 0 1 -1 -1 1 -1 -1 -1 1 -1 -1 1 0 1 0 0 0 0 0 1 0 1 -1 -1 0 -1 0 1 -1 0 1 1 1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 1 -1 -1 1 -1 -1 -1 1 -1 -1 1 1 1 -1 1 0 0 0 1 -1 1 -1 -1 1 0 0 0 1 1 1 0 1 -1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 0 -1 -1 1 example: ΑΑ=1 ΑG = 0 GG = -1

Forging population variation & structure • Genetic diversity and population (sub)structure is caused by… • Mutation • Mutations are changes to the base pair sequence of the DNA. • Natural selection • Genotypes that correspond to favorable traits and are heritable become more common in successive generations of a population of reproducing organisms. •  Mutations increase genetic diversity. • Under natural selection, beneficial mutations increase in frequency, and vice versa.

Forging population variation & structure • Genetic drift • Sampling effects on evolution: • Example: say that the RAF of a SNP in a small population is p. • The offspring generation would (in expectation) have a RAF of p as well for the same SNP. • In reality, it will have a RAF of p’ (a drifted frequency) … • Gene flow • Transfer of alleles between populations (immigration) • Non-random mating • Reduces interaction between (sub)populations. • Other demographic events

Dimensionality reduction SNPs Individuals 0 0 0 1 0 -1 1 1 1 0 0 0 0 0 1 0 1 -1 -1 1 -1 0 0 0 1 1 1 1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 -1 -1 -1 1 -1 -1 1 1 1 -1 1 0 0 0 1 0 1 -1 -1 1 -1 1 -1 1 1 1 1 1 -1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 1 -1 -1 -1 1 -1 -1 1 1 1 -1 1 0 0 1 0 0 1 -1 -1 1 0 0 0 0 1 1 1 1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 -1 -1 -1 1 -1 -1 1 1 1 -1 1 0 0 0 1 1 -1 1 1 1 0 -1 1 0 1 1 0 1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 -1 -1 1 -1 -1 1 1 1 -1 1 -1 -1 -1 1 0 1 -1 -1 0 -1 1 1 0 0 1 1 1 -1 -1 -1 1 0 0 0 0 0 0 0 0 0 1 -1 -1 1 -1 -1 -1 1 -1 -1 1 0 1 0 0 0 0 0 1 0 1 -1 -1 0 -1 0 1 -1 0 1 1 1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 1 -1 -1 1 -1 -1 -1 1 -1 -1 1 1 1 -1 1 0 0 0 1 -1 1 -1 -1 1 0 0 0 1 1 1 0 1 -1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 0 -1 -1 1 • Mutation/natural selection/population history, etc. result to significant structure in the data. • This structure can be extracted via dimensionality reduction techniques. • In most cases, unsupervised techniques suffice! • In most cases, linear dimensionality reduction techniques (PCA) suffice!

Why study population structure? • Mapping causative genes for common complex disorders (e.g. diabetes, heart conditions, obesity, etc.) • History of human populations • Genealogy • Forensics • Conservation genetics

Population stratification Definition: Correlation between subpopulations in the case/control samples and the phenotype under investigation. Effects: Confounds the study, typically leading to false positive correlations. Solution: The problem can be addressed either by careful sample collection or by statistical post-processing of the results (Price et al (2006) Nat Genet).

Population stratification (cont’d) Cases Population 1 Population 2 AC CC AA Example of the confounding effects of population stratification in an association study. Controls Marchini et al (2004) Nat Genet

Recall our objective… • Develop unsupervised, efficient algorithms for the selection of a small set of SNPs that can be used to • capture population structure, and • predict individual ancestry. • Why? cost efficiency. Let’s discuss (briefly) prior work …

Inferring population structure Oceania Africa Europe Central Asia East Asia Middle East America 377 STRPs, Rosenberg et al (2004) Science • Examples of available algorithms/software packages: • STRUCTURE Pritchard et al (2000) Genetics • FRAPPE Li et al (2008) Science

Selecting ancestry informative markers Existing methods (Fst, Informativeness, δ) Rosenberg et al (2003) Am J Hum Genet • Allele frequency based. • Require prior knowledge of individual ancestry (supervised). • Such knowledge may not be available. (e.g., populations of complex ancestry, large multi-centered studies of anonymous samples, etc.) • Unsupervised “feature selection” techniques are often preferable because they tend to not overfit the data.

The Singular Value Decomposition (SVD) Let A be a matrix with m rows (one for each subject) and n columns (one for each SNP). Matrix rows: points (vectors) in a Euclidean space, e.g., given 2 objects (x & d), each described with respect to two features, we get a 2-by-2 matrix. Two objects are “close” if the angle between their corresponding vectors is small.

2nd (right) singular vector 1st (right) singular vector: direction of maximal variance, 2nd (right) singular vector: direction of maximal variance, after removing the projection of the data along the first singular vector. 1st (right) singular vector SVD, intuition Let the blue circles represent m data points in a 2-D Euclidean space. Then, the SVD of the m-by-2 matrix of the data will return …

2nd (right) singular vector 1st (right) singular vector Singular values 2 1: measures how much of the data variance is explained by the first singular vector. 2: measures how much of the data variance is explained by the second singular vector. 1

0 0 Let 1¸2¸ … ¸ be the entries of . Exact computation of the SVD takes O(min{mn2 , m2n}) time. The top k left/right singular vectors/values can be computed faster using Lanczos/Arnoldi methods. SVD: formal definition : rank of A U (V): orthogonal matrix containing the left (right) singular vectors of A. S: diagonal matrix containing the singular values of A.

Rank-k approximations via the SVD  A = U VT features sig. significant noise noise = significant noise objects

Rank-k approximations (Ak) • Uk (Vk): orthogonal matrix containing the top k left (right) singular vectors of A. • k: diagonal matrix containing the top k singular values of A.

PCA and SVD Principal Components Analysis (PCA) essentially amounts to the computation of the Singular Value Decomposition (SVD) of a covariance matrix. SVD is the algorithmic tool behind MultiDimensional Scaling (MDS) and Factor Analysis.

The data… European Americans South Altaians - - Spanish African Americans Japanese Chinese Puerto Rico Nahua Mende Mbuti Mala Burunge Quechua Africa Europe E Asia America 274 individuals from 12 populations genotyped on ~10,000 SNPs Shriver et al (2005) Human Genomics

A B America Africa Asia Europe Paschou et al (2007) PLoS Genetics

A B America Africa Asia Europe Not altogether satisfactory: the singular vectors are linear combinations of all SNPs, and – of course – can not be assayed! Can we find actual SNPs that capture the information in the singular vectors? (E.g., spanning the same subspace …) Paschou et al (2007) PLoS Genetics

SVD decomposes a matrix as… The SVD has strong optimality properties. Top k left singular vectors • X = UkTA = k VkT • The columns of Uk are linear combinations of up to all columns of A.

Carefully chosen X CX decomposition Goal: make (some norm) of A-CX small. c columns of A Why? If A is an subject-SNP matrix, then selecting representative columns is equivalent to selecting representative SNPs to capture the same structure as the top eigenSNPs. We want c as small as possible!

Carefully chosen X CX decomposition Goal: make (some norm) of A-CX small. c columns of A Theory: for any matrix A, we can find C such that is almost equal to the norm of A-Ak with c ≈ k.

CX decomposition c columns of A Easy to prove that optimal X = C+A. (C+ is the Moore-Penrose pseudoinverse of C.) Thus, the challenging part is to find good columns (SNPs) of A to include in C. From a mathematical perspective, this is a hard combinatorial problem.

A theoremDrineas et al (2008) SIAM J Mat Anal Appl Given an m-by-n matrix A, there exists an algorithm that picks, in expectation, at most O( k log k / 2 ) columns of A runs in O(mn2) time, and with probability at least 1-10-20

The CX algorithm Input: m-by-n matrix A, target rank k, number of columns c Output: C, the matrix consisting of the selected columns • CX algorithm • Compute probabilities pj summing to 1 • For each j = 1,2,…,n, pick the j-th column of A with probability min{1,cpj} • Let C be the matrix consisting of the chosen columns • (C has – in expectation – at most c columns)

Subspace sampling (Frobenius norm) Vk: orthogonal matrix containing the top k right singular vectors of A. S k: diagonal matrix containing the top k singular values of A. Remark: The rows of VkT are orthonormal vectors, but its columns (VkT)(i)are not.

Subspace sampling (Frobenius norm) Vk: orthogonal matrix containing the top k right singular vectors of A. S k: diagonal matrix containing the top k singular values of A. Remark: The rows of VkT are orthonormal vectors, but its columns (VkT)(i)are not. Subspace sampling in O(mn2) time Leverage scores (many references in the statistics community) Normalization s.t. the pj sum up to 1

Deterministic variant of CX Paschou et al (2007) PLoS Genetics Mahoney and Drineas (2009) PNAS Input: m-by-n matrix A, integer k, and c (number of SNPs to pick) Output: the selected SNPs • CX algorithm • Compute the scores pj • Pick the columns (SNPs) corresponding to the top c scores.

Deterministic variant of CX (cont’d) Paschou et al (2007) PLoS Genetics Mahoney and Drineas (2009) PNAS Input: m-by-n matrix A, integer k, and c (number of SNPs to pick) Output: the selected SNPs: we will call them PCA Informative Markers or PCAIMs • CX algorithm • Compute the scores pj • Pick the columns (SNPs) corresponding to the top c scores. In order to estimate k for SNP data, we developed a permutation-based test to determine whether a certain principal component is significant or not. (A similar test was presented in Patterson et al (2006) PLoS Genetics)

Worldwide data European Americans South Altaians - - Spanish African Americans Japanese Chinese Puerto Rico Nahua Mende Mbuti Mala Burunge Quechua Africa Europe E Asia America 274 individuals, 12 populations, ~10,000 SNPs using the Affymetrix array Shriver et al (2005) Human Genomics

Afr Eur Asi Ame Selecting PCA-correlated SNPs for individual assignment to four continents (Africa, Europe, Asia, America) Africa Europe Asia America * top 30 PCA-correlated SNPs PCA-scores SNPs by chromosomal order Paschou et al (2007) PLoS Genetics

Correlation coefficient between true and predicted membership of an individual to a particular geographic continent. (Use a subset of SNPs, cluster the individuals using k-means.) Paschou et al (2007) PLoS Genetics

Cross-validation on HapMap data Paschou et al (2007) PLoS Genetics

A B • 9 indigenous populations from four different continents (Africa, Europe, Asia, Americas) • All SNPs and 10 principal components: perfect clustering! • 50 PCAIMs SNPs, almost perfect clustering.

Africa East Asia Oceania America South Central Asia Middle East Europe The Human Genome Diversity Panel Appox. 1000 individuals – 650K Illumina Array Li et al (2008) Science

Highest scoring genes in HGDP dataset * Barreiro et al (2008) Nat Genet, Sabeti et al (2007) Nature, The International HapMap Consortium (2007) Nature

A problem with the CX decomposition Input: m-by-n matrix A, integer k, and c (number of SNPs to pick) Output: the selected PCA Informative Markers or PCAIMs • CX algorithm • Compute the scores pj • Pick the columns (SNPs) corresponding to the top c scores. Problem: Highly correlated SNPs (a.k.a., SNPs that are in LD) get similar – high – scores, and thus the deterministic variant would select redundant SNPs. How do we remove this redundancy?

Column Subset Selection Problem (CSSP) Definition: Given an m-by-c matrix A, find k columns of A forming an m-by-k matrix C that are maximally uncorrelated (a.k.a., select maximally uncorrelated SNPs).

Column Subset Selection Problem (CSSP) Definition: Given an m-by-c matrix A, find k columns of A forming an m-by-k matrix C that are maximally uncorrelated (a.k.a., select maximally uncorrelated SNPs). Metric of correlation: A common formulation is to select a set of SNPs that span a parallel-piped of maximal volume. This formulation is NP-hard (i.e., intractable even for small values of m, c, and k).

Column Subset Selection Problem (CSSP) Definition: Given an m-by-c matrix A, find k columns of A forming an m-by-k matrix C that are maximally uncorrelated (a.k.a., select maximally uncorrelated SNPs). Metric of correlation: A common formulation is to select a set of SNPs that span a parallel-piped of maximal volume. This formulation is NP-hard (i.e., intractable even for small values of m, c, and k). Significant prior work: The CSSP has been studied in the Numerical Linear Algebra community, and many provably accurate approximation algorithms exist.

Prior work on the CSSP: 1965 – 2000Boutsidis, Mahoney, and Drineas (2009) SODA, under review in Num Math

The greedy QR algorithm We use a standard greedy approach (the Rank-Revealing QR factorization). The algorithm performs k iterations: In the first iteration, the top PCAIM is picked; In the second iteration, a PCAIM is picked that is as uncorrelated to with the previously selected PCAIM as possible; In the third iteration the chosen PCAIM has to be as uncorrelated as possible with the first two previously selected PCAIMs; And so on… Efficient implementations are available, and run in minutes for typical values of m, c, and k. Paschou et al (2008) PLoS Genetics

Dimensionality Reduction in the Analysis of Human Genetics Data

Dimensionality Reduction in the Analysis of Human Genetics Data

Presentation Transcript

Principal Component Analysis (Dimensionality Reduction)

Dimensionality reduction

Dimensionality Reduction

Dimensionality reduction

Dimensionality Reduction on Hyperspectral Data for Solids Analysis

Dimensionality reduction

Statistical analysis of array data: Dimensionality reduction, Clustering

Dimensionality reduction

Dimensionality reduction

Dimensionality Reduction

Dimensionality Reduction

Dimensionality Reduction

Dimensionality reduction

Dimensionality reduction

Dimensionality Reduction

Dimensionality reduction

Statistical analysis of array data: Dimensionality reduction, Clustering

Dimensionality reduction