470 likes | 575 Vues
This presentation discusses the application of Principal Component Analysis (PCA) in visualizing tree topology, particularly in the context of data trees that do not conform to traditional PCA projections. Key challenges include managing non-negative branch lengths and addressing the issue of large flat spots in the data. We explore alternative approaches such as branch length representation, tree pruning, non-negative matrix factorization, and Bayesian factor models. Our findings emphasize the need for careful interpretation of values and the tendency for bursts of nearby branches, offering insights into more effective PCA strategies.
E N D
Yongdai Kim Seoul National University 2011. 6. 5 Principal Component Analysis of Tree Topology Presented by J. S. Marron, SAMSI
Dyck Path Challenges • Data trees not like PC projections • Branch Lengths ≥ 0 • Big flat spots
Brain Data: Mean – 2 σ1 PC1 Careful about values < 0
Interpret’n: Directions Leave Positive Orthant • (pic here)
Visualize Trees Important Note: Tendency Towards Large Flat Spots And Bursts of Nearby Branches
Dyck Path Challenges • Data trees not like PC projections • Branch Lengths ≥ 0 • Big flat spots • Alternate Approaches: • Branch Length Representation • Tree Pruning • Non-negative Matrix Factorization • Bayesian Factor Model
Dyck Path Challenges • Data trees not like PC projections • Branch Lengths ≥ 0 • Big flat spots • Alternate Approaches: • Branch Length Representation • Tree Pruning • Non-negative Matrix Factorization • Bayesian Factor Model Discussed by Dan
Dyck Path Challenges • Data trees not like PC projections • Branch Lengths ≥ 0 • Big flat spots • Alternate Approaches: • Branch Length Representation • Tree Pruning • Non-negative Matrix Factorization • Bayesian Factor Model Discussed by Dan
Dyck Path Challenges • Data trees not like PC projections • Branch Lengths ≥ 0 • Big flat spots • Alternate Approaches: • Branch Length Representation • Tree Pruning • Non-negative Matrix Factorization • Bayesian Factor Model Discussed by Lingsong
Non-neg’ve Matrix Factorization • Ideas: • Linearly Approx. Data (as in PCA) • But Stay in Positive Orthant
Dyck Path Challenges • Data trees not like PC projections • Branch Lengths ≥ 0 • Big flat spots • Alternate Approaches: • Branch Length Representation • Tree Pruning • Non-negative Matrix Factorization • Bayesian Factor Model
Contents • Introduction • Proposed Method • Bayesian Factor Model • PCA • Estimation of Projected Trees
Introduction • Given data , let bebranch length vectors. • Dimension p = # nodes in support (union) tree. • For tree , define tree topology vector , p-dimensional binary vector where • Goal: PCA method for
Visualize Trees Important Note: Tendency Towards Large Flat Spots And Bursts of Nearby Branches
Goal of Bayes Factor Model Model Large Flat Spots as yi = 0
Proposed Method • Gaussian Latent Variable Model • Est. Corr. Matrix: Bayes Factor Model • PCA on Est’ed Correlation Matrix • Interpret in Tree Space
Proposed Method • Gaussian Latent Variable Model
Proposed Method • Estimation of the correlation matrix by Bayesian factor model • Estimate and by Bayesian factor model
Proposed Method 3. PCA with an estimated correlation matrix • Apply the PCA to an estimated
Proposed Method • Estimation of projected tree • Define projected trees on PCA directions • Estimate the projected trees by MCMC algorithm
Bayesian Factor Model • Model • Priors • MCMC algorithm • Convergence diagnostic
Bayesian Factor Model • Model
Bayesian Factor Model • Prior • This prior has been proposed by Ghosh and Dunson(2009)
Bayesian Factor Model • MCMC algorithm • Notation • Step 1. generate
Bayesian Factor Model • MCMC algorithm • Step 2. generate where and
Bayesian Factor Model • MCMC algorithm • Step 3. generate where and
Bayesian Factor Model • MCMC algorithm • Step 4. generate where and
Bayesian Factor Model • MCMC algorithm • Step 5. generate
Bayesian Factor Model • Convergence diagnostic. • 100000 iteration of MCMC algorithm after 10000 burn-in iteration • 1000 posterior samples obtained at every 100 iteration • Trace plots, ACF (Auto Correlation functions) and histograms of the three selected s and a selected (Note ).
Bayesian Factor Model • Convergence diagnostic: Three s • 100000 iteration of MCMC algorithm after 10000 burn-in iteration • 1000 posterior samples obtained at every 100 iteration • Trace plot, acf functions and histograms of the three selected s
Bayesian Factor Model • Convergence diagnostic: A • 100000 iteration of MCMC algorithm after 10000 burn-in iteration • 1000 posterior samples obtained at every 100 iteration • Trace plot, acf functions and histograms of the three selected s(25%, 50%, 75%) and
PCA • Scree plot
Visualizing Modes of Variation • Hard to Interpret • Scaling Issues? • Promising and Intuitive • Work in Progress … • Future goals • Improved Notion of PCA • Tune Bayes Approach for Better Interpretation • Integrate with Non-Neg. Matrix Factorization • ……..