1 / 16

Probabilistic Sparse Matrix Factorization

Probabilistic Sparse Matrix Factorization. Delbert Dueck, Quaid Morris, Brendan Frey (Probabilistic & Statistical Inference Group) Tim Hughes (Banting and Best Department of Medical Research). Objective.

long
Télécharger la présentation

Probabilistic Sparse Matrix Factorization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Probabilistic Sparse Matrix Factorization Delbert Dueck, Quaid Morris, Brendan Frey(Probabilistic & Statistical Inference Group) Tim Hughes(Banting and Best Department of Medical Research)

  2. Objective Patterns in gene expression array data can be used to help understand gene regulation and predict the function of yet-uncharacterized genes Objective: To develop a method of probabilistic sparse matrix factorization (PSMF) and apply it to gene expression data to learn the hidden structure underlying the data.

  3. Biological Background • Genes encode basic information about an organism • They tend to be highly expressed in tissues related to their functional role • Mouse gene expression data is from Zhang, Morris, et al. (2004) • Gene expression is influenced by the presence of transcription factors (TFs) • Co-expressed genes are likely activated by the same TFs • The activity of each gene can be explained by the activities of a small number of transcription factors

  4. Expression vector for gene XM_133866.1xg (g=10056), a row vector of length T=55  bladder (t=3)  colon (t=9)  hindbrain (t=22)  large intestine (t=25)  lymph node (t=28)  midbrain (t=31)  pancreas (t=34)  small intestine (t=41)  spleen (t=44)  stomach (t=45) Scalar expression values (xgt ) Gene Expression Array Dataset G=22709 genes  Entire data set: XG×T matrix (G=22709, T=55)  100 genes   T=55 tissues  T=55tissues

  5. Sparse Matrix Factorization • Gene expression data model: • Each gene’s expression profile (xg) is … a linear combination (weighted by ygc, csg) … of a small number (rg<N) … of C possible transcription factor profiles (zc, csg)

  6. Sparse Matrix Factorization Matrix format: (entire dataset)

  7. Probabilistic Sparse Matrix Factorization • To express as a distribution, assume … • varying levels of Gaussian noise in the data: • nothing about transcription factor weights: • normally-distributed transcription factor profiles: • uniformly-distributed factor assignments: • multinomially-distributed factor counts:

  8. Probabilistic Sparse Matrix Factorization • To express as a distribution, assume … • varying levels of Gaussian noise in the data: • nothing about transcription factor weights: • normally-distributed transcription factor profiles: • uniformly-distributed factor assignments: • multinomially-distributed factor counts: • Multiply together to get joint distribution

  9. Factorized Variational Inference • Exact inference is intractable with P(∙)

  10. Factorized Variational Inference • Exact inference is intractable with P(∙) • Approximate it by a simpler distribution, Q(∙), and perform inference on that

  11. Visualization PROBABILISTIC SPARSE MATRIX FACTORIZATION C=50 possible factors N=3 factors per gene (max) P(rg)=[.55 .27 .18] *Sorted by primary transcription factor (sg1)

  12. Results – p-value histograms • Genes can be partitioned into “primary categories” (i.e. same sg1 value), “secondary classes”, etc. • Compare classes with annotated gene ontology (GO-BP) categories for statistical significance

  13. Results – mean log10 p-values

  14. Results – count of significant p-values

  15. Future Directions – different Q(·) Iterated conditional modes (point estimates)

  16. Summary • Introduced probabilistic sparse matrix factorization (PSMF), each row is a linear combination of a “small” number of hidden factors selected from a larger set. • Described a variational inference algorithm for fitting the PSMF model. • Evaluated model on a gene functional prediction task.

More Related