1 / 45

Bioinformatics for Stem Cell Lecture 2

Bioinformatics for Stem Cell Lecture 2. Debashis Sahoo , PhD. Outline. Lecture 1 Recap Multivariate analysis Microarray data analysis Boolean analysis Sequencing data analysis. Multivariate Analysis. Identify Markers of Human Colon Cancer and Normal Colon. Piero Dalerba. Tomer Kalisky.

jafari
Télécharger la présentation

Bioinformatics for Stem Cell Lecture 2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bioinformatics for Stem CellLecture 2 DebashisSahoo, PhD

  2. Outline • Lecture 1 Recap • Multivariate analysis • Microarray data analysis • Boolean analysis • Sequencing data analysis

  3. Multivariate Analysis

  4. Identify Markers of Human Colon Cancer and Normal Colon Piero Dalerba Tomer Kalisky

  5. Single Cell Analysis of Normal Human Colon Epithelium

  6. Hierarchical Clustering

  7. Hierarchical Clustering • Cluster 3.0 • http://bonsai.hgc.jp/~mdehoon/software/cluster/ • Distance metric • Euclidian, Squared Euclidean, Manhattan, maximum, cosine, Pearson’s correlation • Linkage • Single, complete, average, median, centroid

  8. Multivariate Analysis - PCA Principal Component Analysis X = data matrix V = loading matrix U = scores matrix

  9. Fundamentals of PCA • Reduces dimensions of the data • PCA uses orthogonal linear transformation • First principal component has the largest possible variance. • Exploratory tool to uncover unknown trends in the data

  10. PCA Analysis

  11. High-throughput data analysis

  12. Microarray analysis

  13. Microarray • Spotted vs. in situ • Two channel vs. one channel • Probe vs. probeset vs. gene

  14. Quantile Normalization #1 #2 #3 SortedAvg Average Sort Val(Probe_i) = SortedAvg[Rank(Probe_i)]

  15. Invariant Set Normalization Before Normalization Invariant set After Normalization

  16. Good to Check the Image

  17. Group A Group B Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6 Exp 1 Exp 2 Exp 5 Exp 3 Exp 4 Exp 6 Gene 1 Gene 1 Gene 2 Gene 2 Gene 3 Gene 3 Gene 4 Gene 4 Gene 5 Gene 5 Gene 6 Gene 6 • Assign experiments to two groups, e.g., in the expression matrix • below, assign Experiments 1, 2 and 5 to group A, and • experiments 3, 4 and 6 to group B. SAM Two-Class Unpaired 2. Question: Is mean expression level of a gene in group A significantly different from mean expression level in group B?

  18. Group A Group B Exp 1 Exp 2 Exp 5 Exp 3 Exp 4 Exp 6 Gene 1 Group A Group B Exp 1 Exp 2 Exp 5 Exp 3 Exp 4 Exp 6 Gene 1 SAM Two-Class Unpaired Permutation tests • For each gene, compute d-value (analogous to t-statistic). This is • the observed d-value for that gene. • ii) Rank the genes in ascending order of their d-values. iii) Randomly shuffle the values of the genes between groups A and B, such that the reshuffled groups A and B respectively have the same number of elements as the original groups A and B. Compute the d-value for each randomized gene Original grouping Randomized grouping

  19. SAM Two-Class Unpaired iv) Rank the permuted d-values of the genes in ascending order v) Repeat steps iii) and iv) many times, so that each gene has many randomized d-values corresponding to its rank from the observed (unpermuted) d-value. Take the average of the randomized d-values for each gene. This is the expected d-value of that gene. vi) Plot the observed d-values vs. the expected d-values

  20. Significant positive genes (i.e., mean expression of group B > mean expression of group A) SAM Two-Class Unpaired “Observed d = expected d” line The more a gene deviates from the “observed = expected” line, the more likely it is to be significant. Any gene beyond the first gene in the +ve or –ve direction on the x-axis (including the first gene), whose observed exceeds the expected by at least delta, is considered significant. Significant negative genes (i.e., mean expression of group A > mean expression of group B)

  21. GenePattern http://genepattern.broadinstitute.org/

  22. AutoSOME http://jimcooperlab.mcdb.ucsb.edu/autosome/ Aaron Newman Aaron Newman and James Cooper, BMC Bioinformatics, 2010, 11:117

  23. Gene Set Analysis Your Gene Set Cell Cycle Transcription factor Compute enrichment in pathways and networks TGF-beta Signaling Pathway Wnt-signaling Pathway Protein-protein interaction network Tools: GSEA, DAVID, Toppfun, MSigDB, and STRING

  24. Boolean Analysis

  25. Boolean Implication • Analyze pairs of genes. • Analyze the four different quadrants. • Identify sparse quadrants. • Record the Boolean relationships. • If ACPP high, then GABRB1 low • If GABRB1 high, then ACPP low 45,000 Affymetrix microarrays GABRB1 ACPP [Sahoo et al. Genome Biology 08]

  26. Intermediate Threshold Threshold Calculation • A threshold is determined for each gene. • The arrays are sorted by gene expression • StepMiner is used to determine the threshold High CDH expression Low Sorted arrays [Sahoo et al. 07]

  27. (expected – observed) statistic = √ expected B A a00 ( ) a00 a01 a11 1 error rate = + (a00+ a01) (a00+ a10) 2 a00 a10 BooleanNet Statistics nAlow = (a00+ a01), nBlow = (a00+ a10) total = a00+ a01+ a10+ a11, observed = a00 expected = (nAlow/ total * nBlow/ total) * total Boolean Implication = (statistic > 3, error rate < 0.1) [Sahoo et al. Genome Biology 08]

  28. Six Boolean Implications [Sahoo et al. Genome Biology 08]

  29. MiDReG Algorithm MiDReG = (Mining Developmentally Regulated Genes) [Sahoo et al. PNAS 2010]

  30. MiDReG Algorithm MiDReG = (Mining Developmentally Regulated Genes) [Sahoo et al. PNAS 2010]

  31. MiDReG Algorithm MiDReG = (Mining Developmentally Regulated Genes) [Sahoo et al. PNAS 2010]

  32. B Cell Genes KIT CD19 Boolean Implications [Sahoo et al. PNAS 2010]

  33. Jun Seita http://gexc.stanford.edu [Seita, Sahoo et al. PLoS ONE, 2012]

  34. Sequencing data analysis

  35. Sequencing Data Format >SEQUENCE_1 MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL >SEQUENCE_2 SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH FASTA @HWI-EAS209:5:58:5894:21141#ATCACG/1 TTAATTGGTAAATAAATCTCCTAATAGCTTAGATNT +HWI-EAS209:5:58:5894:21141#ATCACG/1 efcfffffcfeefffcffffffddf`feed]`]_Ba FASTQ S - Sanger Phred+33, (0, 40) X - Solexa Solexa+64,(-5, 40) I - Illumina 1.3+ Phred+64, (0, 40) J - Illumina 1.5+ Phred+64, (3, 40) L - Illumina 1.8+ Phred+33, (0, 41)

  36. Mapping

  37. Mapping Software • Long reads • BLAST, HMMER, SSEARCH • Short reads • BLAT • Bowtie, BWA, Partek, SOAP, Tophat, Olego, BarraCUDA

  38. Visualizations

  39. Visualizations • UCSC Genome Browser • GenoViewer, Samtools tview, MaqView, rtracklayer, BamView, gbrowse2 • Integrative Genomics Viewer (IGV)

  40. Quantification • Peak calling • QuEST, MACS, PeakSeq, T-PIC, SIPeS, GLITR, SICER, SiSSRs, OMT • Expression quantification • Cufflinks, NEUMA, RSEM, ABySS, ERANGE, RSAT, Velvet, MISO, RSEQ • SNP calling • samtools, VarScan, GATK, SOAP2, realSFS, Beagle, QCall, MaCH

  41. Peak Discovery [Pepke et al. Nature Methods 2009]

  42. Transcript Quantification RPKM, FPKM [Pepke et al. Nature Methods 2009]

  43. SNP Calling

  44. Typical RNA-seq Workflow [Trapnell et al. Nature Biotech 2010]

  45. [Trapnell et al. Nature Biotech 2010]

More Related