1 / 56

Computational Immunology

Computational Immunology. Steven H. Kleinstein Department of Pathology, Yale University School of Medicine. Introductory!. You can still register until April 28, 2008. OUTLINE. Cover three broad topics – “new” computational methods. Promoter Analysis / Cis-regulatory Analysis

nansen
Télécharger la présentation

Computational Immunology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computational Immunology Steven H. Kleinstein Department of Pathology, Yale University School of Medicine

  2. Introductory! You can still register until April 28, 2008

  3. OUTLINE Cover three broad topics – “new” computational methods • Promoter Analysis / Cis-regulatory Analysis • Over-representation • Gene Set Enrichment Analysis • Multiple Hypothesis Testing • Bonferroni • False Discovery Rate • Dynamic Modeling • Labeling Models • Viral Dynamics

  4. Promoter Analysis Hands-on Mini course on May 1, 2008 @ 1PM Sridhar Hannenhalli Penn Center for Bioinformatics Department of Genetics,University of Pennsylvania To register: http://tsb.mssm.edu/cgi-bin/g/reg/InSilico/reg.cgi

  5. Identifying regulators of TLR responses Temporal activation of macrophages by TLR4 agonist bacterial lipopolysaccharide (LPS) K-means clustering defined 11 groups of genes comprising regulated ‘waves’ of transcription Hypothesize that clustered genes are co-regulated and that they share cis-regulatory elements

  6. B B B Can we identify TFs driving B cell differentiation? Implicate TFs by analyzing behavior of target genes Experiment (B cell subset) Naive If genes targeted by particular transcription factor are differentially expressed, then the transcription factor is likely to play role Gene GC Memory Target genes identified by presence of binding sites

  7. DNA Sequence Motifs for TF Binding Sites Short, recurring patterns in DNA with presumed biological function Nature Biotechnology24, 423 - 425 (2006) Collection of binding sites (ROX1 ) Consensus sequence Frequency Matrix For prediction of new sites, need to account for conservation

  8. Measuring Conservation in the Binding Site Information content measures conservation at each site i Entropy or Shannon Information ATG ATC AAT AAA --- 210 Information content Frequency of base b at position i Total information content related to probability of finding motif in ‘random’ DNA sequence Can be corrected for background frequencies (biased GC)

  9. Sequence Logos Visual expression of frequency and information content Total information content related to probability of finding motif in ‘random’ DNA sequence http://weblogo.berkeley.edu/

  10. The TRANSFAC Database Eukaryotic transcription factors and their genomic binding sites TRANSFAC has public (older version) and commercial (more features) versions Other (free) possibilities: Current version contains 834 matrices (601 vertebrate)

  11. The TRANSFAC Database Eukaryotic transcription factors and their genomic binding sites MATCH Score Information Vector (higher for conserved positions) C C C T G A C G T C A A C G Frequency of nucleotide bi to occur at the position i of the matrix (B{A, T, G, C}) Assumes positions are independent

  12. Identifying putative TF binding sites Search by scanning the promoter region MacIsaac KD, Fraenkel E (2006) Practical strategies for discovering regulatory DNA sequence motifs. PLoS Comput Biol 2: e36. Threshold can be determined by looking at “random” DNA

  13. Identifying TF Target Genes Look 2 Kb up/down-stream of transcription start site 1. Extract genomic sequence (+/- 2Kb around TSS) 2. Identify conserved regions (Human/Mouse/Rat/Dog) 3. Scan conserved regions for potential binding sites Matrix linking transcription factors and potential target genes ‘Gene Sets’ of target genes for each transcription factor

  14. Gene Sets of Transcription Factor Targets Molecular Signatures Database at Broad Institute (http://www.broad.mit.edu/gsea/msigdb) V$NRSF_01 (Neuron Restrictive Silencing Factor) Genes with promoter regions [-2kb,2kb] around transcription start site containing the motif TTCAGCACCACGGACAGMGCC which matches annotation for REST: RE1-silencing transcription factor ATP6V0A1 RPIP8 POU4F3 FLJ42486 L1CAM SLC17A6 TRIM9 MAPK11 DDX25 SNAP25 DRD3 FGF12 COL5A3 SYT4 BDNF POMC GABRB3 TMEM22 GRM1 HES1 MGAT5B TCF1 PCSK2 FLJ44674 VIP FLJ38377 ZNF335 GABRG2 LHX3 DNER CHKA NEFH ZNF579 CHAT SCAMP5 CDKN2B SST OGDHL KCNH4 SEZ6 GLRA1 HTR1A RPH3A PRG3 NPPB FGD2 RNF13 SYT6 CHGA SLC12A5 ELAVL3 KCNH8 GDAP1L1 HCN1 DRD2 HCN3 PAQR4 CALB1 BARHL1 SCN3B CRYBA2 TNRC4 VGF RASGRF1 NEF3 OMG KCNIP2 CDK5R1 ATP2B2 HTR5A PHYHIPL SARM1 GHSR INA PTPRN DBC1 CSPG3 CHRNB2 GRIN1 STMN2 POU4F2 APBB1 GLRA3 Gene sets can also be defined manually

  15. Are ATF3 targets over-represented in Cluster 6? Temporal activation of macrophages by TLR4 agonist bacterial lipopolysaccharide (LPS) Which transcription factors are driving dynamics of each cluster?

  16. Over-Representation Analysis If you draw n marbles at random, what is probability of k red ones? Significance by Hypergeometric Distribution Black Marbles (N-K) Red Marbles (K) k n-k Total marbles (N) Pick (n) Hypergeometric Distribution

  17. Over-Representation Analysis Is set of TF targets over-represented in differentially expressed genes? Significance by Hypergeometric Distribution Genes without TFBS (N-K) Genes with TFBS (K) k n-k All Genes (N) Differentially-Expressed Genes (n) Must choose threshold to define “differential expression”

  18. Over-Representation Analysis Assume 17 genes in cluster, 5 with binding site… Significance by Hypergeometric Distribution Genes without TFBS (1000-100) Genes with TFBS (100) 5 17-5 All Genes (1000) Genes in Cluster (17) Must choose threshold to define “differential expression”

  19. Gene Set Enrichment Analysis (GSEA) Are TF targets enriched among most differentially expressed? Running Sum (KS-like Statistic) Signal-to-Noise (Subramanian et al, PNAS, 2005) Does not require a threshold for differential expression

  20. Gene Set Enrichment Analysis (GSEA) What is distribution for enrichment score (ES) under null hypothesis? Distribution of ES values for “random” data Random permutations of data Calculate ES P value is fraction of “random” data with higher ES Permute class labels or genes to estimate null distribution

  21. GSEA Example: SHM Targeting Are particular motifs over-represented among mutated genes? E2A binding sites enriched among AID-targeted genes

  22. Other Applications of Gene Set Enrichment Analysis Molecular Signatures Database at Broad Institute Gene sets can also be defined manually

  23. Other Applications of Gene Set Enrichment Analysis Molecular Signatures Database at Broad Institute Gene sets from the pathway databases. BioCarta http://www.biocarta.com Signaling pathway database http://www.grt.kyushu-u.ac.jp/spad/menu.html Signaling gateway http://www.signaling-gateway.org/Signal transduction knowledge environment http://stke.sciencemag.org/ Human protein reference database http://www.hprd.org/ GenMAPP http://www.genmapp.org/KEGG http://www.genome.jp/kegg/ Gene ontology http://www.geneontology.org Sigma-Aldrich pathways http://www.sigmaaldrich.com Gene arrays, BioScience Corp http://www.superarray.com/ Human cancer genome anatomy consortium http://cgap.nci.nih.gov/http://cgap.nci.nih.gov/ NetAffx http://www.affymetrix.com/index.affx Gene sets can also be defined manually

  24. Multiple Testing

  25. P value cutoff (a) controls type I error Type I error (False Positive): the error of rejecting a null hypothesis when it is actually true If probability to reject single hypotheses by mistake not more than a = 5% then from 100 tests, 5 are expected to be significant if there are no differences P values are not adequate when number of tests is large

  26. Family-wise error rate (FWER) Pr[FP1]: probability to reject one hypotheses by mistake not more than a Bonfferoni Correction: number of tests performed So if a=0.05 and m=1000 tests, then we require P<0.00005 Too conservative if expect many significant features (e.g., microarray)

  27. False discovery rate (FDR) Expected proportion of false positive results among rejected hypotheses So if FDR=0.05 and m=1000 tests, then we expect 5% of significant results to be false positives So, if 100 significant results then expect 5 are false positives q value for particular feature is expected proportion of false positives incurred when calling that feature significant.

  28. Comparison of Methods Threshold P values when 50 tests are performed with a=0.05 FDR Conservative, controls FDR no matter how many of the m tests are true null cases (m0)

  29. Benjamini & Hochberg FDR is Conservative Controls FDR no matter how many of the m tests are true null cases (m0) Actually controls FDR at: So if a=0.05 and null hypothesis always true (m0/m=1.0) then we control at: but if null hypothesis really false in 20% of tests (m0/m=0.2) and we control at: Could improve if estimate the proportion of true null cases (m0/m)

  30. Estimating the False Discovery Rate (FDR) Estimating the proportion of true null cases (Storey and Tibshirani, PNAS 2003) P values have uniform distribution under null hypothesis Density of P Values P Value Fraction ‘null’ P values estimated by flat part of density histogram

  31. Multiple Testing Correction P values are not adequate when number of tests is large Family-wise error rate (FWER) = Pr[FP1]: probability to reject one hypotheses by mistake not more than a Bonferroni Sequential Bonferroni (Holm’s step-down) False discovery rate (FDR) = E[FP/(FP+TP)] = E[False Positives / Significant]: expected proportion of false positive results among rejected hypotheses Benjamini and Hochberg Storey and Tibshirani Control of FWER only suitable if penalty of making even one type I error is severe

  32. BrdU Labeling Models

  33. BrdU (Bromodeoxyuridine) Thymidine analog incorporated into DNA of dividing cells during S-phase science.csustan.edu/confocal/Images/SCE/index.SCE.htm BrdU incorporated during S phase Flow cytometry to quantify labeled B cells… How to estimate proliferation rate?

  34. BrdU labeling of CD4+ and CD8+ T lymphocytes SIV-infected and an uninfected macaque. Data are from Mohri et al., Science (1998) Is there a difference in cell turnover?

  35. Model of BrdU Labeling Start with a basic model of cell population dynamics… B Rate of change in B cell population Often can assume population in steady-state (i.e., constant)

  36. Model of BrdU Labeling Many experiments stop administering label after some time We can express these as sets of ordinary differential equations

  37. Model of BrdU Labeling Split the B cell population into Labeled (BL) and Unlabeled (BU) subsets BL BU Solve or simulate these equations over time

  38. Model of BrdU Labeling Many experiments stop administering label after some time BL BU Labeling curve reflects both proliferation AND death

  39. Model of BrdU DE-Labeling Stop administering label after some time (te) BL BU Can estimate proliferation AND death

  40. s p B U B L p d d Interaction of Computation & Experiment Compare simulation and experiment using least-squares objective Experimental Observations Computational Model Least-squares objective function New Experiments Fit Model to Data Model Predictions Bootstrapping Confidence Intervals Continuous cycle of modeling and experimentation

  41. Simulated Experiment Demonstrate full cycle of fitting model to data to estimate parameters Parameters used to create synthetic data s = 0.003 per hour p = 0.01 per hour d = p + s (to achieve steady state) Random noise added to each data point BrdU withdrawn How can we estimate underlying rates?

  42. Least-squares objective function Fitting the Model to Experimental Data Compare simulation and experiment using least-squares objective Difference between observed and predicted values Find parameters to minimize error Many options for how to optimize the fit

  43. Local Global Local and Global Optimization Local optimization techniques find optimal fit around given starting point Error in Fit Parameter Value Global optimization attempts to avoid local minima

  44. Optimal Parameter Estimates Least-squares fit using lsqnonlin in MATLAB Plot local curvature to check minimization… Parameter estimates s = 0.0017 per hour p = 0.0099 per hour Is inflow necessary to fit the data? Can we use simpler model?

  45. (1) (2) Reduction in RSS per extra parameter Measure of ‘noise’ in model s p p B B d d Is inflow (s) significant? Inflow is important to explain observations

  46. Bootstrapping Parameter Confidence Intervals • Fit model to data to obtain parameter estimates • Draw a bootstrap sample of the residuals • Create bootstrap sample of observations by adding randomly sampled residual to predicted value of each observation ri Repeat 1000x Estimate parameters for bootstrap samples Bootstrapping observations also possible – asymptotically equivalent

  47. Bootstrapping Parameter Confidence Intervals Percentile Method Contains 95% of the estimates Calculate the parameter for each bootstrap sample and select  (e.g., 0.05) LCL =  /2th percentile. UCL = (1-/2)th percentile. Parameter estimates for synthetic data Estimate of s = 0.0017 [0.0009,0.0030] Estimate of p = 0.0099 [0.0095,0.0100] May not have correct coverage when sampling distribution skewed

  48. Viral Dynamics

  49. Hepatitis C Viral Dynamics and Interferon-a Therapy Modeling 23 patients during 14 days of therapy (daily doses) Short delay followed by biphasic decline in viral load

  50. Model of Hepatitis C Viral Dynamics Includes virus along with target (T) and infected (I) cells Before therapy, virus load is approximately constant

More Related