Rediscovering the Transcriptome - an array expedition

Joern Toedling Berlin Meeting – Feb 2006 Rediscovering the Transcriptome -an array expedition

Transcription Regulation of transcription not completely understood yet

Overview • Tiling Microarrays: fresh, unbiased view on transcription (yeast transcriptome) • Patterns of Transcription (gene clusters) • Heart Transcription Factors (outlook)

Tiling Microarrays: Design • Whole genome represented by probes on array (conventional microarrays: only CDS) Probes

Tiling Arrays: Possibilities Unbiased view on transcription since no focus on or restriction to know genes • refine annotated transcripts • confirm predicted genes • discover novel transcripts

Tiling Arrays: Challenges • Probe Specificity (repetitive regions) • Probes with largely varying physical characteristics due to varying base composition • Segmentation: Defining transcript borders

Segmentation Reliable discovery of segment (transcript) borders: no standard way

Segmentation II Two obvious options: 1.) Smoothing (e.g. compute mean within running window) and thresholding: simple, but estimates of change points will be biased and depend on expression level discovered “transcript”

Segmentation III 2.) Hidden Markov Models (HMM): complicated, since our “states” come from a continuum Our solution: Fit a piecewise constant function (mean of expression levels in segment) change point

Segmentation Algorithm Minimize: t1,…, tS: change points S: number of segments J: number of replicate arrays We implemented a dynamic programming algorithm that finds the segment change-points in linear time. We also extended it to compute confidence intervals for the change points.

Samples S. cerevisia S96 strain, 3 samples each from: • Poly-adenylated RNA • Total RNA • DNA (used to estimate unspecific binding to each probe -> normalization of expression levels)

Tiling Array: Results

Functional non-coding RNA • Conservation (PhastCons, USC genome browser) generally lower than for known genes, but • A number of short novel RNAs is highly conserved across 6 other Saccharomyces species

Excursion: PhastCons • HMM for sequence conservation • based on multiple alignment • different base transition probabilities for • conserved nucleotides • non-conserved nucleotides • gives posterior probability for a nucleotide being conserved

PhastCons in UCSC Genome Browser Example: Mouse DPF3

Software R-functions in BioConductor package tilingArray, main features: • Segmentation including confidence levels • Plot probe expression level along the chromosome plus genomic features

Summary: Yeast Tiling Array Whole-genome tiling arrays • allow refinement of current genomic annotation, such as precise UTR mapping • indicate lots of novel transcripts beyond current annotation Novel transcripts • mostly non-coding • many anti-sense to known genes • partly conserved across other yeast species

Overview • Tiling Microarrays: fresh, unbiased view on transcriptions • Patterns of Transcription (Gene Clusters) • Heart Transcription Factors (outlook)

Patterns of Transcription • Genes co-expressed or silenced as a union across tissues are often clustered next to each other in certain genomic regions. • no large-scale analysis of co-expression for all pairs of genomic neighbors yet • Collaboration with group of S. Sperling, MPI for Molecular Genetics, Berlin

Data Sets • FANTOM3 transcription data from RIKEN Consortium, Japan: transcription of all known M.musculus genes in 13 tissues measured by polyA-RNA sequencing, cap-detection and other methods • Novartis gene expression atlas: microarray data for 79 tissues of H.sapiens

Data Sets II For each transcript: measured if it is expressed in a specific tissue or not (1: yes; 0:no) -> Binary Matrix with Rows=Transcripts and Columns=Tissues cerebellum heart liver lung macrophage ENSMUSG00000048040 1 0 1 1 0 ENSMUSG00000048355 0 0 1 1 1 ENSMUSG00000042750 1 1 1 1 0 ENSMUSG00000057000 0 0 1 1 1 ENSMUSG00000047844 0 0 1 0 0 ENSMUSG00000051579 1 0 1 1 1 ENSMUSG00000054034 1 0 1 0 0 ENSMUSG00000050071 1 0 1 0 0 ENSMUSG00000047291 0 0 0 0 1 ENSMUSG00000042712 1 1 1 1 1 ENSMUSG00000046432 1 0 1 1 0

Pair Coexpression Consider coexpression of each gene and its next adjacent gene Example I: Tissue1 T2 T3 T4 Gene A 1 0 1 1 Gene B 0 1 1 1 Either A or B expressed in 4/4 tissues -> expression of pair = 1.0 Both A and B expressed in 2/4 tissues -> coexpression of pair =0.5 Example II: Tissue1 T2 T3 T4 Gene C 1 0 0 1 Gene D 0 0 0 1 pair expression = 2/4 = 0.5 pair coexpression = 1/4 = 0.25

Permutation Permute Gene Order: True Data: Tissue1 T2 T3 T4 Gene A 1 0 1 1 Gene B 0 1 1 1 Gene C 0 1 0 1 Pair 1 (A,B) expression = 4/4=1.0; coexpression=2/4=0.5 Pair 2 (B,C) expression = 3/4=0.75; coexpression = 2/4 = 0.5 Permuted Data: Tissue1 T2 T3 T4 Gene A 1 0 1 1 Gene C 0 1 0 1 Gene B 0 1 1 1 Pair 1 (A,C) expression = 4/4=1.0; coexpression=1/4=0.25 Pair 2 (B,C) expression = 3/4=0.75; coexpression = 2/4 = 0.5 Simulate distribution under Null Hypothesis “Genes would show tissue-specific expression but independent of their genomically neighboring genes.”

Different Null Hypothesis “With transcriptomes of a given size, a gene pair's expression and coexpression across tissues would be independent of the two genes being adjacent to each other in each tissue.” Permute every column of the binary matrix to simulate null distribution: True Data: Tissue1 T2 T3 T4 Gene A 1 0 1 1 Gene B 0 1 1 1 Gene C 0 1 0 1 Pair 1 (A,B) expression = 4/4=1.0; coexpression=2/4=0.5 Pair 2 (B,C) expression = 3/4=0.75; coexpression = 2/4 = 0.5 Permuted Data: Tissue1 T2 T3 T4 Gene A 0 0 1 1 Gene B 1 1 0 1 Gene C 0 1 1 1 Pair 1 (A,C) expression = 4/4=1.0; coexpression=1/4=0.25 Pair 2 (B,C) expression = 4/4=1.0; coexpression = 2/4 = 0.5

Different Null Hypothesis II “With transcriptomes of a given size, a gene pair's expression and coexpression across tissues would be independent of the two genes being adjacent to each other in each tissue.” Permute every column of binary matrix to simulate null distribution. But: Ignores the tissue-specific expression of each individual gene! --> Unreasonable null hypothesis

Hamming Distance Combine expression and coexpression of pair into one score HD = (expression – coexpression) / expression Tissue1 T2 T3 T4 Gene A 1 0 1 1 Gene B 0 1 1 1 Gene C 0 1 0 1 Pair 1 (A,B) expression = 4/4 coexpression = 2/4 -> HD=(4/4 – 2/4) / (4/4) =0.5 Pair 2 (B,C) expression = 3/4 coexpression = 2/4 -> HD=(3-2)/3 = 0.33

Problems with HD Example I: Tissue1 T2 T3 T4 T5 T6 T7 T8 Gene A 1 0 1 1 1 1 1 1 Gene B 0 1 0 0 1 1 1 1 pair expression = 1.0, pair coexpression = 0.5 - > HD of pair = (1.0 – 0.5 / 1.0) = 0.5 Example II: Tissue1 T2 T3 T4 T5 T6 T7 T8 Gene C 1 0 0 1 0 0 0 0 Gene D 0 0 0 1 0 0 0 0 pair expression = 2/8 = 0.25; pair coexpression = 1/8 = 0.125 -> HD of pair = (0.25 - 0.125/ 0.25) = 0.5 Pairs (A,B) and (C,D) get same score, but one pair are coexpressed in 4 tissues,while the other pair are coexpressed in only 1 tissue -> HD score is a too coarse measure for pair coexpression

2D-Measure on FANTOM3 total: 39592 pairs

FANTOM3 vs. Random • FOR each bin (= combination of expression and coexpression measure) and FOR each row-order permutation: • count whether this combination of expression/coexpression measure appears more often in that permutation than in non-permuted data • -> derive empirical p-value • p(measure) = # {measure appears more often} / # permutations

Comparing FANTOM Coexpression with Permuted Data Define: red: highly coexpressed gene pairs blue: uncorrelated gene pairs Data shows significantly higher coexpression of genomic neighbors than expected by chance

Biological Reasons for strong co-expression of genomic neighbors? • Possibilities: • Coexpressed neighbors from gene duplication / pseudogenes: • --> expected to share common domains • Neighbors coexpressed because of common role in biological processes • --> expected to share common Gene Ontology (GO) annotation • Neighbors are coexpressed due to action of transcription factor (TF) • --> expected to share common TF binding sites • Domain and GO annotation obtained by querying the ENSEMBL data base. • Define: Neighbors have 'similar' annotation if they share at least 50% of the annotation of that partner with more annotation. • Finding: Highly coexpressed neighbor pairs do not share more GO annotation, domains, or TF binding sites than genomic neighbors in general do.

Further Biological Reasons H.C.P. All genes …but highly coexpressed clusters (one or more pairs in a row) span shorter genomic regions than weakly coexpressed clusters

Cluster Decay Clusters dissolve across tissues in 2 ways: a.) directed decay: from one of the ends b.) undirected decay: from the middle of the clusters, but both ends

Cluster Decay II Count Cases of a.) directed decay: from one of the ends or b.) undirected decay: from the middle of the clusters, but both ends for I. Clusters of highly coexpressed genes II. Clusters of uncorrelated genes Counts on FANTOM data for all clusters of size 3 and 4: Fisher Test on independence of Cluster Decay and Cluster Type: odds-ratio = 4.93 (3.63-6.78) p < 2*10^(-16)

Coexpression Summary • More highly coexpressed pairs than expected by chance • H.C.P. do not share more transcription factor binding sites or domain or GO annotation than genomic neighbors in general • Hypothesis: cotranscription due to higher aspects of transcription regulation, such as chromatin unwinding

Overview • Tiling Microarrays: fresh, unbiased view on transcriptions • Patterns of Transcription (Gene Clusters) • Heart Transcription Factors (outlook)

IP Heart Repair • Hope: Replace dead myocardial cells by new ones derived from stem cells or recruit other cells. • But: Genetic basis of cardiomyocyte formation poorly understand • From 2006: Integrated EU Project to investigate. • Together with lab of S. Sperling, MPI for Molecular Genetics, Berlin: • Deduct role of certain transcription factors in cardiomyocyte development

High-throughput Approach DNA • Gene expression arrays allow • Identify target genes, whose expression changes after TF mutation/knock-down • Observe genes differentially expressed during developmental stages • Chromatin immuno-precipitation (ChIP) arrays (Nimblegen) allow • Discover TF binding sites, derive novel motifs and TF target genes ChIP probes Expression probes

The Questions how to detect DNA-protein binding sites, histone modification (ChIP-chip) and differential transcriptional regulation (siRNA+array) with optimal and controlled (comparable) False Positive and False Negative rates?  what are the cis-regulatory motifs and motif combinations?  how to optimally combine - binding events (TF-DNA, POLII-DNA) - differential expression () to identify the direct regulatory interactions  how to model the logic of the regulatory network (combinatorial regulation; indirect regulation)

Graphs and networks Graph := set of nodes and set of edges. Nodes: objects of interest (TFs, target genes) Edges: relationships between them (e.g. binding; differential expression; physical interaction) Nodes can have types and attributes, edges can have types, weights, direction.

Probabilistic Modeling Need to distinguish between the true, underlying network, and the actual results of a measurement (experiment) 1. False positive edges 2. False negative edges (were tested, were not found, but are there in nature) 3. Untested edges(were not tested, are not in your data, but are there in nature) 4. Hidden nodes (unknown TFs/target genes ) Penalized likelihood and/or Bayesian modeling obtain optimal estimates of underlying network given the experimental data (ChIP-chip and siRNA/array)

Current Work • Remapping of probes to newest mouse genome assembly mm6 ... done • Normalization of probe levels ... • Summary of probes into probe sets that represent genes ... • Find differentially expressed genes ... March 1st

Rediscovering the Transcriptome - an array expedition

Rediscovering the Transcriptome - an array expedition

Presentation Transcript

Rediscovering the Bible

Rediscovering apprenticeships

Rediscovering the Way

Rediscovering Catholicism

Rediscovering the Church

Traversing an array

Declare an array:

Sorting an Array

The Transcriptome

Transcriptome

Rediscovering Pakistan

Rediscovering the Silk Route

Genomics I: The Transcriptome

Transcriptome

REDISCOVERING GOD

Shuffle an Array

Glue Grant Human Transcriptome Array

Rediscovering the Top Quark

Rediscovering the Inland Empire

Transcriptome Analysis