Walk-thru of CAGE exercise • Also at http://people.binf.ku.dk/albin/teaching/htbinf/tag_analysis/ • …together with updated slides • And linked from web page
Interlude: a logistics problem • The largest cDNA project so far made 102,000 cDNAs • If you publish, you need to be able to ship these to the people asking for it • This would take >50kg of dry ice! Expensive and a logistics nightmare since you need to keep track of the 102,000 tubes • How can we transfer DNA?
RNA-seq • With a high-throughput tag sequencer, we can also do the brute force approach – fragment all mRNAs in a cell and sequence the pieces (or part of the pieces) • This is commonly referred to as RNA-seq
Compared to SAGE, CAGE • Sequence the whole mRNA – not just the end or the start • Can give connectivity, so that we know what exons that are used, and what isoforms • Is actually bad at capturing 5’ and 3’ edges, due to statistical issues (white board demo)
Typical protocol AAAAA AAAAA TTTTT Isolate mRNA AAAAA Break up mRNAs Make cDNAs of RNA fragments Add adapters, amplify and sequence
We sequence 25-35 bp reads…randomly selected from each side of the fragment
Mapping tags Challenge: What do we get (pros and cons) if we map the tags a) To the genome b) To the transcriptome (like all refseq transcripts)
Genome: unbiased – we could hit any transcripts. Hard to hit spliced tags, and possibly mRNAs that get modified… Transcriptome: We hit annotated genes, and splice sites are not a problem. On the other hand, we cannot find new things
Going from tags to wigs Showing all tags as blocks in the browser is possible, but dumb – because there are potentially thousands in the window of interest, and we go blind Easy way to summarize is to make nucleotide histograms – whiteboard demo
Looking at RNA-seq data • At the tag _analysis web directoy, there is a wig file, mm9_brain.wig showing tags an RNA-seq experiment from mouse brains. Upload this to the browser and look at the two genes below – are they expressed, and how much? • Kcnc3 • Hoxa5
Thought challenge: from tags to expression • We have a wig file showing where all the tags match on the genome • We have the UCSC annotation for all known genes • We want something like a microarray, saying • Gene X has an expression of Y • How can we do this? (2 minutes with your sideman)
“Naïve solution” • For each gene, count the tags that overlap it • Gene X has 45 tags • Gene Y has 4578 tags • Etc Problems with this?
Length of transcripts will have an effect! • A long transcript gives more tags when broken up, and can be captured more easily • So, the number of tags from a transcript depends on • Actual expression (number of RNA molecules) • Length of the RNAs
Normalizing for length – not that hard • For each gene, count the tags that overlap it, and divide by gene length • Gene X has 45/(length of x) tags • Gene Y has 4578(length of y) tags • Etc What if we want to compare two experiments?
We also need to normalize for sample size, just as in SAGE, CAGE and ESTs • Recap: TPM is a normalization that remakes the tags count into what we would get if having exactly one million tags • …so, 10^6* (#tags in my gene)/(total tags)
Combining the two • Normalize by gene length AND sample size • Gene X has an expression of • Z TPMs/(N) • Where N is the RNA length.
Summary of tag technologies • ESTs: old, expensive, long tags. Biased to 5’and 3’ of genes. Can be used for exploration • SAGE: 3’ end tags. Only gene expression, no functional data. Limited for exploration • CAGE/5’SAGE: 5’ end tags. Promoter expression and location. Can be used for exploration • RNA-seq: “Random” tags over the whole mRNA. Expression and location – can be used for both expression and exploration