Gene expression analysis and transcriptomics Daniel Hurley

Gene expression analysisand transcriptomicsDaniel Hurley

What are we going to talk about? • Understanding the core principles and ‘root hypothesis’ of transcriptomics • Choosing between different technologies • How to design an experiment • How to make sense of the data

Core principles: Transcriptomics • Transcriptomicsis the study of the nature and abundance of transcribed elements in a population of cells or a tissue • ‘Transcribed elements’ are: But also ncRNAs (non-coding RNAs) mRNA miRNA siRNA piRNA snoRNA And many more being discovered tRNA rRNA

Core principles: Root hypothesis • Summarising in one statement: The central dogma suggests that the abundance of transcribed elementsaffects cell behaviour and tissue function. Therefore, we hypothesise that comparing the abundance of transcribed elements between different conditions can tell us something about cell behaviour and tissue function in those conditions. The central dogma suggests that the abundance of mRNA affects protein activity. Therefore, we hypothesise that comparing the abundance of mRNA between different conditions can tell us something about protein activity in those conditions.

But

Core principles: the ‘omics’ part • This isn’t very ‘omic’ yet The central dogma suggests that the abundance of mRNA affects protein activity. Therefore, we hypothesise that comparing the abundance of mRNA between different conditions can tell us something about protein activity in those conditions. So the ‘omics’ part is about large-scale measurements, and exploratory hypotheses

Core principles: what can you do with it? Some answers: • Ask questions about relationships between specific genes • Identify potential drug targets • Learn the transcriptomic ‘signature’ of a condition • Make functional hypotheses about an uncharacterised gene • Classify conditions according to their ‘signature’ (e.g. disease)

Technology: different types of data • Realtime RT-PCR – still regarded as ‘gold standard’ by many. Ubiquitous, but labour-intensive, and not really ‘transcriptome-scale’ • Microarrays – revolutionary in the 1990s, driving an explosion in bioinformatics. Use has plateaued but still common • RNAseq – generating count data from high-throughput sequencing of the transcriptome. Perceived as the method of the future (next slides) ‘Transcriptomic’ data can be gathered using a number of methods: Pretty much everything here also applies to quantitative proteomics and to some extent metabolomics, although we will not discuss them in depth

Technology: Microarrays vs. RNAseq (1) • Microarrays = old • RNAseq = the new hotness • BUT it’s not that simple It’s easy to think that: Microarrays RNAseq

Technology: Microarrays vs. RNAseq(2) Microarrays RNAseq Only detect transcripts for which there are probes on the array Measure every assembled transcript Generally detect only one type of RNA (e.g. mRNA OR miRNA) Measure mRNA and ncRNA and everything else Generally do not detect alternative splicing Detect alternative splicing Also costs a lot (but getting rapidly cheaper) Costs a lot Need to be made specific to a species or condition (e.g. human, mouse, tobacco) Similar experimental protocol for every sample type Dynamic range said to be less than RNAseq Dynamic range said to be greater than microarrays Mature technology: known, reliable ways to analyse data. No arguments New technology: no-one is really sure how to analyse the data. Lots of arguments

Technology: Microarrays vs. RNAseq (3) • So when should you use one or the other for a gene expression experiment? • Availability: If you have a well-characterised and popular organism (human, E. coli, mouse, rat, fruit fly, various plant species) for which a commercial microarray exists, it’s an option. Otherwise it’s RNAseq • Breadth: If you are think alternative splicing, or ncRNA are important in your biological process, then RNAseq might be a better choice • Cost: Per-sample, microarrays are lower-cost than RNAseq for human work. For organisms with a smaller transcriptome, the difference is less clear • Complexity: If you don’t want to spend a lot of time (= money) on difficult normalisation and bioinformatics decisions, microarrays may be a better choice. RNAseq bioinformatics is still very new (~20 competing R packages doing the same job!) • Futureproofing: If you really want to compare this data with future data, RNAseq is likely to be around longer. On the other hand, there is a huge amount of published microarray data to which you may be able to compare results.

Design: how do you do it? • Most important, you need a clear and coherent design document. This is important because the cost of repeating experiments is high, and because the data can be bewildering • Ask yourself ‘what is the experimental question I am asking’? Good examples: • Which transcripts could be differentially expressed between control and treated samples across all replicates, correcting for variance between replicates? • Which transcripts could be differentially expressed between each of the 3 combinations of two tissues across all patients, correcting for inter-patient variance? • Are there transcripts which represent the tissue type? That is, transcripts which are more similar across patients than we see between different samples from the same patient? • A good experimental design document: • Sets out the experimental question • Defines the conditions that will be compared • Defines the types of comparisons that will be done between conditions (e.g. pairwise comparisons looking for differences, or a before-and-after ‘paired’ analysis)

Design: variance and replication • How many replicates is ‘enough’? • The short answer is ‘it depends’ • On your estimate of effect strength • On the signal-to-noise ratio of the detector • On the amount of variance within conditions • Three observations is the minimum to define a distribution • Choose your replication strategy to capture the variance that interests you… • …and correct for the variance that doesn’t

What’s a ‘condition’? • What do I mean when I say a ‘condition’ in an experimental sense? • I mean any state of interest in which we can observe a cell population, tissue or organism. • Examples: Patient A with melanoma Patient B with melanoma MTX sensitive cancer cell line MTX resistant cancer cell line HeLa cells 24h post-transfection with siRNA against BRCA1 HeLa cells 48h post-transfection with siRNA against BRCA1 Knockout mouse without Gene X Wild-type mouse with Gene X

Design: variance and p-values • Precise interpretation of a p-value is complex • But it’s uncontroversial (I think!) to say that it’s a proxy measure of the weight of evidence against a null hypothesis • Multiple testing hypothesis problem: we are more likely to see what looks like an interesting result due to chance alone • Can correct for this using false discovery rate assessment and control • The more variance within a condition… P-values capture this intuition in a numeric and rankable form. • The less convincing a result (= less evidence against the null hypothesis)

Data: handling the output • Simple fold-change; not recommended – why? • LIMMA (R package) is the benchmark for microarrays • A raft of packages for RNAseq: EdgeR, deSeq the most common. You’ve done an experiment, and you get a big bunch of data files. Then what? Analysis approaches

Data: what do the results look like? (1) Output from a typical differential expression transcriptomic experiment might look something like this: Transcript information Model parameters Hypothesis strength data Note the sorting, colour-coding and annotation

Data: when do you believe the results? • SkepticalHippo says “multiple hypothesis testing is very important” • T-tests work fine for realtime RT-PCR, or a chi-square test, or Fisher’s Exact Test • The LIMMA package for microarrays incorporates more sophisticated approaches for modelling difference and adjusting for multiple hypothesis testing • The Bonferroni correction is often too conservative • Benjamini-Hochberg FDR is a pragmatic approach for exploratory bioinformatics • Again, various ways of doing this in RNAseq data, but no one clear approach or piece of software

Data: what do the results look like? (2) If we zoom in on an individual transcript, it might look like this: But not everything is differentially expressed!

Data: what do the results look like? (3) We can get high-altitude views of the data by using: Each tool represents the data in a different way, and all tell us something important.

Data: ranking heatmaps

Core principles: what can you do with it? Some answers: • Ask questions about relationships between specific genes • Identify potential drug targets • Learn the transcriptomic ‘signature’ of a condition • Make functional hypotheses about an uncharacterised gene • Classify conditions according to their ‘signature’ (e.g. disease)

Summary: what to do • Spend time clearly defining your experimental question in transcriptomic terms • Get advice on technology, experimental design, and research outputs • Choose replication and conditions which capture the variance that interests you, and corrects for the variance which doesn’t • Be conservative about the number of different questions you ask at once; consider pilot and follow-up experiments • Keep returning to your data. Actively look for as many ways as possible to visualise similarity and difference within the data

Fin Any questions?

Gene expression analysis and transcriptomics Daniel Hurley