520 likes | 739 Vues
Using Affymetrix GeneChips to discover “alternative” biology. Dr Andrew Harrison Mathematical Sciences & Biological Sciences University of Essex harry@essex.ac.uk. The importance of Alternative Biology. Using GeneChips to explore the exotic transcriptome.
E N D
Using Affymetrix GeneChips to discover “alternative” biology Dr Andrew Harrison Mathematical Sciences & Biological Sciences University of Essex harry@essex.ac.uk The importance of Alternative Biology. Using GeneChips to explore the exotic transcriptome. Discovering systematic biases in GeneChip data.
Dr Andrew Harrison Physics Dr William Langdon Physics and Computer Science Dr Olivia Sanchez Computer Science & Bioinformatics Dr Maria Stalteri Organic Chemistry & Bioinformatics Joanna Rowsell Mathematics Zain-Ul-Abdin Khurho Mathematics Rahim Khokhar Mathematics Professor Graham Upton Statistics Jose Arteaga-Salas Statistics Dr Abdel Salhi Computer Science & Mathematics Abdelhak Kheniche Pharmacology & Mathematics
Pre-mRNA undergoes lots of processing Messenger RNA has to get from the nucleus to the cytoplasm. This process needs to be tightly regulated.
A mRNA consists of a 5’ UnTranslated Region, a coding region and a 3’ UTR. mRNAs can be regulated via their UTRs.
Richard Roberts mRNA can hybridize to the DNA from which it originated – there are chunks of DNA in a gene that don’t map to mRNA! Phil Sharp
Alternative Splicing results in different permutations of exons from the same gene. Sometimes introns are also included. Each of these permutations results in a different protein. Alternative Splicing affects > 50% of our genes. It is the most obvious way in which 25,000 genes in the Human Genome can produce ~100,000 proteins.
Sequence motifs Splicing factors Splicing is regulated via …. Dynamic changes in RNA structure Rate of transcription
CstF CLEAVAGE CPSF GU-rich or U-rich AAUAAA 10-30nucleotides Polyadenylation Modification of the transcript 3’ end Polyadenylation occurs during and immediately after transcription CODING REGION 3’ UTR 5’ UTR AAAAAA150-250 Polyadenylation signal (PAS) : AAUAAA GU or U-rich sequence Cleavage and Polyadenylation Specificity Factor (CPSF) Cleavage Stimulation Factor (CstF) Polyadenylation Polymerase (PAP) PAP A A AAAAAAAA150-250
Single poly(A) site Alternative poly(A) sites in the 3’-mostexon Alternative poly(A) sites in differentexons Tian et al. (2000) Alternative Polyadenylation
~50% of human genes produce transcripts with different endings. Each transcript may have a unique combination of motifs within its 3’ UTR. These will act to regulate the transcript during its lifetime.
Mutations affecting how RNA isoforms are created may be responsible for upto 50% of genetic diseases!
1 Gene A 2 3 1 Gene B 2 3 Are there groups of genes which produce correlated combinations of isoforms? e.g. A1-B2, A2-B1, A3-B3 We expect there will be. Furthermore, we believe that co-regulating isoform choice will be a key process within the systems biology of higher eukaryotes.
We are developing informatics tools to aid the analysis of Affymetrix chips (GeneChips, Exon arrays). Probe cells of an Affymetrix Gene chip contain millions of 25mer oligonucleotide probes.
Affymetrix microarrays 5’ 3’ GTGGGAATTGGGTCAGAAGGACTGTGGCTAGG GGAATTGGGTCAGAAGGACTGTGGC GGAATTGGGTCACAAGGACTGTGGC perfect match probe cells mismatch probe cells Probe-pairs scattered on chip
Probes are grown through photolithography. Density of initiation sites for photolithographic probe synthesis is ~5×1013 molecules/cm2. The photolithographic steps have a yield of ~0.92-0.94. There will be 0.9225 (10%) to 0.9425 (20%) full length probes. This gives a full length probe density of 5-10 × 1012 cm-2. Thus there will ~ 3 nm between adjacent full length probes (c.f. diameter of DNA is ~2 nm).
Detect fluorescence Remove partial hybrids by washing in a solution with a reduced salt content (phosphate backbones of nucleic acids have negative charge). Labelling with a fluorescent marker (on the Us). Hybridization Fragmentation of RNA to mean length of ~100 bases.
Probe cell (aka feature) Probe pair Affymetrix probe set Perfect Match (PM) Mismatch (MM) The probes are not physically adjacent on the chip The biggest uncertainty in GeneChip analysis is how to merge all the probe information for one gene - Harrison, Johnston and Orengo, 2007, BMC Bioinformatics, 8: 195
1-9 are different chips. dChip, RMA and GCRMA ‘model’ the systematic hybridisation patterns when calibrating an expression measure.
Once chips have gone through a calibration process, changes in gene expression between conditions or over time can be observed. m=log2(Fold Change), a=log2(Average Intensity) The change in expression between two conditions for all the genes on an array can be viewed on a MA plot
Some genes are represented by multiple probe-sets. Probe-set A Probe-set B If they are measuring the same gene the signals should be up and down regulated together! Is that always true? No Stalteri and Harrison, 2007, BMC Bioinformatics, 8:13
Probes map to different exons. Because of alternative splicing, some of the exons may be upregulated whereas others may be downregulated.
Probes map to different sides of a polyadenylation signal. Because of alternative polyadenylation, some of the probes may be upregulated whereas others may be downregulated.
Alternative splicing and alternative polyadenylation affects how GeneChips should be analysed – there is little work to date on how best to incorporate these effects. Both these biological processes will leave their mark in the data! Because GeneChips perform parallel observations of the whole genome, they may offer the possibility of finding examples of groups of coordinated splicing/polyA decisions. If we can find co-regulated events then we can explore mechanisms. There is little understanding of how groups of genes make coordinated decisions in their choice of splices, or polyadenylation status.
Ensembl 48 probes information exons, genes and transcript information using the BioMart query tool Microarray data megaBLAST Text files repository: sequence files, sequence mappings • Perl programs • SQL queries • Linux scripts sequence alignment of probes to genetic products Local database • mysql db • Linux OS Output
RNA DNA Gene 1 Gene 2 Adjacent genes are considered as independent units in higher eukaryotes. Surveys of transcripts now indicate that adjacent genes sometimes produce tandem chimeras – they contain RNA from both genes We believe that a better understanding of this process will help to shed light on the regulation of polyadenylation sites used by isoforms – why is the first site ignored in some situations?
MASK BP3 Exon 33 Exon 34 Exon 0 Exon A Exon B Exon C 1 2 Exon 33 Exon 0 Exon B Exon C 3 4 5 6 7 8 9 10 11 12 Exon 0 An intergenic exon: MASK-BP3 Located in between MASK and BP3 Exon 0 Only observed in the MASK-BP3 transcript
SSF1 P2Y11 Exon 12 Exon 11 Exon 2 Exon 10 Exon 1 Exon 11 Exon 10 Exon 2 (12) Exon 12 1 2 3 4 6 11 12 7 5 13 8 9 10 46597_at
Lots of “antisense” RNA is produced from the opposite strand to that of overlapping genes.
Mouse chromosome 14 + strand 5' 3' NM_009502 (Vcl), 3' end NM_018829 (Ap3m1), 3' end 1416375_at consensus, 3' end 1416375_at target BI664885 (EST) 1416375_at maps beyond the 3’ end of Ap3m1 and antisense to Vcl.
Mouse chromosome 3 + strand 5' 3' NM_027016 (Tloc1), 5' end AK016981 (4933429H19Rik) 1432634_at consensus 1432634_at target AV282337 (EST) 1432635_a_at consensus 1432635_a_at target An example of a sense-antisense probeset pair (1432634_at and 1432635_a_at), capable of detecting expression upstream of the 5’ end of Tloc1
Binds to the gene of interest and to 50 other places in the transcriptome. Exon 3 & 4 Junction Downstream of the PolyA signal Exon 1 Exon 2 Upstream of a PolyA signal Binds beyond the 3’ end of the gene and is associated with cancer. Doesn’t map to the gene of interest
The standard textbook for molecular biology does not mention …. alternative polyadenylation chimeric transcripts antisense transcription Surveys of Affymetrix GeneChips contain a wealth of untapped information about the transcriptome. We can also use the surveys to identify, and in some cases correct, systematic errors associated with GeneChips.
We can compare a single CEL file (raw data) against the typical chip produced from averaging GEO. This enables us to spot regions in which all the probes have higher/lower intensities than expected. We can use our survey to identify, and in some cases correct, image defects on GeneChips (Arteaga-Salas et al., Briefings in Bioinformatics, 2008).
We are studying the correlations in expression across >6,000 GeneChips (HGU-133A), sampling RNA from many tissues and phenotypes.
We wish to understand the factors which effect the correlations between probes. 1, 2 & 3 are identical to themselves (consensus is 100%, colour white) 2 and 3 share the most in common (light grey) 1 and 2 share the least in common (dark grey)
Mean Intensity Probes 1-11 all map to the same exon. Each number is the correlation × 10. This is a different probe-set mapping to the same exon – there seems to be one outlier.
The outliers correlate well with thousands of probes, taken from many different probesets. Correlation: Red 1; Yellow 0.75; Green 0.5; Blue 0
There is little sequence similarity between the probes, they are from probe-sets picking up different biology, yet they are correlated! TCCTGGACTGAGAAAGGGGGTTCCT GAGACACACTGTACGTGGGGACCAC GGTAGACTGGGGGTCATTTGCTTCC Virtually all of the probes in the group have runs of Guanines within their 25 bases.
Comparing probes with runs of Gs. Number of contiguous Gs Mean Correlation 3 0.14 4 0.42 5 0.49 6 0.62 7 0.75 We are only looking at a small fraction of the entire probe, yet it is dominating the effects across all experiments.
Hybridization kf Probe + Target Duplex kr Dissociation R is the Gas Constant, and T is temperature. G = - RT ln K All spontaneous physical and chemical changes take place in the direction of a decrease in free energy, G < 0
Phosphates on chains of nucleic acids have a negative charge. There is a coulomb block of hybridization on microarrays (Vainrub and Pettitt 2002). The environment caused by probe-probe interactions acts to modify the hybridization of RNA. Hagan and Chakraborty 2004, Journal of Chemical Physics The strength of binding depends upon probe density G = - RT ln K
A tetrad of Guanines can bind to each other through Hoogsteen Hydrogen bonds with the help of a central cation. G-quadruplexes are prevalent in telomeres (single stranded DNA at the end of chromosomes). G-quadruplexes are thermally stable. G-quadruplexes take a range of topologies.
Adjacent probes within a cell on a GeneChip have the same sequence – a run of Guanines will result in closely packed DNA with just the right properties to form quadruplexes.
Parallel G-quadruplexes have a left-handed helical twist. We suggest 4 probes can efficiently form a “Maypole”. Outside the corset of the “G-spot”, the probes have little affinity for bases of the same sequence and the phosphate backbones will repel each other. Inside the G-spot the bases are on the inside and cannot bind target. GGGG
GGGG GGGG GGGG G = - RT ln K Probes that are not bound in G-quadruplexes will have a reduced probe density in the immediate environment of the runs of Guanines. This will result in very effective nucleation, and binding, with respect to hybridization to the rest of the probe. The binding will efficiently occur in the G-spot. Any RNA molecule with a run of Cs will hybridize. Thus, there will be enhanced correlations between all the probes that are able to form G-quadruplexes.
All probes within an exon should be closely correlated. By looking at probes that don’t correlate with the rest of the probes in the exon, we can identify systematic effects related to hybridization biophysics. We have picked up a signal which we associate with probe-probe interactions. Parallel Probes with a run of guanines are ripe for forming G-quadruplexes. Probe-probe interactions act to modify the immediate environment of hybridization. Reducing the density of probes over a short length of sequence will increase the stability of hybridization in this region. Many different transcripts containing relatively short sequences (<<25) will be able to efficiently hybridize to some probes.