Gene Expression Arrays (Haverford College, Fall 2001)

Gene Expression Arrays(Haverford College, Fall 2001) Elisabetta Manduchi manduchi@pcbi.upenn.edu phone: 215-573-4408

Caveats • In these lectures we will focus in particular on the array technology as it pertains to the study of gene expression. There are many more applications of this technology (genotyping, determining identity-by-descent, identifying protein binding sites, etc.) • All the experiments discussed are aimed at capturing information about mRNA abundance. • There is no strict linear relationship between genes and the “proteome” of a cell, as there might be modifications of the proteins that are not apparent from the DNA sequence (post-translational modifications).

Outline • Different kinds of high-throughput gene expression experiments. • Questions these data can help to address. • Image analysis and data preprocessing issues. • Differential expression, class discovery, class prediction. • Gene expression data management.

Different kinds of experiments • Qualitative • differential display • Quantitative • Sequencing based • sequencing of cDNA libraries • SAGE • Array based • filter arrays • two-channel microarrays • short oligonucleotide arrays

cDNA libraries • Represent expressed sequences in a biological sample. • Most mRNA molecules contain 3’ polyA tails. • Use poly-T oligomers to prime the synthesis of cDNA strands by reverse transcriptase. • RNA-DNA duplexes converted to double stranded DNA molecules (ribonuclease H, DNA polymerase I, DNA ligase).

cDNA libraries (cont.) • Ribonuclease H degrades the RNA template strand (short RNA fragments produced during degradation serve as primers for DNA synthesis). • DNA polymerase I catalyzes the synthesis of the second DNA strand and replaces RNA primers with DNA strands. • DNA ligase seals the remaining single-strand breaks in the double stranded DNA molecules. • The double-stranded cDNA is inserted into cloning vectors. • Expressed-sequence tags (ESTs) can then be derived.

Generalities Exploits complementary base-pairing. The steps are: • Prepare the array. • Prepare the mRNA source. • Hybridize probe to target. • Scan image. • Quantify image.

Filter Arrays See paper by Zaho et al. (1995). • cDNA clones are spotted on the array (possibly multiple spots for the same clone). • mRNA source is processed: • polyA RNA is purified from tissues • cDNA is generated with reverse transcription and radioactively labeled. • Probe is hybridized to target.

Filter Arrays: limitations • Cross-hybridization (sequences with high sequence identity, Alu repeats, etc.). • Hard to distinguish the transcripts generated by alternative splicing. • Distortion. • Several sources of bias and noise: • variation in spot size, shape, and concentration • variation in PCR reaction efficiency • variation in labeled nucleotide incorporation.

Two-channel Microarrays See paper by Schena et al. (1995). • Preparing the array: • cDNA clones amplified and deposited into individual wells of a plate (possibly multiple depositions for the same clone) • samples from the plate printed onto a glass microscope slide • the array is processed by chemical and heat treatment to attach the DNA sequences to the glass surface and denature them.

Building the chip Ngai Lab arrayer , UC Berkeley Print-tip head (Slide kindly provided by T. Speed)

Pins collect cDNA from wells well plate Contains cDNA probes Print-tip group 1 cDNA clones Print-tip group 7 Glass Slide Array of bound cDNA probes In this case: 4x4 blocks = 16 print-tip groups (Modified slide from one kindly provided by T. Speed)

Two-channel Microarrays (cont.) • Preparing the mRNA sources: two samples are analyzed simultaneously. For each of them: • polyA mRNA is prepared and reverse transcribed with incorporation of a fluorescent label (usually Cy3 [green] for one sample and Cy5 [red] for the other) * • A variety of labeling methods are currently available (e.g. direct labeling, indirect labeling, dendrimers) • the RNA is then degraded. * In the Schena et al. paper fluorescein and lissamine are used.

Two-channel Microarrays (cont.) • Hybridization: the labeled cDNAs are competitively hybridized to the array. • Scanning: utilizes a laser fluorescent scanning procedure (sequential excitation of the fluorophores). Emitted light is split according to wavelength and detected. • Quantifying: signals are then quantified separately, and the ratio of the two channels for each spot is also reported.

A two-channel microarray experiment Figure from: David J. Duggan et al. (1999)Expression Profiling using cDNA microarrays. Nature Genetics21: 10-14

Two-channel Microarrays: limitations • A large number of cDNA or PCR products must be prepared, purified, quantified, catalogued, and spotted onto a solid support. • If the cDNAs are derived from a cDNA library, low abundance cDNAs are unlikely to be spotted and the library must be normalized to reduce the redundant spotting of cDNAs from highly expressed genes. • Cross-hybridization. • Alternative splicing hard to detect.

Short Oligonucleotide Arrays See paper by Lockhart et al. (1996). • Preparing the array • covalently attached oligonucleotides chemically synthesized directly on a solid substrate • for each mRNA being monitored, a collection (probe set) of probe pairs (16 to 20) is synthesized on the array • each probe pair consists two probe cells: one containing (millions of) copies of a given 25-mer that is a perfect match (PM) to a subsequence of the mRNA in question and the other containing copies of a companion (MM) 25-mer that has a single base difference in a central position.

Short Oligonucleotide Arrays (cont.) • Preparing the mRNA source • polyA RNA is converted to cDNA • cDNA is transcribed in vitro in the presence of fluorescently labeled (biotin or fluorescein) ribonucleotides, giving rise to labeled RNA • RNA is then fragmented with heat (fragment average size of 50 to 100 bp).

Short Oligonucleotide Arrays (cont.) • Hybridization occurs in a flow-cell. A brief washing step follows to remove un-hybridized RNA. • After scanning, the (Affymetrix) quantification for a given probe set (representing an mRNA) consists of: • an intensity for each cell is computed (3rd quartile of pixels distribution in that cell, after excluding bordering pixels) • background values are computed (after dividing the array into sectors) and subtracted from cell intensities • the number of probe pairs where PM signal >> MM signal and PM signal << MM signal is computed as well as the average of the log of the PM/MM ratios for each probe set • a presence/absence call is made on each probe set • ave(PM-MM) is calculated for each probe set and assigned as the intensity of the corresponding mRNA • See also work of Li and Wong (2000, 2001) for other approaches to quantification of these arrays and automatic detection of artifacts

Short Oligonucleotide Arrays: limitations • Criticized by some as using too short sequences as probes • people are now exploring ways of spotting 70-mers on glass microarrays (Long Oligonucleotide Arrays) • Currently, measures for the individual pixels and cells are not made available by Affymetrix, only summary measures for each probe set • More expensive than other arrays

identification of genes which are expressed in a given biological sample identification of genes which are differentially expressed between two samples background calculations quality control and data cleanup replication normalization (within and between slides) transformations Questions/Issues

Expression profiles Given a collection consisting of n gene expression experiments, each involving k genes, get an kn data matrix. For each experiment (sample) we have an expression profile (or molecular fingerprint) of length k over the genes. For each gene we have an expression profile (of length n) over the experiments.

Questions: class discovery • Group (i) the samples or (ii) the genes by similarity of their profiles (unsupervised clustering). • Motivation • (i) determine a molecular classification of samples (e.g. subtypes of tumors which are morphologically indistinguishable) • (ii) determine groups of genes which are co-expressed and possibly co-regulated.

Questions: class prediction • Given known classes of samples, build a prediction model, based on their molecular fingerprints, to be used to classify novel samples. • Given expression profiles for a set of genes with known function, form groups and assign other genes to these groups (supervised clustering).

Questions: gene networksMore issues • Reverse engineering: infer gene networks from gene expression profiles (e.g. using time series). • This is a hard problem to tackle and requires lots of data. Work on this issue is at a more preliminary stage. For an overview see D’haeseleer et al. (2000). • More work can be found in the literature relative to the preceding questions. We will look at some of the methods developed for these.

Image analysis:(filter and two-channel arrays) • Gridding: in order to extract spot intensities it is necessary to accurately identify the location of each of the spots. • Segmentation: it is necessary to identify, within each such location, which pixels correspond to probe hybridized to target. • Intensity extraction: after detecting location, size, and shape of each spot, one needs to calculate the signal (foreground) and the background intensities as well as quality measures at each spot.

Gridding Segmentation Intensity Extraction Image analysis (cont.) Figures from http://www.nhgri.nih.gov/DIR/Microarray/image_analysis.html

Image analysis (cont.) • There are different public and commercial software for image analysis, using different algorithms for the 3 steps involved and requiring/allowing different degrees of manual intervention. Moreover, different software might give a more or less copious output in terms of quality measures • For the segmentation step, the following possibilities might be available: • fixed circle • adaptive • histogram • Forintensity extraction there are also various possibilities: • Foreground: sum, mean, median, mode, etc. of pixel intensities; • Background: none, global, local, morphological opening;

Local background ---- GenePix ---- QuantArray ---- ScanAnalyze (Slide kindly provided by T. Speed)

Morphological non-linear filter on background pixel signal(Spot software) Measures overall baseline background level. (Slide kindly provided by T. Speed)

Data cleansing: why? • There are artifacts, e.g. specks of dust, scratches, etc. • There are multiple light sources: background, target, target hybridized with sample, array surface. • The quality of the image analysis for certain spots might be poor. • Some of the quality measures output by the image analysis software can be utilized for clean-up • Software packages also differ in the amount and type of quality measures provided • Recently an SVM approach has been proposed to flag data Davison T. “Using Support Vector Machines for the Classification of Data Quality in Microarray Experiments” poster at ASI course, S. Miniato Italy, October 2001 • For short oligonucleotide arrays, see work of Li and Wong (2000, 2001) regarding artifact detection

Quality measures • Spot • One channel, R or G • Signal/noise ratio • Variation in pixel intensities • Identification of “bad spots” (no signal), etc. • Two channels, R/G • Circularity, etc. • Array • Percentage of spots with no signal • Distribution of spot signal area, etc. (Slide kindly provided by T. Speed)

Normalization: why? • Within-slide • When comparing the red and green channels in a two-channel microarray experiment need to calibrate these channels because of different labeling efficiencies and scanning properties of the dyes as well as experimental variability (coming from separate reverse transcription and labeling). • There might be other systematic sources of variation in the measured intensities within an array related to print-tip group differences or other spatial effects (e.g. due to the placement of the cover slip) • Between slides • When comparing different array experiments (from any array platform) need to put the data on an equal footing, again removing systematic sources of variation

Normalization: methods • Multiply all values for an array by the same scaling factor obtained from a given set of spots on the array, e.g. • 1/(total intensity) or 1/(average intensity) or 1/(median intensity) • 1/(mean or median ratio): 2-channel microarrays • 1/(slope of some linear fit) • T. Speed’s group (see Yang et al., 2000) proposes various approaches for normalizing (log) ratios(R/G) in 2-channel microarrays, including: • intensity-dependent normalization: the scaling factor depends on the overall intensity of the spot, not just on the array • intensity-and-print-tip-dependent normalization: the scaling factor also depends on the print-tip group • scale normalization (within and between slides) • Li and Wong (2001) propose an approach in the same spirit for short oligonucleotide arrays

MA plots M vs. A log2R vs. log2G M = log2R - log2G, A = (log2R + log2G)/2 (Slide kindly provided by T. Speed)

Normalization - lowess Assumption: Changes roughly symmetric at all intensities or few genes change. (Slide kindly provided by T. Speed)

Normalization - print-tip-group Assumption:For every print-tip-group, changes roughly symmetric at all intensities or few genes change. (Slide kindly provided by T. Speed)

MA plot - after print-tip-group normalization (Slide kindly provided by T. Speed)

Scale normalization Before print-tip group normalization After scaled print-tip-group normalization

Which genes to use for normalization • All genes on the array. • Constantly expressed genes (housekeeping). • Controls • Spiked controls • Genomic DNA or Microarray Sample Pool (MSP) titration series • Rank invariant set Every normalization method relies on the samples and arrays at hand satisfying certain assumptions. Thus, to judge what is the most appropriate normalization for a given dataset, it is important to ascertain which of the necessary assumptions are satisfied.

Experimental Design Issues Question: Which genes are (relatively) up/down regulated between sample type A and sample type B? • Need replicate (experimental and biological) to assess variability within sample type • In the case of 2-channel microarray experiments, possible experimental designs are (direct comparison) (reference design)  n  n B A C A C B (B A)  n  n

Replicates • Multiple spots on the same array (experimental variability, summary values can then be derived for each gene tag). • Multiple arrays utilizing the same sample type. Two possibilities: • the very same sample is hybridized multiple times (experimental variability) • different individuals of the same type are utilized (experimental and biological variability)

Differential Expression: methods. • Claverie (1999), overview paper and method for SAGE • Single-slide methods (2-channel microarrays) • Chen et al. (1997) • Newton et al. (1999) • … • Methods involving replicates • Filter arrays, short oligos arrays, 2-channel arrays with reference design: • Dudoit et al. (2000) (see T. Speed’s group reference) • PaGE (CBIL, Penn Center for Bioinformatics) • SAM (Tusher et al., 2001) • … • 2-channel arrays with direct comparison design: • Lönnsted and Speed (see T. Speed’s group reference) • Kerr and Churchill (2000), ANOVA

2.1 Chen et al. method • Relies on the following assumptions: • there is a constant coefficient of variationc for the entire gene set: • under the null hypothesis of equal means, Riand Gi are i.i.d. normal random variables.

2.1 Chen et al. method (cont.) • Under the null hypothesis of equal means, the density function of Ti=Ri/Gi is computed. • Under the assumptions made, this turns out to be independent of i, but dependent on the (unknown) c. • c is estimated from the data via maximum likelihood (e.g. using housekeeping genes). • Plugging this value into the density formula, p-values can be computed for each ratio.

Gene Expression Arrays (Haverford College, Fall 2001)

Gene Expression Arrays (Haverford College, Fall 2001)

Presentation Transcript

Gene Expression

Comparing Gene Expression Between Affymetrix Arrays

Postdoctoral fellows at primarily undergraduate institutions: The mentor’s perspective

Estimating Gene Expression Signal For Affymetrix Arrays

Thermo-Mechanical Finite Element Analysis of an HTS Current Lead by Means of ANSYS

Gene Expression: From Gene to Protein

Pairwise alignment algorithms (Haverford College, Fall 2001)

Arrays

Gene Expression Networks

A tension in education: depth vs. breadth Sophomore physics: varies quite widely

Adventures in Alice Programming

Gene Expression Networks

Group 4 Gene Expression

Gene expression

High Time Resolution Measurements of Electron Temperature in a Laboratory Plasma

Gene expression data in VectorBase

Classification with Gene Expression Data

Clustering Gene Expression Data

Gene Expression Arrays

Gene expression data in VectorBase

Regulation of Gene Expression

Regulation of Gene Expression