270 likes | 382 Vues
EE150a – Genomic Signal and Information Processing On DNA Microarrays Technology October 12, 2004. Recall the information flow in cells. Replication of DNA {A,C,G,T} to {A, C, G,T} Transcription of DNA to mRNA {A,C,G,T} to {A, C, G,U} Translation of mRNA to proteins
E N D
EE150a – Genomic Signal and Information ProcessingOn DNA Microarrays TechnologyOctober 12, 2004
Recall the information flow in cells • Replication of DNA • {A,C,G,T} to {A, C, G,T} • Transcription of DNA to mRNA • {A,C,G,T} to {A, C, G,U} • Translation of mRNA to proteins • {A,C,G,U} to {20 amino-acids} • Interrupt the information flow and measure gene expression levels! http://www-stat.stanford.edu/~susan/courses/s166/central.gif
Gene Microarrays • A medium for matching known and unknown sequences of nucleotides based on hybridization (base-pairing: A-T, C-G) • Applications • identification of a sequence (gene or gene mutation) • determination of expression level (abundance) of genes • verification of computationally determined genes • Enables massively parallel gene expression studies • Two types of molecules take part in the experiments: • probes, orderly arranged on an array • targets, the unknown samples to be detected
Microarray Technologies • Oligonucleotide arrays (Affymetrix GeneChips) • probes are photo-etched on a chip (20-80 nucleotides) • dye-labeled mRNA is hybridized to the chip • laser scanning is used to detect gene expression levels (i.e., amount of mRNA) • cDNA arrays • complementary DNA (cDNA) sequences “spotted” on arrays (500-1000 nucleotides) • dye-labeled mRNA is hybridized to the chip (2 types!) • laser scanning is used to detect gene expression levels • There are various hybrids of the two technologies above
Oligonucleotide arrays Source: Affymetrix website
GeneChip Architecture Source: Affymetrix website
Hybridization Source: Affymetrix website
Laser Scanning Source: Affymetrix website
Sample Image Source: The Paterson Institute for Cancer Research
Competing Microarray Technologies • So far considered oligonucleotide arrays: • automated, on-chip design • light dispersion may cause problems • short probes, 20-80 • cDNA microarrays are another technology: • longer probes obtained via PCR, polymerase chain reaction • [sidenote: what is optimal length?] • probes grown in a lab, robot printing • two types of targets – control and test
cDNA Microarrays http://pcf1.chembio.ntnu.no/~bka/images/MicroArrays.jpg
Some Design Issues • Photo-etching based design: unwanted light exposure • border minimization • the probes are 20-80 long • Hybridization: binding of a target to its perfect complement • However, when a probe differs from a target by a small number of bases, it still may bind • This non-specific binding (cross-hybridization) is a source of measurement noise • In special cases (e.g., arrays for gene detection), designer has a lot of control over the landscape of the probes on the array
Dealing with Measurement Noise • Recent models of microarray noise • measurements reveal signal-dependent noise (i.e., shot-noise) as the major component • additional Gaussian-like noise due to sample preparation, image scanning, etc. • Image processing assumes image background noise • attempts to subtract it • sets up thresholds • Lack of models of processes on microarrays
Probabilistic DNA Microarray Model • Consider an m£m DNA microarray, with m2 unique types of nucleotide probes • A total of N molecules of n different types of cDNA targets with concentrations c1,…,cn, is applied to the microarray • Measurement is taken after the system reached chemical equilibrium • Our goal: from the scanned image, estimate the concentrations
DNA Microarray Model Cont’d • Each target may hybridize to only one type of probe • There are k non-specific bindings • Model diffusion of unbound molecules by random walk; distribution of unbound molecules uniform on the array • justified by reported experimental results • Assume known probabilities of hybridization and cross-hybridization • Theoretically: from melting temperature • Experimentally: measurements (e.g., from control target samples)
Markov Chain Model Modeling transition between possible states of a target: • one specific binding state • k=2 non-specific bindings • pn=1-kpc-ph is probability that an unbound molecule remains free Measurement is taken after the system reached state of chemical equlibrium – need to find steady state
Markov Chain Model Cont’d Let i=[i,1i,2 … i,k+2]T be a vector whose components are numbers of the type i targets that are in one of the k+2 states of the Markov chain • i,1 is the # of hybridized molecules • i,j, 2 < j · k+2 is # of cross-hybrid. Note that k=1k+2i,k=ci for every i.
Stationary State of the Markov Chain • In equilibrium, we want to find i such that where the transition matrix Pi is given by • Clearly, in the stationary state we have • Finally, ratio i/ci gives stationary state probabilities
Linear Microarray Model • Let matrix Q collect the previously obtained probabilities • The microarray measurement model can be written as • Vector w describes inherent fluctuations in the measured signal due to hybridization (shot-noise) • Binding of the j-type target to the i-type probe is the Bernoulli random variable with variance qi,j(1-qi,j) • hence the variance of wi is given by • Vector v is comprised of iid Gaussian entries
Detection of Gene Expression Levels • A simple estimate is obtained via pseudo-inverse, • Maximize a posteriori probability p(s|c), which is equivalent to where the matrix is given by • Optimization above readily simplifies to
Simulation Results • Consider an 8£8 array (m=8) • Apply n=6 types of targets • Concentrations: [1e5 2e5 2e5 2e5 1e5 2e5] (N=1e6) • Assume the following probabilities: • hybridization – 0.8 • cross-hybridization – 0.1 • release – 0.02 • Let k=3 (number of non-specific bindings) • Free molecules perform random walk on the array
Some Comments • Adopt mean-square error for a measure of performance • As expected, we observe significant improvement over raw measurements (improvement in terms of MSE) • Things to do: • investigate how to incorporate control sample measurements • modification of the technique for very large microarrays is needed (matrix inversion may be unstable) • Experimental verification!
Why is this Estimation Problem Important? • Microarrays measure expression levels of thousands of gene simultaneously • Assume that we are taking samples at different times during a biological process • Cluster data in the expression level space • relatedness in biological function often implies similarity in expression behavior (and vice versa) • similar expression behavior indicates co-expression • Clustering of expression level data heavily depends on the measurements • better estimation may lead to different functionality conclusions
Summary • Microarray technologies are becoming of great importance for medicine and biology • understanding how the cell functions, effects on organism • towards diagnostics, personalized medicine • Plenty of interesting problems • combinatorial design techniques • statistical analysis of the data • signal processing / estimation