Effective Multi-Marker Screening for Microarray Data Analysis

Part II – with interactions of genes in mind Min-Te Chao 2002/10/ 28

So far, all methods are one-gene-at-a-time • First these methods are simple and intuitive, then they begin to become complicated. • Eg., Efron has to use a tricky logistic regression to estimate the prior density which is not too easy.

The general problem with microarray of data is, although similar in regression setup, the “design matrix” is never of full rank.

In the setup Y=X * \beta + error X is n by p, with n<100, p>1000. I have seen a case with n=7, but p>6000.

Let us say there is a way to “Do the statistical problem” (say, with traditional methods), with a smaller p, say p=p_1=3 or 30, depending on the value of n we have. • Let us assume a model with the first p_1 parameteres only (the other betas are all 0, say)

With our traditional method, we may find the likelihood function – with n observation and p_1 parmateres • And we go through the text book method to do inference about the selected p_1 parameters. • And obtain an estimator of the p_1-dim parameter (together with a sd or p-value)

Repeat the procedure B times, each time with a “simple random sample without replacement of size p_1” from the p genes in the problem.

In this way we change an unsolvable problem (in our classical statistical sense) to B problems, all of them can be done with traditional methods • It is very time-consuming, but sometimes it works

Lo, S haw-Hwa and Tien Zheng (2002) Backward haplotype transmission association algorithm – a fast multi-marker screening method To appear: Human Heredity

Instead of genes, they use markers. • P-markers, n-patient • For each patient, we have data from father and mother • So we have n pieces of parents – child data.

The problem is to identify which are the disease-causing markers

They pick out r markers at a time, r<<p • A statistics T(r) is constructed, which tells the “amount of information” for a n-patient, r-marker sub-problem • Markers in this subproblem are deleted one by one, the least important one first, until all markers left are important

This gets us the group 1 of important markers. • We do the same thing for another subset of r markers, and get the group 2 of important markers, …. • Do it B times, B pretty large, say 5000

Combine all markers together, those with highest frequencies are selected. • More specifically, markers whose returning frequencies are more than the 3-rd quartile plus 1.8 times IQR will be selected (about 3.1 sd from mean) • About 10^{-3} type I error.

The difficult part of the problem is to formulate a likelihood function for the r selected markers. • The next problem is to derive a test statistic, together with its properties. But these are problem-specific…

It is the generality of the setup that is important. • Because it considers r markers at a time, so the likelihood function is with respect to the r selected markers. If there is any interaction between 2 or 3 markers, this process has a potential to pick them up

This is not possible with all the one-gene-at-a-time processes.

All known methods, data mining or not, for analysis of micro array type of data are ad hoc and rather primitive. • Amount of theory is limited. • It has the tendency that these methods will eventually become statistical in nature, because an assessment of risk is still a very important factor in scientific work

Subject-matter relevancy is the key • Other keys: good data other scientists effective computation don’t wait

Effective Multi-Marker Screening for Microarray Data Analysis

Effective Multi-Marker Screening for Microarray Data Analysis

Presentation Transcript

Molecular Interactions

Interactions of Living Things

GPS for IFR Operations

Genetic Algorithm and Genetic Programming

The Body, Mind, and Senses

Wnt signaling

Experimental techniques in nuclear and particle physics (part 3)

Herbs-drugs interactions

Genes, Chromosomes, and Human Genetics

Morphogen gradient, cascade, signal transduction

Brand Inside Tom Peters v02.23.2002

BIOL2007 - EVOLUTION AT MORE THAN ONE GENE SO FAR Evolution at a single locus

For every force, there is an equal and opposite force.

Part V: How genes are regulated

HEART FAILURE