Gene Expression meets Gene Ontology: A novel statistical method for Microarray analysis

Gene Expression meets Gene Ontology: A novel statistical method for Microarray analysis Vasanth Singan Advisors: Dr. John Colbourne & Dr. Haixu Tang

OUTLINE • Introduction • Background • Challenge • Previous Work • Methodology • Results • Future Works

INTRODUCTION • Gene expression profiling is providing breakthroughs in medical and fundamental biology research. • Many statistical approaches have been developed to analyze microarray results and identify genes that are regulated under experimental conditions. • Most of the statistical approaches do not consider the existing biological knowledge. We explore the possibility of using existing knowledge to improve the analysis.

BACKGROUND Microarray: 1.cDNA or Spotted Array 2. High Density Oligonucleotide Array

ovootuSxl Drosophila Microarray Experiment The Drosophila gene called ovo(shavenbaby) is required in the germline for sex-determination and female specific germline viability and differentiation OVO regulates its own transcription and the transcription of the gene out OVO-B is a transcriptional activator and is sufficient for female fertility OVO-A is a transcriptional repressor, which when miss-expressed, results in dominant-negative female sterility

Drosophila Microarray Experiment • Goal - To identify additional genes in the germline pathway by probing for both direct and indirect targets of ovo using microarrays • This microarray analysis searched for differentially expressed genes in dissected ovaries from ovo mutants compared to wildtype. • microarrays are printed with ~15k spots - PCR Primers designed by Incyte Genomics amplify 93% of genes in annotation version 1.0 and 75% in version 3.1

Significance Analysis of Microarrays (SAM) SAM computes a statistic di for each gene i, measuring the strength of the relationship between gene expression. It uses repeated permutations of the data to determine significance. SAM produces ranked list of genes based on the expression levels. Problem : Most of the statistical analyses treat each gene independently of each other, but in reality, genes are co-regulated and there are plenty of examples where individual genes do not meet statistical cut-off values yet may be significant if expression profiles are measured as a group.

Example of SAM output

CHALLENGE How to integrate existing knowledge about gene relations to improve tests of significance in microarray analysis ?

Previous Work 1. Sung Geun Lee, Jung Uk Hur, and Yang Seok Kim A graph-theoretic modeling on GO space for biological interpretation of gene clusters Bioinformatics Advance Access published on January 22, 2004 Bioinformatics 2004 20: 381-388. 2. Barry R Zeeberg, et. al.,GoMiner: a resource for biological interpretation of genomic and proteomic data, Genome Biology 2003. 3. Sung Geun Lee, Wan Seon Lee, Yang Seok Kim GOODIES: GO Based Data Mining Tool for Characteristic Attribute Interpretation on a Group of Biological Entities Genome Informatics 14: 675-676 (2003). 4. Boris Adryan and Reinhard Schuh Gene ontology-based clustering of gene expression data Bioinformatics Advance Access published on April 29, 2004. 5. Peter N. Robinson, Andreas Wollstein, Ulrike Böhme, and Brad Beattie Ontologizing gene-expression microarray data: characterizing clusters with Gene Ontology Bioinformatics Advance Access published on February 5, 2004.

Gene Ontology (GO) GO:01 Biological Process GO:02 Development GO:03 Behavior . . . . . . . . . . . . GO:04 Cell differentiation GO:05 Locomotory behavior GO:06 Reproductive behavior • Structured, controlled vocabularies (ontologies) • DAG (Directed Acyclic Graph) Node A Is_a / Part_of Node B

GO:01 Genes a, b, c, d, k, l GO:02 Genes d, k, l GO:03 Genes a, b,c, d GO:06 Genes a, c GO:04 Gene k GO:05 Gene d Annotation

Drosophila Microarray Experiment

Methodology Ranked list of genes from SAM Gene Ontology DAG nodes Gene 01 Gene 02 Gene 03 Gene 04 . . . . . . Gene n Node 01 Node 02 Node 03 Node 04 . . . . . . . Node m

Iterative Refinement Rank List of Genes from SAM Task I Compute significance of GO Nodes N iterations Task II Compute significance of Genes Ranked List of Genes and Nodes

Task - I For each Node N, find the Log Likelihood & probability of it being differentially expressed. Task - II For each gene i, find the posterior probability of it being differentially expressed. Methodology – I(Log-Likelihood)

Gene Significance Original Vs Scrambled

Inferences from Methodology - I • Test against scrambled input shows marginal significance. • The distribution of probabilities of genes within a node are not significantly different from scrambled data set. • Noise is high in lowly expressed genes. • Nodes with too few genes or too many genes are affected by the relatively less proportion of significant genes.

Example of SAM output

Task - I For each Node N, find the E-value based on the average rank of genes. Task - II For each gene i, find the posterior probability based on E-value of the nodes. Methodology- II(Rank Based Permutation Test)

Drosophila Microarray Experiment RANKED LIST OF GENES RANKED LIST OF NODES

RESULTS 1. Functional categories (GO nodes) that are enriched with genes which are up-regulated / down-regulated. 2. A ranked list of genes with associated scores representing how significantly these genes are up-regulated / down-regulated.

FUTURE WORKS • Cut-off value for genes without GO annotations • Jack-knife analysis • Analyze additional data sets

ACKNOWLEDGEMENTS Dr. John Colbourne (CGB) Dr. Haixu Tang Center for Genomics and Bioinformatics Genome Informatics Laboratory

QUESTIONS?

Gene Expression meets Gene Ontology: A novel statistical method for Microarray analysis

Gene Expression meets Gene Ontology: A novel statistical method for Microarray analysis

Presentation Transcript

Gene Expression Profiling

Classification of Microarray Gene Expression Data

Gene Network Modeling

Gene Expression Arrays (Haverford College, Fall 2001)

Gene Concept

Carlo Colantuoni carlo@illuminatobiotech

Bioinformatics

Regulation of Gene Expression Chapter 18

Gene Expression

Classification of Microarray Gene Expression Data

Gene Expression Data and Cluster Analysis

Regulation of Gene Expression

Regulation of Gene Expression

Gene Network Modeling

Gene flow

Chapter 5: DNA, Gene Expression, and Biotechnology

Chapter 13 (Sections 13.1-13.3) Gene Expression

From DNA to Protein: Gene Expression