GOSt a Gene Ontology mining tool Jüri Reimand

GOSt a Gene Ontology mining tool Jüri Reimand

Overview • Introduction, bioinformatics • Gene Ontology (GO) • GOSt, a Gene Ontology mining tool • Statistics and thresholds • Ordered gene lists • Extending GO

cluster similar profiles measures over time Introduction • Bioinformatics • Analysis of experimental data • Genes encode proteins • Proteins : building blocks of living organisms • Gene expression : protein production from genetic code • Microarray experiments measure gene expression • Thousands of genes simultaneously • Expression levels over time • Different biological conditions • Comparison of healthy and diseased cells

“steroid metabolism” “biosynthesis” “iron ion binding” Introduction • Biological experiments give large amounts of data • Groups of similar genes: • top “most active” genes • similar expression profiles over time • Many genes have some available annotations • Previous knowledge from databases • How to describe the group as a whole? • What are the common features? • Which features are significantly overrepresented?

Gene Ontology (GO) • GO - Directed Acyclic Graph (DAG) • Vertices: terms • Edges: relations between general and specific terms • Hierarchically structured vocabulary • 3 DAGs: processes, components, functions • Annotations to vocabulary terms • Association between a gene g and a property t (GO term t) • Based on biological discoveries • Genes of many genomes are annotated to GO • Annotation sets : for a fixed organism • All genes associated with GO term t

GO example • Graph fragment with some terms related to organ development • Vocabulary is general to living organisms • Gene annotations organism-specific • True Path Rulehierarchical annotations ENSG00000163217ENSG00000161202

GOSt – Gene Ontology Statistics • GO annotations to groups of genes • Statistical significance of results • Thresholds for distinguishing significant results • Analysing ordered lists of genes • Visualisation methods, WWW interface • Command line toolset for large-scale analysis

GOSt example

45 mouse genes 338 GO

Evidencecodes Genes GOterms P-value

GO Term Query Gt Gt Gt Gq Gq Gq e.g. heartdevelopment Annotations to gene groups • Result: term tmatches query Q

Statistical significance • Is intersectionQ∩T significant? • Fisher's one-tailed test • Cumulative hypergeometric probability • Get observed or more genes in intersection Q∩T • P ( pick k white balls out of K white and N-K black balls ) • Multiple testing • Every query results in a number of p-values • Matching GO terms are not independent • Increased rate of false positive matches • Which p-values are significant?

Experimental thresholds • Simulation experiment • Fix some gene query size k • Repeat 1000 times: • Generate synthetic query Q with k elements :random subset of organism's genes • Observe best p-value p for query Q • Store p-value, p --> P • Choose p', 50th smallest p-value from P • Threshold p' – top 5% of p-values for random queries of size k • Calculate for query lengths k = [1,1000] • Compare with standard multiple testing corrections • Bonferroni (1936), Benjamini-Hochberg (1995)

Analytical thresholds • Analytical approach to simulated thresholds • Fix gene query size k • Observe all sizes and frequencies of GO annotation sets T • Presume events with different T independent • Observe possible p-values p with query of k elements • Always correct p by constant c=0.97 (set dependencies!) • Find such threshold p', that gives p ~= 0.95 • Repeat for query lengths k = [1,1000]

Significance thresholds

Ordered lists of genes • Gene groups may be ordered • Interesting gene and few most similar genes • Top “most active” genes • Increasing distance from cluster centre • Top of the list, but how many? • Compare list with GO term • Which portion gives best p-value? • Peak significance of ordered query

GOSt algorithms • Unordered query • Intersections with all annotation sets T • Exhaustive algorithm for ordered queries: • intersections with all Qi and annotation sets T • Approximate algorithm for ordered queries: • for every annotation set T, view only list portions that give local p-value extremes • local best p : list ends with matching gene • local worst p : list ends just before matching gene

Peak significance at ordered list of 28 genes p-value query length List of genes, and matches for “Biosynthesis of steroids” Example: Ordered list analysis

Evidencecodes Genes GOcategories P-value Ordered list query

24 sec 2.8 sec Algorithm speed comparison

GOSt features • Command line interface (C/C++ and Perl) • Graphical user interface in web http://bioinf.ebc.ee/GOST • SWOG (Graphics language, Jaanus Hansen 2005) • Data for multiple organisms • yeast, chicken, cow, mouse, rat, human... • Wrappers for parallel applications (GRID, MPI) • Pipelines for gene expression data analysis

GO KEGG:00000 KEGG pathways Extending GO ( i ) • Pathway – a network of interacting genes and proteins • metabolism pathways, disease pathways, .. • Include pathway data to GO vocabulary • KEGG Pathway database • pathways as vocabulary terms • related genes as annotations to terms • KEGG terms independent of GO vocabulary GO:0003674 molecular_function GO:0005575 cellular_component GO:0008150 biological_process

KEGG:05010 - Alzheimer's disease

TF binding site gene Extending GO ( ii ) • Gene expression started by transcription factors (TF) • TFs bind to certain patterns in DNA • Transcription Factor Binding Sites (TFBS) • Often found in regions close to gene (1k bp) • Include TFBS data from TRANSFAC • Patterns (putative TFBS) as vocabulary terms • annotations to genes near patterns Transcription factor ATATAATAAAGATGAGGCGAATATAAATATACCGGCCCTTAGCGCGAAGCAATTCATCATATAAGCGAGAGAGGCCAATATGCAATCTTCGACAGCAT

Motifs added in a hierarchy according to PWM score 5 levels: near_threshold ... near_MAX_score Work in progress Hedi Peterson depth in hierarchy TF:M00000 TRANSFAC motifs GO KEGG:00000 KEGG pathways TRANSFAC motifs TF:M00431_4 TTTSGCGS:4 TF:M00431_3 TTTSGCGS:3 TF:M00431_2 TTTSGCGS:2 TF:M00431_1 TTTSGCGS:1 TF:M00431_0 TTTSGCGS:0 TF:M00328_4 NCNNTNNTGCRTGANNNN:4 TF:M00328_3 NCNNTNNTGCRTGANNNN:3 TF:M00328_2 NCNNTNNTGCRTGANNNN:2 GO:0003674 molecular_function GO:0005575 cellular_component GO:0008150 biological_process

Summary • We investigated means for finding GO annotations to groups of genes, and statistical methods for determining significance of results. • We combined GO vocabulary with various types of biological data, such as KEGG pathways and TRANSFAC regulatory elements. • We proposed analytical thresholds for distinguishing significant results from structured and partly dependent GO annotations, and verified thresholds with simulation experiments. • We proposed a novel concept of analyzing GO annotations for ordered lists of genes, and implemented fast algorithms for the purpose. • The practical result of our work is GOSt, a GO mining tool. Command line interface is suitable for large-scale automatic analysis, while graphical web interface enables highly visualized and interactive analysis.

Sneak preview • GO analysis of hierarchical clustering tree • Cluster genes according to expression similarity and .. • .. “Wrap up” nodes that show no significant annotations in GO • Work in progress • Meelis Kull • Darja Krushevskaja

Acknowledgments Jaak Vilo BIIT group Hedi Peterson Raivo Kolde Meelis Kull Konstantin Tretjakov Jaanus Hansen Pavlos Pavlidis Priit Adler Asko Tiidumaa Ilja Livenson Darja Krushevskaja FunGenES Consortium

GOSt a Gene Ontology mining tool Jüri Reimand

GOSt a Gene Ontology mining tool Jüri Reimand

Presentation Transcript

Lecture 1 Ontology as a Branch of Philosophy

Using the Gene Ontology for Data Analysis

The Future of Disease Ontology

Manipulation of gene Expression in Bacteria

Opinion Mining A Short Tutorial

Chapter 2 Data Mining

How to Build an Ontology

Data Mining

Data Mining Tools

Ontology learning: state of the art and open issues

INTRODUCTION TO DATA MINING

Web Mining : A Bird ’ s Eye View

CS590D: Data Mining Prof. Chris Clifton

Ch 18: Regulation of Gene Expression

Gene Expression: From Gene to Protein

Mining Complex Types of Data

Image Ontology

Image Ontology

Gene Expression: From Gene to Protein