Bioinformatics: Analysis and Clustering Techniques for Gene Expression Study

Expression analysis 2 Introduction to Bioinformatics morten@binf.ku.dk

Program • Jeppe Vinther • Array quality • Finding significantly expressed genes • Spreadsheet exercise • dChip exercise • Overrepresented gene sets • dChip exercise • Web exercise (DAVID) • Clustering • Distance measure exercise • Clustering in dChip exercise

Array quality • Open the CEL-image for MCF7-AV_b_A • Look for artefacts • Also check the others

Finding significant genes • Often a combination of • P-value from t-statistics • High variability requires more replicates • Fold change • Demonstrate in dChip • You do it! • Take a look at the resulting spreadsheet

Putting genes into classes • What can we do with our list of genes? All genes angiogenesis On Y-chr Tyrosin-kinases Targeted to mitochondria Our genes Skeletal development Glycolysis DNA replication Upregulated in brainstem

Gene ontology • Effort to categorize gene products using a controlled vocabulary • Three organising principles (cytochrome c) • Molecular function (oxidoreductase activity) • Biological process (oxidative phosphorylation, induction of cell death) • Cellular component (mitochondrial matrix, mitochondrial inner membrane)

Organisation of GO • Example: Interleukin-12 • Directed acyclic graph • Note the GOIDs • Tools for finding overrepresented GO terms in a set of genes • dChip • EASE • DAVID • …many more

Other classification schemes • GO • Pathways – the KEGG database • Protein domains (from PFAM) • Chromosomal location

Overrepresentation exercises • ”classify genes” in dChip • Find overrepresented annotation in upregulated genes. Instructions in the handouts • DAVID • Do the same here

Clustering

Why cluster? • To find genes that behave similarily • Perhaps they have a common regulator? • To find samples that are similar • E.g. Discover subtypes of disease samples.

Have you seen these? Experiments can also be clustered Ring a bell? 1 row = 1 expression vector Similar rows are grouped or clustered

Agglomerative clustering 0 1 2 3 4 a a,b b c d e

Agglomerative clustering 0 1 2 3 4 a a,b b c d d,e e

Agglomerative clustering 0 1 2 3 4 a a,b b c c,d,e d d,e e

Agglomerative clustering 0 1 2 3 4 a a,b b a,b,c,d,e c c,d,e d d,e e … and the tree is constructed

Expression vectors • Each gene can be represented as a point in space • Dimension of the space = the number of different experiments

Requirement for hierachical clustering • A distance matrix!! • Rings a bell from phylogeny?

Distance measures • Euclidian metrics • Non-euclidean metrics • Semimetric distances

c b a Euclidean metric (x1,y1) a2 + b2 = c2 (x2,y2) Generalised to n dimensions

Requirements for a metric Non-negative Symmetric Distance to self is zero Triangle inequality

Non-euclidean metrics Manhattan metric

Semimetric distance - correlation • Similarity inversely related to distance • 1 – similarity measure

Clustering of high dimensional data • Unsupervised learning of patterns in the data • Hierarchical clustering • K-means clustering • Self-organising maps

Mini exercise • Calculate different distance measures in a spreadsheet

Mini exercise • Try hierachical clustering in dChip • Do point 11 and 12 in the handouts • Try using different distance measures • Try exporting branches of the tree (Clustering->export branch) and do functional classification of those • Walkthrough afterwards

Other ways of grouping data points • Hierachical clustering => builds a tree • K-means => partitions points into k groups • Self organising maps (a.k.a Kohonen maps) • demo

In the clinic

Clinical goals • Improve the diagnostic categorization • Identify useful predictive markers for outcome and therapeutic response • Identify points for intervention: • critical pathways • drug targets Supervised learning

Supervised learning

Training set Negative examples (not ovarian cancers) Positive examples (ovarian cancers) Machine Learning I think this is an ovarian cancer! (confidence is xxx) ”Machine” Unknown sample Neural networks Linear discriminant analysis K-nearest neighbours Support vector machines …

A typical (easy) sample set I

A typical (easy) sample set II Easy to distinguish by one measurement per individual.

A harder sample set I We can tell apples from oranges. But can we distinguish different kinds of apples?

kNN K=4 • Of the 4 nearest neighbours: • 3 are green • 1 is red • So we conclude that ? Is green ?

Error on training set Error on testset cross validation Performance of machine learning • How correctly does it predict known examples? • Beware of overtraining • Assess performance on data not used for training

Microarray summary • Very powerful technology – measure all genes • Noise issues. Lots of data  more possibilities for wrong data • Results are not the ”truth” but hypothesis for testing • Statistical significance != biological significance • Change in analysis will change results • Important to try different things and use judgement • Test your hypothesis using different approaches – the more different the better. • You have only scraped the surface – so when faced with problems, seek assistance

Other uses of microarrays • DNA targets • Copy number analysis • SNP detection • Tiling arrays • Whole genome for transcript mapping • Promotor regions for chromatin immunoprecipitation

Bioinformatics: Analysis and Clustering Techniques for Gene Expression Study

Bioinformatics: Analysis and Clustering Techniques for Gene Expression Study

Presentation Transcript

Chapter 5 RNA Expression Analysis

Expression Analysis Platforms

Analysis of Gene Expression Data

Differential Expression Analysis

Serial Analysis of Gene Expression

Gene Expression Analysis

Expression Analysis Platforms

Global expression analysis

Advanced Differential Expression Analysis

Gene expression analysis

Gene Expression Analysis

Gene Expression Analysis

Gene Expression Analysis and Modeling

Global Expression Analysis: mRNA

Global Gene Expression Analysis Market

Face Modeling, Expression Analysis, Caricature

Expression Data Analysis

Proteome and Gene Expression Analysis

Analysis of Differential Expression

Gene Expression Analysis

Gene Expression Analysis Market