Han Rauwerda Wim de Leeuw Timo Breit MicroArray Department (MAD) &

e-BioScience A new approach to deal with the complexity of genomics research.&e-BioLabA Bioinformatics Problem Solving Environment Han Rauwerda Wim de Leeuw Timo Breit MicroArray Department (MAD) & Integrative Bioinformatics Unit (IBU) Faculty of Science (FNWI) University of Amsterdam (UvA)

The omics revolution recruits ever more soldiers GENOMICS MAD But, omics, we have a problem

Large errormargins Expensive experimentation Loads of data Major bottlenecks reality What’s up with omics (mechanistical studies)? Huge promises theory Big expectations Small results There is no such thing as a quick fix in science!

Outline • e-Science & e-BioScience in omics research • Some thoughts on visualization • the e-BioLab in practice • Tuberculosis experiment: data quality control

biomarker DNA chips, personalized medicine genetic passport? cloning of humans? How did it all happen? molecular bio-techniques discoveries “Origin of species” -Charles Darwin 1859 Mendel's model of heritance 1865 DNA isolation 1869 DNA dubbel helix -Watson & Crick 1953 discovery # chromosomes in man 1955 1st chromosome aberration shown: Down 1959 genetic code `cracked´ 1966 recombinant DNA technology 1972 sequencing technology introns discovered 1977 transgenic mouse 1981 PCR technology 1st disease-gene mapped: Huntington’s Disease 1983 total microbial sequence unraveled: H. influenza micro-array technology 1995 Cloning of sheep Dolly totale yeast DNA sequence unraveled 1996 totale human DNA sequence unraveled 2003 high-throughput MS technique Source HUGO

DNA ~25.000 genes ~100.000 alt spl ~1000.000 var. Today: (Gen)–omics technologies from gene to function cell nucleus Gene Whole-genome sequence projects < 2% total DNA Gene expression by RNA synthesis Genome-wide micro-array analysis mRNA AAAAAAAAA mRNA translation by protein synthesis High-throughput MS analysis NH2 “High-throughput” protein-analysis Protein COOH Protein function: -prediction by bioinformatics -proof by laboratory research function-1 function-n function-2

RNA analysis by micro-array: 1.000-40.000 genes A B C D E F G H I J K L M N O P Q R S T How did life change for a biologist? RNA analysis by Northern blot: 1-15 genes A B C D E F G H I J K L M N O P Q R S T Analyzed genes Samples of cellular experiments

Biotechnology Bioinformatics Biologist Experiments Genomics Data storage Data handling Data preprocessing Data analysis Data integration Data interpretation Transcriptomics Results Proteomics Metabolomics Informatics Knowledge ICT infrastructure Integrative biology or Systemsbiology What is Where in omics? Biology cell DNA RNA protein metabolite

Gene Gene Gene Gene Gene Gene Gene Gene Gene Gene Gene Gene Gene Gene Gene Gene Gene Gene Gene Gene DNA DNA DNA DNA DNA DNA DNA DNA DNA DNA DNA DNA DNA DNA DNA DNA DNA DNA DNA DNA e-BioScience Gene expression by RNA synthesis Gene expression by RNA synthesis Gene expression by RNA synthesis Gene expression by RNA synthesis Gene expression by RNA synthesis Gene expression by RNA synthesis Gene expression by RNA synthesis Gene expression by RNA synthesis Gene expression by RNA synthesis Gene expression by RNA synthesis Gene expression by RNA synthesis Gene expression by RNA synthesis Gene expression by RNA synthesis Gene expression by RNA synthesis Gene expression by RNA synthesis Gene expression by RNA synthesis Gene expression by RNA synthesis Gene expression by RNA synthesis Gene expression by RNA synthesis Gene expression by RNA synthesis mRNA mRNA mRNA mRNA mRNA mRNA mRNA mRNA mRNA mRNA mRNA mRNA mRNA mRNA mRNA mRNA mRNA mRNA mRNA mRNA AAAAAAAAA AAAAAAAAA AAAAAAAAA AAAAAAAAA AAAAAAAAA AAAAAAAAA AAAAAAAAA AAAAAAAAA AAAAAAAAA AAAAAAAAA AAAAAAAAA AAAAAAAAA AAAAAAAAA AAAAAAAAA AAAAAAAAA AAAAAAAAA AAAAAAAAA AAAAAAAAA AAAAAAAAA AAAAAAAAA mRNA translation by protein synthesis mRNA translation by protein synthesis mRNA translation by protein synthesis mRNA translation by protein synthesis mRNA translation by protein synthesis mRNA translation by protein synthesis mRNA translation by protein synthesis mRNA translation by protein synthesis mRNA translation by protein synthesis mRNA translation by protein synthesis mRNA translation by protein synthesis mRNA translation by protein synthesis mRNA translation by protein synthesis mRNA translation by protein synthesis mRNA translation by protein synthesis mRNA translation by protein synthesis mRNA translation by protein synthesis mRNA translation by protein synthesis mRNA translation by protein synthesis mRNA translation by protein synthesis NH2 NH2 NH2 NH2 NH2 NH2 NH2 NH2 NH2 NH2 NH2 NH2 NH2 NH2 NH2 NH2 NH2 NH2 NH2 NH2 e-BioScience COOH COOH COOH COOH COOH COOH COOH COOH COOH COOH COOH COOH COOH COOH COOH COOH COOH COOH COOH COOH function-1 function-1 function-1 function-1 function-1 function-1 function-1 function-1 function-1 function-1 function-1 function-1 function-1 function-1 function-1 function-1 function-1 function-1 function-1 function-1 function-n function-n function-n function-n function-n function-n function-n function-n function-n function-n function-n function-n function-n function-n function-n function-n function-n function-n function-n function-n function-2 function-2 function-2 function-2 function-2 function-2 function-2 function-2 function-2 function-2 function-2 function-2 function-2 function-2 function-2 function-2 function-2 function-2 function-2 function-2 How does a biologist survive? cell nucleus Protein

E-Bioscience Informatics ICT infrastructure E-BioScience in omics Biology Biotechnology Bioinformatics Biologist cell DNA Experiments Genomics Data storage Data handling Data preprocessing Data analysis Data integration Data interpretation RNA Transcriptomics Results protein Proteomics metabolite Metabolomics Knowledge Integrative biology or Systemsbiology

Research problems & data Methods & ICT infrastructure Who’s doing What and Why are they doing That? identify challenging biological problems and bring in biological knowledge Life Sciences e-BioScience • research and development of e-science methodologies, tools, and infrastructure • focus on information management, data analyses and modeling • focus on re-usablity and genericity E-BioLab & PSEs SUPPORT Enabling Sciences share their knowledge and generic methodologies for information management, data analyses and modeling

e-BioScience e-BioScience The e-BioScience approach e-BioScience is an APPROACH rather than a science in itself. Definition: The e-BioScience approach offers a multidisciplinary strategy for life sciences research questions, with an emphasis on design for (omics) experimentation, data preprocessing-integration-interpretation and knowledge representation. So yes: An e-Bioscience approach could (and should) address many if not all of the previously mentioned issues (eventually) As example: The MAD view on an e-BioScience approach

What is the role of e-Science? Mind you: e-Science is a (INFORMATICS) SCIENCE in itself. Definition: The term e-Science (enhanced-Science) is used to describe the research area that deals with high performance computation that involves immense data sets and is carried out in highly distributed network environments i.e. grid computing. It also includes technologies that enable distributed collaboration, such as the Access Grid. So: An e-Bioscience approach depends heavily on the methods, tools and infrastructure that are developed by and in collaboration with e-Science. As example: Virtual Lab for e-Science project (www.VL-e.nl)

Hypothesis generation Experiment design Wet-lab experiment Enhancing knowledge model Publication process In-silico experiment Hypotheses Results X The complexity of omics research eg. domain knowledge domain information domain data Life sciences domain • Hypothesis • Infection by Mycobacterium tuberculosis modulates immune response by producing suppressor carbohydrates (SC) • Experiment: • - Microarray experiment with blood treated with and without SC

Array design • platform choice • probe design • layout design • Experiment design • biological • technical • Spotting data • array QC • layout • probe re-annotation • lab Info • Hybridization data • sample QC • labeling QC • raw image data • lab Info • extracted data • lab info • Hybridization info • transformed data • normalized data • Model Choice • method • contrast • p-value • fold change • gene lists • Methods • machine learning • - e.g. SOMs, Bayesian Networks • statistical • - e.g. GSEA, Global test • literature mining • data mining • mapping to knowledge models • figures • upload to AE/GEO A transcriptomics example of an e-BioScience flow Experiment design Data generation Feature extraction Quality control slides Data preprocessing Data Validation Data analysis Publication

Hypothesis generation Experiment design Wet-lab experiment Enhancing knowledge model Publication process In-silico experiment Hypotheses Results X e-Bio Science eg. semantic modeling visualization Generic virtual laboratory eg. analysis methods information management semantic modeling adaptive inf. disclosure eg. security (AAA) ICT infrastructure Grid- layer The concept of a Bioinformatics Problem Solving Environment eg. domain knowledge domain information domain data Life sciences domain Bioinformatics problem solving environment Rauwerda et al: The Promise of a virtual lab. Drug Discov Today. 2006 Mar;11(5-6):228-36.

Basic model of problem area screen Small integration experiments + integration methods Readily accessible data + models data mining Vague results Easy visualization ! ? e-BioOperator Enabling Scientists Biologists ? e-BioScientist ! Connecting the biologists to the BI-PSE Bioinformatics Problem Solving Environment Methods Tools Workflows Grid

Basic set-up of the e-BioLab

Technical scheme of the e-BioLab

Why should we want to visualize quantitative data? 2 definitions: • Tom DeFanti (1987): Visualization is a method of computing. It transforms the symbolic into the geometric, enabling researchers to observe their simulations and computations. Visualization offers a method for seeing the unseen. It enriches the process of scientific discovery and fosters profound and unexpected insights. • J. Foley (1994): A useful definition of visualization might be the binding (or mapping) of data to representations that can be perceived. Visualize quantitative data to: • Describe, explore and summarize (multi variate data) • Discover trends, assess role of (co-)variates, reasoning on data • Communicate the information that is in the data

John Snow’s cholera map of September 1854 Dots are deaths by cholera, crosses are water pumps

A 10th century display of planetary orbits Illustration of the inclinations of the planetary orbits as a function of time. The next example of plotted time series appears some 800 years later.

Graphical train schedule Paris-Lyon J. Marey, 1880 Superimposed in red: the TGV

… representations that can be perceived …. size of effect shown in graphic Lie factor = size of effect in data

Example: fuel economy • - Lie factor comparing 1985 vs 1978 • real effect: (27.5 - 18)/18 ~ 53% • effect in display: (5.3 - 0.6) / 0.6 ~ 783% • LF = 14.8 • Time scale is not linear • Time scale and fuel scale confounded

Example: the incredible shrinking doctor Use of surfaces to represent numbers • Exaggeration is a much larger than actual increase • Perception of areas varies from person to person (for circles: perceived area ~ (actual area)x with x = 0.8 ± 0.3 • R help pages:

A horror cabinet of visualizations NASA O-ring failures: Would you trust this extrapolation? Age structure of college enrollment - Redundancy - Confusing use of perspective

Example: Napoleon’s Russian Campain 1812-1813

An example: Florence Nightingale Crimean War 1853, half a million dead Nightingale invented the polar diagram Here the number of deaths are represented by an area

Conclusions on visualization • Aim for a lie factor of 1 ( = no lying) • Don’t use surfaces or volumes to represent one dimensional data • Don’t meddle with scales • Decorate your visualizations very sparsely • Do not extend your visualization to areas where there is no data.

break!

Microarrays A DNA microarray is a multiplex technology used in molecular biology and in medicine. It consists of an arrayed series of thousands of microscopic spots of DNA oligonucleotides, called features, each containing picomoles of a specific DNA sequence. This can be a short section of a gene or other DNA element that are used as probes to hybridize a cDNA or cRNA sample (called target) under high-stringency conditions. Probe-target hybridization is usually detected and quantified by fluorescence-based detection of fluorophore-labeled targets to determine relative abundance of nucleic acid sequences in the target.

The tuberculosis experiment • Hypothesis • Infection by Mycobacterium tuberculosis modulates immune response by producing suppressor carbohydrates (SC) • Experiment: • Microarray experiment with blood treated with and without SC • Experiment Design: • paired dye swap with 6 individuals (12 arrays)

Quality control of micro array data • Assessment of the technical quality of a micro-array experiment • Can we identify local effects on the slide? ARRAYVIEW • Are the measured Cy3 and Cy5 intensities comparable? BARPLOT/RIPLOT • Are median intensity values and distributions of data comparable between slides? BOXPLOT • How similar are replicates? BOXPLOT/PCA • Can we see differences between experimental groups? PCA • How many genes on the array are on average expressed? Is this number comparable between replicates? EXPRESSED/RIPLOT • Must we do anything to correct identified problems? • We use the HybQC tool to answer these questions.

Array design • platform choice • probe design • layout design • Experiment design • biological • technical • Spotting data • array QC • layout • probe re-annotation • lab Info • Hybridization data • sample QC • labeling QC • raw image data • lab Info • extracted data • lab info • Hybridization info • transformed data • normalized data • Model Choice • method • contrast • p-value • fold change • gene lists • Methods • machine learning • - e.g. SOMs, Bayesian Networks • statistical • - e.g. GSEA, Global test • literature mining • data mining • mapping to knowledge models • figures • upload to AE/GEO A transcriptomics example of an e-BioScience flow Experiment design Data generation Feature extraction Quality control slides Data preprocessing Data Validation Data analysis Publication

MA-PSE QC & validation using workflows on Grid Workflow option: Taverna / Moteur Kepler VLAM QC validation normalization

Ratio intensity plots (Ri plots or Ma plot) • Ratio intensity plots are just scatterplots rotated 45 degrees • X-axis: intensity: log(Cy3)+log(Cy5), y-axis ratio log(Cy3)-log(Cy5) • Easier to assess: 4 fold change above the line y=2 and below line y=-2

Ratio intensity plots We look at a few thousand points. Where are most points? • Are at higher intensities more genes upregulated than down regulated??? • Observation: an intensity and dye dependent bias • with increasing intensity Cy5 seemes to be more responsive than Cy3

Quality control of raw extracted data (3) • Are median intensity values and distributions of data comparable between slides? • Use box and whisker plots to to visualize this: • M ~ median, Q1,Q3 quartile distances whiskers: respective largest and smallest value r*Q1-Q3 (r=1.5) from box dots: outliers

Box & Whiskers plots in a normal distribution

Quality control of raw extracted data (4) How similar are replicates? Can we see differences between experimental groups? How different are replicates from each other? How different are time points from each other? • Use Principal Component Analysis to get an estimate. • PCA also can answer questions like: • How much variability exists between slides? • Are there any unexpected groups in your set of arrays? • How complex is your data, how much variance is explained by the first few principal components?

5 timepoints n genes PCA in a nutshell (1) • PCA projects a high dimensional space onto a lower dimensional space. • 1st axis captures most variance, 2nd axis – orthogonal to 1st axis - captures next most variance etc. • Practically hard (and not necessary) to attach meaning to the axes. • Step 1 – set up a variance-covariance matrix example: reduce a n genes * n time points to a square matrix with length time points (5)

1 2 PCA 2 3 4 5 PCA 1 PCA in a nutshell (2) • Step 2: calculate the eigenvalues and eigenvectors of the covariance matrix. The eigenvector with the highest eigenvalue is the 1st principal component. • An example of a result:(arrays, representing timepoints visualized in two dimensions) Proportion of variance:PCA1: 0.623PCA2: 0.314PCA3: 0.033PCA4: 0.023PCA5: 0.007 Cumulative variance:PCA1: 0.623PCA2: 0.937PCA3: 0.970PCA4: 0.993PCA5: 1 How would you interpret this PCA?

Future work e-BioLab • 2009: finish generalization of the entire micro-array workflow • griddify the workflow • griddify data access using V-browser ( ~ finished) • enhance interaction with the tiled display using wii • enhance interaction with the display using transparancy • enhance interactivity on the tiled display • making selections • propagate a selection in one visualization to another

e-BioLab VL-e NBIC-BioAssist Han Rauwerda Wim de Leeuw Timo Breit SARA Bouwhuis & de Kler UvA Hertzberger & de Laat NIKHEF Linde & van Rijn

End of presentation

q q k l f a g h o e i m l n p h h b c s r t n m d j g k k Noise and heat management e-BioLab display flat against the wall display computers in separate room

Initial observations • The e-BioLab is cheap to set-up (equipment = ~75.000 Euro). • The technique to use the lab equipment is still very much in development. • The Bioinformatics Problem Solving Environment is also still in development. • So far the focus has been on the Experimental Support Environment. • Mainly for experimental data analysis using in silico experimentation. • It will take an effort to set-up and connect a Interactive and Creative environment. • The tiled display is extremely useful. • It provides good overviews while remaining sharp close by. • It takes an effort to set-up meeting in the lab. • The lab has to be close to the biologists. • It should be accessible on a ad-hoc basis. • Biologist and bioinformaticians love it! • Several initiatives in the Netherlands to copy the e-BioLab prototype • Serious interest from the Dutch Systems Biology (SysBioNL)

Potential future of e-BioLabs e-BioLabs Nijmegen Wageningen BigGrid Computer Clusters Amsterdam Rotterdam

Model Problem- driven hypothesis Experiment design Data Analysis & Integrative in-silico experiments biological problem Omics Data Experiment design VL-e Visualization Biological phenomena Interpretation Biological solutions How could I survive as a “omics” biologist? Biological research domain e-BioScience core domain Enabling science domain Analysis methods Dry-Lab Biological knowledge ICT infrastructure Wet-Lab Data- driven hypothesis e-BioLab

Han Rauwerda Wim de Leeuw Timo Breit MicroArray Department (MAD) &