e-BioLab A Bioinformatics Problem Solving Environment

e-BioLabA Bioinformatics Problem Solving Environment Han Rauwerda Wim de Leeuw Timo Breit MicroArray Department (MAD) & Integrative Bioinformatics Unit (IBU) Faculty of Science (FNWI) University of Amsterdam (UvA)

Omics, we have a problem GENOMICS MAD

DNA ~25.000 genes ~100.000 alt spl ~1000.000 var. Today: (Gen)–omics technologies from gene to function cell nucleus Gene Whole-genome sequence projects < 2% total DNA Gene expression by RNA synthesis Genome-wide micro-array analysis mRNA AAAAAAAAA mRNA translation by protein synthesis High-throughput MS analysis NH2 “High-throughput” protein-analysis Protein COOH Protein function: -prediction by bioinformatics -proof by laboratory research function-1 function-n function-2

RNA analysis by micro-array: 1.000-40.000 genes A B C D E F G H I J K L M N O P Q R S T How did life change for a biologist? RNA analysis by Northern blot: 1-15 genes A B C D E F G H I J K L M N O P Q R S T Analyzed genes Samples of cellular experiments

Gene Gene Gene Gene Gene Gene Gene Gene Gene Gene Gene Gene Gene Gene Gene Gene Gene Gene Gene Gene DNA DNA DNA DNA DNA DNA DNA DNA DNA DNA DNA DNA DNA DNA DNA DNA DNA DNA DNA DNA e-BioScience Gene expression by RNA synthesis Gene expression by RNA synthesis Gene expression by RNA synthesis Gene expression by RNA synthesis Gene expression by RNA synthesis Gene expression by RNA synthesis Gene expression by RNA synthesis Gene expression by RNA synthesis Gene expression by RNA synthesis Gene expression by RNA synthesis Gene expression by RNA synthesis Gene expression by RNA synthesis Gene expression by RNA synthesis Gene expression by RNA synthesis Gene expression by RNA synthesis Gene expression by RNA synthesis Gene expression by RNA synthesis Gene expression by RNA synthesis Gene expression by RNA synthesis Gene expression by RNA synthesis mRNA mRNA mRNA mRNA mRNA mRNA mRNA mRNA mRNA mRNA mRNA mRNA mRNA mRNA mRNA mRNA mRNA mRNA mRNA mRNA AAAAAAAAA AAAAAAAAA AAAAAAAAA AAAAAAAAA AAAAAAAAA AAAAAAAAA AAAAAAAAA AAAAAAAAA AAAAAAAAA AAAAAAAAA AAAAAAAAA AAAAAAAAA AAAAAAAAA AAAAAAAAA AAAAAAAAA AAAAAAAAA AAAAAAAAA AAAAAAAAA AAAAAAAAA AAAAAAAAA mRNA translation by protein synthesis mRNA translation by protein synthesis mRNA translation by protein synthesis mRNA translation by protein synthesis mRNA translation by protein synthesis mRNA translation by protein synthesis mRNA translation by protein synthesis mRNA translation by protein synthesis mRNA translation by protein synthesis mRNA translation by protein synthesis mRNA translation by protein synthesis mRNA translation by protein synthesis mRNA translation by protein synthesis mRNA translation by protein synthesis mRNA translation by protein synthesis mRNA translation by protein synthesis mRNA translation by protein synthesis mRNA translation by protein synthesis mRNA translation by protein synthesis mRNA translation by protein synthesis NH2 NH2 NH2 NH2 NH2 NH2 NH2 NH2 NH2 NH2 NH2 NH2 NH2 NH2 NH2 NH2 NH2 NH2 NH2 NH2 e-BioScience COOH COOH COOH COOH COOH COOH COOH COOH COOH COOH COOH COOH COOH COOH COOH COOH COOH COOH COOH COOH function-1 function-1 function-1 function-1 function-1 function-1 function-1 function-1 function-1 function-1 function-1 function-1 function-1 function-1 function-1 function-1 function-1 function-1 function-1 function-1 function-n function-n function-n function-n function-n function-n function-n function-n function-n function-n function-n function-n function-n function-n function-n function-n function-n function-n function-n function-n function-2 function-2 function-2 function-2 function-2 function-2 function-2 function-2 function-2 function-2 function-2 function-2 function-2 function-2 function-2 function-2 function-2 function-2 function-2 function-2 How does a biologist survive? cell nucleus Protein

Basic model of problem area screen Small integration experiments + integration methods Readily accessible data + models data mining Vague results Easy visualization ! ? e-BioOperator Enabling Scientists Biologists ? e-BioScientist ! E-BioLab: meeting room for multidisciplinary teams Bioinformatics Problem Solving Environment Methods Tools Workflows Grid

Basic set-up of the e-BioLab

Technical scheme of the e-BioLab

Why should we want to visualize quantitative data? 2 definitions: • Tom DeFanti (1987): Visualization is a method of computing. It transforms the symbolic into the geometric, enabling researchers to observe their simulations and computations. Visualization offers a method for seeing the unseen. It enriches the process of scientific discovery and fosters profound and unexpected insights. • J. Foley (1994): A useful definition of visualization might be the binding (or mapping) of data to representations that can be perceived. Visualize quantitative data to: • Describe, explore and summarize (multi variate data) • Discover trends, assess role of (co-)variates, reasoning on data • Communicate the information that is in the data • Next slides: The Visual Display of Quantitative Information, Edward R. Tufte.

John Snow’s cholera map of September 1854 Dots are deaths by cholera, crosses are water pumps

A 10th century display of planetary orbits Illustration of the inclinations of the planetary orbits as a function of time. The next example of plotted time series appears some 800 years later.

Graphical train schedule Paris-Lyon J. Marey, 1880 Superimposed in red: the TGV

… representations that can be perceived …. size of effect shown in graphic Lie factor = size of effect in data

Example: fuel economy • - Lie factor comparing 1985 vs 1978 • real effect: (27.5 - 18)/18 ~ 53% • effect in display: (5.3 - 0.6) / 0.6 ~ 783% • LF = 14.8 • Time scale is not linear • Time scale and fuel scale confounded

Example: the incredible shrinking doctor Use of surfaces to represent numbers • Exaggeration is a much larger than actual increase • Perception of areas varies from person to person (for circles: perceived area ~ (actual area)x with x = 0.8 ± 0.3 • R help pages:

A horror cabinet of visualizations NASA O-ring failures: Would you trust this extrapolation? Age structure of college enrollment - Redundancy - Confusing use of perspective

Example: Napoleon’s Russian Campain 1812-1813

An example: Florence Nightingale Crimean War 1853, half a million dead Nightingale invented the polar diagram Here the number of deaths are represented by an area

Conclusions on visualization • Aim for a lie factor of 1 ( = no lying) • Don’t use surfaces or volumes to represent one dimensional data • Don’t meddle with scales • Decorate your visualizations very sparsely • Do not extend your visualization to areas where there is no data.

end

Microarrays A DNA microarray is a multiplex technology used in molecular biology and in medicine. It consists of an arrayed series of thousands of microscopic spots of DNA oligonucleotides, called features, each containing picomoles of a specific DNA sequence. This can be a short section of a gene or other DNA element that are used as probes to hybridize a cDNA or cRNA sample (called target) under high-stringency conditions. Probe-target hybridization is usually detected and quantified by fluorescence-based detection of fluorophore-labeled targets to determine relative abundance of nucleic acid sequences in the target.

The tuberculosis experiment • Hypothesis • Infection by Mycobacterium tuberculosis modulates immune response by producing suppressor carbohydrates (SC) • Experiment: • Microarray experiment with blood treated with and without SC • Experiment Design: • paired dye swap with 6 individuals (12 arrays)

Quality control of micro array data • Assessment of the technical quality of a micro-array experiment • Can we identify local effects on the slide? ARRAYVIEW • Are the measured Cy3 and Cy5 intensities comparable? BARPLOT/RIPLOT • Are median intensity values and distributions of data comparable between slides? BOXPLOT • How similar are replicates? BOXPLOT/PCA • Can we see differences between experimental groups? PCA • How many genes on the array are on average expressed? Is this number comparable between replicates? EXPRESSED/RIPLOT • Must we do anything to correct identified problems? • We use the HybQC tool to answer these questions.

Array design • platform choice • probe design • layout design • Experiment design • biological • technical • Spotting data • array QC • layout • probe re-annotation • lab Info • Hybridization data • sample QC • labeling QC • raw image data • lab Info • extracted data • lab info • Hybridization info • transformed data • normalized data • Model Choice • method • contrast • p-value • fold change • gene lists • Methods • machine learning • - e.g. SOMs, Bayesian Networks • statistical • - e.g. GSEA, Global test • literature mining • data mining • mapping to knowledge models • figures • upload to AE/GEO A transcriptomics example of an e-BioScience flow Experiment design Data generation Feature extraction Quality control slides Data preprocessing Data Validation Data analysis Publication

MA-PSE QC & validation using workflows on Grid Workflow option: Taverna / Moteur Kepler VLAM QC validation normalization

Ratio intensity plots (Ri plots or Ma plot) • Ratio intensity plots are just scatterplots rotated 45 degrees • X-axis: intensity: log(Cy3)+log(Cy5), y-axis ratio log(Cy3)-log(Cy5) • Easier to assess: 4 fold change above the line y=2 and below line y=-2

Ratio intensity plots We look at a few thousand points. Where are most points? • Are at higher intensities more genes upregulated than down regulated??? • Observation: an intensity and dye dependent bias • with increasing intensity Cy5 seemes to be more responsive than Cy3

Quality control of raw extracted data (3) • Are median intensity values and distributions of data comparable between slides? • Use box and whisker plots to to visualize this: • M ~ median, Q1,Q3 quartile distances whiskers: respective largest and smallest value r*Q1-Q3 (r=1.5) from box dots: outliers

Box & Whiskers plots in a normal distribution

Quality control of raw extracted data (4) How similar are replicates? Can we see differences between experimental groups? How different are replicates from each other? How different are time points from each other? • Use Principal Component Analysis to get an estimate. • PCA also can answer questions like: • How much variability exists between slides? • Are there any unexpected groups in your set of arrays? • How complex is your data, how much variance is explained by the first few principal components?

5 timepoints n genes PCA in a nutshell (1) • PCA projects a high dimensional space onto a lower dimensional space. • 1st axis captures most variance, 2nd axis – orthogonal to 1st axis - captures next most variance etc. • Practically hard (and not necessary) to attach meaning to the axes. • Step 1 – set up a variance-covariance matrix example: reduce a n genes * n time points to a square matrix with length time points (5)

1 2 PCA 2 3 4 5 PCA 1 PCA in a nutshell (2) • Step 2: calculate the eigenvalues and eigenvectors of the covariance matrix. The eigenvector with the highest eigenvalue is the 1st principal component. • An example of a result:(arrays, representing timepoints visualized in two dimensions) Proportion of variance:PCA1: 0.623PCA2: 0.314PCA3: 0.033PCA4: 0.023PCA5: 0.007 Cumulative variance:PCA1: 0.623PCA2: 0.937PCA3: 0.970PCA4: 0.993PCA5: 1 How would you interpret this PCA?

Future work e-BioLab • 2009: finish generalization of the entire micro-array workflow • griddify the workflow • griddify data access using V-browser ( ~ finished) • enhance interaction with the tiled display using wii • enhance interaction with the display using transparancy • enhance interactivity on the tiled display • making selections • propagate a selection in one visualization to another

e-BioLab VL-e NBIC-BioAssist Han Rauwerda Wim de Leeuw Timo Breit SARA Bouwhuis & de Kler UvA Hertzberger & de Laat NIKHEF Linde & van Rijn

End of presentation

q q k l f a g h o e i m l n p h h b c s r t n m d j g k k Noise and heat management e-BioLab display flat against the wall display computers in separate room

Initial observations • The e-BioLab is cheap to set-up (equipment = ~75.000 Euro). • The technique to use the lab equipment is still very much in development. • The Bioinformatics Problem Solving Environment is also still in development. • So far the focus has been on the Experimental Support Environment. • Mainly for experimental data analysis using in silico experimentation. • It will take an effort to set-up and connect a Interactive and Creative environment. • The tiled display is extremely useful. • It provides good overviews while remaining sharp close by. • It takes an effort to set-up meeting in the lab. • The lab has to be close to the biologists. • It should be accessible on a ad-hoc basis. • Biologist and bioinformaticians love it! • Several initiatives in the Netherlands to copy the e-BioLab prototype • Serious interest from the Dutch Systems Biology (SysBioNL)

Potential future of e-BioLabs e-BioLabs Nijmegen Wageningen BigGrid Computer Clusters Amsterdam Rotterdam

Model Problem- driven hypothesis Experiment design Data Analysis & Integrative in-silico experiments biological problem Omics Data Experiment design VL-e Visualization Biological phenomena Interpretation Biological solutions How could I survive as a “omics” biologist? Biological research domain e-BioScience core domain Enabling science domain Analysis methods Dry-Lab Biological knowledge ICT infrastructure Wet-Lab Data- driven hypothesis e-BioLab

Cellular level Space Time Omics ‘design for experimentation’ Biological replicates! DNA or, RNA or, protein, etc Cell lines, tissues (heterogeneous) Hours, days, (few time points)

Bioinformagician The magic of bioinformaticians Crappy Data “Bioinformaticians should use statistics as a drunken man uses lamp-posts' for support rather than illumination."– adapted fromAndrew Lang (1844-1912)

Cellular level Space Time Address ‘design for experimentation’ Well designed biological use cases DNA, RNA, protein, etc GENOMICS (within) each cell (within) minutes

Omics after the hype, now what? • Do some serious expectation management • It will take quite a financial and effort investment • There are (still) a lot of technical issues to be resolved • It will take time before you have your first “real” results • You will have to collaborate seriously to get there • Start training (young) people • Organize a truly multidisciplinary curriculum • Bring back science philosophy in biology education • Teach biologists how to communicate with other science domains • Link omics to ongoing life sciences research • Start from biological topics/problems • Build on available knowledge (models) • Create funding for expensive experimentation • Make it challenging, but feasible for PhD students

Omics after the hype, now what?? • Reorganize the way research is performed • Start working with multidisciplinary teams in projects • Bring bioinformaticians into biology research groups as technicians • Remove technical issues • Get a dedicated Problem Solving Environment (~toolkit) • Acquire access to computational infrastructure (clusters, Grid) • Organize technical and bioinformatics support (MAD) • Attain access to a creative and collaborative environment (e-BioLab) • Get some serious results on all levels of research • Focus on answering biological topics/problems • Aim at feasible (small) results • EXPERIMENT DESIGN, EXPERIMENT DESIGN, EXPERIMENT DESIGN • Find the right biological, technological and bioinformatics partners • Have some luck EXPERIMENT DESIGN, EXPERIMENT DESIGN, EXPERIMENT DESIGN

So life after the omics revolution… It is the COMBINATION of technical potentials with REAL-LIFE biological cases in a PROPER conceptual context with a RELIABLE ICT infrastructure and a MATURE e-BioScience approach that will ADVANCE biology to an e-science and will (eventually) lead to REVOLUTIONARY results So, publish or perish becomes: Collaborate or perish!

Never a dull moment, while spotting the future! MAD staff Timo Breit Jenny Batson Group leader Administrator MAD Micro-Array Department Wet Lab Dry Lab Microarray technology Bioinformatics Floyd Wittink Jurgo Verkooijen Wim Ensink Manager Technician Technician Martijs Jonker Oskar Bruning Maarten Iterson Lennart Post Vacancy Bioinformatician Bioinformatician Bioinformatician Bioinformatician Bioinformatician • Tasks • Experiment execution • Data Analysis • - Service & Support • - Consultancy • - Collaborations SUPPORT Mark de Jong Marino Marinkovic vacancy Mol. biologist R&D technician R&D technician Han Rauwerda Tessa Pronk Wim de Leeuw Marcia Alves de Inda Christiaan Henkel Jerzy Dyczkowski Bioinformatician Biologist Sci. programmer Mathematician Biologist Bioinformatician • Tasks • QC improvements • Infrastructure • Research • Evaluation • - Implementation • - Training & education R&D

How does MAD supports an e-BioScience approach? • Technically we: • started a multidisciplinary group. • implemented several microarray technology platforms. • set-up a microarray service provider. • created a microarray analysis pipeline. • set-up a bioinformatics support centre. • set-up an e-BioLab. • Towards biology domains we: • started collaborations with a multitude of biological groups. • communicate honestly about the needed investments and slow results. • start from existing knowledge with the biological domain. • try to convince researchers to commit to and invest in an e-BioScience approach (started strategic collaborations with organizations medical microbiology UMC, RIVM). • help with experimental design. • execute microarray experiments. • support the bioinformatics data analysis. • help getting relevant biological results (and update the biological model). • Hide the technological/(bio)informatics complexity from biologists!!!!!

Data analyzing experiments & data integrating experiments data data Wet-Lab Experiments Dry-Lab Data Analysis In-Silico Experiments Core Activities Bioinformatics & technical support Communication Management Data management ICT infrastructure Core Activities Bioinformatics Communication Management Data management ICT infrastructure design results design Experiment Design Result Interpretation Experiment Design hypothesis hypothesis Evolving Biological Model How could it relate to my biological research? Data generating experiments

e-BioLab A Bioinformatics Problem Solving Environment

e-BioLab A Bioinformatics Problem Solving Environment

Presentation Transcript

Problem Solving

Problem Solving

A Problem Solving Framework

Problem solving

Problem Solving

Problem Solving

Problem Solving

Problem Solving

Problem solving

PROBLEM-SOLVING

PROBLEM SOLVING

Problem Solving

Problem Solving

Problem Solving

Problem Solving

Solving a Word Problem

Solving a Stoichiometry Problem

Solving a Stoichiometry Problem

The JigCell Problem Solving Environment (PSE)

Solving a rate problem

A Problem Solving “Webquest”

A Problem Solving Framework