geWorkbench Hands-On Training

geWorkbenchHands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert :

geWorkbench geWorkbench is being developed at the Joint Centers for Systems Biology, Columbia University This work is supported by the NCI caBIG and the NIH NCBC programs.

This training is designed for a user who is new to geWorkbench. Target Audience: Researchers and students interested in microarray gene expression experiment analysis. The attendee is expected to have basic computer and biological knowledge. Note – this is not a complete introduction to all geWorkbench components. The primary goal is to describe those features developed for caBIG during Year 1 of the project, and the context in which they are used. Session Details:

Session Details:Overview of the Training Environment • These slides are suitable for use in: • Classroom Training • Centra – Online Classroom • Web-based Delivery

Session Details:Hardware and Software • geWorkbench requires the Sun Java JRE 1.5 environment to be installed on your local machine. • geWorkbench requires significant memory. At least 1 GB is recommended, especially if larger datasets are being read in or hierarchical clustering will be done. • Windows, Linux and Mac/PowerPC version of geWorkbench are available. • See www.geworkbench.org for full details.

Session Details:Session Goals • By the end of the training session participants should : • Have a basic understanding of the purpose and aims of geWorkbench. • Be able to set program preferences and load microarray data from local and remote sources. • Understand how data files are organized into Projects, and how subsets of data can be formed and used. • Use filtering and normalization components to prepare data. • Analyze and view data using a number of new components.

Session Details:Outline of lessons • Introduction • Tutorial Data • Part 1 – Data management • Lesson 1: Basics of the graphical interface • Lesson 2: Setting Preferences • Lesson 3: Projects and Data Files • Lesson 4: Working with Data Subsets • Lesson 5: Working with Remote Sources • Part 2 – Data manipulation • Lesson 6: Normalization • Lesson 7: Filtering • Lesson 8: Experiment Annotations • Part 3 – Analysis and display • Lesson 9: The Scatter Plot component • Lesson 10: Expression Value Distribution • Lesson 11: Reverse Engineering • Lesson 12: Gene Annotation and Pathway Viewing • Lesson 13 : Hierarchical Clustering Analysis • Lesson 14 : ANOVA • Part 4 – Workflow execution • Lesson 15: caSCRIPT Editor

Introduction Introduction

Introduction:Overview • This section will describe in general the capabilities of geWorkbench in the following areas: • Microarray analysis. • Sequence analysis. • Access to remote data and services • A complete description of geWorkbench and online tutorials are available at www.geworkbench.org.

geWorkbench –a platform for tool and data integration geWorkbench is an open-source bioinformatics platform that provides an extensive collection of tools for the management, analysis, visualization and annotation of biomedical data. geWorkbench has been designed with a plug-in framework. As new techniques are developed and implemented, they can be added to geWorkbench. geWorkbench aims to allow different tools to easily work together, such as using microarray analysis to obtain a list of interesting genes, and then retrieving their coding or upstream sequences and using these in BLAST, pattern discovery, or transcription factor binding motif searches. Introduction:Overview

Obtaining data from local or remote data sources Filtering and normalization Basic statistical analysis Clustering (Hierarchical, SOM) Gene Ontology analysis Reverse Engineering Visualization using many common tools Scatter Plot Volcano Plot Expression Profiles Expression Value Distribution Color Mosaic Dendrogram Introduction:Microarray data geWorkbench supports many kinds of operations on microarray data:

BLAST Pattern Discovery Transcription Factor Mapping Syntenic Region Analysis Introduction:Sequence data geWorkbench also provides capabilities for working with sequence data:

Introduction:External data services • There are many biomedical data sources and computational services available through the internet. geWorkbench strives to make remote data and services directly available on the desktop, integrated with its own local tools. • External sources provide expression data, sequences and annotation: • Microarray gene expression repositories (caArray) • Gene annotation web pages (viaCGAP) • DNA Sequence retrieval (UC Santa Cruz) • Pathway diagrams (BioCarta via caBIO database at NCI)

Introduction:External computational services geWorkbench also provides a gateway to several computational services, including some hosted on Columbia servers and clusters. • BLAST – search for sequences similar to a query sequence. • Access is provided both to a Columbia server and the NCBI BLAST service. • Pattern Discovery – find repeated patterns in a group of sequences. • Synteny – compare regions of one chromosome against another. • Through the caGRID project, additional remote services are being added: • Hierarchical clustering – tree-like grouping by expression similarity. • SOM (Self-Organizing Maps) – divide expression profiles into a limited number of bins. • ARACNE – regulatory network reverse engineering.

Tutorial Data Tutorial Data

Tutorial Data:Overview • In this section we describe the downloadable tutorial data files. This is primarily a reference section. Other files are included in the data directory of the program itself. • The data can be downloaded from http://wiki.c2b2.columbia.edu/workbench/index.php/Download • There are several file types • Microarray • Affymetrix MAS5/GCOS format files – a single file per array, as produced by Affymetrix software. • The geWorkbench data matrix format, which merges all expression data from a set of experiments into a single file. By default it uses the ending “.exp”. • Genepix two-color array experiments (in base download). • Sequence • DNA and protein sequence files in FASTA format.

Tutorial Data:Data files All data sets used in the tutorials are available from the download area of the geWorkbench website (http://wiki.c2b2.columbia.edu/workbench/index.php/Download). The file "tutorial_data.zip" contains the following files: cardiogenomics.med.harvard.edu/ Contains 10 individual MAS5/GCOS format data files. webmatrix_quantile_log2_dev1.2_mv0.exp A geWorkbench "exp" format matrix file containing filtered, normalized data. This data originally derives from the file "webmatrix2.exp". NM_024426-Wilms.fasta A Genbank nucleotide seqeuence file. NP_077744-Wilms.fasta A Genbank protein seqeuence file. H1H5_HistoneDB_NHGRI.fastaContains H1 and H5 histone sequences from the NHGRI. cluster_tree_total_pearsons_84_markers.csv Contains a list of genes derived from hierarchical clustering. 64of84ClusterPearsonsSeqs.fasta Contains upstream DNA sequences derived from a subset of the above genes.

Tutorial Data:About the Cardiogenomics Microarray Dataset The example MAS5 format data files were obtained from the following site at Harvard University: http://cardiogenomics.med.harvard.edu/project-detail?project_id=229 A number of MAS5 format data files are available there. The specific project is the "Belgium Dataset of Aortic Stenosis, Congestive Cardiomyopathy and Normal LV Function", and the data is downloadable from: http://cardiogenomics.med.harvard.edu/groups/proj1/pages/download_Hs-belgium.html An abstract describing the study is also available, at: http://cardiogenomics.med.harvard.edu/groups/proj2/pages/Hs-belgium_home.html

Tutorial Data:Generation of example microarray dataset Generation of the "webmatrix2_quantile_log2_dev1.2_mv0.exp" dataset. The file "webmatrix2.exp", available in the Download area, contains results from 100 Affymetrix HG-U95Av2 chips containing B-cell samples from numerous different disease states. 12,600 probes are represented. For use in these tutorials we normalized and filtered the data. The steps on the next page are just an example of how filtering and normalization can be used, and each dataset should be handled according to the type of analysis being undertaken and its goals.

Tutorial Data:Generation of example microarray dataset • The dataset was created through the following steps: • Normalization: Quantile normalization. • Normalization: Log2 transformation. • Filtering: Deviation filter with Deviation bound of 1.2. • Filtering: Missing values filter with maximum number of missing arrays of 0. • The result of performing these steps is available as the file "webmatrix2_quantile_log2_dev1.2_mv0.exp", found in the tutorial data file "tutorial_data.zip”.

Part 1: Data Management • Data Management

Part 1: Data ManagementObjectives • The objective of Part 1 is to learn the basic operation of geWorkbench. This includes understanding the layout of the graphical interface in four main functional regions, and setting user preferences. The loading of local and remote data files will be demonstrated. Perhaps of most importance is understanding how geWorkbench allows data to be divided into subsets, both for setting up analyses and utilizing their results. • After completing Part 1, you should be able to: • Load microarray data into geWorkbench from local and remote sources, and set display preferences. • Understand how the data can be organized into projects and manipulated using sets.

Part 1: Data ManagementLesson outline • Lesson 1: Basics of the graphical interface • Lesson 2: Setting Preferences • Lesson 3: Projects and Data Files • Lesson 4: Working with Data Subsets • Lesson 5: Working with Remote Sources

Lesson 1: Basics of the graphical interface: Basics of the graphical interface.

Lesson 1: Basics of the graphical interfaceThe four areas of the GUI The graphical user interface for geWorkbench is divided into four major sections • Data management Workspace and Projects (upper left). • Marker and Array/Phenotype set selection and management (lower left). • 3. Visualization tools (upper right). • 4. Analytical tools (lower right). Areas 2, 3 and 4 are defined for convenience. The actual placement of a given component into any of these three areas is controlled by a configuration file and can be customized as desired.

Lesson 1: Basics of the graphical interfaceMenu bar and data management area Menu bar • The GUI provides a menu bar at top with a standard choice of commands. • Many commands that are available in the menu bar are also available by right-clicking on data objects. Data management area (area 1) • Working with geWorkbench involves creating a project within the top-level Workspace. • Opened data files and the results of analysis are stored within a Project. • Multiple projects can be used within a workspace to organize data. • A workspace and all the projects and data within it can be saved and later reloaded.

Lesson 1: Basics of the graphical interfaceSet selection area Set selection and management (area 2) • geWorkbench allows sets of markers (gene probes) and of arrays/phenotypes to be defined and used. This allows the application to: • analyze only a desired subset of the data • Return lists of genes from one module which can then be used in another module, e.g. a list of genes returned by a t-test of differential expression can then be further investigated through sequence retrieval and analysis.

Lesson 1: Basics of the graphical interfaceVisualization and analysis areas Visualization and Analysis tools (areas 3 and 4) • To simplify the display area, only the visualization and analysis components relevant to the type of dataset currently selected in the Project Folders area (area 1) are displayed. • Thus choosing a microarray dataset will result in a different set of tabs being displayed as compared with those seen when a nucleotide sequence file is selected. • When a new data file is loaded, or an analysis produces a new data set, not only is it added to the Project area (area 1), but an appropriate viewer in the Visualization area (area 3) is automatically selected. • A selection of visualization and analysis tools will be demonstrated in the following sections.

Lesson 2: Setting Preferences Setting Preferences

Lesson 2: Setting PreferencesModifying settings Preferences • The Preferences selection in the Tools menu allows users to specify how certain aspects of the system will behave. • Once the preferences are set, they are persistent between application sessions. Modifying Settings • From the main menu, click on Tools >Preferences.

Lesson 2: Setting PreferencesModifying settings Modifying Settings • Text Editor: The editor selected will be used to open and inspect data sets loaded in a project. Notepad is the default setting. • Visualization: The color scheme to be applied to color mosaic images. • Absolute: (default) Values are scaled against the largest absolute value found in the dataset, with positive values red and negative green. • Relative: Each marker is mean-variance normalized across all arrays. A red-blue color scheme is used, with red showing positive and blue negative values. • Genepix Value Computation: Specifies how to compute the value displayed for a Genepix array. The default setting is Option 1 (Mean F635 - Mean B635) / (Mean F532 - Mean B532).

Lesson 2: Setting PreferencesNotes • The relative display performs its own transformation on the data just for purposes of visualization. The underlying data is not changed. • The relative selection for the Microarray Viewer preference will give odd-looking results if only a small number of arrays are loaded (e.g. 2). This is because with only two values, each point will be at a color extreme – either blue or red. • Changing the Microarray Viewer relative/absolute preference will not take effect until the next time a data set is loaded.

Lesson 3: Projects and Data Files Projects and Data Files

Lesson 3: Projects and Data FilesFile types • geWorkbench supports a number of data file formats, including: • For Microarrays: • Affymetrix MAS5/GCOS text files. • Affymetrix File Matrix - this is the native file type created by geWorkbench, and contains a data matrix from any number of experiments merged together. • RMA Express File - RMA Express is a sophisticated tool for combining data from multiple Affymetrix chips. It is not a part of geWorkbench. • Genepix Files – created by a popular analysis program for two color arrays. • For Sequence: • FASTA Files. DNA or protein sequence files in FASTA format. • Pattern Files – created by the Pattern Discovery component.

Lesson 3: Projects and Data FilesOpening a file • In this example, we will load 10 individual Affymetrix MAS5 format files, merging them into a single dataset. • Create a Project. All data must belong to a project. Right-click on the Workspace entry in the Project Folders window at upper left to create a new project. 2.Next, right-click on the new Project entry and select Open Files.

Lesson 3: Projects and Data FilesLoading and merging data 3. Select file type Affymetrix MAS5/GCOS as shown. 4. Make sure to check the Merge files checkbox. 5.Select 10 MAS5 format text files from the tutorial data directory. 6. Click Open. 5 3 6 4 The chip type HG_U95Av2 isrecognized...

Lesson 3: Projects and Data FilesViewing data The merged dataset is listed in the Project folder. The data is displayed, in single array format, in the Microarray Viewer. Note we have increased the intensity slider to maximum here.

Lesson 3: Projects and Data FilesRenaming and saving a merged dataset • The merged dataset can be given a shorter name. • Right click on the merged dataset and select Rename. • Enter a new dataset name, e.g. merged_cardio. • The dataset can also be saved to disk for later reuse. • Right-click on the merged dataset and select Save. • Enter a filename.

Lesson 4: Working with Subsets of Data Working with subsets of data

Lesson 4: Working with Subsets of DataBackground • geWorkbench makes extensive use of sets of markers (genes) or arrays. • Sets can be defined by the user, or may be created as a result of an analysis. • Sets of arrays can be used to distinguish between different experimental states, for example as part of a statistical analysis. • The t-test requires two states be defined for comparison. • Sets of markers are returned from various analysis routines. For example the t-test returns a list of markers showing signficant differential expression, and after hierarchical clustering, the markers in a subtree of the resulting dendrogram can be saved. • geWorkbench supports groupings of sets. Each such group can contain different sets of markers or arrays.

Lesson 4: Working with Subsets of DataOverview In this tutorial you will learn • How to create a set of arrays. • How to mark a set of arrays as "Active“. • How to classify a set of arrays, e.g. as "case" vs. "control". • How arrays can be grouped in different ways with descriptive tags.

Lesson 4: Working with Subsets of DataPreparation The first example here will use the same data files read in and merged in the previous lesson (Projects and Data Files). The second example will use the tutorial file webmatrix2_quantile_log2_dev1.2_mv0.exp

Lesson 4: Working with Subsets of DataAssigning arrays to sets We will leave the arrays in the default group, however you can create a new group by pushing the New button on Array/Phenotype Sets located at the lower left in the application (arrow labeled New). First, we will select and label arrays which contain samples from the congestive cardiomyopathy disease state: 1 • In the Arrays/Phenotypes component, select • the six arrays beginning with JB-ccmp, which • represent the samples from the congestive • cardiomyopathy disease state. • 2. Right click, select Add to Set. 2 New

4 Lesson 4: Working with Subsets of DataAssigning arrays to sets 4. Next, similarly label the arrays beginning with JB-n as "Normal“. The Array/Phenotype Sets component will now show the two sets added: 3. Enter "CCMP" in the input box and click OK. 3

Lesson 4: Working with Subsets of DataActivating sets The boxes next to the set name can be checked to indicate that a set of arrays is "Active". Various analysis and visualization components can be set to only use/display activated arrays or markers. Note – if no Marker sets are explicitly activated, then all Markers are implicitly active. The same applies to Arrays.

Lesson 4: Working with Subsets of DataClassifying a set For statistical tests such as the t-test, Case and Control groups can be specified. 1. Left-click on the thumb-tack icon in front of the phenotype name. 2. Select Case to specify the disease arrays as the "Case". The remaining "Normal" arrays are by default considered Control. 1 2

Lesson 4: Working with Subsets of DataClassifying a set 3. A red thumbtack indicates an array set has been marked as "Case". 3

Lesson 4: Working with Subsets of DataUsing multiple array groups • Different groups of sets can be made, both for Markers and for Arrays. They may differ in membership or in how members are named (e.g. amount of detail). • Here we show how several different groupings are defined in the example data file "webmatrix2_quantile_log2_dev1_mv0.exp“. • After loading this file into geWorkbench as type "Affymetrix File Matrix", four groups can be seen in the Arrays/Phenotypes group pulldown menu at right.

Lesson 4: Working with Subsets of DataUsing multiple array groups If we choose the group called "Class", the sets of arrays at right are displayed:

Lesson 4: Working with Subsets of DataUsing multiple array groups If instead we choose the group "Cell Line", a different grouping of the same arrays is seen:

geWorkbench Hands-On Training