PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY
The need for Bioinformatics • Bioinformatics: Application of computational techniques to the management and analysis of biological information. • Clustering techniques applied on data not enough. Need a good visual representation
Agenda • Microarrays • Review of existing clustering and visualization techniques on gene expression data • The need for a customized visualization tool for use by Dr. Laxmi Parida & Dr. Ruhong Zhou of the computational biology group at the IBM Watson Research Center for visual analysis of protein characteristics • Introduce our new technique that makes use of an animated terrain, implemented in the program called PROTERAN
Function of Genes & Proteins • Through the proteins they encode genes orchestrate the mysteries of life • Protein functions vary widely from mechanical support to transportation to regulation.
Still a lot of work ahead • Traditional methods of discovering their functions were done on a gene-by-gene basis, thus throughput was low. • Believed that many genes work together; this is not exhibited in a one-by-one fashion.
Microarrays • Solve the throughput problem • Allow scientists to see genes on a genomic level
Experiment 1 Experiment 2 ……….. Experiment M Gene 1 C511/C311 C512/C312 ……….. C51M/C31M Gene 2 C521/C321 C522/C322 ……….. C52M/C32M . . . . . . . . . . . . . . . Gene N C5N1/C3N1 C5N2/C3N2 ……….. C5NM/C3NM Expression Matrix
Clustering • Clustering: Act of grouping similar objects together • Applied to gene expression in order to find the function of unknown genes • Many different clustering techniques in the literature. Represented techniques are discussed next.
Experiment 1 Experiment 2 ……….. Experiment M Gene 1 C511/C311 C512/C312 ……….. C51M/C31M Gene 2 C521/C321 C522/C322 ……….. C52M/C32M . . . . . . . . . . . . . . . Gene N C5N1/C3N1 C5N2/C3N2 ……….. C5NM/C3NM Determining similarity between two genes • Choose a similarity distance to compare genes e.g.Euclidian distance
Hierarchical Clustering • Create distance matrix of all genes in relation to each other • Find the two closest genes • Merge these two genes and redo distance matrix • Repeat steps 2-3 until only one cluster left
Dendrogram • Binary tree with a distinguished root, which has all the data items at the leaves • Re-orders the expression matrix to place similar genes beside each other
A (A,B) (A,B) B C C (C,D) D D (A,B) A 0 0 1 6 5 8 (A,B) 0 5 7 (C,D) B 0 5 0 7 C 0 2 C 0 2 D 0 D 0 Example Agglomerative Hierarchical Clustering
Advantages • Familiar to biologists • Few parameters to specify
Disadvantages • Requires fast CPUs and large amounts of memory • Does not identify important clusters • Only represents hierarchical organized data • Does not scale up
Disadvantages cont.. • Dendrogram always offers 2n-1 representations (where n = number of elements)
Self Organizing Maps (SOMs) • User picks number of clusters called nodes • Nodes randomly mapped to M-dimensional space (M = # of experiments) • Node values are adjusted by random vectors picked from original data • After node values settle vectors are clustered to closest node
Visualization • Dendrogram • Error Bar Representation
Visualization • U-Matrix
Advantages • User has partial control over structure • Fuzzy Clusters • Variety of visual techniques applicable
Disadvantages • Knowledge of number of clusters beforehand • Many parameters to specify
Principle Component Analysis (PCA) • Mathematical technique that can be used to reduce the number of dimensions of data Principal component analysis
Advantages • No parameters required • 3D Visualization
Disadvantages • Little control over structure • Running time of O(N3) • Not applicable when input is a distance matrix
Biclustering • Clustering of both rows and columns simultaneously
Reaction Coordinates • Folding determines the function of protein • All-atom recreation of protein unrealistic • Reaction coordinates used to describe protein structure • Fraction of Native Contacts • Radius of Gyration • RMSD from the native structure • Number of beta-strand Hydrogen Bonds • Number of alpha helix turns • Hydrophobic core radius of gyration • Principle Components
Protein States • While folding, a protein goes through certain states • The raw data is similar to microarray data. • Dr. Parida and Dr. Zhou have developed their own techniques and clustered β-Hairpin data.
Reaction Coordinates used on the β-Hairpin • Number of Native β-strand hydrogen bonds • Radius of gyration of the hydrophobic core residues • Radius of gyration of entire protein • Fraction of native contacts • Principle component 1 • Principle component 2 • Root mean square deviation (RMSD) from the native structure.
Patterned Cluster RED = Number of columns in pattern. (Also defined as the Pattern Type) WHITE = Column Number PURPLE = Column Value YELLOW = Number of occurrences GREEN = Occurrences
The need for Visual Analysis of Patterned Cluster Data • β-Hairpin file approx 500MB large • Difficult to study the textual representation and get a global view • Very difficult to see interaction of all patterned clusters in relation to each other • Also very difficult to remember all patterned clusters and their occurrence in time
Visual Requirements • Global View • Navigation & Focus • Relative growth • Details of characteristics on demand
Need for Customized Tool • All of the existing visualization techniques on microarrays had one or more drawbacks • None were able to provide a visual for depicting relative growth of clusters.
Terrain Metaphor • Has been shown to be a useful technique in searching a corpus of documents • Very recently the idea has been applied to gene expression with high density clusters representing mountains
Using a Landscape Metaphor to solve our requirements • Each mountain represents a patterned cluster • Mountain growth represents evolution of patterned cluster • Clicking on mountains returns details of patterned cluster
Mapping of Patterned Cluster data into Terrain Geometry • Pattern Type: Number of columns in a patterned cluster • Column Combination: Unique number that identifies a combination of columns
Column Combinations c! (c – t)! * t! c = number of characteristics t = pattern number
Layout • We first thought of using an automated layout technique. • However, one of Dr. Zhou’s requirements was that the same pattern cluster should appear in the same position for consistent interpretation. • Another was that larger pattern types (6 and 7 column) must be very distinguishably placed. • Hence it was decided to use a manual layout design described next.
Top Patterned Clusters Visualized • Final requirement by Dr. Parida and Dr. Zhou is that only the top 10 largest patterned clusters of each column combination should be visualized
Animated Terrain Evolution • Time proceeds from 0 to the maximum number of experiments • Each time unit all patterned clusters are checked • If there is an occurrence the mountain’s height is increased