Download
proteran n.
Skip this Video
Loading SlideShow in 5 Seconds..
PROTERAN: PowerPoint Presentation

PROTERAN:

119 Vues Download Presentation
Télécharger la présentation

PROTERAN:

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY

  2. The need for Bioinformatics • Bioinformatics: Application of computational techniques to the management and analysis of biological information. • Clustering techniques applied on data not enough. Need a good visual representation

  3. Agenda • Microarrays • Review of existing clustering and visualization techniques on gene expression data • The need for a customized visualization tool for use by Dr. Laxmi Parida & Dr. Ruhong Zhou of the computational biology group at the IBM Watson Research Center for visual analysis of protein characteristics • Introduce our new technique that makes use of an animated terrain, implemented in the program called PROTERAN

  4. Function of Genes & Proteins • Through the proteins they encode genes orchestrate the mysteries of life • Protein functions vary widely from mechanical support to transportation to regulation.

  5. Still a lot of work ahead • Traditional methods of discovering their functions were done on a gene-by-gene basis, thus throughput was low. • Believed that many genes work together; this is not exhibited in a one-by-one fashion.

  6. Microarrays • Solve the throughput problem • Allow scientists to see genes on a genomic level

  7. Experiment 1 Experiment 2 ……….. Experiment M Gene 1 C511/C311 C512/C312 ……….. C51M/C31M Gene 2 C521/C321 C522/C322 ……….. C52M/C32M . . . . . . . . . . . . . . . Gene N C5N1/C3N1 C5N2/C3N2 ……….. C5NM/C3NM Expression Matrix

  8. Clustering & Visualization Techniques Review

  9. Clustering • Clustering: Act of grouping similar objects together • Applied to gene expression in order to find the function of unknown genes • Many different clustering techniques in the literature. Represented techniques are discussed next.

  10. Experiment 1 Experiment 2 ……….. Experiment M Gene 1 C511/C311 C512/C312 ……….. C51M/C31M Gene 2 C521/C321 C522/C322 ……….. C52M/C32M . . . . . . . . . . . . . . . Gene N C5N1/C3N1 C5N2/C3N2 ……….. C5NM/C3NM Determining similarity between two genes • Choose a similarity distance to compare genes e.g.Euclidian distance

  11. Hierarchical Clustering • Create distance matrix of all genes in relation to each other • Find the two closest genes • Merge these two genes and redo distance matrix • Repeat steps 2-3 until only one cluster left

  12. Dendrogram • Binary tree with a distinguished root, which has all the data items at the leaves • Re-orders the expression matrix to place similar genes beside each other

  13. A (A,B) (A,B) B C C (C,D) D D (A,B) A 0 0 1 6 5 8 (A,B) 0 5 7 (C,D) B 0 5 0 7 C 0 2 C 0 2 D 0 D 0 Example Agglomerative Hierarchical Clustering

  14. Advantages • Familiar to biologists • Few parameters to specify

  15. Disadvantages • Requires fast CPUs and large amounts of memory • Does not identify important clusters • Only represents hierarchical organized data • Does not scale up

  16. Disadvantages cont.. • Dendrogram always offers 2n-1 representations (where n = number of elements)

  17. Self Organizing Maps (SOMs) • User picks number of clusters called nodes • Nodes randomly mapped to M-dimensional space (M = # of experiments) • Node values are adjusted by random vectors picked from original data • After node values settle vectors are clustered to closest node

  18. Visualization • Dendrogram • Error Bar Representation

  19. Visualization • U-Matrix

  20. Advantages • User has partial control over structure • Fuzzy Clusters • Variety of visual techniques applicable

  21. Disadvantages • Knowledge of number of clusters beforehand • Many parameters to specify

  22. Principle Component Analysis (PCA) • Mathematical technique that can be used to reduce the number of dimensions of data Principal component analysis

  23. Visualization

  24. Advantages • No parameters required • 3D Visualization

  25. Disadvantages • Little control over structure • Running time of O(N3) • Not applicable when input is a distance matrix

  26. Biclustering • Clustering of both rows and columns simultaneously

  27. Available Software

  28. Protein Folding

  29. Reaction Coordinates • Folding determines the function of protein • All-atom recreation of protein unrealistic • Reaction coordinates used to describe protein structure • Fraction of Native Contacts • Radius of Gyration • RMSD from the native structure • Number of beta-strand Hydrogen Bonds • Number of alpha helix turns • Hydrophobic core radius of gyration • Principle Components

  30. Protein States • While folding, a protein goes through certain states • The raw data is similar to microarray data. • Dr. Parida and Dr. Zhou have developed their own techniques and clustered β-Hairpin data.

  31. Reaction Coordinates used on the β-Hairpin • Number of Native β-strand hydrogen bonds • Radius of gyration of the hydrophobic core residues • Radius of gyration of entire protein • Fraction of native contacts • Principle component 1 • Principle component 2 • Root mean square deviation (RMSD) from the native structure.

  32. Raw Data

  33. Patterned Cluster RED = Number of columns in pattern. (Also defined as the Pattern Type) WHITE = Column Number PURPLE = Column Value YELLOW = Number of occurrences GREEN = Occurrences

  34. Sample Patterned Cluster File

  35. The need for Visual Analysis of Patterned Cluster Data • β-Hairpin file approx 500MB large • Difficult to study the textual representation and get a global view • Very difficult to see interaction of all patterned clusters in relation to each other • Also very difficult to remember all patterned clusters and their occurrence in time

  36. Visual Requirements • Global View • Navigation & Focus • Relative growth • Details of characteristics on demand

  37. Need for Customized Tool • All of the existing visualization techniques on microarrays had one or more drawbacks • None were able to provide a visual for depicting relative growth of clusters.

  38. Terrain Metaphor • Has been shown to be a useful technique in searching a corpus of documents • Very recently the idea has been applied to gene expression with high density clusters representing mountains

  39. Using a Landscape Metaphor to solve our requirements • Each mountain represents a patterned cluster • Mountain growth represents evolution of patterned cluster • Clicking on mountains returns details of patterned cluster

  40. PROTERAN

  41. Mapping of Patterned Cluster Data into Terrain Geometry

  42. Mapping of Patterned Cluster data into Terrain Geometry • Pattern Type: Number of columns in a patterned cluster • Column Combination: Unique number that identifies a combination of columns

  43. Column Combinations c! (c – t)! * t! c = number of characteristics t = pattern number

  44. Layout • We first thought of using an automated layout technique. • However, one of Dr. Zhou’s requirements was that the same pattern cluster should appear in the same position for consistent interpretation. • Another was that larger pattern types (6 and 7 column) must be very distinguishably placed. • Hence it was decided to use a manual layout design described next.

  45. Layout

  46. Top Patterned Clusters Visualized • Final requirement by Dr. Parida and Dr. Zhou is that only the top 10 largest patterned clusters of each column combination should be visualized

  47. PROTERAN LAYOUT

  48. Animated Terrain Evolution • Time proceeds from 0 to the maximum number of experiments • Each time unit all patterned clusters are checked • If there is an occurrence the mountain’s height is increased

  49. Mountains of PROTERAN

  50. Results & Extensions