Lecture 3 Data Types in computational biology/Systems biology Useful websites

Lecture 3 Data Types in computational biology/Systems biology Useful websites Handling Multivariate data: Concept and types of metrics, distances etc. K-mean clustering

What is systems biology? Each lab/group has its own definition of systems biology. This is because systems biology requires the understanding and integration different levels of OMICS information utilizing the knowledge from different branches of science and individual labs/groups are working on different area. Theoretical target: Understanding life as a system. Practical Targets: Serving humanity by developing new generation medical tests, drugs, foods, fuel, materials, sensors, logic gates…… Understanding life or even a cell as a system is complicated and requires comprehensive analysis of different data types and/or sub-systems. Mostly individual groups or people work on different sub-systems---

Data Types in computational biology/Systems biology Some of the currently partially available and useful data types: Genome sequences Binding motifs in DNA sequences or CIS regulatory region CODON usage Gene expression levels for global gene sets/microRNAs Protein sequences Protein structures Protein domains Protein-protein interactions Binding relation between proteins and DNA Regulatory relation between genes Metabolic Pathways Metabolite profiles Species-metabolite relations Plants usage in traditional medicines Usually in wet labs, experiments are conducted to generate such data In dry labs like ours we analyze these data to extract targeted information using different algorithms and statistics etc.

Sequence data (Genome /Protein sequence) >gi|15223276|ref|NP_171609.1| ANAC001 (Arabidopsis NAC domain containing protein 1); transcription factor [Arabidopsis thaliana] MEDQVGFGFRPNDEELVGHYLRNKIEGNTSRDVEVAISEVNICSYDPWNLRFQSKYKSRDAMWYFFSRRE NNKGNRQSRTTVSGKWKLTGESVEVKDQWGFCSEGFRGKIGHKRVLVFLDGRYPDKTKSDWVIHEFHYDL LPEHQRTYVICRLEYKGDDADILSAYAIDPTPAFVPNMTSSAGSVVNQSRQRNSGSYNTYSEYDSANHGQ QFNENSNIMQQQPLQGSFNPLLEYDFANHGGQWLSDYIDLQQQVPYLAPYENESEMIWKHVIEENFEFLV DERTSMQQHYSDHRPKKPVSGVLPDDSSDTETGSMIFEDTSSSTDSVGSSDEPGHTRIDDIPSLNIIEPL HNYKAQEQPKQQSKEKVISSQKSECEWKMAEDSIKIPPSTNTVKQSWIVLENAQWNYLKNMIIGVLLFIS VISWIILVG Usually BLAST algorithms based on dynamic programming are used to determine how two or more sequences are matching with each other Sequence matching/alignments

CODONS

CODON USAGE

Multivariate data (Gene expression data/Metabolite profiles) There are many types of clustering algorithms applicable to multivariate data e.g. hierarchical, K-mean, SOM etc. Multivariate data also can be modeled using multivariate probability distribution function

Binary relational Data (Protein-protein interactions, Regulatory relation between genes, Metabolic Pathways) are networks. Clustering is usually used to extract information from networks. Multivariate data and sequence data also can be easily converted to networks and then network clustering can be applied. AtpB AtpA AtpG AtpE AtpA AtpH AtpB AtpH AtpG AtpH AtpE AtpH

Useful Websites

Some websites www.geneontology.org www.genome.ad.jp/kegg www.ncbi.nlm.nih.gov www.ebi.ac.uk/databases http://www.ebi.ac.uk/uniprot/ http://www.yeastgenome.org/ http://mips.helmholtz-muenchen.de/proj/ppi/ http://www.ebi.ac.uk/trembl http://dip.doe-mbi.ucla.edu/dip/Main.cgi www.ensembl.org Some websites where we can find different types of data and links to other databases

Source: Knowledge-Based Bioinformatics: From Analysis to Interpretation Gil Alterovitz, Marco Ramoni (Editors)

NETWORK TOOLS Source: Knowledge-Based Bioinformatics: From Analysis to Interpretation Gil Alterovitz, Marco Ramoni (Editors)

Source: Knowledge-Based Bioinformatics: From Analysis to Interpretation Gil Alterovitz, Marco Ramoni (Editors)

Handling Multivariate data: Concept and types of metrics Multivariate data example Multivariate data format

Distances, metrics, dissimilarities and similarities are related concepts A metric is a function that satisfy the following properties: A function that satisfy only conditions (i)-(iii) is referred to as distances Source: Bioinformatics and Computational Biology Solutions Using R and Bioconductor (Statistics for Biology and Health) Robert Gentleman ,Vincent Carey ,Wolfgang Huber ,Rafael Irizarry ,Sandrine Dudoit (Editors)

These measures consider the expression measurements as points in some metric space. Example: Let, X = (4, 6, 8) Y = (5, 3, 9)

Widely used metrics for finding similarity Correlation

These measures consider the expression measurements as points in some metric space.

Statistical distance between points Statistical distance /Mahalanobis distance between two vectors can be calculated if the variance-covariance matrix is known or estimated. The Euclidean distance between point Q and P is larger than that between Q and origin but it seems P and Q are the part of the same cluster but not Q and O.

Distances between distributions Different from the previous approach (i.e. considering expression measurements as points in some metric space) the data for each feature can be considered as independent sample from a population. Therefore the data reflects the underlying population and we need to measure similarities between two densities/distributions. Kullback-Leibler Information Mutual information KLI measures how much the shape of one distribution resembles the other MI is large when the joint distribution is quiet different from the product of the marginals.

K-mean clustering

Source: “Clustering Challenges in Biological Networks” edited by S. Butenko et. al.

Source: Teknomo, Kardi. K-Means Clustering Tutorials http:\\people.revoledu.com\kardi\ tutorial\kMean\

Initial value of centroids: Suppose we use medicine A and medicine B as the first centroids. Let c1 and c2denote the coordinate of the centroids, then c1 = (1,1) and c2 = (2,1)

Hierarchical clustering

AtpB AtpA AtpG AtpE AtpA AtpH AtpB AtpH AtpG AtpH AtpE AtpH In many cases for example in case of microarray gene expression analysis the data is multivariate type. An Introduction to Bioinformatics Algorithms by Jones & Pevzner Hierarchical Clustering Data is not always available as binary relations as in the case of protein-protein interactions where we can directly apply network clustering algorithms.

Hierarchical Clustering We can convert multivariate data into networks and can apply network clustering algorithm about which we will discuss in some later class. If dimension of multivariate data is 3 or less we can cluster them by plotting directly. An Introduction to Bioinformatics Algorithms by Jones & Pevzner

Hierarchical Clustering Some data reveal good cluster structure when plotted but some data do not. Data plotted in 2 dimensions However, when dimension is more than 3, we can apply hierarchical clustering to multivariate data. In hierarchical clustering the data are not partitioned into a particular cluster in a single step. Instead, a series of partitions takes place.

Hierarchical Clustering Hierarchical clustering is a technique that organizes elements into a tree. A tree is a graph that has no cycle. A tree with n nodes can have maximum n-1 edges. A Graph A tree

Hierarchical Clustering • Hierarchical Clustering is subdivided into 2 types • agglomerative methods, which proceed by series of fusions of the n objects into groups, • and divisive methods, which separate n objects successively into finer groupings. • Agglomerative techniques are more commonly used Data can be viewed as a single cluster containing all objects to n clusters each containing a single object .

Hierarchical Clustering Distance measurements Euclidean distance between g1 and g2

Hierarchical Clustering An Introduction to Bioinformatics Algorithms by Jones & Pevzner In stead of Euclidean distance correlation can also be used as a distance measurement. For biological analysis involving genes and proteins, nucleotide and or amino acid sequence similarity can also be used as distance between objects

Hierarchical Clustering • An agglomerative hierarchical clustering procedure produces a series of partitions of the data, Pn, Pn-1, ....... , P1. The first Pn consists of n single object 'clusters', the last P1, consists of single group containing all n cases. • At each particular stage the method joins together the two clusters which are closest together (most similar). (At the first stage, of course, this amounts to joining together the two objects that are closest together, since at the initial stage each cluster has one object.)

Hierarchical Clustering An Introduction to Bioinformatics Algorithms by Jones & Pevzner Differences between methods arise because of the different ways of defining distance (or similarity) between clusters.

Hierarchical Clustering How can we measure distances between clusters? Single linkage clustering Distance between two clusters A and B, D(A,B) is computed asD(A,B)= Min { d(i,j) : Where object i is in cluster A and object j is cluster B}

Hierarchical Clustering Complete linkage clustering Distance between two clusters A and B, D(A,B) is computed asD(A,B)= Max { d(i,j) : Where object i is in cluster A and object j is cluster B}

Hierarchical Clustering Average linkage clustering Distance between two clusters A and B, D(A,B) is computed asD(A,B) = TAB / ( NA * NB) Where TAB is the sum of all pair wise distances between objects of cluster A and cluster B. NAand NB are the sizes of the clusters A and B respectively. Total NA * NBedges

Hierarchical Clustering Average group linkage clustering Distance between two clusters A and B, D(A,B) is computed asD(A,B) = = Average { d(i,j) : Where observations i and j are in cluster t, the cluster formed by merging clusters A and B } Total n(n-1)/2 edges

Hierarchical Clustering Alizadeh et al. Nature 403: 503-511 (2000).

Classifying bacteria based on 16s rRNA sequences.

Lecture 3 Data Types in computational biology/Systems biology Useful websites

Lecture 3 Data Types in computational biology/Systems biology Useful websites

Presentation Transcript

CLONING

Oral Biology 5301

Models and Algorithmic Tools for Computational Processes in Cellular Biology Bhaskar DasGupta Department of Computer S

Cytoscape and networks

Biology 323 Human Anatomy for Biology Majors Lecture 16 Dr. Stuart S. Sumida

Multi-scale Modeling in Systems Biology

CBIO243: Principles of Cancer Systems Biology

Graphs and Graph Theory in Computational Biology

BIOLOGY KEYSTONE REVIEW

iGEM, The Registry of Standard Biological Parts, And Synthetic Biology

Biology EOC Highlight Review

Computational Genomics

6.096 Lecture 10

Genomes to Hits: The Emerging Assembly Line in Silico

INTRODUCTION The aim of computational structural biology is to

The Unifying Concept in Biology

Speyside High School Biology Department

Biology EOC Review 2012

Lecture 18, 03 Nov 2003 Chapter 9 (Aquatic Ecosystems) Student Presentations Conservation Biology

Welcome to Dr. Rathman ’ s Biology 156 Class !!!