Machine Learning and its Applications in Bioinformatics

Machine Learning and its Applications in Bioinformatics Yen-Jen Oyang Dept. of Computer Science and Information Engineering

Observations and Challenges in the Information Age • A huge volume of information has been and is being digitized and stored in the computer. • Due to the volume of digitized information, effectively exploitation of information is beyond the capability of human being without the aid of intelligent computer software.

An Example ofSupervised Machine Learning (Data Classification) • Given the data set shown on next slide, can we figure out a set of rules that predict the classes of objects?

Data Set

× × × × × 。。。 × × 。。。。 × 。。。 30 × × × × × × 10 15 20 Distribution of the Data Set

Rule Based on Observation

Rule Generated by a Kernel Density Estimation Based Algorithm Let and If then prediction=“O”. Otherwise prediction=“X”.

Identifying Boundary of Different Classes of Objects

Boundary Identified

Problem Definition ofData Classification • In a data classification problem, each object is described by a set of attribute values and each object belongs to one of the predefined classes. • The goal is to derive a set of rules that predicts which class a new object should belong to, based on a given set of training samples. Data classification is also called supervised learning.

The Vector Space Model • In the vector space model, each object is described by a number of numerical attributes/features. • For example, the outlook of a man is described by his height, weight, and age. • It is typical that the objects are described by a large number of attributes/features.

Transformation of Categorical Attributes into Numerical Attributes • Represent the attribute values of the object in a binary table form as exemplified in the following:

Assign appropriate weight to each column. • Treat the weighted vector of each row as the feature vector of the corresponding object.

Transformation of the Similarity/Dissimilarity Matrix Model • In this model, a matrix records the similarity/dissimilarity scores between every pair of objects.

We may select P2, P5, P6 as representatives and use reciprocals of the similarity scores to these representatives to describe an object. • For example, the feature vectors of P1 and P2 are <1/53, 1/35, 1/180> and <0, 1/816, 1/606>, respectively.

Applications ofData Classificationin Bioinformatics • In microarray data analysis, data classification is employed to predict the class of a new sample based on the existing samples with known class. • Data classification has also been widely employed in prediction of protein family, protein fold, and protein secondary structure.

For example, in the Leukemia data set, there are 72 samples and 7129 genes. • 25 Acute Myeloid Leukemia(AML) samples. • 38 B-cell Acute Lymphoblastic Leukemia samples. • 9 T-cell Acute Lymphoblastic Leukemia samples.

Gene1 Gene2 ‧‧‧‧‧‧ Genen Sample1Sample2 Samplem Model of Microarray Data Sets

Alternative Data Classification Algorithms • Decision tree (Q4.5 and Q5.0); • Instance-based learning(KNN); • Naïve Bayesian classifier; • RBF network; • Support vector machine(SVM); • Kernel Density Estimation (KDE) based classifier.

Instance-Based Learning • In instance-based learning, we take k nearest training samples of a new instance (v1, v2, …, vm) and assign the new instance to the class that has most instances in the k nearest training samples. • Classifiers that adopt instance-based learning are commonly called the KNN classifiers.

Example of the KNN Classifiers • If an 1NN classifier is employed, then the prediction of “” = “X”. • If an 3NN classifier is employed, then prediction of “” = “O”.

Decision Function of the KNN Classifier • Assume that there are two classes of samples, positive and negative. • The decision function of a KNN classifier is:

Extension of the KNN Classifier • We may extend the KNN classifier by weighting the contribution of each neighbor with a term related to its distance to the query vector:

A RBF Network Based Classifier withGaussian Kernels • It is typical that all are radial basis functions of the same form. • With the popular Gaussian function, the decision function is of the following form:

The Common Structure of the RBF Network Based Data Classifier v

Regularization of a RBF Network Based Classifier • The conventional approaches proceed with either employing a constant σ for all kernel functions or employing a heuristic mechanism to set σi individually, e.g. a multiple of the average distance among samples, and attempt to minimize where is a learning sample.

The term is included to avoid overfitting and γ is to be set through cross validation.

Decision Function of a SVM • A prediction of the class of a new sample located at v in the vector space is based on the following rule:

The Kernel Density Estimation (KDE) Based Classifier • The KDE based learning algorithm constructs one approximate probability density function for one class of objects. • Classification of a new object is conducted based on the likelihood function:

Identifying Boundary of Different Classes of Objects

Boundary Identified

Problem Definition of Kernel Density Estimation • Given a set of samples randomly taken from a probability distribution. We want to find a set of symmetric kernel functions and the corresponding weights such that

The Proposed KDE Based Classifier • We determined to employ the Gaussian function and set the width of each Gaussian function to a multiple of the average distance among neighboring samples:

can be estimated as follow:

Accuracy of Different Classification Algorithms

Comparison of Execution Time(in seconds)

Parameter Setting through Cross Validation • When carrying out data classification, we normally need to set one or more parameters associated with the data classification algorithm. • For example, we need to set the value of k with the KNN classifier. • The typical approach is to conduct cross validation to find out the optimal value.

In the cross validation process, we set the parameters of the classifier to a particular combination of values that we are interested in and then evaluate how good the combination is based on alternative schemes. • With the leave-one-out cross validation scheme, we attempt to predict the class of each sample using the remaining samples as the training data set.

With 10-fold cross validation, we evenly divide the training data set into 10 subsets. Each time, we test the prediction accuracy of one of the 10 subsets using the other 9 subsets as the training set.

Overfitting • Overfitting occurs when we construct a classifier based on insufficient quantity of samples. • As a result, the classifier may works well for the training dataset but fail to deliver an acceptable accuracy in the real world.

For example, if we toss a fair coin two times, there is a 50% chance that we will observe either side up in both tosses. • Therefore, if we draw our conclusion on how fair the coin is with just two tosses, we may end up with overfitting the dataset. • Overfitting is a serious problem in analyzing high-dimensional datasets, e.g. the microarray datasets.

Alternative Similarity Functions • Let < vr,1,vr,2 ,…, vr,n> and < vt,1,vt,2 ,…, vt,n > be the gene expression vectors, i.e. the feature vectors, of samples Sr and St, respectively. Then, the following alternative similarity functions can be employed: • Euclidean distance—

Cosine— • Correlation coefficient--

Importance of Feature Selection • Inclusion of features that are not correlated to the classification decision may make the problem even more complicated. • For example, in the data set shown on the following page, inclusion of the feature corresponding to the Y-axis causes incorrect prediction of the test instance marked by “”, if a 3NN classifier is employed.

y • It is apparent that “o”s and “x” s are separated by x=10. If only the attribute corresponding to the x-axis was selected, then the 3NN classifier would predict the class of “” correctly. x x=10

Linearly Separable and Non-Linearly Separable • Some datasets are linearly separable. • However, there are more datasets that are non-linearly separable.

Machine Learning and its Applications in Bioinformatics