Supervised Machine Learning Algorithms
E N D
Presentation Transcript
Taxonomy of Machine Learning Methods • The main idea of machine learning (ML) • To use computers to learn from massive amounts of data • For tedious or unstructured data, machines can often make better and more unbiased decisions than a human learner • ML forms the core of artificial intelligence (AI) • Especially in the era of big data • Need to write a computer program based on a model algorithm • Learning from given data objects, one can reveal the categorical class or experience affiliation of future data to be tested • Essentially defines ML as an operational term
Taxonomy of Machine Learning Methods (cont.) • To implement an ML task • Need to explore or construct computer algorithms to learn from data • Make predictions on data based on their specific features, similarities, or correlations • ML algorithms are operated by building a decision-making model from sample data inputs • Defines the relationship between features and labels • A feature is an input variable for the algorithm • A label is an output variable for the algorithm • The outputs are data-driven predictions or decisions • One can handle the ML process subjectively • By finding the best fit to solve the decision problem based on the characteristics in data sets
Classification by Learning Paradigms • ML algorithms can be built with different styles in order to model a problem • The style is dictated by the interaction with the data environment • Expressed as the input to the model • The data interaction style decides the learning models that a ML algorithm can produce • The user must understand the roles of the input data and the model’s construction process • The goal is to select the ML model that can solve the problem with the best prediction result • ML sometime overlaps with the goal of data mining
Classification by Learning Paradigms (cont.) • Three classes of ML algorithms based on different learning styles • Supervised, unsupervised, and semi-supervised • Three ML methods are viable in real-life applications • The style is hinged on how training data is used in the learning process
Classification by Learning Paradigms (cont.) • Supervised learning • The input data is called training data with a known label or result • A model is constructed through training by using the training dataset • Improved by receiving feedback predictions • The learning process continues until the model achieves a desired level of accuracy on the training data • Future incoming data without known labels is tested on the model with an acceptable level of accuracy • Unsupervised learning • All input data are not labeled with a known result
Classification by Learning Paradigms (cont.) • A model is generated by exploring the hidden structures present in the input data • To extract general rules, go through a mathematical process to reduce redundancy, or organize data by similarity testing • Semi-supervised learning • The input data is a mixture of labeled and unlabeled examples • The model must learn the structures to organize the data in order to make predictions possible • Under different assumptions on how to model the unlabeled data
Supervised Machine Learning Algorithms • In a supervised ML system • The computer learns from a training data set of {input, output} pairs • The input comes from sample data given in a certain format • e.g., The credit reports of borrowers • The output may be discrete • e.g., yes or no to a loan application • The output can be also continuous • e.g., The probability distribution that the loan can be paid off in a timely manner • The goal is to work out a reliable ML model • Can map or produce the correct outputs from new inputs that were unseen before
Supervised Machine Learning Algorithms (cont.) • Four families of supervised ML algorithms • Regression, decision trees, Bayesian networks, and support vector machines • The ML system acts like a finely tuned predictor function g(x) • The learning system is built with a sophisticated algorithm to optimize this function • e.g., Given an input data x in a credit report of a borrower, the bank will make a loan decision based on the predicted outcome • The learning process is iteratively refined using an error criterion to make better predictions • Minimizes the error between predicted value and actual experience in input data
Supervised Machine Learning Algorithms (cont.) • The iterative trial-and-error process • Suggested for machine learning algorithms to train a model
Regression Analysis • The outputs of regression are continuous rather than discrete • Finds the causal relationship between the input and output variables • Apply mathematical statistics to establish dependent variables and independent variables in learning • The independent variables are the inputs of the regression process, aka the predictors • The dependent variable is the output of the process • Essentially performs a sequence of parametric or nonparametric estimations • Careful to make the predictions • Causality may lead to illusions or false relationships to mislead the users
Regression Analysis (cont.) • The estimation function can be determined • By experience using a priori knowledge or visual observation of the data • The regression method can be applied to classify data by predicting the category tag of data • Regression analysis determines the quantitative relation in a learning process • How the value of the dependent variable changes • When any independent variable varies while the other independent variables are left unchanged • Regression analysis estimates the average value of the dependent variable when the independent variables are fixed
Regression Analysis (cont.) • The estimated value is a function of the independent variables known as the regression function • Can be described by a probability distribution • Most regression methods are parametric naturally • Need to calculate the undetermined coefficients of the function by using some error criteria • With a finite dimension in the analysis space • Nonparametric regression may be infinite-dimensional • Accuracy or performance depends on the quality of the dataset used • Related to the data generation process and the underlying assumptions made
Regression Analysis (cont.) • Regression offers estimation of continuous response variables • As opposed to the discrete decision values used in classification that demand higher accuracy • In the formulation of a regression process • The unknown parameters are often denoted as β • May appear as a scalar or a vector • The independent variables are denoted by a vector X and a dependent variable as Y • When multiple dimensions are involved, these parameters are vectors in form • A regression model establishes the approximated relation between X, β, and Y:
Regression Analysis (cont.) • The function f(X, β) is approximated by the expected value E(Y|X) • The regression function f is based on the knowledge of the relationship between a continuous variable Y and vector X • If no such knowledge is available, an approximated handy form is chosen for f • Measuring the Height after Tossing a Small Ball in the Air • Measure its height of ascent h at the various time instant t • The relationship is modeled as • β1 determines the initial velocity of the ball
Regression Analysis (cont.) • β2 is proportional to standard gravity • ε is due to measurement errors • Linear regression is used to estimate the values of β1 and β2 from the measured data • This model is nonlinear with respect to the time variable t • But it is linear with respect to parameters β1 and β2 • Consider k components in the vector of unknown parameters β • Three models to relate the inputs to the outputs • Depending on the relative magnitude between the number N of observed data points of the form (X, Y) and the dimension k of the sample space
Regression Analysis (cont.) • When N < k, most classical regression analysis methods can be applied • Most classical regression analysis methods can be applied • The defining equation is underdetermined • No enough data to recover the unknown parameters β • When N = k and the function f is linear • The equation Y = f (X, β) can be solved exactly without approximation • There are N equations to solve N components in β • The solution is unique as long as the X components are linearly independent • If f is nonlinear, many solutions may exist or no solution at all
Regression Analysis (cont.) • In general, the situation with N > k data points • Enough information in the data that can estimate a unique value for β under an overdetermined situation • The measurement errors εi follows a normal distribution • There exists an excess of information contained in (N - k) measurements • Known as the degrees of freedom of the regression • Regression with a Necessary Set of Independent Measurements • Need the necessary number of independent data to perform the regression analysis of continuous data measurements
Regression Analysis (cont.) • Consider a regression model with four unknown parameters, 𝛽0, 𝛽1, 𝛽2 and 𝛽3 • An experimenter performs 10 measurements • All at exactly the same value of independent variable vector X = (X1, X2, X3, X4) • Regression analysis fails to give a unique set of estimated values for the four unknown parameters • Not get enough information to perform the prediction • Only can estimate the average value and the standard deviation of the dependent variable Y • Measuring at two different values of X • Only gives enough data for a regression with two unknowns, but not for three or more unknowns • Only if performs measurements at four different values of the independent variable vector X
Regression Analysis (cont.) • Regression analysis will provide a unique set of estimates for the four unknown parameters in β • Basic assumptions on regression analysis under various error conditions • The sample is representative of the data space involved • The error is a random variable with a mean of zero conditioned over the input variables • The independent variables are measured with no error • The predictors are linearly independent • The errors are uncorrelated • The variance of error is a constant across observations
Linear Regression • Regression analysis includes linear regression and nonlinear regression • Unitary linear regression analysis • Only one independent variable and one dependent variable are included in the analysis • The approximate representation for the relation between the two can be conducted with a straight line • Multivariate linear regression analysis • Two or more independent variables are included in regression analysis • Linear relation between dependent variable and independent variables • The model of a linear regression y = f(X)
Linear Regression (cont.) • X = (x1, x2,⋯, xn) with n 1 is a multidimensional vector and y is scalar variable • f(X) is a linear predictor function used to estimate the unknown parameters from data • Linear regression is applied mainly in the two areas • An approximation process for prediction, forecasting, or error reduction • Predictive linear regression models for an observed data set of y and X values • The fitted model makes a prediction of the value of y for future unknown input vector X • To quantify the strength of the relationship between output y and each input component Xj
Linear Regression (cont.) • Assess which Xj is irrelevant to y and which subsets of the Xj contain redundant information about y • Major steps in linear regression
Unitary Linear Regression • Crickets chirp more frequently on hotter days than on cooler days
Unitary Linear Regression (cont.) • Consider a set of data points in a 2D sample space (x1, y1), (x2, y2), ..., (xn, yn) • Mapped into a scatter diagram • If can be covered approximately by a straight line: y = ax + b + ε • x is an input variable, y is an output variable in the real number range, a and b are coefficients • ε is a random error, and follows a normal distribution with mean E(ε)and variance Var(ε) • Need to work out the expectation by using a linear regression expression: y = ax + b • The main task is to conduct estimations for coefficient a and b via observation on n groups of input samples
Unitary Linear Regression (cont.) • Fit linear regression models with a least squares approach • The approximation is shown by a linear line • Amid the middle or center of all data points in the data space • The residual error (loss) of a unitary model
Unitary Linear Regression (cont.) • The convex objective function is given by • To minimize the sum of squares, need to calculate the partial derivative of Q with respect to , and make them zero • are mean value for input variable and dependent variable, respectively
Unitary Linear Regression (cont.) • After working out the specific expression for the model • Need to know the fitting degree to the dataset • If the expression can express the relation between the two variables and can be used in actual predictions • To figure out the estimated value of the dependent variable with • For each sample in the training data set
Unitary Linear Regression (cont.) • The closer the coefficient of determination R2 is to 1, the better the fitting degree is • The further R2 is away from 1, the worse fitting degree is • Linear regression can also be used for classification • Only used in a binary classification problem • Decide between the two classes • For multivariate linear regression, this method is also applied to classify a dataset
Unitary Linear Regression (cont.) • Healthcare Data Analysis • Obesity is reflected by the weight index • More likely to have high blood pressure or diabetes • Predict the relationship between obesity and high blood pressure • The dataset for body weight index and blood pressure of some people at a hospital in Wuhan
Unitary Linear Regression (cont.) • Conduct a preliminary judgment on what is the datum of blood pressure of a person with a body weight index of 24 • A prediction model with two variables • The unitary linear regression may be considered • Determine distribution of the data points • Scatter diagram for body weight index-blood pressure
Unitary Linear Regression (cont.) • All data points are almost on or below the straight line • Being linearly distributed • The data space is modeled by a unitary linear regression process • By the least square method • Get a = 1.32 and b = 96.58 • Therefore we have y = 1.32x + 96.58 • A significance test is needed to verify whether the model will fit well with the current data • A prediction is made through calculation • The mean residual and coefficient of determination of the model are: average error is 1.17 and R2 = 0.90
Unitary Linear Regression (cont.) • The mean residual is much less than the mean value 125.6 of blood pressure • The coefficient of determination is close to 1 • This regression equation is significant • Can fit well into the dataset • Predictions may be conducted for unknown data on this basis • Given body weight index, the value of blood pressure of a person may be determined with the model • Substitute 24 for x • Can get the value of blood pressure of that person as y = 1.32 × 24 + 96.58 = 128
Multiple Linear Regression • During solving actual problems • Often encounter many variables • e.g., The scores of a student may be influenced by factors like earnestness in class, preparation before class and review after class • e.g., The health of a man is not only influenced by the environment, but also related to the dietary habits • The model of unitary linear regression is not adapted to many conditions • Improve it with a model of multivariate linear regression analysis • Consider the case of m input variables • The output is expressed as a linear combination of the input variables
Multiple Linear Regression (cont.) • 𝛽0, 𝛽1,⋯, 𝛽m, 𝜎2 are unknown parameters • ε complies with normal distribution • The mean value is 0 and the variance is equal to 𝜎2 • By working out the expectation for the structure to get the multivariate linear regression equation • Substituted y for E(y) • Its matrix form is given as E(y) = X𝛽 • X = [1, x1,⋯, xm], 𝛽 = [𝛽0, 𝛽1,⋯, 𝛽m]T • Our goal is to compute the coefficients by minimizing the objective function
Multiple Linear Regression (cont.) • Defined over n sample data points • To minimize Q, need to make the partial derivative of Q with respect to each βi zero • The multiple linear regression equation
Multiple Linear Regression (cont.) • Multivariate regression is an expansion and extension of unitary regression • Identical in nature • The range of applications is different • Unitary regression has limited applications • Multivariate regression is applicable to many real-life problems • Estimate the Density of Pollutant Nitric Oxide in a Spotted Location • Estimation of the density of nitric oxide (NO) gas, an air pollutant, in an urban location • Vehicles discharge NO gas during their movement
Multiple Linear Regression (cont.) • Creates a pollution problem proven harmful to human health • The NO density is attributed to four input variables • Vehicle traffic, temperature, air humidity, and wind velocity • 16 data points collected in various observed spotted locations in the city • Apply the multiple linear regression method to estimate the NO density • In testing a spotted location measured with a data vector of {1436, 28.0, 68, 2.00} for four features {x1, x2, x3, x4}, respectively • X = [1, xn1, xn2, xn3, xn4]T and the weight vector W = [b, β1, β2, β3, β4]T for n = 1,2,.…,16
Multiple Linear Regression (cont.) • e.g., for the first row of training data, [1300, 20, 80, 0.45, 0.066], X1 = [1, 1300, 20, 80, 0.45]T, which gives the output value y1 = 0.066 • Need to compute W = [b, β1, β2, β3, β4]T and minimize the mean square error • The 16 × 5 matrix directly obtained from the sample data table • y = [0.066, 0.005,.…, 0.039]Tis the given column vector of data labels
Multiple Linear Regression (cont.) • To make the prediction results on the testing sample vector x = [1, 1300, 20, 80, 0.45]T • By substituting the weight vector obtained • The final answer is {β1 = 0.029, β2 = 0.015, β3 = 0.002, β4 = −0.029, b = 0.070} • The NO gas density is predicted as = 0.065 or 6.5%
Logistic Regression Method • Many problems require a probability estimate as output • Logistic regression is an extremely efficient mechanism for calculating probabilities • Commonly used in fields like data mining, automatic diagnosis for diseases, and economic predictions • The logistic model may be used to solve problems of binary classification • In solving a classification problem • The inputs are divided into two or more classes • The learner must produce a model that assigns unseen inputs to one or more of these classes • Typically tackled in a supervised way
Logistic Regression Method • Spam filtering is a good example of classification • The inputs are e-mails, blogs, or document files • The output classes are spam and non-spam • For logistic regression classification • The principle is to conduct classification to sample data with a logistic function • Maps logistic regression output to probabilities • Known as a sigmoid function • The input domain of the sigmoid function is (-∞, +∞) and the range is (0, 1) • Can regard the sigmoid function as a probability density function for sample data
Logistic Regression Method (cont.) • The function image is sensitive, if z = 0 • And not sensitive if z ≫ 0 or z ≪ 0 z
Logistic Regression Method (cont.) • The basic idea for logistic regression • Sample data may be concentrated at both ends of the by the use of intermediate feature z of the sample • Can be divided into two classes • Consider vector X = (x1,⋯, xm) with m independent input variables • Each dimension of X stands for one attribute (feature) of the sample data (training data) • Multiple features of the sample data are combined into one feature by • Figure out the probability of the z feature with designated data
Logistic Regression Method (cont.) • And apply the sigmoid function to act on that feature • Obtain the expression for the logistic regression • During combining of multiple features into one feature
Logistic Regression Method (cont.) • Make use of the linear function • The coefficient of the linear function, i.e., feature weight of sample data, needs to be determined • Maximum likelihood Estimation (MLE) is adopted to transform it into an optimization problem • Attempts to find the parameter values that maximize the likelihood function, given the observations • The coefficient is determined through the optimization method • The loss function is Log Loss • D is the data set containing many labeled examples, i.e., (x, y) pairs
Logistic Regression Method (cont.) • y is the label in a labeled example and its value must either be 0 or 1 • y′ is the predicted value, somewhere between 0 and 1, given the set of features in x • Minimizing this negative logarithm of the likelihood function yields a maximum likelihood estimate • Logistic regression returns a probability • To map a regression value to a binary category must define a classification or decision threshold • Thresholds are problem-dependent • Tempting to assume that it should always be 0.5 • Its value must be tuned • Part of choosing a threshold is assessing how much one will suffer for making a mistake
Logistic Regression Method (cont.) • General steps for logistic regression • Accuracy is one metric for evaluating classification models • The fraction of predictions the model gets right
Logistic Regression Method (cont.) • Four possible statuses for binary classification • TP (True Positive) refers to an outcome where the model correctly predicts the positive class • TN (True Negative) means an outcome where the model correctly predicts the negative class • FP (False Positive) is an outcome where the model incorrectly predicts the positive class • FN (False Negative) an outcome where the model incorrectly predicts the negative class