What Semantic Web researchers need to know about Machine Learning?

What Semantic Web researchers need to know about Machine Learning? http://analytics.ijs.si/events/Tutorial-MachineLearningForSemanticWeb-ISWC2007-Busan-Nov2007/ Marko Grobelnik, Blaž Fortuna,Dunja Mladenić

How Semantic Web compares to Machine Learning? (1/2) • Semantic Web is traditionally non-analytic area of research… • …it includes a lot of logic, some language technologies, some standardization etc. • Generally, manual encoding of knowledge is still appreciated the most… • …where the main paradigm is top-down. • Top-down approaches enable accuracy when designing knowledge bases, but can cost a lot and are mostly not scalable

How Semantic Web compares to Machine Learning? (2/2) • Machine Learning is on the other hand almost exclusively analytic & bottom-up… • …where analytic techniques have the main goal to extract knowledge from the data… • …with a little or no human involvement. • The cost per encoded piece of knowledge is lower, achievable scale is high, … • …but, the techniques are limited to what is computable in a reasonable time.

What are stitching points between SW and ML? • …SW can contribute to ML and vice versa • Machine Learning for Semantic Web • …decrease costs when extracting knowledge and building ontologies, speeding-up reasoning, scalability • Semantic Web for Machine Learning • …Semantic Web techniques can provide new insights into the data (e.g. via reasoning) which are unobservable by statistical means

Part 1: Understanding Machine Learning in simple terms? Sub-areas of machine learning What are constituents of a machine learning algorithm? What kind of problems one could try to solve with ML methods? Why ML is easy and why it is hard?

Sub-areas of machine learning • Data-Mining • Main events: ACM KDD, IEEE ISCM, SIAM SDM • Books: Witten and Frank, 1999 • AI style Machine Learning • Main events: ICML, ECML/PKDD • Books: Mitchell, 1997 • Statistical Machine Learning • Main events: NIPS, UAI • Books: Bishop 2007, Duda & Hard & Stork 2000 • Theoretical Machine Learning • Main events: COLT, ALT • Books: Vapnik 1999

Which areas are contributing? Learning with different representations Analysis of Data bases Data Mining AI Style ML Theory on learning Statistical data analysis Machine Learning COLT Statistics

What is “Learning” in Machine Learning? • Herbert Simon: “Learning is any process by which a system improves performance from experience.” • Improve on task T, with respect to performance metric P, based on experience E: • T: Recognizing hand-written words • P: Percentage of words correctly classified • E: Database of human-labeled images of handwritten words • T: Driving on four-lane highways using vision sensors • P: Average distance traveled before a human-judged error • E: A sequence of images and steering commands recorded while observing a human driver • T: Categorize email messages as spam or legitimate • P: Percentage of email messages correctly classified • E: Database of emails, some with human-given labels

What are constituents of a machine learning algorithm? • Typical ML algorithm includes: • Input data in some kind of form (most often vectors) • Input parameters (settings) • Fitting data using some search algorithm into a model • Output the model in the form of y=f(x) • Use the model for classification of unseen data • …the key elements are: • Language of the model – determines complexity of the model • …popular language for many algorithms is linear model (used in SVM, Perceptron, Bayes, ...): y = a x1 + b x2 + c x3 + … • Search algorithm – determines the quality of a result • …most often it is some kind of local optimization

What kind of problems one could try to solve with ML methods? • Generally, machine learning is about finding patterns and regularities in the data… • …the result is always a model which could be understood as a summary of the data used for learning model • The resulting model can be used for: • Explaining existing phenomena in the data • Predicting future situations

Why ML is easy and why it is hard? • ML is easy: • …because it seems the God is not playing dice and the Nature usually behaves nicely and is not chaotic • With ML techniques we try to discover the rules how the nature was generating the observed data. • Simple algorithmic approaches can bring a lot of success when understanding the observed data. • ML is hard: • …because we don’t know the language the God is using when programming the Universe • There are many different ways to represent similar concepts… • …we need to deal with scale, noise, dynamics, etc and is hard to find the right way how to model the observed data

Part 2: Main approaches in Machine Learning Supervised learning Semi-supervised learning Unsupervised learning

When to apply each of the approaches? • …let’s take the example from medicine: • we have patients which have symptoms (inputs) and diagnosis (output) • Supervised learning (classification) • …given symptoms and corresponding diagnoses for many patients, the goal is to find rules which can map/predict symptoms to diagnosis for an unseen patient • Semi-supervised learning (transduction, active learning) • …given symptoms and corresponding diagnoses for only a few symptoms, leverage these to find most probable diagnoses for all the patients • Unsupervised learning (clustering, decompositions) • …given only symptoms for many patients, find the groups of similar patients (with e.g. possibly similar diagnoses).

Supervised learning Assign an object to a given finite set of classes: • Medical diagnosis • …assign diagnosis to a patient • Credit card applications or transactions • …assign credit score to an applicant • Fraud detection in e-commerce • …decide about fraud or non-fraud event in a business process • Financial investments • …decide whether to buy or sell or hold on a stock exchange • Spam filtering of e-mails • …decide if an email is a spam or a regular email • Recommending articles in a newspaper • …decide if an article fits the user profile • Semantic/linguistic annotation • …assign semantic or linguistic annotation to a word or phrase

Recommending News Articles ??? new article (test example) Machine Learning Classifier labeled articles (training examples) predicted article class (interesting or not)

Supervised learning • Given is a set of labeled examples represented by feature vectors • The goal is: to build a model approximating the target function which would automatically assign right label to a new unlabeled example • Feature values: • discrete (eg., eyes_color  {brown, blue, green}) • continuous (eg., age  [0..200]) • ordered (eg., size {small, medium, large}) • Values of the target function – labels: • discrete (classification) or continuous (regression) • exclude each other (eg., medical diagnosis) or not (eg., document content can be about arts and computer science) • have some predefined relations (taxonomy of document categories, e.,g., DMoz or Medline) • The target function can be represented in different ways (storing examples, symbolic, numerical, graphical,…) and modeled by using different algorithms

Illustrative example Recommending cartoon for a 5-year old boy • Main characters  {vehicle, human, animal} • Duration [5..90]

Target function There is a trade-off between the expressiveness of a representation and the ease of learning • The more expressive a representation, the better it will be at approximating an arbitrary function; however, the more examples will be needed to learn an accurate function • Illustrative example • Values of the target function:discrete labels (classification), exclude each other • Interesting movie:

Possible data visualization Possible Model for not interesting (vehicles = no) and (human = yes) human = yes human = no vehicles = yes vehicles = no

Generalization • Model must generalize the data to correctly classify yet unseen examples (the ones which don’t appear in the training data) • Lookup table of training examples is a consistent model that does not generalize • An example that was not in the training data can not be classified Occam’s razor: • Finding a simple model helps ensure generalization

Ockham (Occam)’s Razor • William of Ockham (1295-1349) was a Franciscan friar who applied the criteria to theology: • “Entities should not be multiplied beyond necessity” (Classical version but not an actual quote) • “The supreme goal of all theory is to make the irreducible basic elements as simple and as few as possible without having to surrender the adequate representation of a single datum of experience.” (Einstein) • requires a precise definition of simplicity • assumes that nature itself is simple

Algorithms for learning classification models Storing examples • Nearest Neighbour Symbolic • Decision trees • Rules in propositional logic or first order logic Numerical • Perceptron algorithm • Winnow algorithm • Support Vector Machines • Logistic Regression Probabilistic graphical models • Naive Bayesian classifier • Hidden-Markov Models

Nearest neighbor • Storing training examples without generating any generalization • Simple, requires efficient storage • Classification by comparing the example to the stored training examples and estimating the class based on classes of the most similar examples • Similarity function is crucial Also known as: • Instance-based, Case-based, Exemplar-based, Memory-based, Lazy Learning

Similarity/Distance • For continuous features use Euclidian distance • For discrete features, assume distance between two values is 0 if they are the same and 1 if they are different (eg., Hamming distance for bit vectors). • To compensate for difference in units across features, scale all continuous values to the interval [0,1].

Nearest neighbor

Nearest neighbor - example Model Model Features

Nearest neighbor - example Distance: 1+0+1+1=3 1+1+1+1=4 1+1+0+1=3 1+1+0+0=2 1+1+1+1=4 1+0+1+1=3 1+0+1+1=3 1+0+1+1=3

Decision trees • Recursively splitting set of examples until (almost) all examples in the set have the same class value • starts with a set of all examples • find a feature giving the best split • split the examples according to the feature values • repeat for each subset of examples • Classification by “pushing” the example from root to leaf, assign class value of the leaf

Decision tree - example duration characters 5 3 5 3 vehicle ≤10 animals human 90≥ 10<, ≤ 30 30 <,<90 3 3 2 2 2 1 1 2 = 0.63*0.68+0.38*0.53 = 0.95 = 0.33*1.58+0.67*0.58 = 0.91 InfGain=0.95-(0+0+0)=0.95 InfGain=0.95-(0+0+0.38*0.91+0)=0.61

Decision tree - example characters 5 3 vehicle animals human 3 3 2

Decision tree model

If-then Rules Generating rules by adding conditions: • Λ – restricting the rule (less examples match) • V – generalizing the rule (more examples match) • Maximize quality of each rule (eg., matching examples are of the same class) while aiming at describing all the examples with a set of rules

If-then Rules characters 5 3 vehicle animals human 3 3 2 Converting a tree to rules • If (Main characters = vehicles) then interesting • If (Main characters = human) then uninteresting • If (Main characters = animals) then interesting • :

Support Vector Machine Learns a hyperplane in higher dimensional space that separates the training data and gives the highest margin Implicit mapping of the original feature space into higher dimensional space mapping using so called kernel function (eg., linear, polynomial, …) Regarded as state-of-the-art in text document classification Positive class Margin Negative class Hyperplane

Linear Model

Naïve Bayes Determine class of example ek by estimating • P(ci) – estimate from the data using frequency: no. of examples with class ck / no. of all examples • P(ek|ci) – too many possibilities (all combinations of feature values) • assume feature independence given the class

Naïve Bayes - example Model P( | ) → 5/8*(0.4*0.2) = 0.05 P( | ) → 3/8*(0*0) = 0 P(vehicle| )=0.6 P(vehicle| )=0 P(human| )=0 P(human| )=1 P(animal| )=0.4 P(animal| )=0 P( )=5/8 P( )=3/8 P(5| )=0.2 P(5| )=0 P(10| )=0.2 P(10| )=0 P(20| )=0 P(20| )=0.33 P(30| )=0 P(30| )=0.33 P(60| )=0.2 P(60| )=0 P(90| )=0.4 P(90| )=0.33

Generative Probabilistic Models • Assume a simple (usually unrealistic) probabilistic method by which the data was generated • Each class value has a different parameterized generative model that characterizes it • Training: Use the data for each category to estimate the parameters of the generative model for that category. • Maximum Likelihood Estimation (MLE): Set parameters to maximize the probability that the model produced the given training data • If Mλ denotes a model with parameter values λ and Dk is the training data for the kth class, find model parameters for class k (λk) that maximize the likelihood of Dk: • Testing: Use Bayesian analysis to determine the category model that most likely generated a specific test instance.

Semi-supervised learning Similar to supervised learning except that • we have examples and only some of them are labeled • we may have a human available for a limited time to provide labels of examples • …this corresponds to the situation where all the patients in the database have symptoms, but only a few have diagnosis • …and occasionally we have doctors for a limited time to respond the questions about the patients

Using unlabeled data(Nigam et al., 2000) • Given: a small number of labeled examples and a large pool of unlabeled examples, no human available • e.g., classifying news article as interesting or not interesting • Approach description (EM + Naive Bayes): • train a classifier with only labeled documents, • assign probabilistically-weighted class labels to unlabeled documents, • train a new classifier using all the documents • iterate until the classifier remains unchanged

Using Unlabeled Data with Expectation-Maximization (EM) E-step: Estimate labels of unlabeled documents Initialize: Learn from labeled only Naive Bayes M-step: Use all documents to rebuild classifier Guarantees local maximum a posteriori parameters

Co-training (Blum & Mitchell, 1998) Theory behind co-training • Possible to learn from unlabeled examples • Value of unlabeled data depends on • How (conditionally) independent are the two representations of the same data • The more the better • The number of redundant inputs (features) • Expected error decreases exponentially with this number • Disagreement on unlabeled data predicts true error Better performance on labelling unlabeled data compared to EM approach

Bootstrap Learning to Classify Web Pages few labeled and many unlabeled Page Classifier Link Classifier Document Given: set of documents where each document is described by two independent sets of features (e.g. document text + hyperlinks anchor text) Hyperlink to the document

Active Learning

Active Learning • We use this methods whenever hand-labeled data are rare or expensive to obtain • Interactive method • Requests only labeling of “interesting” objects • Much less human work needed for the same result compared to arbitrary labeling examples Data & labels Teacher passive student query Teacher active student label Active student asking smart questions performance Passive student asking random questions number of questions

Approaches to Active Learning • Uncertainty sampling (efficient) • select example closest to the decision hyperplane (or the one with classification probability closest to P=0.5) [Tong & Koller 2000] • Maximum margin ratio change • select example with the largest predicted impact on the margin size if selected [Tong & Koller 2000] • Monte Carlo Estimation of Error Reduction • select example that reinforces our current beliefs [Roy & McCallum 2001] • Random sampling as baseline • Experimental evaluation (using F1-measure) of the four listed approaches shown on three categories from Reuters-2000 dataset • average over 10 random samples of 5000 training (out of 500k) and 10k testing (out of 300k)examples • two of the methods a rather time consuming, thus we run them for including the first 50 unlabeled examples • experiments show that active learning is especially useful for unbalanced data

Category with very unbalanced class distribution having 2.7% of positive examples Uncertainty seems to outperform MarginRatio

Illustration of Active learning • starting with one labeled example from each class (red and blue) • select one example for labeling (green circle) • request label and add re-generate the model using the extended labeled data Illustration of linear SVM model using • arbitrary selection of unlabeled examples (random) • active learning selecting the most uncertain examples (closest to the decision hyperplane)

Uncertainty sampling of unlabeled example

What Semantic Web researchers need to know about Machine Learning?