CSC 558 – Data Analytics II, Prep for assignment 1 – Instance-based (lazy) machine learning

CSC 558 – Data Analytics II,Prep for assignment 1 – Instance-based (lazy) machine learning January 2018

Why instance-based learning? • Assignment 1 is modeled after data sonification sonic surveys & machine listener research in 2015-2017. • We sonified data sets by turning instances into sound for classification against 1 of 3 reference sounds. • Human subjects listened to 3 reference files as training set, then classified instances as 1 of 3. • Humans cannot learn more than a few training sounds quickly • Training set size of 3 is inadequate for many machine learning algorithms. Instance-based approaches approximated human listener performance. • See W. Malke’s thesis & unpublished 2017 paper on course page. • See instance-based paper links on course page.

Properties of instance-based learning from 1991 paper on course page 1. They are computationally expensive classifiers since they save all training instances, 2. they are intolerant of attribute noise, 3. they are intolerant of irrelevant attributes, 4. they are sensitive to the choice of the algorithm's similarity function, 5. there is no natural way to work with nominal-valued attributes or missing attributes, and 6. they provide little usable information regarding the structure of the data

Instance based approaches in Weka • Their papers are linked on our course page. • IBk: K-nearest neighbors classifier. • Comes with a variety of distance calculators between attribute values. • KStar: K* is an instance-based classifier, that is the class of a test instance is based upon the class of those training instances similar to it, as determined by some similarity function. • K* uses information entropy as its distance measure. • LWL: Locally weighted learning. Uses an instance-based algorithm to assign instance weights which are then used by a specified WeightedInstancesHandler. • Can do classification (e.g. using naive Bayes) or regression (e.g. using linear regression).

Instance-based learning (Weka ch. 4.7) • In instance-based learning the distance function defines what is learned • Most instance-based schemes use Euclidean distance:a(1) and a(2): two instances with k attributes • Note that taking the square root is not required when comparing distances • Other popular metric: city-block metric • Adds differences without squaring them

Normalization and other issues (Wekach. 4.7) • Different attributes are measured on different scales  need to be normalized, e.g., to range [0,1]: • vi : the actual value of attribute i • Nominal attributes: distance is assumed to be either 0 (values are the same) or 1 (values are different) • Common policy for missing values: assumed to be maximally distant (given normalized attributes)

Finding nearest neighbors efficiently(Wekach. 4.7) Simplest way of finding nearest neighbour: linear scan of the data Classification takes time proportional to the product of the number of instances in training and test sets Nearest-neighbor search can be done more efficiently using appropriate data structures We will discuss two methods that represent training data in a tree structure:kD-trees and ball trees

kD(imensional)-tree example(Wekach. 4.7)

Using kD-trees: example query ball

More on kD-trees Complexity depends on depth of the tree, given by the logarithm of number of nodes for a balanced tree Amount of backtracking required depends on quality of tree (“square” vs. “skinny” nodes) How to build a good tree? Need to find good split point and split direction Possible split direction: direction with greatest variance Possible split point: median value along that direction Using value closest to mean (rather than median) can be better if data is skewed Can apply this split selection strategy recursively just like in the case of decision tree learning

Building trees incrementally Big advantage of instance-based learning: classifier can be updated incrementally Just add new training instance! Can we do the same with kD-trees? Heuristic strategy: Find leaf node containing new instance Place instance into leaf if leaf is empty Otherwise, split leaf according to the longest dimension (to preserve squareness) Tree should be re-built occasionally (e.g., if depth grows to twice the optimum depth for given number of instances)

Ball trees Potential problem in kD-trees: corners in high-dimensional space may mean query ball intersects with many regions Observation: no need to make sure that regions do not overlap, so they do not heed to be hyperrectangles Can use balls (hyperspheres) instead of hyperrectangles A ball tree organizes the data into a tree of k-dimensional hyperspheres Motivation: balls may allow for a better fit to the data and thus more efficient search

Ball tree example

Using ball trees Nearest-neighbor search is done using the same backtracking strategy as in kD-trees Ball can be ruled out during search if distance from target to ball's center exceeds ball's radius plus radius of query ball

Building ball trees Ball trees are built top down, applying the same recursive strategy as in kD-trees We do not have to continue until leaf balls contain just two points: can enforce minimum occupancy (this can also be done for efficiency in kD-trees) Basic problem: splitting a ball into two Simple (linear-time) split selection strategy: Choose point farthest from ball's center Choose second point farthest from first one Assign each point to these two points Compute cluster centers and radii based on the two subsets to get two successor balls

Discussion of nearest-neighbor learning • Often very accurate • Assumes all attributes are equally important • Remedy: attribute selection, attribute weights, or attribute scaling • Possible remedies against noisy instances: • Take a majority vote over the k nearest neighbors • Remove noisy instances from dataset (difficult!) • Statisticians have used k-NN since the early 1950s • If n   and k/n  0, classification error approaches minimum • kD-trees can become inefficient when the number of attributes is too large • Ball trees are instances may help; they are instances of metric trees

What problems in assignment 1? • Classify nominal generator in audio waveform. • Sine, triangle, square, sawtooth, pulse - 10% duty cycle • Estimate numeric frequency of waveform’s fundamental sine wave. • I have generated the five listed waveforms at 1000Hz and a signal strength of .9 maximum amplitude with 0% white noise amplitude as reference .wave file training instances. • There are 10,000 instances (2K of each type) with fundamental frequency in range [100,2000] Hz, signal strength [.5,.75], noise strength [.1,.25] as the test dataset.

Why this audio problem set? • It derives from our machine listener research. • I have some expertise in audio signal processing. Data science needs human expertise. Algorithms are not enough. • These waveforms are similar to biomedical waveforms and other cyclic sensor-based measurements. • They will work well with time-series analysis.

CSC 558 – Data Analytics II, Prep for assignment 1 – Instance-based (lazy) machine learning

CSC 558 – Data Analytics II, Prep for assignment 1 – Instance-based (lazy) machine learning

Presentation Transcript

Machine Learning: Symbol-based

Ch10 Machine Learning: Symbol-Based

Instance Based Learning

Nearest Neighbor

Data Mining and Machine Learning

International Workshop on Machine Learning and Text Analytics (MLTA2013)

Advanced Analytics on Hadoop Spring 2014 WPI , Mohamed Eltabakh

Instance Based Learning

Current Trends in Machine Learning and Data Mining

CS 6243 Machine Learning

Machine Learning CS 165B Spring 2012

Data Mining Schemes in Practice

Machine Learning 21431

Machine Learning Lecture 11: Nearest Neighbor

Machine Learning and Motion Planning

Data Mining and Machine Learning in Population Health Studies

Latest Report on Global Cloud Analytics Market Analysis 2015

About Machine Learning Examples

Microsoft Azure services

Big data for beginners

Data Science & Bigdata Course in Hyderabada - DigitalNest

How machine learning is benefitting big data analytics ?