Overview of Classification in Data Mining

CS 590M Fall 2001: Security Issues in Data Mining Lecture 3: Classification

What is Classification? • Problem: assign items to pre-defined classes • Sample Y = Y1 … Yn • Set of classes X • Given Y, choose C that contains Y • How do we know how to do this? • Training data: set of items for which proper Xi is known.

Issues • Classification accuracy • False positives, False negatives • No clear “best” metric • Computation cost • Training • Classification

Approaches: • Naïve Bayes • K-Nearest Neighbor • Decision rules/Decision trees • Neural Networks

Naïve Bayes: History Bayes classifier: From Probability Theory • Idea: A-posteriori probability of class given all inputs is best possible classifier • Problem: doesn’t generalize. • Solution: Bayesian Belief network Y2 Y1 Y4 Y3 P(Xi|Y) = P(Y4|Y2,Y3)P(Y2|Y1)P(Y3|Y1)P(Y1)

Problems with Bayesian Belief Network • What should the network structure be? • Some work in how to learn the structure • Getting it wrong results in over-specificity • What are the probabilities? • Learning techniques exist here • Computational cost to learn network

Naïve Bayes • Two-layer Bayes network • No need to learn structure • Assumes inputs independent • Learn the probabilities that work best on training data Y1 Y2 Y3 P(X|Y1...Yn) = P(X)*Πi P(Yi|X) X

K-Nearest Neighbor • Idea: Choose “closest” training item • Class of test is same as class of closest training item • Need to define distance • What if this is a bad match? • Find K closest items • Use most common class in those K

KNN: Advantages • As training set → ∞, K → ∞, result approaches optimal • View as “best probability over all samples”: this is Bayes theorem • Training simple • Just put training set into a data structure

KNN: Problems • With small K, only captures convex classes • High dimensionality: may be “nearest” in irrelevant attributes • Query time: Search all training data • Algorithms to make this faster • But good enough to be “standard” for comparison

Classification and Security • Ideas on how to use classifiers to improve security • Intrusion Detection • ? • Potential risks • Identifying private information based on similarity with training data

Overview of Classification in Data Mining

Overview of Classification in Data Mining

Presentation Transcript

CSE 634 Data Mining Concepts and Techniques Association Rule Mining

Data Mining: Preprocessing Techniques

Chapter 3: Data Mining and Data Visualization

Mining data with PolyAnalyst

Data Mining on Streams

DATA MINING LECTURE 4

Web Mining

CSE 538 Web Search and Mining Web Crawling

CS490D: Introduction to Data Mining Prof. Walid Aref

What we have covered?

MMDSS 2007 Data stream management and mining

Networks and Security

CSE 331: Introduction to Networks and Security

15-826: Multimedia Databases and Data Mining

Data Mining with Big Data

Data Mining: Classification and Prediction

Spatial Data Mining

Data Mining: Concepts and Techniques