Data Mining CSCI 307 Spring, 2019

Data MiningCSCI 307Spring, 2019 Lecture 4 Input, Concepts, Instances, and Attributes

Terminology • Components of the input: • Concept: Thing to be learned • Concept Description : Output of learning scheme • Aim: intelligible and operational concept description • Instances (AKA tuples): the individual, independent examples of a concept • Note: more complicated forms of input are possible • Attributes: Features that measure aspects of an instance • Note: We will focus on nominal and numeric ones

What’s a Concept? • Concept: thing to be learned • Concept Description: output of learning scheme • Styles of Learning: • Classification Learning: predicting a discrete class • Association Learning: detecting associations between features • Clustering: grouping similar instances into clusters • Numeric Prediction: predicting a numeric quantity

Classification Learning • Example problems: weather data, contact lenses, irises, labor negotiations • Classification learning is supervised • because the scheme is provided with actual outcome for the training examples, so success can be judged. • Outcome is called the class of the example • Measure success on fresh data for which class • labels are known (test data) • In practice success is often measured subjectively We are looking at examples that belong to one class, there exists classification scenarios that are multi-labeled.

Numeric Prediction • Variant of classification learning where “class” is numeric (also called “regression”) • Learning is supervised • Scheme is being provided with target value • Measure success on test data

Association Learning • Can be applied if no class is specified and any kind of structure is considered “interesting” • Difference from classification learning: • Can predict any attribute’s value, not just the class, and more than one attribute’s value at a time • Hence, far more association rules than classification rules • Thus, constraints are necessary • Minimum coverage (e.g. 80%) • Minimum accuracy (e.g. 95%) • Only use with non-numeric attributes

Iris example: If there is no class given, then it is likely the 150 instances would fall into natural clusters (hopefully) corresponding to the three types. Challenge is to assign new instances to these clusters. Clustering Finding groups of items that are similar Clustering is unsupervised The class of an example is not known Success often measured subjectively Might use results in second scheme to find rules for assigning new instances.

What’s in an Example? • Instance: specific type of example • Thing to be classified, associated, or clustered • Individual, independent example of target concept • Characterized by a predetermined set of attributes • Input to learning scheme: set of instances/dataset/tuple • Represented as a single relation/flat file • Rather restricted form of input • No relationships between objects • Most common form in practical data mining

Creating a flat file A Family Tree Peter M Peggy F Grace F Ray M = = Steven M Graham M Pam F Ian M Pippa F Brian M = Anna F Nikki F

Family TreeRepresentedasaTable Name Gender Parent1 parent2 Peter Peggy Grace Ray Steven Graham Pam Ian Pippa Brian Anna Nikki Male Female Female Male Male Male Female Male Female Male Female Female ? ? ? ? Peter Peter Peter GraceGrace Grace Pam Pam ? ? ? ? Peggy Peggy Peggy Ray Ray Ray Ian Ian

The “sister-of” Relation These two tables represent sisterhood in a slightly different way. 144 pairs of people Only positives defined Closed-world assumption Does not always match the real world.

AFullRepresentation inOneTable Flattening aka Denormalizing: Collapse the two previous tables into one, so have transformed the original "relationals" into instance form. if second person’s gender == female and first person’s parent == second person’s parent thensister-of = yes

Generating a Flat File • Process of flattening called “denormalization” • Several relations are joined together to make one • Possible with any finite set of finite relations • Problematic: relationships without pre-specified number of objects • Example: concept of nuclear-family • Denormalization may produce spurious regularities that reflect structure of database • Example: “supplier” predicts “supplier address” Customers buy products, flattening the DB produces each instance: customer, product, supplier, supplier address. Supermarket manager might care about the combinations of products each customer purchases, but not the "discovery" of the suppliers address.

The “ancestor-of” Relation

These general relations are beyond the scope of our textbook and the class. Recursion Infiniterelationsrequirerecursion This definition works no matter how distantly two people are related. If person1 is a parent of person2 then person1 is an ancestor of person2 If person1 is a parent of person2 and person2 is an ancestor of person3 then person1 is an ancestor of person3 • Appropriate techniques are known as “inductive logic programming” • (e.g. Quinlan’s First Order Inductive Learner, FOIL, is a rule-based learning algorithm) • Problems: (a) do not deal with noise well and (b) computational complexity, i.e. large datasets are slow.

Multi-instance Concepts • Each individual example comprises a setof instances • The same attributes describe all the instances • One or more instances within an example may be responsible for its classification • Goal of learning is still to produce a concept description • There are important real world applications • e.g. drug molecule shapes that take different forms are a set that predicts positive or negative bindingactivity. The entire set is classified at either positive or negative.

What’s in an Attribute? • Each instance is described by a fixed predefined set of features, its “attributes” • But: number of attributes may vary in practice • Possible solution: “irrelevant value” flag • Related problem: existence of an attribute may depend on the value of another • Possible attribute types (“levels of measurement”): Statisticians often use nominal, ordinal, interval, and ratio Nominal aka categorical; Numeric aka continuous

Nominal Quantities • Values are distinct symbols • Values themselves serve only as labels or names • Nominal comes from the Latin word for name • Example: attribute outlook from weather data • Values: sunny, overcast, and rainy • No relation is implied among nominal values (no ordering or distance measure) • Only equality tests can be performed

Ordinal Quantities • Impose order on values • But: no distance between values defined • Example: attribute temperature in weather data • Values: hot > mild > cool • Note: addition and subtraction don’t make sense • Example rule: • temperature < hot ==> play = yes • Distinction between nominal and ordinal not always clear by observation (e.g. attribute outlook, i.e. isovercast between sunny and rainy?)

Interval Quantities • Interval quantities are not only ordered but measured in fixed and equal units • Example 1: attribute temperature expressed in degrees Fahrenheit • Example 2: attribute year • Difference of two values (of same attribute) makes sense • Sum or product doesn’t make sense • Zero point is not defined!

Ratio quantities are ones for which the measurement scheme defines a zero point • Example: attribute distance • Distance between an object and itself is zero • Ratio quantities are treated as real numbers • All mathematical operations are allowed • But: is there an “inherently” defined zero point? • Answer depends on scientific knowledge • e.g. Daniel Fahrenheit knew no lower limit to temperature, but today the scale is based on absolute zero. • e.g. Measurement of time since the culturally defined zero at A.D. 0 is not a ratio, but years since the Big Bang is. Ratio Quantities

AttributeTypesUsedinPractice • Most schemes accommodate just two levels of measurement: nominal and ordinal • Nominal attributes are also called categorical, enumerated, or discrete • But alas, enumerated and discrete imply order • Special case: dichotomy (boolean attribute) • Ordinal attributes are called numeric, or continuous • But alas, continuous implies mathematical continuity

"Data about the data" Metadata • Metadata is information about the data that encodes background knowledge. • Can be used to restrict search space Examples: • Dimensional considerations (i.e. restrict the search to expression or comparisons that are dimensionally correct) • Circular orderings might affect types of tests, e.g. degrees in compass; e.g. day attribute might use next day, previous day, next week day, etc. • Partial orderings

Data Mining CSCI 307 Spring, 2019

Data Mining CSCI 307 Spring, 2019

Presentation Transcript

Data Mining CSCI 307, Spring 2019 Lecture 13

Data Structures CSCI 132, Spring 2019 Lecture 21 Doubly Linked Lists

CSci 8980: Data Mining (Fall 2002)

CSci 8980: Data Mining (Fall 2002)

CSci 8980: Data Mining (Fall 2002)

Data Structures CSCI 132, Spring 2019 Lecture 14 Review for Exam 1

Data Mining Spring 2013

Data Structures CSCI 132, Spring 2019 Lecture 18 Recursion and Look-Ahead

Data Mining Spring 2007