260 likes | 376 Vues
Discover changes in classification patterns from old to new data, allowing users to identify important changes impacting accuracy. Our algorithm traces corresponding rules and estimates error rates to determine significance.
E N D
Mining Changes of Classification by Correspondence Tracing Ke Wang Senqiang ZhouSimon Fraser University{wangk, szhoua}@cs.sfu.ca Wai Chee Ada Fu Jeffrey Xu YuThe Chinese University of Hong Kongadafu@cs.cuhk.edu.hkyu@se.cuhk.edu.hk
The Problem • Mining changes of classification patterns as the data changes • What we have: old classifier and new data • What we want: the changes of classification characteristics in the new data members with a large family shop frequently. members with a large family shop infrequently. Example
Targets • State the changes explicitly • Simply comparing old and new classifiers does not work • Distinguish the important changes, otherwise users could be over-whelmed • Interested in the changes causing the change of classification accuracy.
Related Work • [GGR99]: transfer two classifiers into a common specialization and the changes are measured by the amount of work required for such transfer. • Human users hardly measure the changes in this way • Have not addressed the primary goal: accuracy
Related Work (Cont.) • [LHHX00]: extract changes by requiring the new classifier to be similar to the old one. • Using the same splitting attributes or the same splitting in the decision tree construction. • Put a severe restriction on mining important changes.
Our Approach: Basic Idea • For each old rule o, trace the corresponding new rules in the new classifier through the examples they both classify Example: n1 and n2 are corresponding rules of o. New rule n1 New rule n2 Old rule o New data set
The Algorithm • Input: old classifier and new data • Output: the changes of classification patterns Identify corresponding new rules for each old rule Present changes Build new classifier Step 3 Step 2 Step 1
Identify The Corresponding Rules • Given an old rule o: • Collect the examples classified by o • For each such example, identify the new rule n that classifies it • characteristic change: O <n1,n2, …, nk> New rule n1 Old rule o New rule n2
Any Important Changes? • Given O <n1,n2, …, nk>, which changes are more important? • Hint: users usually are interested in the changes causing the change of classification accuracy. • Basic idea: measure the importance of changes based on the estimated accuracy of rules on future data
Pessimistic Error Estimation • Consider an old rule o that classifies N examples ( in new data set) with E wrongly • Observed error rate: E/N • How about the error rate in the whole population? • Given an confidence level CF, the upper bound of error rate is UCF(N,E) • Details in [Q93]
Estimating Quantitative Change • Consider a rule pair <o, ni>, while o(No, Eo), ni(Nn, En) Estimate the error rates for both rules Calculate the decrease of error Quantitative change Calculate the increase of accuracy
An Example Consider <o, n1, n2> o: A4=1 C3, (N=7, E=4) n1: A3=1A4=1 C3, (N=5, E=0) 3 classified by o n2: A3=2A4=1 C2 , (N=6, E=0) 4 classified by o Assume the new data set has 18 examples and CF=25%. Consider the quantitative changes of <o, n1> The estimated error rate of o is: UCF(7, 4) = 75% The estimated error rate of n1 is: UCF(5, 0) = 24% The decrease of error rate: 75% - 24% = 51% The increase of accuracy (o,n1) = (3/18)*51%=8.5% The increase of accuracy (o,<n1,n2>) =8.5%+12%=20.5%
Types of Changes • Global changes: both characteristics change and quantitative change are large. • Alternative changes: characteristics change is large but its quantitative change is small.
Types of Changes (Cont.) • Target changes: similar characteristics but different classes (targets) • Interval changes: shift of boundary points due to the emerging of new cutting points.
Experiments • Two data sets: • German Credit Data from UCI repository [MM96] • IPUMS Census Data [IPUMS] • Goal: to verify if our algorithm can find the important changes “supposed” to be found
Methodologies • For German Credit data: • Plant some changes to original data and check if the proposed method finds them. • For IPUMS census data: • The proposed method is applied to find the changes across years or different sub-populations. • Classifiers are built using C4.5 algorithm
Summaries of German Data • Data description: • 2 classes: bad and good • 20 attributes, 13 categorical • 1000 examples: 700 are “good” • Changes planted: • Target change • Interval change • Etc.
Planting Target Change Personal-status = single-male, Foreign = no Credit = good (23, 0) Consider the examples classified by the old rule. Changes planted: if ( Liable-people=1 ) then change the class from good to bad 12 examples changed
The Changes Found Personal-status = single-male, Foreign = no Credit = good Personal-status = single-male, Liable-people <=1, Foreign = no Credit = bad ( = 0.48%) Liable-people >1, Foreign = no Credit = good( = 0.54%)
Planting Interval Change Status = 0DM, Duration > 11, Foreign = yes Credit = bad (164, 47) Consider the examples classified by the old rule. Changes planted: Increase the Duration value by 6 (months) for each example classified by the old rule. 164 examples changed
The Changes Found Status = 0DM, Duration > 11, Foreign = yes Credit = bad Status = 0DM,Duration > 16, Foreign = yes Credit = bad( = 1.20%)
Summaries of IPUMS Data • Take “vetstat” as class • 3 data sets: 1970, 1980 and 1990. • Each data set contains the examples for several races. • The proposed method is applied to find the changes across years or different sub-populations.
Interesting Changes Found A. 1970-black vs 1990-black 35<age ≤54 Vetstat=yes 40<age ≤72, sex=male Vetstat=yes( = 1.20%) B. 1990-black vs 1990-chinese 40<age ≤72, sex=male Vetstat=yes bplg =china, incss ≤5748 Verstat=no( = 4.56%)
Conclusion • Extracting changes from potentially very dissimilar old and new classifiers by correspondence tracing • Ranking the importance of changes • Presenting the changes • Experiments on real-life data sets
References • [GGR99] V. Ganti, J. Gehrke, and R. Ramakrishnan. A framework for measuring changes in data characteristics. In PODS, 1999 • [IPUMS] http://www.ipums.umn.edu/. • [LHHX00] B. Liu,W. Hsu, H.S. Han, and Y. Xia. Mining changes for real-life applications. In DaWak, 2000
References (Cont.) • [MM96] C.J. Merz and P. Murphy. UCI repository of machine learning databases • [Q93] J.R. Quinlan. C4.5: programs for maching learning. 1993