Active Mining of Data Streams

Active Mining of Data Streams Wei Fan , Yi-an Huang , Haixun Wang and Philip S. Yu Proc. SIAM International Conference on Data Mining 2004 Speaker: Pei-Min Chou Date:2005/01/14

Introduction • In most real-world problems, labelled data streams are rarely immediately available • models are refreshed periodically • we propose a new concept of demand-driven active data mining.

Method • Step1:Detect potential changes of data streams ---”Guess” • Step2:If guessed loss or error rate higher than tolerable maximum---choose a small number of data records • Step3:If statistically estimated loss higher than tolerable maximum---Reconstruct the old model

Definition(1) • Dc:complete data set • D:training set • S:data stream • dt:Decision tree constructed from D • Tolerable Maximum: Exact values are completely defined by each application

Definition(2) • nl :number of instance classified by leaf l • N:size of data stream • Statistic at leaf l • Σp(l)=1

Example D:training set Dc:complete set

Example---decision tree Bank is ICE no yes Local is A Bank is IBE no yes no Local is B Price is 100 no yes yes yes no C1: Billy Tom C6: Paul Amy C2: Mary C3: Ella C4: John C5 1/7 1/7 1/7 0 2/7 PD(l)=2/7

Observable Statistics(1) • ps(l):statistic at leaf l in S • pD(l): statistic at leaf l in D • Change of leaf statistic on data stream • PS means that significant change occur

Example(2) Bank is ICE no yes Local is A Bank is IBE yes no Price is 100 Local is B yes yes no yes no C1 C2: ErinHebe C4: Boss Sam C5: JoJo C6 C3 S: New data stream 0 2/5 0 2/5 1/5 Ps(l)=0

Observable Statistics(2) • La:validation loss • Le:sum of expected loss at every leaf • LS:potential change in loss due to changes in the data stream • Difference :LS take the loss function into account

Example(3) Bank is ICE no yes Local is A Bank is IBE yes no Price is 100 Local is B yes yes no yes no 30% C1 C2 C4: Boss Sam C5: JoJo C6: C3 S: New data stream Hebe Erin Major 0.7 Le(C2)=(1-0.7)*30%=9%

Loss Estimation • When two statistics above tolerable maximum occur • Investigate true class labels of a selected number of example • Assume loss of each example:{l1. l2. l3…. ln} • Average loss : Σli/n • Standard error: ( ) • Investigation cost :not for free

Experiment(1) Changing statistics is good indicator of change

Experiment(2)

Experiment(3)

Experiment(4)

Experiment---Result • Two statistics are very well correlated with the amount of change • Statistically estimated loss range is very close to true value

Conclusion • Estimates the error without knowing the true class labels • statistical sampling method to estimate the range of true loss • Model reconstruction whenever estimated loss is higher than tolerable maximum.

Active Mining of Data Streams

Active Mining of Data Streams

Presentation Transcript

Data Mining on Streams

Data Mining in Streams and Graphs

Mining Data Streams

Mining High-Speed Data Streams

Frequent Pattern Mining in Data Streams

Mining High-Speed Data Streams

Mining Data Streams

Mining Decision Trees from Data Streams

Data Mining: Concepts and Techniques Mining data streams

Statistical Approaches to Mining Multivariate Data Streams

Mining Data Streams (Part 1)

Mining Data Streams

Mining High-Speed Data Streams

Data Mining on Streams

Mining Data Streams

Data Mining: Concepts and Techniques Mining data streams

Data Mining for Data Streams

Mining Data Streams

Mining Data Streams

Statistical Approaches to Mining Multivariate Data Streams