1 / 18

Active Mining of Data Streams

Active Mining of Data Streams. Wei Fan , Yi-an Huang , Haixun Wang and Philip S. Yu Proc. SIAM International Conference on Data Mining 2004. Speaker: Pei-Min Chou Date:2005/01/14. Introduction. In most real-world problems, labelled data streams are rarely immediately available

trynt
Télécharger la présentation

Active Mining of Data Streams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Active Mining of Data Streams Wei Fan , Yi-an Huang , Haixun Wang and Philip S. Yu Proc. SIAM International Conference on Data Mining 2004 Speaker: Pei-Min Chou Date:2005/01/14

  2. Introduction • In most real-world problems, labelled data streams are rarely immediately available • models are refreshed periodically • we propose a new concept of demand-driven active data mining.

  3. Method • Step1:Detect potential changes of data streams ---”Guess” • Step2:If guessed loss or error rate higher than tolerable maximum---choose a small number of data records • Step3:If statistically estimated loss higher than tolerable maximum---Reconstruct the old model

  4. Definition(1) • Dc:complete data set • D:training set • S:data stream • dt:Decision tree constructed from D • Tolerable Maximum: Exact values are completely defined by each application

  5. Definition(2) • nl :number of instance classified by leaf l • N:size of data stream • Statistic at leaf l • Σp(l)=1

  6. Example D:training set Dc:complete set

  7. Example---decision tree Bank is ICE no yes Local is A Bank is IBE no yes no Local is B Price is 100 no yes yes yes no C1: Billy Tom C6: Paul Amy C2: Mary C3: Ella C4: John C5 1/7 1/7 1/7 0 2/7 PD(l)=2/7

  8. Observable Statistics(1) • ps(l):statistic at leaf l in S • pD(l): statistic at leaf l in D • Change of leaf statistic on data stream • PS means that significant change occur

  9. Example(2) Bank is ICE no yes Local is A Bank is IBE yes no Price is 100 Local is B yes yes no yes no C1 C2: ErinHebe C4: Boss Sam C5: JoJo C6 C3 S: New data stream 0 2/5 0 2/5 1/5 Ps(l)=0

  10. Observable Statistics(2) • La:validation loss • Le:sum of expected loss at every leaf • LS:potential change in loss due to changes in the data stream • Difference :LS take the loss function into account

  11. Example(3) Bank is ICE no yes Local is A Bank is IBE yes no Price is 100 Local is B yes yes no yes no 30% C1 C2 C4: Boss Sam C5: JoJo C6: C3 S: New data stream Hebe Erin Major 0.7 Le(C2)=(1-0.7)*30%=9%

  12. Loss Estimation • When two statistics above tolerable maximum occur • Investigate true class labels of a selected number of example • Assume loss of each example:{l1. l2. l3…. ln} • Average loss : Σli/n • Standard error: ( ) • Investigation cost :not for free

  13. Experiment(1) Changing statistics is good indicator of change

  14. Experiment(2)

  15. Experiment(3)

  16. Experiment(4)

  17. Experiment---Result • Two statistics are very well correlated with the amount of change • Statistically estimated loss range is very close to true value

  18. Conclusion • Estimates the error without knowing the true class labels • statistical sampling method to estimate the range of true loss • Model reconstruction whenever estimated loss is higher than tolerable maximum.

More Related