30 likes | 150 Vues
This chapter explores various strategies for handling massive datasets in machine learning. It addresses how naive Bayes methods can be efficiently managed, the necessity for incremental learning schemes, and the importance of linear processing time. The chapter discusses techniques for dealing with oversized datasets, including training on smaller subsets and the potential for overfitting. Additionally, it highlights the role of parallelization in enhancing learning algorithms and integrates discussions on metadata, text mining, and adversarial situations such as spam filtering.
E N D
Learning from Massive Datasets • Can it be held in main memory?---Naïve Byaes Method • Some learning schemes are incremental; some are not. • What about time it takes to model?—should be linear or near linear • What to do when data set is too large? • Use a small subset of data for training---law of diminishing returns • Some schemes do better with more data; but there is also a danger of overfitting • Parallelization is another way---develop parallelized versions of learning schemes
Incorporating Domain Knowledge :Metadata---data about data---semantic, causal, and functional • Text and web mining: • Adversarial situations: Junk email filtering, for example