1 / 12

Apache Mahout

Apache Mahout. Qiaodi Zhuang Xijing Zhang. What is Mahout?. Mahout is a scalable machine learning library from Apache. It uses MapReduce paradigm which in combination with Hadoop can be used as an inexpensive solution to solve machine learning problems.

Télécharger la présentation

Apache Mahout

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Apache Mahout Qiaodi Zhuang Xijing Zhang

  2. What is Mahout? • Mahout is a scalablemachine learning library from Apache. • It uses MapReduceparadigm which in combination with Hadoop can be used as an inexpensive solution to solve machine learning problems. • [1].Anil, Robin, Ted Dunning, and Ellen Friedman. Mahout in action. Manning, 2011.

  3. Problem&Challenge • Many datasets now are: • Far too large for a single machine, cannot fit into main memory • [2].http://www.orzota.com/apache-mahout-and-machine-learning/

  4. Mahout’s Algorithms: • Clustering: Kmeans, Fuzzy Kmeans • Classification: SVM, Random Forests • Recommender • Pattern Mining • Regression

  5. K-means Algorithms: • Input: a database D, of m records, r1, ..., rm and a desired number of clusters k • Output: set of k clusters that minimizes the squared error criterion • Begin • Randomly choose k records as the centroids for the k clusters; • repeatassign each record ri to a cluster such that the distance between ri • and the cluster centroid (mean) is the smallest among the k clusters; recalculate the centroid (mean) for each cluster based on the records • assigned to the cluster; • until no change; • End;

  6. K-means Clustering in Mahout • [3].K-means Clustering in the Cloud -- A Mahout Test, R. M. Esteves et al.,IEEE Advanced Information Networking and Applications , 2011,

  7. Evaluation The dataset is from the 1999 KDD cup. It has 4,940,000 records, with 41 attributes and 1 label (converted to numerical. A 1.1 GB dataset was used. This file was randomly segmented into smaller files. • [3].K-means Clustering in the Cloud -- A Mahout Test, R. M. Esteves et al.,IEEE Advanced Information Networking and Applications , 2011,

  8. [3].K-means Clustering in the Cloud -- A Mahout Test, R. M. Esteves et al.,IEEE Advanced Information Networking and Applications , 2011,

  9. Future • Classification • Decision Trees such as J48 and ID3 • Clustering • DBSCAN and CoWeb Clustering techniques • Association Rules • Apriori

  10. References: • [1].Anil, Robin, Ted Dunning, and Ellen Friedman. Mahout in action. Manning, 2011. • [2].http://www.orzota.com/apache-mahout-and-machine-learning/ • [3].K-means Clustering in the Cloud -- A Mahout Test, R. M. Esteves et al.,IEEE Advanced Information Networking and Applications , 2011, • [4].https://mahout.apache.org/ • [5].http://www.ibm.com/developerworks/java/library/j-mahout/

  11. Question?

  12. Thank you!

More Related