260 likes | 422 Vues
Introducing Apache Mahout. Scalable Machine Learning for All! Grant Ingersoll. Agenda. What is Machine Learning? Definitions Types Applications Mahout What? Why? How? Who?. What is Machine Learning?. NOT!. Or?. http://en.wikipedia.org/wiki/Image:Hal-9000.jpg.
E N D
Introducing Apache Mahout Scalable Machine Learning for All! Grant Ingersoll
Agenda What is Machine Learning? Definitions Types Applications Mahout What? Why? How? Who?
What is Machine Learning? NOT! Or? http://en.wikipedia.org/wiki/Image:Hal-9000.jpg http://upload.wikimedia.org/wikipedia/en/4/49/Terminator.jpg
How about? Google News
Or? Amazon.com
Definition “Machine Learning is programming computers to optimize a performance criterion using example data or past experience” Intro. To Machine Learning by E. Alpaydin Subset of Artificial Intelligence Many other fields: comp sci., biology, math, psychology, etc.
Characterizations Lots of Data Identifiable Features in that Data Too big/costly for people to handle People still can help
Types Supervised Using labeled training data, create function that predicts output of unseen inputs Unsupervised Using unlabeled data, create function that predicts output Semi-Supervised Uses labeled and unlabeled data
Classification/Categorization Spam Filtering Named Entity Recognition Phrase Identification Sentiment Analysis Classification into a Taxonomy
Clustering Find Natural Groupings Documents Search Results People Genetic traits in groups Many, many more uses
Collaborative Filtering Recommend people and products User-User User likes X, you might too Item-Item People who bought X also bought Y
Info. Retrieval Learning Ranking Functions Learning Spelling Corrections User Click Analysis and Tracking
Other Image Analysis Robotics Games Higher level natural language processing Many, many others
What is Apache Mahout? A Mahout is an elephant trainer/driver/keeper, hence… (and other distributed techniques) + Machine Learning =
What? Hadoop brings: Map/Reduce API HDFS In other words, scalability and fault-tolerance Thus, Mahout’s Goal is: Scalable Machine Learning with Apache License
Why Mahout? Many Open Source ML libraries either: Lack Community Lack Documentation and Examples Lack Scalability Lack the Apache License ;-) Or are research-oriented Personal: Learn more ML Intelligent Apps are the Present and Future See the Hadoop talks tomorrow and Friday! Goal: Overcome gaps the Apache Way!
Current Status Close to Initial release Focused on examples, docs, bug fixes What’s in it: Simple Matrix/Vector library Taste Collaborative Filtering Clustering Canopy/K-Means/Fuzzy K-Means/Mean-shift Classifiers Naïve Bayes Complementary NB Evolutionary Integration with Watchmaker for fitness function
How? Examples Taste Clustering Classification Evolutionary
Taste: Movie Recommendations Given ratings by users of movies, recommend other movies http://lucene.apache.org/mahout/taste.html#demo
Clustering: Synthetic Control Data http://archive.ics.uci.edu/ml/datasets/Synthetic+Control+Chart+Time+Series Each clustering impl. has an example Job for running in <MAHOUT_HOME>/examples o.a.mahout.clustering.syntheticcontrol.* Outputs clusters…
Classification: NB and CNB Examples 20 Newsgroups http://cwiki.apache.org/confluence/display/MAHOUT/TwentyNewsgroups Wikipedia http://cwiki.apache.org/confluence/display/MAHOUT/WikipediaBayesExample
Evolutionary Traveling Salesman http://cwiki.apache.org/confluence/display/MAHOUT/Traveling+Salesman Class Discovery http://cwiki.apache.org/confluence/display/MAHOUT/Class+Discovery
What’s Next? Release 0.1! Shared Amazon Images (others?) More Examples Winnow/Perceptron (MAHOUT-85) Hbase and HAMA support Normalize I/O format for data Solr Integration (SOLR-769) Other Algorithms: SVM, Linear Regression, etc.
When, Where, Who When? Now! Mahout is growing Who? You! We want Java programmers who: Are comfortable with math Like to work on large, hard problems Where? http://lucene.apache.org/mahout http://cwiki.apache.org/MAHOUT mahout-{user|dev}@lucene.apache.org
Resources “Programming Collective Intelligence” by Toby Segaran “Data Mining - Practical Machine Learning Tools and Techniques” by Ian H. Witten and Eibe Frank Hadoop - http://hadoop.apache.org http://mloss.org/software/