s cikit -learn

scikit-learn Machine Learning in Python VandanaBachani http://infolab.tamu.edu Spring 2012

Outline • What is scikit-learn? • How can it be useful to the lab? • There are other packages too! • Features • Usage • Conclusion http://infolab.tamu.edu

What is scikit-learn? scikit-learn is a Python module integrating classic machine learning algorithms in the tightly-knit world of scientific Python packages (numpy, scipy, matplotlib) • A comprehensive package for all machine learning needs. • Faster • Accuracy? If you have the right data, it is pretty loyal. Ref: http://scikit-learn.org/stable/

Ref: http://scikit-learn.org/stable/

How can it be useful to the lab? • Our daily jobs: • Regression/Prediction • Text Classification • Text Feature Extraction • Text Feature Selection • Using Chi-Square and other metrics • Cross-Validation • K-Fold • Clustering (K-Means, etc.) • Maybe in future: • Image Classification All in one package! http://infolab.tamu.edu

There are other packages too! http://infolab.tamu.edu

Features Linear Models • Regression (Predicting Continuous Values) Example: Prices of houses (Boston house dataset) • Linear, Ridge, Lasso (for sparse coefficients, useful in field of compressed sensing), LARS (very-high dimensional data), Bayesian • Classification • Logistic Regression, Stochastic Gradient Descent http://infolab.tamu.edu

Features Support Vector Machines • Classification • SVC (one-vs-one), LinearSVC (one-vs-rest) • Regression • SVR • Density Detection & Outlier Detection (unsupervised learning) http://infolab.tamu.edu

Features Unsupervised Learning • Clustering • K-Means, Mean Shift, Spectral Clustering • Ward (hierarchical, constructs tree) • Manifold Learning • Dimensionality Reduction (for visualization, etc) • Novelty and Outlier Detection • Uses SVM http://infolab.tamu.edu

Features Miscellaneous • Nearest neighbors • Unsupervised, Classification • Decision Trees • Classification, Regression • Gaussian Processes • Regression • Metrics • metrics.roc_curve(y_true, y_score) • metrics.precision_recall_fscore_support(...) • joblib and pickle http://infolab.tamu.edu

Features • Cross-Validation • cross_validation.KFold(n, k[, indices]) • Datasets • Feature Extraction • Text • feature_extraction.text.WordNGramAnalyzer([...]) • feature_extraction.text.CharNGramAnalyzer([...]) • Image • feature_extraction.image.extract_patches_2d(...) • Feature Selection • feature_selection.chi2(X, y) • feature_selection.SelectKBest(score_func[, k]) http://infolab.tamu.edu

Usage • Linear Regression >>> from sklearn import linear_model >>> clf = linear_model.LinearRegression() >>> clf.fit ([[0, 0], [1, 1], [2, 2]], [0, 1, 2]) LinearRegression(copy_X=True, fit_intercept=True, normalize=False) • Classification >>> from sklearn.linear_model import SGDClassifier >>> X = [[0., 0.], [1., 1.]] >>> y = [0, 1] >>> clf = SGDClassifier(loss="hinge", penalty="l2") >>> clf.fit(X, y) SGDClassifier(alpha=0.0001, class_weight=None, eta0=0.0, fit_intercept=True, learning_rate='optimal', loss='hinge', n_iter=5, n_jobs=1, penalty='l2', power_t=0.5, rho=0.85, seed=0, shuffle=False, verbose=0) http://infolab.tamu.edu

Usage • SVC & Cross-Validation >>> from sklearn import datasets >>> from sklearn import svm >>> from sklearn import cross_validation >>> iris = datasets.load_iris() >>> clf = svm.SVC(kernel='linear') >>> scores = cross_validation.cross_val_score(clf, iris.data, iris.target, cv=5) >>> scores array([ 1. ..., 0.96..., 0.9 ..., 0.96..., 1. ...]) http://infolab.tamu.edu

Sample Code penalty = "l2" #LinearSVC can be tried with L1, L2 penalties print "LinearSVC" linearSVC = LinearSVC(loss='l2', penalty=penalty, C=1000, dual=False, tol=1e-3) classify(linearSVC, X_train, y_train, X_test, y_test) #SGDClassifier print "SGDClassifier" sgdClf = SGDClassifier(alpha=.0001, n_iter=50, penalty=penalty) classify(sgdClf, X_train, y_train, X_test, y_test) print "NaiveBayes - Multinomial" bernoulliNBClf= BernoulliNB(alpha=.01) classify(bernoulliNBClf, X_train, y_train, X_test, y_test) -------------- def classify(clf, X_train, y_train, X_test, y_test): clf.fit(X_train, y_train) train_time = time() - t0 print "train time: %0.3fs" % train_time pred= clf.predict(X_test) test_time = time() - t0 print "test time: %0.3fs" % test_time print "classification report:" print metrics.classification_report(y_test, pred, target_names=categories) data_train, data_test = trainData.data, testData.data y_train, y_test = trainData.target, testData.target print "Extracting features from the training dataset" #can use a specific analyzer to be passed to vectorizer #by default WordNGramAnalyzer is used vectorizer = Vectorizer() X_train = vectorizer.fit_transform(data_train) print "done in %fs" % (time() - t0) print "n_samples: %d, n_features: %d" % X_train.shape print "Extracting features from the test dataset" X_test= vectorizer.transform(data_test) print "done in %fs" % (time() - t0) print "n_samples: %d, n_features: %d" % X_test.shape http://infolab.tamu.edu

Sample Results SGDClassifier train time: 1.505s test time: 0.023s classification report: precision recall f1-score support TECHNOLOGY 0.75 0.99 0.85 3918 IDIOMS 0.94 0.66 0.78 5205 POLITICAL 0.88 0.99 0.93 4268 MUSIC 0.90 0.74 0.81 872 GAMES 0.97 0.95 0.96 457 SPORTS 0.87 0.98 0.92 443 MOVIES 0.97 0.90 0.93 1092 CELEBRITY 0.73 0.46 0.56 24 avg / total 0.88 0.86 0.86 16279 http://infolab.tamu.edu

Conclusion • If you are a python person - • Seems like a good library • NLTK + scikit-learn should make an excellent pair for our lab • Good documentation wins! http://infolab.tamu.edu

Thanks Email: vandana_bvj_tamu@tamu.edu http://infolab.tamu.edu

s cikit -learn

s cikit -learn

Presentation Transcript

Learn

Learn the ‘s of creating a great presentation

let s learn about england

let s learn a little about strong adjectives

Let ’ s Learn About Russia

Let s Learn About Benjamin Franklin

Let s learn a little about STRONG ADJECTIVES

IT S NOT WHAT YOU TEACH, IT S WHAT THEY LEARN

LET’S LEARN THE FRUIT S

Learn to Learn

LEARN 2 LEARN Steps

learn

Let ’ s learn “ The Shape Song! ” Preschool Math

Letâ€™s Learn about Map Elements

Learn

Let`s learn Chinese!

Learn

Learn

Learn

LEARN HOW TO LEARN

learn

Learn