An Introduction to Machine Learning with Perl
880 likes | 1.83k Vues
An Introduction to Machine Learning with Perl. February 3, 2003 O’Reilly Bioinformatics Conference. Ken Williams ken@mathforum.org. Tutorial Overview. What is Machine Learning? (20’) Why use Perl for ML? (15’) Some theory (20’) Some tools (30’) Decision trees (20’) SVMs (15’)
An Introduction to Machine Learning with Perl
E N D
Presentation Transcript
An Introduction to Machine Learning with Perl February 3, 2003 O’Reilly Bioinformatics Conference Ken Williams ken@mathforum.org
Tutorial Overview • What is Machine Learning? (20’) • Why use Perl for ML? (15’) • Some theory (20’) • Some tools (30’) • Decision trees (20’) • SVMs (15’) • Categorization (40’)
References & Sources • Machine Learning, Tom Mitchell. McGraw-Hill, 414pp, 1997 • Foundations of Natural Language Processing, Christopher D. Manning & Hinrich Schütze. MIT Press, 680 pp, 1999 • Perl-AI list (perl-ai@perl.org)
What Is Machine Learning? • A subfield of Artificial Intelligence (but without the baggage) • Usually concerns some particular task, not the building of a sentient robot • Concerns the design of systems that improve (or at least change) as they acquire knowledge or experience
Typical ML Tasks • Clustering • Categorization • Recognition • Filtering • Game playing • Autonomous performance
Typical ML Tasks • Clustering
Typical ML Tasks • Categorization
Typical ML Tasks • Recognition Vincent Van Gogh Michael Stipe Mohammed Ali Ken Williams Burl Ives Winston Churchill Grover Cleveland
Typical ML Tasks • Recognition Little red corvette The kids are all right The rain in Spain Bort bort bort
Typical ML Tasks • Filtering
Typical ML Tasks • Game playing
Typical ML Tasks • Autonomous performance
Typical ML Buzzwords • Data Mining • Knowledge Management (KM) • Information Retrieval (IR) • Expert Systems • Topic detection and tracking
Who does ML? • Two main groups: research and industry • These groups do listen to each other, at least some • Not many reusable ML/KM components, outside of a few commercial systems • KM is seen as a key component of big business strategy - lots of KM consultants • ML is an extremely active research area with relatively low “cost of entry”
When is ML useful? • When you have lots of data • When you can’t hire enough people, or when people are too slow • When you can afford to be wrong sometimes • When you need to find patterns • When you have nothing to lose
An aside on your presenter • Academic background in math & music (not computer science or even statistics) • Several years as a Perl consultant • Two years as a math teacher • Currently studying document categorization at The University of Sydney • In other words, a typical ML student
Why use Perl for ML? • CPAN - the viral solution™ • Perl has rapid reusability • Perl is widely deployed • Perl code can be written quickly • Embeds both ways • Human-oriented development • Leaves your options open
But what about all the data? • ML techniques tend to use lots of data in complicated ways • Perl is great at data in general, but tends to gobble memory or forego strict checking • Two fine solutions exist: • Be as careful in Perl as you are in C (Params::Validate, Tie::SecureHash, etc.) • Use PDL or Inline (more on these later)
Interfaces vs. Implementations • In ML applications, we need both data integrity and the ability to “play with it” • Perl wrappers around C/C++ structures/objects are a nice balance • Keeps high-level interfaces in Perl, low-level implementations in C/C++ • Can be prototyped in pure Perl, with C/C++ parts added later
Some ML Theory and Terminology • ML concerns learning a target function from a set of examples • The target function is often called a hypothesis • Example: with Neural Network, a trained network is a hypothesis • The set of all possible target functions is called the hypothesis space • Training process can be considerd a search through the hypothesis space
Some ML Theory and Terminology • Each ML technique will • probably exclude some hypotheses • prefer some hypotheses over others • A technique’s exclusion & preference rules are called its inductive bias • If it ain’t biased, it ain’t learnin’ • No bias = rote learning • Bias = generalization • Example: kids learning multiplication (understanding vs. memorization)
Some ML Theory and Terminology • Ideally, a ML technique will • not exclude the “right” hypothesis, i.e. the hypothesis space will include the target hypothesis • Prefer the target hypothesis over others • Measuring the degree to which these criteria are satisfied is important and sometimes complicated
Evaluating Hypotheses • We often want to know how good a hypothesis is • To know how it performs in real world • May be used to improve learning technique or tune parameters • May be used by a learner to automatically improve the hypothesis • Usually evaluate on test data • Test data must be kept separate from training data • Test data used for purpose 3) is usually called validation or held-out data. • Training, validation, and test data should not contaminate each other
Evaluating Hypotheses • Some standard statistical measures are useful • Error rate, accuracy, precision, recall, F1 • Calculated using contingency tables
Evaluating Hypotheses • Error = (b+c)/(a+b+c+d) • Accuracy = (a+d)/(a+b+c+d) • Precision = p = a/(a+b) • Recall = r = a/(a+c) • F1 = 2pr/(p+r) Precision is easy to maximize by assigning nothing Recall is easy to maximize by assigning everything F1 combines precision and recall equally
Evaluating Hypotheses • Example (from categorization) • Note that precision is higher than recall - indicates a cautious categorizer Precision = 0.851, Recall = 0.711, F1 = 0.775 These scores depend on the task - can’t compare scores across tasks Often useful to compare categories separately, then average (macro-averaging)
Evaluating Hypotheses • The Statistics::Contingency module (on CPAN) helps calculate these figures: use Statistics::Contingency; my $s = new Statistics::Contingency; while (...) { ... Do some categorization ... $s->add_result($assigned, $correct); } print "Micro F1: ", $s->micro_F1, "\n"; print $s->stats_table; Micro F1: 0.774803607797498 +-------------------------------------------------+ | miR miP miF1 maR maP maF1 Err | | 0.243 0.843 0.275 0.711 0.851 0.775 0.006 | +-------------------------------------------------+
Useful Perl Data-Munging Tools • Storable - cheap persistence and cloning • PDL - helps performance and design • Inline::C - tight loops and interfaces
Storable • One of many persistence classes for Perl data (Data::Dumper, YAML, Data::Denter) • Allows saving structures to disk: store($x, $filename); $x = retrieve($filename); • Allows cloning of structures: $y = dclone($x); • Not terribly interesting, but handy
PDL • Perl Data Language • On CPAN, of course (PDL-2.3.4.tar.gz) • Turns Perl into a data-processing language similar to Matlab • Native C/Fortran numerical handling • Compact multi-dimensional arrays • Still Perl at highest level
PDL demo PDL experimentation shell: ken% perldl perldl> demo pdl
Extending PDL • PDL has extension language PDL::PP Lets you write C extensions to PDL Handles many gory details (data types, loop indexes, “threading”)
Extending PDL • Example: $n = $pdl->sum_elements; # Usage: $pdl = PDL->random(7); print "PDL: $pdl\n"; $x = $pdl->sum_elements; print "Sum: $sum\n"; # Output: PDL: [0.513 0.175 0.308 0.534 0.947 0.171 0.702] Sum: [3.35]
Extending PDL pp_def('sum_elements', Pars => 'a(n); [o]b();', Code => <<'EOF’, double tmp; tmp = 0; loop(n) %{ tmp += $a(); %} $b() = tmp; EOF );
Extending PDL pp_def('sum_elements', Pars => 'a(n); [o]b();', Code => <<'EOF’, double tmp; tmp = 0; loop(n) %{ tmp += $a(); %} $b() = tmp; EOF );
Extending PDL pp_def('sum_elements', Pars => 'a(n); [o]b();', Code => <<'EOF’, double tmp; tmp = 0; loop(n) %{ tmp += $a(); %} $b() = tmp; EOF );
Extending PDL pp_def('sum_elements', Pars => 'a(n); [o]b();', Code => <<'EOF’, $GENERIC() tmp; tmp = ($GENERIC()) 0; loop(n) %{ tmp += $a(); %} $b() = tmp; EOF );
Inline::C • Allows very easy embedding of C code in Perl modules • Also Inline::Java, Inline::Python, Inline::CPP, Inline::ASM, Inline::Tcl • Considered much easier than XS or SWIG • Developers are very enthusiastic and helpful
Inline::C basic syntax • A complete Perl script using Inline: (taken from Inline docs) #!/usr/bin/perl greet(); use Inline C => q{ void greet() { printf("Hello, world\n"); } }
Inline::C for writing functions • Find next prime number greater than $x #!/usr/bin/perl foreach (-2.7, 29, 30.33, 100_000) { print "$_: ", next_prime($_), "\n"; } . . .
Inline::C for writing functions use Inline C => q{ int next_prime(double in) { // Implements a Sieve of Eratosthenes int *is_prime; int i, j; int candidate = ceil(in); if (in < 2.0) return 2; is_prime = malloc(2 * candidate * sizeof(int)); for (i = 0; i<2*candidate; i++) is_prime[i] = 1; . . .
Inline::C for writing functions for (i = 2; i < 2*first_candidate; i++) { if (!is_prime[i]) continue; if (i >= first_candidate) { free(is_prime); return i; } for (j = i; j < 2*first_candidate; j += i) is_prime[j] = 0; } return 0; // Should never get here } }
Inline::C for wrapping libraries • We’ll create a wrapper for ‘libbow’, an IR package • Contains an implementation of the Porter word-stemming algorithm (i.e., the stem of 'trying' is 'try’) # A Perlish interface: $stem = stem_porter($word); # A C-like interface: stem_porter_inplace($word);
Inline::C for wrapping libraries package Bow::Inline; use strict; use Exporter; use vars qw($VERSION @ISA @EXPORT_OK); BEGIN { $VERSION = '0.01'; } @ISA = qw(Exporter); @EXPORT_OK = qw(stem_porter stem_porter_inplace); . . .
Inline::C for wrapping libraries use Inline (C => 'DATA', VERSION => $VERSION, NAME => __PACKAGE__, LIBS => '-L/tmp/bow/lib -lbow', INC => '-I/tmp/bow/include', CCFLAGS => '-no-cpp-precomp', ); 1; __DATA__ __C__ . . .
Inline::C for wrapping libraries // libbow includes bow_stem_porter() #include "bow/libbow.h" // The bare-bones C interface exposed int stem_porter_inplace(SV* word) { int retval; char* ptr = SvPV_nolen(word); retval = bow_stem_porter(ptr); SvCUR_set(word, strlen(ptr)); return retval; } . . .
Inline::C for wrapping libraries // A Perlish interface char* stem_porter (char* word) { if (!bow_stem_porter(word)) return &PL_sv_undef; return word; } // Don't know what the hell these are for in libbow, // but it needs them. const char *argp_program_version = "foo 1.0"; const char *program_invocation_short_name = "foofy";
When to use speed tools • A word of caution - don’t use C or PDL before you need to • Plain Perl is great for most tasks and usually pretty fast • Remember - external libraries (like libbow, pari-gp) both solve problems and create headaches
Decision Trees • Conceptually simple • Fast evaluation • Scrutable structures • Can be learned from training data • Can be difficult to build • Can “overfit” training data • Usually prefer simpler, i.e. smaller trees
Decision Trees • Sample training data: