880 likes | 1.79k Vues
An Introduction to Machine Learning with Perl. February 3, 2003 O’Reilly Bioinformatics Conference. Ken Williams ken@mathforum.org. Tutorial Overview. What is Machine Learning? (20’) Why use Perl for ML? (15’) Some theory (20’) Some tools (30’) Decision trees (20’) SVMs (15’)
E N D
An Introduction to Machine Learning with Perl February 3, 2003 O’Reilly Bioinformatics Conference Ken Williams ken@mathforum.org
Tutorial Overview • What is Machine Learning? (20’) • Why use Perl for ML? (15’) • Some theory (20’) • Some tools (30’) • Decision trees (20’) • SVMs (15’) • Categorization (40’)
References & Sources • Machine Learning, Tom Mitchell. McGraw-Hill, 414pp, 1997 • Foundations of Natural Language Processing, Christopher D. Manning & Hinrich Schütze. MIT Press, 680 pp, 1999 • Perl-AI list (perl-ai@perl.org)
What Is Machine Learning? • A subfield of Artificial Intelligence (but without the baggage) • Usually concerns some particular task, not the building of a sentient robot • Concerns the design of systems that improve (or at least change) as they acquire knowledge or experience
Typical ML Tasks • Clustering • Categorization • Recognition • Filtering • Game playing • Autonomous performance
Typical ML Tasks • Clustering
Typical ML Tasks • Categorization
Typical ML Tasks • Recognition Vincent Van Gogh Michael Stipe Mohammed Ali Ken Williams Burl Ives Winston Churchill Grover Cleveland
Typical ML Tasks • Recognition Little red corvette The kids are all right The rain in Spain Bort bort bort
Typical ML Tasks • Filtering
Typical ML Tasks • Game playing
Typical ML Tasks • Autonomous performance
Typical ML Buzzwords • Data Mining • Knowledge Management (KM) • Information Retrieval (IR) • Expert Systems • Topic detection and tracking
Who does ML? • Two main groups: research and industry • These groups do listen to each other, at least some • Not many reusable ML/KM components, outside of a few commercial systems • KM is seen as a key component of big business strategy - lots of KM consultants • ML is an extremely active research area with relatively low “cost of entry”
When is ML useful? • When you have lots of data • When you can’t hire enough people, or when people are too slow • When you can afford to be wrong sometimes • When you need to find patterns • When you have nothing to lose
An aside on your presenter • Academic background in math & music (not computer science or even statistics) • Several years as a Perl consultant • Two years as a math teacher • Currently studying document categorization at The University of Sydney • In other words, a typical ML student
Why use Perl for ML? • CPAN - the viral solution™ • Perl has rapid reusability • Perl is widely deployed • Perl code can be written quickly • Embeds both ways • Human-oriented development • Leaves your options open
But what about all the data? • ML techniques tend to use lots of data in complicated ways • Perl is great at data in general, but tends to gobble memory or forego strict checking • Two fine solutions exist: • Be as careful in Perl as you are in C (Params::Validate, Tie::SecureHash, etc.) • Use PDL or Inline (more on these later)
Interfaces vs. Implementations • In ML applications, we need both data integrity and the ability to “play with it” • Perl wrappers around C/C++ structures/objects are a nice balance • Keeps high-level interfaces in Perl, low-level implementations in C/C++ • Can be prototyped in pure Perl, with C/C++ parts added later
Some ML Theory and Terminology • ML concerns learning a target function from a set of examples • The target function is often called a hypothesis • Example: with Neural Network, a trained network is a hypothesis • The set of all possible target functions is called the hypothesis space • Training process can be considerd a search through the hypothesis space
Some ML Theory and Terminology • Each ML technique will • probably exclude some hypotheses • prefer some hypotheses over others • A technique’s exclusion & preference rules are called its inductive bias • If it ain’t biased, it ain’t learnin’ • No bias = rote learning • Bias = generalization • Example: kids learning multiplication (understanding vs. memorization)
Some ML Theory and Terminology • Ideally, a ML technique will • not exclude the “right” hypothesis, i.e. the hypothesis space will include the target hypothesis • Prefer the target hypothesis over others • Measuring the degree to which these criteria are satisfied is important and sometimes complicated
Evaluating Hypotheses • We often want to know how good a hypothesis is • To know how it performs in real world • May be used to improve learning technique or tune parameters • May be used by a learner to automatically improve the hypothesis • Usually evaluate on test data • Test data must be kept separate from training data • Test data used for purpose 3) is usually called validation or held-out data. • Training, validation, and test data should not contaminate each other
Evaluating Hypotheses • Some standard statistical measures are useful • Error rate, accuracy, precision, recall, F1 • Calculated using contingency tables
Evaluating Hypotheses • Error = (b+c)/(a+b+c+d) • Accuracy = (a+d)/(a+b+c+d) • Precision = p = a/(a+b) • Recall = r = a/(a+c) • F1 = 2pr/(p+r) Precision is easy to maximize by assigning nothing Recall is easy to maximize by assigning everything F1 combines precision and recall equally
Evaluating Hypotheses • Example (from categorization) • Note that precision is higher than recall - indicates a cautious categorizer Precision = 0.851, Recall = 0.711, F1 = 0.775 These scores depend on the task - can’t compare scores across tasks Often useful to compare categories separately, then average (macro-averaging)
Evaluating Hypotheses • The Statistics::Contingency module (on CPAN) helps calculate these figures: use Statistics::Contingency; my $s = new Statistics::Contingency; while (...) { ... Do some categorization ... $s->add_result($assigned, $correct); } print "Micro F1: ", $s->micro_F1, "\n"; print $s->stats_table; Micro F1: 0.774803607797498 +-------------------------------------------------+ | miR miP miF1 maR maP maF1 Err | | 0.243 0.843 0.275 0.711 0.851 0.775 0.006 | +-------------------------------------------------+
Useful Perl Data-Munging Tools • Storable - cheap persistence and cloning • PDL - helps performance and design • Inline::C - tight loops and interfaces
Storable • One of many persistence classes for Perl data (Data::Dumper, YAML, Data::Denter) • Allows saving structures to disk: store($x, $filename); $x = retrieve($filename); • Allows cloning of structures: $y = dclone($x); • Not terribly interesting, but handy
PDL • Perl Data Language • On CPAN, of course (PDL-2.3.4.tar.gz) • Turns Perl into a data-processing language similar to Matlab • Native C/Fortran numerical handling • Compact multi-dimensional arrays • Still Perl at highest level
PDL demo PDL experimentation shell: ken% perldl perldl> demo pdl
Extending PDL • PDL has extension language PDL::PP Lets you write C extensions to PDL Handles many gory details (data types, loop indexes, “threading”)
Extending PDL • Example: $n = $pdl->sum_elements; # Usage: $pdl = PDL->random(7); print "PDL: $pdl\n"; $x = $pdl->sum_elements; print "Sum: $sum\n"; # Output: PDL: [0.513 0.175 0.308 0.534 0.947 0.171 0.702] Sum: [3.35]
Extending PDL pp_def('sum_elements', Pars => 'a(n); [o]b();', Code => <<'EOF’, double tmp; tmp = 0; loop(n) %{ tmp += $a(); %} $b() = tmp; EOF );
Extending PDL pp_def('sum_elements', Pars => 'a(n); [o]b();', Code => <<'EOF’, double tmp; tmp = 0; loop(n) %{ tmp += $a(); %} $b() = tmp; EOF );
Extending PDL pp_def('sum_elements', Pars => 'a(n); [o]b();', Code => <<'EOF’, double tmp; tmp = 0; loop(n) %{ tmp += $a(); %} $b() = tmp; EOF );
Extending PDL pp_def('sum_elements', Pars => 'a(n); [o]b();', Code => <<'EOF’, $GENERIC() tmp; tmp = ($GENERIC()) 0; loop(n) %{ tmp += $a(); %} $b() = tmp; EOF );
Inline::C • Allows very easy embedding of C code in Perl modules • Also Inline::Java, Inline::Python, Inline::CPP, Inline::ASM, Inline::Tcl • Considered much easier than XS or SWIG • Developers are very enthusiastic and helpful
Inline::C basic syntax • A complete Perl script using Inline: (taken from Inline docs) #!/usr/bin/perl greet(); use Inline C => q{ void greet() { printf("Hello, world\n"); } }
Inline::C for writing functions • Find next prime number greater than $x #!/usr/bin/perl foreach (-2.7, 29, 30.33, 100_000) { print "$_: ", next_prime($_), "\n"; } . . .
Inline::C for writing functions use Inline C => q{ int next_prime(double in) { // Implements a Sieve of Eratosthenes int *is_prime; int i, j; int candidate = ceil(in); if (in < 2.0) return 2; is_prime = malloc(2 * candidate * sizeof(int)); for (i = 0; i<2*candidate; i++) is_prime[i] = 1; . . .
Inline::C for writing functions for (i = 2; i < 2*first_candidate; i++) { if (!is_prime[i]) continue; if (i >= first_candidate) { free(is_prime); return i; } for (j = i; j < 2*first_candidate; j += i) is_prime[j] = 0; } return 0; // Should never get here } }
Inline::C for wrapping libraries • We’ll create a wrapper for ‘libbow’, an IR package • Contains an implementation of the Porter word-stemming algorithm (i.e., the stem of 'trying' is 'try’) # A Perlish interface: $stem = stem_porter($word); # A C-like interface: stem_porter_inplace($word);
Inline::C for wrapping libraries package Bow::Inline; use strict; use Exporter; use vars qw($VERSION @ISA @EXPORT_OK); BEGIN { $VERSION = '0.01'; } @ISA = qw(Exporter); @EXPORT_OK = qw(stem_porter stem_porter_inplace); . . .
Inline::C for wrapping libraries use Inline (C => 'DATA', VERSION => $VERSION, NAME => __PACKAGE__, LIBS => '-L/tmp/bow/lib -lbow', INC => '-I/tmp/bow/include', CCFLAGS => '-no-cpp-precomp', ); 1; __DATA__ __C__ . . .
Inline::C for wrapping libraries // libbow includes bow_stem_porter() #include "bow/libbow.h" // The bare-bones C interface exposed int stem_porter_inplace(SV* word) { int retval; char* ptr = SvPV_nolen(word); retval = bow_stem_porter(ptr); SvCUR_set(word, strlen(ptr)); return retval; } . . .
Inline::C for wrapping libraries // A Perlish interface char* stem_porter (char* word) { if (!bow_stem_porter(word)) return &PL_sv_undef; return word; } // Don't know what the hell these are for in libbow, // but it needs them. const char *argp_program_version = "foo 1.0"; const char *program_invocation_short_name = "foofy";
When to use speed tools • A word of caution - don’t use C or PDL before you need to • Plain Perl is great for most tasks and usually pretty fast • Remember - external libraries (like libbow, pari-gp) both solve problems and create headaches
Decision Trees • Conceptually simple • Fast evaluation • Scrutable structures • Can be learned from training data • Can be difficult to build • Can “overfit” training data • Usually prefer simpler, i.e. smaller trees
Decision Trees • Sample training data: