Learning From String Sequences

Learning From String Sequences David Lindsay Supervisors: Zhiyuan Luo, Alex Gammerman and Volodya Vovk

Overview • Background: • Pattern Recognition of String Data • Traditional Approaches: String to Word Vector (SWV) & Implementation Issues • Learning with K-Nearest Neighbours (K-NN) • The Universal Similarity Metric (USM) as alternative to SWV • Kolmogorov Complexity Estimation using Compression Algorithms • Experiments using USM-K-NN Learner

Lots of string data!!! • Examples of data: • Text data – email messages, news articles, web pages • Biological data – DNA, proteins • Main problems presented: • Highly symbolic representation • Complex underlying syntax • Variable length (could be 100 “characters” long or 100,000 long!)

For example a Spam Email Dataset: Label Type of message ? SPAM LEGITIMATE SPAM Date: Sun, 06 Jun 2004 11:33:34 +0000 From: Hattie Cherry <hattiecherryxe@emayl.de> To: davidl@cs.rhul.ac.uk Subject: chea-p sof:tware odjhogbdl Looking for cheap high-quality software? We might have just what you need. Windows XP Professional 2002 ............. $50 Adobe Photoshop 7.0 ..................... $60 Microsoft Office XP Professional 2002 ... $60 Corel Draw Graphics Suite 11 ............ $60 and lots more... Date: Sat, 05 Jun 2004 02:38:24 -0600 From: Carroll Owen <vglfmnblxetqb@qa3zs3.com> To: davidl@cs.rhul.ac.uk Cc: dave@cs.rhul.ac.uk, damien@cs.rhul.ac.uk, daniel@cs.rhul.ac.uk, davidh@cs.rhul.ac.uk, dieter@cs.rhul.ac.uk, cjm@cs.rhul.ac.uk Subject: SLEEP_AIDS PAIN_KILLERS ANXIETY_RELEIF and more c4qbw91807 IF: m5sjq22840 Get Va1ium, Vioxx, Ambien, Paxil, Nexium, Xanax, Phentermine, and other popular meds.. FREE overnight FedEx...Cheaper than your local pharmacy.. Our licensed doctors fill out prescriptions online.. http://www.google.com.mens5ra.com/tp/default.asp?id=cal Date: Tue, 8 Jun 2004 09:51:15 +0100 (BST) From: Steve Schneider <steve@cs.rhul.ac.uk> To: CompSci research postgrads <research-pgs@cs.rhul.ac.uk> Cc: CompSci academic staff <academic@cs.rhul.ac.uk> Subject: postgrad colloquium Dear all, Just a reminder to those who have not already provided your titles and abstracts for your postgraduate colloquium talks, that these are due by the end of today. Steve -------------------------------------------------------------------- Professor Steve Schneider Department of Computer Science Royal Holloway, University of London, Egham, Surrey TW20 0EX, UK. Tel : +44 1784 443431 Fax : +44 1784 439786 S.Schneider@cs.rhul.ac.uk Date: Wed, 2 Jun 2004 15:47:12 +0100 (BST) From: Alex Gammerman <alex@cs.rhul.ac.uk> To: CompSci staff <staff@cs.rhul.ac.uk> Cc: postgraduates@cs.rhul.ac.uk Subject: Congratulations to David Surkov Many congratulations to David who successfully defended his PhD thesis yesterday. Well done David! Alex Object Email message , ,..., Test Object, what is the true label? Training Set to “learn” from Pattern recognition using strings • Goal of pattern recognition = find the “best” label for each new test object.

Word Freq dear 1 research 3 microsoft 0 staff 2 postgrad 4 computer 1 Traditional Approach: String-to-Word-Vector (SWV) • (1) Break down string into fixed number of words (2) use frequencies as features of string. Date: Tue, 8 Jun 2004 09:51:15 +0100 (BST) From: Steve Schneider <steve@cs.rhul.ac.uk> To: CompSci research postgrads <research-pgs@cs.rhul.ac.uk> Cc: CompSci academic staff <academic@cs.rhul.ac.uk> Subject: postgrad colloquium Dear all, Just a reminder to those who have not already provided your titles and abstracts for your postgraduate colloquium talks, that these are due by the end of today. Steve -------------------------------------------------------------------- Professor Steve Schneider Department of Computer Science Royal Holloway, University of London, Egham, Surrey TW20 0EX, UK. Tel : +44 1784 443431 Fax : +44 1784 439786 S.Schneider@cs.rhul.ac.uk

Implementation of SWV • Pick “stop” words: words that occur too often to be useful for classification eg. and, it, are, of, then, etc. • Lemmatise: grouping similar words eg. postgrad  postgraduate, compsci  computer science etc. • Choose which and how many words to use as features. • Lots of domain knowledge must be incorporated!

10 9 8 7 Freq of “green” 6 5 4 3 normal 2 1 spam 7 8 9 10 2 3 4 5 6 1 Freq. of “banana” K-Nearest Neighbours Find K - closest training examples Choose majority class label observed Easy estimation of probabilities using label frequencies ?

Universal Similarity Metric (USM) Li et al (2003) • Based on non-computable notion of Kolmogorov complexity. • Proven universality – recognise all effective similarities between strings • Essentially a normalised information distance copes with variable length strings

Turing Machine (TM) Output Tape 010101001011101010101010101010110100101010010101010100101010 UTM  o j h e a r d   TM Program • x = “dear john, how are you doing…” Lossless Compression Algorithm • x* = $%?;@><£”**”? Kolmogorov Complexity Estimations • Stringx = “dear john, how are you doing…” • Definition:K(x) = shortest UTM program that writes string x to output tape. • Approximation:K(x) = size of compressed string

Experiments • Experimented using well tested real-life data: • Spam Email dataset • Protein Localisation dataset. • Implemented a K-NN learner with a USM distance and tested on the data. • Compare with other methods that used SWV approach (and variants).

Spam Results Algorithm Recall (%) Prec (%) USM-1-NN 92.5 99.1 USM-10-NN 95.01 98.7 USM-20-NN 95.43 99.14 USM-30-NN 94.8 98.49 Naive Bayes 82.35 99.02 TiMBL KNN (Trad 1NN) 85.27 95.92 MS Outlook patterns 53.01 87.93

Algorithm Cyto (%) Extra (%) Mito (%) Nucl (%) Overall (%) USM-1-NN 76.6 73.8 50.5 84.8 76.5 USM-10-NN 83.9 79.1 53.6 76.4 75.9 USM-20-NN 83.6 79.1 53.6 76.5 75.8 USM-30-NN 85.9 88.0 53.9 70.7 74.0 Naive Bayes 76.0 73.8 49.8 82.7 75.3 Kohonen SOM 72.1 71.4 43.3 77.4 70.6 Neural Net 55.0 75.0 61.0 72.0 66.0 Markov Model 78.1 62.2 69.2 74.1 73.0 SVM 76.9 80.0 56.7 87.4 79.4 Protein Results

Reliable probability forecasts • Empirical Reliability Curves on protein data: Naïve Bayes Error: 24.7% Square Loss: 0.375 Log Loss: 2.686 USM 30-NN Error: 26.0% Square Loss: 0.323 Log Loss: 0.972

Summary • USM distance = natural and successful for use in K-NN learners. • USM K-NN learners: • gives competitive classification accuracy • provide reliable probability forecasts • less pre-processing of data • Provides new focus when designing learners  find a compression algorithm for data • However, USM approach very slow (50)and memory intensive!

Current and future work • Parallels with cognitive science  cognition = compression • Try lossy compression, and alternative compression algorithms • Try multi-media data • use mp3 for music • divX for video • jpeg for images

Questions ?????

Learning From String Sequences

Learning From String Sequences

Presentation Transcript

Learning Rules from System Call Arguments and Sequences for Anomaly Detection

String

New Gauge Symmetries from String Theory

Re-using Learning Sequences

Using GC content to distinguish Phytophthora sequences from tomato sequences

Nuclear Forces from String Theory

Inflationary Scenarios from String Theory

Developmental Sequences in Second Language Learning

From Sequences to Structure

String

Cosmic Strings from String Theory

Probability Theory and Basic Alignment of String Sequences

LEARNING SEQUENCES FROM Conway’s game of life

Learning Semantic String Transformations from Examples

Phonetic String Matching:Lessons from Information Retrieval

Designing of Learning Sequences

A fast algorithm for approximate string matching on gene sequences

String

From Scaling to String Theory:

Using GC content to distinguish Phytophthora sequences from tomato sequences

Learning Rules from System Call Arguments and Sequences for Anomaly Detection