1 / 16

Learning From String Sequences

Learning From String Sequences. David Lindsay Supervisors: Zhiyuan Luo, Alex Gammerman and Volodya Vovk. Overview. Background: Pattern Recognition of String Data Traditional Approaches: String to Word Vector (SWV) & Implementation Issues Learning with K -Nearest Neighbours ( K -NN)

tazanna
Télécharger la présentation

Learning From String Sequences

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning From String Sequences David Lindsay Supervisors: Zhiyuan Luo, Alex Gammerman and Volodya Vovk

  2. Overview • Background: • Pattern Recognition of String Data • Traditional Approaches: String to Word Vector (SWV) & Implementation Issues • Learning with K-Nearest Neighbours (K-NN) • The Universal Similarity Metric (USM) as alternative to SWV • Kolmogorov Complexity Estimation using Compression Algorithms • Experiments using USM-K-NN Learner

  3. Lots of string data!!! • Examples of data: • Text data – email messages, news articles, web pages • Biological data – DNA, proteins • Main problems presented: • Highly symbolic representation • Complex underlying syntax • Variable length (could be 100 “characters” long or 100,000 long!)

  4. For example a Spam Email Dataset: Label Type of message ? SPAM LEGITIMATE SPAM Date: Sun, 06 Jun 2004 11:33:34 +0000 From: Hattie Cherry <hattiecherryxe@emayl.de> To: davidl@cs.rhul.ac.uk Subject: chea-p sof:tware odjhogbdl Looking for cheap high-quality software? We might have just what you need. Windows XP Professional 2002 ............. $50 Adobe Photoshop 7.0 ..................... $60 Microsoft Office XP Professional 2002 ... $60 Corel Draw Graphics Suite 11 ............ $60 and lots more... Date: Sat, 05 Jun 2004 02:38:24 -0600 From: Carroll Owen <vglfmnblxetqb@qa3zs3.com> To: davidl@cs.rhul.ac.uk Cc: dave@cs.rhul.ac.uk, damien@cs.rhul.ac.uk, daniel@cs.rhul.ac.uk, davidh@cs.rhul.ac.uk, dieter@cs.rhul.ac.uk, cjm@cs.rhul.ac.uk Subject: SLEEP_AIDS PAIN_KILLERS ANXIETY_RELEIF and more c4qbw91807 IF: m5sjq22840 Get Va1ium, Vioxx, Ambien, Paxil, Nexium, Xanax, Phentermine, and other popular meds.. FREE overnight FedEx...Cheaper than your local pharmacy.. Our licensed doctors fill out prescriptions online.. http://www.google.com.mens5ra.com/tp/default.asp?id=cal Date: Tue, 8 Jun 2004 09:51:15 +0100 (BST) From: Steve Schneider <steve@cs.rhul.ac.uk> To: CompSci research postgrads <research-pgs@cs.rhul.ac.uk> Cc: CompSci academic staff <academic@cs.rhul.ac.uk> Subject: postgrad colloquium Dear all, Just a reminder to those who have not already provided your titles and abstracts for your postgraduate colloquium talks, that these are due by the end of today. Steve -------------------------------------------------------------------- Professor Steve Schneider Department of Computer Science Royal Holloway, University of London, Egham, Surrey TW20 0EX, UK. Tel : +44 1784 443431 Fax : +44 1784 439786 S.Schneider@cs.rhul.ac.uk Date: Wed, 2 Jun 2004 15:47:12 +0100 (BST) From: Alex Gammerman <alex@cs.rhul.ac.uk> To: CompSci staff <staff@cs.rhul.ac.uk> Cc: postgraduates@cs.rhul.ac.uk Subject: Congratulations to David Surkov Many congratulations to David who successfully defended his PhD thesis yesterday. Well done David! Alex Object Email message , ,..., Test Object, what is the true label? Training Set to “learn” from Pattern recognition using strings • Goal of pattern recognition = find the “best” label for each new test object.

  5. Word Freq dear 1 research 3 microsoft 0 staff 2 postgrad 4 computer 1 Traditional Approach: String-to-Word-Vector (SWV) • (1) Break down string into fixed number of words (2) use frequencies as features of string. Date: Tue, 8 Jun 2004 09:51:15 +0100 (BST) From: Steve Schneider <steve@cs.rhul.ac.uk> To: CompSci research postgrads <research-pgs@cs.rhul.ac.uk> Cc: CompSci academic staff <academic@cs.rhul.ac.uk> Subject: postgrad colloquium Dear all, Just a reminder to those who have not already provided your titles and abstracts for your postgraduate colloquium talks, that these are due by the end of today. Steve -------------------------------------------------------------------- Professor Steve Schneider Department of Computer Science Royal Holloway, University of London, Egham, Surrey TW20 0EX, UK. Tel : +44 1784 443431 Fax : +44 1784 439786 S.Schneider@cs.rhul.ac.uk

  6. Implementation of SWV • Pick “stop” words: words that occur too often to be useful for classification eg. and, it, are, of, then, etc. • Lemmatise: grouping similar words eg. postgrad  postgraduate, compsci  computer science etc. • Choose which and how many words to use as features. • Lots of domain knowledge must be incorporated!

  7. 10 9 8 7 Freq of “green” 6 5 4 3 normal 2 1 spam 7 8 9 10 2 3 4 5 6 1 Freq. of “banana” K-Nearest Neighbours Find K - closest training examples Choose majority class label observed Easy estimation of probabilities using label frequencies ?

  8. Universal Similarity Metric (USM) Li et al (2003) • Based on non-computable notion of Kolmogorov complexity. • Proven universality – recognise all effective similarities between strings • Essentially a normalised information distance copes with variable length strings

  9. Turing Machine (TM) Output Tape 010101001011101010101010101010110100101010010101010100101010 UTM  o j h e a r d   TM Program • x = “dear john, how are you doing…” Lossless Compression Algorithm • x* = $%?;@><£”**”? Kolmogorov Complexity Estimations • Stringx = “dear john, how are you doing…” • Definition:K(x) = shortest UTM program that writes string x to output tape. • Approximation:K(x) = size of compressed string

  10. Experiments • Experimented using well tested real-life data: • Spam Email dataset • Protein Localisation dataset. • Implemented a K-NN learner with a USM distance and tested on the data. • Compare with other methods that used SWV approach (and variants).

  11. Spam Results Algorithm Recall (%) Prec (%) USM-1-NN 92.5 99.1 USM-10-NN 95.01 98.7 USM-20-NN 95.43 99.14 USM-30-NN 94.8 98.49 Naive Bayes 82.35 99.02 TiMBL KNN (Trad 1NN) 85.27 95.92 MS Outlook patterns 53.01 87.93

  12. Algorithm Cyto (%) Extra (%) Mito (%) Nucl (%) Overall (%) USM-1-NN 76.6 73.8 50.5 84.8 76.5 USM-10-NN 83.9 79.1 53.6 76.4 75.9 USM-20-NN 83.6 79.1 53.6 76.5 75.8 USM-30-NN 85.9 88.0 53.9 70.7 74.0 Naive Bayes 76.0 73.8 49.8 82.7 75.3 Kohonen SOM 72.1 71.4 43.3 77.4 70.6 Neural Net 55.0 75.0 61.0 72.0 66.0 Markov Model 78.1 62.2 69.2 74.1 73.0 SVM 76.9 80.0 56.7 87.4 79.4 Protein Results

  13. Reliable probability forecasts • Empirical Reliability Curves on protein data: Naïve Bayes Error: 24.7% Square Loss: 0.375 Log Loss: 2.686 USM 30-NN Error: 26.0% Square Loss: 0.323 Log Loss: 0.972

  14. Summary • USM distance = natural and successful for use in K-NN learners. • USM K-NN learners: • gives competitive classification accuracy • provide reliable probability forecasts • less pre-processing of data • Provides new focus when designing learners  find a compression algorithm for data • However, USM approach very slow (50)and memory intensive!

  15. Current and future work • Parallels with cognitive science  cognition = compression • Try lossy compression, and alternative compression algorithms • Try multi-media data • use mp3 for music • divX for video • jpeg for images

  16. Questions ?????

More Related