1 / 34

Extracting Key-Substring-Group Features for Text Classification

Extracting Key-Substring-Group Features for Text Classification. KDD2006. Dell Zhang and Wee Sun Lee. L. U. Classifier. The Context. Text Classification via Machine Learning (ML). Learning. Predicting. Training Documents. Test Documents. be. to. to_be_or_not_to_be…. not. or. to.

alicia
Télécharger la présentation

Extracting Key-Substring-Group Features for Text Classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Extracting Key-Substring-Group Features for Text Classification KDD2006 Dell Zhang and Wee Sun Lee

  2. L U Classifier The Context • Text Classification via Machine Learning (ML) Learning Predicting Training Documents Test Documents

  3. be to to_be_or_not_to_be… not or to be … Text Data To be, or not to be …

  4. Some Applications • Non-Topical Text Classification • Text Genre Classification • Paper? Poem? Prose? • Text Authorship Classification • Washington? Adams? Jefferson? How to exploit sub-word/super-word information?

  5. Some Applications • Asian-Language Text Classification How to avoid the problem of word-segmentation?

  6. Some Applications • Spam Filtering (Pampapathi et al., 2006) How to handle non-alphabetical characters etc.?

  7. Some Applications • Desktop Text Classification How to deal with different types of files?

  8. Learning Algorithms • Generative • Naïve Bayes, Rocchio, … • Discriminative • Support Vector Machine (SVM) , AdaBoost, … For word-based text classification, discriminative methods are often superior to generative methods. How about string-based text classification?

  9. String-Based Text Classification • Generative • Markov Chain Models (char-level) • fixed order: n-gram, … • variable order: PST, PPM, … • Discriminative • SVM with string kernel (= taking all substrings as features implicitly through the “kernel trick”) • limitations: (1) ridge problem; (2) feature redundancy; (3) feature selection/weighting and advanced kernels.

  10. The Problem generative word-based string-based ? discriminative

  11. The Difficulty • The number of substrings: O(n2) d1: to_be d2: not_to_be 5 + 9 = 14 characters 15 + 45= 60 substrings

  12. Our Idea • The substrings could be partitioned into statisticalequivalence groups ot ot_ ot_t ot_to ot_to_ ot_to_b ot_to_be to to_ to_b to_be d1: to_be …… d2: not_to_be

  13. d1 d1 d1 d1 d1 d2 d2 d2 d2 d2 d2 d2 d2 d2 Suffix Tree o _ b e t _ t o _ b e _ b e o t _ t o _ b e b e _ t o _ b e a suffix tree node = a substring group b e e n o t _ t o _ b e

  14. Substring-Groups The substrings in an equivalence group have exactly identical distribution over the corpus, therefore such a substring-group could be taken in whole as a single feature to be used by a statistical machine learning algorithm for text classification.

  15. Substring-Groups • The number of substring-groups: O(n) • n trivial substring-groups • leaf nodes • frequency = 1 • not so useful to learning • at most n-1 non-trivial substring-groups • internal (non-root) nodes • frequency > 1 • to be selected as features

  16. Key-Substring-Groups • Select the key (salient) substring-groups by • -l the minimum frequency • freq(SGv) • -h the maximum frequency • freq(SGv) • -b the minimum number of branches • children_num(v) • -p the maximum parent-child conditional probability • freq(SGv) / freq(SGp(v)) • -q the maximum suffix-link conditional probability • freq(SGv) / freq(SGs(v))

  17. Suffix Link • “c1 c2 …ck ” “c2 …ck ” • v s(v) • s(v)  root

  18. Feature Extraction Algorithm • Input • a set of documents • the parameters • Output • the key-substring-groups for each document • Time Complexity: O(n) • Trick • make use of suffix links to traverse the tree

  19. Feature Extraction Algorithm construct the (generalized) suffix tree T using Ukkonen’s algorithm; count frequencies recursively; select features recursively; accumulate features recursively; for each document d { match d to T and get to the node v; while v is not the root { output the features associated with v; move v to the next node via the suffix link of v; } }

  20. Experiments • Parameter Tuning • the number of features • the cross-validation performance • Feature Weighting • TFxIDF (with l2 normalization) • Learning Algorithm • LibSVM • linear kernel

  21. English Text Topic Classification • Dataset • Reuters-21578 Top10 (ApteMod) • The home-ground of word-based text classification • Classes • (1) earn; (2) acq; (3) money-fx; (4) grain; (5) crude; (6) trade; (7) interest; (8) ship; (9) wheat; (10) corn. • Parameters • -l 80 -h 8000 -b 8 -p 0.8 -q 0.8 • Features • 9*1013 6,055(extracted in < 30 seconds)

  22. English Text Topic Classification The distribution of substring-groups ~ Zip’s law (power law)

  23. English Text Topic Classification The performance of linear kernel SVM with key-substring-group features on the Reuters-21578 top10 dataset.

  24. English Text Topic Classification Comparing the experimental results of our proposed approach and some representative existing approaches.

  25. English Text Topic Classification The influence of feature extraction parameters to the number of features and the text classification performance.

  26. Chinese Text Topic Classification • Dataset • TREC-5 People’s Daily News • Classes • (1) Politics, Law and Society; (2) Literature and Arts; (3) Education, Science and Culture; (4) Sports; (5) Theory and Academy; (6) Economics. • Parameters • -l 20 -h 8000 -b 8 -p 0.8 -q 0.8

  27. Chinese Text Topic Classification • Performance (miF) • SVM + word segmentation: 82.0% • (He et al., 2000; He et al., 2003) • char-level n-gram language model: 86.7% • (Peng et al. 2004) • SVM with key-substring-group features: 87.3%

  28. Greek Text Authorship Classification • Dataset • (Stamatatos et al., 2000) • Classes • (1) S. Alaxiotis; (2) G. Babiniotis; (3) G. Dertilis; (4) C. Kiosse; (5) A. Liakos; (6) D. Maronitis; (7) M. Ploritis; (8) T. Tasios; (9) K. Tsoukalas; (10) G. Vokos.

  29. Greek Text Authorship Classification • Performance (accuracy) • deep natural language processing: 72% • (Stamatatos et al., 2000) • char-level n-gram language model: 90% • (Peng et al. 2004) • SVM with key-substring-group features: 92%

  30. Greek Text Genre Classification • Dataset • (Stamatatos et al., 2000) • Classes • (1) press editorial; (2) press reportage; (3) academic prose; (4) official documents; (5) literature; (6) recipes; (7) curriculum vitae; (8) interviews; (9) planned speeches; (10) broadcast news.

  31. Greek Text Genre Classification • Performance (accuracy) • deep natural language processing: 82% • (Stamatatos et al., 2000) • char-level n-gram language model: 86% • (Peng et al. 2004) • SVM with key-substring-group features: 94%

  32. Conclusion • We propose • the concept of key-substring-group features and • a linear-time (suffix tree based) algorithm to extract them • We show that • our method works well for some text classification tasks clustering etc.? gene/protein sequence data?

  33. ?

More Related