1 / 26

Text Document Categorization by Term Association

Text Document Categorization by Term Association. Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on Data Mining (ICDM ’ 02) Presentation by Yu-Kai Lin. Outline.

chen
Télécharger la présentation

Text Document Categorization by Term Association

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on Data Mining (ICDM’02) Presentation by Yu-Kai Lin

  2. Outline • Introduction • Related work • Building an Associative Text Classifier • Experimental Results • Conclusion

  3. Introduction • Text categorization is a necessity due to the very large amount of text documents that we have to deal with daily. • A text categorization system can be used in indexing documents to assist information retrieval tasks as well as in classifying e-mails, memos or web pages in a yahoo-like manner.

  4. Introduction (cont.) • The data classification process : • (a) Learning : Training data are analyzed by a classification algorithm. (Figure 1) • (b) classification : Test data are used to estimated in the form of classification rules. (Figure 2)

  5. Figure 1 Classification algorithm Training data Classification rules If age = “31…40” And income = high Then Credit_rating = excellent

  6. Figure 2 Classification rules Training data New data ( John ,31…40,high) Credit rating ? excellent

  7. Related Work • Text classifier • Association Rule Mining

  8. Related Work (cont.) • Text classifier • Naïve Bayesian classifier (chapter 7.4) • ID3 (Decision tree chapter 7.3) • C4.5 ( chapter 7.6) • K-NN (chapter 7.7.1) • Neural Networks • Support Vector Machines (SVM)

  9. Related Work (cont.) • Association Rule Mining • Association Rules Generation • Associative classifiers

  10. Related Work (cont.) • Association Rules Generation • “X=>Y” • support s • confidence c • strong rules: • rules that have a support and confidence greater than given thresholds

  11. Related Work (cont.) • Associative classifiers • Learning method is represented by the association rule mining • Discover strong patterns that are associated with the class labels • New object are categorized by these patterns (classifier)

  12. Building an Association Text Classifier Training Set Testing Set Preprocessing Phase Association Rule Mining Model Validation Associative Classifier

  13. Building an Association Text Classifier (cont.) • Data collection Preprocessing • Association Rules Generation • Pruning the Set of Association Rules • Prediction of Classes Associated with New Documents

  14. Building an Association Text Classifier (cont.) • Data collection Preprocessing • Weed out not interesting words • stopwording • stemming • Transform documents into transactions • categories set C = {c1, c2, … , cm} • term set T = {t1, t2, … , tn} • document Di = {cc1, cc2, … , ccm, tt1, tt2, … , ttn}

  15. Building an Association Text Classifier (cont.) • Association Rules Generation • Apriori • Advantage • The performance studies show its efficiency and scalability • Drawback of using on our transactions • Generate a large number of associations rules • Most of them are irrelevant for classification

  16. ARC-BC • Association Rule-based Categorizer By Category algorithm • Apriori-based • Interested in rules that indicate a category label (T => ci ): Strong rules • Prune the rules that no use for categorization

  17. ARC-BC Algorithm

  18. ARC-BC Algorithm

  19. ARC-BC association rules for category 1 category 1 association rules for category i classifier category i association rules for category n category n put the new documents in the correct class

  20. Examples of association rules composing the classifier

  21. Building an Association Text Classifier (cont.) • Pruning the Set of Association Rules • The number of rules that can be generated in the association rule mining phase could be very large • Noisy information mislead the classification process • Make classification time longer • Pruning method • Eliminate the specific rules and keep only those that are more general and with high confidence • Prune unnecessary rules by database coverage

  22. Building an Association Text Classifier (cont.) • Pruning the Set of Association Rules definition

  23. Pruning the Set of Association Rules Algorithm

  24. Building an Association Text Classifier (cont.) • Prediction of Classes Associated with New Documents • Algorithm

  25. Experimental results • 9,603 training documents and 3,299 testing documents

  26. Conclusion • Its effectiveness is comparable to most well-known text classifiers • Relatively fast training time • Rules generated are understandable and can be easily manually updated • When retraining a new document, only the concerned categories are adjusted and the rules could be incrementally updated

More Related