1 / 20

Presenter Tae Jin Kim Byungil Jeong

Parameterized Generation of Labeled Datasets for Text Categorization Based on a Hierarchical Directory. Presenter Tae Jin Kim Byungil Jeong. Introduction. Text categorization goal : the classification of documents into a fixed number of predefined categories numerous works studied

joycet
Télécharger la présentation

Presenter Tae Jin Kim Byungil Jeong

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parameterized Generation of Labeled Datasets for Text Categorization Based on a Hierarchical Directory Presenter Tae Jin Kim Byungil Jeong

  2. Introduction • Text categorization • goal : the classification of documents into a fixed number of predefined categories • numerous works studied • But, good test collections for text categorization are by far less abundant • Focus on how to generate labeled datasets • Main contributions • present a methodology for automatically acquiring labeled data sets for text categorization experiments • establish a connection between similarity metrics for document set and the classification accuracy • make publicly available a large collection of text categorization datasets • http://techtc.cs.technion.ac.il

  3. Outline of dataset generation distance User compute distance between a pair of categories in a hierarchical directory pick up a pair of categories with specified distance TEXT File (Labeled Datasets)

  4. Dataset generation • Generated datasets • contain two categories and are single-labeled • that is, every document belongs to exactly one category • based on the Open Directory Project (http://dmoz.org) • Two kinds parameters • characterizing the dataset (=pairs of categories) – distance using metrics • characterizing the individual categories • [Assumption] Hierarchical directory • the directory is organized as a tree where each node is labeled with a category • each category • a collection of documents • text descriptions • each document • short annotations (optionally)

  5. Metrics • Metrics quantify conceptual distance between a pair of categories • The larger the distance, the easier it is to induce a classifier for separating the categories • small distance = very difficult to classify • Metrics • Maximum Achievable Accuracy • Edge-counting graph metric • Wordnet-based textual metric

  6. MAA metric • MAA • Maximum Achievable Accuracy • use the maximum accuracy among a set of classifiers • c1,c2 are a pair of categories comprising a dataset • C is a set of classification algorithms • Nothing seems simpler than defining the hardness of a dataset by actual classification accuracy • problem • grossly inefficient • too computationally intensive to be practical

  7. Edge-counting graph metric • the distance between a pair of categories by the length of shortest path connecting them in the hierarchy • conjecture • the closer two categories are in the underlying graph, the closer they are in meaning

  8. WordNet-based textual metric (1) • WordNet • an electronic lexical database for computational linguistics, text analysis and many related areas • The similarity metric for individual words • P. Resnik. Semantic similarity in a taxonomy, 1999 WordNet Taxonomy

  9. WordNet-based textual metric (2) We don’t use full text documents in each category for computing distance preprocessing c1, c2,…,cn : category UNION + its title and description + title and description for its subcategories and links D1,D2,…,Dn : an unordered bag of words * metric for entire textual descriptions of categories, symmetrically

  10. WordNet-based textual metric (3) the asymmetric distance between a pair of such descriptions is canonically defined as an average distance from the words of the first description to those of the second one. the distance between a word and a bag of words is defined as the shortest distance between this word and the bag the distance between two words : the maximum possible score – Resnik’s similarity metric : to transform the similarity metric into a measure of distance

  11. WordNet-based textual metric (4) D1 D2 w1 w1 w2 w2 wn wn D1 distance (D2, D1) D2 w1 w1 w2 w2 wn wn

  12. Properties of individual categories • The following parameters can be configured • cardinality • the desired number of documents it should contain • coherence • document = web site • a number of pages downloaded from each Web site and concatenated into a single document • topic & language

  13. Finding appropriate pairs of categories • Graph distance : count the number of edges • Text distance • Cache the text distances of all pairs of categories considered so far • Searching procedure 1. Check the cache if an appropriate pair exists 2. Randomly sample a pair of categories from the cache whose distance is closest to the specified distance 3. Hill climbing search in the hierarchy graph from the pair 4. If it fails, repeat procedure 2-3 • Never examine actual documents

  14. Generating text datasets • Collect the documents not only in a category but also in its sub-categories • Create a document for a target Web site • Starting from the URL listed in the directory • Crawl Web pages in the target site in the BFS order • A predefined number of Web pages are downloaded • Noise in Web pages • Small texts on menus, textual advertisement, Unrelated images, text rendered in background color

  15. Filtering data on the Web • Pre-processing : eliminate certain categories ex) “Top/World” subtree of the ODP • Online filtering : prevent pursuing external links • Post-processing • Download more web pages than specified by users • Weak filtering : discard Web pages containing HTTP error messages, or only have a few words • Strong filtering : eliminate unrelated pages ex) legal notices • Computing text distance between the root page and a sub-page • Eliminate outliers : further than from the root one standard deviation above the average

  16. Empirical evaluation • Generated 300 datasets of varying difficulty based on ODP using graph distance or text distance • ODP covers over 4 million sites in over 540,000 categories • A data set : a pair of categories • Each category : 100-200 documents • Each document : concatenation of 5 sub-documents (Web pages) • Text categorization • Support vector machines (SVM light implementation), decision trees (C4.5), and K-Nearest Neighbor • Accuracy under 10-fold cross-validation scheme

  17. Evaluation on distance metrics • High correlation between text categorization accuracy and distance metrics => good control over the difficulty • 0.533 for the graph metric, 0.834 for the text metric by Pearson’s linear correlation coefficient • 0.614 between the graph metric and the text metric

  18. Distance metrics and MAA • High correlation between distance metrics and MAA => good predictors of dataset difficulty • 0.550 for the graph metric, 0.790 for the text metric

  19. Versatility of dataset generation • There exists enough category pairs of adequate size at different distance in ODP • Text metric case : 3500 pairs are sampled from 13,000 categories with mid-size (having 100-3000 links)

  20. Conclusion and Discussion • ACCIO can automatically acquire labeled datasets with user definable properties for text categorization from hierarchical directory of documents • Experimental results show proposed metrics are good predictor of the difficulty of dataset • Drawbacks of graph metrics • Correlation increases with the depth of tree nodes • Unreliable values for extremely long hierarchy paths

More Related