Automated Generation of Text Categorization Datasets using Hierarchical Directory Presenter Methodology

Parameterized Generation of Labeled Datasets for Text Categorization Based on a Hierarchical Directory Presenter Tae Jin Kim Byungil Jeong

Introduction • Text categorization • goal : the classification of documents into a fixed number of predefined categories • numerous works studied • But, good test collections for text categorization are by far less abundant • Focus on how to generate labeled datasets • Main contributions • present a methodology for automatically acquiring labeled data sets for text categorization experiments • establish a connection between similarity metrics for document set and the classification accuracy • make publicly available a large collection of text categorization datasets • http://techtc.cs.technion.ac.il

Outline of dataset generation distance User compute distance between a pair of categories in a hierarchical directory pick up a pair of categories with specified distance TEXT File (Labeled Datasets)

Dataset generation • Generated datasets • contain two categories and are single-labeled • that is, every document belongs to exactly one category • based on the Open Directory Project (http://dmoz.org) • Two kinds parameters • characterizing the dataset (=pairs of categories) – distance using metrics • characterizing the individual categories • [Assumption] Hierarchical directory • the directory is organized as a tree where each node is labeled with a category • each category • a collection of documents • text descriptions • each document • short annotations (optionally)

Metrics • Metrics quantify conceptual distance between a pair of categories • The larger the distance, the easier it is to induce a classifier for separating the categories • small distance = very difficult to classify • Metrics • Maximum Achievable Accuracy • Edge-counting graph metric • Wordnet-based textual metric

MAA metric • MAA • Maximum Achievable Accuracy • use the maximum accuracy among a set of classifiers • c1,c2 are a pair of categories comprising a dataset • C is a set of classification algorithms • Nothing seems simpler than defining the hardness of a dataset by actual classification accuracy • problem • grossly inefficient • too computationally intensive to be practical

Edge-counting graph metric • the distance between a pair of categories by the length of shortest path connecting them in the hierarchy • conjecture • the closer two categories are in the underlying graph, the closer they are in meaning

WordNet-based textual metric (1) • WordNet • an electronic lexical database for computational linguistics, text analysis and many related areas • The similarity metric for individual words • P. Resnik. Semantic similarity in a taxonomy, 1999 WordNet Taxonomy

WordNet-based textual metric (2) We don’t use full text documents in each category for computing distance preprocessing c1, c2,…,cn : category UNION + its title and description + title and description for its subcategories and links D1,D2,…,Dn : an unordered bag of words * metric for entire textual descriptions of categories, symmetrically

WordNet-based textual metric (3) the asymmetric distance between a pair of such descriptions is canonically defined as an average distance from the words of the first description to those of the second one. the distance between a word and a bag of words is defined as the shortest distance between this word and the bag the distance between two words : the maximum possible score – Resnik’s similarity metric : to transform the similarity metric into a measure of distance

WordNet-based textual metric (4) D1 D2 w1 w1 w2 w2 wn wn D1 distance (D2, D1) D2 w1 w1 w2 w2 wn wn

Properties of individual categories • The following parameters can be configured • cardinality • the desired number of documents it should contain • coherence • document = web site • a number of pages downloaded from each Web site and concatenated into a single document • topic & language

Finding appropriate pairs of categories • Graph distance : count the number of edges • Text distance • Cache the text distances of all pairs of categories considered so far • Searching procedure 1. Check the cache if an appropriate pair exists 2. Randomly sample a pair of categories from the cache whose distance is closest to the specified distance 3. Hill climbing search in the hierarchy graph from the pair 4. If it fails, repeat procedure 2-3 • Never examine actual documents

Generating text datasets • Collect the documents not only in a category but also in its sub-categories • Create a document for a target Web site • Starting from the URL listed in the directory • Crawl Web pages in the target site in the BFS order • A predefined number of Web pages are downloaded • Noise in Web pages • Small texts on menus, textual advertisement, Unrelated images, text rendered in background color

Filtering data on the Web • Pre-processing : eliminate certain categories ex) “Top/World” subtree of the ODP • Online filtering : prevent pursuing external links • Post-processing • Download more web pages than specified by users • Weak filtering : discard Web pages containing HTTP error messages, or only have a few words • Strong filtering : eliminate unrelated pages ex) legal notices • Computing text distance between the root page and a sub-page • Eliminate outliers : further than from the root one standard deviation above the average

Empirical evaluation • Generated 300 datasets of varying difficulty based on ODP using graph distance or text distance • ODP covers over 4 million sites in over 540,000 categories • A data set : a pair of categories • Each category : 100-200 documents • Each document : concatenation of 5 sub-documents (Web pages) • Text categorization • Support vector machines (SVM light implementation), decision trees (C4.5), and K-Nearest Neighbor • Accuracy under 10-fold cross-validation scheme

Evaluation on distance metrics • High correlation between text categorization accuracy and distance metrics => good control over the difficulty • 0.533 for the graph metric, 0.834 for the text metric by Pearson’s linear correlation coefficient • 0.614 between the graph metric and the text metric

Distance metrics and MAA • High correlation between distance metrics and MAA => good predictors of dataset difficulty • 0.550 for the graph metric, 0.790 for the text metric

Versatility of dataset generation • There exists enough category pairs of adequate size at different distance in ODP • Text metric case : 3500 pairs are sampled from 13,000 categories with mid-size (having 100-3000 links)

Conclusion and Discussion • ACCIO can automatically acquire labeled datasets with user definable properties for text categorization from hierarchical directory of documents • Experimental results show proposed metrics are good predictor of the difficulty of dataset • Drawbacks of graph metrics • Correlation increases with the depth of tree nodes • Unreliable values for extremely long hierarchy paths

Automated Generation of Text Categorization Datasets using Hierarchical Directory Presenter Methodology

Automated Generation of Text Categorization Datasets using Hierarchical Directory Presenter Methodology

Presentation Transcript

Min-Jeong Kim NASA GMAO/GESTAR

Jae-Hoon Jeong , Jung-Soo Park, Kyeong-Jin Lee, Hyoung-Jun Kim ETRI 18 th March 2003

Kyeongim Kim, Jieun Jeong El- Naafidha

Tae-Jin Yang: tjyang@snu.ac.kr ( http://im-crop.snu.ac.kr/ )

Presenter Lee Beom -Jin

Shim Jae Hwan 2007049392 Kim Tae Sik 2007049712

TAE

NeuroImage (2007) Presenter : Evgenii Kim

NeuroImage (2004) Presenter : Evgenii Kim

Ignas Budvytis * , Tae- Kyun Kim * , Roberto Cipolla

Byungil Jeong

Kim Jin-A

Jong-Nam Kim and Tae-Sun Choi

Woong-Tae Kim (Harvard-Smithsonian CfA)

Presenter : HyeonJu Jeong Ph.D Candidate

Presenter: Siteng Jin

Kim, Seo -Young Lee, Jeong -Min, Jung, Seung -Jin

2007. 09.12 KIM JEONG-IN

TAE YUN KIM - TAE YUN KIM AND TAE KWON DO

Chaoyang Kim Tae technology limited liability company

Dr. Kim Jin-Sil

Jae-Hoon Jeong , Jung-Soo Park, Kyeong-Jin Lee, Hyoung-Jun Kim ETRI 18 th March 2003