Large Scale Multi-Label Classification

Large Scale Multi-Label Classification via MetaLabelerLei Tang Arizona State University Suju Rajan and Vijay K. Narayanan Yahoo! Data Mining & Research

Large Scale Multi-Label Classification • Huge number of instances and categories • Common for online contents QueryCategorization Web Page Classification Social Bookmark/Tag Recommendation Video Annotation/Organization

Challenges • Multi-Class: thousands of categories • Multi-Label: each instance has >1 labels • Large Scale: huge number of instances and categories • Our query categorization problem: 1.5M queries, 7K categories • Yahoo! Directory 792K docs, 246K categories in Liu et al. 05 • Most existing multi-label methods do not scale • structural SVM, mixture model, collective inference, maximum-entropy model, etc. • The simplest One-vs-Rest SVM is still widely used

One-vs-Rest SVM C1 C2 C3 C4 SVM1 SVM2 SVM3 SVM4 Predict C3 C4 C1 C2

One-vs-Rest SVM • Pros: • Simple, Fast, Scalable • Each label trained independently, easy to parallel • Cons: • Highly skewed class distribution (few +, many -) • Biased prediction scores • Output reasonable good ranking (Rifkin and Klauta04) • e.g. 4 categories C1, C2, C3, C4 • True Labels for x1: C1, C3 • Prediction Scores: {s1, s3} > {s2, s4} • Predict the number of labels?

MetaLabeler Algorithm • Obtain a ranking of class membership for each instance • Any genetic ranking algorithm can be applied • Use One-vs-Rest SVM • Build a Meta Model to predict the number of top classes • Construct Meta Label • Construct Meta Feature • Build Meta Model

Meta Model – Training Clothing • Q1 = affordable cocktail dress • Labels: • Formal wear • Women Clothing Leather clothing Women Clothing Formal wear Children Clothing How to handle predictions like 2.5 labels? Fashion Meta data Query: #labels • Q2 = cotton children jeans • Labels: • Children clothing Q1: 2 Q2: 1 Q3: 3 Regression Meta-Model One-vs-Rest SVM • Q3 = leather fashion in 1990s • Labels: • Fashion • Women Clothing • Leather Clothing

Meta Feature Construction • Content-Based • Use raw data • Raw data contains all the info • Score-Based • Use prediction scores • Bias with scores might be learned • Rank-Based • Use sorted prediction scores

MetaLabeler Prediction • Given one instance: • Obtain the rankings for all labels; • Use the meta model to predict the number of labels • Pick the top-ranking labels • MetaLabeler • Easy to implement • Use existing SVM package/software directly • Can be combined with a hierarchical structure easily • Simply build a Meta Model at each internal node

Baseline Methods • Existing thresholdingmethods (Yang 2001) • Rank-based Cut (Rcut) • output fixed number of top-ranking labels for each instance • Proportion-based Cut • For each label, choose a portion of test instances as positive • Not applicable for online prediction • Score-based Cut (Scut, aka. threshold tuning) • For each label, determine a threshold based on cross-validation • Tends to overfit and is not very stable • MetaLabeler: A local RCut method • Customize the number of labels for each instance

Publicly Available Benchmark Data • Yahoo! Web Page Classification • 11 data sets: • each constructed from a top-level category • 2nd level topics are the categories • 16-32k instances, 6-15k features, 14-23 categories • 1.2 -1.6 labels per instance, maximum 17 labels • Each label has at least 100 instances • RCV1: • A large scale text corpus • 101 categories, 3.2 labels per instance • For evaluation purpose, use 3000 for training, 3000 for testing • Highly skewed distribution (some labels have only 3-4 instances)

MetaLabeler of Different Meta Features • Which type of meta feature is more predictive? • Content-based MetaLabeler outperforms other meta features

Performance Comparison • MetaLabeler tends to outperform other methods

Bias with MetaLabeler • The distribution of number of labels is imbalanced • Most instances have small number of labels; • Small portion of data instances have many more labels • Imbalanced Distribution leads to bias in MetaLabeler • Prefer to predict lesser labels • Only predict many labels with strong confidence

Scalability Study • Threshold tuning requires cross-validation, otherwise overfit • MetaLabeler simply adds some meta labels and learn One-vs-Rest SVMs

Scalability Study (cond.) • Threshold tuning: linearly increasing with number of categories in the data • E.g. 6000 categories -> 6000 thresholds to be tuned • MetaLabeler: upper bounded by the maximum number of labels with one instance • E.g. 6000 categories • but one instance has at most 15 labels • Just need to learn additional 15 binary SVMs • Meta Model is “independent” of number of categories

Application to Large Scale Query Categorization • Query categorization problem: • 1.5 million unique queries: 1M for training, 0.5M for testing • 120k features • A 8-level taxonomy of 6433 categories • Multiple labels • e.g. 0% interest credit card no transfer fee • Financial Services/Credit, Loans and Debt/Credit/Credit Card/ Balance Transfer • Financial Services/Credit, Loans and Debt/Credit/Credit Card/ Low Interest Card • Financial Services/Credit, Loans and Debt/Credit/Credit Card/ Low-No-fee Card • 1.23 labels on average • At most 26 labels

Flat Model • Flat Model: do not leverage the hierarchical structure • Threshold tuning on training data alone takes 40 hours to finish while MetaLabeler costs 2 hours.

Hierarchical Model - Training Root Step 1: Generate Training Data . . . . . Step 2: Roll up labels . . . . . . . . . . . . Step 3: Create “Other” Category N . . . . . . . . . . . . . . . Step 4: Train One vs. Rest SVM Other . . . . . . . . . . . . . . . . . . . . . . . . . . Training Data New Training Data

Hierarchical Model - Prediction Root Query q Predict using SVMs trained at root level m4 m1 . . . . . m2 m3 Query q c1 m2 . . . . . . . . . Stop !!! Query q c2 m3 . . . . . . . . . . . . . . Other c3 Stop !!! . . . . . . . . . . . . . . . . . . . . . . . . . . • Stop if reaching a leaf node or “other” category

Hierarchical Model + MetaLabeler • Precision decrease by 1-2%, but recall is improved by 10% at deeper levels.

Features in MetaLabeler

Conclusions & Future Work • MetaLabeler is promising for large-scale multi-label classification • Core idea: learn a meta model to predict the number of labels • Simple, efficient and scalable • Use existing SVM software directly • Easy for practical deployment • Future work • How to optimize MetaLabeler for desired performance ? • E.g. > 95% precision • Application to social networking related tasks

Questions?

References • Liu, T., Yang, Y., Wan, H., Zeng, H., Chen, Z., and Ma, W. 2005. Support vector machines classification with a very large-scale taxonomy. SIGKDD Explor. Newsl. 7, 1 (Jun. 2005), 36-43. • Rifkin, R. and Klautau, A. 2004. In Defense of One-Vs-All Classification. J. Mach. Learn. Res. 5 (Dec. 2004), 101-141. • Yang, Y. 2001. A study of thresholding strategies for text categorization. In Proceedings of the 24th Annual international ACM SIGIR Conference on Research and Development in information Retrieval (New Orleans, Louisiana, United States). SIGIR '01. ACM, New York, NY, 137-145.

Large Scale Multi-Label Classification

Large Scale Multi-Label Classification

Presentation Transcript

Multi-label Classification without Multi-label Cost - Multi-label Random Decision Tree Classifier

Large-Scale Multi-purpose wireless networks

Large Scale Visual Recognition Challenge (ILSVRC) 2013: Classification spotlights

Multi-Label Collective Classification

Multi-label Relational Neighbor Classification using Social Context Features

Landmark Classification in Large-scale Image Collections ICCV 09

Effective Multi-Label Active Learning for Text Classification

I . Problem Improve large-scale retrieval / classification accuracy

Large-scale matching

Multi-Label Feature Selection for Graph Classification

LARGE SCALE

Large- scale Organisations

Landmark Classification in Large-scale Image Collections

Large scale

Multi-Label Collective Classification

Large-Scale Automatic Classification of Phishing Pages

Label Embedding Trees for Large Multi-class Tasks

Large-Scale Wire-Speed Packet Classification on FPGAs

Large Scale Drupal

Multi-label Associative Classification of Medical Documents from MEDLINE