Generative and Discriminative Models in Text Classification

Generative and DiscriminativeModels in Text Classification David D. Lewis Independent Consultant Chicago, IL, USA Dave@DavidDLewis.com www.DavidDLewis.com Workshop on Challenges in Information Retrieval and Language Modeling, U Mass, CIIR, Amherst, MA, 11 Sept 2002

Text Classification • Given a document, decide which of several classes it belongs to: • TREC filtering • TDT tracking task • Text categorization! • Automated indexing, content filtering, alerting,... • More LM papers here than any other IR problem • Others: parts of IE, author identification,...

Lang. Models are Generative • Model predicts probability document d will be generated by a source c • e.g. Unigram language model: • Parameters, i.e. P(w|c)’s, are fit to optimally predict generation of d

Classify Text w/ Gen. Model • One source model for each class c • Choose class c with largest value of: • For 2 classes, unigram P(d|c), we have: • aka Naive Bayes (NB), Roberston/KSJ

The Discriminative Alternative • Directly model probability of generating class conditional on words: P(c|w) • Logistic regression: • Tune parameters to optimize conditional likelihood (class probability predictions)

LR & NB: Same Parameters!

Observations • LR & NB have same parameterization for 2- or k-class, binary or raw TF weighting • LR outperforms NB in text categorization and batch filtering studies • NB optimizes parameters to predict words, LR optimizes to predict class

False Hopes for LM? • Leveraging unlabeled data (e.g. EM)? • Initial results show only small impact (same story as syntactic class tagging) • Non-unigram models • More accurately predict the wrong thing? • Cross-lingual TC • Any more than MT followed by TC?

True LM Hopes 1: Small Data? • Number training examples to reach maximum effectiveness (Ng & Jordan ‘01): • NB: O(log # features) • LR: O(# features) • LR and NB not compared yet (?) in low data (TREC adaptive, TDT tracking) case • Priors/smoothing likely to prove critical

True LM Hopes 2: Facets? • MeSH category assignments: Anti-Inflammatory Agents, Non-Steroidal/*therapeutic use Tumor Necrosis Factor/antagonists & inhibitors/immunology • Most combinations have zero training data • Berger & Lafferty MT approach?

Non-LM TC Challenges? • Integration of prior knowledge • Choosing documents to label (TREC adaptive, active learning, sampling) • Combining text and nontext predictors • Knowing how well a classifier will/can do • Evolving category systems, switching vocabularies

Generative and Discriminative Models in Text Classification