1 / 51

Recall Systems: Efficient Learning and Use of Category Indices

Recall Systems: Efficient Learning and Use of Category Indices. Omid Madani With Wiley Greiner, David Kempe, and Mohammad Salavatipour. Overview. Problems and motivation Proposal: recall systems Experiments Related work and conclusions. Massive Learning. Lots of ...

mei
Télécharger la présentation

Recall Systems: Efficient Learning and Use of Category Indices

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Recall Systems: Efficient Learning and Use of Category Indices Omid Madani With Wiley Greiner, David Kempe, and Mohammad Salavatipour

  2. Overview • Problems and motivation • Proposal: recall systems • Experiments • Related work and conclusions

  3. Massive Learning • Lots of ... • Instances (millions, unbounded..) • Dimensions (1000s and beyond) • Categories (1000s and beyond) • Two questions: • How to quickly categorize? • How to efficiently learn to categorize efficiently?

  4. Yahoo! Page Topics (Y! Directory) Arts&Humanities Business&Economy Recreation&Sports Sports Photography History Contests Amateur Magazines Education college Over 100,000 categories in the Yahoo! directory basketball Given a page, quickly categorize… Larger for vision, text prediction,... (millions and beyond)

  5. Efficiency • Two phases (unless truly online): • Learning • Classification time/deployment • Resource requirements: • Memory • Time • Sample efficiency

  6. Idea • Cues in input may quickly narrow down possibilities => “index” categories • Like search engine, but learn a good index • Goal here: index reduces possible classes, classifiers are then applied for precise classification

  7. Summary Findings • Very fast: • Train time: learned in minutes on thousands of instances/categories • 10s of online classifiers trained on each instance (not 1000s) • Index doesn’t hurt classifier accuracy!

  8. Instance x Recognition System Recall System Reduced set of candidate categories Classifier Application Categories for x

  9. The Problem: Tripartite Graph features instances categories f1 x1 c1 x2 f2 c2 x3 f3 c3 x4 f4 c4 x5 f5 x6 x7

  10. Output: An Index features concepts Bipartite graph f1 c1 set of edges (E)= “COVER” f2 c2 f3 c3 f4 c4 f5 c5

  11. Using the Index • Given instance x, retrieve the following candidate set of concepts: A concept is retrieved when a disjunction of features is satisfied

  12. Terminology • False positive: The retrieved concept shouldn’t have been retrieved (irrelevant) • False negative: The concept should have been retrieved, but was not (missed)

  13. Learning to Index • Lets learn the cover (the edges) • Online and mistake driven • Mistake means: • A false negative concept, or • Too many false positives

  14. The Indexer Algorithm • For each concept c keep a sparse vector Vc, initially 0 • Begin with empty cover • On each instance x, • Retrieve candidates concepts • Update Vc for each false negative c (promotion) • If fp-count > tolerance, update Vc for each false positive c (demotion) • Update index accordingly • Update classifiers

  15. Use Feature Weights • For each concept c keep a sparse vector Vc, initially 0 • An (i,j)-edge exists in the cover iff Inclusion threshold

  16. Updating the Vectors • Increase/decrease feature weights in Vc that appear in x by learning rate • In promotion, if feature is not present in Vc: initialize to 1 or 1/df • In demotion: ignore 0 features • Max normalize weights (optional) • Update the index • Takes O(|x| + |Vc|) on every instance

  17. The Indexer Algorithm

  18. The Update Subroutine

  19. Analysis • Under a distribution X on instances • A given cover E induces a • A false-positive rate (fp-rate): • A false-negative rate = fn-rate fp-rate(E) [fp-count on x]

  20. Analysis • If fp-rate(E) <= fp, and fn-rate(E) <= fn, we say the cover is a (fp,fn)-cover • Is there an algorithm that converges efficiently to a (fp, fn)-cover? • We can show this for the max-norm algorithms, given existence of (0,0)-cover, and we set tolerance to 0

  21. Convergence of max-norm • The max-norm algorithm converges to a (0,0)-cover, given such exists, and tolerance is set to 0 • The max-norm algorithm makes O(KL) mistakes for a concept with K pure features, and average instance length of L

  22. Pure Features • Pure feature f for c = if f occurs, the instance belong to c • A “pure” feature never gets “punished” for its concept • Will take O(L) mistakes to get other irrelevant features out of index

  23. Complexity Results • Existence of (fn,fp)-cover is NP-hard (when fp > 0, fn can remain 0). • Approximation is also NP-hard! • Why successful in practice?!

  24. Variations • Some alternatives: • Use of weights for ranking • Other update policies • Additive updates • Use of other norms, or no norm • Batch versus online • …

  25. Instance x Recognition System Recall System Reduced set of candidate categories Classifier Application Categories for x

  26. The Classifiers • (Possibly) Binary classifiers: • One for each concept • For learning the classifiers: • Online learning algorithms

  27. Learners Used • Need online algorithms • Experimented with: • Perceptron • Winnow • Committees of these (voted perceptrons, etc.)

  28. Experiments

  29. Questions • Small tolerance (10s, 100s) enough? • Convergence? Overhead (speed & memory)? • Overall performance? (together with classifier training and testing)

  30. Size Statistics • 3 large text categorization corpora: • The big new Reuters corpus (Rose et al) • An ads dataset (internal) • ODP = open directory project (web pages and their categories)

  31. Domain statistics

  32. Domains

  33. Experimental set up • Split data into 70% train and 30% test • Same split used for all experiments • Algorithm parameters: • Tolerance = 100, • Learning rate = 1.2 • Inclusion threshold = 0.1 • 2.4 ghz with 64 gig ram

  34. Performance (Indexer Alone)

  35. Reuters With Classifiers All three domains but subset of classes

  36. fp-rate at pass i fn-rate at pass i Indexer’s Performance Reuters Ads ODP

  37. Indexer’s Timings m = minutes, h = hours

  38. Performance With Classifiers I Reuters No = index NOT used Yes = index used

  39. With Classifiers II Reuters, 50 sample categories Ads, 76 sample categories ODP, 108 sample categories f1 score (harmonic mean of precision and recall) at pass i

  40. Error Plot total False negative False positive

  41. W and fp-rate Convergence number of instances

  42. Fn-rate vs. Tolerance Fn-rate tolerance

  43. Fp-rate vs. tolerance Fp-rate

  44. Index Size Statistics After 20 Passes

  45. High Out-degree Features • In Reuters: • “woodmark” (outdegree 10) • Wooden Furniture Measuring • Precision Instruments • Electronic Active Components • … • “prft” (64) • “shr” (59)

  46. Related Work • Fast classification candidates: • hierarchical learning, trees (kd, metric, ball, vp, cover, ..), • inverted indices (search engines!) • Fast learning candidates: • Nearest neighbors • Naïve Bayes • Generative models • Hierarchical learning • Feature selection/reduction

  47. Related • Fast visual categorization in biological systems (e.g. Thorpe et al) • Psychology of concepts (e.g. Murphy’02) • Associative memory, speed up learning, blackboard systems, models of aspects of mind/brain

  48. Summary • Problem: Efficiently learn and classify when categories abound • Proposed the recall system: an index that serves as a filter • Efficiently learned the filter quickly learned a quick system!

  49. Current/Future • Evaluation on other domains • Language modeling, prediction • Vision .. • Extend techniques • Ranking (easier than labeling: got very promising results) • Learn “staged” versions • Concept discovery • Understand better: • Why such efficient algorithm work? • Why should good covers exist? What tolerance? • Strengthen convergence analysis

  50. Acknowledgements • Thanks to Thomas Pierce for helping us with the Nutch engine • The Y!R ML group (DeCoste and Keerthi) for discussions

More Related