1 / 23

INFORMATION RETRIEVAL TECHNIQUES BY DR . ADNAN ABID

INFORMATION RETRIEVAL TECHNIQUES BY DR . ADNAN ABID. Lecture # 43 Classification. ACKNOWLEDGEMENTS. The presentation of this lecture has been taken from the following sources

francis
Télécharger la présentation

INFORMATION RETRIEVAL TECHNIQUES BY DR . ADNAN ABID

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. INFORMATION RETRIEVAL TECHNIQUESBYDR. ADNAN ABID Lecture # 43 Classification

  2. ACKNOWLEDGEMENTS The presentation of this lecture has been taken from the followingsources • “Introduction to information retrieval” by PrabhakarRaghavan, Christopher D. Manning, and Hinrich Schütze • “Managing gigabytes” by Ian H. Witten, ‎Alistair Moffat, ‎Timothy C. Bell • “Modern information retrieval” by Baeza-Yates Ricardo, ‎ • “Web Information Retrieval” by Stefano Ceri, ‎Alessandro Bozzon, ‎Marco Brambilla

  3. Outline • Rochio classification • K Nearest neighbors • Nearest-Neighbor Learning • kNN decision boundaries • Bias vs. variance

  4. Rocchio classification

  5. Classification Using Vector Spaces • In vector space classification, training set corresponds to a labeled set of points (equivalently, vectors) • Premise 1: Documents in the same class form a contiguous region of space • Premise 2: Documents from different classes don’t overlap (much) • Learning a classifier: build surfaces to delineate classes in the space

  6. Documents in a Vector Space Government Science Arts

  7. Test Document of what class? Government Science Arts

  8. Test Document = Government Is this similarity hypothesis true in general? Government Science Arts Our focus: how to find good separators Decision Boundaries

  9. RocchioAlgorithm • Relevance feedback methods can be adapted for text categorization • As noted before, relevance feedback can be viewed as 2-class classification • Relevant vs. non-relevant documents • Use standard tf-idf weighted vectors to represent text documents • For training documents in each category, compute a prototype vector by summing the vectors of the training documents in the category. • Prototype : centroid of members of class • Assign test documents to the category with the closest prototype vector based on cosine similarity

  10. RocchioAlgorithm • There may be many more red vectors along y-axis, and they will drift the RED centroid towards y-axis. Now an awkward Red documents may be near the blue centroid.

  11. Rocchio classification • Little used outside text classification • It has been used quite effectively for text classification • But in general worse than Naïve Bayes • Again, cheap to train and test documents

  12. Rocchio classification • Rocchio forms a simple representative for each class: the centroid/prototype • Classification: nearest prototype/centroid • It does not guarantee that classifications are consistent with the given training data

  13. K Nearest neighbors

  14. k Nearest Neighbor Classification • kNN = k Nearest Neighbor • To classify a document d: • Define k-neighborhood as the k nearest neighbors of d • Pick the majority class label in the k-neighborhood

  15. Example: k=6 (6NN) P(science| )? Government Science Arts

  16. Nearest-Neighbor Learning • Learning: just store the labeled training examples D • Testing instance x (under 1NN): • Compute similarity between x and all examples in D. • Assign x the category of the most similar example in D. • Does not compute anything beyond storing the examples • Also called: • Case-based learning (remembering every single example of each class) • Memory-based learning (memorizing every instance of training set) • Lazy learning • Rationale of kNN: contiguity hypothesis (docs which are near to a given input doc will decide its class)

  17. Nearest-Neighbor

  18. k Nearest Neighbor • Using only the closest example (1NN) subject to errors due to: • A single atypical example. • Noise (i.e., an error) in the category label of a single training example. • More robust: find the k examples and return the majority category of these k • k is typically odd to avoid ties; 3 and 5 are most common • Assign weight (relevance) of neighbors to decide.

  19. kNN decision boundaries Boundaries are in principle arbitrary surfaces – but usually polyhedra Government Science Arts kNN gives locally defined decision boundaries between classes – far away points do not influence each classification decision (unlike in Naïve Bayes, Rocchio, etc.)

  20. 3 Nearest Neighbor vs. Rocchio • Nearest Neighbor tends to handle polymorphic categories better than Rocchio/NB.

  21. kNN: Discussion • No feature selection necessary • No training necessary • Scales well with large number of classes • Don’t need to train n classifiers for n classes • May be expensive at test time • In most cases it’s more accurate than NB or Rocchio

  22. Evaluating Categorization • Evaluation must be done on test data that are independent of the training data • Sometimes use cross-validation (averaging results over multiple training and test splits of the overall data) • Easy to get good performance on a test set that was available to the learner during training (e.g., just memorize the test set)

  23. Bias vs. variance: Choosing the correct model capacity

More Related