1 / 22

Classifier Evaluation

Classifier Evaluation. Vasileios Hatzivassiloglou University of Texas at Dallas. Hash tables. A hash table or associative array implements efficiently a function with a very large domain but relatively few recorded values Example: Map names to phone numbers

daisy
Télécharger la présentation

Classifier Evaluation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas

  2. Hash tables • A hash table or associative array implements efficiently a function with a very large domain but relatively few recorded values • Example: Map names to phone numbers • Although there are many possible names, only a few will be stored in a particular phone book

  3. Implementing hash tables • A hash table works by using a hash function to translate the input (keys) to a small range of buckets • For example, h(n) = n mod k where k is the size of hash table • Collisions can occur when different keys are mapped to the same bucket, and must be resolved • Many programming languages directly support hash tables

  4. Example hash table

  5. FASTA after step 1

  6. FASTA – Step 2 • Group together hot spots on the same diagonal. This creates a partial alignment with matches and mismatches (no indels). • Keep the 10 best diagonal runs • If a hot spot matches at position i in S and position j in T, it will be on the (i-j)th diagonal • Sort hot spots by i-j to group them

  7. FASTA – Step 3 • Rescore exact matches using realistic substitution penalties (from a set such as PAM250 for proteins) • Trim and extend hot spots according to substitution penalties, allowing “good” mismatches

  8. The PAM matrices • From observing closely related proteins in evolution, we can estimate the likelihood than one amino acid mutates to another • Normalize these probabilities by PAM (Percentage of Acceptable Mutations in 100 amino-acids) • The PAM0 matrix is the identity matrix • The PAM1 matrix diverges slightly from the identity matrix

  9. Calculating PAM matrices • If we have PAM1, then • PAMN = (PAM1)N • A Markov chain of independent mutations • The PAM250 matrix has been found empirically most useful • At this evolutionary distance, 80% of amino acids are changed • Change varies according to class (from only 45% to 94%) • Some amino acids are no longer good matches with themselves

  10. FASTA after Steps 2 and 3

  11. FASTA – Step 4 • Starting from the best diagonal run, look at nearby diagonal runs and incorporate non-overlapping hot spots • This extends the partial alignment with some insertions and deletions • We only look a limited distance from the best diagonal run

  12. FASTA after Step 4

  13. FASTA – Step 5 • Run the full dynamic programming alignment algorithm in a band around the extended best diagonal run • Only consider matches within w positions on either side of the extended best diagonal run • Typically, w is 16, and 32n ≪ n2

  14. FASTA final step

  15. BLAST • Basic Local Alignment Search Tool • Uses words like FASTA, but allows for approximate matches of words to create high scoring pairs (HSPs) • Usually longer words (k=3 for proteins, 11 for DNA) • HSPs are combined on the same diagonal and extended • Reports local alignments based on one HSP or a combination of two close HSPs • Variations allow gaps and pattern search

  16. Alignment as classification • Alignment can be viewed as • A function that produces similarity values between any two strings • These similarity values can then be used to inform classifiers and clustering programs • A binary classifier: Any two strings are classified as related/similar or not • Requires the use of a threshold • The threshold can be fixed or depend on the context and application

  17. Measuring performance • Done on a test set separate from the training set (the examples with known labels) • We need to know (but not make available to the classifier) the class labels in the test set, in order to evaluate the classifier’s performance • Both sets must be representative of the problem instances – not always the case

  18. Contingency tables • Given a n-way classifier, a set with labels assigned by the classifier and correct, known labels we construct a n×ncontingency table counting all combinations of true/assigned classes

  19. 2×2 Contingency Table • Binary classification in this example

  20. Two types of error • Usually one class is associated with “success” or “detection” • False positives: Report that the sought after class is the correct one when it is not (b in the contingency table) • False negatives: Fail to report the sought after class even when it is the correct one (c in the contingency table)

  21. Performance measures • Accuracy: How often is the classification correct? • A = (a+d)/N, where N is the size of the scored set (N=a+b+c+d) • Problem: If the a priori probability of one class is much higher, we are usually better off just predicting that class, which is not a very meaningful classifier • E.g., in a disease detection test

  22. Accounting for rare classes • Assign a cost to each error and measure the expected error • Normalize for fixed N to make results comparable across experiments • Measure separate error rates • PrecisionP=a/(a+b) • Recall (or sensitivity) R=a/(a+c) • Specificity d/(d+b)

More Related