220 likes | 414 Vues
Classifier Evaluation. Vasileios Hatzivassiloglou University of Texas at Dallas. Hash tables. A hash table or associative array implements efficiently a function with a very large domain but relatively few recorded values Example: Map names to phone numbers
E N D
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas
Hash tables • A hash table or associative array implements efficiently a function with a very large domain but relatively few recorded values • Example: Map names to phone numbers • Although there are many possible names, only a few will be stored in a particular phone book
Implementing hash tables • A hash table works by using a hash function to translate the input (keys) to a small range of buckets • For example, h(n) = n mod k where k is the size of hash table • Collisions can occur when different keys are mapped to the same bucket, and must be resolved • Many programming languages directly support hash tables
FASTA – Step 2 • Group together hot spots on the same diagonal. This creates a partial alignment with matches and mismatches (no indels). • Keep the 10 best diagonal runs • If a hot spot matches at position i in S and position j in T, it will be on the (i-j)th diagonal • Sort hot spots by i-j to group them
FASTA – Step 3 • Rescore exact matches using realistic substitution penalties (from a set such as PAM250 for proteins) • Trim and extend hot spots according to substitution penalties, allowing “good” mismatches
The PAM matrices • From observing closely related proteins in evolution, we can estimate the likelihood than one amino acid mutates to another • Normalize these probabilities by PAM (Percentage of Acceptable Mutations in 100 amino-acids) • The PAM0 matrix is the identity matrix • The PAM1 matrix diverges slightly from the identity matrix
Calculating PAM matrices • If we have PAM1, then • PAMN = (PAM1)N • A Markov chain of independent mutations • The PAM250 matrix has been found empirically most useful • At this evolutionary distance, 80% of amino acids are changed • Change varies according to class (from only 45% to 94%) • Some amino acids are no longer good matches with themselves
FASTA – Step 4 • Starting from the best diagonal run, look at nearby diagonal runs and incorporate non-overlapping hot spots • This extends the partial alignment with some insertions and deletions • We only look a limited distance from the best diagonal run
FASTA – Step 5 • Run the full dynamic programming alignment algorithm in a band around the extended best diagonal run • Only consider matches within w positions on either side of the extended best diagonal run • Typically, w is 16, and 32n ≪ n2
BLAST • Basic Local Alignment Search Tool • Uses words like FASTA, but allows for approximate matches of words to create high scoring pairs (HSPs) • Usually longer words (k=3 for proteins, 11 for DNA) • HSPs are combined on the same diagonal and extended • Reports local alignments based on one HSP or a combination of two close HSPs • Variations allow gaps and pattern search
Alignment as classification • Alignment can be viewed as • A function that produces similarity values between any two strings • These similarity values can then be used to inform classifiers and clustering programs • A binary classifier: Any two strings are classified as related/similar or not • Requires the use of a threshold • The threshold can be fixed or depend on the context and application
Measuring performance • Done on a test set separate from the training set (the examples with known labels) • We need to know (but not make available to the classifier) the class labels in the test set, in order to evaluate the classifier’s performance • Both sets must be representative of the problem instances – not always the case
Contingency tables • Given a n-way classifier, a set with labels assigned by the classifier and correct, known labels we construct a n×ncontingency table counting all combinations of true/assigned classes
2×2 Contingency Table • Binary classification in this example
Two types of error • Usually one class is associated with “success” or “detection” • False positives: Report that the sought after class is the correct one when it is not (b in the contingency table) • False negatives: Fail to report the sought after class even when it is the correct one (c in the contingency table)
Performance measures • Accuracy: How often is the classification correct? • A = (a+d)/N, where N is the size of the scored set (N=a+b+c+d) • Problem: If the a priori probability of one class is much higher, we are usually better off just predicting that class, which is not a very meaningful classifier • E.g., in a disease detection test
Accounting for rare classes • Assign a cost to each error and measure the expected error • Normalize for fixed N to make results comparable across experiments • Measure separate error rates • PrecisionP=a/(a+b) • Recall (or sensitivity) R=a/(a+c) • Specificity d/(d+b)