Performance measures

Performance measures

Performance measuresformatching • The followingcounts are typicallymeasured for performance ofmatching: • TP: truepositives, i.e. numberofcorrectmatches • FN: false negatives, matchesthatwerenotcorrectlydetected • FP: false positives, proposedmatchesthat are incorrect • TN: truenegatives, non-matchesthatwerecorrectlyrejected • Based on them, anyparticularmatchingstrategy at a particularthreshold can beratedby the followingmeasures: • True Positive Rate (TPR) alsoreferredasTrueAcceptance Rate (TAR) = TP / (TP+FN) = TP / P • False positive rate (FPR)alsoreferredasFalse Acceptance Rate (FAR) = FP / (FP+TN) = FP / N • TAR @ 0.001 FAR is a typical performance index used in benchmarks. Ideally, the true positive rate willbecloseto1 and the false positive rate closeto0.

ROC curves • As wevary the matchingthreshold at whichTPR and FPR are obtained, we derive a set ofpoints in the TPR-FPR space , which are collectivelyknownas the receiver-operatingcharacteristic (ROC curve). • The ROC curve plots the true positive rate against the false positive rate for a particularcombinationoffeatureextraction and matchingalgorithms. The area under the ROC curve (AUC) isoftenusedas a scalar measureof performance. • As the thresholdθis increased,the number of true positives increases and false positives lowers. The closerthis curve liesto the upper left corner, the betterisits performance. • The ROC curve can alsobeusedtocalculate the MeanAveragePrecision, whichis the averageprecisionasyouvary the thresholdtoselect the best results.

R’(q) 1 True R(q) Selection f(d,q)= - - - 0 - + - + - - + + - - - + - + - - - - + - - - + - - - + + + - - 0.98 d1 + 0.95 d2 + 0.83 d3 - 0.80 d4 + 0.76 d5 - 0.56 d6 - 0.34 d7 - 0.21 d8 + 0.21 d9 - - - - Ranking f(d,q)= - R’(q) Performance measure for retrieval sets • Definitionof performance measuresforretrievalsetsstemsfrom information retrieval. • The case ofdocumentselectionisdistinguishedfrom the case in which position in the • retrieval set isconsidered (document ranking). + • With Selection, the classifier is inaccurate: • “Over-constrained” query (terms are too specific)  no relevant documents found • “Under-constrained” query (terms are too general)  over delivery Even if the classifier is accurate, all relevant documents are not equally relevant. • Ranking allows the user to control the boundary according to his/her preferences.

retrieved & irrelevant Not retrieved & irrelevant irrelevant retrieved & relevant not retrieved but relevant relevant All docs retrieved not retrieved Retrieved Relevant Performance measuresforunrankedretrievalsets • Two most frequent and basic measures for unranked retrieval sets are Precision and Recall.These are first defined for the simple case where the information retrieval system returns a set of documents for a query

Relevant Relevant Very high precision, very low recall High recall, but low precision High precision, high recall Relevant • The advantage of having two numbers is that one is more important than the other in many • circumstances: • Surfers would like every result in the first page to be relevant (i.e. high precision). • Professional searchers are moreconcerned with high recall and will tolerate low precision.

F-Measure: is a single measure that that takes into account both recall and precision. It is the the weighted harmonic mean of precision and recall: • Compared to arithmetic mean, both precision and recall must be high for harmonic mean to be high.

E-Measure (parameterized F-Measure): a variant of F-measure that trades off precision versus recall. Allows weighting emphasis on precision over recall: • Value of  controls trade-off: •  = 1: equally weights precision and recall (E=F). •  > 1: weights recall more. •  < 1: weights precision more.

Performance measuresforrankedretrievalsets • In a ranking context, appropriate setsofretrieveddocuments are givenby the top kretrieveddocuments. Foreachsuch set, precision and recallvalues can beplottedtogive a Precision-Recall curve. Precision-Recall curve plot a trade-offbetweenrelevant and non-relevantitemsretrieved Many relevant documents but many other useful missed The ideal case 1 Precision 1 0 Recall Most relevant documents Butalso many non-relevant Slide contentfromJ. Ghosh

Computing Precision-Recall points • Precision-Recall plots are built as follows: • For each query, produce the ranked list of retrieved documents. Setting different thresholds on this ranked list results into different sets of retrieved documents. Different recall/precision measures are therefore obtained. • Mark each document in the ranked list that is relevant. • Compute a recall/precision pair for each position in the ranked list that contains a relevant document. Slide contentfromJ. Ghosh

Example 1 • Let total # of relevant documents = 6. Check each new recall point: R=1/6=0.167; P=1/1=1 R=2/6=0.333; P=2/2=1 R=3/6=0.5; P=3/4=0.75 R=4/6=0.667; P=4/6=0.667 R=5/6=0.833;P=5/13=0.38 Missing one relevant document. Doesn’t reach 100% recall Slide fromJ. Ghosh

Example 2 • Let total # of relevant documents = 6. Check each new recall point: R=1/6=0.167; P=1/1=1 R=2/6=0.333; P=2/3=0.667 R=3/6=0.5; P=3/5=0.6 R=4/6=0.667; P=4/8=0.5 R=5/6=0.833; P=5/9=0.556 R=6/6=1.0;P=6/14=0.429 Slide fromJ. Ghosh

InterpolatedPrecision-Recallcurves • Precision-recallcurveshave a distinctivesaw-toothshape: • if the (k+ 1)thdocumentretrievedisnon-relevantthenrecallis the sameasfor the top kdocuments, butprecisiondrops; • ifitisrelevant, thenbothprecision and recallincrease, and the curve jags up and to the right. • Interpolated Precision is often useful to removejiggles. the interpolated precision at a certain recall level r is defined as the highest precision found for any recall level q ≥ r : pint(r) = maxr’≥r p(r′) Interpolated precision at recall levelr r

In order to obtain reliable performance measures, performance is averaged over a • large set of queries: • Compute average precision at each standard recall level across all queries. • Plot average precision/recall curves to evaluate overall system performance on • a document/query corpus. Precision 11 0.75 0.667 0.38 Precision 1 0.67 0.6 0.5 0.556 0.429 Recall 1 Recall

Comparing performance of two or more systems • When performance of two or more systems are compared, the curve closest to the upper right-hand corner of the graph indicates the best performance This system has the best performance Precision Recall Slide fromJ. Ghosh

Other performance measuresforrankedretrievalsets • Average precision(AP) is a typical performance measure used for ranked sets. Average Precision is defined as the average of the precision scores after each relevant item (true positive, TP) in the scope S. Given a scope S = 7, and a ranked list (gain vector) G = [1,1,0,1,1,0,0,1,1,0,1,0,0,..], where 1/0 indicate the gains associated to relevant/non-relevant items, respectively: AP = (1/1 + 2/2 + 3/4 + 4/5) / 4 = 0.8875. • Mean Average Precision (MAP): Average of the average precision value for a set of queries. • Average Dynamic Precision (ADP) is also used. It is defined as the average sum of precisions with increasing scope S, with 1 ≤ S ≤ #relevant items: = (1 + 1 + 0.667 + 0.75 + 0.80 + 0.667 + 0.571) / 7 = 0.779

Other measures for ranked retrieval sets usually employed in benchmarks are the mean values of : • Recognition Rate: total number of queries for which a relevant item is in the 2nd position of the ranked list divided by the number of items in the dataset • 1st tier and 2nd tier: average number of relevant items retrieved respectively in the first n and 2n positions of the ranked list (n =7 typically used in benchmarks). • Cumulated Gain (CG) at a particularrank position p: wherereliis the gradedrelevanceof the result at position i (at rank-5 typically used). • DiscountedCumulated Gain (DCG) at a particularrank position p(highlyrelevantdocumentsappearinglower in a searchresultlist are penalizedreducing the gradedrelevancevaluelogarithmicallyproportionalto the position of the result):

Performance measures