Adjudicator Agreement and System Rankings for Person Name Search

Adjudicator Agreement and System Rankings for Person Name Search Mark Arehart, Chris Wolf, Keith Miller The MITRE Corporation {marehart, cwolf, keith}@mitre.org

Summary Matching multicultural name variants is knowledge intensive Ground truth dataset requires tedious adjudication Guidelines not comprehensive, adjudicators often disagree Previous evaluations: multiple adjudication, voting Results of study: High agreement, multiple adjudication not needed “Nearly” same payoff for much less effort

Dataset Watchlist, ~71K Deceased persons lists Mixed cultures 1.1K variants for 404 base names Ave. 2.8 variants per base record Queries, 700 404 base names 296 randomly selected from watchlist Subset of 100 randomly selected for this study

Method Adjudication pools as in TREC: pool from 13 algorithms Four judges complete pools (1712 pairs, excluding exact matches) Compare system rankings under different versions of ground truth

Adjudicator Agreement Measures overlap = a / (a + b + c) p+ = 2a / (2a + b + c) p- = 2d / (2d + b + c)

Adjudicator Agreement Lowest is A~B kappa 0.57 Highest is C~D kappa 0.78

So far… Test watchlist and query list Results from 13 algorithms Adjudications by 4 volunteers Ways of compiling alternate ground truth sets Still need…

Comparing System Rankings A complete ranking A A How similar? Kendall’s tau Spearman’s rank correlation B C C B D E E D

Significance Testing Not all differences are significant (duh) F1-measure: harmonic mean of precision & recall Not a proportion or mean of independent observations Not amenable to traditional significance tests Like other IR measures, e.g. MAP Bootstrap resampling Sample with replacement from data Compute difference for many trials Produces a distribution of differences

Incomplete Ranking Not all differences significant  partial ordering B A A How similar? B C C D E D E

Evaluation Statements A>B A>C A>D A>E B=C B>D B>E C>D C>E D=E B A<B A>C A>D A>E B>C B>D B>E C>D C>E D=E A A B C C D E D E

Similarity n systems  n(n-1) / 2 evaluation statements Reversal rate: proportion of reversed relations: 10% A>B A>C A>D A>E B=C B>D B>E C>D C>E D=E A<B A>C A>D A>E B>C B>D B>E C>D C>E D=E Total disagreement: 20% Sensitivity: proportion of relations with sig diff Sens = 80% Sens = 90%

Comparisons With Baseline No reversals except with intersection GT (one algorithm) Highest and lowest agr with consensus Low!

GT Comparisons

Comparison With Random 1000 GT versions created by randomly selecting a judge Consensus sensitivity = 74.4% Average random sensitivity = 72.9% (sig diff at 0.05) Average disagreement with consensus = 7.3% 5% disagreement expected (actually more) 2.3% remainder (actually less) attributable to GT method No reversals in any of the 1000 sets

Conclusion Multiple adjudicators judge everything  expensive Single adjudicator  variability in sensitivity Multiple adjudicators randomly divide pool: Slightly less sensitivity No reversals of results Much less labor Differences wash out approximating consensus Practically same result for less effort

Adjudicator Agreement and System Rankings for Person Name Search