1 / 25

Indexing Consistency Across Multiple Indexers/Taggers

Indexing Consistency Across Multiple Indexers/Taggers. Dietmar Wolfram, Hope A. Olson & Raina Bloom University of Wisconsin-Milwaukee SIS Research Forum October 14, 2008. Indexing Consistency. Indexing key to retrieval Consistency deemed essential for effective retrieval

talisa
Télécharger la présentation

Indexing Consistency Across Multiple Indexers/Taggers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Indexing Consistency Across Multiple Indexers/Taggers Dietmar Wolfram, Hope A. Olson & Raina Bloom University of Wisconsin-Milwaukee SIS Research Forum October 14, 2008

  2. Indexing Consistency • Indexing key to retrieval • Consistency deemed essential for effective retrieval • People (including indexers) interpret content differently • Typically agreement on core topics, but with wide dispersion • Same is true for newer tagging environments

  3. Previous Research • Long history of consistency research with small number of indexers • Olson & Wolfram (2006) mapped distribution of multiple indexers’ terms (n=33) • Diverged into two paths: • Co-occurrences and syntagmatic relationships (JDoc forthcoming) • Measuring consistency

  4. Measuring consistency • Medelyan & Witten (2006) have proposed a measure based on the cosine measure of similarity used in IR, but uses controlled vocabularies • Wolfram & Olson (2007) introduced vector-based Inter-indexer Consistency Density (ICD) of multiple indexers (n=64)

  5. Distribution of assigned term frequency Strong inverse relationship; not necessarily Zipfian Similar relationship seen for co-occurrence of assigned terms 89% of co-occurring term pairs appear once 2006 study: Distribution of Student-assigned Terms

  6. 2007 study: Vector Technique tested on student indexing data t test outcome (assuming unequal variances) t = 0.7288 α = .05 p = 0.471

  7. Study Purpose • To apply the indexing consistency method developed by Wolfram & Olson (2007) on a large data set • To determine if vocabulary usage in an emerging area is significantly different than for established areas as measured by inter-tagger consistency across documents in different fields

  8. Measures of Inter-Indexer Consistency • Usually only permit comparisons of two indexers (or a few more) • Hooper (1965) H(I1, I2) = C___ A + B – C • Rolling (1981) R(I1, I2) = 2 C A + B • Where • A and B are the size of I1 and I2’s term set • C is the number of common terms

  9. Simplifying the Process by Applying IR Modelling • See CAIS 07 presentation (Wolfram & Olson)for presentation of concept • Indexing is central to IR theory and models (e.g., vector space model) • Usually, the document is the focal point • The same principles can be applied to indexers & taggers, who now serve as the focal point

  10. Defining an Indexer/Tagger Space • Traditional vector space model d1 d2 … dm • Same approach can be applied to a multiple indexer environment, where Documents = Indexers

  11. Indexing/Tagging Space Dist(I1,C) I1 Dist(I2,C) Centroid Dist(I3,C) I2 I3 Calculating Distances

  12. Document Space vs. Indexer / Tagger Space Characteristics • Characteristic of overall space measured as a density using inter-document/tagger distances • Document space • Low density space => easier to distinguish documents => better for retrieval • Indexer/Tagger space • The opposite is desirable • High density space => more similarity & higher consistency

  13. Calculating Inter-Indexer (Tagger) Consistency Density Where m is the number of indexers/taggers

  14. Applying the ICD Measure to a Large Dataset • Used tagging data available from CiteULike (www.citeulike.org) • 800,000 tagged documents • 29,000 taggers • Identified scholarly documents that have been tagged by a large number of taggers, which served as the basis for the comparison • Viable documents were categorized into 3 topical areas • Average ICDs for groups of documents compared across the topical areas

  15. Data Characteristics • Less than .03% of articles have been tagged by at least 10 people • ~ 2/3rds of highly tagged documents represent spam (e.g., links to commercial websites) • 78 viable articles tagged by at least 20 taggers were identified • 3 subject areas were identified • Science • Social Science • Social Software

  16. Potential Challenge for Comparing Outcomes • Densities are influenced by distances in the tagger space • Distances are influenced by the dimensionality of the space • Dimensionality is influenced by number of unique tags and taggers • Therefore, outcomes should first be checked for significant correlations

  17. Relationships between Taggers and Density Outcomes Pearson’s r correlation = 0.033 p = .782 … therefore, it’s not an issue

  18. Analyzing the Data • Outliers were removed • 74 documents remained • 1-way ANOVA (parametric & non-parametric equivalent) used to compare average densities across the three topic areas

  19. Descriptive Statistics

  20. ANOVA Outcome Parametric Kruskal Wallis Test Non-parametric

  21. Discussion • No significant differences in average density outcomes • Therefore, no significant difference in vocabulary usage • Could it be a reflection of the tagger population? • Investigated topic areas are closely related, so differences might not be apparent • Limited to what is being tagged => most items related to social software and allied science & social science areas

  22. Research Limitations • Only takes distances into account, not semantics or contexts • Different sets of terms with similar tagging specificity and exhaustivity patterns will result in similar densities • Method can be computationally more challenging than traditional approaches • But with more taggers this is to be expected

  23. Research Limitations • Findings are only as good as the data • Spam is common • Tagger motivation and intentions for contributing documents and tags may differ

  24. Conclusion • ICD method is viable and usable even on larger datasets • Vocabulary consistency does not appear to be significantly different across the three broad topic areas • Future research will examine further applications of the regularities found

More Related