1 / 52

CROWDSOURCING

CROWDSOURCING. Massimo Poesio Part 4: Dealing with crowdsourced data. THE DATA. The result of crowdsourcing in whatever form is a mass of often inconsistent judgments Need techniques for identifying reliable annotations and reliable annotators

xenon
Télécharger la présentation

CROWDSOURCING

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CROWDSOURCING Massimo Poesio Part 4: Dealing with crowdsourced data

  2. THE DATA • The result of crowdsourcing in whatever form is a mass of often inconsistent judgments • Need techniques for identifying reliable annotations and reliable annotators • In the Phrase Detectives context, to discriminate between genuine ambiguity and disagreements due to error

  3. THE ANDROCLES EXAMPLE

  4. SOME APPROACHES • Majority voting • But: it ignores the substantial differences in behavior between annotators • Alternatives: • Removing bad annotators eg using clustering • Weighing annotators

  5. SNOW ET AL

  6. SNOW ET AL: WEIGHTING ANNOTATORS

  7. LATENT MODELS OF ANNOTATION QUALITY • The problem of reaching a conclusion on the basis of judgments by separate experts that may often be in disagreement is a longstanding one in epidemiology • A number of techniques developed, including • Dawid and Skene 1979 (also used by Passonneau & Carpenter) • Latent Annotation model (Uebersax 1994) • Raykar et al 2010 • Recently, Carpenter has been developing an explicit Hierarchical Bayesian model (2008)

  8. DAWID AND SKENE 1979 • Model consists of likelihood for • annotations (labels from annotators) • categories (true labels) for items given • annotator accuracies and biases • prevalence of labels • Frequentistsestimate 2–4 given 1 • Optional regularization of estimates (for 3 and 4)

  9. A GENERATIVE MODEL OF THE ANNOTATION TASK • What all of these models do is to provide an EXPLICIT PROBABILISTIC MODEL of the observations in terms of annotators, labels, and items

  10. THE DATA • K possible labels • J annotators • I number of items • N total number of annotations of the I items produced by the J annotators • y_{i,j}: label produced for item i by coder j

  11. THE DATA: BY ITEM

  12. THE DATA: BY ANNOTATIONS

  13. THE ANNOTATION TABLE

  14. A GENERATIVE MODEL OF THE ANNOTATION TASK • The probabilistic model specifies the probability of a particular label on the basis of PARAMETERS specifying the behavior of the annotators, the prevalence of the labels, etc • In Bayesian models, these parameters are specified in terms of PROBABILITY DISTRIBUTIONS

  15. THE PARAMETERS OF THE MODEL • z_i: the ACTUAL category of item i • Θ_{j,k,k’}: ANNOTATOR RESPONSE • the probability that annotator j labels an item as k’ when it belongs to category k • π_k: PREVALENCE • The probability that an item belongs to category k

  16. DISTRIBUTIONS • Each of the parameters is characterized in terms of a PROBABILITY DISTRIBUTION • When we have some information on the data, these distributions can be used to characterize their behavior • E.g., annotators may be all equally good / there may be a skew • Otherwise just defaults

  17. DISTRIBUTIONS • Prevalence of labels (PRIOR) • π ~ Dir(α) • Annotator j’s response to item of category k (PRIOR) • Θ_{j,k} ~ Dir(β_k) • True category of item i (LIKELIHOOD): • z_i ~ Categorical(π) • Label from j for item i (LIKELIHOOD): • y_{i,j} ~ Categorical(Θ_{j,z_i})

  18. TYPES OF ANNOTATORS: SPAMMY (RESPONSE TO ALL ITEMS THE SAME)

  19. TYPES OF ANNOTATORS: BIASED (HAS SKEW IN RESPONSE – COMMON IN LOW PREVALENCE DATA)

  20. QUICK INTRO TO DIRICHLET • Dirichlet is often seen in Bayesian models (e.g., Latent Dirichlet Allocation, LDC) because it is a CONJUGATE PRIOR of the MULTINOMIAL distribution

  21. BINOMIAL AND MULTINOMIAL

  22. CONJUGATE PRIOR • In Bayesian inference the objective is to compute a POSTERIOR on the basis of a LIKELIHOOD and a PRIOR • A CONJUGATE PRIOR of distribution D is a distribution such that if it is used for the prior, then the posterior also has that shape • E.g., ‘Dirichlet is a conjugate prior of the multinomial’ means that if the likelihood is a multinomial and the prior is Dirichlet then the posterior is also Dirichlet. NLE

  23. DIRICHLET DISTRIBUTION

  24. CATEGORICAL • The categorical distribution is a generalization of the Bernoulli distribution that specifies the probability of a given outcome for a binary trial • E.g., the probability of getting a head in a coin toss • Cfr.: BINOMIAL distribution that specifies the probability of getting N heads

  25. A GRAPHICAL VIEW OF THE MODEL

  26. THE PROBABILISTIC MODEL OF A GIVEN LABEL

  27. AN EXAMPLE

  28. PROBABILISTIC INFERENCE • Probabilistic inference techniques are used to INFER the parameters from the data and therefore compute the probabilities and parameters • Often: Expectation Maximization (EM) • The EM implementation in R used by Carpenter & Passonneau to estimate the parameters available from • https://github.com/bob- carpenter/ anno

  29. APPLICATION TO WORD SENSE DISTRIBUTION (CARPENTER & PASSONNEAU, 2013, 2014) • Carpenter and Passonneau used the Dawid and Skeene model to compare manual annotators with turkers on word sense disambiguation anno of the MASC corpus

  30. THE MASC corpus • Manually annotated subcorpus (MASC) • 500K word subset of Open American National Corpus (OANC) • Multiple genres: technical manuals, poetry, news, dialogue, etc. • 16 types of annotation (not all manual) • part of speech, phrases, word sense, named entity, ... • 100 item word-sense corpus • balanced by genre and part-of-speech (noun, verb, adjective)

  31. MASC WORDSENSE • 100 words balanced between adjs, nouns, & verbs • 1000 sentences for each word • Annotated using WordNet senses for these words • ~ 1M tokens

  32. MASC Wordsense: annotation using trained annotators • pre-training on 50 items • independent labeling of 1000 items • 100 items labeled by 3 or 4 annotators • agreement on these 100 items reported • only single round of annotation, most items single annotated

  33. Annotation using trained annotators • College students from Vassar, Barnard, Columbia • 2–3 years of work on project • General training plus per-word training • Supervised by • Becky Passonneau • Nancy Ide (maintainer of MASC) • ChristianeFellbaum (maintainer of WordNet)

  34. Annotation using crowdsourcing • 45 randomly selected words balanced across nouns, verbs, and adjectives were reannotated using crowdsourcing • 1000 instances per word • 25+ annotators per instance • high number of annotators to – estimate difficulty– reject independence of labels

  35. Differences from trained situation • Annotators not trained • Not told to look at WordNet • Each HIT: • 10 sentences for the same word • WordNet senses listed under the word

  36. METHODS • Passonneau & Carpenter used their model to • Evaluate prevalence of labels in different ways • Evaluate annotator response

  37. PREVALENCE ESTIMATION

  38. ASSESSMENT OF QUALITY

  39. ANNOTATOR RESPONSE

  40. AGREEMENT RATES

  41. OTHER MODELS • Raykar et al, 2010 • Carpenter, 2008

  42. RAYKAR ET AL 2010 • Simultaneously ESTIMATES THE GROUND TRUTH from noisy labels, produces an ASSESSMENT OF THE ANNOTATORS, and LEARNS A CLASSIFIER • Based on logistic regression • Bayesian (includes priors on the annotators)

  43. ANNOTATORS • Annotator j characterized by her/his • SENSITIVITY: the ability to recognize positive cases • α_j = P(y_j=1|y=1) • SPECIFICITY: the ability to recognize negative cases • β_j = P(y_j=1|y=1)

  44. RAYKAR ET AL Raykar et al propose a version of the EM algorithm that can be used to estimate P(O|θ) as well as sensitivity and specificity for each annotator Carpenter developed a fully Bayesian version of the approach based on gradient descent www.phrasedetectives.com

  45. CARPENTER

  46. DISAGREEMENT IN INTERPRETATION

  47. AMBIGUITY: REFERENT 15.12 M: we’re gonna take the engine E3 15.13 : and shove it over to Corning 15.14 : hook [it] up to [the tanker car] 15.15 : _and_ 15.16 : send it back to Elmira (from the TRAINS-91 dialogues collected at the University of Rochester)

  48. AMBIGUITY: REFERENT About 160 workers at a factorythat made paper for the Kent filters were exposed to asbestos in the 1950s. Areas of the factorywere particularly dusty where the crocidolite was used. Workers dumped large burlap sacks of the imported material into a huge bin, poured in cotton and acetate fibers and mechanically mixed the dry fibers in a process used to make filters. Workers described "clouds of blue dust" that hung over parts of the factory, even though exhaust fans ventilated the area. www.phrasedetectives.com

  49. AMBIGUITY: EXPLETIVES 'I beg your pardon!' said the Mouse, frowning, but very politely: 'Did you speak?' 'Not I!' said the Lory hastily. 'I thought you did,' said the Mouse. '--I proceed. "Edwin and Morcar, the earls of Mercia and Northumbria, declared for him: and even Stigand, the patriotic archbishop of Canterbury, found it advisable--"' 'Found WHAT?' said the Duck. 'Found IT,' the Mouse replied rather crossly: 'of course you know what "it" means.'

  50. OTHER DATA: WORDSENSE DISAMBIGUATION (Passonneau et al 2010) And our ideas of what constitutes a FAIR wage on a FAIR return on capital are historically contingent … {sense1, sense1, sense1, sense2, sense2, sense2} … the federal government … is wrangling for its FAIR share of the dividend … {sense1, sense1, sense2, sense2, sense8, sense8}

More Related