html5-img
1 / 75

Data Annotation for Classification

Data Annotation for Classification. Prediction. Develop a model which can infer a single aspect of the data (predicted variable) from some combination of other aspects of the data (predictor variables) Which students are off-task? Which students will fail the class?. Classification.

Mia_John
Télécharger la présentation

Data Annotation for Classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Annotation for Classification

  2. Prediction • Develop a model which can infer a single aspect of the data (predicted variable) from some combination of other aspects of the data (predictor variables) • Which students are off-task? • Which students will fail the class?

  3. Classification • Develop a model which can infer a categorical predicted variable from some combination of other aspects of the data • Which students will fail the class? • Is the student currently gaming the system? • Which type of gaming the system is occurring?

  4. We will… • We will go into detail on classification methods tomorrow

  5. In order to use prediction methods • We need to know what we’re trying to predict • And we need to have some labels of it in real data

  6. For example… • If we want to predict whether a student using educational software is off-task, or gaming the system, or bored, or frustrated, or going to fail the class… • We need to first collect some data • And within that data, we need to be able to identify which students are off-task (or the construct of interest), and ideally when

  7. So we need to label some data • We need to obtain outside knowledge to determine what the value is for the construct of interest

  8. In some cases • We can get a gold-standard label • For instance, if we want to know if a student passed a class, we just go ask their instructor

  9. But for behavioral constructs… • There’s no one to ask • We can’t ask the student (self-presentation) • There’s no gold-standard metric • So we use data labeling methods or observation methods • (e.g. quantitative field observations, video coding) • To collect bronze-standard labels • Not perfect, but good enough

  10. One such labeling method • Text replay coding

  11. Text replays • Pretty-prints of student interaction behavior from the logs

  12. Examples

  13. Sampling • You can set up any sampling schema you want, if you have enough log data • 5 action sequences • 20 second sequences • Every behavior on a specific skill, but other skills omitted

  14. Sampling • Equal number of observations per lesson • Equal number of observations per student • Observations that machine learning software needs help to categorize (“biased sampling”)

  15. Major Advantages • Both video and field observations hold some risk of observer effects • Text replays are based on logs that were collected completely unobtrusively

  16. Major Advantages • Blazing fast to conduct • 8 to 40 seconds per observation

  17. Notes • Decent inter-rater reliability is possible(Baker, Corbett, & Wagner, 2006)(Baker, Mitrovic, & Mathews, 2010)(Sao Pedro et al, 2010)(Montalvo et al, 2010) • Agree with other measures of constructs(Baker, Corbett, & Wagner, 2006) • Can be used to train machine-learned detectors(Baker & de Carvalho, 2008) (Baker, Mitrovic, & Mathews, 2010) (Sao Pedro et al, 2010)

  18. Major Limitations • Limited range of constructs you can code • Gaming the System – yes • Collaboration in online chat – yes(Prata et al, 2008) • Frustration, Boredom – sometimes • Off-Task Behavior outside of software – no • Collaborative Behavior outside of software – no

  19. Major Limitations • Lower precision (because lower bandwidth of observation)

  20. Hands-on exercise

  21. Find a partner • Could be your project team-mate, but doesn’t have to be • You will do this exercise with them

  22. Get a copy of the text replay software • On your flash drive • Or at http://www.joazeirodebaker.net/algebra-obspackage-LSRM.zip

  23. Skim the instructions • At Instructions-LSRM.docx

  24. Log into text replay software • Using exploratory login • Try to figure out what the student’s behavior means, with your partner • Do this for ~5 minutes

  25. Now pick a category you want to code • With your partner

  26. Now code data • According to your coding scheme • (is-category versus is-not-category) • Separate from your partner • For 20 minutes

  27. Now put your data together • Using the observations-NAME files you obtained • Make a table (in excel?) showing

  28. Now… • We can compute your inter-rater reliability… (also called agreement)

  29. Agreement/ Accuracy • The easiest measure of inter-rater reliability is agreement, also called accuracy # of agreements total number of codes

  30. Agreement/ Accuracy • There is general agreement across fields that agreement/accuracy is not a good metric • What are some drawbacks of agreement/accuracy?

  31. Agreement/ Accuracy • Let’s say that Tasha and Uniqua agreed on the classification of 9200 time sequences, out of 10000 actions • For a coding scheme with two codes • 92% accuracy • Good, right?

  32. Non-even assignment to categories • Percent Agreement does poorly when there is non-even assignment to categories • Which is almost always the case • Imagine an extreme case • Uniqua (correctly) picks category A 92% of the time • Tasha always picks category A • Agreement/accuracy of 92% • But essentially no information

  33. An alternate metric • Kappa (Agreement – Expected Agreement) (1 – Expected Agreement)

  34. Kappa • Expected agreement computed from a table of the form

  35. Kappa • Expected agreement computed from a table of the form • Note that Kappa can be calculated for any number of categories (but only 2 raters)

  36. Cohen’s (1960) Kappa • The formula for 2 categories • Fleiss’s (1971) Kappa, which is more complex, can be used for 3+ categories • I have an Excel spreadsheet which calculates multi-category Kappa, which I would be happy to share with you

  37. Expected agreement • Look at the proportion of labels each coder gave to each category • To find the number of agreed category A that could be expected by chance, multiply pct(coder1/categoryA)*pct(coder2/categoryA) • Do the same thing for categoryB • Add these two values together and divide by the total number of labels • This is your expected agreement

  38. Example

  39. Example • What is the percent agreement?

  40. Example • What is the percent agreement? • 80%

  41. Example • What is Tyrone’s expected frequency for on-task?

  42. Example • What is Tyrone’s expected frequency for on-task? • 75%

  43. Example • What is Pablo’s expected frequency for on-task?

  44. Example • What is Pablo’s expected frequency for on-task? • 65%

More Related