1 / 37

Fridolin Wild , Christina Stahl, Gerald Stermsek, Gustaf Neumann

Parameters Driving Effectiveness of Automated Essay Scoring with LSA 9 th CAA, July 6 th 2005, Loughborough. Fridolin Wild , Christina Stahl, Gerald Stermsek, Gustaf Neumann Department of Information Systems and New Media Vienna University of Economics and Business Administration. Agenda.

lynley
Télécharger la présentation

Fridolin Wild , Christina Stahl, Gerald Stermsek, Gustaf Neumann

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parameters Driving Effectiveness of Automated Essay Scoring with LSA9th CAA, July 6th 2005, Loughborough Fridolin Wild, Christina Stahl, Gerald Stermsek, Gustaf Neumann Department of Information Systems and New MediaVienna University of Economics and Business Administration

  2. Agenda • TEL @ WUW • Essay Scoring with LSA • Latent Semantic Analysis (LSA) • Parameters Driving Effectiveness • Experiment Results • Summary & Future Work

  3. Technology Enhanced Learning @ WUW

  4. TEL @ WUW • Learn@WU • > 19.000 users • > 27.000 resources • Research Driven Development • (EducaNext.org) • (HCD-online.com)

  5. Electronic Assessment @ WUW • The Situation • No Entrance Limitations in Austria • High Drop-out Rates • Varying Number of Freshmen (by 1.000) • Space Problems • Highly Scalable Courses (with Large-Scale Assessments) • Concentrate Resources on Higher Semesters • Currently: many multiple choice tests (for practice, scanner for exams) • Feeding Answers • Tempting: learning answers by hard instead of critical thinking (negative effects found) • Future: Free-Text Assessment • Increase Quality of Feedback (formative, no autograding!) • With Latent Semantic Analysis (LSA)

  6. Essay Scoring with Latent Semantic Analysis

  7. Essay Scoring with LSA

  8. Software: The R Package ‘lsa’ • Currently in Version 0.4 • available upon request • public domain • Can be integrated on DB-Level • into PostgreSQL • ‘Essay Scoring by Stored Procedures’ • Easy to use (students!) • Wrapper Module for .LRN (Diploma Thesis)

  9. Latent Semantic Analysis

  10. Convert to Document Term Matrix { M } = Input (Docs)

  11. = Singular Value Decomposition

  12. “Latent Semantics” • Assumption: documents have a semantic structure • Structure is obscured by word usage (noise, synonyms, homographs, …) • Therefore: map doc-term-matrix using conceptual indices derived statistically (truncated SVD): M2 = TS2D’

  13. Truncated SVD

  14. Reconstructed, Reduced Matrix m4: Graphminors: A survey

  15. doc2doc - similarities unreduced - Based on M = TSD’ - Pearson Correlation over document vectors reduced • based on M2 = TS2D’ • - Pearson Correlation over document vectors

  16. SVD-Updating: Folding-In • SVD Factor Stability • SVD calculates factors over a textbase • Different texts – different factors • Challenge: avoid unwanted factor changes (e.g. bad essays) • Solution: folding-in of essays instead of recalculating • SVD is computationally expensive • 14 seconds (300 docs textbase) • 10 minutes (3500 docs textbase) • … and rising!

  17. 2 vT 1 Folding-In in Detail Mk (2) convert „Dk“-format vector to „Mk“-format Tk Sk Dk (1) convert Original Vector to „Dk“-format (cf. Berry et al., 1995)

  18. Parameters Driving Effectivness

  19. Parameters 4 x 12 x 7 x 2 x 3 = 2016 Combinations

  20. Pre-Processing • Stemming • Porter Stemmer (snowball.tartarus.org) • ‚move‘, ‚moving‘, ‚moves‘ => ‚move‘ • in German even more important (more flections) • Stop Word Elimination • 373 Stop Words in German • Stemming plus Stop Word Elimination • Unprocessed (‘raw’) Terms

  21. Term Weighting Schemes weightij = lw(tfij) ∙ gw(tfij) • Local Weights (LW) • None (‘raw’ tf) • Binary Term Frequency • Logarithmized Term Frequency (log) • Global Weights (GW) • None (‚raw‘ tf) • Normalisation • Inverse Document Frequency (IDF) • 1 + Entropy • 12 Combinations

  22. SVD-Dimensionality • Percentage of Cumulated Values • Shares of 50%, 40%, 30% • Share of Values = Number of Docs • Absolute Fraction of k • 1/50 and 1/30 • Fixed Number k (‘magic 10’)

  23. Similarity Measures & Methods • Pearson Correlation • (Cosine Correlation) • Spearman‘s Rho • Best Hit of Best Solutions • Mean of Best Solutions pics: http://davidmlane.com/hyperstat/A62891.html

  24. Assessing Effectiveness • Compare Machine Scores with Human Scores • Human-to-Human Correlation • Usually around .6 (literature, own experiments) • Increased by familiarity between assessors, tighter assessment schemes, … • Scores vary even stronger with decreasing subject familiarity (.8 at high familiarity, worst test -.07)

  25. Experiment Settings • Test Collection • 43 Students’ Essays in German • Scored by Human Assessor from 0 to 5 points (ratio scaled) • Average essay length: 56.4 words • Training Collection • 3 ‘golden essays’ • Plus 302 documents from a marketing glossary • Average glossary entries: 56.1 words

  26. Experiment Results

  27. Overall • 48 < 0.001 • 459 < 0.01 • 885 < 0.05 • 1235 < 0.1 • Rest: 781 not significant

  28. Pre-Processing • Best: Stop Word Filtering (Ø .31) • Stemming / Stemming&Stoppingworsen Results (by .06 and .03) • Raw: .26 • Best 50: • 21 x stopping • 14 x raw • 12 x stemming • 3 x stemming & stopping Sorted Spearman Correlation of all Experiments

  29. Term Weighting • Global Weights: • IDF overall best (.36 with logtf) • Normalisation worsens (.15 - .17) • 1+Entropy: nearly no effect • Local Weights: • hardly any effect • raw and logtf squeeze curve • Best 50: • 20 x bintf • 19 x logtf • 11 x raw • 26 x IDF • 13 x raw • 6 x normalisation • 5 x 1+entropy

  30. Dimensionality • ‘Share’ Scores Best • 50%: .29 (40% and 30%: .28) • Curve favours 30% • Rest: .22 to .24 • Negative Correlations at Share 30%: Normalisation as GW • Best 50: • 13 x 1/50th • 10 x share 50% • 8 x 1/30th • 8 x magic ten • 5 x share 40% • 3 x share 30% • 3 x ndocs

  31. Correlation Measures • Cosine & Pearson slightly better incurve (in averageSpearman is) • Best 50: • 21 x Spearman • 15 x Cosine • 14 x Pearson

  32. Correlation Method • Mean Correlationwith Best Essaysis slightly better • Best 50: • 31 x Mean • 19 x Maximum

  33. Summary and Future Work

  34. Summary • Effectiveness can be tuned in advance • Recommendation (not a guarantee): • Use Stop Word Filtering • Use IDF as global weight and any local • Use Spearman’s Rho • Use Average Correlation to Best Essays • However: other combinations still can be successful! • Optimisations are not independent

  35. Future Work • A Model of Influencing Factors • Stability across Changing Corpora / Contexts • Different Text Assessment Methods • Similarity Measurement Method • Doc versus Query (=Aspect) • Way of Corpus Splitting • Bag-of-Words: Documents vs. Sentence, Paragraph, N-Grams • Summaries or Controlled Vocabulary • Norm-referenced vs. criterion-referenced (NRT, CRT) • ‘Definatory’ vs. ‘case based’ QAs (I,E) • Compare with other Scoring Methods

  36. Thanks for your attention! Get these slides at www.educanext.org

  37. The Original Question • Question: “Wählen Sie jenen Plan aus, mit dem Sie das Kommunikationsziel am ehesten erreichen könnten und begründen Sie Ihre Auswahl in Schlagworten! (5 Punkte) • Best Essay (5P): „Bei der Wahl kommt es darauf an, ob ich Breitenwirkung (Reichweite) oder Tiefenwirkung (OTS) erzielen will (zB bei Imageverbesserung). Hier sollte meiner Meinung nach Plan 2 gewählt werden, da die OTS fast gleich sind 22 Plan 1 - 20 Plan 2 = kein großer Unterschied, doch die Kosten pro 1000 Nutzer und Kosten pro 1000 Kontakte bei Plan 2 billiger sind (Nutzer: 2 um 101,- €, Kontakte 2 um 3 € günstiger als 1)“ • Essay (2P): “Plan 2: da ich durch minimale Mehrkosten für die Schaltungen -> mehr Schaltungen habe und -> eine wesentlich größere Reichweite sowohl gesamt als auch in der Zielgruppe“

More Related