Fridolin Wild , Christina Stahl, Gerald Stermsek, Gustaf Neumann

Parameters Driving Effectiveness of Automated Essay Scoring with LSA9th CAA, July 6th 2005, Loughborough Fridolin Wild, Christina Stahl, Gerald Stermsek, Gustaf Neumann Department of Information Systems and New MediaVienna University of Economics and Business Administration

Agenda • TEL @ WUW • Essay Scoring with LSA • Latent Semantic Analysis (LSA) • Parameters Driving Effectiveness • Experiment Results • Summary & Future Work

Technology Enhanced Learning @ WUW

TEL @ WUW • Learn@WU • > 19.000 users • > 27.000 resources • Research Driven Development • (EducaNext.org) • (HCD-online.com)

Electronic Assessment @ WUW • The Situation • No Entrance Limitations in Austria • High Drop-out Rates • Varying Number of Freshmen (by 1.000) • Space Problems • Highly Scalable Courses (with Large-Scale Assessments) • Concentrate Resources on Higher Semesters • Currently: many multiple choice tests (for practice, scanner for exams) • Feeding Answers • Tempting: learning answers by hard instead of critical thinking (negative effects found) • Future: Free-Text Assessment • Increase Quality of Feedback (formative, no autograding!) • With Latent Semantic Analysis (LSA)

Essay Scoring with Latent Semantic Analysis

Essay Scoring with LSA

Software: The R Package ‘lsa’ • Currently in Version 0.4 • available upon request • public domain • Can be integrated on DB-Level • into PostgreSQL • ‘Essay Scoring by Stored Procedures’ • Easy to use (students!) • Wrapper Module for .LRN (Diploma Thesis)

Latent Semantic Analysis

Convert to Document Term Matrix { M } = Input (Docs)

= Singular Value Decomposition

“Latent Semantics” • Assumption: documents have a semantic structure • Structure is obscured by word usage (noise, synonyms, homographs, …) • Therefore: map doc-term-matrix using conceptual indices derived statistically (truncated SVD): M2 = TS2D’

Truncated SVD

Reconstructed, Reduced Matrix m4: Graphminors: A survey

doc2doc - similarities unreduced - Based on M = TSD’ - Pearson Correlation over document vectors reduced • based on M2 = TS2D’ • - Pearson Correlation over document vectors

SVD-Updating: Folding-In • SVD Factor Stability • SVD calculates factors over a textbase • Different texts – different factors • Challenge: avoid unwanted factor changes (e.g. bad essays) • Solution: folding-in of essays instead of recalculating • SVD is computationally expensive • 14 seconds (300 docs textbase) • 10 minutes (3500 docs textbase) • … and rising!

2 vT 1 Folding-In in Detail Mk (2) convert „Dk“-format vector to „Mk“-format Tk Sk Dk (1) convert Original Vector to „Dk“-format (cf. Berry et al., 1995)

Parameters Driving Effectivness

Parameters 4 x 12 x 7 x 2 x 3 = 2016 Combinations

Pre-Processing • Stemming • Porter Stemmer (snowball.tartarus.org) • ‚move‘, ‚moving‘, ‚moves‘ => ‚move‘ • in German even more important (more flections) • Stop Word Elimination • 373 Stop Words in German • Stemming plus Stop Word Elimination • Unprocessed (‘raw’) Terms

Term Weighting Schemes weightij = lw(tfij) ∙ gw(tfij) • Local Weights (LW) • None (‘raw’ tf) • Binary Term Frequency • Logarithmized Term Frequency (log) • Global Weights (GW) • None (‚raw‘ tf) • Normalisation • Inverse Document Frequency (IDF) • 1 + Entropy • 12 Combinations

SVD-Dimensionality • Percentage of Cumulated Values • Shares of 50%, 40%, 30% • Share of Values = Number of Docs • Absolute Fraction of k • 1/50 and 1/30 • Fixed Number k (‘magic 10’)

Similarity Measures & Methods • Pearson Correlation • (Cosine Correlation) • Spearman‘s Rho • Best Hit of Best Solutions • Mean of Best Solutions pics: http://davidmlane.com/hyperstat/A62891.html

Assessing Effectiveness • Compare Machine Scores with Human Scores • Human-to-Human Correlation • Usually around .6 (literature, own experiments) • Increased by familiarity between assessors, tighter assessment schemes, … • Scores vary even stronger with decreasing subject familiarity (.8 at high familiarity, worst test -.07)

Experiment Settings • Test Collection • 43 Students’ Essays in German • Scored by Human Assessor from 0 to 5 points (ratio scaled) • Average essay length: 56.4 words • Training Collection • 3 ‘golden essays’ • Plus 302 documents from a marketing glossary • Average glossary entries: 56.1 words

Experiment Results

Overall • 48 < 0.001 • 459 < 0.01 • 885 < 0.05 • 1235 < 0.1 • Rest: 781 not significant

Pre-Processing • Best: Stop Word Filtering (Ø .31) • Stemming / Stemming&Stoppingworsen Results (by .06 and .03) • Raw: .26 • Best 50: • 21 x stopping • 14 x raw • 12 x stemming • 3 x stemming & stopping Sorted Spearman Correlation of all Experiments

Term Weighting • Global Weights: • IDF overall best (.36 with logtf) • Normalisation worsens (.15 - .17) • 1+Entropy: nearly no effect • Local Weights: • hardly any effect • raw and logtf squeeze curve • Best 50: • 20 x bintf • 19 x logtf • 11 x raw • 26 x IDF • 13 x raw • 6 x normalisation • 5 x 1+entropy

Dimensionality • ‘Share’ Scores Best • 50%: .29 (40% and 30%: .28) • Curve favours 30% • Rest: .22 to .24 • Negative Correlations at Share 30%: Normalisation as GW • Best 50: • 13 x 1/50th • 10 x share 50% • 8 x 1/30th • 8 x magic ten • 5 x share 40% • 3 x share 30% • 3 x ndocs

Correlation Measures • Cosine & Pearson slightly better incurve (in averageSpearman is) • Best 50: • 21 x Spearman • 15 x Cosine • 14 x Pearson

Correlation Method • Mean Correlationwith Best Essaysis slightly better • Best 50: • 31 x Mean • 19 x Maximum

Summary and Future Work

Summary • Effectiveness can be tuned in advance • Recommendation (not a guarantee): • Use Stop Word Filtering • Use IDF as global weight and any local • Use Spearman’s Rho • Use Average Correlation to Best Essays • However: other combinations still can be successful! • Optimisations are not independent

Future Work • A Model of Influencing Factors • Stability across Changing Corpora / Contexts • Different Text Assessment Methods • Similarity Measurement Method • Doc versus Query (=Aspect) • Way of Corpus Splitting • Bag-of-Words: Documents vs. Sentence, Paragraph, N-Grams • Summaries or Controlled Vocabulary • Norm-referenced vs. criterion-referenced (NRT, CRT) • ‘Definatory’ vs. ‘case based’ QAs (I,E) • Compare with other Scoring Methods

Thanks for your attention! Get these slides at www.educanext.org

The Original Question • Question: “Wählen Sie jenen Plan aus, mit dem Sie das Kommunikationsziel am ehesten erreichen könnten und begründen Sie Ihre Auswahl in Schlagworten! (5 Punkte) • Best Essay (5P): „Bei der Wahl kommt es darauf an, ob ich Breitenwirkung (Reichweite) oder Tiefenwirkung (OTS) erzielen will (zB bei Imageverbesserung). Hier sollte meiner Meinung nach Plan 2 gewählt werden, da die OTS fast gleich sind 22 Plan 1 - 20 Plan 2 = kein großer Unterschied, doch die Kosten pro 1000 Nutzer und Kosten pro 1000 Kontakte bei Plan 2 billiger sind (Nutzer: 2 um 101,- €, Kontakte 2 um 3 € günstiger als 1)“ • Essay (2P): “Plan 2: da ich durch minimale Mehrkosten für die Schaltungen -> mehr Schaltungen habe und -> eine wesentlich größere Reichweite sowohl gesamt als auch in der Zielgruppe“

Fridolin Wild , Christina Stahl, Gerald Stermsek, Gustaf Neumann

Fridolin Wild , Christina Stahl, Gerald Stermsek, Gustaf Neumann

Presentation Transcript

S. Stahl, CEO Stahl-Electronics

By: Ahlesa Stahl

Gerald Ford

Gerald Stano

Gerald Ford

Meselson-Stahl Experiment

Stefan Brantner, Thomas Enzi, Susanne Guth, Gustaf Neumann, Bernd Simon (Presentor)

Gerald Ford

Gerald Brady

S. Stahl, CEO Stahl-Electronics

Gerald