Combining reliability and validity in assessment: Can auditing create the Golden Goose in assessment quality?

Combining reliability and validity in assessment: Can auditing create the Golden Goose in assessment quality? Institute of Education Doctoral School Summer Conference 2011 Mr. Niall Douglas BSc MA MBS MCollT

Contents • Reliability and Validity in HE assessment • Consequences of assessment unreliability on plagiarism • My proposed research • Current market situation and likely future demands in HE assessment (if we have time) • Ethical consequences

Reliability and Validity in HE assessment • While I was reading for my second undergraduate degree at St. Andrews 2004-2008, one of the most common complaints I ever heard from students was about the inconsistency in grading • In other words, grades varied a LOT in qualitative assessment • Most students also had issue with how quantitative assessment grades up to 100%, whereas qualitative assessment only grades up to 75-80% • This was seen – and quite correctly in my opinion – as a hidden subsidy of quantitative study

Reliability and Validity in HE assessment • It turns out that St. Andrews undergraduates are hardly alone in having issue with this: Yorke (2000) found that this “unfairness” features highly in student dissatisfaction surveys • So why does UK Higher Education persist in these marking inconsistencies i.e. Unreliability in assessment? [1]: Yorke, M. (2000) 'The quality of the student experience: what can institutions learn from data relating to non-completion?', Quality in Higher Education, vol. 6, pp. 61--75.

Reliability and Validity in HE assessment • One area in which UK HE diverges from US HE is in choosing validity over reliability in assessment • Validity of assessment is measured by how well the assessment achieves learning outcomes • For example, one might choose a suboptimal form of assessment (in terms of reliability, marking overheads or even the prevention of cheating) if that form has superior effects on student learning outcomes • In particular, in the UK HE tries to allow for more creativity and originality in student writing – something inherently subjectively assessable – even if grades consequently vary a lot, even with the same marker and the same student

Reliability and Validity in HE assessment • This is why UK HE marks qualitative assessments up to just 75-80% - because it has been repeatedly found that grades in qualitative assessment have a variance of up to ±20% (Hartog, Rhodes and Burt, 1936; Cox, 1967; Branthwaite, Trueman and Berrisford, 1981; Elton and Johnston, 2002) • Marking to 75-80% heavily compresses the variance into a ±12.5-15% range, 20% of which is ±2.5-3% - hence the historical UK HE choice of 10% bands between 1H, 2H1, 2H2 etc. • Of course, this is not widely realised by students, nor even staff! But it is done as a “least worst” outcome despite penalising qualitative study • If you want to know a LOT more on this, see Elton, L. and Johnston, B. (2002) 'Assessment in universities: a critical review of research'. Hartog, S.P., Rhodes, E.C. and Burt, S.C. (1936) The marks of examiners: being a comparison of marks allotted to examination scripts by independent examiners and boards of examiners, together with a section on a viva voce examination, Macmillan. Cox, R. (1967) 'Examinations and higher education: a survey of the literature', Higher Education Quarterly, vol. 21, pp. 292--340. Branthwaite, A., Trueman, M. and Berrisford, T. (1981) 'Unreliability of Marking: further evidence and a possible explanation', Educational Review, vol. 33, pp. 41--46.

Consequences of assessment unreliability on plagiarism • Of course, because creativity and originality are enabled by this choice of UK HE it has the consequence of generating the highest grade volatility for the most creative and original students! • Therefore, while this preference for validity is admirable, does it have unintended consequences on the incentives to plagiarise? • Here are some facts about plagiarism ...

Consequences of assessment unreliability on plagiarism • Firstly, cheating detection software is extremely easy to fool. In fact, there are sites on the internet which will pre-check your work for you to make sure you won’t get caught • Secondly, there is a vibrant market in most universities of Masters students being paid by undergraduates to write essays for them – these being original don’t get detected • As an example, the going rate in University College Cork was about €100 per thousand lines, with a refund if the student didn’t get a 2H2 or higher

Consequences of assessment unreliability on plagiarism • A study by de Lambert et. Al (2002) found that 80% of New Zealand HE students plagiarise5 • In my own experience at Hull, St. Andrews and University College Cork, it is even more common than that. I would actually say it is ubiquitous – as in, I have never met a recent student who hasn’t! • Additionally, for some subjects like Computer Science, the problem is over-detection of plagiarism because coursework answers tend to be very similar to one another • Being falsely accused of plagiarism when you in fact did not is a highly unpleasant experience I can assure you! [5]: de Lambert, K., Ellen, N. and Taylor, L. (2002) 'Prevalence of academic dishonesty in tertiary institutions: The New Zealand story', New Zealand Journal of Applied Business Research, vol. 1.

Consequences of assessment unreliability on plagiarism • From my own experience, I reckon that the top students selectively plagiarise whereas the bottom students randomly plagiarise • I have a feeling that, if true, this might be an excellent proxy for measuring the understanding of study itself • This would support the hypothesis that the most creative and original students deliberately plagiarise not out of laziness or incapability, but rather as an expert method for reducing assessment volatility • And surely such adaptability is a valuable capability in a student?

Consequences of assessment unreliability on plagiarism • This raises a very interesting thought ... • Perhaps instead of looking at all plagiarism as inherently negative, perhaps the patterns in plagiarism might offer a superior indicator of the ability of the student? • Taking it a step further, perhaps the overall patterns of gestating an assessed work might be an even more superior indicator? • In other words, instead of assessing final works as standalone items, we also assess how those works were generated

My proposed research • My research intends to develop a software tool which records a person’s activities as they gestate an assessed work • This record is then formed into an audit trail which can be embedded into the Microsoft Office document submitted for assessment as proof of originality of work • Note that the student can delete items from the trail before submission. However we retain what they delete

My proposed research • These audit trails are accumulated onto an internet cloud over time along with ALL material seen by each student during the gestation of their works • This allows large scale statistical analyses to be performed looking for correlations between awarded grade and (for example) how a student goes about writing an essay, such as what they search for, where they look and in what sequence, and of course what they delete from their submitted audit trails • You can also run statistical analyses of many other things such as typical patterns in word usage or impact of citations on students • This lets you do a number of very cool things ...

My proposed research • For example, students can now compare how they write an essay to how other people write their essays • This lets you turn summative assessment into formative assessment – good! • Universities can compare their grading with the average grading of their past (or other universities) to discover if there are any anomalies in their current processes. In particular, the statistics can flag those students in a cohort most likely to have the highest variance in grade • This lets the assessing department determine partial risk and deploy double or even external marking to those students most likely to need it. This is a vast improvement over the random double marking currently employed • It also provides an independent external check of the assessor which considerably exceeds the QAA’s present gold standard because it checks each and every assessment rather than just a random sample

My proposed research • And there are obvious major benefits for the detection and elimination of plagiarism by increasing the effort to plagiarise and decreasing the rewards for doing so • There are some further benefits: • The software can also analyse what you are writing while you write it and suggest further reading on the basis of previously written essays stored in the cloud • It’s great for reflective practice, because I am not entirely sure how I gestate my own academic outputs. Watching myself write these slides I would wonder what it looks like from outside my head! • It could also be a boon to distance assessment the importance of which is only just beginning ...

My proposed research • On the advantages for distance (actually de-campused) education: • Most people have a statistically fingerprintable “style” to their research and writing. This lets you ensure that a submitted work came from the same person as before • Moreover, that style slowly evolves over time. If a person used my software from the very beginning of their academic career you’d get a graph of how their style evolves over time • This lets you calculate teaching practice quality as well as letting you compare students when they go before the workplace i.e. as a recruitment selection tool • The possibilities only begin there ... I don’t have time to mention more here, but even if it only partially works there are a multitude of uses for this data

Current market situation and likely future demands in HE assessment • This research is not just about doing cool things, it is also about solving a major pressing need for UK Higher Education • UK HE has already adapted to a substantial per student funding drop • For example, expenditure per student in 1984/85 was £8,600 in 2009 pounds. In 2008/09, it was just £7,389!) • That’s a drop of 14% per student! • UK HE has already substantially improved its productivity ...

Current market situation and likely future demands in HE assessment • From an Economic perspective, UK HE has experienced tremendous productivity improvements • Total Factor Productivity (TFP) analysis of UK HE finds a 51.5% improvement in TFP between 1980/81 and 1992/933 • As a comparison, UK HE storms past Chinese HE in both TFP and TFP growth4 where in China TFP growth has actually been negative 1998-2002 [3]: Flegg, AT et al., (2003), ‘Measuring the Efficiency and Productivity of British Universities: An Application of DEA and the Malmquist Approach’ [4]: Ng, YC & Li, S, (2009), ‘Efficiency and productivity growth in Chinese universities during the post-reform period’

Current market situation and likely future demands in HE assessment • UK HE faces some large structural shifts in its primary market. According to the British Council, (2004), ‘Vision 2020: Forecasting international student mobility - a UK perspective’: • Global transnational HE demand is forecast to rise from 2.1m places to 5.8m by 2020 (a 176% increase) • UK HE currently holds about 13% of the global transnational HE market. While this is projected to only rise to 15% by 2020, it still means a trebling of international students for UK HE, or a 34% rise in total students attending UK HE when you include projected domestic growth

Current market situation and likely future demands in HE assessment • If you look at the cost of providing a HE course, they break down as follows: • Fixed costs: • Planning and developing course design to fit other courses and quality standards • Developing course materials • Running costs: • Providing infrastructure e.g. Buildings, computers, libraries • Marketing, advertising and non-per student administration • Teaching • Per-student costs: • Tutorials • Assessment • Per student administration

Current market situation and likely future demands in HE assessment • As you gain students (especially distance students): • Fixed costs per student drop • Running costs per student drop • Per-student costs do NOT drop • (Economic and Accounting) productivity is strongly correlated to economies of scale • Therefore economies of scale are inevitably bound to reducing per student costs • And that implies reducing the cost of assessing students ...

Current market situation and likely future demands in HE assessment • Given that UK HE has already plucked the “low hanging fruit” in productivity gains, how it is supposed to improve still further with a 34% increase in student numbers by 2020? • Especially given that new low cost HE providers are becoming increasingly cost competitive on the global stage by using much cheaper labour to reduce tutorial and per student admin costs? • In short: either UK HE innovates or it will get pushed out of its global market

Current market situation and likely future demands in HE assessment • Lastly, the Higher Education Achievement Report (HEAR) wants to phase out overall degree classifications in favour of academic transcripts and even more detailed per-topic assessments • My worry about this is that this may penalise even further the more creative and original students by showing up the volatility in their component grades • And that’s where my research comes in! If it works it could be a great tool for simultaneously reducing assessment costs whilst improving its quality and being more meritocratic. A win-win-win!

Ethical consequences • Just because it appears to be a win-win-win does not necessarily mean that there isn’t a big potential for problems ... • After all for every “good” use of such personally intimate data there are many “bad” uses • For example, it certainly would be a gold mine for marketing • Or even one might use this data to early identify those most likely to become subversive or engage in criminal behaviour or even fail to repay loans

Ethical consequences • Some of these problems can be inhibited by deliberately designing the data storage in a form which makes it hard to regress such indicators • However, many of these problems also affect all social networking users in general • We definitely comply with the EU Data Protection Directive, so we are currently legal and meeting best current practice • And while we can identify individuals, anyone we give access to our data for research purposes can not (without the individual user’s explicit permission)

Ethical consequences • In the end, it is probably too early to be able to say much more without knowing what will work and what won’t i.e. We have to build it and see what is possible first • We are, however, very mindful of these issues Thank you for listening! Questions?

Combining reliability and validity in assessment: Can auditing create the Golden Goose in assessment quality?