380 likes | 391 Vues
Potentials and Challenges of Teacheru000bInvolvement in Rating Scale Designu000bfor High-Stakes Exams
E N D
1 Part I_ Chapter 4 Zahra Farajnezhad
2 Franz Franz Holzknecht, Benjamin Kremmel, Carmen Konzett, Holzknecht, Benjamin Kremmel, Carmen Konzett, Kathrin Kathrin Eberharter Eberharter, and Carol , and Carol Spöttl Spöttl
Abstract Abstract Although teachers are sometimes portrayed as unreliable raters because of their emotional involvement and proximity to students or test-takers, it can be argued that they have more expertise and experience in rating test-takers’ performances than most test developers. Therefore, it seems only logical to include them in the development of rating scales. This applies to both scenarios in which teachers are only responsible for preparing students for High-Stakes exams, and scenarios where teachers are responsible for test preparation as well as the rating of the test performances. Involving teachers in rating scale design can offer test developers access to a wealth of rating experience and thereby increase the validity of the scale. This chapter will outline the potentials and challenges of involving secondary school teachers in the design of rating instruments for a large-scale national High- Stakes exam. Two case studies on teacher involvement in scale development will be presented (writing and speaking). The chapter will compare the two projects, highlighting what was found useful by the involved teachers. It will do so by analyzing teacher statements from retrospective questionnaires (N = 23) about their experience of being involved in one or both of these projects. The chapter will conclude with insights into the importance of teacher involvement in this stage of the test development cycle, and will highlight the usefulness of combining top-down and bottom-up scale development procedures. 3
4.1 Rating Scale Development 4.1 Rating Scale Development At first sight, there is great variation in the way in which rating scales are developed, because of the varied contexts and tests for which they are used. The factors involved in choosing a scale development methodology include practical aspects such as financial and temporal resources, aspects relating to the test such as the test construct, its purpose and stakes, and aspects pertaining to the nature of the scale such as the potential users and the desired form in which scores should be reported (Weigle 2002). However, all of these individual approaches fall into two different categories: a) a top-down approach starting from a conceptual basis (relying for instance on experts’ intuitions, on language acquisition theories or even on an already existing rating scale to formulate new descriptors), and b) a bottom-up approach using evidence from sample test performances to describe typical features of the language of these performances from which descriptors are then generated 4 • (cf. e.g. Fulcher 2003; Green 2014; Turner 2012; Upshur and Turner 1995; Weigle 2002 for more detailed descriptions of different scale development methods).
4.1 Rating Scale Development 4.1 Rating Scale Development Fulcher et al. (2011) label these two categories: 1) “measurement-driven” and 2) “performance data-based” approaches. “Measurement-driven” approaches include intuitive methods (i.e. using experts’ beliefs about language testing and language acquisition based on their experience, knowledge and intuition) and the “scaling-descriptors” method (i.e. placing existing descriptors from different sources on a new scale using expert judgments and Rasch analysis). “performance data-based methods” which use test performances as a starting point for scale and descriptor development. There are various ways in which test performances can be analyzed, following more or less tightly guided protocols and resulting in different types of scales besides regular analytic scales. 5
4.1 Rating Scale Development 4.1 Rating Scale Development While Fulcher et al. (2011) advocate a bottom-up approach, generating descriptors from detailed discourse analysis of test performances, Knoch (2011) suggests a top-down approach by basing the criteria of the rating scale on a taxonomy of the main points of existing theories of language proficiency. All authors seem to agree that once a basic rating scale has been established, the process becomes iterative. The initial decisions taken and the descriptors formulated must be validated in a process involving the repeated application of the scale to test performances, while simultaneously analyzing rating behavior and adapting the scale. 6
4.1.1 Using the CEFR as a Basis for Rating Scales 4.1.1 Using the CEFR as a Basis for Rating Scales One top-down approach to scale development involves devising a new rating scale for an individual test on the existing illustrative scales contained in the Common European Framework of Languages (CEFR) (Council of Europe 2001). The CEFR fulfils a twofold purpose: first, it serves as the conceptual framework to inform the test construct and hence the rating scale, and second, it provides concrete descriptors for a number of aspects of language proficiency. These descriptors are already scaled into levels and can be used as a starting point for formulating new descriptors for a specific rating scale. 7
4.1.2 Teachers As Raters and Teachers As Scale Developers 4.1.2 Teachers As Raters and Teachers As Scale Developers Studies usually provide little information on the individuals who actually carry out the scale development in practice, i.e. choosing criteria, formulating descriptors and placing them on levels. Often there is a reference to ‘experts’ having been involved, but it is usually not made clear what their area of expertise was, e.g. in test development, scale development, applied linguistics, language teaching, language testing or other areas. Galaczi et al.’s (2011) report of a rating scale development process for Cambridge ESOL is one of the few instances in which the qualities of the ‘experts’ are described in more detail: “They were chosen for their knowledge of the Cambridge English tests and the needs of examiners and test takers, as well as for their experience as speaking examiners and for their knowledge of the relevant area of applied linguistics” (2011, p. 222). Deygers et al. (2013) describe the involvement of “subject specialists” or “domain experts” in the development of a rating scale for a test of Dutch as a foreign language. These were specialists in “Dutch for Academic Purposes” and “professionals employed within the academic target domain” (Deygers et al. 2013, pp. 274–275). Specifically, they were language tutors, academic staff in language and subject teaching, and researchers. 8
4.1.2 Teachers As Raters and Teachers As Scale Developers 4.1.2 Teachers As Raters and Teachers As Scale Developers Upshur and Turner (1995, 1999) described a project in which a test of speaking was produced for a Canadian school board, with teachers involved in both designing tasks and a rating scale based on the analysis of performances, working in small groups together with the researchers. Alderson et al. (2000) also worked with teachers on all levels, from the conceptualization of test specifications to item writing and scale development. The teachers received initial training and were supported throughout by language testing experts. Harsch and Martin (2012) reported on a project to develop large-scale assessment instruments to assess EFL proficiency of secondary school students across Germany as part of educational monitoring. In that, teacher involvement was from German secondary schools designing tasks and a rating scale with the support of language testing experts. These teachers also participated in the first few rounds of small-scale trialing of the rating scale and ensuing revision, leading to a first version of a workable rating scale. Deygers and Van Gorp (2015) report using “novice raters” in the validation process with a view to check validity and to make any additional necessary adaptations to the scale. 9
4.1.2 Teachers As Raters and Teachers As Scale Developers 4.1.2 Teachers As Raters and Teachers As Scale Developers Turner (2000) appears to be the only research study addressing the impact of the factor teacher involvement on scale design and on scale users’ perception of the scale. Her study focused on five EFL teachers who participated in designing a rating scale for written performances based on an analysis of performance samples and guided by two testing experts. She analyzed the audio-recorded discussions that went on during the scale-design sessions and concluded from these analyses that teachers were involved in scale development “potentially affected the way the final scale came out”. There is still a lack on the impact of teacher involvement in rating scale design. It is not described in detail who the teachers were, what exact role they played in the scale development procedure, or how their involvement was set up. None of the studies investigated what effect teachers’ involvement in scale design had on the teachers; or their language assessment literacy (LAL). The current study addresses these issues. It does by describing in detail the procedure of teacher involvement in two rating scale development projects (writing and speaking) for a national High-Stakes exam, and by analyzing teacher statements from retrospective questionnaires about their experience of being involved in these projects. 10
4.2 Processes of Scale 4.2 Processes of Scale Development 4.2.1 4.2.1 Writing Writing Development 4.2.1.1 Project Background The rating scales described here were part of a larger project to design a CEFR-based standardized national school-leaving exam for Austrian secondary schools, which included assessment of listening, reading, writing and lexico-grammatical ability for the modern foreign languages English, French, Spanish and Italian (Spöttl and Green 2009, Spöttl et al. 2016). The assessment of writing skills involved several activities: • developing test specifications for the writing part of the exam • training item writers to develop writing tasks • defining text types to be tested (e.g. essay, report, article etc.) • developing writing tasks (in English, French, Italian and Spanish) to elicit B2 or B1 performances • designing a rating scale in English to assess B2 performances (target language English) • designing a rating scale in German to assess B1 performances (target languages French, Italian and Spanish) The teachers received their training and worked as item writers alongside their teaching jobs. One of the criteria for selecting item writers for teachers was their ability to use English as a working language. 25 teachers worked in four groups, organized around the target languages of the writing tasks: English, French, Italian and Spanish. 11
4.2.1.1 Project Background In accordance with the Austrian national curriculum, the writing part of the Standardisierte Reife- und Diplomprüfung (SRDP) exam was going to target two different proficiency levels: CEFR B2 for English, as this is generally the first foreign language taught in the Austrian school system, and CEFR B1 for the three Romance languages, as their teaching usually sets in at least 2 years later so students are not expected to arrive at the same proficiency level as in English. As a consequence, separate test specifications were developed for the two levels, differing mainly in the description of the target competence to be achieved, but also in details such as the number of words to be produced by the students in completing the writing tasks. The task format remained the same across the four languages: a short-written prompt specifying the communicative situation and the text type to be produced and a rubric containing a list of three instructions about what content the test takers had to include in their response. An example task in Fig. 4.1. 12
Fig. 4.1 Example writing task for English Fig. 4.1 Example writing task for English You have decided to take part in the competition. In your essay argue for or against couch surfing. You should: give reasons for joining or not joining the community discuss the effects of Couchsurfing on tourism evaluate its influence on personal development Write around 350 words. Give your essay a title. *= sleeping on the couch or extra bed of a friend when travelling, especially to save money 13
4.2.1.2 Procedure of Scale Development 4.2.1.2 Procedure of Scale Development Two scales were to be produced, one in English for the scoring of performances responding to B2 English tasks and one in German for the scoring of performances responding to B1 tasks in French, Italian or Spanish. It was decided by the test developers to follow a scale development approach that would combine elements of top-down and bottom-up methods. In terms of the top-down element, the CEFR was to provide the basis for both scales. The CEFR descriptors provided not only the basic outline of the content of the scales but also informed and influenced its nature: an “ability” rating scale (e.g. Bachman and Palmer 2010) with descriptions of what test takers can do at each band of the scale. The format of the scales was to be analytic, as due to a lack of central correction of exams in Austria individual class teachers would be grading test takers’ performances for the final exam and the test developers felt that teachers needed maximum support with a broad spectrum of descriptors. 14
4.2.1.2 Procedure of Scale Development 4.2.1.2 Procedure of Scale Development The test developers divided the scales into 10 bands, with 5 bands including descriptors (bands 2, 4, 6, 8 and 10) and 5 “in-between” bands without descriptors (bands 1, 3, 5, 7 and 9), for two reasons. • First, for teacher trainings, the test developers felt that the Hungarian scale did not offer enough range in the descriptors to discriminate sufficiently between the wide range of performances still within one B level (be it B1 or B2). • Second, 10 bands would make the integration of the scales’ scores into the general class grading system easier for teachers. It was decided that band 6 was the pass mark. This decision was informed by the legal situation, which stipulates that a pass must reflect that the candidate’s competence has demonstrated mastery of the curriculum (Bundesministerium für Unterricht und Kunst 1974), which in practice means that the majority of Austrian teachers set the pass mark at 60% for their classroom assessments. 15
4.2.1.2 Procedure of Scale Development 4.2.1.2 Procedure of Scale Development The test developers who trained and supported the teachers presented raw draft versions of the rating scales to the group of teachers, divided into four criteria: task achievement, coherence and cohesion, lexical and structural range and lexical and structural accuracy. The criteria coherence and cohesion, lexical and structural range, and lexical and structural accuracy formed a central part of the CEFR-linking process assuring that exam performances would be assessed at the legally required (B2 for the first foreign language and B1 for the second foreign languages). The teachers discussed in plenary the choice of the four criteria: first, task achievement, which, as a more task-specific and less generic criterion, was conceptually less strongly linked to the CEFR and needed more original input from the group; second, the separation of linguistic range and linguistic accuracy, which deviates from the CEFR where linguistic competence is divided into lexical and grammatical competence; third, two language-based criteria, adopting the terminology “lexical” and “structural” instead of the CEFR’s “vocabulary” and “grammar”; and fourth, “coherence” and “cohesion” for one of the criteria. 16
4.2.1.2 Procedure of Scale Development 4.2.1.2 Procedure of Scale Development The teachers were divided into smaller language-specific groups (three to four teachers per group) and carried out three tasks while going through the performances of each writing task. First, they divided the performances into three piles: good pass, minimal pass, and fail. They took notes and discussed how the elicited performances reflected on the writing task and whether the writing task needed revising. This process was called “script selection”. Second, they rated each performance with the draft scale, ticking the descriptors they used, and commenting on the usefulness, appropriateness, comprehensibility, practicality etc. for rating the performances at hand. This process was called “scale formulation”. Third, they identified benchmark performances for a good pass, a minimal pass, and a fail, which could go forward to the benchmarking sessions. 17
4.2.1.2 Procedure of Scale Development 4.2.1.2 Procedure of Scale Development The test developers then collected the teachers’ notes and used them to revise and flesh out the draft scales, while the teachers produced more writing tasks. After the next trial, the group of teachers met again with the test developers and repeated the process of reading and discussing performances, dividing them into piles of good pass, minimal pass and fail, and rating them with the new draft of the scale. Some descriptors were moved across bands and/or criteria several times until a decision could be taken as to which band and criterion they best fit, while other descriptors were removed or reformulated. The approach to scale development was a combination of the two approaches: (a) top-down, working with the existing descriptors from the CEFR (as suggested by Knoch 2011) and (b) bottom-up, i.e. a data-based procedure (as suggested by Fulcher et al. 2011; Fulcher 2003) 18
4.2.1.3 Teacher Questionnaire 4.2.1.3 Teacher Questionnaire To investigate teachers’ perceptions of being involved in the development of rating scales an online questionnaire was distributed among all teachers who were involved in the project. The questionnaire included a section about biodata, followed by nine specific questions about the teachers’ perceptions of working as scale developers. In the last section, the teachers were asked if/how their general LAL benefited from working on the scales. 25 teachers worked as scale developers on the writing scales and the final version of the questionnaire was sent to all of them. Fifteen teachers took part in the survey, resulting in a return rate of 60%. Two of the respondents were male and 13 were female. At the start of the project, the majority of the sample had been teaching for more than 20 years (67%), three of them (20%) had been teaching for more than 10 years, and two (13%) had 0–5 years of teaching experience. 19
In Fig. 4.2,It can be seen that their work as scale developers helped the great majority of teachers in all aspects listed. The teachers found their involvement most useful for explaining the scale to their students and for using it themselves. It also helped them explain the scale to their colleagues and gain a better understanding of the construct. All teachers felt that the process was instructional. 20
In Fig. 4.3, teachers were asked whether they thought the rating scales benefitted from their involvement in the development, and if so, why. Fourteen out of 15 teachers indicated that the rating scales profited from the teachers’ input. 71%: Teacher involvement the scales became more user friendly, 50%: They became more comprehensible, 43%: The acceptance of the scales by class teachers was enhanced, and 36%: Their involvement the scales were better tailored to the Austrian context. 21
In Fig. 4.4. All the teachers thought that through working with real test taker performances scale use was trained during the development. 79% of the sample, was that benchmark performances were identified in parallel to developing the scale. 71%: the scales’ user-friendliness could be checked immediately, 57%: the scales were improved by formulating additional descriptors not included in the CEFR, and 50%: relevant descriptors could be identified more easily. A smaller number of teachers felt that their writing tasks improved through using learner performances during scale development (36%) and it made the scale development process more varied (7%). In terms of disadvantages, 69%: reading through all of the performances was exhausting, 31%: the process was lengthened, and 23% indicated that it was organizationally challenging. 22
Table 4.1. The procedure was split into two distinct tasks: choosing, adapting and formulating descriptors (labelled “scale formulation”) and identifying and analyzing performances for developing the scales (labelled “script selection”). Teachers were asked which LAL area benefitted from their work on each of the two tasks. They could choose from a list of 24 different LAL areas. The seven most frequent areas chosen by the teachers for both tasks are displayed It can be seen that they learned more in script selection tasks than in scale formulation tasks. However, both tasks helped teachers gain a better understanding of rating productive performances, writing tasks, validity, reliability, and selecting tasks for classroom use. Interestingly, more than half of the teachers learned about classroom assessment during their work on the descriptors (42% in script selection tasks). 23
4.2.2 Speaking 4.2.2 Speaking 4.2.2.1 Project Background The procedure for developing the speaking scales followed a similar protocol of the writing scales. The speaking exam is not part of the compulsory standardized test, the development was commissioned and funded by a different body. The project was scheduled to be completed by a smaller group of developers within a shorter period of time. The group was chosen to represent five different regions of Austria and five modern languages. The speaking scale was intended to be used by teachers of Russian. Three of the teachers had been involved with the development of the writing scale. Four of the teachers were tasked with developing the scale at B2 in English, while the rest of the group developed a scale in German targeting B1. After those two scales had been completed and translated into German and English, the two groups were collated at the end of the process to develop a scale for A2. In addition, the groups also had to develop holistic versions of the different analytic scales, to be used by the interlocutor during the exam. 24
4.2.2.2 Procedure of Scale Development 4.2.2.2 Procedure of Scale Development The specific nature of spoken exam performances meant it would have been inappropriate to simply adopt the categories of the already existing and published scale for written performances. In line with the suggested criteria in the CEFR Manual (Council of Europe 2009, p. 184), it was agreed that the criterion organization and layout, would hardly be applicable in speaking exams, while rating descriptors of fluency and interaction were very likely to be useful in describing the proficiency levels of test takers. An extensive discussion among the group of teachers, which was moderated by the test developers, the decision was made to take over the key formal aspects of the writing scale, i.e. the number of criteria (4) and the number of bands (10), as teachers had already been familiarized with this rating frame. The group of teachers decided that teachers would not want to see the traditional linguistic competences of structural and lexical knowledge only represented as one band. Given that task achievement was felt to be an indispensable rating criterion to avoid negative washback and performances that were off-topic and had been learnt, the decision was made to collude the categories fluency and interaction into one criterion. 25
4.2.2.2 Procedure of Scale Development 4.2.2.2 Procedure of Scale Development In the first group meetings, these CEFR descriptors were systematically weeded out based on observations of real student performances in the respective foreign languages, leaving a reduced set of relevant descriptors to work with. Intermittently, plenary sessions with all members of both groups were held to discuss general issues such as the number of descriptors feasible or the concordance between the two scales (B1 and B2), both linguistically and in terms of the construct. Once the analytic scales were finalized, the holistic scales were produced. To that end a number performances for each language were rated with the finished analytic scales, and the teachers kept track of which descriptors in each of the four criteria and bands they applied most often. Based on the frequencies of descriptor use, the holistic scales were developed for the three CEFR levels (A2, B1 and B2). 26
4.2.2.2 Procedure of Scale Development 4.2.2.2 Procedure of Scale Development Although the procedure of developing the speaking scales was the same as for the writing scales (combining a top-down and bottom-up approach), a number of factors made the process more difficult. Due to the nature of speaking, the iterative process of collecting and rating performances in order to remove, adapt or add descriptors was technically more challenging and more time-consuming. Consequently, the development of the speaking scales relied on a smaller number of performances and the performance sample was thus less representative in terms of the general Austrian student population. In addition, it was more difficult to find additional performance- based descriptors, especially for band 10 as not enough highly proficient test-takers could be recruited. 27
4.2.2.3 Teacher Questionnaire 4.2.2.3 Teacher Questionnaire The teachers who had worked on the speaking scales responded to the same questionnaire as the teachers who had worked on the writing scales. Ten teachers took part in speaking scale development, 3 out of which eight responded to the survey (all female). The great majority of respondents (N = 7) had more than 20 years of experience as teachers, and one respondent had been teaching for more than 10 years. Three respondents were teachers of English, one was a teacher of French, two were teachers of Italian, and one was a teacher of Spanish. 28
In Fig. 4.5: working as scale developers helped the teachers in all of the aspects listed. The teachers in the speaking project felt more positively than the teachers in the writing project about the different statements. All of the teachers felt that working on the scales aided them in explaining the scale to their students and colleagues, that it improved their own confidence in using the scale, and it helped them design better speaking tasks. The great majority fully agreed that it helped them to understand the construct and it was motivating. More than 60% fully agreed that the scale development process facilitated their understanding of the CEFR, while the rest of the sample at least partially agreed with that statement. 29
• in Fig. 4.6: Asked whether the scales benefited from the teachers’ participation and, if they did, explored possible reasons for it. Seven out of eight respondents thought that the scales profited from the teachers’ involvement. Out of 86%: the scales became more user-friendly and were better tailored to the Austrian context. 43%: the scales became more comprehensible, and 29%: the scales’ acceptance by class teachers was enhanced. 30
Fig. 4.7: Asked the teachers’ opinions about the use of actual test taker performances for scale development. 86%: scale use could be practiced during the development and the scales’ user-friendliness was enhanced. 71%: using test taker performances helped choose the most relevant descriptors, 57%: additional descriptors not included in the CEFR could be identified, 43%: their speaking tasks were improved and 29%: it made the process more varied. In terms of disadvantages, more than half of the sample: the process was lengthened, 43%: it was organizationally challenging and 14%: listening to the performances was exhausting. 31
• Table 4.2: Asked teachers if their LAL improved through their work in scale development. Teachers’ scale development tasks were split into scale formulation and performance selection activities for the questionnaire, the procedure was not split up.LAL areas (out of a total of 24) that more than half of the teachers gained knowledge in through their work on the speaking scales. Results show that the majority of teachers learned about the principles of reliability, validity and practicality, with the latter being rated higher overall by teachers involved in speaking scale design than by teachers in the writing project. Another similarity between the speaking and writing project concerns learning effects in the area of classroom assessment. 32
4.3 Discussion and Conclusion 4.3 Discussion and Conclusion The majority of teachers in both projects indicated that through their involvement in scale development they gained a better understanding of the concepts of validity, reliability and practicality, and improved their knowledge in areas such as classroom assessment, task design, and the rating of productive performances. In the writing project, LAL gains stemmed more from the teachers’ work with student performances (a process labelled “script selection”) rather than actual scale formulation itself. In addition, teachers in both projects found the scale development process motivating and instructional, and it helped them explain the scale to their colleagues. Most importantly, their involvement in scale design clearly helped teachers to explain the scale to their students, which seems indispensable for both scenarios in which teachers are only responsible for preparing students for High-Stakes exams, and scenarios where teachers are responsible for test preparation as well as the rating of the test performances. 33
4.3 Discussion and Conclusion 4.3 Discussion and Conclusion The results show that not only teachers can learn from working as scale developers, but rating scales seem to benefit as well, at least from the teachers’ point of view. A large number of teachers thought that through their involvement the scales became more user-friendly, more comprehensible, and better tailored to the Austrian population. The findings confirm Turner’s (2000) work that teacher involvement in scale design could have an effect on the final scale. The results could suggest that involving teachers has the potential to promote positive attitudes towards High-Stakes exams. In both projects, work with the CEFR and student performances simultaneously made the scale development challenging and lengthened the process. The teachers in the writing project perceived the procedure as exhausting, as they had to read through hundreds of student performances and rate a selected sample with the draft versions of the scales. This was less of a problem for the teachers in the speaking project, because they worked with a much smaller number of performances, and listening to video-taped speaking exams might have been less taxing than reading hand-written scripts. 34
4.3 Discussion and Conclusion 4.3 Discussion and Conclusion The majority of teachers in both projects indicated that using student performances helped them identify the most relevant descriptors and informed the creation and wording of additional descriptors not included in the CEFR. Also, they in both projects indicated that working with student performances enabled them to practice using the scale already during its development. These findings suggest that using student performances in scale development uses resources more efficiently, as the outcome of such a procedure are not only empirically-derived rating scales, but also trained raters as well as benchmark performances. The study thus indicates that teacher involvement in scale development can be beneficial for teachers and students, as well as the resulting rating scales, and similar High-Stakes language testing projects in different international contexts should therefore also consider involving teachers. In addition, the workshops need to be planned carefully and sufficient time needs to be allotted both for plenary and group discussions. Some of the processes could be translated to an online working environment, however, there are many issues of test security that need to be considered. Although such a procedure requires considerable amounts of time and financial resources. 35
References References Alderson, J. C., Nagy, E., & Öveges, E. (Eds.). (2000). English language education in Hungary. Part II: Examining Hungarian learners’ achievements in English. Budapest: The British Council. Bachman, L. F., & Palmer, A. S. (2010). Language assessment in practice. Oxford: Oxford University Press. Bundesministerium für Unterricht und Kunst. (1974). Leistungsbeurteilung in Pflichtschulen sowie mittleren und höheren Schulen. Retrieved September 30, 2016, from https://www.ris.bka.gv.at/GeltendeFassung.wxe?Abfrage=Bundesnormen&Gesetzesnummer=10009375 Council of Europe. (2001). Common European framework of reference for languages: Learning, teaching, assessment. Cambridge: Cambridge University Press. Council of Europe. (2009). Relating language examinations to the Common European framework of reference for languages: Learning, teaching, assessment (CEFR): A manual. Strasbourg: Language Policy Division. Deygers, B., & Van Gorp, K. (2015). Determining the scoring validity of a co-constructed CEFR-based rating scale. Language Testing, 32, 521–541. Deygers, B., Van Gorp, K., & Joos, S. (2013). Rating scale design: A comparative study of two analytic rating scales in a task-based test. In E. Galaczi & C. J. Weir (Eds.), Exploring language frameworks: Proceedings from the ALTE Kraków conference, July 2011 (pp. 273–289). Cambridge: Cambridge University Press. Fulcher, G. (2003). Testing second language speaking. Harlow: Pearson Longman. Fulcher, G. (2012). Assessment literacy for the language classroom. Language Assessment Quarterly, 9, 113–132. Fulcher, G., Davidson, F., & Kemp, J. (2011). Effective rating scale development for speaking tests: Performance decision trees. Language Testing, 28, 5–29. Galaczi, E., French, A., Hubbard, C., & Green, A. (2011). Developing assessment scales for large-scale speaking tests: A multiple-method approach. Assessment in Education: Principles, Policy and Practice, 18, 217–237. Green, A. (2014). Exploring language assessment and testing. London: Routledge. Harsch, C., & Martin, G. (2012). Adapting CEF-descriptors for rating purposes: Validation by a combined rater training and scale revision approach. Assessing Writing, 17, 228–250. 36
References References Hudson, T. (2005). Trends in assessment scales and criterion-referenced language assessment. Annual Review of Applied Linguistics, 25, 205–227. Knoch, U. (2011). Rating scales for diagnostic assessment of writing: What should they look like and where should the criteria come from? Assessing Writing, 16, 81–96. Konzett, C. (2011). Every word counts. Fine-tuning the language of assessment scales: A field report. Paper presented at IATEFL TEASIG 2011: ‘Standards and standardizing in high and low stakes exams: Assessment from classroom to Matura’, Innsbruck. Spöttl, C., & Green, R. (2009). Going national, standardised and live in Austria: Challenges and tensions. Paper presented at the 6th Annual EALTA Conference, Turku. Spöttl, C., Kremmel, B., Holzknecht, F., & Alderson, J.C. (2016). Evaluating the achievements and challenges in reforming a national language exam: The reform team’s perspective. Papers in Language Testing and Assessment, 5, 1–22. Tankó, G. (2005). Into Europe: The writing handbook. Budapest: Teleki Lazlo Foundation and The British Council Hungary. Turner, C. E. (2000). Listening to the voices of rating scale developers: Identifying salient features for second language performance assessment. The Canadian Modern Language Review, 56, 555–584. Turner, C. E. (2012). Rating scales for language tests. In C. A. Chapelle (Ed.), The encyclopedia of applied linguistics. Oxford: Wiley-Blackwell. Upshur, J. A., & Turner, C. E. (1995). Constructing rating scales for second language tests. ELT Journal, 49, 3–12. Upshur, J. A., & Turner, C. E. (1999). Systematic effects in the rating of second-language speaking ability: Test method and learner discourse. Language Testing, 16, 82–111. Weigle, S. C. (2002). Assessing writing. Cambridge: Cambridge University Press. 37
Thank You for Your Thank You for Your Attention Attention By: Zahra Farajnezhad 38