The purpose of the Speak Assessment is to measure a candidate’s ability to speak in natural contexts. As such it is important to ascertain whether the scores on the assessment correspond with native speaker intuitions about the proficiency of the speaker. These judgments are an important for demonstrating the external validity of the exam.
Native speaker judgments are often used in linguistic research, and prior research has found on the one hand that there is a relationship between the ratings of untrained and professional raters (Rogalski et al, 2020; Pinget et. al, 2014; Mattran, 1977) and that on the other hand training and experience improve inter-rater reliability (Isaacs & Thomson, 2013; Kuiken & Vedder, 2014; Kang et al, 2019, Rogalski et al, 2020). Accordingly, it was predicted that judgments of second language English proficiency by an untrained native speaker would have a relationship to grades obtained by trained raters, but that the relationship would not be as high as the relationship among the trained raters.
The question of this research was: To what extent does the Speak Assessment capture native speaker judgments of proficiency?
A large pre-employment screening company in Israel interviewed 55 candidates, recorded the judgments of the interviewer about the language skills of the candidates, and then had the candidates take the Speak Assessment. The interviewer was a psychologist who is a fluently bilingual speaker of Hebrew and English, with English speaking parents. The participants were all fluent Hebrew speakers, with some who spoke Russian as a first language. The interviewer rated the candidates on a scale of 1 to 6 for fluency, pronunciation, grammar, and vocabulary. The interviewer did not use rubrics, or any external set of criteria. After the interview was complete, the candidates took the Speak Assessment. The results of the two sets of ratings were compared.
There was a correlation of ⍴ = 0.76 between the scores of the interviewer and the ratings of the Speak Assessment. However, the agreement for fluency was significantly lower than any of the other agreement scores for individual measures. Reasons for this will be discussed below. Accordingly, the fluency scores were excluded, providing a correlation of ⍴ = 0.79. This finding compared with an interrater reliability of ⍴ = 0.97 of mean scores among the trained raters.
There are multiple factors that have been explored in relation to what constitutes different levels of linguistic proficiency. These constructs are characterized in the Can-Do statements of the CEFR (Council of Europe Framework of Reference for languages) and the Council of Europe’s manual for aligning tests to the CEFR includes requirements for benchmarking and training of raters as prerequisites for asserting alignment to the CEFR. Research has shown bias against people with differing accents and disproportionate weight being assigned to accentedness in the evaluation of speech. In this case, the rater shared a linguistic background with the test-takers, likely mitigating against this bias, and also providing a familiarity with the linguistic patterns of this group that may not be shared by other native speakers of English who would need to interact with these candidates in professional contexts after they are hired. This familiarity is the most likely cause of the differences among ratings of fluency between the interviewer and the trained raters. As such, the interviewer provided a useful measure of behavior of hiring managers, but this rating may not be generalizable to more linguistically diverse contexts.
The results of research supported the prediction that the Speak Assessment has a relationship with native speaker judgments, but that these judgments are not as reliable as those of trained raters.
Barbara Hoekje, & Kimberly Linnell. (1994). “Authenticity” in Language Testing: Evaluating
Spoken Language Tests for International Teaching Assistants. TESOL Quarterly, 28(1), 103.
Kang, O., Rubin, D., & Kermad, A. (2019). The Effect of Training and Rater Differences on Oral Proficiency Assessment. Language Testing, 36(4), 481–504.Mattran, K. J. (1977). Native Speaker Reactions to Speakers of ESL: Implications for Adult Basic Education Oral English Proficiency Testing. TESOL Quarterly, 11(4), 407.
Pinget, A.-F., Bosker, H. R., Quené, H., & de Jong, N. H. (2014). Native Speakers’ Perceptions of Fluency and Accent in L2 Speech. Language Testing, 31(3), 349–365.
Yvonne Rogalski, Sarah E. Key-DeLyria, Sarah Mucci, Jonathan P. Wilson & Lori J. P. Altmann (2020) The relationship between trained ratings and untrained listeners’ judgments of global coherence in extended monologues, Aphasiology, 34:2, 214-234, DOI: 10.1080/02687038.2019.1643002
SAITO, K., TROFIMOVICH, P., & ISAACS, T. (2016). Second language speech production: Investigating linguistic correlates of comprehensibility and accentedness for learners at different ability levels. Applied Psycholinguistics, 37(2), 217-240. doi:10.1017/S0142716414000502