Speak - Quick.Accurate.Innovative.

External Validation Study - Higher Education

Background

The Speak assessment is a computer-based test of spoken English. One of the primary uses of the exam is for higher education admissions or placement into an English language programs. As such, it is necessary to evaluate the adequacy of the exam to provide information for making decisions about the English levels of candidates for higher education. The ability of an exam to measure what it purports to measure is known as validity. There are many different methods for establishing validity. This paper explains the construct validity of the Speak assessment for use in higher education English programs.

There are two main methods of making admissions and placement decisions. One method is through standardized testing, and a second is through in-house testing. Standardized testing has the advantage of providing scores according to a recognized scale, and providing comparability with other, similar programs. External, standardized exams also have the benefit to the program of being more cost effective for the program, because students are generally the ones who pay for these exams. In-house exams have the advantage of being customizable and specified to the needs of the programs. They have the additional benefit to the students of shifting the cost for testing onto the program, rather than requiring often costly external exams (Ling, et al., 2014).

Personal interviews are often a part of English placement and admissions tests, such as the IELTS, and are also a frequent component of in-house tests. As such, interviews are a logical means of testing the validity of a standardized exam for making placement decisions.

The question that this study explored was how well do the scores of the Speak Assessment correspond to the scores of face-to-face interviewers for placement in English programs.

Method

51 prospective students at a teachers’ college in Israel volunteered to participate in the study. All of the students were interviewed by two different faculty members, who scored the students’ speech using CEFR aligned rubrics. Most, but not all of the faculty members have experience using the CEFR. Although each student was interviewed by two different faculty members, the interviewers were not necessarily consistent for each student.

After the interviews, the students took the Speak Assessment of speaking and listening. The college compiled the exam and interview scores and provided them to Speak with identifying information removed.

Results

The agreement of the Speak scores with the college interviewers is reported below. Both Light’s Kappa, and ICC scores are reported. The agreement of the Speak CEFR scores with those of the two raters shows that overall, the agreement between the Speak test scores and the college rater score is closer (0.923 and 0.931) than are the scores of the two college raters to each other (0.882).

Cohen’s Weighted Kappa scores are also reported. These scores take into account the size of the difference between the scores. Using this test, the relationship of the scores of the Speak Assessment with each of the raters (0.826, 0.825) is stronger than the scores between the raters (0.713).

Conclusions can not be drawn about the skill of any of the individual raters, because, as stated above, the interviewers were randomized. Overall, of the 51 exams scored, the Speak score agreed with both of the raters 27 times (53%), and with one of the raters in 22 cases (43%). In total, there was exact agreement between the Speak CEFR score and one of the interviewers in 96% of the cases.

Speak CEFR Score and 1st College Rater

Interrater Reliability

n rater statistic z p
Light's Kappa 51 2 0.681 9.67 0.00

Interrater Reliability

Method Cohen's Kappa for 2 Raters (Weights: equal)
Subjects 51
Raters 2
Agreement % 74.5
Kappa 0.826
z 9.13
p-value <.001

Intraclass correlation coefficient

Subjects Raters Subject variance Rater variance Residual variance Consistency Agreement
Value 51 2 1.87 0.0251 0.132 0.934 0.923

Speak CEFR Score and 2nd College Rater

Interrater Reliability

n rater statistic z p
Light's Kappa 51 2 0.657 9.41 0.00

Interrater Reliability

Method Cohen's Kappa for 2 Raters (Weights: equal)
Subjects 51
Raters 2
Agreement % 72.5
Kappa 0.825
z 9.16
p-value <.001

Intraclass correlation coefficient

Subjects Raters Subject variance Rater variance Residual variance Consistency Agreement
Value 51 2 1.86 0.00431 0.133 0.933 0.931

1st and 2nd College Raters

Interrater Reliability

n rater statistic z p
Light's Kappa 51 2 0.444 6.57 4.89e-11

Interrater Reliability

Method Cohen's Kappa for 2 Raters (Weights: equal)
Subjects 51
Raters 2
Agreement % 54.9
Kappa 0.713
z 7.93
p-value <.001

Intraclass correlation coefficient

Subjects Raters Subject variance Rater variance Residual variance Consistency Agreement
Value 51 2 1.90 0.00196 0.253 0.883 0.882

Conclusions

The Speak Assessment was more consistent with any one interviewer’s assessment of the students’ English proficiency than were individual raters with each other. The high agreement between the Speak Assessment and the human interviewer scores (0.96) indicates that the Speak Assessment is a valid means of assessing students’ English proficiency level for use in a college program.

References

Carlsen, C. H. (2018). The Adequacy of the B2 Level as University Entrance Requirement. Language Assessment Quarterly, 15(1), 75–89. doi: 10.1080/15434303.2017.1405962


Chapelle, C. A., & Voss, E. (2014). Evaluation of language tests through validation research. In A. Kunnan (Ed.), The companion to language assessment. New York, NY: Wiley.


Bruno Falissard (2012). psy: Various procedures used in psychometry. link.


Matthias Gamer and Jim Lemon and Ian Fellows Puspendra Singh (2019). irr: Various Coefficients of Interrater Reliability and Agreement. link.


Kane, M. (2006). Validation. In R. L. Brennan (Ed.), Educational Measurement (4th ed.). New York, NY: American Council on Education and Praeger.


Ling, G., Wolf, M.K., Cho, Y., and Wang, Y., (2014) English-as-a-Second-Language Programs for Matriculated Students in the United States: An Exploratory Survey and Some Issues. Educational Testing Service, Princeton, NJ doi:10.1002/ets2.12010


North, B. (2007, February 6). Common European Framework of Reference for Languages (CEFR). Retrieved from https://www.coe.int/en/web/common-european-framework-reference-languages/documents


Seol, H. (2020). seolmatrix: Correlations suite for jamovi. [jamovi module]. Retrieved from https://github.com/hyunsooseol/seolmatrix/. link


The jamovi project (2020). jamovi. (Version 1.2) [Computer Software]. Retrieved from https://www.jamovi.org.

Read next:

Methods for Establishing Validity and Reliability of the SPEAK Assessment >>

Repeated Administrations of the Speak Assessment >>

External Validation Study - Workforce >>

Concurrent Validity of the Speak Assessment – A Research Study >>