Book Review – Building a validity argument for the test of English as a foreign language

| May 1, 2009

May 2009 Volume 5 Issue 1
Article 7.

PDF File

Book Review

Chapelle, C.A., Enright, M.K., Jamieson, J.M. (Eds.) (2008). Building a validity argument for the test of English as a foreign language. New York: Routledge. 370 pp. $ 41.30 (paperback). ISBN 10: 0-8058-5456-8.
Reviewed by Seyed Vahid Aryadoust


Seyed Vahid Aryadoust is a PhD candidate in applied linguistics, at National Institute of Education, Nanyang Technological University, in Singapore. His areas of interest are language testing, psychometrics, and validation

The most important property of a measurement tool is the validity of uses and interpretation of its scores. Test developers attempt to establish validity by exploiting different techniques. Conventionally, validity has subsumed content, criterion, predictive, and construct classes. However, the new argument-based approach to validity adheres to the use and interpretations of test scores. The argument-based approach to validity has been introduced to the field by Kane (1992, 2001, 2002, 2004, 2006) (see also Mislevy, Steinberg, & Almond, 2003; Kane, Crooks, & Cohen, 1999; Mislevy, 2003; Koenig and Bachman, 2004; Bachman, 2005).
The book Building a validity argument for the test of English as a foreign language which reports on the first large scale usage of the argument-based validation (ABV) is an attempt to validate a high-stakes test of English for academic purposes. This report is in fact a treatment of a paper presented in 2004, which enjoyed a novel validation framework. Figure 9.1 (p. 321), in essence an expansion of Mislevy, Steinberg, and Almond’s (2003) assessment argument based structure, can serve as a map that shows us where this process has started and where terminated. This map also uses the “three-bridge argument structure” introduced by Kane, Crook, and Cohen (1999, p.9).
The book has been divided into nine chapters, chapter eight being the lengthiest and chapter one the shortest, and has three appendices and an index. The work appears well-organized and the transition from one chapter to another chapter is very smooth especially when readers find that different authors refer them to other chapters to get a clearer picture of the validation process. Since the ultimate objective of the book is building a strong validity argument for the Test of English as a Foreign Language (TOEFL), the book states that all stages in argument-based validity (ABV) are crucial and required for the final decision-making about the test. The first three chapters focus primarily on the theoretical ground of the argument and as the reader moves through the book, they find more practical steps taken to provide answers to different research questions.
A feature that distinguishes this book among other books focusing on testing is its scope.  For validation and test development enthusiasts, it provides a unique reference which sheds light on the new validation approach used for developing the new TOEFL. In this sense, it has used a new method in a high-stakes context. Both validation method and the test itself are new to the field. That is why the reader will most probably find the ideas encouraging and provocative to employ in other contexts and with similar or different tests. A host of tables and figures have illuminated the long-lasting process of developing and validating of the high stakes and scope of Internet-Based (iBT) TOEFL.
Chapter one, written by the editors, Chapelle,  Enright, and Jamieson, is in 25 pages and is entitled Test Score Interpretation and Use.  It distinguishes between “competency-centered” and “task-centered” as the main “conceptual frameworks emerged for test design and score interpretation” (p. 3) which are mainly drawn from Messick (1994). The authors find the competency-centered perspective accommodating to the new TOEFL aims and claim to have adopted it throughout the argument construction stages (although this is not always observed). Chapter one is further intended to illuminate the structure of interpretive arguments. Thus, taking into account the consideration of Toulmin’s (1958, 2003) informal reasoning, the authors bring an example of a speaking test in Figure 1.3 (p. 7) to explain grounds, rebuttals, backings, and warrant.  In order to complete the grounds of the argument, they draw on Mislevy et al. (2003) and propose that grounds can be branched into the properties of students and test tasks. This will also bridge the gap between the two test development perspectives mentioned above. Observations (grounds) are connected to conclusions (claims) through inferences which are evaluation, generalization, and extrapolation. Generalizing this structure to the TOEFL, they provide a figure (1.7) that entails six stages (p. 19). This is a very useful and extensive explanation which provides examples of warrants, assumptions, backings, and inferences in the TOEFL interpretive argument. Overall, “By articulating assumptions underlying the warrants, the interpretive argument serves as a guide for the collection of the backing needed for support, and a preliminary argument validity can be drafted” (p. 23).
Chapter two, The Evaluation of the TOEFL (pp. 27-54), written by Carol A. Taylor and Paul Angelis is a historical account of the new TOEFL from 1961 to the present day. This chapter establishes its scope by asking four questions: “Did similar issues exist in 1961 when the first version of the TOEFL was conceived? How was the construct of language proficiency defined for the first TOEFL? What type of test tasks did the test consist of? How was validation approached?” (p. 27).  Answering these questions will broaden the horizon for the reader of the book in that they will get a clue how different language theories such as that proposed by Carroll (1961) affected the structure of the TOEFL. The first TOEFL description is summarized in Table 2.1 (p. 29) and test content and all validation issues in this phase of test construction are mentioned in brief. The book then moves to the second and third revision programs of the test and arrives at the “Goal for the New TOEFL” depicted in page 43. This chapter concludes in pointing at the “advisory meeting” and the studies on the computer-based version of the test by experts and researchers.
Chapter three (p. 96), Frameworks for the New TOEFL, is written by Joan M. Tamieson, Daniel Eigner, William Grabe, and Anthony John Kunnan. This chapter is the extension of the previous chapter and takes on from where chapter two concludes, i.e., the 1990s. In this episode of the TOEFL history, “the challenge was to articulate a framework incorporating communicative competence to underlie score interpretation” (p. 55). As mentioned above, readers can find referencing to chapters prior to or following any section throughout the book, which makes the book uniform. This chapter also refers to the previous chapter to elaborate on the score interpretation. So, all assumptions and inferences which start from domain description _ in the new TOEFL to define the domain, Jamieson, Jones, Kirsch, Mosenthal, and Taylor (2000) have suggested considering situation, text material, and test rubric _ and end in extrapolation are raised and tabulated clearly. Also, some frameworks are proposed for the four language skills assessed in the new TOEFL, i.e., listening, reading, speaking, and writing. The usage of the Item Response Theory to handle polytomous data and also the rationale behind the integrated tasks, which are unprecedented in the history of EAP testing, are two more issues explicated in this chapter. For readers who would like to learn about the TOEFL 2000 Framework, this chapter is the most informative section of the book.
Chapter four (pp. 97-143) written by Mary Enright et al. is entitled Prototyping New Assessment Tasks and starts in a summary of “assumptions underlying warrant” in the TOEFL project in the stages of domain description, evaluation, and explanation. Following this short introduction, the readers are provided with general abilities in listening and reading along with the test/task types to measure these skills. Also, for each skill, investigations relevant to the evaluation and explanation of that skill are discussed through touching on the main results of some studies. For instance, three studies are mentioned to advocate the explanation of the listening section of the test. This chapter is in fact a summary of the first prototyping phase of the new TOEFL. But prototyping of the TOEFL does not stop here and is pursued further in the successive prototyping stage, which is stated in chapter five.
Chapter five (pp. 145-186), written by Mary Enright, Brent Bridgeman, Daniel Eignor, Yong-Won Lee, and Donald Powers, is Prototyping Measures of Listening, Reading, Speaking, and Writing. The second prototyping phase took one year from 2000 to 2001. “The research on this phase was motivated primarily by the need to support the generalization inference” (p. 145). Readers can find a report on the use of the test blueprint constructed in the first prototyping and its expansion. This chapter sheds further light on the composition of the four skills tested in the TOEFL, e.g., the “appraisals of speaking samples by English language specialists” (p. 158), scoring criteria, and provides a summary to account for the generalizability of the findings in each skill. Another element in this chapter is the viability of reporting a composite score which is the total score of students on the test. Readers will be benefitted by the comprehensive Tables 5.13 and 5.14 (pp. 182-184) which provide a helpful summary of the results and findings of this phase of research.
Chapter six (pp. 187-225), Prototyping a New Test, by Kristen Huff, Donald Powers, Robert Kantor, Pamela Mollaun, Susan Nissan, and Mary Schedl, reports on a “field study of the proposed new test” (p. 187). On page 188 is a summary of the questions which are focused on in this phase and relevant to all assumptions underlying inferences and warrants in “the TOEFL Interpretive Argument.” This section presents a field study with 3,000 participants using two forms of the new test as the “concrete examples” of it. One of the interesting sections of this chapter to me is the report of demographics and background characteristics in the field study.  Generalization and reliability of the results are inspected again alongside a factor analysis of the prototype of the new TOEFL. A study on the strategy use in the TOEFL reading by Cohen and Upton (2006) is also briefly discussed to make a strong case for the new TOEFL structure in this section.
Chapter seven (pp. 227-258) is entitled Finalizing the Test Blueprint and authored by Mari Pearlman. Pearlman starts off with a brief summary of what has happened from 2002 onward highlighting the fact that evidence-centered design (ECD) has been employed, though not fully, to design the task in the new TOEFL. The remainder of this chapter provides a wealth of tables supporting this claim directly or indirectly. Given this, task shells along with examples are presented in tables and the final blueprint of the four skills is presented in this chapter. Summaries of listening, reading, speaking, and writing are finalized and presented to the reader clearly. This chapter further provides information on the iBT (Internet-Based Testing) system of test delivery, and on-line scoring networks are also briefly discussed.
Lin Wang, Daniel Eignor, and Mary Enright have written chapter eight, A Final Analysis, (pp. 259-318). This chapter is the lengthiest one throughout the book. It starts with a set of guiding questions from the evaluation phase through the utilization of the test scores. Each assumption underlying the inferences and warrants is examined carefully in this chapter. The issue of computer familiarity is also considered in the report. Once more, reliability and generalizability of the test are discussed, which lays some stress on the importance of this issue in high-stakes testing. Also, “comparability of parallel speaking and writing task types” are discussed and a summary of this field study is then provided. In the following section of this chapter, all outcomes relevant to the explanation stage in the argument case are discussed, with figures and tables presenting the relevant findings. The following section in this chapter discusses the evidence relevant to the utilization of the test scores and interpretations. For this purpose, the scores from the Computer-Based TOEFL and iBT TOEFL are linked and a study of TOEFL preparation courses in Eastern Europe is mentioned. Score and scale descriptors in four skills are also described.
Chapter nine (pp. 319-352) by Carol Chapelle is entitled The TOEFL Validity Argument. It refers the reader to the first chapter to remind them of the definitions required to understand this section. The distinctive property of this chapter is the presence of a lot of figures. The main TOEFL interpretive argument is figured in the first graph (p. 321) which recapitulates briefly the grounds (main and intermediate), inferences, and the conclusion or claim of the argument. Readers will find the assumptions and the conclusion from the domain definition inference. As a reader of this book, I agree with Chapelle’s closing remarks that “for the audience of applied linguists and measurement experts, the narrative of the TOEFL validity argument presented through this volume should demonstrate the validity of TOEFL score interpretation as an indicator of academic English language proficiency and score use for admissions decisions at English-medium universities” (p. 350). There are three appendices enclosed to the book, i.e., Appendix A: 1995 Working assumptions that Underlie an initial TOEFL® 2000 Design Framework; Appendix B: Summary of 1995 Research Recommendations; and Appendix C: Timeline of TOEFL® Origins and the New TOEFL Project— Key Efforts and Decisions. The last attachment to the book is an index to the book contents.
The entire content of the book is very cohesive and the reader will get the answer to the questions raised in their mind about the results. However, in some sections, the book falls short to provide a clear picture of data analysis. In some cases, some deviations from the main competency-based perspective are observed despite that editors claim that the book’s approach is competency-based. In the meantime, these deviations have augmented building the validity argument. In fact, it is more useful to merge task-based and competency-based approaches in validation (Buck, 2001).
Overall, the exhaustive review and clear explanation of a lengthy line of research reported in the book Building a Validity Argument for the Test of English as a Foreign Language is a precious attempt to put into practice the new theoretical framework of the argument-based validation and also to learn about the alteration and treatment plans throughout the vicissitudes of at least 20 years, the offspring of which is the current iBT TOEFL. That the book _ in fact the projects reported in the book _ does not overlook the fact that the TOEFL is an internationaltest of English for academic purposes is a plus which can distinguish the volume and eventually the TOEFL from other tests of English for academic purposes. This research line navigates through different phases of the TOEFL studies and finally concludes that the uses and interpretation of the test scores are sufficiently valid. The simplicity of the language used throughout the chapters can also make this book a good choice not only for researchers but also for other users who like to study about validation.

Bachman, L. F. (2005). Building and supporting a case for test use. Language Assessment Quarterly, 2(1), 1–34.
Buck, G. (2001). Assessing listening. UK: Cambridge University Press.
Cohen, A., & Upton, T. (2006). Strategies in responding to the new TOEFL reading tasks (TOEFL Monograph No. 33). Princeton, NJ: Educational Testing Service.
Jamieson, J., Jones, S., Kirsch, I., Mosenthal, P., & Taylor, C. (2000). TOEFL 2000 framework: A working paper (TOEFL Monograph No. 16). Princeton, NJ: Educational Testing Service.
Kane, M. (1992). An argument-based approach to validity. Psychological Bulletin, 112, 527–535.
Kane, M. (2001). Current concerns in validity theory. Journal of Educational Measurement, 38, 319–342.
Kane, M. (2002). Validating high-stakes testing programs. Educational Measurement: Issues and Practice, 21(1), 31–41.
Kane, M. (2004). Certification testing as an illustration of argument-based validation. Measurement: Interdisciplinary Research and Perspectives, 2, 135–170.
Kane, M. (2006). Validation. In R. L. Brennan (ed.), Educational measurement (4th ed.), (pp. 17-64). USA: American Council on Education, Praeger Series on Higher Education.
Kane, M., Crooks, T., & Cohen, A. (1999). Validating measures of performance.Educational Measurement: Issues and Practice, 18(2), 5–17.
Koenig, J. A., & Bachman, L. F. (Eds.). (2004). Keeping score for all: The effects of inclusion and accommodation policies on large-scale educational assessment.Washington, DC: National Research Council.
Messick, S. (1994). The interplay of evidence and consequences in the validation of performance assessment. Educational Researcher, 23(2), 13-23.
Mislevy, R. J. (2003). Argument substance and argument structure in educational assessment. Law, Probability and Risk, 2(4), 237–258.
Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (2003). On the structure of educational assessments. Measurement: Interdisciplinary Research and Perspectives, 1, 3–62.
Toulmin, S. E. (1958). The uses of argument. Cambridge, UK: Cambridge University Press.
Toulmin, S. E. (2003). The uses of argument (Updated ed.). Cambridge, England: Cambridge University Press.

Tags: ,

Category: 2009, Book Reviews