Making Classroom Assessments Reliable and Valid. Robert J. Marzano
Чтение книги онлайн.
Читать онлайн книгу Making Classroom Assessments Reliable and Valid - Robert J. Marzano страница 4
The National Assessment of Educational Progress (NAEP) began in 1969 and “was part of the same general trend toward large-scale data gathering” (Shepard, 2008, p. 27). However, researchers and policymakers designed NAEP for program evaluation as opposed to individual student performance evaluation.
The need to gather and utilize data about individual students started minimum competency testing in the United States. This spread quickly, and by 1980 “all states had a minimum competency testing program or a state testing program of some kind” (Shepard, 2008, p. 31). But this, too, ran aground because of the amount of time and resources necessary for large-scale competency tests.
The next wave of school reform was the “excellence movement” spawned by the high visibility report A Nation at Risk (National Commission on Excellence in Education, 1983). It cited low standards and a watered-down curriculum as reasons for the lackluster performance of U.S. schools. It also faulted the minimum competency movement, noting that focusing on minimum requirements distracted educators from the more noble and appropriate goal of maximizing students’ competencies.
Fueled by these criticisms, researchers and policymakers focused on the identification of rigorous and challenging standards for all students in the core subject areas. Standards work in mathematics set the tone for the reform:
Leading the way, the National Council of Teachers of Mathematics report on Curriculum and Evaluation Standards for School Mathematics (1989) expanded the purview of elementary school mathematics to include geometry and spatial sense, measurement, statistics and probability, and patterns and relationships, and at the same time emphasized problem solving, communication, mathematical reasoning, and mathematical connections rather than computation and rote activities. (Shepard, 2008, p. 35)
By the early 1990s, virtually every major academic subject area had sample standards for K–12 education.
Shepard (2008) notes that standards-based reform, begun in the 1990s, “is the most enduring of test-based accountability reforms” (p. 37). However, she also cautioned that the version of this reform enacted in No Child Left Behind (NCLB) “contradicts core principles of the standards movement” mostly because the assessments associated with NCLB did not place ample focus on the application and use of knowledge reflected in the standards researchers developed (Shepard, 2008, p. 37). Also, the accountability system that accompanied NCLB focused on rewards and punishments.
The beginning of the new century saw an emphasis on testing that was highly focused on standards. In 2009, the National Governors Association Center for Best Practices (NGA) and the Council of Chief State School Officers (CCSSO) partnered in “a state-led process that [drew] evidence and [led] to development and adoption of a common core of state standards … in English language arts and mathematics for grades K–12” (as cited in Rothman, 2011, p. 62). This effort, referred to as the Common Core State Standards (CCSS), resulted in the establishment of two state consortia that were tasked with designing new assessments aligned to the standards. One consortium was the Partnership for Assessment of Readiness for College and Careers (PARCC); the other was the Smarter Balanced Assessment Consortium (SBAC):
Each consortium planned to offer several different kinds of assessments aligned to the CCSS, including year-end summative assessments, interim or benchmark assessments (used throughout the school year), and resources that teachers could use for formative assessment in the classroom. In addition to being computer-administered, these new assessments would include performance tasks, which require students to demonstrate a skill or procedure or create a product. (Marzano, Yanoski, Hoegh, & Simms, 2013, p. 7)
These efforts are still under way although with less widespread use than in their initiation.
Next, I discuss previous abuses of large-scale assessments that occurred in the first half of the 20th century (Houts, 1977). To illustrate the nature and extent of these abuses, consider the first intelligence test usable for groups that Alfred Binet developed in 1905. It was grounded in the theory that intelligence was not a fixed entity. Rather, educators could remediate low intelligence if they identified it. As Leon J. Kamin (1977) notes in his book on the nature and use of his IQ test, Binet includes a chapter, “The Training of Intelligence,” in which he outlines educational interventions for those who scored low on his test. There was clearly an implied focus on helping low-performing students. It wasn’t until the Americanized version of the Stanford-Binet test (by Lewis M. Terman, 1916) that the concept of IQ solidified as a fixed entity with little or no chance of improvement. Consequently, educators would use the IQ test to identify students with low intelligence so they could monitor and deal with them accordingly. Terman (1916) notes:
In the near future intelligence tests will bring tens of thousands of these high-grade defectives under the surveillance and protection of society. This will ultimately result in curtailing the reproduction of feeble-mindedness and in the elimination of an enormous amount of crime, pauperism, and industrial inefficiency. It is hardly necessary to emphasize that the high-grade cases, of the type now so frequently overlooked, are precisely the ones whose guardianship it is most important for the State to assume. (pp. 6–7)
The perspective that Lewis Terman articulated became widespread in the United States and led to the development of Arthur Otis’s (one of Terman’s students) Army Alpha test. According to Kamin (1977), performance scores for 125,000 draftees were analyzed and published in 1921 by the National Academy of Sciences, titled Memoirs of the National Academy of Sciences: Psychological Examining in the United States Army (Yerkes, 1921). The report contains the chapter “Relation of Intelligence Ratings to Nativity,” which focuses on an analysis of about twelve thousand draftees who reported that they were born outside of the United States. Educators assigned a letter grade from A to E for each of the draftees, and the distribution of these letter grades was analyzed for each country. The report notes:
The range of differences between the countries is a very wide one …. In general, the Scandinavian and English speaking countries stand high in the list, while the Slavic and Latin countries stand low … the countries tend to fall into two groups: Canada, Great Britain, the Scandinavian and Teutonic countries … [as opposed to] the Latin and Slavic countries. (Yerkes, 1921, p. 699)
Clearly, the perspective regarding intelligence has changed dramatically and large-scale assessments have come a long way in their use of scores on tests since the early part of the 20th century. Yet even now, the mere mention of the terms large-scale assessment or standardized assessment prompts criticisms to which assessment experts must respond (see Phelps, 2009).
The Place of Classroom Assessment
An obvious question is, What is the rightful place of CA? Discussions regarding current uses of CA typically emphasize their inherent value and the advantages they provide over large-scale assessments. For example, McMillan (2013b) notes:
It is more than mere measurement or quantification of student performance. CA connects learning targets to effective assessment practices teachers use in their classrooms to monitor and improve student learning. When CA is integrated with and related to learning, motivation, and curriculum it both educates students and improves their learning. (p. 4)
Bruce Randel and Tedra Clark