Making Classroom Assessments Reliable and Valid. Robert J. Marzano
Чтение книги онлайн.
Читать онлайн книгу Making Classroom Assessments Reliable and Valid - Robert J. Marzano страница 7
Observed score = time of assessment (true score) + error
The part of the equation added to the basic equation from traditional assessment is that the true score for a particular student on a particular test is at a particular time. A student’s true score, then, changes from assessment to assessment. Time is now a factor in any analysis of the reliability of CAs, and there is no need to assume that students have not changed from assessment to assessment.
As we administer more CAs to a student on the same topic, we have more evidence about the student’s increasing true score. Additionally, we can track the student’s growth over time. Finally, using this time-based approach, the pattern of scores for an individual student can be analyzed mathematically to compile the best estimates of the student’s true scores on each of the tests in the set. Consider figure I.2.
Figure I.2: Linear trend for five scores over time from an individual student.
Note that there are five bars and a line cutting across those bars. The five vertical bars represent the individual student’s observed scores on five assessments administered on one topic over a given period of time (let’s say a nine-week grading period).
Normally, an average of these five scores is computed to represent the student’s final score for the grading period. In this case, the average of the five scores is 78. This doesn’t seem to reflect the student’s learning, however, because three of the observed scores were higher than this average. Alternatively, the first four scores might be thought of as formative practice only. In this case, the last score of 84 is considered the summative, and it would be the only one reported. But if we consider this single final assessment in isolation, we also must consider the error associated with it. As shown in table I.2, even if the assessment had a reliability coefficient of 0.85, we would have to add and subtract six points to be surer of the student’s true score. That range of scores within the 95 percent confidence interval would be 78 to 90.
Using the new paradigm for CAs and the new time-based equation, estimates of the true score on each assessment can be made. This is what the line cutting through the five bars represents. The student’s observed score on the first test was 71, but the estimated true score was 72. The second observed score was 75, as was the estimated true score, and so on.
We consider how this line and others are computed in depth in chapter 4 (page 83), but here the point is that analyzing sets of scores for the same student on the same topic over time allows us to make estimations of the student’s true scores as opposed to using the observed scores only. When we report a final summative score for the student, we can do so with much more assuredness. In this case, the observed final score of 84 is the same as the predicted score, but now we have the evidence of the previous four assessments to support the precision of that summative score.
This approach also allows us to see how much a student has learned. In this case, the student’s first score was 71, and his last score was 84, for a gain of thirteen points. Finally, chapter 3 (page 59) presents ways that do not rely on complex mathematical calculations to make estimates of students’ true scores across a set of assessments. I address the issue of measuring student growth in chapters 3 and 4. This book also presents formulas that allow educators to program readily available tools like Excel to perform all calculations.
The Large-Scale Assessment Paradigm for Validity
The general definition for the validity of an assessment is that it measures what it is designed to measure. For large-scale assessments, this tends to create a problem from the outset since most large-scale assessments are designed to measure entire subject areas for a particular grade level. For example, a state test in English language arts (ELA) at the eighth-grade level is designed to measure all the content taught at that level. A quick analysis of the content in eighth-grade ELA demonstrates the problem.
According to Robert J. Marzano, David C. Yanoski, Jan K. Hoegh, and Julia A. Simms (2013), there are seventy-three eighth-grade topics for ELA in the CCSS. Researchers and educators refer to these as elements. Each of these elements contains multiple embedded topics, which means that a large-scale assessment must have multiple sections to be considered a valid measure of those topics.
Of course, sampling techniques would allow large-scale test designers to address a smaller subset of the seventy-three elements. However, validity is still a concern. To cover even a representative sample of the important content would require a test that is too long to be of practical use. As an example, assume that a test was designed to measure thirty-five (about half) of the seventy-three ELA elements for grade 8. Even if each element had only five items, the test would still contain 175 items, rendering it impractical for classroom use.
The New CA Paradigm for Validity
Relative to validity, CAs have an advantage over large-scale assessments in that they can and should be focused on a single topic (technically referred to as a single dimension). In fact, making assessments highly focused in terms of the content they address is a long-standing recommendation from the assessment community to increase validity (see Kane, 2011; Reckase, 1995). This makes intuitive sense. Since CAs will generally focus on one topic or dimension over a relatively short period, teachers can more easily ensure that they have acceptable levels of validity. Indeed, recall from the previous discussion that some measurement experts contend that CAs have such high levels of validity that we should not be concerned about their seemingly poor reliability.
The aspect of CA validity that is more difficult to address is that all tests within a set must measure precisely the same topic and contain items at the same levels of difficulty. This requirement is obvious if one examines the scores depicted in figure I.2. If these scores are to truly depict a given student’s increase in his or her true score for the topic being measured, then educators must design the tests to be as identical as possible. If for example, the fourth test in figure I.2 is much more difficult than the third test, a given student’s observed score on that fourth test will be lower than the score on the third test even though the student’s true score has increased (the student has learned relative to the topic of the tests).
Sets of tests designed to be close to one another in the topic measured and the levels of difficulty of the items are referred to as parallel tests. In more technical terms, parallel tests measure the same topic and have the same types of items both in format and difficulty levels. I address how to design parallel tests in depth in chapters 2 and 3 (pages 39 and 59, respectively). Briefly, though, the more specific teachers are regarding the content students are to master and the various levels of difficulty, the easier it is for them to design parallel tests. To do this, a teacher designing a test must describe in adequate detail not only the content that demonstrates proficiency for a specific standard