Making Classroom Assessments Reliable and Valid. Robert J. Marzano
Чтение книги онлайн.
Читать онлайн книгу Making Classroom Assessments Reliable and Valid - Robert J. Marzano страница 6
To understand this pattern, one must imagine that the second administration happened right after the initial administration, but somehow students forgot how they answered the items the first time. In fact, it’s best to imagine that students forgot they took the test in the first place. Although this is impossible in real life, it is a basic theoretical underpinning of the traditional concept of reliability—the pattern of scores that would occur across students over multiple replications of the same assessment. Lee J. Cronbach and Richard J. Shavelson (2004) explain this unusual assumption in the following way:
If, hypothetically, we could apply the instrument twice and on the second occasion have the person unchanged and without memory of his first experience, then the consistency of the two identical measurements would indicate the uncertainty due to measurement error. (p. 394)
If a test is reliable, one would expect students to get close to the same scores on the second administration of the test as they did on the first. As depicted in Second Administration (A), this is basically the case. Even though only two students received exactly the same score, all scores in the second administration were very close to their counterparts in the first.
If a test is unreliable, however, one would expect students to receive scores on the second administration that are substantially different from those they received on the first. This is depicted in the column Second Administration (B). Notice that students’ scores vary greatly from their first on this hypothetical administration.
Table I.1 demonstrates the general process at a conceptual level of determining reliability from a traditional perspective. If the pattern of variation in scores among students is the same from one administration of a test to another, then the test is deemed reliable. If the pattern of variation changes from administration to administration, the test is not considered reliable. Of course, administrations of the same tests to the same students without students remembering their previous answers don’t occur in real life. Consequently, measurement experts (called psychometricians) have developed formulas that provide reliability estimates from a single administration of a test. I discuss this in chapter 3 (page 59).
Next, we consider the equation for a single score, as well as the reliability coefficient.
The Equation for a Single Score
While the large-scale paradigm considers reliability from the perspective of a pattern of scores for groups of students across multiple test administrations, it is also based on the assumption that scores for individual students contain some amount of error. Error may be due to careless mistakes on the part of students, on the part of those administering and scoring the test, or both. Such an error is referred to as a random measurement error, and that is an anticipated part of any assessment (Frisbie, 1988). Random error can either increase the score a student receives (referred to as the observed score) or decrease the score a student receives. To represent this, the conceptual equation for an individual score within the traditional paradigm is:
Observed score = true score + error score
The true score is the score a test taker would receive if there were no random errors from the test or the test taker. In effect, the equation implies that when anyone receives a score on any type of assessment, there is no guarantee that the score the test taker receives (for example, the observed score) is the true score. The true score might be slightly or greatly higher or lower than the observed score.
The Reliability Coefficient
The reliability of an assessment from the traditional perspective is commonly expressed as an index of reliability—also referred to as the reliability coefficient (Kelley, 1942). Such a coefficient ranges from a 0.00 to a 1.00, with 1.00 meaning there is no random error operating in an assessment, and 0.00 indicating that the test scores completely comprise random error. While there are no published tests with a reliability of 1.00 (simply because it’s impossible to construct such a test), there are also none published with a reliability even remotely close to 0.00. Indeed, David A. Frisbie (1988) notes that most published tests have reliabilities of about 0.90, but most teacher-designed tests have much lower reliabilities of about 0.50. Others have reported higher reliabilities for teacher-designed assessments (for example, Kinyua & Okunya, 2014). Leonard S. Feldt and Robert L. Brennan (1993) add a cautionary note to the practice of judging an assessment from its reliability coefficient:
Although all such standards are arbitrary, most users believe, with considerable support from textbook authors, that instruments with coefficients lower than 0.70 are not well suited to individual student evaluations. Although one may quarrel with any standard of this sort, many knowledgeable test users adjust their level of confidence in measurement data as a hazy function of the magnitude of the reliability coefficient. (p. 106)
As discussed earlier, the reliability coefficient tells us how much a set of scores for the same students would differ from administration to administration, but it tells us very little about the scores for individual students. The only way to examine the precision of individual scores is to calculate a confidence interval around the observed scores. Confidence intervals are described in detail in technical note I.1 (page 110), but conceptually they can be illustrated rather easily. To do so, table I.2 depicts the 95 percent confidence interval around an observed score of seventy-five out of one hundred points for tests with reliabilities ranging from 0.55 to 0.85.
Table I.2: Ninety-Five Percent Confidence Intervals for Observed Score of 75
Note: The standard deviation of this test was 8.33 and the upper and lower limits have been rounded.
Table I.2 depicts a rather disappointing situation. Even when a test has a reliability of 0.85, an observed score of 75 has a 95 percent confidence interval of 69 to 81. When the reliability is as low as 0.55, then that confidence interval is between 64 and 86. From this perspective, CAs appear almost useless in that they have so much random error associated with them. Fortunately, there is another perspective on reliability to use to render CAs more precise and, therefore, more useful.
The New CA Paradigm for Reliability
As long as the reliabilities of CAs are determined using the coefficients of reliability based on formulas that examine the difference in patterns of scores between students, there is little chance of teachers being able to demonstrate the precision of their assessments for individual students. These traditional formulas typically require a great many items and a great many examinees to use in a meaningful way. Classroom teachers usually have relatively few items on their tests (which are administered to relatively few students).
This problem is solved, however, if we consider CAs in sets administered over time. The perspective of reliability calculated from sets of assessments administered over time has been in the literature for decades (see Rogosa, Brandt, & Zimowski, 1982; Willett, 1985, 1988). Specifically, a central tenet of this book is that one should examine reliability of CAs from the perspective of groups of assessments on the same topic administered over time (as opposed to a single assessment at one point in time). To illustrate, consider the following five scores, each from a separate assessment, on the same topic, and administered to a specific student over time (such as a grading period): 71, 75, 81, 79, 84.