The New Art and Science of Classroom Assessment. Robert J. Marzano

Чтение книги онлайн.

Читать онлайн книгу The New Art and Science of Classroom Assessment - Robert J. Marzano страница 4

The New Art and Science of Classroom Assessment - Robert J. Marzano The New Art and Science of Teaching

Скачать книгу

can actually be more precise than external assessments when it comes to examining the performance of individual students.

      This chapter outlines the facts supporting our position. In the remaining chapters, we fill in the details about how educators can design and use classroom assessments to fulfill their considerable promise.

      It is important to remember that all three types of assessment we depict in figure I.1 have important roles in the overall process of assessing students. To be clear, we are not arguing that educators should discontinue or discount year-end and interim assessments in lieu of classroom assessments. We are asserting that of the three types of assessment, classroom assessments should be the most important source of information regarding the status and growth of individual students.

      We begin by discussing the precision of externally designed assessments.

      Externally designed assessments, like year-end and interim assessments, typically follow the tenets of classical test theory (CTT), which dates back at least to the early 1900s (see Thorndike, 1904). At its core, CTT proposes that all assessments contain a certain degree of error, as the following equation shows.

      Observed Score = True Score + Error Score

      This equation indicates that the score a test taker receives (the observed score) on any type of assessment comprises two components—a true component and an error component. The true component (the true score) is what a test taker would receive under ideal conditions—the test is perfectly designed and the situation in which students take the test is optimal. The error component (the error score) is also a part of the observed score. This component represents factors that can artificially inflate or deflate the observed score. For example, the test taker might guess correctly on a number of items that would artificially inflate the observed score, or the test taker might misinterpret a few items for which he or she actually knows the correct answers, which would artificially deflate the observed score.

      Most externally designed assessments have reliabilities of about 0.85 or higher. Unfortunately, even with a relatively high reliability, the information a test provides about individuals has a great deal of error in it, as figure I.2 shows.

      Note: The standard deviation of this test was 15, and the upper and lower limits have been rounded.

      Figure I.2 depicts the degree of precision of individual students’ scores across five levels of reliability: 0.45, 0.55, 0.65, 0.75, and 0.85. These levels represent the range of reliabilities one can expect for assessments students will see in K–12 classrooms. At the low end are assessments with reliabilities of 0.45. These might be hastily designed assessments that teachers create. At the high end are externally designed assessments with reliabilities of 0.85 or even higher. The second column represents the observed score, which is 70 in all situations. The third and fourth columns represent the lower limit and upper limit of a band of scores into which we can be 95 percent sure that the true score falls. The range represents the size of the 95 percent confidence interval.

      The pattern of scores in figure I.2 indicates that as reliability goes down, one has less and less confidence in the accuracy of the observed score for an individual student. For example, if the reliability of an assessment is 0.85, we can be 95 percent sure that the student’s true score is somewhere between eleven points lower than the observed score and eleven points higher than the observed score, for a range of twenty-two points. However, if the reliability of an assessment is 0.55, we can be 95 percent sure that the true score is anywhere between twenty points lower than the observed score and twenty points higher than the observed score.

      These facts have massive implications for how we design and interpret assessments. Consider the practice of using one test to determine if a student is competent in a specific topic. If the test has a reliability of 0.85, an individual student’s true score could be eleven points higher or lower than the observed score. If the test has a reliability of 0.55, an individual student’s true score could be twenty points higher or lower than the observed score. Making the situation worse, in both cases we are only 95 percent sure the true score is within the identified lower and upper limits. We cannot overstate the importance of this point. All too often and in the name of summative assessment, teachers use a single test to determine if a student is proficient in a specific topic. If a student’s observed score is equal to or greater than a set cut score, teachers consider the student to be proficient. If a student’s score is below the set cut score, even by a single point, teachers consider the student not to be proficient.

      Examining figure I.2 commonly prompts the question, Why are assessments so imprecise regarding the scores for individual students even if they have relatively high reliabilities? The answer to this question is simple. Test makers designed and developed CTT with the purpose of scoring groups of students as opposed to scoring individual students. Reliability coefficients, then, tell us how similar or different groups of scores would be if students retook a test. They cannot tell us about the variation in scores for individuals. Lee J. Cronbach (the creator of coefficient alpha, one of the most popular reliability indices) and his colleague Richard J. Shavelson (2004) strongly emphasize this point when they refer to reliability coefficients as “crude devices” (p. 394) that really don’t tell us much about individual test takers.

      To illustrate what reliability coefficients tell us, consider figure I.3.

       Source: Marzano, 2018, p. 62.

      Figure I.3 illustrates precisely what a traditional reliability coefficient means. The first column, Initial Administration, reports the scores of ten students on a specific test. The second column, Second Administration (A), represents the scores from the same students after they have taken the test again. But before students took the test the second time, they forgot that they had taken it the first time, so the items appear new to them. While this cannot occur in real life and seems

Скачать книгу