Making Classroom Assessments Reliable and Valid. Robert J. Marzano

Чтение книги онлайн.

Читать онлайн книгу Making Classroom Assessments Reliable and Valid - Robert J. Marzano страница 5

Making Classroom Assessments Reliable and Valid - Robert J. Marzano

Скачать книгу

in the classroom instruction and learning” (p. 161). Susan M. Brookhart (2013) explains that CAs can be a strong motivational tool when used appropriately. M. Christina Schneider, Karla L. Egan, and Marc W. Julian (2013) identify CA as one of three components of a comprehensive assessments system. Figure I.1 depicts the relationship among these three systems.

      Figure I.1: The three systems of assessment.

      As depicted in figure I.1, CAs are the first line of data about students. They provide ongoing evidence about students’ current status on specific topics derived from standards. Additionally, according to figure CAs should be the most frequently used form of assessments.

      Next are interim assessments. Schneider and colleagues (2013) describe them as follows: “Interim assessments (sometimes referred to as benchmark assessments) are standardized, periodic assessments of students throughout a school year or subject course” (p. 58).

      Year-end assessments are the least frequent type of assessments employed in schools. Schneider and colleagues (2013) describe them in the following way:

      States administer year-end assessments to gauge how well schools and districts are performing with respect to the state standards. These tests are broad in scope because test content is cumulative and sampled across the state-level content standards to support inferences regarding how much a student can do in relation to all of the state standards. Simply stated, these are summative tests. The term year-end assessment can be a misnomer because these assessments are sometimes administered toward the end of a school year, usually in March or April and sometimes during the first semester of the school year. (p. 59)

      While CAs have a prominent place in discussions about comprehensive assessments, they have continually exhibited weaknesses that limit their use or, at least, the confidence in their interpretation. For example, Cynthia Campbell (2013) notes the “research investigating evaluation practices of classroom teachers has consistently reported concerns about the adequacy of their assessment knowledge and skill” (p. 71). Campbell (2013) lists a variety of concerns about teachers’ design and use of CAs, including the following.

      ■ Teachers have little or no preparation for designing and using classroom assessments.

      ■ Teachers’ grading practices are idiosyncratic and erratic.

      ■ Teachers have erroneous beliefs about effective assessment.

      ■ Teachers make little use of the variety of assessment practices available.

      ■ Teachers don’t spend adequate time preparing and vetting classroom assessments.

      ■ Teachers’ evaluative judgments are generally imprecise.

      Clearly, CAs are important, and researchers widely acknowledge their potential role in the overall assessment scheme. But there are many issues that must be addressed before CAs can assume their rightful role in the education process.

      Almost all problems associated with CAs find their ultimate source in the concepts of reliability and validity. Reliability is generally described as the accuracy of a measurement. Validity is generally thought of as the extent to which an assessment measures what it purports to measure.

      Reliability and validity are related in a variety of ways (discussed in depth in subsequent chapters). Even on the surface, though, it makes intuitive sense that validity is probably the first order of business when designing an assessment; if a test doesn’t measure what it is supposed to measure, it is of little use. However, even if a test is designed with great attention to its validity, its reliability can render validity a moot point.

      An assessment’s validity can be limited or mediated by its reliability (Bonner, 2013; Parkes, 2013). For example, imagine you were trying to develop an instrument that measures weight. This is a pretty straightforward construct, in that weight is defined as the amount of gravitational pull on an object or the force on an object due to gravity. With this clear goal in mind, you create your own version of a scale, but unfortunately, it gives different measurements each time an object is placed on it. You put an object on it, and it indicates that the object weighs one pound. You take it off and put it on again, and it reads one and a half pounds. The third time, it reads three-quarters of a pound, and so on. Even though the measurement device was focused on weight, the score derived from the measurement process is so inaccurate (imprecise or unreliable) that it cannot be a true measure of weight. Hence, your scale cannot produce valid measures of weight even though you designed it for that specific purpose. Its reliability has limited its validity. This is probably the reason that reliability seems to receive the majority of the attention in discussions of CA. If a test is not reliable, its validity is negated.

      For CAs to take their rightful place in the assessment triad depicted in figure they must be both valid and reliable. This is not a new or shocking idea; reliability and validity for CAs must be thought of differently from how they are with large-scale assessments.

      Large-scale assessments are so different from CAs in structure and function that the paradigms for validity and reliability developed for large-scale assessments do not apply well to CAs. There are some who argue that they are so different from large-scale assessments that they should be held to a different standard than large-scale assessments. For example, Jay Parkes (2013) notes, “There have also been those who argue that CAs … have such strong validity that we should tolerate low reliability” (p. 113).

      While I believe this is a defensible perspective, in this book, I take the position that we should not simply ignore psychometric concepts related to validity and reliability. Rather, we should hold CAs accountable to high standards relative to both validity and reliability, but educators should reconceptualize the standards and psychometric constructs on which these standards are based in order to fit the unique environment of the classroom. I also believe that technical advances in CA have been hindered because of the unquestioned adherence to the measurement paradigms developed for large-scale assessments.

      Even though validity is the first order of business when designing an assessment, I begin with a discussion of reliability because of the emphasis it receives in the literature on CAs. At its core, reliability refers to the accuracy of a measurement, where accuracy refers to how much or how little error exists in an individual score from an assessment. In practice, though, large-scale assessments represent reliability in terms of scores for groups of students as opposed to individual students. (For ease of discussion, I will use the terms large-scale and traditional as synonyms throughout the text.) As we shall see in chapter 4 (page 83), the conceptual formula for reliability in the large-scale assessment paradigm is based on differences in scores across multiple administrations of a test. Consider table I.1 to illustrate the traditional concept of reliability.

      The column Initial Administration reports the scores of ten students for the first administration of a specific test. (For ease of discussion, the scores are listed in rank order.) The next column, Second Administration (A), and the first represent a pattern of scores that indicate relatively high reliability for the test in question.

Скачать книгу