Introduction to Abnormal Child and Adolescent Psychology. Robert Weis
Чтение книги онлайн.
Читать онлайн книгу Introduction to Abnormal Child and Adolescent Psychology - Robert Weis страница 60
The BASC-3 can be completed by parents, teachers, or older children and adolescents to obtain an overall estimate of behavior problems and adaptive functioning.
Many tests of personality and social–emotional functioning yield T scores with a mean of 50 and standard deviation of 10.
What Makes a Good Psychological Test?
There are many kinds of psychological tests, but not all tests are created equal. The accuracy of a clinician’s diagnosis and recommendations for treatment depends on the quality of the tests he selects and the manner in which he uses them. In this section, we will discuss the three most important features of evidence-based testing: (1) standardization, (2) reliability, and (3) validity.
Standardization
Most tests used in clinical settings follow some sort of standardization—that is, they are administered, scored, and interpreted in the same way to all examinees. For example, all 7-year-old children who take the WISC–V are administered the same test items. Items are presented in the same way to all children according to specific rules described in the test manual. These rules include where participants must sit, how instructions must be presented, how much time is allowed, and what sort of help (if any) examiners can provide. Children’s answers are scored in the same way, using specific guidelines presented in the manual (Wechsler et al., 2014).
Standardized test administration and scoring allows clinicians to compare one child’s test scores with the performance of his or her peers. Two children who obtain the same number of correct test items on an intelligence test are believed to have comparable levels of cognitive functioning only if they were administered the test in a standardized fashion. If one child was given extra time, additional help, or greater encouragement by the examiner, comparisons would be inappropriate.
Most standardized tests, like the WISC–V, are norm-referenced. Norm-referenced tests allow clinicians to quantify the degree to which a specific child is similar to other youths of the same age, grade, and/or gender. These tests are called norm-referenced because the child is compared to a normative sample of children, a large group of youths whose demographics reflect a larger population, such as all children in the United States or children with ADHD. Examples of norm-referenced tests include intelligence tests, personality tests, and behavior rating scales (Achenbach, 2015).
Children’s scores on norm-referenced tests are compared to the performance of other children, in order to make these scores more meaningful. Imagine that a 9-year-old girl correctly answers 45 questions on the WISC–V. A clinician would record her “raw score” as 45. However, a raw score of 45 does not allow the clinician to determine whether the girl is intellectually gifted, average, or delayed. To interpret her raw score, the clinician needs to compare her raw score to children in the normative sample, that is, other children who have already completed the WISC–V. If the mean raw score for 9-year-olds in the normative sample was 45 and the girl’s raw score was 45, the clinician might conclude that the girl’s cognitive functioning is within the average range. However, if the mean raw score for 9-year-old children was 30 and the girl’s raw score was 45, the clinician might conclude that the girl has above-average cognitive functioning.
The results of norm-referenced testing, therefore, depend greatly on the comparison of the individual child with the normative sample. At a minimum, comparisons are made based on children’s age. For example, on measures of intelligence, 9-year-old children must be compared to other 9-year-old children, not to 6-year-old children or to 12-year-old children. On other psychological tests, especially tests of behavior and social–emotional functioning, comparisons are made based on age and gender. For example, boys tend to show more symptoms of hyperactivity than do girls. Consequently, when a clinician obtains parents’ ratings of hyperactivity for a 9-year-old boy, he compares these ratings to the ratings for other 9-year-old boys in the normative sample (Achenbach, 2015).
Usually, clinicians want to quantify the degree to which children score above or below the mean for the normative sample. To quantify children’s deviation from the mean, clinicians transform the child’s raw test score to a standard score. A standard score is simply a raw score that has been changed to a different scale with a designated mean and standard deviation. For example, intelligence tests have a mean of 100 and a standard deviation of 15. A child with a FSIQ of 100 would fall squarely within the average range compared to other children his age, whereas a child with a FSIQ of 115 would be considered above average.
Reliability
Reliability refers to the consistency of a psychological test. Reliable tests yield consistent scores over time and across administrations. Although there are many types of reliability, the three most common are test–retest reliability, inter-rater reliability, and internal consistency (Hogan & Tsushima, 2018).
Test–retest reliability refers to the consistency of test scores over time. Imagine that you purchase a Fitbit to help you get into shape. You wear the Fitbit each morning while walking to your first class. If the number of steps estimated by the Fitbit is approximately the same each day, we would say that the Fitbit shows high test–retest reliability. The device yields consistent scores across repeated administrations. Psychological tests should also have high test–retest reliability. A child who earns a FSIQ of 110 should earn a similar FSIQ score several months later.
Inter-rater reliability refers to the consistency of test scores across two or more raters or observers. Imagine that you are affluent enough to own a Fitbit and a Garmin to measure your daily activity, one on each wrist. If the number of steps were similar for each device, we would say that the devices showed excellent inter-rater reliability; they agree with each other. Similarly, psychological tests should show high inter-rater reliability. For example, on portions of the WISC–V, psychologists assign points based on the thoroughness of children’s answers. If a child defines an elephant as an animal, she might earn 1 point, whereas if she defines it as an animal with four legs, a trunk, and large ears, she might earn 2 points. Different psychologists should assign the same points for the same response, showing high inter-rater reliability.
Internal consistency refers to the degree to which test items yield consistent scores. Imagine that you want to obtain an estimate of your physical activity using your Fitbit. You decide to measure activity in three ways: (1) using the Fitbit’s step count, (2) using GPS data, and (3) by manually recording your activity. If you exercise a lot that day, all three scores should be high, because they all measure the same construct (i.e., activity). On the other hand, if you are sedentary that day, all three scores should be low. Such data would indicate good internal consistency; items measuring the same construct should yield consistent results.
Psychological tests should also have high internal consistency. For example, the WISC–V verbal comprehension tests show very high internal consistency. Children with excellent verbal skills tend to answer most test items correctly, whereas children with lower verbal skills tend to struggle on these items. High internal consistency suggests that items on the verbal comprehension index measure the same construct (e.g., verbal comprehension) and not other constructs such as the child’s visual–spatial skills or memory.
Reliability can be quantified using a coefficient ranging from 0 to 1.0. A reliability coefficient of 1.0 indicates perfect consistency. What constitutes “acceptable” reliability varies depending on the type of reliability and construct the test is measuring. For example, tests that assess traits that are believed