The new book, Uneducated Guesses: Using Evidence to Uncover Misguided Education Policies, demonstrates the danger of making public policy decisions without consulting the evidence. Its author, Howard Wainer, examines the use of student standardized test scores to evaluate teachers and other “ideas that only sound good if you say them fast.”

In this chapter co-authored by Peter Baldwin,Wainer tells the true story of a third-grade teacher who was suspended because her class did better than a superficial analysis of the scores on a statewide exam indicated was likely. Thankfully, the teacher was exonerated when a subsequent, more careful analysis of the data was performed.

All of the facts of the case described here are accurate; the names of the people and places have been modified to preserve confidentiality.  

Excerpted from UNEDUCATED GUESSES by Howard Wainer. Published by Princeton University Press. Copyright  © 2011 by Princeton University Press. Reprinted by permission.

Uneducated GuessesSince the passage of “No Child Left Behind” the role of standardized test scores in contemporary education has grown. Students have been tested and teachers, schools, districts and states have been judged by the outcome. The pressure to get high scores has increased and concern about cheating has increased apace. School officials desperately want high scores, but simultaneously do not want even the appearance of any malfeasance. Thus it was not entirely a surprise after 16 of the 25 students in Jenny Jones’ third-grade class at the Squire Allworthy Elementary School obtained perfect scores on the state’s math test, that a preliminary investigation took place. The concern of the principal arose because only about 2 percent of all third graders in the state had perfect scores and so this result, as welcome as it was, seemed so unlikely that it was bound to attract unwanted attention.

The investigation began with a discussion between the principal and Ms. Jones, in which Ms. Jones explained that she followed all of the rules specified by Ms. Blifil, a teacher designated by the principal as the administrator of the exam. One of the rules that Ms. Blifil sent to teachers giving the exam was “point and look,” in which the proctor of the exam was instructed to stroll around the class during the administration of the exam and if she saw a student doing something incorrectly she should point to that item and look at the student. Ms. Jones did not proctor the exam, for she was occupied administering it individually to a student in special education, but she instructed an aide on how to do it.

It was later discovered that “point and look” was clearly forbidden by the state’s rules for test administration.

The school’s investigation, after seeing Ms. Jones’ class’s results, had found what they believed to be the cause of them. To cement this conclusion the school administration brought in an expert, a Dr. Thwackum, to confirm that indeed the obtained test results were too good to be true. Dr. Thwackum was a young Ph.D. who had studied measurement in graduate school. He was asked if the results obtained on the math exam were sufficiently unlikely to have occurred to justify concluding that students had been inappropriately aided by the exam proctor. Dr. Thwackum billed the district for 90 minutes of his time and concluded that indeed such a result was so unlikely that a fair test administration was not plausible.

Ms. Jones was disciplined, suspended without pay for a month, given an “unsatisfactory” rating, her pay increment for the year was eliminated, and her service for the year in question did not count toward her seniority.

Through her union she filed an appeal. The union hired a statistician to look into the matter more carefully.

No harm, no foul

“No harm, no foul” and the presumption of innocence help constitute the foundations upon which the great institutions of basketball and justice, respectively, rest. The adjudication of Ms. Jones’ appeal relied heavily on both principles. Ms. Jones admitted to instructing the exam proctor about the “point and look” rule—her culpability on this point is not in dispute. However, she was not disciplined for incorrectly instructing the proctor; rather she was disciplined for the alleged effect of the proctor’s behavior. This distinction places a much greater burden on the school because they must show that the “point and look” intervention not only took place but that it had an effect. That is, the school must demonstrate that Ms. Jones’ rule violation (foul) unfairly improved her students’ scores (harm)—failure to show such an effect exculpates her.

Further, here the burden of proof lies on the accuser, not the accused. That is, Ms. Jones is presumed innocent unless the school can prove otherwise. Moreover, the scarcity of high quality, student-level data made proving that the effect could not be zero much more challenging than proving that it could have been.

The subsequent investigation had two parts. The first was to show that it is believable that the cause (the “point and look” rule) had no effect and the second and larger part was to estimate the size of the alleged effect (the improvement in performance attributable to this cause) or—as we shall discuss—show that the size of the effect could have been zero.

The cause

Because the principle of “no harm, no foul” makes the effect—not the cause—of primary concern, our interest in the cause is limited to showing that it is plausible that the aide’s behavior had no effect on children’s scores. This is not difficult to imagine if it is recalled that Ms. Jones’ third-grade class was populated with eight-year-olds. Getting them to sit still and concentrate on a lengthy exam is roughly akin to herding parakeets. In addition there were more than 10 forms of the test that were handed out at random to the class so that the aide, in looking down on any specific exam at random was unlikely to be seeing the same question that she saw previously. So what was the kind of event that is likely to have led to her “pointing and looking”? Probably pointing to a question that had been inadvertently omitted, or perhaps suggesting that a student should take the pencil out of his nose, or stop hitting the student just in front of him with it.

The developers of the test surely had such activities in mind when they wrote the test manual, for on Page 8 of the instructions they said,

“Remind students who finish early to check their work in that section of the test for completeness and accuracy and to attempt to answer every question.”

So, indeed it is quite plausible that the aide’s activities were benign and had little or no effect on test scores. Given this, both the prosecution and the defendants sought to measure the alleged effect: Dr. Thwackum seeking to show that the effect could not have been zero while we tried to show that the effect could have been zero. Of course, this was an observational study, not an experiment. We don’t know the counterfactual result of how they would have done had the aide behaved differently. And so the analysis must use external information, untestable assumptions, and the other tools of observational studies.

The effect

Raw math scoresFirst, we must try to understand how Dr. Thwackum so quickly arrived at his decisive conclusion. Had he been more naïve, he might have computed the probability of 16 out of 25 perfect scores based on the statewide result of 2 percent perfect scores. But he knew that would have assumed that all students were equally likely to get a perfect score. He must have known that this was an incorrect assumption and that he needed some information that would allow him to distinguish among students. It is well-established that all valid tests of mental ability are related to one another. For example, the verbal and mathematical portions of the SAT correlate 0.7 with each other; about the same as the correlation between the heights of parents and children.

Happily, Dr. Thwackum had the results of a reading test used by the state (with the unlikely name of DIBELS), for the same students who were in Ms. Jones’ class. He built a prediction model using the DIBELS score to predict the math score from an independent sample and found, using this model with the DIBELS scores as the independent variable, that it predicted not a single perfect score on the math test for any of the students in Ms. Jones’ class. As far as Dr. Thwackum was concerned that sealed the case against the unfortunate Ms. Jones.

But, it turned out that DIBELS has little going for it. It is a very short and narrowly focused test that uses one-minute reading passages scored in a rudimentary fashion. It can be reliably scored but it has little generality of inference. An influential study of its validity concluded that, “… DIBELS mispredicts reading performance on other assessments much of the time, and at best is a measure of who reads quickly without regard to whether the reader comprehends what is read.” And, more to the point of this investigation, when Dr. Thwackum used DIBELS to predict math scores the correlation between the two was approximately 0.4. Thus it wasn’t surprising that the predicted math scores were shrunken toward the overall average score and no perfect scores were portended.

Is there a more informative way to study this? First we must reconcile ourselves to the reality that additional data are limited. What were available were the reading and math scaled scores for all the students in the school district containing Squire Allworthy School and some state and county-wide summary data. Failure to accommodate the limitations of these data may produce a desirable result but not one that bears scrutiny (as Dr. Thwackum discovered). We can begin by noting that the third-grade math test has no top, hence it discriminates badly at the high end. This is akin to measuring height with a ruler that only extends to six feet. It works perfectly well for a large portion of the population, but it groups together all individuals whose heights exceed its upper limit.

This result is obvious in the frequency distribution of raw scores in Figure 1. Had the test been constructed to discriminate among more proficient students the right side of the distribution would resemble the left side, yielding the familiar symmetric bell-shaped curve.

Although only 2 percent of all third graders statewide had perfect scores of 61 on the test, about 22 percent had scores 58 or greater. There is only a very small difference in performance between perfect and very good. In fact, more than half the children (54 percent) taking this test scored 53 or greater. Thus the test does not distinguish very well among the best students – few would conclude that a third grader who gets 58 right out of 61 on a math test was demonstrably worse than one who got 61 right. And no one knowledgeable about the psychometric variability of tests would claim that such a score difference was reliable enough to replicate in repeated testing.

But Figure 1 is for the entire state. What about this particular county? What about Squire Allworthy Elementary School? To answer this we can compare the county, school-level, and classroom-level results with the state using the summary data available to us. One such summary for Reading —which is not in dispute—is shown in Table 1.

3rd grade reading performance levels: 2006

From this we see that the county does much better than the state as a whole, the district does better than the county, Squire Allworthy does better still with fully 93 percent of its students being classified as Advanced or Proficient. Ms. Jones’ students performed best with all examinees scoring at the Proficient or Advanced level in Reading. Clearly, Ms. Jones’ students are in elite company, performing at a very high level compared to other third graders in the state. Let us add that within district, Reading and Math scores correlated positively and reasonably strongly.

If we build a model using Reading to predict Math, we find the regression effect makes it impossible to predict a perfect Math score—even from a perfect Reading score. In part, this is due to the ceiling effect we discussed previously (the prediction model fails to account for the artificially homogeneous scores at the upper end of the scale) and in part it’s due to the modest relationship between Reading and Math scores. Thus, when Dr. Thwackum used this approach, with an even poorer predictor, DIBELS scores, it came as no surprise that Ms. Jones’ students’ performance seemed implausible.

Observations

Dr. Thwackum’s strategy was not suitable for these data because under-predicts high performers. An even simpler approach, which circumvents this problem, is to look only at examinees with perfect Math scores. In addition to Ms. Jones’ 16 students with perfect scores, there were 23 non-Jones students with perfect math scores within the Squire Allworthy School’s district. We can compare Reading scores for Ms. Jones’ students with those for non-Jones students. There are three possible outcomes here: (i) Ms. Jones’ students do better than non-Jones students, (ii) all students perform the same, or (iii) Ms. Jones’ students do worse than non-Jones students.

We observed that there was a modestly strong positive relationship between Reading and Math proficiency. For this subset of perfect Math performers, there is no variability among scores, so no relationship between Reading and Math can be observed. Nevertheless, failure to observe any relationship can be plausibly attributed to the ceiling effect we discussed rather than a true absence of relationship between Reading and Math proficiency. So, we could reasonably suggest that if Ms. Jones’ perfect performers were unfairly aided, their reading scores should be lower on average than non-Jones perfect performers. That is, if option (iii) (Ms. Jones’ students do worse than non-Jones students) is shown to be true, it could be used as evidence in support of the school district’s case against Ms. Jones. However, if Ms. Jones’ students do equally well or better than non-Jones students— options (i) and (ii) above—or if we lack the power to identify any differences, we must conclude that the effect could have been zero, at least based on this analysis.

Typically, if we wanted to compare Reading score means for Ms. Jones’ 16 students and the districts’ 23 students with perfect Math scores, we would make some distributional assumptions and perform a statistical test. However, even that is not necessary here because Ms. Jones’ students’ Reading mean is higher than the districts’ students’ mean. The districts’ 23 students with perfect Math scores had a mean reading score of 1549. Whereas the 16 students in Ms. Jones’s class with perfect math scores had a mean Reading score of 1650.

Reading scoresThe box plots in Figure 2 compare the distribution of reading scores for two groups of students with perfect math scores. This plot follows common convention with the box containing the middle 50 percent of the data and the cross line represents the median. The dotted vertical lines depict the upper and lower quarters of the data. The two small circles represent two unusual points. 

We can see that Ms. Jones’s students' Reading scores are not below the reference group's (as would be expected had her intervention produced the alleged effect). On the contrary: her 16 perfect math students did noticeably better on Reading than non-Jones students who earned perfect Math scores. This suggests that Ms. Jones’s students' perfect Math scores are not inconsistent with their Reading scores. And, in fact, if the math test had a higher top her students would be expected to do better still.

Thus based on the data we have available we cannot reject the hypothesis that Ms. Jones’s students came by their scores honestly, for there is no empirical evidence that Ms. Jones’ intervention produced an effect of any kind: her students’ scores appear no less legitimate than those of students from other classes in the district.

The outcome

An arbitration hearing was held to decide the outcome of Ms. Jones’ appeal. Faced with these results the representatives of the district decided to settle a few minutes in advance of the hearing. Ms. Jones’ pay and seniority were restored and her “unsatisfactory” rating was replaced with one better suited to an exemplary teacher whose students perform at the highest levels.

Ms. Blifil was relieved of further responsibility in the oversight of standardized exams and the practice of “point and look” was retired.

The statisticians who did these analyses were rewarded with Ms. Jones’ gratitude, the knowledge that they helped rectify a grievous wrong, and a modest honorarium.

Hopefully, Dr. Thwackum learned that a little ignorance is a dangerous thing.

Howard Wainer is a distinguished research scientist at the National Board of Medical Examiners and adjunct professor of statistics at the Wharton School of the University of Pennsylvania. For 21 years, he was a principal research scientist at Educational Testing Service. Wainer is the recipient of numerous awards and honors: he is a Fellow of the American Statistical Association and the American Educational Research Association. He was given a Career Achievement Award for Contributions to Educational Measurement by the National Council on Measurement in Education in 2007.