It sounds so simple. If tests measure what a student learned, why can’t they also be used to measure what a teacher taught? Wouldn’t this data make it easy to distinguish good teachers from bad ones? And can’t this information be used in teacher retention, tenure, salary, and other decisions?
The use of student performance data to judge the quality of teachers looks like a magic bullet to some education policymakers. Convinced that poor teachers are the primary cause of low student achievement, they argue that “value-added models” (VAMs) can make it easier for administrators to identify and fire underperformers. But will it?
In January, the Educational Testing Service (ETS) hosted a symposium to answer some of these questions. Thirteen organizations, including NJEA, sponsored “Standardized Tests and Teacher Accountability.” The event showcased noted researchers and statisticians who have studied VAM extensively. The purpose of the symposium was to interject legitimate and current research into a debate, which to this point, has featured more rhetoric than reason.`
Former New York Times national education columnist Richard Rothstein delivered the opening keynote, “An Overemphasis on Teachers.” Rothstein currently works as a research associate at the Economic Policy Institute. His address was followed by a panel discussion moderated by Howard Wainer, distinguished research scientist, National Board of Medical Examiners. The panelists were:
- Henry Braun, Boisi professor of education and public policy, Boston College
- Sean P. Corcoran, assistant professor, New York University, Steinhardt School of Education
- Arthur E. Wise, president emeritus, National Council for Accreditation of Teacher Education.
Laura Goe, a research scientist in the Performance Research Group at ETS, was the final speaker. She spoke on “Evaluating Teacher Effectiveness: Where Do We Go From Here?”
All of these experts addressed what effective teacher evaluation systems should look like. They explained VAM’s weaknesses and the areas in which it holds some promise. Free from political agenda, the researchers provided an honest assessment of the state of teacher assessment. A few of their points seemed obvious; others may surprise you, and still others may confuse you.
Major themes of the symposium proceedings follow.
1. Current standardized tests simply aren’t good enough for use in value-added scenarios.
There is no need to rehash the many and varied weaknesses of standardized tests here. But to ask testing companies to design tests that also determine the quality of a student’s schooling is an especially tall order.
In order to use tests to evaluate teachers, statisticians explain that the distance between student test scores must have the same meaning at every place on the score scale. That means a 10-point gain from 10 to 20 must have the same meaning as a 10-point gain from 80 to 90. This may be reasonable when comparing the scores of the students of two sixth-grade teachers, but it becomes problematic in a comparison between a sixth-grade and an eighth-grade math teacher. Imagine the difficulties when comparing the 10-point gain of a student in a French class to a 10-point gain in physics.
As Howard Wainer, who spent much of his career working at ETS, explained, that’s like trying to determine if Mozart was a better musician than Babe Ruth was a hitter.
“Value added models ask more from current tests than they can provide,” Wainer acknowledged, “and this comes from someone whose life has been spent in testing and who has a deep and abiding love for what tests can do.”
“There is a lot that testing companies could do to address this issue,” added Henry Braun. “But it will take time and cost money.”
2. Most teachers don’t work in subject areas or grades where the students are tested.
If you teach kindergarten, physical education or French, for example, this comes as no surprise to you. Research indicates that about 69 percent of teachers can’t currently be accurately assessed with test score-based models. Laura Goe argued that although the percentage varies from state to state; in most places the number is probably closer to 80 percent.
“No problem,” say VAM proponents. All that has to be done is to create and administer tests in every subject and at every grade level. But is this really where we want to spend our education dollars? Isn’t the creation of tests for the purpose of evaluating teachers a case of the tail wagging the dog?
“It’s hard to imagine how you could use VAM anytime soon in areas that aren’t tested,” said Arthur Wise. “Are we ready to pay for [the ramping up of testing programs]? What will we have to give up?”
3. The use of VAM to evaluate teachers will further narrow the curriculum.
This seems obvious, especially since it has already happened under No Child Left Behind. If a teacher’s worth is determined by a student’s score on a test, it follows that non-tested subjects will be abandoned in favor of those that have consequences.
Richard Rothstein pointed to a September 2010 study that highlights the tragedy of this phenomenon. In “Fine Motor Skills and Early Comprehension of the World: Two New School Readiness Indicators,” David Grissmer, principal scientist at Center for the Advanced Study of Teaching and Learning at the University of Virginia, concluded that general knowledge is a much stronger overall predictor of later math, reading, and science scores than early math and reading scores. In other words, we would serve our students better by telling them stories and taking them on field trips.
“Instead we’ve adopted a strategy that focuses on drilling basic skills and narrows the curriculum,” Rothstein noted.
Wise stated his belief that a focus on test scores will affect the psyches of teachers who are caught between doing what’s best for children and the reality of how they will be judged.
“What does it do to the nature of the profession and potential satisfaction?”
4. Value-added models may produce a number, but that number must still be interpreted and used by human beings.
As Rothstein explained, even if value-added measures are only used for 50 percent of a teacher’s evaluation, it will likely become 100 percent of the trigger for retention and pay decisions made by a supervisor. In other words, if every other aspect of a teacher’s evaluation is satisfactory, but the value-added score is low, those other measures will be deemed unimportant or incorrect. Conversely, if a teacher’s value-added score is high, the administrator will likely ignore other concerns about a teacher’s performance.
In the end, regardless of the system of teacher evaluation that’s in place, it’s up to managers to do their jobs and implement that system fairly.
“Human beings love numbers and they love rankings. But will people put too much emphasis on a number?” questioned Sean Corcoran, author of “Can Teachers be Evaluated by their Students’ Test Scores? Should They Be? The Use of Value-Added Measures of Teacher Effectiveness in Policy and Practice,” a report of the Annenberg Institute for School Reform.
5. VAM scores have little value beyond a local context.
“We can’t take VAM scores at face value—we can’t take naked scores and print them in the newspaper,” declared Braun, who authored the policy paper “Using Student Progress to Evaluate Teachers: A Primer on Value-Added Models” while he worked at ETS.
But that’s just what happened in Los Angeles and may occur in New York.
All of the experts agreed that VAM could provide useful information, but mostly at the school level where principals can interpret data and use professional judgment.
While a supervisor will know that a teacher is in a classroom with 30 special needs students, someone in the community who sees a VAM number does not. A principal can use value-added data to make decisions regarding resources or professional development that are needed.
“Given the temper of the times, the public is not likely to be satisfied by that,” Wise worried. “”Within a classroom this information will make sense, but will that satisfy the cry for accountability for a school, a district, or a state?”
“Outsiders would not know how the value-added info contributed to the evaluation of the teacher,” added Corcoran. “But the cry for accountability is much bigger than that.”
6. Far more research is needed to determine which models show the greatest promise.
When bad weather approaches, have you ever noticed that meteorologists tell us about conflicting computer models—one that has the storm coming up the coast and the other out to sea? These varied conclusions are the result of the input and weighting of different sets of data. Why doesn’t the meteorologist just pick one model and stick with it? Because sometimes one computer gets it right while sometimes the other provides a more accurate prediction.
The same applies to value-added models. As Braun explained, there are a number of different models that try to level the playing field. “When we apply different models to the same data, we get different answers.” In other words, one model may determine that you are an effective teacher but the other one may not.
Rothstein referenced a recent study connected to the Gates Foundation that tried to suggest that teachers who do well under one value-added model tend to score well on other models.
“It’s a very weak correlation,” Rothstein explained. “If you fire the quintile of teachers whose value-added scores for teaching basic skills were the lowest, you would also remove 30 percent of the teachers who scored above average in the teaching of reasoning and critical thinking.”
7. VAM just may make it harder to get rid of bad teachers.
Most experts agree that three to five years of data are needed before VAM results can have a reasonable degree of reliability. This reality may generate the unintended consequence of making it harder to fire clearly underperforming teachers after one or two years in the classroom.
“Once we report that it takes three years to have good data, why wouldn’t every teacher insist upon it?” asked Wise. “Rather than being a progressive force, [VAM] can, in insidious ways, become a regressive force.”
8. The obsession with VAM will undermine much needed educational and societal reforms.
The education reform debate is fraught with falsehoods, including the notion that teachers are the most important factor in a student’s achievement. Not only is that not true (research has long shown that the effects of home life are far more important), teachers may not even be the most important in-school determinant of student success. But by focusing solely on teacher quality, policymakers are ignoring other more important reforms. These include strengthening the curriculum, encouraging teacher collaboration, and improving instructional leadership, just to name a few.
In addition, schools generally reflect the communities they serve. We cannot ignore the effect of societal problems such as poverty on education.
“The single biggest threat to student achievement is the national economic situation,” argued Rothstein. “Educators can’t do anything about it, except to be more vocal and explain how the economic problems of parents affect their children.”
9. Even the magic of statistics can’t make up for missing data.
Sure, everyone understands that student mobility and/or absenteeism result in missing data. The solutions seem equally simple—we can omit students where data is incomplete or impute scores based on the average of similar students. Unfortunately the first solution can seriously alter results while the second one encourages a gaming of the system.
The problem of missing data is even more pernicious,” added Braun.”It seems fair that professionals be held responsible for only those students who were there for 120 days. But [VAM] doesn’t take into account the fact that responsible professionals will pay attention to the students who are there. In schools with highly mobile populations, a great deal of a teacher’s efforts will be spent on students who aren’t there when the test is given.”
10. We can only know for sure the impact of a teacher on a student’s performance if we can determine how that student would have fared with another teacher. And we’ll never know that.
“We would not believe the causal link if we said that John grew three inches because he was in Ms. Jones’ class,” noted Wainer. “Under what conditions would we believe John gained 10 points because he was in Ms. Jones’ class?”
In statisticians’ lingo, this means we need a counter factual (would John have gained 10 points in Mrs. Smith’s class?) to determine the unique impact of a teacher.
Current value-added models have not “fully accounted for the causal inference and bias that naturally occur if we haven’t fully accounted for other factors,” Corcoran noted. “I’m pessimistic whether we can overcome these biases.”
Another problem is the inability of tests, and therefore VAM, to separate teacher effects from school effects. In other words, it’s very difficult to determine what role a teacher played in a child’s learning as opposed to other in-school factors such as easily accessible and up-to-date supplemental resources, outstanding school leadership, a great library, or a full-time math coach.
11. VAM flies in the face of the move toward teacher collaboration.
At a time when the trend is toward teacher collaboration, how can we determine the degree to which a teacher’s efforts (or a student’s scores for that matter) were affected by other educators?
Wise provided a real-time scenario. “The U.S. Department of Education recently released its technology plan. It calls for teams of connected educators to replace solo practitioners….How does that vision intersect with this vision?
“VAM is not a forward-looking innovation but a backward looking one,” Wise continued. “It will actually retard the state of practice - not advance it. Because for VAM to work we must maintain the classrooms of today with one teacher and 25 children. Any variation from that model, such as team teaching, departmentalization in elementary school, or a mixture of teachers and technology providing instruction, will affect data collection. For VAM to work we’d need to preserve a 19th-century model of schooling.”
There is widespread agreement that a better method of identifying poor teachers and helping them improve and or getting them out of the classroom is needed. So why not just give VAM a try? After all, some of the technical issues, such as missing data, and causal inference, could be could be mitigated over time and with additional research.
“I’m somewhat but not very sympathetic to the argument that VAM is simply better than what we have,” concluded Corcoran. “The best it can currently do is identify those teachers who are systematically very high or very low performing after multiple years of observation. It makes me ask—didn’t we already know who these teachers are? Is this really news to us?”
It’s not accurate to simply compare VAM to the status quo. To be fair, VAM must be considered within the context of its many limitations.
“Solving [VAM’s] problems range from difficult to perhaps impossible,” Wainer added. “And until they are resolved the uses for VAM must be sharply limited.”
Lisa A. Galley is an NJEA associate director of communications and editor of the NJEA Review.
Video highlights from “Standardized Tests AND Teacher Accountability: THE RESEARCH” and links to presenter papers on VAM and teacher evaluation are available on njspotlight.com/ets_symposium.