Should we shoot for the moon, or aim for improvement?


A few weeks ago, I read a post that is still sticking in my craw.

Shooting Bottle Rockets at the Moon: Overcoming the Legacy of Incremental Education Reform

The author, Thomas Kane, argues that we need to stop tinkering and institute more drastic reforms in order to catch up to the highest-performing countries. He has written and researched extensively on teacher evaluation systems, so his voice is an important and informed one.

But I disagree with nearly everything he said.

I found only one area of agreement:  “In education…we do not pause long enough to consult the evidence on expected effect sizes and assemble a list of reforms that could plausibly succeed in achieving our ambitious goals.” Most of us can probably agree that education reformers do not pay enough attention to the relevant evidence, and I think this lack of attention extends beyond the expected effect sizes into things such as the limitations of the evidence base, the generalizability of the findings, and the extent of contradictory evidence.

But the parts that are sticking in my craw are pretty much everything else.

I’ll start with the title and the underlying premise. Kane argues that we have a “legacy of incremental education reform” that needs to be “overcome.” As a classroom teacher for nine years and a researcher and policy analyst for seven more, I nearly choked on reading that headline and I still can’t get over it. What legacy is he talking about? When I was young, education reform lurched from one end of the pendulum to the other, from whole language to phonics, from new math to back to basics, with lots of debate and idoelogical rancor in between. As I got older, education reform shifted to lofty platitudes with few specifics. Remember Goals 2000? That was the plan that basically said we’re going to fix everything about education by the year 2000. That was followed by No Child Left Behind, then Race to the Top. Do those sound incremental? Currently, our high-profile reform efforts center on the Common Core. Overall, I think the Common Core is a positive step to move toward deeper conceptual development, but it’s hard to argue that overhauling the standards, curricula, and assessments of 44 states within the space of a couple of years is “incremental” change. It seems much closer to the truth to suggest that we are rushing things just a bit and might want to consider “pausing” enough to move at a more incremental pace in order to implement the Common Core more carefully.

Turns out that Kane is talking about reforms such as “better professional development for teachers, higher teacher salaries, incrementally smaller class sizes, better facilities, stronger broad-band connections for schools, etc.,” But those are hardly our highest profile education reforms.

Second, Kane states that those incremental reforms he mentions aren’t big enough to get us where we need to go, and he then proposes a set of four elements that “could provide the needed thrust.” Two problems with this:

A) It’s not at all clear that those “not-big-enough” reforms that Kane disparages are actually incremental. They are mostly things that we haven’t even tried to do on a large scale. And it’s not even certain that we know how to do them on a large scale. Can anyone point me to an example of a large district or state that has actually implemented “better professional development?”  Does anyone know how to implement better professional development on a large scale? What about $130,000 teacher salaries? If the state of North Dakota, flush with money from natural gas drilling, implemented $150,000 teacher salaries across the board, would anyone call that an “incremental” change? The class size example might even be more pertinent because we had good evidence that lowering class sizes in Tennessee worked, but when that evidence was applied to the Californian context where the policy resulted in hiring many thousands of rookie teachers, the inexperience of all the new teachers appears to have wiped out any benefits that might have been created by the lower class sizes. I don’t think we are in a position to quibble over effect sizes because we are still largely in the dark about the effects themselves.

B) The four “elements” of reform with sufficient “magnitude” and “thrust” are themselves incremental improvements at best. Kane argues that the best of them might produce .045 standard deviations of improvement per year. I don’t buy the evidence for his argument on some of his preferred reforms, but even if he is right, it’s hard to describe .045 standard deviations of improvement as anything more than an “incremental” improvement. Especially when we consider that in the one large random experiment that we have, class size reduction in Tennessee Kindergartens resulted in about 0.2 standard deviations of improvement after one year. For those skimming, that means class size reduction produced an improvement over 4 times larger than the largest of Kane’s preferred reforms. It is important for me to note that this result faded out somewhat over time, additional years of small class sizes did not add to this effect, and these results have not been replicated in other studies. But then, those other studies are widely considered to be less reliable. So the weight of the evidence might suggest that we should drastically lower class sizes in all U.S. Kindergartens.

On to my third big gripe with Mr. Kane. The reforms he proposes are not any bigger in size, nor any better in terms of their evidence base. They’re just more controversial. And that seems to be his true subtext. We can’t be so namby-pamby in education. We gots to start hurting people’s feelings and firing teachers if we want to compete with the Chinese. But of course, the policies he is proposing are controversial for many good reasons, not the least of which is that we really have no idea of what the unintended consequences would be if we were to, say, take his suggestion and not retain (or “fire”) the bottom 25 percent of teachers on value-added measures at tenure time (usually after two to three years of teaching). We have a difficult time recruiting top-notch students into teaching as it is. Will anyone with a modicum of understanding of statistics choose to enter a job knowing that they may be fired after two years based on a measure that has so much noise that a teacher who is rated at the 43rd percentile has a margin of error that ranges from the 15th percentile to the 71st (Corcoran, 2010)?

Another problem with his premise: why should we accept that incremental reform is somehow less than some grandiose promised moon shot? A large part of our ongoing crisis in American education stems from our propensity to lurch from one silver bullet solution to the next without enough focus to actually make any solution workable. Teachers know this. A continual gripe from teachers is how the district has abandoned last year’s pet reform in favor of a new approach that teachers are expected to quickly master with little to know support, all the while knowing that this new approach is almost sure to be forsaken within mere months. Instead, the international evidence suggests that countries such as Japan have achieved long-term, ongoing growth by providing a structure in which teachers work together to create incremental improvements in instruction. These incremental improvements add up to real learning. It’s hard to see what other types of improvements could really be possible in a field as complex as human learning. So I take back my third big gripe with Mr. Kane. Sort of. It’s ok that he didn’t find any bigger-than-incremental reforms to promote. There aren’t any. But it’s not ok for him to pretend that he has found some giant-sized solutions when he really hasn’t.

And, yes, I’ve got more gripes. Such as, why is closing the gap with China a “necessary goal?” If the Chinese are truly improving their education system (which is, by the way highly debatable since there is a lot of evidence that the highly touted results in Shanghai come from only testing a small slice of the best students, but anyway), if China is improving their educational system, we should celebrate that fact and rejoice in the hope that poverty, hunger, and human misery will be substantially reduced. The same is true for the improvement in any country. The growth of other countries is much more likely to lift all of humanity than it is to prove a threat.

We do, however, face real threats. Huge ones. Here are three that spring to mind: Climate change, income inequality, and the rapid pace of technological change that is projected to eliminate 50 percent of all current jobs within a generation. What are we doing to prepare for those threats? Are any of them likely to be met by increasing our PISA scores? Or do we need to begin to focus our educational reform efforts more broadly? Perhaps we should be developing involved citizens who are able to think critically and resolve political disagreements amicably. Or stretching children’s creativity and ability to adapt to new situations?

If those types of reform goals were met, they might very well bring along with them improved PISA scores and a closing of the gap with China. But they might not. And if we were able to end global warming, reduce income inequality, and find new jobs for all of our children, why would we care?

– Kevin

Zombie Satire and the Ministry of Value-Added


George Orwell and Jonathan Swift appear to have come back from the dead and are now running the editorial board of the LA Times.

May 14’s Editorial announces, with the courage required of any modest proposal, that the LA Times has doubts about using value-added models based on student test scores to evaluate teachers.

The editorial notes, “This isn’t the first study to cast doubt on what has become a linchpin educational policy of the Obama administration but there’s an interesting element that lends its findings extra weight: It was funded by the Bill & Melinda Gates Foundation, a well-known supporter of using test scores in teacher evaluations.”

And Orwell et al conclude by pointing to a key problem with these types of evaluations: “The problem is that, under pressure from the U.S. Department of Education, states have been rushing to set up rubrics for judging teachers based, to a significant degree, on rigid use of test scores.”

Interesting note about the Gates Foundation’s involvement, to be sure. And you’ll get no argument from me about the problems with the U.S. Department of Education’s pressure to rush test-score based evaluations into practice.

But the most interesting element is left unstated. In Orwellian style, the past four years of LA Times reports and editorials have been dismissed without a whisper of mention or a hint of mea culpa. Readers unfamiliar with recent history might not realize that, after the Department of Education and the Gates Foundation, the LA Times has been the most prominent advocate for using test scores in teacher evaluations.

Not only did the Times commission their own value-added studyof LA’s teachers. Not only did the Times declare “value-added analysis offers the closest thing available to an objective assessment of teachers.” Not only did the Times disdain the judgments of principals and parents, and the results of periodic assessments when “seven years of [standardized] student test scores suggest otherwise.” Not only did the Times’ reporters present themselves as capable classroom evaluators, equipped by their value-added evaluations to find and write about evidence that supported the conclusions that had already been objectively determined by the statistical model. Not only did the Times release teacher’s scores to the public, creating enormous pressure on districts and states to rush to develop their own test-score based evaluations. Not only was the Times’ release of test scores praised by the Secretary of Educationin the Administration’s “first…support for a public airing of information about teacher performance,” thus suggesting that the Times may have even been responsible for pressuring the U.S. Department of Education to hurry up with this “objective”, “no-brainer” idea of  evaluating teachers based on their students’ test scores.

But the Times also deliberately misrepresented alternative views on value-added, (in)famously claiming that Derek Briggs’ study “confirms many Los Angeles Times findings on teacher effectiveness” in spite of the fact that the abstract of that study vehemently declared “the research on which the Los Angeles Times relied for its teacher effectiveness reporting was demonstrably inadequate to support the published rankings,” and had “serious weaknesses making the effectiveness ratings invalid and unreliable.”

All of this unstated backstory is really interesting. But it’s never mentioned.

As far as Orwell’s ongoing role on the editorial board, I’m not sure if the LA Times lacks the workforce that would be required to actually rewrite all of its previous advocacy of value-added, or if they are simply counting on our collective laziness and amnesia to help us ignore this humiliating about-face.

But I have to admit. I remain a fan of Orwell’s earlier work, when the lies were more out in the open, in-your-face, and honest:

On the sixth day of Hate Week, after the processions, the speeches, the shouting, the singing, the banners, the posters…when the great orgasm was quivering to its climax and the general hatred of Eurasia had boiled up… — at just this moment it had been announced that Oceania was not after all at war with Eurasia. Oceania was at war with Eastasia. Eurasia was an ally.

There was, of course, no admission that any change had taken place. Merely it became known, with extreme suddenness and everywhere at once, that Eastasia and not Eurasia was the enemy.

What We Measure


I listened to Dr. Susan Embretson speak at UCLA last Friday.  Lots of what she said went over my head, which might be because she’s just a lot smarter than me – a situation I’m fortunate enough to be in pretty much every day of the week at UCLA and at home. But one thing I caught was that she’s working on new ways of generating test items automatically.

Her work is making it easier and cheaper to develop reliable new tests. Efficiency is always good. And research in the area of item response theory and computer adaptive testing is leading to exciting developments in diagnostically measuring student progress and providing each student with more appropriate and quicker feedback. Exciting may seem too strong a word for this extremely complex and math-heavy research, but you can check out the School of One to see what this type of individualized education could look like.

But, on the other hand, for most teachers and students, the effects so far seem to be more testing, and thinking about more testing made me think about the bigger picture of educational research. I began to wonder, not for the first time, whether a large part of this progress and this important work is rather beside the point as far as students and teachers are concerned.

Maybe we need to step back and think about our schools from a variety of perspectives. What are the problems that other people might wish we were working on?

From students’ point of view, one of the biggest problems is the lack of jobs. Huge numbers of black and brown teenagers are unemployed. But, partly due to our research emphasis on measurement and our policy emphasis on accountability, vocational programs have been cut or curtailed sharply.

From parents’ perspective, one of the biggest problems has got to be our children’s health.  The growing epidemic of obesity and diabetes seems highly relevant, yet the testing/accountability mania has led schools to cut PE, health, and recess.

From the planet’s perspective, one of the biggest problems is the epidemic of waste caused by humans.  Yet every day our schools teach our children to throw away huge amounts of food because lunch and recess times are too short and play time is too precious to be spent eating.

We are measuring carefully, but the things we’re measuring are not necessarily the things we care about. Undoubtedly, researchers are concerned about and probably even working to end unemployment, obesity, and global warming, so if we truly care about these and other issues, we need to start thinking a little less about how to measure and a little more about what we  measure.


The Muddy Language of Teacher Evaluation


“We need high-quality teacher evaluation systems to make sure that every student is taught by a quality teacher.” 

[Note: For those of you prefer your commentary in cartoon form, click here.]

We see statements like this everywhere in education reform, and they’re worded in such a way that it’s hard to disagree with them.  Do you want to make sure every student has a quality teacher?  Of course, who doesn’t?  To do this, you need a way to gauge teacher quality, right?  Sure.

But there’s something else going on here.  Something that gets blurred by the vague language of the statement.  Something that has to do with the mechanism behind “making sure that…”

When we evaluate something (our dietary habits, for example, or a car we’re considering buying, or a teacher’s practice), it’s helpful to figure out whether we’re seeking information in order to improve the thing (“formative evaluation”) or decide whether to keep or reject the thing (“summative evaluation”).

Presumably, when we evaluate our eating habits, we have no intention of stopping eating; rather, we want to see what’s going on, what’s working, what’s not working, what we can change.  That is, we’re carrying out a “formative evaluation.”  But when we go car shopping, we’re looking to make a decision, a definitive, thumbs-up-or-thumbs-down decision; we’re engaged in “summative evaluation.”

What about when we evaluate teachers?  Clearly, we need to evaluate for both of these purposes.  Teachers need information about their practice so that they can reflect on it and improve.  And administrators need information about the quality of work their employees are engaged in, in order to make personnel decisions.

Formative teacher evaluation makes everyone feel warm and fuzzy inside.  We picture dedicated, reflective teachers, studying their craft, honing their skills, becoming the best that they can possibly be and raising up their students in the process.  And summative teacher evaluation is, honestly, a little awkward.  Because, let’s face it, it amounts to firing people.

The clever double-speak that we sensed in the opening quote (which nearly everyone, from every corner of the reform debate, is patently guilty of) is this: there is no acknowledgement of the difference between formative and summative teacher evaluation.  All too often, ambiguous language masks the very real question: How exactly we are going to make sure that every student has a quality teacher?

By improving our existing teaching force?  Or by “firing our way to the top,” as Mr. Duncan so delightfully put it?  The solution is, of course, that we need both formative and summative teacher evaluation.  But unless we recognize that both exist, unless we get behind the blurry language and articulate this distinction, we risk derailing potentially productive policy conversations and descending into sound-bite-ridden shouting matches.  That’s a reform strategy, I think we can agree, goes absolutely nowhere.