Estimating and reporting teacher effectivenss: Newspaper researchers do things that academic researchers never could

Alex Tabarrok reports on an analysis from the Los Angeles Times of teacher performance (as measured by so-called value-added analysis, which is basically compares teachers based on their students’ average test scores at the end of the year, after controlling for pre-test scores.

It’s well known that some teachers are much better than others, but, as Alex points out, what’s striking about the L.A. Times study is that they are publishing the estimates for individual teachers. For example, this:

55545525.jpg

Nice graphics, too.

To me, this illustrates one of the big advantages of research in a non-academic environment. If you’re writing an article for the L.A. Times, you can do what you want (within the limits of the law). If you’re doing the same research study at a university, there are a million restrictions. For example, from an official document, “The primary purpose of an Institutional Review Board (IRB) is to protect the rights and welfare of human subjects participating in research…” I’m not saying this is wrong, but if you’re in a non-academic and non-clinical environment, you can just go out and collect data, interview people, whatever.

P.S. Too bad they’re not fitting multilevel models. I’d think that would help a lot with estimates for individual teachers.

P.P.S. Alex also links to a study that recommends “to hire a lot of teachers on probation and then fire 80% after two years.” That sounds a bit ridiculous to me: who would take such a job, knowing there’s an 80% chance of getting fired??

Or maybe there’s some way they’re thinking of implementing this so it would work: For example, if you want to be a teacher, you could take two years off from your current job with some sort of guarantee that, if you’re not deemed to be in that top 20% of teachers, that you get to have your old job back after two years. That might work, maybe.

23 thoughts on “Estimating and reporting teacher effectivenss: Newspaper researchers do things that academic researchers never could

  1. If the pay is good, I think many people would take a job like this. Or you could make the raise after two years also quite large.

    There are many sales jobs that they determine your ability during an introductory period and only keep people above a threshold.

  2. If I'm understanding that article and the associated white paper, the rankings are based on estimated coefficients from a linear model for each teacher.

    It's interesting and cool stuff.

    [rant]

    But what the heck is up with this getting several glowing mentions by stats-knowledgeable people around the web without a single mention of the fact that there appears to be no explanation of the errors for the coefficients? I mean, a simple dot plot with error bars would do it, right?

    Isn't explaining variation and uncertainty one of the central freakin points of statistics?

    Wouldn't a "good" graphic depict not only the estimated effect of a teacher but also the ERROR, too? Wouldn't a good newspaper article convey some sense of the precision of the estimates?

    [/rant]

    Sorry.

    On a more serious note, I'm curious how the estimated effects for each teacher might change over time. My quick perusal of the articles seemed to suggest that they didn't look at this (though I could be wrong).

  3. http://www4.gsb.columbia.edu/rt/null?&exclusive=f

    Andrew,

    You really ought to to a look at his yourself (see the link above). It's just a Powerpoint and perhaps I've misunderstood their methodology, but having built corporate attrition and recruitment models myself, there are things here that make me nervous as hell.

    They seem to:

    assume turnover rate isn't affected when you change the distribution of time-on-the-job;

    start out with a low estimate for recruiting cost then dismiss the possibility that this program might great increase the per capita cost;

    ignore nested and censored data (and, as I've mentioned on OE, nested data is a big concern here).

    And as for your point about the difficulty of massively ramping up hiring while telling potential applicants that they have an 80% chance of having to find a new career in a year, the Powerpoint offers these two counterarguments:

    1. There is at least one case of a highly unrepresentative school district ramping up hiring under completely different circumstances;

    2. We could pay more money.

    Joseph and I have more on this over at Observational Epidemiology.

    Mark

  4. "who would take such a job, knowing there's an 80% chance of getting fired??"

    The assumption is that teachers know something about their own quality, so you know (roughly) where you're going to end up. So better quality teachers self-select, and worse quality teachers select away.

    "with some sort of guarantee that, if you're not deemed to be in that top 20% of teachers, that you get to have your old job back after two years"

    Not sure this would work – if you don't make it, then now your old employer knows something about your quality that it might not have known before.

  5. OneEyedMan:

    I can't imagine that school districts would have the funds or the willingness to pay the kind of salary that would offset an 80% two-year firing rate.

    Jme:

    I'd be more interested in year-to-year variation than in standard errors for individual-year estimates. Also, I very much like the presentation of estimates within the context of raw data such as in the L.A. Times graphic above.

    Mark:

    Thenks for the link. I'll take a look.

    Matt:

    Sure, I can see that the threat of being fired would deter some of the worst teachers from even trying. Still, 80% . . . that's a lot.

  6. This has caused a pretty big uproar in the ed policy community, where value-added measures of teacher effectiveness have been controversial for a while. I think most people who work on this would say that that these measures are really only 'good' at identifying teachers in the tails of the distribution and really not great for all those in the middle, and no serious researcher would suggest that the scores be reported or used in isolation. I had the same thought about how a researcher would never be able to do this (I know several academics who work with this sort of teacher data in other states and since the data is collected by the districts, I think academics often use it without having to go through their own IRBs but the confidentiality agreements they have to sign with the districts are extreme) – I posed the question on an ed policy blog and the guy who did the analysis responded that the Times got the data through a public records request. When he worked on it, the data only had scrambled student and teacher identifiers and then the Times put the teacher names back in when he was done. It made me wonder what would happen if researchers got data through similar public records requests…

  7. Thanks.

    The paper certainly holds up better than the presentation but (based on a quick read) they don't fully address any of the points that concern me and some they don't even acknowledge.

    The paper was also troubling in other ways.They rely heavily on elementary school math scores in their analysis even though the applicability of these findings to HS teaching is questionable and, more importantly, according to these same scores, elementary school is the area where we're doing well internationally.

    Though it's a minor point, their discussion of Teach for America plays a bit fast and loose, ignoring some conflicting research and emphasising the fact that TfA applicants haven't taken education courses but skipping over the fact that they take specialized training that frequently covers the exact same material.

    I've got more on this here:

    http://observationalepidemiology.blogspot.com/201

    http://observationalepidemiology.blogspot.com/201

  8. Seems like "fire 80% of teachers" is a great example of liquidation costs rising at an increasing rate (something Taleb made me think about). I'm not talking about the financial costs of the paperwork, etc., needed to fire 800 out of 1000 teachers (say).

    There are more important reputational costs. If you fire 1-5 teachers in a year, no one notices. Fire 800 teachers, and that attracts a lot more negative attention than 800 times the negative attention caused by firing 1 teacher.

    Those who then have an anxiety attack about the school district include: parents with kids there, parents considering putting their kids there, child-less voters or prospective residents, all teachers and administration or those considering teaching there, perhaps the kids themselves, etc.

    With 800 fired in a stroke, they all smell that "something just ain't right" and pull out of that district — parents send their kids to private schools, voters change districts, prospective teachers stay far away and decide to start their teaching career elsewhere, and so on.

    If the idea of convexity of reputational costs seems dubious, consider more obvious cases. If a school district is found to have 1 teacher who slept with a student, it suffers some cost. What if there had been discovered 100 such teachers? The cost there is *way* more than 100 times the first cost. Or if the teacher(s) had been taking bribes for good grades, had tampered with grades just to make themselves look good, etc.

  9. If I had a 5th grade kid in the American education system then I would rather he was in John Smith's class than Miguel Aguilar's. John Smith looks like he has a healthier respect for the part standardised tests play in the education of children aged 10 then Miguel Aguiar.

    Re: "to hire a lot of teachers on probation and then fire 80% after two years."

    And what about the 80% of kids who got taught by the "inferior" teachers? Don't they deserve teachers worth hiring for more than two years? If you think a teacher isn't worthy enough to teach then don't let them teach at all.

    The scores of JS and MA depend so much on the types of children that each has in their class. Throw in an ESOL child and a special needs child into a class and a top teacher can give a "poor" performance (but which still may be better than any other teacher could have done under the circumstances.)

    Kids aren't put into classes randomly – a great teacher might get assigned the toughest kids precisely because they are great.

  10. I can't exactly follow the white paper, but I have a question. I would have the initial hypothesis, before doing the study, that teachers whose students were mostly below average at pre-test would see increases at post-test, and teachers whose students were mostly above average at pre-test would see decreases at post-test. Stands to reason that there would be floor and ceiling effects. Does this analysis control for that? It's the case that the two examples from the news story fit that profile.

  11. Speaking of firing four out of five teachers, I've often thought how good the LA Lakers would be if they fired all their players besides Kobe Bryant and Pau Gasol and replaced them with guys who are just as good as Kobe and Pau. They'd be epic!

  12. I read the technical paper by the LA Times's Rand Corporation Economist, but it was mostly about what didn't predict changes in test scores (i.e., most things that are generalizable).

    I didn't see much evidence that value-added scores would have huge validity for average teachers. One obvious challenge would be to see if the Times's economist could use 2003-2009 changes in test scores to predict accurately who will be the top performing teachers in the 2010 test scores that have just been released.

  13. Josef:

    I don't know the details of what they did here, but the idea is to control for pre-test using a regression model, not just take a difference score. Difference scores are indeed subject to the problem you discuss, but regression estimates should be ok. We discuss this in chapter 9 of ARM.

    Steve:

    I expect that, for the two extreme teachers shown in the L.A. Times article, the difference between them is statistically significant. But, yes, it's a lot easier to distinguish between one of the best teachers and one of the worst, than to distinguish between two randomly-selected teachers. Or to distinguish between two good teachers or two poor teachers. In one of Jonah Rockoff's papers, he and his colleagues did perform a validation such as you suggest, and what they did was to check whether teachers in the top and bottom quarters of performance were likely to remain there one year later.

  14. Andrew:

    The statistic doesn't report on good teachers, it reports on who can best raise the (average?) test scores of their class relative to other teachers.

    There are a number of ways that teachers can raise test scores and some of those involve some really bad teaching practises.

    If all teachers were identical and perfect then I would suspect that you would still get a very similar pattern to the times piece because kids aren't randomly assigned to teachers and some kids lives are very chaotic and transient e.g. a kids switches school for four months to live with Dad.

  15. Most of the statistical methodology described in the report by Richard Buddin of RAND that provides the basis for the LA Times article is straight-forward and used widely. The details of FGLS or the Bayesian methods used to correct for measurement error (not identified in the white paper) are not terribly important here. There is nothing special about so-called Value-Added Measurement (VAM). It’s just a context specific brand name for using student test scores after controlling for other antecedents.

    The Buddin white paper does a fair job of describing the study and its broad results, but as Steve Sailer notes in the comments here and at Marginal Revolution, it does little to bolster the claim that the value-added estimates are useful for evaluation of individual instructors or schools. The LA Times articles, however, strike me as completely irresponsible in their representation of the study and its limitations. Richard Buddin’s white paper suggests that he is not likely the source of the problem. Perhaps the LAT reporters lack the capacity to really understand what it is that they are reporting on or perhaps they’re trying to “sex up” the story; probably some of both.

    It will be interesting to see what form the LAT release of the individual teacher results takes. Will they publish bare point estimates of VAM? Likely. It would be far more responsible of them to report the unique 95%-confidence interval for each teacher. Though, I imagine that they might protest that doing so would be confusing to the average reader. Of course, that confusion is probably warranted in this case.

    I have some issues with the method. First, the model is a simple linear additive model. A student gain from the 45th to the 55th percentile is treated as equivalent that from the 89th to the 99th. Also, the linear model is used though the dependent variable is clearly limited. I would expect the model to perform poorly towards the extremes. Can anyone here provide an informed opinion as to how much this might be expected to influence the validity of the estimates for individual teachers or schools?

    As is often the case, the method relies on the assumption that the student-year error terms are exogenous. Given that the lagged test score is treated as a sufficient statistic for all prior inputs, I would expect this assumption to be violated. The use of robust standard errors only helps with the tests of significance or computation of confidence intervals. However, I question whether the assignments of students to teachers is sufficiently random for this not to impact the individual VAM estimates. If a teacher inherits most of each year’s incoming class from an especially (in)competent teacher in the lower grade, a situation that is likely to be persistent, would we not still expect the VAM estimate to be biased?

    Finally, a quibble with Buddin’s presentation more than the method, there is no presentation of the restricted model ex-teacher VAM to compare with the full model. I do note, comparing Table 4 with Table 8, (the teacher and school VAM are estimated independently) that though the coefficient estimate variance of the school effects is in Buddin’s words “quite small” while that of the schools is “large” the R-sq of the two models differ by less than 0.01 for ELA and .001 for Math. This makes me suspicious that neither the teacher effects nor the school effects add much to the model. It’s not legit to guess from comparisons of coefficient estimates and standard errors, but my guess would be that the lagged test scores are doing all of the heavy lifting in these models. My guess is that the Cohen’s f-squared for the individual teacher VAM is diminishingly small. In keeping with the Bayesian bent of this forum, by how much would one rationally revise one’s prior estimate of an individual teacher’s performance on the basis of this VAM estimate?

    The paper, McCaffrey, Sass, Lockwood and Mihaly (2009) The Intertemporal Variability of Teacher Effect Estimates, Education Finance and Policy, 4:4 referenced by Buddin goes a long way towards addressing the concerns of others here regarding the typical magnitude of standard errors, variability and forecasting accuracy of such models. Those authors estimate that restricting the grant of tenure only to teachers with VAM estimates in the top three quintiles would be expected to improve test scores by about 0.04 standard deviations.

    There is an extensive literature from both compensation and learning that details why measures such as these are likely to be more harmful than helpful in this context, but this is a statistics blog and this comment is too long already.

  16. Megan: Two things.

    1. I agree that teaching-to-the-test is not, and should not be, the only thing. Nonetheless, I'd guess that the teachers who do best with students' test scores are good teachers in other ways. (Conversely, when I think of the times that I've taught poorly, I think these are classes where my students learned very little.)

    2. From my discussions with education researchers, my impression is that elementary-school students are assigned nearly randomly to teachers. So I don't think that's such a bad approximation. If teachers truly did not vary so much in teaching ability, I don't think we'd be seeing the persistent patterns from year to year that we see.

    That said, if these texts become high-stakes for teachers, I can see that the new incentives could create big problems.

    David: Thanks for the long and thoughtful comment. Maybe I'll post it as its own blog entry; I don't really know how many readers ever get this deep into the comment section.

  17. Re: Megan's point

    "I agree that teaching-to-the-test is not, and should not be, the only thing. Nonetheless, I'd guess that the teachers who do best with students' test scores are good teachers in other ways."

    Putting aside math for the moment (where teaching to the test is less risky), your assumption is only valid if you hold constant factors like material covered, methods of instruction and grading.

    I taught in a high school where the principal (who was highly ambitious and not all that ethical) put great pressure on teachers to raise test scores. One particularly spineless history teacher spent about a month doing nothing but drilling facts that were likely to be on the test. No discussions. No writing assignments. No additional reading. No attempt to put the material into any kind of meaningful context.

    His students always did well.

    All of the really good history teachers I've known have routinely done things that have lowered student test scores. Going all the way back to my high school teacher, a short-tempered football coach who actually thought all that excellence and character-building stuff should apply not only to his players, but to all students. He assigned books like the Jungle. He demanded thoughtful discussions. He spent considerable time explain the greater significance of our lessons.

    None of these things appreciably raised his students' test scores.

  18. Dear Dr. Huelsbeck:

    Thanks so much.

    You write:

    "Those authors estimate that restricting the grant of tenure only to teachers with VAM estimates in the top three quintiles would be expected to improve test scores by about 0.04 standard deviations."

    So, according to the McCaffrey paper, firing the bottom 40% of teachers after a few years on the job would boost average student test by 0.04 standard deviations. That would be from the 50th percentile to what’s now the 51.6th percentile?

    I've advocated value-added analysis of teachers and schools since the mid-1990s. Nonetheless, I suspect we're going to take years to work the kinks out of overall rating systems.

    By way of analogy, Bill James kicked off the modern era of baseball statistics analysis around 1975. But he stuck to doing smaller scale analyses and avoided trying to build one giant overall model for rating players. In contrast, other analysts such as Pete Palmer rushed into building overall ranking systems, such as his 1984 book, but they tended to generate curious results such as the greatness of Roy Smalley Jr.. James held off until 1999 before unveiling his win share model for overall rankings.

  19. Relating to the issues about IRBs, Dick Buddin is a RAND economist, and works at RAND. However, this was not RAND work (it's reported ambiguously or wrongly in a lot of places). From the start it was explicitly done as separately from RAND. RAND does have an IRB, which is approved by people like the NIH, just like a university IRB, and functions just like a university IRB.

    This project was carried out outside RAND, so the IRB wasn't involved. I wonder how that would work in a university. Could one say that one's work was done outside the university and therefore it was not subject to IRB approval? I guess it was done under the auspices of the LA Times, and I imagine they have some sort of ethical guidelines. (Although maybe they're not formalized.)

    Jeremy

    Disclaimer: I work at RAND.

Comments are closed.