Friday, July 24, 2015

How to cluster standard errors in 15 minutes

How to cluster standard errors in 15 minutes

A brief note on clustered standard errors, so we can get back to business asap.

This blog post follows section 8.2 of Mostly Harmless Econometrics (Abbreviated MHE). I highly recommend reading that section of MHE for more details.

Let’s start with an econometric model with a group-specific component in the error:

yig = β0 + β1xig + vg + ηig

Before moving on, let me make sure that I know what I mean by this.

First, I mean that the conditional expectation of yig is linear: E[yig|xig]= β0 + β1xig. Thus, xig is exogenous–like randomly assigned–got it?

Second, I mean that all other factors that determine yig are decomposed into two parts: first, a “uptick” that a person has because she is in group g (that’s vg); second, all other factors (those are in eig ). Call eig = vg + ηig , because why not?

Running example

The running example in MHE is simple: i is students, g is classrooms, and xig is class size. Then xig doesn’t vary across i within g, so we can just write xig as xg.

Here’s an example for the more general case: a large school district randomly assigns to schools categorical grants for 9th grade math tutoring, and the schools then individually randomly assign the tutoring to students within their walls. Then xig indicates whether the student gets tutoring in 9th grade math, denoted yig. This variable varies within schools. I will use this example.

Roughly-speaking, because assignment of tutoring is not completely random, that is, we didn’t just “shake up the whole district” and randomly assign students to tutoring or not, inference about the average effect of tutoring on math test score outcomes cannot possibly be as precise. It could be that by chance a few really good schools got the grants! (Imagine if only 2 schools got these grants–then we could be in some serious trouble!) Of course, this only matters insofar as there are “good schools” and “bad schools”: if students are completely randomly assigned to schools, then we’re still OK. But in general, because there are many fewer schools than students, and part of our randomization was on schools, we have to “pretend” we have less data, which in practice means we have to blow up our standard errors. This is all so that e.g. our 95% CIs do what they’re intended to do.

Clustering in practice

Here’s the general practical procedure to use for error clustering, with intuitive justification. It is really simple.

  • First, if you’re thinking about clustering and the number of groups that you have is large, just try clustering. Although there are many ways to “try clustering,” it is best to just apply Stata’s -cluster- option to your model. (Random effects (RE) complicates matters and on footnote 223 of MHE the authors say that we generally won’t see much gain by using RE (but I faintly remember seeing some significant increases in precision in some regressions I ran, so keep your eyes peeled and let me know if that ever happens to you!).)

  • If your standard errors go up when you cluster, you have to cluster. If they don’t, you don’t. It is that simple. Why? Because when the number of groups is large, the clustered standard errors are a consistent (though presumably less precise) estimate of the true standard errors, even under the assumption that “you don’t have to cluster.” In particular, when you “don’t have to cluster,” clustering won’t affect your standard errors. What is “a large number of groups?” Maybe >10? MHE discusses this a bit, but there is no clear answer.

  • If the number of groups you have is very small, worry. Angrist and Pischke say that usually the clustered standard errors are lower than they should be in this case. I tried this with a few groups in simulations (not reported), and yes, the standard errors seem to go way down. Clustered standard errors should not be smaller!

    • But note, when we say “small number of groups,” I think we’re really saying < 10. This is a very small number of groups, and it starts making me nervous that maybe at this point we started with the wrong model. I muse about this a little at the end of this blog post.
  • To sum up, if the number of groups is large (>10?), don’t think–just cluster™.

The theory behind clustering

OK, enough hand-waving. The Moulton factor (page 311 of MHE) tells the complete story, so let’s check it out. This says that when there is that group-specific error component vg, the true sampling variance of the slope coefficient [betahat] is equal to the conventional sampling variance blown up by the Moulton factor:

(real σ2β̂) = (1 + [(V(ng))/(n) + n − 1]ρxρe)(conventional σ2β̂)

MHE talks about this formula briefly, but in my opinion this equation displays the heart of the clustering problem so it deserves our full attention.

In this formula:

  • ρx is the “intraclass correlation of xig
  • ρe is the “intraclass correlation of eig = vg + ηig
  • n is average group size
  • V(ng) is the variance of group size

note that [(V(ng))/(n) + n − 1] is > 1, as long as group size is not always 1, because then n > 1.

What are the intraclass correlations? Well, ρx is the correlation between xig and xjg (within g) averaged across g. So it measures “how much treatment depends on the group you are in.” And ρe is the proportion of the variance of the error that is in vg, which means essentially the proportion of all factors determining yig besides xig that are correlated within group. If you didn’t get that the first time, keep reading.

The formula reveals the two cases when you don’t have to cluster:

Case 1: ρx = 0. This occurs if, for instance, xig is a treatment that is randomly assigned to students among i regardless of the school they attend g. In this case, the fact that I got treated doesn’t tell you it is more likely that my schoolmates got treated–so the problem’s gone and we’re back to an iid sample!

If ρx is nonzero, treatment probability depends on school id, though, because we assume xig is exogenous, it depends in a “completely random way.” For example, you might treat students with 90% probability in some school but only 10% probability in another school, because of differential grant assignment at the school level. But the schools that got these different probabilities were completely randomly picked! Thus there is no correlation between xig and vg, although there is correlation between xig and xjg for a given g. For instance if me and my friend attend the same school and you know that I have been tutored in 9th grade, you will think my friend also got tutored because you’ll think it likely that the common school we attend got a grant. Hence ρx > 0.

Case 2: ρe = 0. If vg “doesn’t exist,” then the standard error won’t change, either. Sure, maybe treatment probability depends on the school you attend in a random way as discussed in the previous bullet. But if school doesn’t effect outcomes at all, then this nonrandomness doesn’t matter! In this case, there aren’t any “good schools” or “bad schools.” To give an example, for sure, treatment probability always depends on “the coin flip” you used to randomize–call this Coing–which varies at the “group level” where group is defined as either g=1 for tutored or g=0 for the untutored. However, Coing does not effect test scores at all! Hence in this case, vg=0, which is the same thing as ρe = 0. We don’t need to cluster.

My musings about a small number of groups

When there are a small number of groups, I’m still confused. For instance, assume I want to estimate a difference in means: say I want to ask whether New York or California has a higher propensity to vote among people in their teens. In this case, there are “two groups,” the two states, but it is complete nonsense to cluster standard errors at the state level. Also consider the standard two-state differences-in-differences estimator. In that case, if I impose the strong form of common counterfactual trends, that being that the counterfactual trends of the two states would be identical without the policy change, then I don’t have to cluster. If I don’t impose this, I cannot infer anything from a two-state differences-in-differences, anyway, and my “inability” to consistently estimate clustered standard errors is just another, more roundabout way of telling me this.

I might be willing, however, to impose the common counterfactual trends assumption and use a two-state diff-in-diff to identify, say, the effects of this particular policy change on this state’s outcome variables–realizing that practically every policy change is a complex bundle, it may be worth obtaining suggestive evidence that a particular policy change “worked” even if we can’t be completely confident the strong common counterfactual trends assumption is satisfied. I’m still musing over this. My issue is reconciling the facts: that we can get precise estimates of the difference between two population means by drawing two simple random samples, that a diff-in-diff with only two states only identifies anything with a strong common counterfactual trends assumption, but a diff-in-diff with many states can identify something with a weaker counterfactual trends assumption, and with “enough” states we should be able to “test” whether the strong common counterfactual trends assumption is true by error clustering–but I don’t know whether “enough” means “3 or more,” or what? But this is a topic for another day.

No comments:

Post a Comment