Friday, September 11, 2015

When do dropouts drop out?

By which I mean--which grades and which months of the school year are high school students more likely to drop out?

On this subject, the most interesting question is "do dropouts more often leave high school over summer break, or do they more often leave sometime in the middle of the year instead?"

The second most interesting question is "in which grade do students drop out?""

I tried two approaches to answer this question. One uses two small datasets, but is subject to numerous caveats. The other uses a big dataset, the CPS, but I found it a poor approach. I discuss these in turn.

Small data: NELS:88 & ELS:2002

The small data approach uses two NCES student surveys, which I refer to here by their acronyms NELS (from the early 1990s), and ELS (from the early 2000s).

In these surveys, dropouts are asked (in 12th grade) "when did you last attend school?" The NELS and ELS provides enough information to pin this down to grade and month.

Running the tabulations, this is what I found.

Results from the NELS

Results from the ELS

Discussion

From the NELS it looks like 11th grade is the time to drop out, although this could be driven by attrition--students who drop out and actually take the drop out survey in the NELS in 12th grade might be more likely to have dropped out recently, otherwise we would lose them from the sample as well. The ELS provides similar patterns.

As far as summer dropout, if that was a big deal then we should see a substantial spike in May, June, or July. We see May spikes, but not June or July spikes. This suggests that summertime dropout might be less common than within school year dropout.

Caveats

As already mentioned, the biggest caveat with this approach is sample attrition.

For the NELS,

  • 1,445 are known dropouts (11.90% of the full sample)
  • for which 1,033 (72%) we have time of last attendance

For the ELS,

  • 845 are known dropouts (5.22% of the full sample)
  • for which 513 (61%) we have time of last attendance

Second biggest caveat is small sample size. For instance if I were to try to estimate the proportion last attending in 9.June using the NELS:88, the 95% confindence interval would be about 0.5% to 2.1% using the usual formulas.

Big data: Current Population Survey

The big data approach exploits the longitudinal Current Population Survey (CPS). The approach takes a bit of explaining, but I'll be brief.

Be forewarned that the results of this approach are not reasonable at all. But at this point I can't figure out the flaw in my method.

Anyway, here we go. The CPS surveys a bunch of Americans and asks those ages 16 to 24 what grade they're in and whether they are attending high school. Being on summer vacation is considered not attending high school (other vacations such as Christmas break are still considered attending). Why is that? I'm not sure, but apparently the timing of summer break is more important to the Bureau of Labor Statistics than rates of high school drop out.

Critical for the current analysis, the CPS surveys the same people repeatedly over time, in the following manner. Say I'm chosen to be part of the CPS today (in September). I will then answer a series of questions today. But they'll also come back in October and ask me another series of questions; and also in November, and in December. Then they will leave me alone (they will come back a year later, too, but I don't use that in the analysis to follow).

So if I were to drop out of high school sometime between my current survey (in September) and the next one (in October), then I'd report that I'm in school now, but next month I'd say I'm not. This is how I calculate the month-to-month event drop out rate, for pairs of months.

This longitudinal nature of the CPS used to be very difficult to exploit, but just last year a group of researchers at the Minnesotta Population Center (MPC) provided a easy-to-use linking variable in the IPUMS CPS. They discuss the use of these linking variables in their impressive paper, "Making full use of the longitudinal design of the Current Population Survey: Methods for linking records across 16 months." It suffices to say here that they've made linking people longitudinally in the CPS easy enough for even me to do.

So given the survey design,

  • I can only calculate the event dropout rate for months that are within 4 months of each other. That is, I can calculate the proportion dropping out between September and December, but not September and January.
  • But then you ask: why wouldn't you just calculate one-month drop out rates--that should be enough, no?
    • The reason that I need to calculate 4 month drop out rates is because otherwise I cannot study summertime drop out--which is the most interesting question. With 4-month ahead drop out rates, I can study dropout from June to September--i.e., I can ask, "among the students in school in June, what proportion where NOT in school in September?" This gives an estimate of summertime dropout that is reasonable even though the variable measuring high school enrollment cannot distinguish between dropout and summer vacation.
    • 4 month dropout rates also provide a sanity check on the method--a check that is absurdly violated, and is the reason why I don't currently take seriously the results from this section
  • I only calculate dropout rates by month for 9th through 11th graders, and the statistics are heavily weighted toward 11th graders. The question does not distinguish between graduation and drop out. I avoid this problem entirely by excluding 12th graders. Moreover, the survey only asks students older than 16 whether they are attending high school (which is typically 10th or 11th grade, and usually 11th grade).
    • If you're familiar with the CPS data, you might reason: the survey asks about highest level of education, so couldn't I just code a student as graduating high school by looking at their month-to-month change in highest level of education? The answer is no! As discussed in more detail by the creators of the CPS linking variable, highest level of education is only updated upon entrance to the sample or particular months. According to their paper (page 140): "In the case of educational attainment,for example,respondents are only asked questions about that concept in February, July, October, or in MIS1 and MIS5; data in all other months are carried forward from earlier surveys. These sorts of measurement complications will be documented in the new IPUMS-CPS data dissemination website, but it will be important for researchers to understand them going forward."
    • This might make you worried that something similar is going on with the variable that I do use--school attendance. But that should make dropout even rarer, when in fact as you will see, the estimated amount of dropout through this method looks to be way too high!

Results from the CPS

The results from this approach are reported in the following table. The rows of the table are the start months; the columns are the end months.

To calculate the event dropout rate for each cell, I:

  • Restrict attention to people who were surveyed in both months
  • Calculate S, the number of people surveyed who were in high school in both months
  • Calculate D, the number of people surveyed who were in high school in the first month but NOT in school in the second month
  • Calculate dropout rate = D/(D+S).

Dropout by pairs of months

Data source: Current Population Survey (CPS), 2000-2013 all months; age range 16-24; obtained from cps.ipums.org

jan feb mar apr may jun jul aug sep oct nov dec
jan.3.03.63.8........
feb..3.63.74.5.......
mar...3.64.641.8......
apr....4.142.556.3.....
may.....41.958.038.0....
jun......45.034.35.2...
jul.......21.75.34.4..
aug........4.63.84.9.
sep.........3.34.03.7
oct3.9.........3.64.0
nov4.03.7.........3.5
dec3.63.54.1.........

Table reads: about 3% of students who are surveyed and are expected to be in high school in both January (row of the table) and February (column of the table) are in school in January but NOT in February. That is, the event drop out rate between January and February is about 3%.

Number of students in each month pair

Data source: Current Population Survey (CPS), 2000-2013 all months; age range 16-24; obtained from cps.ipums.org

jan feb mar apr may jun jul aug sep oct nov dec
jan.20157128136211........
feb..21640138976513.......
mar...22419142616705......
apr....23065145634210.....
may.....2249192204464....
jun......908556522620...
jul.......755447052090..
aug........1185168933301.
sep.........16959108075060
oct5480.........1890912057
nov119945212.........19799
dec19013111265222.........

Table reads: in the CPS, 20157 students were either in school in both January (row of table) and February (column of table) (hence considered a school attendee), or in school in January (row) but NOT in February (column) (hence considered a dropout)

This doesn't add up

For standard errors, they can be calculated by sqrt((p(1-p)/n)), and they can get large--but standard errors aren't really the most obvious problem with this table. The statistics fail a simple adding up condition, that being e.g. if there is a 3% dropout rate from Jan to Feb, 3.6% from Feb to Mar, and 3.6% from Mar to April, then the dropout rate from Jan to April should be pretty close to 10%, but in this data it is only 3.8%! The fact that this table so blatantly violates the simplest adding up condition makes me very wary of this approach to determining monthly event dropout rates using the CPS.

But if we're willing to take the table seriously anyway, it looks like summertime dropouts are in the minority. The low event dropout rates from June or July to September suggest this. Depending on how you run the numbers, this approach suggests that summertime dropout probably composes somewhere between 15% and 40% of total dropout.

I hope to find better ways to estimate these important stylized facts of dropout, either using the CPS or other data sources.