Friday, September 11, 2015

When do dropouts drop out?

By which I mean--which grades and which months of the school year are high school students more likely to drop out?

On this subject, the most interesting question is "do dropouts more often leave high school over summer break, or do they more often leave sometime in the middle of the year instead?"

The second most interesting question is "in which grade do students drop out?""

I tried two approaches to answer this question. One uses two small datasets, but is subject to numerous caveats. The other uses a big dataset, the CPS, but I found it a poor approach. I discuss these in turn.

Small data: NELS:88 & ELS:2002

The small data approach uses two NCES student surveys, which I refer to here by their acronyms NELS (from the early 1990s), and ELS (from the early 2000s).

In these surveys, dropouts are asked (in 12th grade) "when did you last attend school?" The NELS and ELS provides enough information to pin this down to grade and month.

Running the tabulations, this is what I found.

Results from the NELS

Results from the ELS

Discussion

From the NELS it looks like 11th grade is the time to drop out, although this could be driven by attrition--students who drop out and actually take the drop out survey in the NELS in 12th grade might be more likely to have dropped out recently, otherwise we would lose them from the sample as well. The ELS provides similar patterns.

As far as summer dropout, if that was a big deal then we should see a substantial spike in May, June, or July. We see May spikes, but not June or July spikes. This suggests that summertime dropout might be less common than within school year dropout.

Caveats

As already mentioned, the biggest caveat with this approach is sample attrition.

For the NELS,

1,445 are known dropouts (11.90% of the full sample)
for which 1,033 (72%) we have time of last attendance

For the ELS,

845 are known dropouts (5.22% of the full sample)
for which 513 (61%) we have time of last attendance

Second biggest caveat is small sample size. For instance if I were to try to estimate the proportion last attending in 9.June using the NELS:88, the 95% confindence interval would be about 0.5% to 2.1% using the usual formulas.

Big data: Current Population Survey

The big data approach exploits the longitudinal Current Population Survey (CPS). The approach takes a bit of explaining, but I'll be brief.

Be forewarned that the results of this approach are not reasonable at all. But at this point I can't figure out the flaw in my method.

Anyway, here we go. The CPS surveys a bunch of Americans and asks those ages 16 to 24 what grade they're in and whether they are attending high school. Being on summer vacation is considered not attending high school (other vacations such as Christmas break are still considered attending). Why is that? I'm not sure, but apparently the timing of summer break is more important to the Bureau of Labor Statistics than rates of high school drop out.

Critical for the current analysis, the CPS surveys the same people repeatedly over time, in the following manner. Say I'm chosen to be part of the CPS today (in September). I will then answer a series of questions today. But they'll also come back in October and ask me another series of questions; and also in November, and in December. Then they will leave me alone (they will come back a year later, too, but I don't use that in the analysis to follow).

So if I were to drop out of high school sometime between my current survey (in September) and the next one (in October), then I'd report that I'm in school now, but next month I'd say I'm not. This is how I calculate the month-to-month event drop out rate, for pairs of months.

This longitudinal nature of the CPS used to be very difficult to exploit, but just last year a group of researchers at the Minnesotta Population Center (MPC) provided a easy-to-use linking variable in the IPUMS CPS. They discuss the use of these linking variables in their impressive paper, "Making full use of the longitudinal design of the Current Population Survey: Methods for linking records across 16 months." It suffices to say here that they've made linking people longitudinally in the CPS easy enough for even me to do.

So given the survey design,

I can only calculate the event dropout rate for months that are within 4 months of each other. That is, I can calculate the proportion dropping out between September and December, but not September and January.
But then you ask: why wouldn't you just calculate one-month drop out rates--that should be enough, no?
- The reason that I need to calculate 4 month drop out rates is because otherwise I cannot study summertime drop out--which is the most interesting question. With 4-month ahead drop out rates, I can study dropout from June to September--i.e., I can ask, "among the students in school in June, what proportion where NOT in school in September?" This gives an estimate of summertime dropout that is reasonable even though the variable measuring high school enrollment cannot distinguish between dropout and summer vacation.
- 4 month dropout rates also provide a sanity check on the method--a check that is absurdly violated, and is the reason why I don't currently take seriously the results from this section
I only calculate dropout rates by month for 9th through 11th graders, and the statistics are heavily weighted toward 11th graders. The question does not distinguish between graduation and drop out. I avoid this problem entirely by excluding 12th graders. Moreover, the survey only asks students older than 16 whether they are attending high school (which is typically 10th or 11th grade, and usually 11th grade).
- If you're familiar with the CPS data, you might reason: the survey asks about highest level of education, so couldn't I just code a student as graduating high school by looking at their month-to-month change in highest level of education? The answer is no! As discussed in more detail by the creators of the CPS linking variable, highest level of education is only updated upon entrance to the sample or particular months. According to their paper (page 140): "In the case of educational attainment,for example,respondents are only asked questions about that concept in February, July, October, or in MIS1 and MIS5; data in all other months are carried forward from earlier surveys. These sorts of measurement complications will be documented in the new IPUMS-CPS data dissemination website, but it will be important for researchers to understand them going forward."
- This might make you worried that something similar is going on with the variable that I do use--school attendance. But that should make dropout even rarer, when in fact as you will see, the estimated amount of dropout through this method looks to be way too high!

Results from the CPS

The results from this approach are reported in the following table. The rows of the table are the start months; the columns are the end months.

To calculate the event dropout rate for each cell, I:

Restrict attention to people who were surveyed in both months
Calculate S, the number of people surveyed who were in high school in both months
Calculate D, the number of people surveyed who were in high school in the first month but NOT in school in the second month
Calculate dropout rate = D/(D+S).

Dropout by pairs of months

Data source: Current Population Survey (CPS), 2000-2013 all months; age range 16-24; obtained from cps.ipums.org

	jan	feb	mar	apr	may	jun	jul	aug	sep	oct	nov	dec
jan	.	3.0	3.6	3.8	.	.	.	.	.	.	.	.
feb	.	.	3.6	3.7	4.5	.	.	.	.	.	.	.
mar	.	.	.	3.6	4.6	41.8	.	.	.	.	.	.
apr	.	.	.	.	4.1	42.5	56.3	.	.	.	.	.
may	.	.	.	.	.	41.9	58.0	38.0	.	.	.	.
jun	.	.	.	.	.	.	45.0	34.3	5.2	.	.	.
jul	.	.	.	.	.	.	.	21.7	5.3	4.4	.	.
aug	.	.	.	.	.	.	.	.	4.6	3.8	4.9	.
sep	.	.	.	.	.	.	.	.	.	3.3	4.0	3.7
oct	3.9	.	.	.	.	.	.	.	.	.	3.6	4.0
nov	4.0	3.7	.	.	.	.	.	.	.	.	.	3.5
dec	3.6	3.5	4.1	.	.	.	.	.	.	.	.	.

Table reads: about 3% of students who are surveyed and are expected to be in high school in both January (row of the table) and February (column of the table) are in school in January but NOT in February. That is, the event drop out rate between January and February is about 3%.

Number of students in each month pair

Data source: Current Population Survey (CPS), 2000-2013 all months; age range 16-24; obtained from cps.ipums.org

	jan	feb	mar	apr	may	jun	jul	aug	sep	oct	nov	dec
jan	.	20157	12813	6211	.	.	.	.	.	.	.	.
feb	.	.	21640	13897	6513	.	.	.	.	.	.	.
mar	.	.	.	22419	14261	6705	.	.	.	.	.	.
apr	.	.	.	.	23065	14563	4210	.	.	.	.	.
may	.	.	.	.	.	22491	9220	4464	.	.	.	.
jun	.	.	.	.	.	.	9085	5652	2620	.	.	.
jul	.	.	.	.	.	.	.	7554	4705	2090	.	.
aug	.	.	.	.	.	.	.	.	11851	6893	3301	.
sep	.	.	.	.	.	.	.	.	.	16959	10807	5060
oct	5480	.	.	.	.	.	.	.	.	.	18909	12057
nov	11994	5212	.	.	.	.	.	.	.	.	.	19799
dec	19013	11126	5222	.	.	.	.	.	.	.	.	.

Table reads: in the CPS, 20157 students were either in school in both January (row of table) and February (column of table) (hence considered a school attendee), or in school in January (row) but NOT in February (column) (hence considered a dropout)

This doesn't add up

For standard errors, they can be calculated by sqrt((p(1-p)/n)), and they can get large--but standard errors aren't really the most obvious problem with this table. The statistics fail a simple adding up condition, that being e.g. if there is a 3% dropout rate from Jan to Feb, 3.6% from Feb to Mar, and 3.6% from Mar to April, then the dropout rate from Jan to April should be pretty close to 10%, but in this data it is only 3.8%! The fact that this table so blatantly violates the simplest adding up condition makes me very wary of this approach to determining monthly event dropout rates using the CPS.

But if we're willing to take the table seriously anyway, it looks like summertime dropouts are in the minority. The low event dropout rates from June or July to September suggest this. Depending on how you run the numbers, this approach suggests that summertime dropout probably composes somewhere between 15% and 40% of total dropout.

I hope to find better ways to estimate these important stylized facts of dropout, either using the CPS or other data sources.

Friday, August 21, 2015

Are there a lot of veterans in IT?

About a week ago a friend remarked to me that a lot of the IT workers he knew were military veterans. We wondered first whether this was true--that is, is it the case that with a high chance if you're talking with an IT guy, you're talking with a military vet--and we also wondered what might cause this.

Easy question #1 can be answered with the March Current Population Survey (CPS). The March CPS is an annual survey of about 60,000 households, which, according to the data I'm looking at from IPUMS, leads to about 200,000 total respondents each year.

Using this data, I calculate, for each occupation category, the proportion of respondents who report themselves as military veterans. Overall, approximately 7.83% of the respondents used in these calculations are veterans.

These proportions are presented in the following collapsible bar chart. There are 23 occupation categories, each containing particular occupations (there are over 500 different occupations in these data). Click on a occupation category to toggle its subcategories.

To answer my friend's question, let's see... where would "an IT guy" be? Probably what we had in mind is in the "Installation, Maintenance, and Repair" category, maybe "Electronic ... equipment installers and repairers". Those occupations do hover around 20% veteran, suggesting that our thoughts might be somewhat accurate.

Poking around some more does suggest that being a military veteran changes occupational choices later in life in ways that are really quite intuitive given the kinds of human capital obtained while in service. This is interesting in light of media attention surrounding the purported lack of employability of military veterans, e.g. Fortune, CNN, Citizen-Times. Maybe some veterans are able to earn higher wages by exploiting their military-earned human capital.

Methods and data sources

The data was obtained from IPUMS. In this analysis, I restrict to the years 2003 to 2010, over which the Census occupation codes remained constant. I also restrict attention to people asked about their veteran status (which requires the person to be over 15 or 17, depending on the year). About 32% of the remaining sample does not have an occupation category, almost always because they are not in the labor force (being an unemployed new worker also leads to not having an occupation category, but unemployed workers do have an occupation category specified).

Friday, August 14, 2015

Hypothesis: in pro-voucher states, minority students are more likely to attend private schools

The evidence looks to be in favor of this hypothesis. In states that the Heritage foundation says are highly pro-voucher, the proportion of private school students who are minorities is about 3-4% larger than expected. The difference is significant at the p=0.05 level.

In the average state, the proportion of private school students who are minority is about 12%, while the average state is about 28% minority in all K12 schools. So, 3-4% is a meaningful difference.

Mapping voucher programs in the United States

In an earlier post I discussed the work of Saporito and Sohoni, who used the Common Core of Data to find that more white children attend private schools than black or hispanic children, and moreover, when there is a private school nearby, the neighborhood public school has fewer-than-expected white students.

An insightful comment by Robert Kahn suggested that these differences might be smaller in places where there are voucher programs targeting the poor, funding private school attendance for these children.

Embarassingly, I didn't know much about the current state of voucher programs in the US. So here we go.

Gerrymandered school attendance zones? The recent work of Richards and Stroub

Thinking about the possible determinants of the shape of school attendance zones, Richards and Stroub ask: are they drawn to increase or decrease racial segregation?

They call this "attendance zone gerrymandering."

Their findings are presented in two recent papers: Richards (2014) and Richards and Stroub (2015). Roughly, the answer is yes, attendance zones do seem to be gerrymandered, possibly to increase racial segregation.

Is switching schools bad? According to Burkam et. al., the ECLS-K says: not really.

Imagine it’s October, and there’s a struggling first grader just getting to know his classmates, the teacher, which bus number is his, etc. Then his unemployed father gets a well-paid job in another state. They pack up and leave in a week.

Research tends to show (though this is not as consistent as you might think!) that when students switch schools, grades go down.

The relationship between tax rates and property values for NYS school districts

It is often said by sensible people that poor school districts impose higher school tax rates in order to obtain the same local revenues per pupil as rich districts. Let's see if this is true.

Are school district boundaries the "dividing lines" in New York State?

The segregation of poverty in New York schools and districts

Are school district boundaries the dividing lines between the poor and nonpoor in New York State? Let’s check by using the NCES Common Core of Data.

How to cluster standard errors in 15 minutes

A brief note on clustered standard errors, so we can get back to business asap.

This blog post follows section 8.2 of Mostly Harmless Econometrics (Abbreviated MHE). I highly recommend reading that section of MHE for more details.

Studying Segregation using the CCD: the work of Saporito and Sohoni

The Common Core of Data is good for studying school segregation.
In large city school districts, there are relatively more whites in private schools than blacks or hispanics.
- So more government support for school choice might increase urban school segregation.
On the other hand, public magnet schools do seem to reduce segregation, contrary to conventional wisdom.

The Common Core of Data: an exploratory introduction

The CCD: an exploratory introduction

Attrition in American highschools

The NCES' Common Core of Data (CCD) is a cornerstone of ed analysis. Among other things (see section 3), it contains school-level data on enrollments by grade.

We can use the CCD to, for example, understand where high school dropout is more or less serious. Across the continental US, dropout propensities look like this:

Using the 2009-10 through 2012-13 CCD: US | Northeast | Midwest | South | West

Using the 1999-00 through 2002-03 CCD: US | Northeast | Midwest | South | West

Using the 1989-90 through 1992-93 CCD: US | Northeast | Midwest | South | West

Key
● attrition greater than 18.5%
● attrition between 6.7% and 18.5%
● attrition less than 6.7%

The Data Buddha