The Data Buddha: The Common Core of Data: an exploratory introduction

The CCD: an exploratory introduction

Attrition in American highschools

The NCES' Common Core of Data (CCD) is a cornerstone of ed analysis. Among other things (see section 3), it contains school-level data on enrollments by grade.

We can use the CCD to, for example, understand where high school dropout is more or less serious. Across the continental US, dropout propensities look like this:

Using the 2009-10 through 2012-13 CCD: US | Northeast | Midwest | South | West

Using the 1999-00 through 2002-03 CCD: US | Northeast | Midwest | South | West

Using the 1989-90 through 1992-93 CCD: US | Northeast | Midwest | South | West

Key
● attrition greater than 18.5%
● attrition between 6.7% and 18.5%
● attrition less than 6.7%

In this picture, each dot is a high school. Red dots are high schools that are experiencing a serious dropout problem; orange dots are high schools experiencing a dropout problem; green dots are not experiencing a significant dropout problem. Notice the clustering of red in urban areas. Click the links above the image to focus in on Census regions.

My approach to creating that map is simple. If you observe a high school that, year after year, has about 100 9th graders but only about 75 10th graders, something is probably amiss. This pattern is called "attrition;" insofar as schools do not experience attrition, they are said to have "holding power" (by for instance the IDRA).

There are many ways to calculate attrition, but the simplest approach is the one taken here. For each school with nonzero 9th through 12th grade enrollment, I calculate the ratio of 11th and 12th grade enrollment to 9th and 10th grade enrollment. I then take 1 minus that. Formally:

attrition rate in school k = 

        students in 11th and 12th grade in school k
    1 - -------------------------------------------
        students in 9th and 10th grade in school k

This quantity will be 0 when there is no attrition, and will be larger the greater the attrition. This will of course have some serious outliers because some schools will be closing gradually (and thus have very small 9th and 10th grade classes), and some schools will be opening gradually (and thus have very small 11th and 12th grade classes). For this reason, it is best to focus on the quantiles of this variable, not the averages.

The median attrition rate of all high schools in the US (unweighted) over the academic years 2009-10 through 2012-13 is about 6.7%. That is, about 6.7% of students seen in 9th and 10th grade are not in 11th and 12th grade. This statistic is comparable in magnitude to the national status dropout rate, which is approximately 7%. The 75th percentile of attrition is 18.5%.

Paging through the maps, you will find that the attrition rate has declined nationwide, although according to these data, Iowa and Northern Missouri stand out as a places where this rate has always been low. (You will also find that there seem to be fewer schools in the around-2000 maps; the reason for this is a mystery to me.) Also, schools in Upstate New York and the Northeast have generally experienced a reduction in the attrition rate; schools in Ohio have not.

The mobility criticism

Note that these calculated attrition rates are not subject to the criticisms of the IDRA measure: that states which are losing population will look like they have more attrition than states that do not. The IDRA measure compares this years 9th grade with the 12th grade three years hence. If the future 12th grade is half the size of the current 9th grade, then there was 50% attrition. If there is out-migration, however, the 12th grade three years hence will be smaller just because of demographic change. IDRA claims to use statistical methods to correct for this. The method I've presented here does not face this criticism, since I am using this year's 12th grade and comparing it with this year's 9th grade. Unless, for instance, families with students in 9th grade are significantly less likely to move away from the city than families with students in 12th grade, this measure is simply not subject to migration bias.

Measuring dropout

There are at least four reasonable ways to measure dropout, and possibly up to six ways. Attrition, however, is an easy and somewhat robust way. I may discuss this further in the future.

Resilience to poverty by urbanicity: the case of attrition

Is rural child poverty more or less common than urban poverty? And how do towns fare when it comes to child poverty?

Using the CCD, we can calculate the proportion of students obtaining free lunch for each locale setting. Observing a student obtain free government-subsidized lunch indicates that his or her family is relatively poor. It roughly means (according to the law) a family of four sustaining itself off of at most a $30,000 annual income, or a family of two sustaining itself off of at most $20,000 annual income. See the USDA eligibility guidelines for details. The precise empirical relationship between household income and this variable is interesting given how commonly free lunch status is used to measure poverty in education research. Hereafter, I call the free lunch rate the "poverty rate." Using the 2012-13, I calculate this rate by urbanicity:

. mean povertyRate [aw=member2012], over(urbanicity_code)

Mean estimation                   Number of obs   =     14,740

        rural: urbanicity_code = rural
       suburb: urbanicity_code = suburb
         town: urbanicity_code = town
        urban: urbanicity_code = urban

--------------------------------------------------------------
        Over |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
povertyRate  |
       rural |   .3386269   .0028852      .3329715    .3442823
      suburb |   .3146449   .0033042      .3081682    .3211215
        town |   .3948882   .0037699      .3874987    .4022777
       urban |   .4634704   .0040932      .4554473    .4714936
--------------------------------------------------------------

Table reads: 33.8% of public school students attending rural 
    schools obtained free school lunch.

According to this table, a student from a rural school has about a 34% chance of being poor; a student from an urban school has about a 46% chance of being poor. These statistics surprised me, since I had thought rural schools had higher poverty rates than town schools.

I hypothesize that rural/town schools would be more resilient to poverty than urban schools. From a cursory glance of New York State data, it seemed that rural/town schools are able to successfully graduate their students in spite of poverty. Moreover, although town schools have comparable poverty rates to urban schools, we do not regularly hear of a "dropout epidemic" among these schools. This might be because they only contain about 1/3 of the enrollment of urban schools (output demonstrating this, calculated using the CCD, is included below). In spite of this, I think it is worth checking whether rural/town schools tend to be more resilient to poverty.

. su rural suburb town urban [aw=member2012]

    Variable |     Obs      Weight        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------------------
       rural |  14,954    12271788    .1711153   .3766223          0          1
    suburban |  14,954    12271788    .4147667   .4926982          0          1
        town |  14,954    12271788    .1149203   .3189364          0          1
       urban |  14,954    12271788    .2991976   .4579219          0          1

Table reads: 17% of public school students were enrolled in rural schools in
    2012-13, according to the CCD.

A natural way to test this hypothesis is a regression. Is the relationship between attrition and poverty different for different urbanicities? The relevant model is:

E[mean attrition rate|poverty rate, urbanicity] 
    = b0 + b1*povertyRate + b2*povertyRate*isRural
    + b3*povertyRate*isUrban + b4*povertyRate*isTown
    + b5*isRural + b6*isUrban + b7*isTown
    
where 
mean attrition rate is the average attrition rate for the school 
    over 2009-10 through 2012-13
povertyRate is proportion of free-lunch receiving students;
isRural, isUrban, and isTown are dummy variables for whether the
    school is located in the relevant place.

Calling:

mean attrition rate = meanattrit

povertyRate*isRural = povRural, etc.

The results from fitting this model are as follows.



. reg meanattrit pov* rural urban town, robust

Linear regression                               Number of obs     =     10,394
                                                F(7, 10386)       =     130.34
                                                Prob > F          =     0.0000
                                                R-squared         =     0.1042
                                                Root MSE          =     .15394

------------------------------------------------------------------------------
             |               Robust
  meanattrit |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
 povertyRate |   .2213194   .0144947    15.27   0.000     .1929069    .2497319
    povRural |   -.115291   .0234441    -4.92   0.000     -.161246   -.0693361
    povUrban |  -.0930615   .0221047    -4.21   0.000     -.136391   -.0497321
     povTown |  -.1077527   .0274651    -3.92   0.000    -.1615896   -.0539158
       rural |   .0253524   .0093154     2.72   0.007     .0070925    .0436123
       urban |   .0871346    .011379     7.66   0.000     .0648296    .1094397
        town |   .0114077   .0121088     0.94   0.346     -.012328    .0351433
       _cons |   .0760162   .0054014    14.07   0.000     .0654285    .0866039
------------------------------------------------------------------------------

. test povRural=povUrban=povTown

 ( 1)  povRural - povUrban = 0
 ( 2)  povRural - povTown = 0

       F(  2, 10386) =    0.42
            Prob > F =    0.6601

My hypothesis is not true, but the data nevertheless reveals an interesting pattern. From this output, it looks like suburban schools are less resilient to increases in poverty rates than all other school locales. But I don't have any good explanation for this.

Digging deeper, I produce a plot of means by urbanicity below.

Graph reads: urban schools with a free lunch rate of around 10% have about a 
25% attrition rate on average (yellow dot).

Apparently from this plot there is a kind of hockey-stick shape to these empirical attrition rates, where relatively low poverty urban, town, and rural high schools tend to have higher average attrition rates. This pattern is not exhibited by suburban schools, which explains why the slope of poverty on attrition for suburban schools is greater than for other school categories.

Getting and using the Common Core of Data

The CCD has basic school-, school district-, and state-level education data. It is composed of two parts: the nonfiscal CCD that contains enrollment and employment data, and the fiscal CCD that contains school district level data on fiscal variables.

Presently, the CCD is compiled by the US Department of Education from submissions from state education departments. From my understanding, historically, the CCD surveys were sent directly to school principals.

CCD raw data files -- not recommended

Raw data files for the CCD can be found at the NCES CCD website; however, I would not recommend using this website if you plan on using the nonfiscal CCD. If you are reading this, you are likely a researcher in education or social science, and we typically use Stata or R. It is a nontrivial matter to read in the data and obtain a school-level panel in Stata or R using the code they provide on their website, unless you plan on only using the data since 2007-08. In which case, the flat files are a comma-delimited piece of cake.

ELSI

Fortunately, NCES provides an alternative, the ELSI table generator, for accessing the nonfiscal CCD. This web program provides access to the full CCD. It is capable of providing up to 75 columns of data at a time, which is sufficient to obtain a few variables over all the CCD's years. Moreover, you can -merge- multiple ELSI-generated datasets.

Quirks of ELSI

I found that clicking on the tab headings (the ones with the > symbols) is the best way to navigate through this interface; in particular, using the button to navigate to the "Select Filters" section often caused the interface to freeze.
You will want to remember to select the School and LEA IDs, as they are not selected by default.
The “free lunch eligible students” number is available under enrollments > students in special programs.
Once you get the data, a big chunk of your work to analyze it in Stata is to replace all of the invalid characters in the names ELSI provides so that you can rename the variables appropriately for analysis. (I had hoped I might be able to side-step this because there is a glossary, but the glossary doesn’t help much. It amounts to a codebook, so it could be worth looking at, although I think most of the variable names in the CCD are self-explanatory.) It is quite the headache to clean this data; fortunately for you, I’ve included my Stata do file at the end of this blog post, so you can explore my work, and in particular, you should be able to get ELSI CCD data quickly by downloading other series and modifying my code.

More details on what the CCD contains

Nonfiscal CCD
- Consists of school level or school district level data on non-fiscal variables.
- Includes enrollments by grade, race/ethnicity, free/reduced-price lunch status; FTE teachers (school-level); and guidance counselors, instructional aides, and administrators (only at the district-level).
Fiscal CCD
- Consists of school district level data on fiscal variables.
- Includes school district revenues by source (including the famous local-state-federal division, but also how much revenue is from property taxes) and expenditures by various categories, including teacher pay, administration, transportation, and capital outlays (such as construction).
- Available from the CCD data download website. Not available (as far as I know) in the ELSi tableGenerator.
- I plan to discuss this data in a later post.

Appendix: notes on maps & analysis

Data sources

The CCD data was obtained from the ELSI interface. See section 3.
To obtain the Census shapefiles, I consulted a Census Website. The 2014 version of the shapefiles could not be used with shp2dta in Stata, so I used the 2010 Census shapefiles.
To create maps by Census region, I used an excel file containing state FIPS code compilations from the Census.

Stata do files & instructions

dropouts.do is the cleanest of my do files for this project. It takes the zip files directly from the ELSI website and turns them into a large stata dataset. If you replicate this, you will have to replace the ELSI_csv_export_6357269214450906354461.csv (etc.) names appropriately.

mapmaking.do and mapmaking_olders.do are poorly organized do files that I used to put datasets together and save them in Stata 12 format so that I could use them on a larger computer.

createmaps.do, a file I used on a larger computer, uses the shape file exports to construct a bunch of region-specific map files to be used later in creating the dot maps you saw at the beginning of this page.

createmaps10.do, createmaps00.do, and createmaps90.do generate the maps using the output dta files from createmaps.do.

povertyattritanalysis.do uses a dataset obtained from mapmaking_older.do to conduct the attrition rates on poverty by urbanicity analysis presented in section 2.

The Data Buddha

Saturday, July 18, 2015

The Common Core of Data: an exploratory introduction