American crime data are famously spotty. Even national trends can change substantially as the FBI revises its estimates, which are based on numbers voluntarily submitted by police departments. But crime estimates at the county level, compiled by the ICPSR NACJD1 based on the same underlying data, are known by crime nerds to be particularly awful, to the point that leading crime-data expert Jacob Kaplan says not to use them. The biggest problem is that missing data are “imputed” using substandard methods that have changed over time.
A couple of months ago, however, I dug into the situation to help out a colleague and discovered a few nuances. I figured I’d share them here for kicks. I wouldn’t say the data are great, but with appropriate caution they might work for analyses where one must compare crime trends with other variables collected at the county level.
For one thing, a 2022 study tried a spiffier method for imputing the data and concluded the new model “produces results that are comparable to the methods NACJD uses when preparing the county level UCR data.”
The authors do “recommend applying multiple imputation to the UCR rather than relying on the county level data published by NACJD” because this accounts “for the uncertainty in imputed values, so that correct standard errors can be obtained in subsequent analyses.” They or another team might consider publishing a full imputed dataset for everyone to use. But at any rate, the actual estimates seem similar either way, and one can use the methods described in the paper to create an improved dataset if desired.
But if we’re still not comfortable with imputation, here is another key detail about the data:
[For 1977 to 1993,] the data for any ORI [i.e., agency] reporting 12 months were used for county aggregation as submitted. Data for an ORI reporting six to 11 months were increased by a weight of [12/months reported]. Data for any ORI reporting less than 6 months were deleted from the county total, and the population served by that ORI was deleted from the county population total to help control for differential data quality across counties.
When I first read this in a codebook published in the 2000s, I wasn’t sure it was even accurate, because the codebook from the earliest dataset (1977-1983) doesn’t mention it, at least not very explicitly.2 Nonetheless, when I compared the 1980 populations in the UCR with the county populations that come with CDC mortality data, I found that while they generally matched, in a fair number of cases the UCR numbers were lower—as you’d expect if some were intentionally reduced. This is clearest on a log scale since counties have such hugely varying populations:
The later years of the UCR data, a bit more explicitly and directly, have a variable for the population covered by the agencies that actually reported crime, as well as a “coverage indicator” with the percentage of the county’s data that are imputed.
To be clear, this is probably not a great way to do things. Kaplan also says the data appear to conflate the “number of months reported” with the last month reported, so an agency that only reported for December would be thought to have reported all year.
But crucially, this system does allow you to strain out counties known to have bad coverage: Compare the populations in the UCR data with another source and yank out the mismatches, and/or rely on the coverage indicator in the later years. I found that for most years in the ’80s and ’90s, somewhere around 15% of counties have populations that differ from the CDC’s by more than a tenth.
If the imputation methods aren’t as bad as previously thought and you can strain out (at least a lot of) the heavily imputed counties, then maybe, just maybe, these data would work in a pinch, or at least can be used to double-check analyses done with, for example, the CDC’s homicide tallies, which have their own issues.3
To wrap up, some minor annoyances I noticed when exploring these questions (and see the update below for a concrete illustration of how removing the population mismatches improves the data):
The 1993 data are currently not available. Per a response to my question from ICPSR, “The 1993 county-level UCR data were made unavailable due to problems that were uncovered with the data. NACJD is currently working on re-releasing these data, but we do not yet have an estimated release date.”
Note that, hilariously, crime counts of zero might be actual crime counts of zero, but also might indicate no estimate if the coverage indicator is also zero. I have no idea why they would use a value that can be real as a missing-data code that you need to cross-reference with another variable.
New York City has five counties and the homicides are just divided up by population, basically assuming that Manhattan and the Bronx have the same murder rate. Also includes 9/11. (Relatedly, if you crosswalk the Supplementary Homicide Report data to counties instead of relying on the UCR county dataset, all NYPD homicides end up attributed to Manhattan!)
Counties are pretty consistent over time, but you’ll want to look at the changes if you’re combining a lot of years, and decide which changes you can live with and which counties you need to drop or combine. This and this are helpful sources. Dade County becomes Miami-Dade and gets a new FIPS code; Broomfield County, Colo., is created out of a few other counties; Connecticut recently changed its whole system, etc. A bunch of Alaska boroughs with like five people in them keep getting their borders shifted or whatever.
It can be helpful to compare the UCR’s murder numbers with the CDC’s homicide counts when available. I find that there are some big discrepancies in the counts, but that eliminating the observations with large population mismatches as described above, as well as dropping New York, greatly controls the problem. Aggregating New York’s five counties might be another option.
The variable names change a lot in different years and the data are in ASCII format, so whatever you do, make sure you have fun!
UPDATE: Here’s a good illustration of how trimming out the population mismatches improves the data. Running from 1980 to 2003 (the numbers I happen to have handy), the CDC county homicide counts are plotted against the UCR county murder counts. Note that the CDC data are complete in most of the 1980s but then counts below 10 are censored after that. The dots are colored by the “mismatch” variable, which is true if the populations in the two datasets differ by more than 10 log points, as well as for the five counties of New York. (The distinct cluster of observations where murders run much higher in the UCR than in the CDC tally is Queens.) This variable nicely flags the observations where the death counts are pretty far off, though it’s hard to be precise at the extreme low end, where a lot of the CDC data are censored and a lot of the count differences are in the single digits.
A notable exception is St. Louis City (which is a county, but a different county from St. Louis County) in 1984. You can see it as a red dot toward the far left of the graph, hanging out above the rest of the non-mismatched data. UCR has only 11 murders on a population of 444,896, versus 131 and 430,019 for the CDC. Interestingly, the SHR data have 129 murders for that year, 11 of which were in December, so I wonder if that’s an example of the months-reported issue Kaplan flagged. Perhaps St. Louis submitted full data to the SHR but only December for the main UCR dataset, and the folks doing the county UCR assumed it had close to full data because the final month was reported?
Franklin County in Ohio, home to Columbus, is another odd one, responsible for the red dots falling below the rest of the pack on the left side of the chart. The UCR tends to record 100 murders or so, but the CDC data run lower and fluctuate like crazy, in two years (1980 and 1995) hitting as low as 13! Usually I assume the CDC is closer to the truth, but in this case the UCR numbers seem a lot more sensible.
If you really need to know what all that stands for, Google it.
It says agencies reporting less than 6 months were “excluded from the aggregation,” but it doesn’t, as far as I can tell, specifically note the populations were changed in addition to the crime counts. I haven’t checked all the other pre-1994 codebooks.
Starting in the late 1980s, the system refuses to provide death counts under 10, for privacy reasons. I think publishing at least some full county-by-year death counts, for key causes of death, would be a boon to public-health research. Homicides are not private in any reasonable sense of the word to begin with.