3

Grabbing the NHAMCS emergency room data in python

 1 year ago
source link: https://andrewpwheeler.com/2024/03/30/grabbing-the-nhamcs-emergency-room-data-in-python/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Grabbing the NHAMCS emergency room data in python

So Katelyn Jetelina on her blog, The Local Epidemiologist, had a blog post (with Heidi Moseson) on how several papers examining mifepristone related to emergency room (ER) visits were retracted. (Highly recommend Katelyn’s blog, I really enjoy the mix of virology and empirical data discussions you can’t get from other outlets.)

This reminded me I had on the todo list examining the CDC’s NHAMCS (National Hospital Ambulatory Medical Care Survey) data. This is a sample of ER visit data collated by the CDC. I previously putzed with this data to illustrate predictive models for wait times, and I was interested in examining gun violence survival rates in this dataset.

I had the idea with checking out gun violence in this data after seeing Jeff Brantingham’s paper showing gun shot survival rates in California have been decreasing, and ditto for Chicago via work by Jen’s Ludwig and Jacob Miller. It is not uber clear though if this is a national pattern, Jeff Asher does not think so for example. So I figured the NHAMCS would be a good way to check national survival rates, and maybe see if any metro areas were diverging over time.

Long story short, the NHAMCS data is waaay too small of sample to look at rare outcomes like gun violence. (So probably replicating the bad studies Katelyn mentions in her blog are not worth the effort, they will be similarly rare). But the code is concise enough to share in a quick blog post for others if interested. Looking at the data the other day, I realized you could download SPSS/SAS/Stata files instead of the fixed width from the CDC website. This is easier than my prior post, as you can read those different files into python directly without having to code all of the variable fields from the fixed width file.

So for some upfront, the main library you need is pandas (as well as pyreadstat installed). The rest is just stuff that comes with pythons standard library. The NHAMCS files are zipped SPSS files, so a bit more painful to download but not that much of an issue. (Unfortunately you cannot just read in memory, like I did with Excel/csv here, I have to save the file to disk and then read it back.)

import pandas as pd
import zipfile
from io import BytesIO
import requests
from os import path, remove

# This downloads zip file for SPSS
def get_spss(url,save_loc='.',convert_cat=False):
    ext = url[-3:]
    res = requests.get(url)
    if ext == 'zip':
        zf = zipfile.ZipFile(BytesIO(res.content))
        spssf = zf.filelist[0].filename
        zz = zf.open(spssf)
        zs = zz.read()
    else:
        zs = BytesIO(res.content)
        spssf = path.basename(url)
    sl = path.join('.',spssf)
    with open(sl, "wb") as sav:
        sav.write(zs)
    df = pd.read_spss(sl,convert_categoricals=convert_cat)
    remove(sl)
    return df

Now that we have our set up, we can just read in each year. Note 2005!

# creating urls
base_url = 'https://ftp.cdc.gov/pub/health_statistics/nchs/dataset_documentation/NHAMCS/spss/'
files = ['ed02-spss.zip',
         'ed03-spss.zip',
         'ed04-spss.zip',
         'ed05-sps.zip',
         'ed06-spss.zip',
         'ed07-spss.zip',
         'ed08-spss.zip',
         'ed09-spss.zip',
         'ed2010-spss.zip',
         'ed2011-spss.zip',
         'ed2012-spss.zip',
         'ed2013-spss.zip',
         'ed2014-spss.zip',
         'ed2015-spss.zip',
         'ed2016-spss.zip',
         'ed2017-spss.zip',
         'ed2018-spss.zip',
         'ed2019-spss.zip',
         'ed2019-spss.zip',
         'ed2020-spss.zip',
         'ed2021-spss.zip']
urls = [base_url + f for f in files]

def get_data():
    res_data = []
    for u in urls:
        res_data.append(get_spss(u))
    for r in res_data:
        r.columns = [v.upper() for v in list(r)]
    vars = []
    for i,d in enumerate(res_data):
        year = i + 2001
        vars += list(d)
    vars = list(set(vars))
    vars.sort()
    vars = pd.DataFrame(vars,columns=['V'])
    for i,d in enumerate(res_data):
        year = i + 2001
        uc = [v.upper() for v in list(d)]
        vars[str(year)] = 1*vars['V'].isin(uc)
    return res_data, vars

rd, va = get_data()
all_data = pd.concat(rd,axis=0,ignore_index=True)

Note that the same links with the zipped up sav files also have .sps files, so you can see how the numeric variables are encoded. Or pass in the argument convert_cat=True to the get_spss function and it will turn the data into strings based on the labels.

You can check out which variables are available for which years via the va dataframe. They are quite consistent though. The bigger pain is that for older years, we have ICD9 codes, and for more recent years we have ICD10. So it takes a bit of work to normalize between the two (for ICD10, just looking at the first 3 is ok, for ICD9 you need to look at all 5 though). It is similar to NIBRS crime data, a single event can have different codes associated with it, so you need to look across all of them to identify whether any of the codes are associated with gun assaults.

# Assaulting Gun Violence for ICD9/ICD10
# ICD9, https://www.aapc.com/codes/icd9-codes-range/165/
# ICD9, https://www.cdc.gov/nchs/injury/ice/amsterdam1998/amsterdam1998_guncodes.htm
# ICD10, https://www.icd10data.com/ICD10CM/Codes/V00-Y99/X92-Y09
gv = {'Handgun': ['X93','9650'],
      'Longgun': ['X94','9651','9652','9653'],
      'Othergun': ['X95','9654']}

any_gtype = gv['Handgun'] + gv['Longgun'] + gv['Othergun']
gv['Anygun'] = any_gtype

fields = ['CAUSE1','CAUSE2','CAUSE3']

all_data['Handgun'] = 0
all_data['Longgun'] = 0
all_data['Othergun'] = 0
all_data['Anygun'] = 0


for f in fields:
    for gt, gl in gv.items():
        all_data[gt] = all_data[gt] + 1*all_data[f].isin(gl) + 1*all_data[f].str[:3].isin(gl)

for gt in gv.keys():
    all_data[gt] = all_data[gt].clip(0,1)

There are ranging between 10k and 40k rows in each year, but overall there are very few observations of assaultive gun violence. So even with over 500k rows across the 19 years, there are fewer than 200 incidents of people going to the ER because of a gun assault.

# Not very many, only a handful each
all_data[gv.keys()].sum(axis=0)

# This produces
# Handgun      20
# Longgun      11
# Othergun    139
# Anygun      170

These are far too few in number to estimate the survival rate changes over time. So the Brantingham or Ludwig analysis that looks at larger register data of healthcare claims, or folks looking at reported crime incident data, is likely to be much more reasonable to estimate those trends. If you do a groupby per year this becomes even more stark:

# Per year it is quite tiny
all_data.groupby('YEAR')[list(gv.keys())].sum()

#         Handgun  Longgun  Othergun  Anygun
# YEAR
# 2002        1        0        12      13
# 2003        5        2         4      11
# 2004        1        3        12      16
# 2005        2        1         7      10
# 2007        1        0        14      15
# 2008        2        2        10      14
# 2009        0        0        12      12
# 2010        1        0        10      11
# 2011        2        1         9      12
# 2012        1        0         6       7
# 2013        0        0         0       0
# 2014        1        0         2       3
# 2015        0        0         6       6
# 2016        0        0         5       5
# 2017        0        0         0       0
# 2018        0        1         4       5
# 2019        0        0         6       6
# 2020        0        0         9       9
# 2021        1        0         4       5

While using weights you can get national level estimates, the standard errors are based on the observed number of cases. Which in retrospect I should have realized – gun violence is pretty rare, so rates in the 1 to 100 per 100,000 would be the usual range. If anything these are maybe a tinge higher than I should have guessed (likely due to how CDC does the sampling).

To be able to do the analysis of survival rates I want, the sample sizes here would need to be 100 times larger than they are. Which would require something more akin to NIBRS reporting by hospitals directly, instead of having CDC do boots on the ground samples. Which is feasible of course (no harder for medical providers to do this than police departments), see SPARCS with New York data for example.

But perhaps others can find this useful. It may be easier to do things more like 1/100 events and analyze them. The data has quite a few variables, like readmission due to other events, public/private insurance, different drugs, and then of course all the stuff that is recorded via ICD10 codes (which is both health events as well as behavioral). So probably not a large enough sample to do analysis of other criminal justice related health care incidents, but they do add up to big victim costs to the state that are easy to quantify, as the medicaid population is a large chunk of that.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK