3

Survey duplicates and other stuff

 1 year ago
source link: https://andrewpwheeler.com/2023/10/13/survey-duplicates-and-other-stuff/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Survey duplicates and other stuff

So for various updates. First, I have started an AltAc newsletter, see the first email here. It will be examples of jobs, highlights of criminologists who are in the private sector, and random beginner tech advice (this week was getting started in NLP using simpletransformers, next week I will give a bit of SQL). Email me if you want to be added to the list (open to whomever).

Second, over on CRIME De-Coder I have a post on more common examples of NLP in crime analysis: named entity recognition, semantic similarity, and supervised learning. I have been on an NLP kick lately – everyone thinks genAI (e.g. ChatGPT) is all the rage, most applications though really don’t want genAI, they just think genAI is cool.

Third, Andrew Gelman blogged about the Hogan/Kaplan back and forth. I do like Gelman’s hot takes, even if he is not super in-the-know on current flavors of different micro-econ identification strategies.

Fourth, I have pushed mine and Gio’s whitepaper on using every door direct mail and web based push surveys + MRP to CrimRXiv. Using Every Door Direct Mail Web Push Surveys and Multi-level modelling with Post Stratification to estimate Perceptions of Police at Small Geographies (Circo & Wheeler, 2023). This is our solution to measuring spatially varying attitudes towards police with a reasonable budget (the NIJ community perceptions challenge). Check it out and get in touch if you want to deploy something like that to your jurisdiction.

For a bit of background, we (me and Gio) intentionally did not submit to the other categories in the competition. I don’t think you can reasonably measure micro place community attitudes using any of the other methods in the competition (with the exception of boots on the ground survey takers, which is cost prohibitive and a non-starter for most cities). So we could be like ‘uSiNG TwiTTeR aNd SenTimeNT aNalYSis tO mEasUre HoW mUCh PeoPLe hAtE pOLIcE’, but this is bad both from a measure perspective (sentiment analyses are not good, even ignoring selection biases in those public social media sources) and a tieing it to a particular spatial area perspective. The tieing it to a small spatial area also makes me very hesitant to suggest purely web based adverts to generate survey responses.

Analyzing Survey Duplicates

The last, and biggest thing I wanted to share. Jake Day and Jon Brauer’s blog, Reluctant Criminologists, has a series of posts on analyzing near survey duplicates (also with contribution by Maja Kotlaja). Apparently there was a group with SurveyMonkey/Princeton that gives a recommendation that matches over 85% are cause for concern – so if you take two survey responses, and those two responses have over 85% of the same survey responses, that is symptomatic of fraud in the SurveyMonkey advice.

This is theoretically caused by a malicious actor (such as a survey taker not wanting to do work). So they take an existing survey, duplicate it, but to be sneaky change a few responses to make it look less suspicious (this is not about making up responses whole cloth).

The 85% rule is bad advice and will result in chasing the noise. Many real world criminology surveys will have more random matches, so people will falsely assume surveys are fraudulent. RC do EDA on one of their own surveys and say why they don’t think the over 85% is likely to be fraudulent. (And sign up for their blog and other work while you go check out their post.)

So I give python code how I would analyze survey duplicates, taking into account the nature of the baseline data, Survey Forensics: Identifying Near Duplicate Responses. I show in simulations how you can get more than 85% duplicate matches, even in random data, depending on the marginal distribution of the survey responses.

I also show how to use statistics to identify outliers use false discovery rate corrections, and then cluster like responses together for easier analysis of the identified problem responses. Using data in Raleigh, I show a several groups of outlying survey near duplicates are people just doing run responses (all 5s, 4s, or missings). But I do identify 3 pairs that are somewhat suspicious.

While this is particular type of fraud is not so much a problem in web based surveys, it is not totally irrelevant – you can have people retake web based push surveys by accident. Urban talks about this in terms of analyzing IP addresses. Cell phones and WiFi though some of those could slip through, so this is idea is not totally irrelevant even for web surveys. And for shared WiFi (universities, or people in the same home/apt taking the survey) IP addresses won’t necessarily discriminate.

References


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK