1

Too relaxed? Naive Bayes does not improve recidivism forecasting in the NIJ chal...

 10 months ago
source link: https://andrewpwheeler.com/2023/07/17/too-relaxed-naive-bayes-does-not-improve-recidivism-forecasting-in-the-nij-challenge/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Too relaxed? Naive Bayes does not improve recidivism forecasting in the NIJ challenge

So the paper Improving Recidivism Forecasting With a Relaxed Naïve Bayes Classifier (Lee et al., 2023), recently published in Crime & Delinquency, has incorrect results. Note I am not sandbagging on the authors, I reviewed this paper for JQC and Journal of Criminal Justice, so I have given the authors this same feedback already (multiple times!). The authors however did not correct their results, and just journal shopped and published the wrong findings.

I have replication code here to review. (Note I initially made a mistake in my code replication, reversed calculating p(x|y), I calculated p(y|x) by accident, see this older code I shared in my prior reviews, but I was still correct in my assertion that Lee’s results were wrong.)

So the main thing that made me go to this effort, the authors report unbelieveable results. They report Brier Scores for Females (Round 1) of 0.104 and for males 0.159 – these scores blow the competition out of the water. The leaderboard was 0.15 for Females and 0.19 for males. Note how I don’t list to the third decimal – the difference between the teams you needed to go down that low. Lee also reports unbelievably low Brier scores for the alternative logit and random forest models – their results just on their face are not believable.

If the authors really believe their results this kind of sucks for them they did not participate in the NIJ challenge, they would have won more than $150,000! But I am pretty sure they are miscalculating their Brier scores somewhere. My replication code shows them in the same ballpark as everyone else, but they would not have made the leaderboard. Here are my estimates of what their Brier scores should be reported as (the Brier column below in the two tables):

AIL4fc_8oGhCOztsGRXFrAt0z7PGnlUu7NkaL5u1Lq7QPj_TC7Us2HQtOnE6zm2STZjn0LXxD7C4csvnHMOVQL47zrgbh4NwO2_5jO1g5kwBQIc_jmC43Fh1TesoORatWAM_k6NZxjmXPtQt5U34_C5VQNCk=w995-h418-s-no?authuser=0

Folks can go and look at their paper and their set of spreadsheets in the supplemental material – I have posted not many more than 50 lines of (non-comment) python code that replicates their regression model coefficients and shows their Brier scores are wrong though. (And subsequently any points Lee et al. 2023 make about fairness are thus wrong as well.)

NIJ probably released papers at some point, but if you want to see other folks discussion, there is Circo & Wheeler (2022) (for mine and Gio’s results for team MCHawks), and Mohler & Porter (2021) for team PASDA.

I may put in the slate sometime to discuss naive Bayes (and other categorical encoding schemes). It is not a bad idea for data with many categories, but for this NIJ data there just isn’t that much to squeeze out of the data. So any future work will be unlikely to dramatically improve upon the competition results (it is difficult to overfit this data). Again given my analysis here, I am pretty sure a valid data analysis (not peeking) at best will “beat” the competition results in the 3rd decimal place (if they can improve at all).

Now part of the authors argument is that this method (relaxed naive Bayes) results in simpler interpretations. Typically people interpret “simple” models in terms of the end results, e.g. having a simple checklist of integer weights. The more I deal with predictive models though, I think this is maybe misguided. You could also interpret “simple” in terms of the code used for how someone derived the weights (and evaluated the final metrics). This is important when auditing code that others have written, as you will ultimately take the code and apply it to your data.

I think this “simpler to estimate the same results” is probably more important for scientists and outside groups wanting to verify the integrity of any particular machine learning model than “simple end result weights”. Otherwise scientists can make up results and say my method is better. Which is simpler I suppose, but misses the boat a bit in terms of why we want simple models to begin with.

References


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK