Ask HN: Should I publish my research code?
source link: https://news.ycombinator.com/item?id=29934192
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Ask HN: Should I publish my research code?
Ask HN: Should I publish my research code? 308 points by jarenmf 6 hours ago | hide | past | favorite | 277 comments I'm looking for advice om whether I should publish my research code? The paper itself is enough to reproduce all the results. However, the implementation can easily take two months of work to get it right.
In my field many scientists tend to not publish the code nor the data. They would mostly write a note that code and data are available upon request.
I can see the pros of publishing the code as it's obviously better for open science and it makes the manuscript more solid and easier for anyone trying to replicate the work.
But on the other hand it's substantially more work to clean and organize the code for publishing, it will increase the surface for nitpicking and criticism (e.g. coding style, etc). Besides, many scientists look at code as a competitive advantage so in this case publishing the code will be removing the competitive advantage.
Matt Might has a solution for this that I love: Don't clean & organize! Release it under the CRAPL, making explicit what everyone understands, viz.:
"Generally, academic software is stapled together on a tight deadline; an expert user has to coerce it into running; and it's not pretty code. Academic code is about 'proof of concept.'"
What do you know, it turns out the professional software developers I work with are actually scientists and academics!!
It's not about computers and "You don't put science on your name if you're a real science"
He prefers the name informatics.
Having flashbacks to when a close friend was getting an MS in Political Science, and spent the first semester in a class devoted to whether or not political science is a science.
This also lead to a strange naming now we have data science as well where it is called "Data videnskab" which is just a literal translation of the English term.
: https://dl.acm.org/doi/10.1145/365719.366510 (sadly behind a wall)
I believe they are referring to what the degree currently known as Computer Science should be called.
The average software developer doesn't even know much math.
Right now, "software engineer" basically means "has a computer, -perhaps- knows a little bit about what goes on under the hood".
There isn't a standard license that show that someone is proficient in security, or accessibility, or even in how computer hardware or networking work at a basic level.
So all we're doing is diluting the term "engineer", so as to not mean anything.
The only thing the term "software engineer" practically means is: they have a computer. It's meaningless, just a vanity title meant to sound better than "developer".
I am not a lawyer, just a hardcore open-source advocate in a former life.
It is explicitly the point of the license that the code is not for those purposes, because it's shitty code that should not be reused in any real code base.
More often than not, they are more than willing to help.
Indeed, also there're things like "By reading this sentence, You have agreed to the terms and conditions of this License.". That can't hold up in court! How can I know in advance what the rest of the conditions say before agreeing to them?
Then again, I am not a lawyer either.
Certainly don't clean it up unless you're going to repeat the experiment with the cleaned up code.
And, yeah. I've found some significant mistakes in research code -- authors have always been grateful that I saved them from the public embarrassment.
> I'm not a lawyer, so I doubt I've written the CRAPL in such a way that it would hold up well in court.
Please do release your code, but please use a standard open source license. As for which one, look to what your peers use.
Supplying bad code is a lot more valuable than supplying no code.
Also in my experience, reviewers won't actually review your code, even though they like it a lot when you supply it.
The only thing I would add is a description of the build environment and an example of how to use it.
4) You recognize that any request for support for the Program will be discarded with extreme prejudice.
I think that should be a "may" rather than a "will." If I find out someone is using my obscure academic code, and they ask for help, I'd be pretty pumped to help them (on easy requests at least).
> 4) You recognize that any request for support for the Program will be discarded with extreme prejudice.
There is no way I'd even make a request for support.
If only to help people who simply can't read academic papers because it's not a language their brain is wired to parse, but who can read code and will understand your ideas via that medium.
[EDIT]: to go further, I have - more than once - run research code under a debugger (or printf-instrumented it, whichever) to finally be able to get an actual intuitive grasp of the idea presented in the paper.
Once you stare at the actual variables while the code runs, my experience is it speaks to you in a way no equation ever will.
Using this license would actually make me suspect that your results aren't even valid and I don't trust many experiments that don't release source code.
Why does he think that but presumably not the same about the paper itself and the “equations”, plots, etc. contained within?
It’s really not that hard to write pretty good code for prototypes. In fact, I can only assume that he and other professors never allowed or encouraged “proof of concept” code to be submitted as course homework or projects.
Moreover you are not being paid for writing reasonable programs you're paid for doing science. Nobody would submit "prototype" papers, because they are the currency of academic work. There is lots of time spend on polishing a paper before submission, but doing that for code is generally not appreciated because nobody will see this on your CV.
If code is a large part of your scientific work, then it's just as important as someone who does optics keeping their optics table organized and designed well. If one is embarrassed by that, then too bad. Embarrassment is how we learn.
Lastly, you're describing problems with the academic world as if they are excuses. They're reasons but most people know the academic world is not perfect, especially with what is prioritized and essentially gamified.
If you want to set expectations, this can simply be done in a README. Putting this in a license makes no sense. Copyright licenses grant exceptions to copyright law. If you're adding something else to it, you're muddying the water, not making it better.
Use standard and well understood licenses e.g. GPL for code and CC for documentation. The world does not need more license fragmentation.
Moreover, sure, lots of the license is text that isn't common in legal documents, but there's no rule that says legal text can't be quirky, funny or superfluous. It's just most practical to keep it dry.
You are permitted to use the Program to validate scientific claims submitted for peer review, under the condition that You keep modifications to the Program confidential until those claims have been published.
In this particular case, however, there's very little risk of actual law suits happening. There is some, but the real goal of the license is not to protect anyone's ass in court (except for the obvious "no warranty" part at the end), but to clearly communicate intent. Don't forget that this is something GPL and MIT also do besides their obvious "will likely stand up in court" qualities. In fact I think that communicating intent is the key goal of GPL and MIT, and also the key goal of CRAPL.
From this perspective, IMO the only problem in this license is
This line makes me sad because it makes a mockery of what's otherwise a pretty decent piece of communication. Obviously nobody can agree to anything just by reading a sentence in it. It should say that by using the source code in any way, you must agree to the license.
By reading this sentence, You have agreed to the terms and conditions of this License.
Again, this is not how a license work. You can express your intents, ideas and desires in a README file and in many other ways.
The license is nothing more than a contract that provides rights to the recipient under certain conditions. Standing up in court is its real power and only purpose.
That's why we should prefer licenses that stood up in court and have been written by lawyers rather than developers or scientists.
Incidentally, I've seen people violate the Java "no realtime" clause.
Plus, the usual "no warranty" is strong enough to protect the authors anyways.
You must be fun at parties :)
Benefits: people who want to reproduce your analysis can use exactly the right software, and people who want to build on your work can find the latest in your repo. Either know how to cite your work correctly.
In practice drive-by nitpicking over coding style is not that common, particularly in (some) science fields where the other coders are all other scientists who don’t have strong views on it. Nitpicks can be easily ignored anyway.
BTW should you choose to publish, the Turing Way has a section on software licenses written for researchers: https://the-turing-way.netlify.app/reproducible-research/lic...
No, this is almost never the case. It should be. But it cannot really be. There are always more details in the code than in the paper.
Note that even the code itself might not be enough to reproduce the results. Many other things can matter, like the environment, software or library versions, the hardware, etc. Ideally you should also publish log files with all such information so people could try to use at least the same software and library versions.
And random seeds. Make sure this part is at least deterministic by specifying the seed explicitly (and make sure you have that in your log as well).
Unfortunately, in some cases (e.g. deep learning) your algorithm might not be deterministic anyway, so even in your own environment, you cannot exactly reproduce some result. So make sure it is reliable (e.g. w.r.t. different random seeds).
> In my field many scientists tend to not publish the code nor the data.
This is bad. But this should not be a reason that you follow this practice.
> clean and organize the code for publishing
This does not make sense. You should publish exactly the code as you used it. Not a restructured or cleaned up version. It should not be changed in any way. Otherwise you would also need to redo all your experiments to verify it is still doing the same.
Ok, if you did that as well, then ok. But this extra effort is really not needed. Sure it is nicer for others, but your hacky and crappy code is still infinitely better than no code at all.
> it will increase the surface for nitpicking and criticism
If there is no code at all, this is a much bigger criticism.
> publishing the code will be removing the competitive advantage
This is a strange take. Science is not about competing against other scientists. Science is about working together with other scientists to advance the state of the art. You should do everything to accelerate the process of advancement, not try to slow it down. If such behavior is common in your field of work, I would seriously consider to change the field.
Ideally, if your code has a random component (MCMC, bootstrapping, etc), your results should hold up across many random seeds and runs. I don’t care about reproducing the exact same figure you had, I want to reproduce your conclusions.
In a sense, when a laboratory experiment gets reproduced, you start off with a different “random state” (equipment, environment, experimenter - all these introduce random variance). We still expect the conclusions to reproduce. We should expect the same from “computational studies”.
> Unfortunately, in some cases (e.g. deep learning) your algorithm might not be deterministic anyway, so even in your own environment, you cannot exactly reproduce some result. So make sure it is reliable (e.g. w.r.t. different random seeds).
Publishing the weights of a trained model allows verification (and reuse) of results even before going to the effort of reproducing it. This is especially useful when training the model is prohibitively expensive.
* You increase the impact of your work and as a consequence also might get more citations.
* It's the right thing to do for open and reproducible research.
* You can get feedback and improve the method.
* You are still the expert on your own code. That someone picks it up, implements an idea that you also had and publishes before you is unlikely.
* I never got comments like "you could organize the code better" and don't think researchers would tend to do this.
* Via the code you can get connected to groups you haven't worked with yet.
* It's great for your CV. Companies love applicants with open-source code.
Everybody here talks about how publishing code helps (or even makes possible) reproducibility, but this is not true, on the contrary, it hinders it. Reproducing the results does not mean running the same code as the author and plotting the same figures as in the paper. This is trivial and good for nothing. Reproduction is other researchers independently reimplementing the code using only the published theory, and getting the same results. If the author publishes the code, no one will bother with this, and this is bad for science.
 Rougier et al., "Sustainable computational science: the ReScience initiative" https://arxiv.org/abs/1707.04393
 Plesser, "Reproducibility vs. Replicability: A Brief History of a Confused Terminology" https://doi.org/10.3389/fninf.2017.00076
Second, publishing code helps make invisible decisions visible in a far better manner than the paper text does. Try as we might to imagine that every single researcher degree of freedom is baked into the text, it isn't and it never has been.
Third, errors do occur. They occur when running author code (stochasticity in models being inadequately pinned down, software version pinning, operating system -- I had a replication error stemming from round-towards-even behaviour implementation varying across platforms). If you have access to the code, then it's far easier to determine the source of the error. If the authors made a mistake cleaning data, having their code makes it easier to reproduce their results using their exact decisions net of the mistake you fix.
Most papers don't get replicated or reproduced. Making code available makes it more likely that, at a minimum, grad students will mechanically reproduce the result and try to play around with design parameters. That's a win.
Source: Extensive personal work in open and transparent science, including in replication; have written software to make researcher choices more transparent and help enable easier replication; published a meta-analysis of several dozen papers that used both reproducing author results from author code, producing author results with code reimplementation, and producing variant results -- each step was needed to ensure we were doing things right; a close friend of mine branched off into doing replications professionally for World Bank and associated entities and so we swap war stories; always make replication code available for my own work.
Why trust results if you can't see the methodology in detail and apply the approach to your own data? I once knew somebody who built a fuzz tester for a compilers project, got ahold of a previous project's compiler code that won best paper awards, and discovered a bunch of implementation bugs that completely invalidated the results.
Why is the peer review process limited to a handful of people who probably don't have access to the code and data? If your work is on Github, anybody can come along and peer review it, probably in much more detail. And as a researcher, you don't get just one chance to respond to their feedback -- you can actually start a dialogue, which other people are free to join in.
As long as a project's README makes any sort of quality / maintenance expectations clear upfront, why not publish your code?
This is my experience, too, and in my opinion this is exactly what has to change for really reproducible research, not ready to run software supplied by the author.
There are many good arguments in support of publishing code, but reproducibility is not one of them, that's all I'm saying.
And to the original author's credit, when I sent him a draft of my paper and code, he loved how such a simple approach outperformed his. I always felt that was the spirit of collaboration in science. If he hadn't supplied his code, I really would never have known how they performed unless I also fully implemented the other solution -- which really wasn't the point of the research at all.
I agree with this statement, however I think you may have a misunderstanding on reproducing results. It's not that you can reproduce their graphs from their dataset, but rather seeing if their code reproduces on to your (new) dataset.
Another way to think of it is that the research paper's Methodology section is describing how to set up a laboratory environment to replicate results. By extension the laboratory for coding research IS the code. Thus, by releasing the code along with your paper, you are effectively stating "how is a direct copy of my laboratory for you to conduct your replicate on".
> Reproducing the results does not mean running the same code as the author and plotting the same figures as in the paper.
Sure, to some extent. But the code does provide a baseline, a sanity check. People who are trying to reproduce results should (as I describe above) go through both the paper and the code with a fine tooth comb. The provided code should be considered a place to start. I'll often re-implement the algorithm in a different language using the paper and the provided code as a guide.
While I strongly support sharing the code, I am not sure if this is a great reason to do so. Companies are made up of many individuals, and while some might appreciate what it takes to open source code, other individuals might judge the code without full context and think it is sloppy. My suggestion is that you fully explain the context before sharing code with companies.
I think this is the most important reason to do it. Research code is not meant to be perfect as another op said, but it can be instrumental in helping others, including non-academics, understand your research.
I think the sooner it's released the better (assuming you've published and you're not needing to protect any IP.) There's some great advice here: https://the-turing-way.netlify.app/reproducible-research/rep...
My main concern would be to make sure there are no passwords or secret keys in the data, not how it looks.
You'll open yourself up for comments. They may be positive or negative. You'll only know how it pans out afterwards.
Is the code something that you'll want to improve on for further research? If so publish it on github. It opens the way for others to contribute and improve the code. Be sure to include a short readme that you welcome PRs for code cleanup, etc. That way you can turn comments criticizing your code into a request for collaboration. It'll really separates helpful people from drive by commenters.
Worth mentioning specifically: If you make a git (et al) repository public, make sure there are no passwords or secret keys in the history of the repository either. Cleaning a repository history can be tricky, so if this is an issue, best to just publish a snapshot of the latest code (or make 100% sure you've invalidated all the credentials).
For my 2 cents I'd prefer to see sloppy code vs no code.
If you did something wrong, you did it wrong. Hopefully someone would put in a PR to fix it
Also personal data of any human subjects.
On the other hand, if your goal is only to advance your own career and you want to inhibit others from operating in this space any more than necessary to publish (diminish your “competitive advantage”) then I guess you wouldn’t want to publish.
All the groundbreaking papers in deep learning in the last decade had code published. So if you're aiming for thousands of citations, you need code.
I am in this field and I would say less than 10% of the top papers have code published by the author, and those are most of the time another 0.1% improvement in imagenet. All the libraries that you generally use are likely to be recreated by others in this field. Lot of most interesting work's code never come out like alphazero/muzero, GPT-3 etc.
Personally, it is a pet peeve I have about my field. But there is no incentive for a new researcher to publish code as it decreases barriers to entry. As much as it's nice to say that researching in academia is about progressing science, as a researcher, you are your own startup trying to make it (i.e., get tenure).
I personally look at any paper without code with great suspicion. The reviewers certainly did not try to reproduce your results, and I have no guarantee that a paper without code has enough information for me to reproduce.
I always go for the papers with code provided.
I've always published my research code. Thanks to that, one of the tools I wrote during my PhD has been re-used by other researchers and we ended up writing a paper together! In my field is was quite a nice achievement to have a published paper without my advisor as a co-author even before my PhD defense (and it most likely counted a lot for me to get a tenured position shortly after).
The tool in question was finja, an automatic verifier/prover in OCaml for counter-measures against fault-injection attacks on asymmetric cryptosystems: https://pablo.rauzy.name/sensi/finja.html
My two most recently published papers also come with published code released as Python package:
- SeseLab, which is a software platform for teaching physical attacks (the paper and the accompanying lab sheets are in French, sorry): https://pypi.org/project/seselab/
- THC (trustable homomorphic computation), which is a generic implementation of the modular extension scheme, a simple arithmetical idea allowing to verify the integrity of a delegated computation, including over homomorphically encrypted data: https://pypi.org/project/thc/
Anyone who programs publicly (via streaming, blogging, open source) opens themselves up for criticism, and 90% of the time the criticism is extremely helpful (and the more brutally honest, the better).
I recall an Economist magazine author made their code public, and the top comments on here were about how awful the formatting was. The criticism wasn't unwarranted, and although harsh, would have helped the author improve. What wasn't stated in the comments is that by publishing their code, the author already placed themselves ahead of 95% of people in their position who wouldn't have had the courage to do so. In the long run, the author will get a lot better and much more confident (since they are at least more aware of any weaknesses).
I'd weigh up the benefits of constructive (and possibly a little unconstructive) criticism and the resulting steepening of your trajectory against whatever downsides you expect from giving away some of your competitive advantage.
I've published 100,000s of lines of code from my research over 20 years, and I think I've had exactly one useful comment from someone who wasn't a close collaborator I would have been sharing code with anyway.
I still believe research code should be shared, but don't do it because you will get useful feedback.
This seems to depend on a paper getting a modest amount of media traction. That seems to set off the group of people who want to complain about code online.
ps always use an auto formatter/linter. I can't believe we ever used to live without them. So much time used to be wasted re-wrapping lines manually and we'd still get it wrong.
Citation needed. I have rarely seen valuable feedback from random visitors from the internet.
I wish it wasn't viewed as a competition in the first place.
No, it isn't.
Reproducing the results means that you provide the code that you used so that people can reproduce it just by running "make" (or something similar). If you do not publish the code and the input data, your research is not reproducible and it should not be accepted in a modern, decent world.
It doesn't matter that your code is ugly. Nobody is going to look at it anyway. They are only going to call it. If the code is able to produce the results of the paper with the same input data, that's enough. If the code is not able to at least do that, this means that even you are not able to reproduce your own results. In that case, you shouldn't publish the paper yet.
* I have never had someone come back to criticize my code style. And if they do, so what? I'll block them and not think about it again. I don't need to get my feathers ruffled over this.
* Similarly, if someone's trying to replicate my results, and they fail, it's on them to contact me for help. After that it's on me to choose how much effort to put into helping them. But if they don't contact me, or if they don't put in a good faith effort to replicate the results, that's their problem. If they try to publish a failure to replicate without having done that, it's no more valid science than publishing bad science in the first place.
Overall, I think most people who stress about publishing code do so because they haven't done it before. I've personally only ever had good consequences from having done so (people picking up the code who would never have done anything with it if it weren't already open source).
In your position, I would only release code which is not too hard to reproduce anyway or which only provides negligible competitive advantage for you. I mainly have "normal" paper in mind (experiments or data analysis) - if the main contribution is, for example, an algorithm which you want people to use, the you should publish an implementation obviously.
Research based on or involving code/models/algorithms should always be accompanied by a code drop. Nobody expects the code to be of good quality.
Everything else is not reproducible - and against the scientific codex (IMO).
I read so many papers that claim incredible results, and wondering how they implemented their models in this particular simulator (close to impossible with only what is out there), only to find that there is just nothing to be found, anywhere. No repo, no models, no patch. NIL.
Sending an E-Mail? No response.
Further, anyone could just claim anything this way. Why bother doing any real work?
What if there is a small error in the code?
Wouldn't it be better to know that? In a scientific sense, searching for "the truth"?
A strong result isn’t just the final number, it’s also the process how you arrived there.
In a very real sense unless the paper has a result that's so compelling I can't ignore it if there's no published source code -- even if it's an obvious prototype! -- I'll pass it by. I'm not alone in that in my line of work. Industry folks might also be more willing to accept prototype code than academic folks, I dunno.
Worth consider, I guess, if you're interested in your work crossing the academic/industry boundary smoothly.
If the paper is enough to reproduce the results AND cleaning up the code can/is tedious, then adding the "code and data are available upon request" note seems both fair and justified.
That way, whoever wants the code can still ask for it and it does not lay an unnecessary burden on the author.
BUT, I have definitely encountered the situation where I read a paper, then looked at the associated code, and found that the exciting result was entirely because of a bug. The reputation, "This investigator is someone who does shoddy, error-prone work" is probably the worst possible one.
I hypothesize that you will see some combination of three effects: (1) you will get lots of downloads (which means people are using your code, good work!), perhaps with lots of follow-up emails and perhaps not depending on what the code does; (2) you will get lots of emails from random nutjobs looking to pick holes in your work, and you will waste your time answering them; (3) you will get almost completely ignored.
Whatever the outcome, I think a lot of people would be interested in to hearing about what you learn.
From Heil et al. (https://www.nature.com/articles/s41592-021-01256-7):
> Documenting implementation details without making data, models and code publicly available and usable by other scientists does little to help future scientists attempting the same analyses and less to uncover biases. Authors can only report on biases they already know about, and without the data, models and code, other scientists will be unable to discover issues post hoc.
Even better would be to containerize all software dependencies and orchestrate the analysis with a workflow manager. The authors of the above paper refer to that as "gold standard reproducibility"
Reproducibility -- I dunno. A re-implementation seems better for reproducibility. The paper is the specific set of claim, not the code. If there are built-in assumptions in your code (or even subtle bugs that somehow make it 'work' better), then someone who "reproduces" by just running your code will also have these assumptions.
Coding time -- are you sure? Professional coders are pretty good. If you have, for example, taken the true academic path and written your code in FORTRAN, there's every chance that a professional could bang out a proof of concept in Python or C++ in like a week (really depends on the type of code -- EIGEN and NUMPY save you from a whole layer of tedium that BLAS and LAPACK 'helpfully' provide). Really good pseudocode might be more useful than your actual code
Another note -- personally I treat my code as essentially the IP of my advisor. (He eventually open sources most things anyway). But do check on the IP situation if you want to open source it yourself. If you are working as a research assistant, some or all of your code may belong to you University. They probably don't care, but it is better to have the conversation before angering them.
Hear hear! OP, if you go this route, treat your implementation as a practice run, and write out exactly how it works in pseudocode.
My 2 cents:
I think that hiring a (good) professional for a rework/reimplementation would be productive, but it would certainly run the risk of exposing errors in your work. If that's desirable or not depends on your timeline to publish, I guess.
At the time there were two widely used software packages for phylogenetic inference, PAUP*  and MrBayes . The source code for MrBayes was available, and although at the time I had some pretty strong criticisms of the code structure, it was immensely valuable to my research, and I remain very grateful to its author for sharing the code. In contrast the PAUP* source was not available, and I struggled immensely to replicate some of its algorithms. As a case in point, I needed to compute the natural log of the gamma function with similar precision, but there was no documentation for how PAUP* did this. I eventually discovered that the PAUP* author had shared some of the low-level code with another project. Based on comments in that code I pulled the original references from the 60s literature and solved these problems that had plagued me for months in a matter of days. Now, from what I could see in that shared PAUP* code, I suspect that the PAUP* code is of very high quality. But the author significantly reduced his scientific impact by keeping the source to himself.
1. It gives your work more visibility. If there is a easy git clone route to reproducing your work, it offers a low effort starting point for people to build upon your work which means they are more likely to use it. Plus you get free citations from anyone who touches it.
2. There is no reason that people should be hoarding code in academia, and the only reason people do it now is a sort of prisoners dilemma problem (first person to publish their code had to start from scratch, so they feel possessive and let it die when they graduate). Every researcher who releases their code chips away at the problem and pushes the community to be more open with their code which is intrinsically more efficient.
3. If you get lucky and the community adopts your code it will be viewed very positively by potential future career advancement committee being 'they guy who wrote _x'
4. When I started in academia I based my codebase on an existing publicly available code, which saved me a huge amount of time in my work. I built upon it (not expanding the base code, but using it as a module to integrate experimental measurements to the simulations tools I wrote from scratch) in my PhD and when I graduated I handed a virtualbox image with the whole mess (yay free code--wouldn't have been possible with nonfree code) off to my successors, people in new groups, etc which ended up being the base of an entire new research group at a different university. Every once in a while I get an email asking for help, and get a notification saying that someone cited the code.
Now in the vast majority of cases you will only get a couple of people looking at your code (my experience so far), but still I think it's worth it. The question is, clean up the code or not. Ideally you would, because it increases the chance of someone using it by a lot. On the other hand with the realities of academic work, this is largely underappreciated.
So I recommend to find a balance, clean up enough so it is reasonably straight forward to run the code. Write a good readme that points to the paper and gives the appropriate citation.
You're supposed to welcome criticism and 'nitpicking' as a scientist.
I'll add: I think that we need to change the mindset in academia about code. If code was involved in producing the results in the paper that code should be considered part of the paper and (at least) as important as the text of the paper. (Same for data)
This was kind a change for my advisor who was definitely less interested in that aspect of research. I think this is an issue in academia and needs to change.
Also, ultimately if someone wants to copy and publish your work as their own it will be relatively easy to show that and the community as a whole will recognize it.
Also, for me it felt good when another student/researcher was aided by my work.
You don't need to clean it up or make the code presentable. Everyone knows it's research grade code. Most important part is that you have the code in a state that you can reuse in the future for another publication.
I've been saved multiple times by being able to easily go back to decade old work and reproduce plots.
Thus badges can become a kinda excuse for not fixing stuff by default.
someone might clean it up for you, too
As the other comment said, if you care about "advancing the science", and won't mind stuff like the above happening, then go for it. In my experience, it is not worth it.
This has been very much my experience.
Now both of the researchers have to be cited, but only one of them did the discovery work.
Unlikely. Following the algorithm from scratch may produce "similar" results, but not "reproduce", bugs and all. The only thing that can do that is your code.
Plus, typically, when you set out to reproduce a paper from only the algorithmic description, it's typically not until you're 2 or 3 weeks into coding that you realise the original paper made many assumptions in the code that were not explicitly stated in the paper.
> However, the implementation can easily take two months of work to get it right.
An even more important reason why you should release your code.
> In my field many scientists tend to not publish the code nor the data.
A regrettable state of affairs indeed.
> They would mostly write a note that code and data are available upon request.
I have personally come across many cases where this promise could no longer be honoured by the time of the request. Publish the code.
> I can see the pros of publishing the code as it's obviously better for open science and it makes the manuscript more solid and easier for anyone trying to replicate the work.
It is also increasingly a requirement for funding bodies
But on the other hand it's substantially more work to clean and organize the code for publishing
> Then don't. Release it under the CRAPL, stating as much. It is still better than nothing.
> it will increase the surface for nitpicking and criticism (e.g. coding style, etc).
If you were an entrepreneur hoping to peddle snake oil and not get found out, then I would see your point. But you're a scientist, you're supposed to welcome such criticism and opportunities for improvement. If anything, you might even get collaborations / more publications on the basis of improving on that code.
> Besides, many scientists look at code as a competitive advantage so in this case publishing the code will be removing the competitive advantage.
I would sincerely not feel very comfortable calling such people "scientists".
You have limited time. I'd prioritize that time on what you think others will find useful.
Don't worry about ugly code. There are research codes with 1k+ stars on GitHub that are ugly. They have so many stars because people find them useful.
You absolutely don't have to publish your code, or anything else of that matter. Don't let the the drive for impact on the community force you into working on something you're not interested in.
Congrats on your publication.
It's better than nothing, it also is the only way for others to reproduce your results. I am surprised you were not asked to do that by whatever journal you chose to publish your results.
>many scientists look at code as a competitive advantage so in this case publishing the code will be removing the competitive advantage
LOL, what!? What is this crap about "competitive advantage"? Are you privately funded? Then it's fine. If you're funded by public (i.e. government) money, you are (at least ethically) obliged to share your work with everybody.
You raise a good tangential point:
Releasing a data set can be just as useful as releasing code, and every bit as necessary to reproducing results.
Moreover, reproducing a well-curated dataset can be just as prohibitive in terms of time and expense.
How many papers have reused datasets such as ImageNet, Celeb-A, etc. in recent years?
By all means, release your datasets even if you don't release your code.
CS scientific journals should make the bar much higher in that regard: no code? no publish. unless really good excuse.
I left the academic world a few years ago, but several of the analysis codes/models I published (either as stand-alone tools or artifacts published alongside a journal article) still regularly get used... if anything, there's probably a larger user base for one of my models today than there ever has been, and it's leading to a long-tail of publications where my initial work is either cited or I'm offered co-authorship when I have time to offer hands-on support for improving the model/code and offering my insight as a domain expert.
If you can take the time to clean up some code or author a lightweight package, that's amazing! But it's a bang-for-your-buck type thing. If you ever aspire to leave academia, it's undoubtedly worth spending some time to clean up the code, add documentation, add some unit tests, etc - great artifacts to use in supporting a hiring process if you move into a technical role somewhere in industry. But is far from necessary.
But it is more honest. Whatever you think about the effort required to do this, there's value in honesty.
Here is an example of my own scientific work:
- paper 
- preprint 
- GitHub 
It certainly wasn't easy to get all of this done. But doing this can also be a guide for others. They get to see exactly what you've done so that they don't waste months on the exact implementation. They can see where maybe you've made some mistakes to avoid them. They can see so much of the implicit knowledge that is left out of your paper and learn from it. Your code isn't going to be perfect, but what paper is, either?
Everyone will be a critic, anyway, so make it easy to pick up criticism of the stuff you feel the least confident in and do better next time. You won't get better if no one sees your code.
Personally, I would. Open source is a form of peer review, and if you're wanting to stand by your paper as peer-reviewable then I believe the code should be included in that. Generally speaking, I feel more researchers need to open up their code to peer review because generally speaking, research code tends to not have the same robustness against mistakes (through coding convention as well as tests) as professional software development. I shudder to think how many papers have flawed results that no one realises and are just accepted, because no one can spare the effort rebuilding the code from scratch and without any prior reference in order to verify said results.
I don't think you need to clean it up. You're not competing for a coding elegance competition, but rather allowing someone to find bugs if they exist and point it out, just as they would peer reviewing your paper.
More cynically, spaghetti code probably helps as a defense against people ripping off your code, so if you're worried about your competitive advantage then not cleaning it up is a form of security through obscurity :)
Separate from that, is there fairly new chatter in your field about reproducible science, publishing code and data, etc.? If so, what's the current thinking there about how valuable this is to collective science, and how that should affect the sometimes unfortunate conflicts of interest between career and science?
The obvious answer for science is: publish. The goal of science should be to make it easy for others to reproduce your work. Not to make it theoretically possible, but hard, because of the "competitive advantage".
The right thing to do would be to publish and next time you review another paper that does not publish code use that as a reason to reject it. The whole "code and data upon request" is obvious bullshit, there have been studies on it and often enough it ends up with "well, we don't have that code/data any more", "why do you need that? we won't help you plan to publish something we don't like" etc. pp.
The fact that your code is a mess means that it might be buggy; if other people can see your code, someone might find a bug in it. As you said, this is a good thing for open science, and makes your work easier to reproduce.
As an outsider looking in, many academic fields seem to have a reproducibility crisis. Many psychological studies, for example, cannot be reproduced yet they continue to be cited.
I personally feel like every academic paper should be reproducible. I should be able to email you the study and you should get the same results. Obviously clinical trials may vary (and thus the important of statistical significance) but the real problem is data and models. If I, as someone reading your study, don't have your data, how can it possibly be reproduced? If I gather my own data will I get completely different results? If I'm solely relying on what details you give, how do I know you haven't made a fatal assumption or even just buggy code with your model?
I personally feel like a condition of all Federal funding should be that the data and any code should be made freely available.
So I support the idea of releasing it and that releasing something messy is better than releasing nothing but I can't speak to your individual circumstances.
If you want real protection of course you can always try to get a patent, but then I've got you because 90% of the people I have this conversation with are worried about people stealing their idea but don't think it is patent-worthy.
A similar analogue exists in startups: ideas are really a dime a dozen. Execution is what matters. There are millions of great startup ideas floating around -- I bet almost anyone could come up with at least a few that are viable -- but actually having the follow-through and dedication to execute that idea, that is what is challenging. I can't tell you how many people I've had calls with where the exchange is basically "I want your thoughts on this amazing idea but you have to sign an NDA first". 90% of the time these people aren't willing to go all-in on their idea and stake their career on it (hence them seeking second opinions), so it makes no sense for them to worry about me "stealing" their half-baked, unrealized idea. I say to them "would you take $3M in interest-free debt to develop this idea right now" and they say "no!" to which I say "then why should I sign an NDA?"
There is value in scrutinizing the code - not w.r.t. coding styles or standards but to discover bugs in the implementation, which are very common. Scientists are only human, and scientific software is less often checked by a second pair of eyes. There is also value in trying to replicate a study from scratch with a fresh implementation only from the details in the paper. Many conferences, for instance the European Conference on Information Retrieval (ECIR), Europe's largest scientific search technology conference, has a replication track only for replication papers, and these are often the most interesting/insightful papers. It occasionally happens that a result is not caused by what the authors think, but is merely an artifact of the implementation code. A very famous MIT researcher (not naming him or her here on purpose) fell into this trap in their Ph.D. thesis, but it can happen to anyone, really. Scientific results become objective knowledge as others solidify the body of knowledge by carrying out replications and arriving at the same results.
Whatever your decision about past code, going forward, if you plan to release all future research code, you will likely write better code in the first place, as you will constantly be aware that people will be looking at it, and that can only be a good thing.
(FWIW, I'm a professor at an R1 university. I give this advice to all of my Ph.D. students and strongly, strongly encourage them to put their code out there on our github.)
This is unfortunately. In one of my articles I linked to my github repo where I had implemented the algorithm in C. One of my reviewers complained that I had used C instead of C++. Probably advisable to not publish code before peer review.
Computer engineering on novel systems is a bit harder, but a /complete/ spec of the system (enough for someone to precisely rebuild it) should be published in that case. Remote access on request to the prototype would be better.
> But on the other hand it's substantially more work to clean and organize the code for publishing
(B) Don't spend time cleaning code for publishing. Spend your time writing more papers.
> it will increase the surface for nitpicking and criticism (e.g. coding style, etc).
(C) Don't worry about this.
> Besides, many scientists look at code as a competitive advantage so in this case publishing the code will be removing the competitive advantage.
(D) If you do B, if will also reduce your worries about this. I am half joking.
There have been times when I've had to abandon incorporating an idea presented in a research paper because the paper doesn't have enough information for me to implement it in code. I could've made a lot of progress with some proof of concept code, even if it wasn't clean.
If it's uncommon to release code then I'd doubt anyone in the peer review will review it.
As an example, I've found a paper that promises a method to do the very thing I want to accomplish. It's not too dense but it skips a few crucial moments and I've been working on coding the method for a year now (on and off, of course but still for a long time). If the code was available it probably wouldn't take as long. The paper didn't mention that the code was available upon request but it was implemented in a piece of software. I've found it eventually but it was a version just before the feature I'm after was added. I tracked the author and they were great sport about cold emails bet didn't have the source any more.
So yes, please publish the code. You don't have to clean it up. It worked for the paper — it's good enough. Even the most terrible code is immeasurably better than no code.
You might be doing a young student a solid :D And don’t worry about cleaning it up!
If you use GitHub you could even disable Issues and have a note saying you don’t accept pull requests (in case you’re worried about support burden).
GitHub is not a viable solution; Microsoft can not be trusted to keep these important cultural artifacts safe and accessible in the near future.
If it is too much work to refactor the code for publishing, you can also just publish pseudocode.
I don't think anyone will nitpick or criticize coding style or things like that unless it is particularly egregious (ie naming variables something vulgar etc). The point of research papers is to communicate new and valuable findings. If people in this conference or journal are nitpicking things like that, you may want to find a different place to submit your work.
I don't know what your field is, but in Computer Science I can't say I have ever known people to consider their code a competitive advantage. The only time they might shy from releasing code is when they think they can commercialize it or something.
For the journal I edit, authors are required to include the code and data with the submission. The code and data are available along with the paper if it's published. We do replication audits of some papers to make sure you can take the materials they've included and reproduce every result in the paper. If not, the conditional acceptance changes to rejection. I've had cases where reviewers found errors in the code, so I rejected the paper.
On the argument that it's a competitive advantage: what does that mean? You should be able to claim results but not show where they came from? That's not science.
Keep in mind that this is a "source available" requirement, not an open source requirement. It is a matter of transparency. You have to let others see exactly what you did.
"Scientific communication relies on evidence that cannot be entirely included in publications, but the rise of computational science has added a new layer of inaccessibility. Although it is now accepted that data should be made available on request, the current regulations regarding the availability of software are inconsistent. We argue that, with some exceptions, anything less than the release of source programs is intolerable for results that depend on computation. The vagaries of hardware, software and natural language will always ensure that exact reproducibility remains uncertain, but withholding code increases the chances that efforts to reproduce results will fail."
Then answer any criticism about it by asking for a PR.
To preempt code style complaints find a code formatter for your language and run everything through that first.
Refer to the repository in your paper, but don't put a link. Create a little bit of friction to get to the repo to discourage the casual readers who don't really need the code from popping over too easily.
Ideally, it would be nice if the code has a professional-level quality to it, but I think everyone involved in evaluating research understands that it is at best a prototype. Proper software engineering is expensive, and it is not the role of research to do this. The process, as it was explained to me: university research pushes the state of the art, industrial research labs are slightly behind this and looking to transfer into practical uses (along with this some government agencies are interested in tech transfer) and finally software engineering takes these ideas and turns them into actual products. You aren't making a product, so it is OK for the code not to be perfect (also, from experience, 'professional' industry code is not always that great either). The main point is that someone has some chance of reproducing your results.
The exception to this is if you are making a product, where the definition of product is a tool for further research. Examples might be tools for symbolic execution or formal verification, in which case it might be worth some time to make the experience of using it good for that benefit, to reduce friction so that people try and want to use your tool.
Artefact evaluation is rapidly becoming something people are encouraged to do and helps enormously in verifying results, but the point is usually to try to reproduce the results of the paper to back up the science, not to start an argument over coding style. I would hope that artefact evaluation processes make this clear and ensures that evaluations of artefacts focus on reproducibility. For outside comments that might arise, I suggest you publish the work as open source and respond to any criticisms with a fairly standard line: yes this is research quality code and we would like to have time to improve it. If you would like to submit a patch/pull request we would welcome any help.
Science that is not reproducible is not science.
If you can, publish something high-level. Matlab or Python or Julia is fine. C or Java, not so much, because the build environment will not be available any longer after a few years. Actually, if you can, publish several translations.
And don't forget to publish your data sets as well. And your data augmentation or whatever. Everything you need to reproduce your results.
And for the love of Knuth, DO NOT OPTIMIZE YOUR CODE. Dumb code is good code in science. You would not believe what kinds of havoc some algorithms wreaked on my systems in the name of optimization. Optimizations that made a ten-year-old algorithm run in two nanoseconds instead of four (vastly exaggerated). Optimizations that obfuscated otherwise perfectly reasonable algorithms.
The goal is reproducibility.
It is better for science, it will be better for you and it will be better for people who want to play with your code.
Publishing is a form of advertising what you did, and helping others reproduce it makes it go viral and is a testament to how much they care. It can only help your career.
You’ll definitely get people who nitpick the code. This won’t hurt and it may even help in its own way.
So, yes. Please publish the code, it will make the rest of the paper stronger.
I suggest publishing the code as is on something such as Github, Gitlab, etc. I suspect you have ideas on how you can improve the code, perhaps there's even a way of improving your research methodology by doing so, enabling new insights with further research. If you did a follow up experiment with improved analysis enabled by your improved code, then that's another paper, and another (more cleaned up) version of the code to push to the repository.
The above is all supposition though, as I don't know your field. If deep learning then the above seems more likely. If your field is geology, then improvements in the software might not enable better insights.
Some papers link to the code instead of including it. Maybe I'm just unlucky, but this usually leads to dead links (but that's a different topic altogether).
What I would not expect from people is code that would necessarily run in your environment. For example, in many cases, the paths are going to be hard-coded, for a variety of reasons. It might be ideal to write code that will just work, in a reproducible environment, but that often takes more work than people are willing to commit to, given all the other things they have to do.
Finally, cleaning up your code for presentation is a final opportunity for you to discover any mistakes before you publish and then later have an embarrassing public retraction.
You could add a disclaimer that the code was worked on until it provided a satisfactory result, and no further, and is not intended for (any) use. You might even add that, except for outright, actual errors that affect the result of the research, comments are discouraged.
I often publish very bad code, terrible terrible spaghetti, it's not how I write code at my job, because at my job, I'm paid to produce not only working and correct code, but also code that is maintainable and understandable and follows certain practices.
However, my hobby is not writing corporate code, but writing code that get done what I want to get done, nothing more, and sometimes less. It might even have actual bugs in it that I can plainly see and don't care about because they don't affect my uses
If people can't tell the difference, I don't care, not my problem. If a future employer can't tell the difference, I won't work with them.
Like others have said, research code isn't meant to be production quality code so I wouldn't worry about "quality" in that way.
The main points are that I made only a minimal attempt to organize it, and I made the state of the code clear in the README. I don't recall anyone complaining about the code or even mentioning it during review. (Though to be fair, I also don't recall whether I published the code before or after the paper was accepted.)
Looking at things from the other side, I'm am at least an order of magnitude more likely to read, use the work/methods from, and therefore cite a paper that comes with code.
You can embed this to the PDF, e.g. see section A.1  for how.
Similarly, I’ve found papers that don’t include their complete data set in the paper, and had to try to reverse engineer it from images and so on. It is really frustrating when papers are incomplete.
I'd say, grad student owes nobody anything until they finish, because they're bearing the greatest risk of losing priority, and the openness of science is being used against them. Nothing lost by waiting until they have their degree in the bag before sharing. Then clean it up and use it as part of your portfolio. Or append it to your thesis. Advancing science after you've secured your career is a fair compromise.
I love open source and open science, but also look back on my own graduate studies, and I chose a topic that was protected by virtue of a large capital investment plus domain knowledge that was not represented by code. Also, my thesis predates widespread use of the Internet. ;-)
Can you provide a source, or example of this? What does the Amazon of academia look like?
My field, physics, much harder. Building my experiment required a bunch of expensive equipment (maybe half a million in today's dollars), gear that I built myself, the technique of operating it, and so forth.
My career, much harder. I work in business. You learn about my ideas when a patent comes out. ;-)
Look at my other comment for more explanation - if you are working under less known advisor, or at less known university, there is a high chance that this will happen if your work is good.
That was for the first occurrence. For the 2nd one, we just did not bother because it hurts my advisor's reputation as well. It is not in the interest of journal to admit the mistake once they made it -- they will fight you about it and try to keep their reputation/image up.
Nowadays margins are large enough and cost nothing or next to nothing, and you don't probably have any other use of your code, so what would be the advantage for you in not publishing it?
What kind of competitive advantage does it give to you? (what many scientists think might be not as relevant as what you think about this "competitive advantage" secifically in your specific case/field)
About "cleaning it", why?
I mean, if as-is it works (but it is "ugly") it still works, what if in the process of "cleaning it" you manage to introduce a bug of some kind?
Unless you plan to also re-test it after the cleaning, I guess it would be better to not clean it at all.
For every paper introducing the revolutionary Algorithm X, there are a bunch of follow-up papers like "Algorithm X applied to self-driving cars", "Algorithm X applied to smartphones", "Algorithm X with some tweaks that provide marginal improvements", "Algorithm X but using consumer-grade hardware" and so on.
If every other lab has to spend several months to replicate your first paper, you and your colleagues can spam out the follow-up papers before anyone else can catch up. This makes your publication count go up.
Other means for achieving similar effects include delaying the publication of your code, or releasing undocumented spaghetti-code with missing dependencies and entirely comprised of one-character variable names.
Of course, this stuff comes at a cost: Making it harder for people to use your work makes them less likely to use your work. So it might be better for your citation count to release the code - and in any case, who goes into research hoping their ideas will be ignored?.
This is probably wrong, depending on the field. At least in machine learning, the papers that get cited the most are those that other people can easily pick up and work on. They become the basis for future work, get cited as baselines more often, etc. Publishing research ML code is a competitive advantage.
What is useful is if you can produce code people can build on and do their own cool stuff with -- then they will cite you. However, getting something to a state where it is tested for all reasonable inputs, has some basic docs, etc. is a hard untaking.
https://github.com/minion/minion (C++ constraint solver)
https://github.com/stacs-cp/demystify (Python puzzle solver)
https://github.com/peal/vole (Rust group theory solver)
In practice Minion is generally used as a backend to Conjure ( https://conjure.readthedocs.io/en/latest/ ), which provides a much nicer input language.
In the field I follow the most (Computer Graphics/Rendering) I think there is a big problem with reproducibility as well, and to be honest, I think some of the major players actually have little interest in making this significantly better, since they can take advantage of the visibility of a flashy render/fps counter shown at an event while still keep on building a "moat" between them and others that want to adopt the same methods.
Which is in the end partly an answer to your question: your paper could clearly describe all the elements needed to implement a method correctly, but by providing a sample implementation you allow others to "stand" on your shoulders, as they say, instead of having to climb there first and then proceed. You can not worry too much about the state of your codebase by making clear via README/documentation/license that it's still in "proof of concept" phase.
One reasonable observation I have heard is that in some fields, during peer review, some reviewers seem to like to nitpick on the code rather than the paper, sometimes in subtle ways. Because of that, I think it can be (unfortunately) OK to release the code after acceptance or publication. But apart from this, I see only advantages.
If the answer to the above is no, and it will mostly cost you time and effort. Then don't publish.
If the answer to the above is yes, then consider the return on investment for publishing your code. If you earn more reputation/money/whatever if you publish than what you expenditure on doing the work of publishing, then publish, if not, then don't.
That "competitive advantage" is just holding everyone back, slowing progress. This is particularly annoying to hear coming from "research" which I thought was supposed to be advancing the state of the art for the benefit of society. That's ostensibly the reason for publishing papers right, to disseminate knowledge? Or is it really just to increase ones ego and get paid?
Not saying you should publish code, just that deliberately keeping secrets in your field seems to go against what I thought you were doing.
I should take a couple of hours. The code works? You know how to reproduce what you did, right? It shouldn't be perfect. Shouldn't even pass code review. Should just work.
> many scientists look at code as a competitive advantage so in this case publishing the code will be removing the competitive advantage.
Well depends on the field I guess, but you also want recognition and impact. What is the point of publishing a result no one uses?
If the purpose is to push human knowledge forward, then it seems backwards not to publish everything.
Personally, I've found it difficult in my various careers to date when I've been put in positions where the actions that serve my immediate interests are in any way in conflict with my underlying principles or overarching goals. It's demotivating and deflating.
If I were in your position, I would publish everything and let myself feel pride in what I did. Even if we're all just insignificant specks in the grand scheme of things, pursuing a greater purpose can help make it feel like something matters.
Every researcher thinks this, and it's always wrong. If you care about scientific progress, publish the code and data.
Besides, available code should cause more people to look at your work and ultimately cite it.
If you published a paper that uses information from the code then yes you absolutely must publish your code. Otherwise you're contributing to the decline of science via the opaqueness of papers and irreproducibility problem.
While I appreciate this is true, it’s also quite sad. Science shouldn’t be a competitive sport to increase a couple metrics like publications and citations such that useful parts of replicating and extending studies aren’t shared. :(
One of my most cited papers is a relatively uninteresting one we wrote for a conference competition. But we have code so it is easy to compare your alternative approach to us. That means citations.
So it can work for your benefit as well.
Do it. There's no good reason not to.
The thing is that I was required to provide a way to reproduce, so code obfuscated and/or uncommented were not a problem. I provided clean code anyway.
For me, it shows the authors are confident yet also open to critique. Which is a wonderful thing.
Secondly, I usually need the code to really understand the paper.
Published terrible code is far better than unpublished code.
So identify what's most critical or novel about your work and publish that.
Put a huge note in the readme that this is research code and only licensed for non commercial use.
Put a note on your personal homepage that you're available to hire as a research consultant for $1000 per day.
Companies who like your research will put 1+1 together. A friend of mine got hired straight out of university at a very competitive salary with this approach.
If people want great code that runs easily and is easy to read, that's engineering work, built off the back of novel implementations.
If people want novel implementations that are likely rough around the corners and require a bit of finagling to run, leave that to the scientists.
To me at least sends a signal of people hiding stuff. That's not good. It made me distrust some papers in the past. I tried to reach out with no success.
Agree with other comments on CRAPL, but you should release it.
- People who use your work will cite you.
- You may get collaborators.
- It's an easy-to-get-to backup
- For non-academic jobs, it's part of your resume
Worst case scenario, it will end up in a star-less github repo that nobody reads.
If someone has comments about style ask them to improve it for you.
Worry about maintaining things after someone asks for maintenance, the vast majority of code is never read again.
Aggregate valuable and interesting links.
Joyk means Joy of geeK