Ask HN: Should I publish my research code?

 5 months ago
source link: https://news.ycombinator.com/item?id=29934192
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Ask HN: Should I publish my research code?

Ask HN: Should I publish my research code? 308 points by jarenmf 6 hours ago | hide | past | favorite | 277 comments I'm looking for advice om whether I should publish my research code? The paper itself is enough to reproduce all the results. However, the implementation can easily take two months of work to get it right.

In my field many scientists tend to not publish the code nor the data. They would mostly write a note that code and data are available upon request.

I can see the pros of publishing the code as it's obviously better for open science and it makes the manuscript more solid and easier for anyone trying to replicate the work.

But on the other hand it's substantially more work to clean and organize the code for publishing, it will increase the surface for nitpicking and criticism (e.g. coding style, etc). Besides, many scientists look at code as a competitive advantage so in this case publishing the code will be removing the competitive advantage.

> it's substantially more work to clean and organize the code for publishing, it will increase the surface for nitpicking and criticism (e.g. coding style, etc).

Matt Might has a solution for this that I love: Don't clean & organize! Release it under the CRAPL[0], making explicit what everyone understands, viz.:

"Generally, academic software is stapled together on a tight deadline; an expert user has to coerce it into running; and it's not pretty code. Academic code is about 'proof of concept.'"

[0] https://matt.might.net/articles/crapl/

> "Generally, academic software is stapled together on a tight deadline; an expert user has to coerce it into running; and it's not pretty code. Academic code is about 'proof of concept.'"

What do you know, it turns out the professional software developers I work with are actually scientists and academics!!

They don't call it "Computer Science" for nothing ;)
notoriously Philip Wadler says that computer science has two problems: computer and science.

It's not about computers and "You don't put science on your name if you're a real science"

He prefers the name informatics.

source: https://youtube.com/watch?v=IOiZatlZtGU

You don't put science on your name if you're a real science

Having flashbacks to when a close friend was getting an MS in Political Science, and spent the first semester in a class devoted to whether or not political science is a science.

Maybe a little OT, but, I'd rather it be called "computing science." Computers are just the tool. I believe it was Dijkstra who famously objected to it being called "computer science," because they don't call astronomy "telescope science," or something to that effect.
Peter Naur agreed with that which is why it is called "Datalogi" in Denmark and Sweden, his English term Datalogy never really caught on though. He wrote it in a letter to the editor in 1966 and it has pretty much been used since then here as Naur founded the first Datalogy university faculty a few years later.

This also lead to a strange naming now we have data science as well where it is called "Data videnskab" which is just a literal translation of the English term.

[0]: https://dl.acm.org/doi/10.1145/365719.366510 (sadly behind a wall)

informatics may be the closest english analogue. incidentally also what computer science is called in german (informatik)
I never thought about this, but it's right. However, to get more nitpicky, most of the uses of "Comput[er|ing] Science" should be replaced with "Computer Engineering" anyway. If you are building something for a company, you are doing engineering, not science research
All my job titles have been of the form Software Engineer plus a modifier.

I believe they are referring to what the degree currently known as Computer Science should be called.

The average developer isn't often doing "engineering". Until we have actual standards and a certification process, "engineer[ing]" doesn't mean anything.

The average software developer doesn't even know much math.

Right now, "software engineer" basically means "has a computer, -perhaps- knows a little bit about what goes on under the hood".

I'm not talking about the "average developer", I'm talking about college graduates having a "Computer Science" degreee but in practice being "Computer engineers"
College degrees aren't standardized and most of the time don't really mean anything. Ask some TAs for computer science courses about how confident they are in undergrads' ability to code.

There isn't a standard license that show that someone is proficient in security, or accessibility, or even in how computer hardware or networking work at a basic level.

So all we're doing is diluting the term "engineer", so as to not mean anything.

The only thing the term "software engineer" practically means is: they have a computer. It's meaningless, just a vanity title meant to sound better than "developer".

Be careful there. If you start calling what you're doing 'engineering', people will want you to know statics, dynamics, and thermodynamics.
Please don't use this license. Copy the language from the preamble and put it in your README if you'd like, but the permissions granted are so severely restricted as to make this license essentially useless for anything besides "validation of scientific claims." It's not an open-source license - if someone wished to include code from CRAPL in their MIT-licensed program, the license does not grant them the permission to do so. Nor does it grant permission for "general use" - the software might not be legally "suitable for any purpose," but it should be the user's choice to decide if they want to use the software for something that isn't validation of scientific claims.

I am not a lawyer, just a hardcore open-source advocate in a former life.

From the post: "The CRAPL says nothing about copyright ownership or permission to commercialize. You'll have to attach another license if you want to classically open source your software."

It is explicitly the point of the license that the code is not for those purposes, because it's shitty code that should not be reused in any real code base.

I‘m a proponent of MIT and BSD style licenses normally, but this calls for something like AGPL: Allow other researchers and engineers to improve upon your code and build amazing things with it. If someone wants to use your work to earn money, let them understand and reimplement the algorithms and concepts, that’s fine too.
While I whole heatedly agree with you, I would seriously question anyone trying to reuse research code in production without completely reimplementing it from scratch.
In the research industry, it is well established for anyone wanting to publish or utilize / include another's research in their own, to contact the source author and receive explicit permission to do so.

More often than not, they are more than willing to help.

> the permissions granted are so severely restricted as to make this license essentially useless

Indeed, also there're things like "By reading this sentence, You have agreed to the terms and conditions of this License.". That can't hold up in court! How can I know in advance what the rest of the conditions say before agreeing to them?

Then again, I am not a lawyer either.

I'd like to piggyback and say that increasing the surface for nitpicking and criticism is exactly why OP should release his code. It improves the world's ability to map data to the phenomenon being observed. It becomes auditable.

Certainly don't clean it up unless you're going to repeat the experiment with the cleaned up code.

Agreed on both points! As somebody who bridges research and production code, I can typically clean code faster than I can read & understand it. It really helps to have a known-good (or known-"good") example so that I can verify that my cleanup isn't making undesired changes.

And, yeah. I've found some significant mistakes in research code -- authors have always been grateful that I saved them from the public embarrassment.

The CRAPL is a "crayon license" — a license drafted by an amateur that is unclear and so will prevent downstream use by anyone for whom legal rigor of licensing is important.


> I'm not a lawyer, so I doubt I've written the CRAPL in such a way that it would hold up well in court.

Please do release your code, but please use a standard open source license. As for which one, look to what your peers use.

I do this with my code and can highly recommend it.

Supplying bad code is a lot more valuable than supplying no code.

Also in my experience, reviewers won't actually review your code, even though they like it a lot when you supply it.

I'm not a scientist so maybe I don't get it, but it seems like code could be analogized to a chemist's laboratory (roughly). If a chemist published pictures of their lab setup in their paper, and it turned out that they were using dirty glassware, paper cups instead of beakers, duct taping joints, etc etc, wouldn't that cast some doubt on their results? They would be getting "nitpicked" but would it be unfair? Maybe their result would be reproducible regardless of their sloppy practices, but until then I would be more skeptical than I would be of a clean and orderly lab.
Yes, if you feel you have to make it "release ready" then you'll never publish it. I'm pretty sure a good majority of the code is never released because the original author is ashamed of it, but they shouldn't be. Everybody is in the same boat.

The only thing I would add is a description of the build environment and an example of how to use it.

I like it so far, other than

4) You recognize that any request for support for the Program will be discarded with extreme prejudice.

I think that should be a "may" rather than a "will." If I find out someone is using my obscure academic code, and they ask for help, I'd be pretty pumped to help them (on easy requests at least).

The point of the license is to set your expectations as low as possible. Then, when you actually /do/ get support, you'll be ecstatic rather than non-plussed.
When phrased like this,

> 4) You recognize that any request for support for the Program will be discarded with extreme prejudice.

There is no way I'd even make a request for support.

Exactly. If nothing else, a request for support has a chance of being an indication that there's somebody else in the field that cares about some aspect of the problem. I might not act on it, but it is good to have some other human-generated signal that says "look over there."
A thousand times this. A working demonstrator of a useful idea that is honest about its limits is so valuable. Mover over most commercial code is garbage! :)
Please do absolutely publish your code.

If only to help people who simply can't read academic papers because it's not a language their brain is wired to parse, but who can read code and will understand your ideas via that medium.

[EDIT]: to go further, I have - more than once - run research code under a debugger (or printf-instrumented it, whichever) to finally be able to get an actual intuitive grasp of the idea presented in the paper.

Once you stare at the actual variables while the code runs, my experience is it speaks to you in a way no equation ever will.

I think this could be done much better by putting a very restrictive license like GPLv3 / AGPL and then in the README putting in that I don't support this project at all and ignore everything associated with wherever you are hosting it.

Using this license would actually make me suspect that your results aren't even valid and I don't trust many experiments that don't release source code.

Anecdotally most of the research papers I see and have worked on publish their code but don't really clean it up. Even papers by big companies like Microsoft Research. Still significantly better than not publishing the code at all.
> Academic code is about "proof of concept."

Why does he think that but presumably not the same about the paper itself and the “equations”, plots, etc. contained within?

It’s really not that hard to write pretty good code for prototypes. In fact, I can only assume that he and other professors never allowed or encouraged “proof of concept” code to be submitted as course homework or projects.

I think you don't understand the realities of work in scientific/academic organisations. Unless you work in computer science you likely never received any formal education on programming except for some numerical methods in c/matlab/fortran course during your studies (which often also focus more on the numerical methods and not the programming). So pretty much every person, just learned by doing.

Moreover you are not being paid for writing reasonable programs you're paid for doing science. Nobody would submit "prototype" papers, because they are the currency of academic work. There is lots of time spend on polishing a paper before submission, but doing that for code is generally not appreciated because nobody will see this on your CV.

I understand it fine. Like I said, it’s really not that hard to write pretty good code for prototypes. I'm not saying the code needs to be architected for scale or performance or whatever else needless expectation. I don't have a formal education in programming or computer science and write clean code just fine, as do some other non-computer science people I've worked with in advanced R&D contexts. And then some (many?) don't. It's not really about educational background, it's more about just being organized. Even when someone is "just" doing science, a lot of times, the code explicitly defines what's going on and has major impacts on the results. (Not to mention that plenty of people with computer science backgrounds write horrible code.)

If code is a large part of your scientific work, then it's just as important as someone who does optics keeping their optics table organized and designed well. If one is embarrassed by that, then too bad. Embarrassment is how we learn.

Lastly, you're describing problems with the academic world as if they are excuses. They're reasons but most people know the academic world is not perfect, especially with what is prioritized and essentially gamified.

MIT and BSD are established and well accepted licenses, literally named after the academic institutions where they originated. Licenses are legal documents, part of what makes them "explicit what everyone understands" is their legally recognized establishment.

If you want to set expectations, this can simply be done in a README. Putting this in a license makes no sense. Copyright licenses grant exceptions to copyright law. If you're adding something else to it, you're muddying the water, not making it better.

This is not what licenses are for!! They are not statements about the quality of your work or anything similar.

Use standard and well understood licenses e.g. GPL for code and CC for documentation. The world does not need more license fragmentation.

This has explicit usage limitations that matter in science land, which is very much the kind of thing that belongs in a license.
   You are permitted to use the Program to validate scientific claims
   submitted for peer review, under the condition that You keep
   modifications to the Program confidential until those claims have
   been published.

Moreover, sure, lots of the license is text that isn't common in legal documents, but there's no rule that says legal text can't be quirky, funny or superfluous. It's just most practical to keep it dry.

In this particular case, however, there's very little risk of actual law suits happening. There is some, but the real goal of the license is not to protect anyone's ass in court (except for the obvious "no warranty" part at the end), but to clearly communicate intent. Don't forget that this is something GPL and MIT also do besides their obvious "will likely stand up in court" qualities. In fact I think that communicating intent is the key goal of GPL and MIT, and also the key goal of CRAPL.

From this perspective, IMO the only problem in this license is

    By reading this sentence, You have agreed to the terms and
    conditions of this License.
This line makes me sad because it makes a mockery of what's otherwise a pretty decent piece of communication. Obviously nobody can agree to anything just by reading a sentence in it. It should say that by using the source code in any way, you must agree to the license.
By existing, you have agreed to the terms and conditions.
> clearly communicate intent

Again, this is not how a license work. You can express your intents, ideas and desires in a README file and in many other ways.

The license is nothing more than a contract that provides rights to the recipient under certain conditions. Standing up in court is its real power and only purpose.

That's why we should prefer licenses that stood up in court and have been written by lawyers rather than developers or scientists.

I agree. There's a lot of confusion surrounding even the most established ones, so there's no need to further muddy the situation with newer licenses. In my opinion a "fun" license, with its untested legal ambiguity, restricts usage more than a well established license with a similar level of freedoms.
For instance, the Java license explicitly forbids the use in/for real-time critical systems, and such limitations are good to stress in a license so that they may reach legal force, also to protect the author(s).

Incidentally, I've seen people violate the Java "no realtime" clause.

Used to, OpenJDK is licensened under GPLv2 with the classpath excemption that allows this for years. If not running an OpenJDK build it depends on your vendor license.
And it makes the license non-opensource.

Plus, the usual "no warranty" is strong enough to protect the authors anyways.

> This is not what licenses are for!!

You must be fun at parties :)

Hi, I’m a Research Software Engineer (a person who makes software that helps academics/researchers) at a university in the UK. My recommendation is that not only do you publish the code, you mint a DOI (digital object identifier, Zenodo is usually the go to place for that) for the specific version that was used in your paper and you associate them. And you include a citation file (GitHub supports them now: https://docs.github.com/en/repositories/managing-your-reposi...) in your software repo.

Benefits: people who want to reproduce your analysis can use exactly the right software, and people who want to build on your work can find the latest in your repo. Either know how to cite your work correctly.

In practice drive-by nitpicking over coding style is not that common, particularly in (some) science fields where the other coders are all other scientists who don’t have strong views on it. Nitpicks can be easily ignored anyway.

BTW should you choose to publish, the Turing Way has a section on software licenses written for researchers: https://the-turing-way.netlify.app/reproducible-research/lic...

And I would suppose it will drive more citations. Which is a plus!
For a physician that writes pretty awful code, I like this comment.
> The paper itself is enough to reproduce all the results.

No, this is almost never the case. It should be. But it cannot really be. There are always more details in the code than in the paper.

Note that even the code itself might not be enough to reproduce the results. Many other things can matter, like the environment, software or library versions, the hardware, etc. Ideally you should also publish log files with all such information so people could try to use at least the same software and library versions.

And random seeds. Make sure this part is at least deterministic by specifying the seed explicitly (and make sure you have that in your log as well).

Unfortunately, in some cases (e.g. deep learning) your algorithm might not be deterministic anyway, so even in your own environment, you cannot exactly reproduce some result. So make sure it is reliable (e.g. w.r.t. different random seeds).

> In my field many scientists tend to not publish the code nor the data.

This is bad. But this should not be a reason that you follow this practice.

> clean and organize the code for publishing

This does not make sense. You should publish exactly the code as you used it. Not a restructured or cleaned up version. It should not be changed in any way. Otherwise you would also need to redo all your experiments to verify it is still doing the same.

Ok, if you did that as well, then ok. But this extra effort is really not needed. Sure it is nicer for others, but your hacky and crappy code is still infinitely better than no code at all.

> it will increase the surface for nitpicking and criticism

If there is no code at all, this is a much bigger criticism.

> publishing the code will be removing the competitive advantage

This is a strange take. Science is not about competing against other scientists. Science is about working together with other scientists to advance the state of the art. You should do everything to accelerate the process of advancement, not try to slow it down. If such behavior is common in your field of work, I would seriously consider to change the field.

I agree with almost all of this, however I believe that publishing random seeds is dangerous in its own way.

Ideally, if your code has a random component (MCMC, bootstrapping, etc), your results should hold up across many random seeds and runs. I don’t care about reproducing the exact same figure you had, I want to reproduce your conclusions.

In a sense, when a laboratory experiment gets reproduced, you start off with a different “random state” (equipment, environment, experimenter - all these introduce random variance). We still expect the conclusions to reproduce. We should expect the same from “computational studies”.

> And random seeds. Make sure this part is at least deterministic by specifying the seed explicitly (and make sure you have that in your log as well).

> Unfortunately, in some cases (e.g. deep learning) your algorithm might not be deterministic anyway, so even in your own environment, you cannot exactly reproduce some result. So make sure it is reliable (e.g. w.r.t. different random seeds).

Publishing the weights of a trained model allows verification (and reuse) of results even before going to the effort of reproducing it. This is especially useful when training the model is prohibitively expensive.

In my view and personal experience, the pros outweigh the cons:

* You increase the impact of your work and as a consequence also might get more citations.

* It's the right thing to do for open and reproducible research.

* You can get feedback and improve the method.

* You are still the expert on your own code. That someone picks it up, implements an idea that you also had and publishes before you is unlikely.

* I never got comments like "you could organize the code better" and don't think researchers would tend to do this.

* Via the code you can get connected to groups you haven't worked with yet.

* It's great for your CV. Companies love applicants with open-source code.

> It's the right thing to do for open and reproducible research.

Everybody here talks about how publishing code helps (or even makes possible) reproducibility, but this is not true, on the contrary, it hinders it. Reproducing the results does not mean running the same code as the author and plotting the same figures as in the paper. This is trivial and good for nothing. Reproduction is other researchers independently reimplementing the code using only the published theory, and getting the same results. If the author publishes the code, no one will bother with this, and this is bad for science.

This is a common misconception, but you are actually talking about "replicability" which is "writing and then running new software based on the description of a computational model or method provided in the original publication, and obtaining results that are similar enough" [1]. Reproducibility instead refers to running the same code on the same data to get the same results [2].

[1] Rougier et al., "Sustainable computational science: the ReScience initiative" https://arxiv.org/abs/1707.04393

[2] Plesser, "Reproducibility vs. Replicability: A Brief History of a Confused Terminology" https://doi.org/10.3389/fninf.2017.00076

First, this is overtly not true. Reproducibility refers to all forms: that the paper figures can be built from code and don't have errors, that a reimplementation of new code on the same data produces the same results, and that gathering new data (e.g. by conducting the same experiment again if possible, in other words replication) produces comparable results.

Second, publishing code helps make invisible decisions visible in a far better manner than the paper text does. Try as we might to imagine that every single researcher degree of freedom is baked into the text, it isn't and it never has been.

Third, errors do occur. They occur when running author code (stochasticity in models being inadequately pinned down, software version pinning, operating system -- I had a replication error stemming from round-towards-even behaviour implementation varying across platforms). If you have access to the code, then it's far easier to determine the source of the error. If the authors made a mistake cleaning data, having their code makes it easier to reproduce their results using their exact decisions net of the mistake you fix.

Most papers don't get replicated or reproduced. Making code available makes it more likely that, at a minimum, grad students will mechanically reproduce the result and try to play around with design parameters. That's a win.

Source: Extensive personal work in open and transparent science, including in replication; have written software to make researcher choices more transparent and help enable easier replication; published a meta-analysis of several dozen papers that used both reproducing author results from author code, producing author results with code reimplementation, and producing variant results -- each step was needed to ensure we were doing things right; a close friend of mine branched off into doing replications professionally for World Bank and associated entities and so we swap war stories; always make replication code available for my own work.

In my experience, the published paper is super vague on the approach, and implementing it without further references is really hard. I'm not necessarily arguing that papers should get longer and more detailed to counter this; expressing the details that matter in code seems like a more natural way to communicate anyway.

Why trust results if you can't see the methodology in detail and apply the approach to your own data? I once knew somebody who built a fuzz tester for a compilers project, got ahold of a previous project's compiler code that won best paper awards, and discovered a bunch of implementation bugs that completely invalidated the results.

Why is the peer review process limited to a handful of people who probably don't have access to the code and data? If your work is on Github, anybody can come along and peer review it, probably in much more detail. And as a researcher, you don't get just one chance to respond to their feedback -- you can actually start a dialogue, which other people are free to join in.

As long as a project's README makes any sort of quality / maintenance expectations clear upfront, why not publish your code?

> In my experience, the published paper is super vague on the approach, and implementing it without further references is really hard.

This is my experience, too, and in my opinion this is exactly what has to change for really reproducible research, not ready to run software supplied by the author.

There are many good arguments in support of publishing code, but reproducibility is not one of them, that's all I'm saying.

And just like OP said, it generally takes a couple of months to go from paper to working code. I've implemented a few papers as code as a side-gig for a while, and I wouldn't mind having a baseline from the authors to check and see if I'm following the paper correctly!
I disagree for another reason. Have access to the code allows easy comparison. I did some research in grad school on a computational method and there was a known best implementation that was in the research. I reached out to the author and he kindly supplied me with the source code of his work. I wasn't trying to replicate his results, but rather I wanted to compare his results to my implementations results in a variety of scenarios to see if I had improved over the other method.

And to the original author's credit, when I sent him a draft of my paper and code, he loved how such a simple approach outperformed his. I always felt that was the spirit of collaboration in science. If he hadn't supplied his code, I really would never have known how they performed unless I also fully implemented the other solution -- which really wasn't the point of the research at all.

> Reproducing the results does not mean running the same code as the author and plotting the same figures as in the paper. This is trivial and good for nothing.

I agree with this statement, however I think you may have a misunderstanding on reproducing results. It's not that you can reproduce their graphs from their dataset, but rather seeing if their code reproduces on to your (new) dataset.

Another way to think of it is that the research paper's Methodology section is describing how to set up a laboratory environment to replicate results. By extension the laboratory for coding research IS the code. Thus, by releasing the code along with your paper, you are effectively stating "how is a direct copy of my laboratory for you to conduct your replicate on".

I guess things are a spectrum. I've worked on research projects where understanding and developing the algorithm is the research. There isn't really an "input data set" other than a handful of parameters that are really nothing more than scale factors, and the output is how well the algorithm performs. So "setting up the laboratory" by cloning the code and building it is...fine, but a reimplementation of the algorithm with "the same" results (modulo floating point, language features, etc. etc.) aligns much better with reproducibility.
Often the text in a paper that describes some algorithm will not be completely clear or complete. Providing the code fills in those blanks. I've taken a paper with code and gone through the code line by line comparing it with what was described in the paper. The code often ends up clarifying ambiguities. In some cases there's an outright disagreement between the paper and the code - that's where the feedback comes in, ask the author about these disagreements, it will help them clarify their approach (both the text and the code).

> Reproducing the results does not mean running the same code as the author and plotting the same figures as in the paper.

Sure, to some extent. But the code does provide a baseline, a sanity check. People who are trying to reproduce results should (as I describe above) go through both the paper and the code with a fine tooth comb. The provided code should be considered a place to start. I'll often re-implement the algorithm in a different language using the paper and the provided code as a guide.

What are your thoughts on including pseudocode within the paper directly? It seems to clear up some of the ambiguities while adding brevity since it doesn't provide the somewhat frivolous details of implementation. I think it also limits some of the potential stylistic critiques.
It's not a bad idea to include pseudocode, but my pet peeve is that there's really no standard for pseudocode. I tend to write pseudocode that looks like python. I did that in an internal paper for a group I worked in a few years back and the manager of the group made me change it to look more like C (complete with for loops as: for(i=0; i<x; i++) which made no sense to me).
Oh, haha, then disregard my intuition that it would help avoid stylistic critiques :-)
"Reproducing results" in a scientific context doesn't mean taking the original author's data and going through the same process. It usually means taking different data and going through the same process. Having code on hand to put new data through the same process makes that a lot easier.
> It's great for your CV. Companies love applicants with open-source code.

While I strongly support sharing the code, I am not sure if this is a great reason to do so. Companies are made up of many individuals, and while some might appreciate what it takes to open source code, other individuals might judge the code without full context and think it is sloppy. My suggestion is that you fully explain the context before sharing code with companies.

"It's the right thing to do for open and reproducible research."

I think this is the most important reason to do it. Research code is not meant to be perfect as another op said, but it can be instrumental in helping others, including non-academics, understand your research.

I think the sooner it's released the better (assuming you've published and you're not needing to protect any IP.) There's some great advice here: https://the-turing-way.netlify.app/reproducible-research/rep...

What a great question. You've come to the right community.

My main concern would be to make sure there are no passwords or secret keys in the data, not how it looks.

You'll open yourself up for comments. They may be positive or negative. You'll only know how it pans out afterwards.

Is the code something that you'll want to improve on for further research? If so publish it on github. It opens the way for others to contribute and improve the code. Be sure to include a short readme that you welcome PRs for code cleanup, etc. That way you can turn comments criticizing your code into a request for collaboration. It'll really separates helpful people from drive by commenters.

> My main concern would be to make sure there are no passwords or secret keys in the data, not how it looks.

Worth mentioning specifically: If you make a git (et al) repository public, make sure there are no passwords or secret keys in the history of the repository either. Cleaning a repository history can be tricky, so if this is an issue, best to just publish a snapshot of the latest code (or make 100% sure you've invalidated all the credentials).

The brute force way around this is to remove the .git folder and re init the git repo.

For my 2 cents I'd prefer to see sloppy code vs no code.

If you did something wrong, you did it wrong. Hopefully someone would put in a PR to fix it

> My main concern would be to make sure there are no passwords or secret keys in the data, not how it looks.

Also personal data of any human subjects.

Maybe do it and see what happens. If something bad then don’t do it again…
Is your goal to help advance the science and our general knowledge? Publish the code. You don’t even need to clean it up. Just publish. Don’t worry about coding style nitpicks. Having the code and data available actually protects you from claims of fabrication or unseen errors in hidden parts of your research.

On the other hand, if your goal is only to advance your own career and you want to inhibit others from operating in this space any more than necessary to publish (diminish your “competitive advantage”) then I guess you wouldn’t want to publish.

Not sure if posting the paper only is even the best move. I personally never work with papers with no code published. Just not worth the effort to reproduce them, when I can use the second SOTA for nearly no performance penalty and much less effort.

All the groundbreaking papers in deep learning in the last decade had code published. So if you're aiming for thousands of citations, you need code.

> All the groundbreaking papers in deep learning in the last decade had code published. So if you're aiming for thousands of citations, you need code.

I am in this field and I would say less than 10% of the top papers have code published by the author, and those are most of the time another 0.1% improvement in imagenet. All the libraries that you generally use are likely to be recreated by others in this field. Lot of most interesting work's code never come out like alphazero/muzero, GPT-3 etc.

This is very domain specific. OP said it is not the norm to do publish code in his field. I have a PhD and in my field it is the same. So much so that I can't think of any paper in my field that has code published. Therefore, a paper with no code would not be at a disadvantage.

Personally, it is a pet peeve I have about my field. But there is no incentive for a new researcher to publish code as it decreases barriers to entry. As much as it's nice to say that researching in academia is about progressing science, as a researcher, you are your own startup trying to make it (i.e., get tenure).

Can confirm.

I personally look at any paper without code with great suspicion. The reviewers certainly did not try to reproduce your results, and I have no guarantee that a paper without code has enough information for me to reproduce.

I always go for the papers with code provided.

As a reviewer I have reproduced results with my own independent plasma simulation code. And I have had a reviewer write in a report about my paper "result X seemed strange, but I ran it with my code and it does it too. I don't know why, but the results is valid". In my opinion that is even better than just rerunning the same code.
Agreed, reproducibility helps a lot, and it is very easy to get details wrong when reimplementing. Having the source code is a bit plus.
Yes you should! And not only for ethical reasons (actually reproducible research, publicly financed work, etc), even if those are good enough by themselves.

I've always published my research code. Thanks to that, one of the tools I wrote during my PhD has been re-used by other researchers and we ended up writing a paper together! In my field is was quite a nice achievement to have a published paper without my advisor as a co-author even before my PhD defense (and it most likely counted a lot for me to get a tenured position shortly after).

The tool in question was finja, an automatic verifier/prover in OCaml for counter-measures against fault-injection attacks on asymmetric cryptosystems: https://pablo.rauzy.name/sensi/finja.html

My two most recently published papers also come with published code released as Python package:

- SeseLab, which is a software platform for teaching physical attacks (the paper and the accompanying lab sheets are in French, sorry): https://pypi.org/project/seselab/

- THC (trustable homomorphic computation), which is a generic implementation of the modular extension scheme, a simple arithmetical idea allowing to verify the integrity of a delegated computation, including over homomorphically encrypted data: https://pypi.org/project/thc/

I agree. I published code that I used for my dissertation (more than 30 years ago). I think it led to thousands of citations.
> it will increase the surface for nitpicking and criticism

Anyone who programs publicly (via streaming, blogging, open source) opens themselves up for criticism, and 90% of the time the criticism is extremely helpful (and the more brutally honest, the better).

I recall an Economist magazine author made their code public, and the top comments on here were about how awful the formatting was. The criticism wasn't unwarranted, and although harsh, would have helped the author improve. What wasn't stated in the comments is that by publishing their code, the author already placed themselves ahead of 95% of people in their position who wouldn't have had the courage to do so. In the long run, the author will get a lot better and much more confident (since they are at least more aware of any weaknesses).

I'd weigh up the benefits of constructive (and possibly a little unconstructive) criticism and the resulting steepening of your trajectory against whatever downsides you expect from giving away some of your competitive advantage.

Do you really mean 90% of the criticism is extremely helpful? Or did you mean 90% was useless.

I've published 100,000s of lines of code from my research over 20 years, and I think I've had exactly one useful comment from someone who wasn't a close collaborator I would have been sharing code with anyway.

I still believe research code should be shared, but don't do it because you will get useful feedback.

I had the same experiences, but only publishing for 5 years so far. I still try to puplish everything openly, but I do not expect any responses anymore. In none of my papers, the reviewers appeared to have even seen the Jupyter Notebooks I attached as HTML. The papers are cited, some more, some less, but there is no reaction towards the source code. I still don't regret publishing it.
Interesting. Are the unhelpful comments coming from academics or random peanut gallery folks?
Peanut gallery, in my experience. The number of people who I've never met before who decide to complain about hardcoded file paths or run a linter and tell me my paper must therefore be garbage is frustratingly high.

This seems to depend on a paper getting a modest amount of media traction. That seems to set off the group of people who want to complain about code online.

This. Feedback (less loaded term than "criticism") is something you should want. You can obviously ignore tabs vs spaces types of comments but if your code takes 2 months to get right then it probably still has bugs in it after 2 months and it would be a win if others started finding them for you. Also, if the style is really that bad then it could be obscuring bugs that would otherwise be easy to spot (missing braces, etc), and you might find bugs while fixing it up.

ps always use an auto formatter/linter. I can't believe we ever used to live without them. So much time used to be wasted re-wrapping lines manually and we'd still get it wrong.

> 90% of the time the criticism is extremely helpful

Citation needed. I have rarely seen valuable feedback from random visitors from the internet.

> as a competitive advantage so in this case publishing the code will be removing the competitive advantage.

I wish it wasn't viewed as a competition in the first place.

To be frank, nobody cares about your code. I’d be shocked and flattered if anybody read any of the code I wrote during my PhD. Publish the code in its current state and move on. If people take the time to actually read and nit pick your code you’ll have succeeded.
Personally, as someone who's had to dig through academic code in the field of bioinformatics, I do appreciate code being attached to the paper, regardless of the paper's level of detail or the code's quality (or lack thereof). I don't think many researchers expect high quality code unless you're releasing a library explicitly for general use and expect contributors. That said, a brief README with at least instructions to execute is a rare but welcome addition in my personal opinion.
> The paper itself is enough to reproduce all the results.

No, it isn't.

Reproducing the results means that you provide the code that you used so that people can reproduce it just by running "make" (or something similar). If you do not publish the code and the input data, your research is not reproducible and it should not be accepted in a modern, decent world.

It doesn't matter that your code is ugly. Nobody is going to look at it anyway. They are only going to call it. If the code is able to produce the results of the paper with the same input data, that's enough. If the code is not able to at least do that, this means that even you are not able to reproduce your own results. In that case, you shouldn't publish the paper yet.

Agreed! Even if your code is just a slight modification of some other well known tutorial, in order to reimplement the code means that I, as a scientist, have to reverse engineer your codebase based on your text - which may not be as straightforward as expected. Remember that some researchers are not native English speakers, so there's the added complexity of translating your words into a readable format.
Worth noting that this is a modernization of scientific tradition, which predates code. I did publish my code, but the bulk of my work was a series of manual steps. That was 30 years ago. The closest thing to a replication involved changes to my design. The science world at large is still coming up to speed on this.
Publishing replication data and code increases the impact and citation rate of published work. For a literature review, see https://osf.io/kgnva/wiki/Open%20Science%20Literature/#Open_...
While I publish in a field where making source code available is much more common, let me just make a couple of points:

* I have never had someone come back to criticize my code style. And if they do, so what? I'll block them and not think about it again. I don't need to get my feathers ruffled over this.

* Similarly, if someone's trying to replicate my results, and they fail, it's on them to contact me for help. After that it's on me to choose how much effort to put into helping them. But if they don't contact me, or if they don't put in a good faith effort to replicate the results, that's their problem. If they try to publish a failure to replicate without having done that, it's no more valid science than publishing bad science in the first place.

Overall, I think most people who stress about publishing code do so because they haven't done it before. I've personally only ever had good consequences from having done so (people picking up the code who would never have done anything with it if it weren't already open source).

You absolutely should. Papers should always have reproducible code, otherwise there is no practical usage for the community. Crap is better than nothing.
It would be nice if everybody would publish code for their papers. But in a field where most people don't do it, releasing your code will probably not be beneficial for you due to the loss of the competitive advantage. I know for people with cs background this sounds weird but it is reality n academia.

In your position, I would only release code which is not too hard to reproduce anyway or which only provides negligible competitive advantage for you. I mainly have "normal" paper in mind (experiments or data analysis) - if the main contribution is, for example, an algorithm which you want people to use, the you should publish an implementation obviously.

This whole mindset is so shockingly wrong from an academic perspective.

Research based on or involving code/models/algorithms should always be accompanied by a code drop. Nobody expects the code to be of good quality.

Everything else is not reproducible - and against the scientific codex (IMO).

I read so many papers that claim incredible results, and wondering how they implemented their models in this particular simulator (close to impossible with only what is out there), only to find that there is just nothing to be found, anywhere. No repo, no models, no patch. NIL.

Sending an E-Mail? No response.

Further, anyone could just claim anything this way. Why bother doing any real work?

What if there is a small error in the code?

Wouldn't it be better to know that? In a scientific sense, searching for "the truth"?

Just do a super-minimal cleaning and upload to Zenodo or similar, then stick the DOI to the code and input/output files in your paper somewhere. 99% certain your reviewers will not brother to look at your code. 10 years from now someone new looking into the same topic gets a leg up. Don't feel obligated to update, clarify, or even think about the code ever again. If you want to build a community or something, then by all means go for github, but providing code along with your paper should be something automatic and quick, not adding an unwanted burden.
I personally do not trust research that does not have reasonably polished publicly available code behind it.

A strong result isn’t just the final number, it’s also the process how you arrived there.

This is a question near to my heart. I'm not an academic but a practicing systems software engineer. A good chunk of my work is sourcing interesting academic ideas and trying to turn them into practically useful software. Papers that don't release their source code are often not as reproducible as the authors think. Perhaps there's a bug that the results depend on, perhaps the implementation is very specific to the context the software runs in, perhaps the paper gives _most_ of what you'd need to re-implement but the fine details are missing. I've seen them all.

In a very real sense unless the paper has a result that's so compelling I can't ignore it if there's no published source code -- even if it's an obvious prototype! -- I'll pass it by. I'm not alone in that in my line of work. Industry folks might also be more willing to accept prototype code than academic folks, I dunno.

Worth consider, I guess, if you're interested in your work crossing the academic/industry boundary smoothly.

This one depends on both the field you are in as well as your own academic philosophy OP.

If the paper is enough to reproduce the results AND cleaning up the code can/is tedious, then adding the "code and data are available upon request" note seems both fair and justified.

That way, whoever wants the code can still ask for it and it does not lay an unnecessary burden on the author.

As someone making a career in academia, I recognize both pros and cons here, but I think that the pros far outweigh the cons. Essentially, I think the question is one of identity - do you want your reputation to be "This investigator is the kind of person who's code is always available"? I know that as I evaluate job applications, funding proposals, or papers, I weigh this reputation highly, and consider the opposite "This investigator is the kind of person that hesitates to share their code" to be a big red flag.

BUT, I have definitely encountered the situation where I read a paper, then looked at the associated code, and found that the exciting result was entirely because of a bug. The reputation, "This investigator is someone who does shoddy, error-prone work" is probably the worst possible one.

There's a lot of advice here, but very little data to support any of it. Since you're a scientist, why not take an experimental approach to answering this question? Publish your code, for one (selected) paper. Monitor (a) the download log, and (b) the emails you get related to your code.

I hypothesize that you will see some combination of three effects: (1) you will get lots of downloads (which means people are using your code, good work!), perhaps with lots of follow-up emails and perhaps not depending on what the code does; (2) you will get lots of emails from random nutjobs looking to pick holes in your work, and you will waste your time answering them; (3) you will get almost completely ignored.

Whatever the outcome, I think a lot of people would be interested in to hearing about what you learn.

Releasing the code is the very least you should do to make your analysis reproducible. I would be surprised if it was possible to exactly reproduce the results from the paper alone.

From Heil et al. (https://www.nature.com/articles/s41592-021-01256-7):

> Documenting implementation details without making data, models and code publicly available and usable by other scientists does little to help future scientists attempting the same analyses and less to uncover biases. Authors can only report on biases they already know about, and without the data, models and code, other scientists will be unable to discover issues post hoc.

Even better would be to containerize all software dependencies and orchestrate the analysis with a workflow manager. The authors of the above paper refer to that as "gold standard reproducibility"

Yes you should. Just publishing as it is would be enough. Everybody understands that academic code is pretty experimental and nobody would judge it if it is pretty or not. The reason why you should publish it is to gain trust. Back when I was doing my PhD I found several instances of papers that had results that were nearly impossible to reproduce to the point that I sometimes believed they were just fakes. I am pretty sure in most of the cases that is not the case but....
Having a polished public implementation can lead to a massive increase in the number of citations a paper recieves, if it is really a useful system. Some of my papers I think would have received far fewer citations if I had not released the code. Of course, if it is a really niche area with only a handful of researchers, this may not be true.
Publishing code could be nice, if for example your code has a commercial application and a company wants to use it, a reference implementation might be nice.

Reproducibility -- I dunno. A re-implementation seems better for reproducibility. The paper is the specific set of claim, not the code. If there are built-in assumptions in your code (or even subtle bugs that somehow make it 'work' better), then someone who "reproduces" by just running your code will also have these assumptions.

Coding time -- are you sure? Professional coders are pretty good. If you have, for example, taken the true academic path and written your code in FORTRAN, there's every chance that a professional could bang out a proof of concept in Python or C++ in like a week (really depends on the type of code -- EIGEN and NUMPY save you from a whole layer of tedium that BLAS and LAPACK 'helpfully' provide). Really good pseudocode might be more useful than your actual code

Another note -- personally I treat my code as essentially the IP of my advisor. (He eventually open sources most things anyway). But do check on the IP situation if you want to open source it yourself. If you are working as a research assistant, some or all of your code may belong to you University. They probably don't care, but it is better to have the conversation before angering them.

> Really good pseudocode might be more useful than your actual code

Hear hear! OP, if you go this route, treat your implementation as a practice run, and write out exactly how it works in pseudocode.

My 2 cents:

I think that hiring a (good) professional for a rework/reimplementation would be productive, but it would certainly run the risk of exposing errors in your work. If that's desirable or not depends on your timeline to publish, I guess.

Been there, done that. I published my doctoral research code [1] so that others could inspect, verify, replicate, extend, etc. YMMV, but the feedback I received from other researchers ranged from neutral to surprisingly positive (e.g. people using it in ways that pleasantly surprised me). But let me expand on my own experiences while developing that software, trying to figure out how to replicate the then-current state of the art.

At the time there were two widely used software packages for phylogenetic inference, PAUP* [2] and MrBayes [3]. The source code for MrBayes was available, and although at the time I had some pretty strong criticisms of the code structure, it was immensely valuable to my research, and I remain very grateful to its author for sharing the code. In contrast the PAUP* source was not available, and I struggled immensely to replicate some of its algorithms. As a case in point, I needed to compute the natural log of the gamma function with similar precision, but there was no documentation for how PAUP* did this. I eventually discovered that the PAUP* author had shared some of the low-level code with another project. Based on comments in that code I pulled the original references from the 60s literature and solved these problems that had plagued me for months in a matter of days. Now, from what I could see in that shared PAUP* code, I suspect that the PAUP* code is of very high quality. But the author significantly reduced his scientific impact by keeping the source to himself.

[1]: https://github.com/canonware/crux

[2]: https://paup.phylosolutions.com/

[3]: http://nbisweden.github.io/MrBayes/

Does your publication venue have an artifact review committee? That would be a good way to share your code and (redacted or anonymized) data. I'm in security/privacy research, and our venues recently started doing this. They serve as a quality check, labeling your artifacts from merely "submitted" to "results reproduced."



There are about 100 comments saying the same thing already, but I would highly suggest publishing the code:

1. It gives your work more visibility. If there is a easy git clone route to reproducing your work, it offers a low effort starting point for people to build upon your work which means they are more likely to use it. Plus you get free citations from anyone who touches it.

2. There is no reason that people should be hoarding code in academia, and the only reason people do it now is a sort of prisoners dilemma problem (first person to publish their code had to start from scratch, so they feel possessive and let it die when they graduate). Every researcher who releases their code chips away at the problem and pushes the community to be more open with their code which is intrinsically more efficient.

3. If you get lucky and the community adopts your code it will be viewed very positively by potential future career advancement committee being 'they guy who wrote _x'

4. When I started in academia I based my codebase on an existing publicly available code, which saved me a huge amount of time in my work. I built upon it (not expanding the base code, but using it as a module to integrate experimental measurements to the simulations tools I wrote from scratch) in my PhD and when I graduated I handed a virtualbox image with the whole mess (yay free code--wouldn't have been possible with nonfree code) off to my successors, people in new groups, etc which ended up being the base of an entire new research group at a different university. Every once in a while I get an email asking for help, and get a notification saying that someone cited the code.

As a fellow scientist I would say go for it. I know people who had a vast amount of citations (>2000) for a paper accompanying a code/program release that they made at an opportune moment (they released a code for designing/analysing photonic crystals just when the field was taking off).

Now in the vast majority of cases you will only get a couple of people looking at your code (my experience so far), but still I think it's worth it. The question is, clean up the code or not. Ideally you would, because it increases the chance of someone using it by a lot. On the other hand with the realities of academic work, this is largely underappreciated.

So I recommend to find a balance, clean up enough so it is reasonably straight forward to run the code. Write a good readme that points to the paper and gives the appropriate citation.

> it will increase the surface for nitpicking and criticism

You're supposed to welcome criticism and 'nitpicking' as a scientist.

Not from untrained randos on the internet. Signal noise ratio and prior of “not a nutjob” have to be high enough to offset the cost of lost focus.
That’s a bit of a dismissive straw man. The quote was explicitly referring to nitpicking of things unrelated to the research. You intentionally snipped the very next three words “e.g. code style”. Contrary to your implication, it is not a scientist’s job to welcome any and all nitpicking and criticism, which is why there are professional moderated platforms for relevant science critique, as opposed to criticism.
Emphatically YES. Put the code on GitHub. It doesn't have to be perfect. Especially if it will take two months for someone to "get it right" from the paper. I've been involved in projects where we were trying to reproduce results from some paper both with code and without. The description of an algorithm in a paper can sometimes be unclear, often reading the code makes the description in the paper much clearer. In the cases where no node code is provided it's that much harder to reproduce the results. You want to make it as easy as possible for others to reproduce your results - give them your code - put it in a github repo. If they spot discrepancies between the code and what's described in the paper, then all the better - you can use that feedback to improve both.

I'll add: I think that we need to change the mindset in academia about code. If code was involved in producing the results in the paper that code should be considered part of the paper and (at least) as important as the text of the paper. (Same for data)

I always published all of my code/papers/source for my publications. I never made anything "revolutionary" but I still felt it was important to produce reproducible research, even if relatively insignificant.

This was kind a change for my advisor who was definitely less interested in that aspect of research. I think this is an issue in academia and needs to change.

Also, ultimately if someone wants to copy and publish your work as their own it will be relatively easy to show that and the community as a whole will recognize it.

Also, for me it felt good when another student/researcher was aided by my work.


You don't need to clean it up or make the code presentable. Everyone knows it's research grade code. Most important part is that you have the code in a state that you can reuse in the future for another publication.

I've been saved multiple times by being able to easily go back to decade old work and reproduce plots.

Many conferences are starting to adopt a badge system and will evaluate your artifact. And this is becoming more and more popular, and I know many researchers that will keep these badges in mind when reading the evaluation in the paper. For example here is the artifact evaluation that was done at SOSP 2021 https://sysartifacts.github.io/sosp2021/results.html.
These badges are kinda controversial. The message they send out is "we give you an extra goodie if you do proper science, because we don't expect that to be the default".

Thus badges can become a kinda excuse for not fixing stuff by default.

publish; bad code is far better than no code

someone might clean it up for you, too

Two times I have published my research code - both times I have found many other papers/projects plagiarized my work without giving me any credit. This happens way more than you would think, especially if you are working under less known advisor, and at less known university.

As the other comment said, if you care about "advancing the science", and won't mind stuff like the above happening, then go for it. In my experience, it is not worth it.

> Two times I have published my research code - both times I have found many other papers/projects plagiarized my work without giving me any credit. This happens way more than you would think, especially if you are working under less known advisor, and at less known university.

This has been very much my experience.

I wonder how often it is the case that code isn't considered an academic product per se, and so free to use. May have to make it very explicit.
Publish your code only after you have made the journal publications / conference papers. I have witnessed a researcher getting robbed of his work when another researcher took his almost complete code from github and submitted faster to a journal for publication.

Now both of the researchers have to be cited, but only one of them did the discovery work.

> The paper itself is enough to reproduce all the results.

Unlikely. Following the algorithm from scratch may produce "similar" results, but not "reproduce", bugs and all. The only thing that can do that is your code.

Plus, typically, when you set out to reproduce a paper from only the algorithmic description, it's typically not until you're 2 or 3 weeks into coding that you realise the original paper made many assumptions in the code that were not explicitly stated in the paper.

> However, the implementation can easily take two months of work to get it right.

An even more important reason why you should release your code.

> In my field many scientists tend to not publish the code nor the data.

A regrettable state of affairs indeed.

> They would mostly write a note that code and data are available upon request.

I have personally come across many cases where this promise could no longer be honoured by the time of the request. Publish the code.

> I can see the pros of publishing the code as it's obviously better for open science and it makes the manuscript more solid and easier for anyone trying to replicate the work.

It is also increasingly a requirement for funding bodies

But on the other hand it's substantially more work to clean and organize the code for publishing

> Then don't. Release it under the CRAPL, stating as much. It is still better than nothing.

> it will increase the surface for nitpicking and criticism (e.g. coding style, etc).

If you were an entrepreneur hoping to peddle snake oil and not get found out, then I would see your point. But you're a scientist, you're supposed to welcome such criticism and opportunities for improvement. If anything, you might even get collaborations / more publications on the basis of improving on that code.

> Besides, many scientists look at code as a competitive advantage so in this case publishing the code will be removing the competitive advantage.

I would sincerely not feel very comfortable calling such people "scientists".

So it really depends upon what you want out of your research career. Part of being a successful researcher is making an impact on the community. This involves producing works that the community finds useful. I've always looked at making my code available as another avenue to help increase the impact of my work. In my case, many more people have used my public codes than have ever read my papers.

You have limited time. I'd prioritize that time on what you think others will find useful.

Don't worry about ugly code. There are research codes with 1k+ stars on GitHub that are ugly. They have so many stars because people find them useful.

You absolutely don't have to publish your code, or anything else of that matter. Don't let the the drive for impact on the community force you into working on something you're not interested in.

Congrats on your publication.

>But on the other hand it's substantially more work to clean and organize the code for publishing

It's better than nothing, it also is the only way for others to reproduce your results. I am surprised you were not asked to do that by whatever journal you chose to publish your results.

>many scientists look at code as a competitive advantage so in this case publishing the code will be removing the competitive advantage

LOL, what!? What is this crap about "competitive advantage"? Are you privately funded? Then it's fine. If you're funded by public (i.e. government) money, you are (at least ethically) obliged to share your work with everybody.

Here's a good example. The Fisher's iris flower data was released with his work in a 1936 paper. It was used as an example of his discriminate analysis. This data set has been repeatedly used over and over to show examples of cluster analysis and segmentation. Many statistics teachers use it in their curriculum. You never know where the research could lead to growth and development in a field.


> Here's a good example. The Fisher's iris flower data was released with his work in a 1936 paper. It was used as an example of his discriminate analysis. This data set has been repeatedly used over and over to show examples of cluster analysis and segmentation. Many statistics teachers use it in their curriculum. You never know where the research could lead to growth and development in a field.

You raise a good tangential point:

Releasing a data set can be just as useful as releasing code, and every bit as necessary to reproducing results.

Moreover, reproducing a well-curated dataset can be just as prohibitive in terms of time and expense.

How many papers have reused datasets such as ImageNet, Celeb-A, etc. in recent years?

By all means, release your datasets even if you don't release your code.

It's sort of funny when the pro list includes "better for science" and there's still a need for a con list. There should be a scientific equivalent to the hypocratic oath; a lot of us laypeople imagine that scientist default to "good for science" and "ease and possibility of replicatability."

CS scientific journals should make the bar much higher in that regard: no code? no publish. unless really good excuse.

I mean this is just unrealistic. There are plenty of valid reasons not to want to publish code. For example, you want to commercialize the product. As long as you are given a description of the system you can always go and reimplement it yourself if you are not being lazy.
Unambiguously, yes. If possible, release it using some sort of open source license, and grab a DOI for the initial and any subsequent release of the code - you can use Zeonodo or some other tool for this.

I left the academic world a few years ago, but several of the analysis codes/models I published (either as stand-alone tools or artifacts published alongside a journal article) still regularly get used... if anything, there's probably a larger user base for one of my models today than there ever has been, and it's leading to a long-tail of publications where my initial work is either cited or I'm offered co-authorship when I have time to offer hands-on support for improving the model/code and offering my insight as a domain expert.

If you can take the time to clean up some code or author a lightweight package, that's amazing! But it's a bang-for-your-buck type thing. If you ever aspire to leave academia, it's undoubtedly worth spending some time to clean up the code, add documentation, add some unit tests, etc - great artifacts to use in supporting a hiring process if you move into a technical role somewhere in industry. But is far from necessary.

You're right, it is substantially more work to clean and organize the code for publishing. Being open about your work does make the attack surface much larger and more likely to be nitpicked, criticized, have an error found, etc.

But it is more honest. Whatever you think about the effort required to do this, there's value in honesty.

Here is an example of my own scientific work:

- paper [0]

- preprint [1]

- GitHub [2]

It certainly wasn't easy to get all of this done. But doing this can also be a guide for others. They get to see exactly what you've done so that they don't waste months on the exact implementation. They can see where maybe you've made some mistakes to avoid them. They can see so much of the implicit knowledge that is left out of your paper and learn from it. Your code isn't going to be perfect, but what paper is, either?

Everyone will be a critic, anyway, so make it easy to pick up criticism of the stuff you feel the least confident in and do better next time. You won't get better if no one sees your code.

[0]: https://cancerres.aacrjournals.org/content/81/23/5833

[1]: https://www.biorxiv.org/content/10.1101/2021.01.05.425333v2

[2]: https://github.com/LupienLab/3d-reorganization-prostate-canc...

Disclaimer: Not an academic, and my whilst undergrad thesis included code it was so broken that when others saw it I had nothing to lose except my pride.

Personally, I would. Open source is a form of peer review, and if you're wanting to stand by your paper as peer-reviewable then I believe the code should be included in that. Generally speaking, I feel more researchers need to open up their code to peer review because generally speaking, research code tends to not have the same robustness against mistakes (through coding convention as well as tests) as professional software development. I shudder to think how many papers have flawed results that no one realises and are just accepted, because no one can spare the effort rebuilding the code from scratch and without any prior reference in order to verify said results.

I don't think you need to clean it up. You're not competing for a coding elegance competition, but rather allowing someone to find bugs if they exist and point it out, just as they would peer reviewing your paper.

More cynically, spaghetti code probably helps as a defense against people ripping off your code, so if you're worried about your competitive advantage then not cleaning it up is a form of security through obscurity :)

Can you ask scientists who are very experienced in your field and successful in the career track that you want to be?

Separate from that, is there fairly new chatter in your field about reproducible science, publishing code and data, etc.? If so, what's the current thinking there about how valuable this is to collective science, and how that should affect the sometimes unfortunate conflicts of interest between career and science?

I some sense the way you phrase your question shows how broken incentives in science are.

The obvious answer for science is: publish. The goal of science should be to make it easy for others to reproduce your work. Not to make it theoretically possible, but hard, because of the "competitive advantage".

The right thing to do would be to publish and next time you review another paper that does not publish code use that as a reason to reject it. The whole "code and data upon request" is obvious bullshit, there have been studies on it and often enough it ends up with "well, we don't have that code/data any more", "why do you need that? we won't help you plan to publish something we don't like" etc. pp.

Consider that "cleaning and organizing" your code means that it is no longer the code that actually produced the results in your paper!

The fact that your code is a mess means that it might be buggy; if other people can see your code, someone might find a bug in it. As you said, this is a good thing for open science, and makes your work easier to reproduce.

Disclaimer: I'm not an academic. I cannot possibly speak to the possible benefits and implications of this from an academic point of view. Like there might only be downside to doing this. I don't know and don't pretend to know.

As an outsider looking in, many academic fields seem to have a reproducibility crisis. Many psychological studies, for example, cannot be reproduced yet they continue to be cited.

I personally feel like every academic paper should be reproducible. I should be able to email you the study and you should get the same results. Obviously clinical trials may vary (and thus the important of statistical significance) but the real problem is data and models. If I, as someone reading your study, don't have your data, how can it possibly be reproduced? If I gather my own data will I get completely different results? If I'm solely relying on what details you give, how do I know you haven't made a fatal assumption or even just buggy code with your model?

I personally feel like a condition of all Federal funding should be that the data and any code should be made freely available.

So I support the idea of releasing it and that releasing something messy is better than releasing nothing but I can't speak to your individual circumstances.

Though you don't mention this particular issue, it often comes up, and as someone who used to work as a DoD research scientist, I will say this: I think academics are largely under the impression that they should be worried about people "taking their idea" and building something amazing with it without compensating you in some way. In reality, it is vanishingly rare that a published paper gets used for anything, by anyone, and it is even rarer by an additional order of magnitude that someone successfully tries to use something without consulting the author and/or trying to bring them along. You are the expert on the thing you have made, so if someone sees massive potential in it, they will likely bring you along. Publishing some quick and dirty research code that is able to reproduce the results of the paper can only help you in the long run.

If you want real protection of course you can always try to get a patent, but then I've got you because 90% of the people I have this conversation with are worried about people stealing their idea but don't think it is patent-worthy.

A similar analogue exists in startups: ideas are really a dime a dozen. Execution is what matters. There are millions of great startup ideas floating around -- I bet almost anyone could come up with at least a few that are viable -- but actually having the follow-through and dedication to execute that idea, that is what is challenging. I can't tell you how many people I've had calls with where the exchange is basically "I want your thoughts on this amazing idea but you have to sign an NDA first". 90% of the time these people aren't willing to go all-in on their idea and stake their career on it (hence them seeking second opinions), so it makes no sense for them to worry about me "stealing" their half-baked, unrealized idea. I say to them "would you take $3M in interest-free debt to develop this idea right now" and they say "no!" to which I say "then why should I sign an NDA?"

Be prepared for a metric crapton of crushing silence when you release your code. But do release your code.
I published some of my Academic code like a tool for simulating superconducting circuits [1] or a tool to manage lab instruments for quantum computing (or other) experiments [2]. It's super niche but both tools have found users in other labs that even keep developing them (at least for [2]). And it's nice to look at your code after 10 years and realize how much you've grown as a programmer :)

[1]: https://github.com/adewes/superconductor [2]: https://github.com/adewes/pyview https://github.com/adewes/python-qubit-setup

Publishing the code is great, as some of the questions a reader of your paper may have can only answered by looking into the source code, as no paper has enough space to talk about all implementation details in a real-life complex system.

There is value in scrutinizing the code - not w.r.t. coding styles or standards but to discover bugs in the implementation, which are very common. Scientists are only human, and scientific software is less often checked by a second pair of eyes. There is also value in trying to replicate a study from scratch with a fresh implementation only from the details in the paper. Many conferences, for instance the European Conference on Information Retrieval (ECIR), Europe's largest scientific search technology conference, has a replication track only for replication papers, and these are often the most interesting/insightful papers. It occasionally happens that a result is not caused by what the authors think, but is merely an artifact of the implementation code. A very famous MIT researcher (not naming him or her here on purpose) fell into this trap in their Ph.D. thesis, but it can happen to anyone, really. Scientific results become objective knowledge as others solidify the body of knowledge by carrying out replications and arriving at the same results.

Whatever your decision about past code, going forward, if you plan to release all future research code, you will likely write better code in the first place, as you will constantly be aware that people will be looking at it, and that can only be a good thing.

You've gotten a ton of feedback already, but: Please do! Don't try to make it perfect. Just publish it. As the saying goes: "The perfect is the enemy of the good." Release the code you used to get YOUR results. You can always improve it later if it turns out people end up really interested in it.

(FWIW, I'm a professor at an R1 university. I give this advice to all of my Ph.D. students and strongly, strongly encourage them to put their code out there on our github.)

> it will increase the surface for nitpicking and criticism

This is unfortunately. In one of my articles I linked to my github repo where I had implemented the algorithm in C. One of my reviewers complained that I had used C instead of C++. Probably advisable to not publish code before peer review.

I assume computer science results without published artifacts to be fake. When it's so easy to publish and run code, if the researcher can't even do that then I assume the code does not work and thus the results must be fabricated. If your work has trade secrets or something, research publication with peer review is the wrong way to distribute your results.

Computer engineering on novel systems is a bit harder, but a /complete/ spec of the system (enough for someone to precisely rebuild it) should be published in that case. Remote access on request to the prototype would be better.

(A) Publish your code as is, so the code is the actual code used in the paper.

> But on the other hand it's substantially more work to clean and organize the code for publishing

(B) Don't spend time cleaning code for publishing. Spend your time writing more papers.

> it will increase the surface for nitpicking and criticism (e.g. coding style, etc).

(C) Don't worry about this.

> Besides, many scientists look at code as a competitive advantage so in this case publishing the code will be removing the competitive advantage.

(D) If you do B, if will also reduce your worries about this. I am half joking.

My take on this is that some code is 10X better than no code.

There have been times when I've had to abandon incorporating an idea presented in a research paper because the paper doesn't have enough information for me to implement it in code. I could've made a lot of progress with some proof of concept code, even if it wasn't clean.

The economic incentive of science is for your work to be replicated and cited. Not publishing the code and data means your work is harder to reuse for subsequent studies and will hurt citations.

If it's uncommon to release code then I'd doubt anyone in the peer review will review it.

As someone who has had to reproduce others' research results: it is much much better to release the unclean, unorganized code that actually produced your results than it is to release nothing. Even if it doesn't run (e.g. it depends on a hardware system that the user won't have access to) it's still better for people to be able to read your code and understand some tricky part that isn't fully explained in your paper.
I am not a researcher in the sense that I'm not publishing papers but I'm a consumer of research. Every day I can find the source code for the paper is a great day. Even if it's some language I don't use I still have something to go off of. Often it's easier to red some code to understand the method than to read the paper itself. I'm used to read code. I do it almost everyday and I'm relatively proficient at it. I'm not very well at untangling academic language or having to read 30 years worth of papers to get all the assumptions made in a paper.

As an example, I've found a paper that promises a method to do the very thing I want to accomplish. It's not too dense but it skips a few crucial moments and I've been working on coding the method for a year now (on and off, of course but still for a long time). If the code was available it probably wouldn't take as long. The paper didn't mention that the code was available upon request but it was implemented in a piece of software. I've found it eventually but it was a version just before the feature I'm after was added. I tracked the author and they were great sport about cold emails bet didn't have the source any more.

So yes, please publish the code. You don't have to clean it up. It worked for the paper — it's good enough. Even the most terrible code is immeasurably better than no code.

In undergrad I was so grateful whenever a CS paper came with code. It helped me learn and comprehend so much better and I always wanted to thank the person who did it (sometimes I even did if their email was there :)).

You might be doing a young student a solid :D And don’t worry about cleaning it up!

If you use GitHub you could even disable Issues and have a note saying you don’t accept pull requests (in case you’re worried about support burden).

OP, you shouldn't worry about the state of your code. The could be criticism, but I don't think there's anything that's public and not criticized. A horrible thing that's open source is much better than something that's not. The only real thing to consider here is the type of the license, and weighing the competitive advantage you're talking about. With the license, sites like this[0] can help.

[0] https://choosealicense.com/

Why not? It's a loss to humanity's progress if all researchers make it difficult to find the code and data.
I would agree with many others here who say publish it. In some fields there is an additional question of where to host it, lest your paper's impact outlast the lifetime of your current GitHub repo or whatever. There are good solutions out there. Assuming you are at a university it's worth having a chat with a librarian.
Yes, the university library is where I would try to find a publishing solution that could remain after I am dead and all my online accounts expire.

GitHub is not a viable solution; Microsoft can not be trusted to keep these important cultural artifacts safe and accessible in the near future.

In my experience, typically researchers only publish the relevant and/or core algorithms of their research. If you would like, you can always publish the code to Github (if it isn't already), and reference it in your paper.

If it is too much work to refactor the code for publishing, you can also just publish pseudocode.

I don't think anyone will nitpick or criticize coding style or things like that unless it is particularly egregious (ie naming variables something vulgar etc). The point of research papers is to communicate new and valuable findings. If people in this conference or journal are nitpicking things like that, you may want to find a different place to submit your work.

I don't know what your field is, but in Computer Science I can't say I have ever known people to consider their code a competitive advantage. The only time they might shy from releasing code is when they think they can commercialize it or something.

Let me say that if you do decide to release it it's not just scientists and academics who can stand to benefit. Chances are your paper is less approachable to those outside academia and your code would be easier to understand for an engineer. I would honestly encourage all researchers to publish their code on that basis. You don't have to clean it up or write any scripts to help build it. Just attach what you have and I second the idea to use the CRAPL license!
One argument against publishing code is that maybe there's an error (or more) in your code, which validates your possibly mistaken theory, and forcing an independent reimplementation by others would uncover this problem.
That's a good point I'd not considered. I suppose it's ultimately a risk scientists/researchers take to not lead humanity down a broken path however I'm of the firm belief the truth will "out" and transparency almost always trumps opaqueness
The fact that there may be an error is an argument in favor of publishing the code.
Publishing the code can increase the number of people tinkering with it, and possibly debugging it, it's true. But people just going with it, start using it without looking into the details, and blindly trusting the author (i.e. being lazy) sounds pretty realistic, too.
The trend is in the direction of requiring open code and data. There's been a big movement that direction in economics, and most fields will likely also move that way, so it's more a question of whether you should do it now or in the future.

For the journal I edit, authors are required to include the code and data with the submission. The code and data are available along with the paper if it's published. We do replication audits of some papers to make sure you can take the materials they've included and reproduce every result in the paper. If not, the conditional acceptance changes to rejection. I've had cases where reviewers found errors in the code, so I rejected the paper.

On the argument that it's a competitive advantage: what does that mean? You should be able to claim results but not show where they came from? That's not science.

Keep in mind that this is a "source available" requirement, not an open source requirement. It is a matter of transparency. You have to let others see exactly what you did.

Myself and co-authors argued here https://www.nature.com/articles/nature10836?proof=t%2Btarget... for open computer code in science.

"Scientific communication relies on evidence that cannot be entirely included in publications, but the rise of computational science has added a new layer of inaccessibility. Although it is now accepted that data should be made available on request, the current regulations regarding the availability of software are inconsistent. We argue that, with some exceptions, anything less than the release of source programs is intolerable for results that depend on computation. The vagaries of hardware, software and natural language will always ensure that exact reproducibility remains uncertain, but withholding code increases the chances that efforts to reproduce results will fail."

Publish it on GitHub (or GitLab or your code hosting service of choice).

Then answer any criticism about it by asking for a PR.

To preempt code style complaints find a code formatter for your language and run everything through that first.

Refer to the repository in your paper, but don't put a link. Create a little bit of friction to get to the repo to discourage the casual readers who don't really need the code from popping over too easily.

I think you probably already have the answers you were seeking (yes, ideally you should) but I'd like to add some points:

Ideally, it would be nice if the code has a professional-level quality to it, but I think everyone involved in evaluating research understands that it is at best a prototype. Proper software engineering is expensive, and it is not the role of research to do this. The process, as it was explained to me: university research pushes the state of the art, industrial research labs are slightly behind this and looking to transfer into practical uses (along with this some government agencies are interested in tech transfer) and finally software engineering takes these ideas and turns them into actual products. You aren't making a product, so it is OK for the code not to be perfect (also, from experience, 'professional' industry code is not always that great either). The main point is that someone has some chance of reproducing your results.

The exception to this is if you are making a product, where the definition of product is a tool for further research. Examples might be tools for symbolic execution or formal verification, in which case it might be worth some time to make the experience of using it good for that benefit, to reduce friction so that people try and want to use your tool.

Artefact evaluation is rapidly becoming something people are encouraged to do and helps enormously in verifying results, but the point is usually to try to reproduce the results of the paper to back up the science, not to start an argument over coding style. I would hope that artefact evaluation processes make this clear and ensures that evaluations of artefacts focus on reproducibility. For outside comments that might arise, I suggest you publish the work as open source and respond to any criticisms with a fairly standard line: yes this is research quality code and we would like to have time to improve it. If you would like to submit a patch/pull request we would welcome any help.

Source Code or it Didn't Happen.

Science that is not reproducible is not science.

If you can, publish something high-level. Matlab or Python or Julia is fine. C or Java, not so much, because the build environment will not be available any longer after a few years. Actually, if you can, publish several translations.

And don't forget to publish your data sets as well. And your data augmentation or whatever. Everything you need to reproduce your results.

And for the love of Knuth, DO NOT OPTIMIZE YOUR CODE. Dumb code is good code in science. You would not believe what kinds of havoc some algorithms wreaked on my systems in the name of optimization. Optimizations that made a ten-year-old algorithm run in two nanoseconds instead of four (vastly exaggerated). Optimizations that obfuscated otherwise perfectly reasonable algorithms.

The goal is reproducibility.

There are a lot of difficult questions posed here on HN but this is not one of them: unequivocally, you should publish the code.

It is better for science, it will be better for you and it will be better for people who want to play with your code.

Publishing is a form of advertising what you did, and helping others reproduce it makes it go viral and is a testament to how much they care. It can only help your career.

You’ll definitely get people who nitpick the code. This won’t hurt and it may even help in its own way.

Unless you are publishing a software methods paper, you don’t have to worry about cleaning the code or making it portable. In my field, publishing code (and data) is a requirement and has been for years. That doesn’t mean that the code needs to be pretty ( it usually isn’t), it just needs to support the paper.

So, yes. Please publish the code, it will make the rest of the paper stronger.

At the end of the day the impact and perceived quality of your research correlates to how peer reviewed it is, and how reproducible it is. Everything necessary to reproduce your research should be published, including the code. However, if you publish cleaned up versions of your code, that isn't the code you used to do your research.

I suggest publishing the code as is on something such as Github, Gitlab, etc. I suspect you have ideas on how you can improve the code, perhaps there's even a way of improving your research methodology by doing so, enabling new insights with further research. If you did a follow up experiment with improved analysis enabled by your improved code, then that's another paper, and another (more cleaned up) version of the code to push to the repository.

The above is all supposition though, as I don't know your field. If deep learning then the above seems more likely. If your field is geology, then improvements in the software might not enable better insights.

I'm currently a student getting a master's in computer science. In my experience, having the code available in research papers is rare, but useful. Many times I find myself struggling to understand how something can be implemented or, when presented with choices, choosing one when reading research papers. When the paper has the code published I am able to follow it better.

Some papers link to the code instead of including it. Maybe I'm just unlucky, but this usually leads to dead links (but that's a different topic altogether).

Besides all the pros already mentioned, there is a high chance other researchers will use your code and cite you. If they have to compare their results with some previous state of the art, it will be the one with available code. The whole thing about “the paper is enough to reproduce” never happens, ever.
The code should be published, and knowing this, researchers will hopefully try avoid certain commonly harmful practices. One of these is re-using the same script to run slightly different models by editing some of the hard-coded parameters. I've myself found more than one mistake in someone else's reported results due to this sort of thing. But identifying it was quite a bit of trouble because the record of what was ran was erased when they moved on to the next model.

What I would not expect from people is code that would necessarily run in your environment. For example, in many cases, the paths are going to be hard-coded, for a variety of reasons. It might be ideal to write code that will just work, in a reproducible environment, but that often takes more work than people are willing to commit to, given all the other things they have to do.

Finally, cleaning up your code for presentation is a final opportunity for you to discover any mistakes before you publish and then later have an embarrassing public retraction.

I'd only clean it of stuff like passwords and such, and add a header that the code is provided as-is.

You could add a disclaimer that the code was worked on until it provided a satisfactory result, and no further, and is not intended for (any) use. You might even add that, except for outright, actual errors that affect the result of the research, comments are discouraged.

I often publish very bad code, terrible terrible spaghetti, it's not how I write code at my job, because at my job, I'm paid to produce not only working and correct code, but also code that is maintainable and understandable and follows certain practices.

However, my hobby is not writing corporate code, but writing code that get done what I want to get done, nothing more, and sometimes less. It might even have actual bugs in it that I can plainly see and don't care about because they don't affect my uses

If people can't tell the difference, I don't care, not my problem. If a future employer can't tell the difference, I won't work with them.

I absolutely LOVE research that has code released with it. Just because then I can quickly explore the code and play around + tinker with it.

Like others have said, research code isn't meant to be production quality code so I wouldn't worry about "quality" in that way.

FWIW, this is how I've released the crappy barely-working "academic quality" code for a paper in the past:


The main points are that I made only a minimal attempt to organize it, and I made the state of the code clear in the README. I don't recall anyone complaining about the code or even mentioning it during review. (Though to be fair, I also don't recall whether I published the code before or after the paper was accepted.)

Looking at things from the other side, I'm am at least an order of magnitude more likely to read, use the work/methods from, and therefore cite a paper that comes with code.

I highly recommend adding it. It doesn't have to be exposed, but is super useful for anyone who will want to reproduce or build upon your work later.

You can embed this to the PDF, e.g. see section A.1 [1] for how.

[1]: https://raw.githubusercontent.com/motiejus/wm/main/mj-msc-fu...

After finding a mistake in a paper, having to fix it, and then publishing my code, I’ve found other people contact me for the fix rather than the author of the paper. I would recommend publishing the code rather than assuming your paper is bug free and complete.

Similarly, I’ve found papers that don’t include their complete data set in the paper, and had to try to reverse engineer it from images and so on. It is really frustrating when papers are incomplete.

Depends on the climate of the field you're in, and where you're at in your career. There are fields where entire research groups routinely harvest preliminary ideas from graduate student publications, and then finish them and rush to publication before the student realizes what's happened.

I'd say, grad student owes nobody anything until they finish, because they're bearing the greatest risk of losing priority, and the openness of science is being used against them. Nothing lost by waiting until they have their degree in the bag before sharing. Then clean it up and use it as part of your portfolio. Or append it to your thesis. Advancing science after you've secured your career is a fair compromise.

I love open source and open science, but also look back on my own graduate studies, and I chose a topic that was protected by virtue of a large capital investment plus domain knowledge that was not represented by code. Also, my thesis predates widespread use of the Internet. ;-)

> There are fields where entire research groups routinely harvest preliminary ideas from graduate student publications, and then finish them and rush to publication before the student realizes what's happened.

Can you provide a source, or example of this? What does the Amazon of academia look like?

Biology, and synthetic chemistry. Unfortunately all anecdotal. I live near a major research university, have lots of friends who are involved at all levels, and relatives who are even closer to it. It tends to be in areas that require minimal capital investment to pivot into a new study. Also, the student pursuing the original idea is hampered by their own emerging skills. "My student's thesis just got scooped" is something that every professor has experienced or knows about.

My field, physics, much harder. Building my experiment required a bunch of expensive equipment (maybe half a million in today's dollars), gear that I built myself, the technique of operating it, and so forth.

My career, much harder. I work in business. You learn about my ideas when a patent comes out. ;-)

I can attest to this. I myself am victim of this. My undergraduate thesis was plagiarized by two other papers. Code was 80% the same, they just added some trivial things. No citing of my work at all.

Look at my other comment for more explanation - if you are working under less known advisor, or at less known university, there is a high chance that this will happen if your work is good.

Write to the journal they published in and call them out.
This kind of thing is best done after the thesis is in the bag. A student is racing against the clock. Grad study has many kinds of hard failures. At the most extreme, your advisor could up and die. The focus has to be on finishing. That's how you get out.
Matter of fact, I did, even with help of my advisor. The journal did not take any action (it is Q1 open access journal), since they come up with all kinds of mental gymnastics why it is not copied (which boiled down it is NOT 100% the same).

That was for the first occurrence. For the 2nd one, we just did not bother because it hurts my advisor's reputation as well. It is not in the interest of journal to admit the mistake once they made it -- they will fight you about it and try to keep their reputation/image up.

As always I may be wrong, but the (admittedly very few) times I find an article/paper based or revolving around code that is interesting/useful for some purposes I read the "code is available on request" (or similar) as the (in-) famous Fermat's Last theorem note: Hanc marginis exiguitas non caperet.

Nowadays margins are large enough and cost nothing or next to nothing, and you don't probably have any other use of your code, so what would be the advantage for you in not publishing it?

What kind of competitive advantage does it give to you? (what many scientists think might be not as relevant as what you think about this "competitive advantage" secifically in your specific case/field)

About "cleaning it", why?

I mean, if as-is it works (but it is "ugly") it still works, what if in the process of "cleaning it" you manage to introduce a bug of some kind?

Unless you plan to also re-test it after the cleaning, I guess it would be better to not clean it at all.

> What kind of competitive advantage does it give to you?

For every paper introducing the revolutionary Algorithm X, there are a bunch of follow-up papers like "Algorithm X applied to self-driving cars", "Algorithm X applied to smartphones", "Algorithm X with some tweaks that provide marginal improvements", "Algorithm X but using consumer-grade hardware" and so on.

If every other lab has to spend several months to replicate your first paper, you and your colleagues can spam out the follow-up papers before anyone else can catch up. This makes your publication count go up.

Other means for achieving similar effects include delaying the publication of your code, or releasing undocumented spaghetti-code with missing dependencies and entirely comprised of one-character variable names.

Of course, this stuff comes at a cost: Making it harder for people to use your work makes them less likely to use your work. So it might be better for your citation count to release the code - and in any case, who goes into research hoping their ideas will be ignored?.

NDA requires you to share data several months after reporting it. But in many cases, data collection has not even completed by then. Theoretically someone could scoop you by analyzing your data before all data collection is completed (e.g. N=100, instead of N=120). I'd think that would be career suicide if it were found out, but the risk of it happening doesn't exactly provide much of a incentive to make it any easier on them.
> Besides, many scientists look at code as a competitive advantage so in this case publishing the code will be removing the competitive advantage.

This is probably wrong, depending on the field. At least in machine learning, the papers that get cited the most are those that other people can easily pick up and work on. They become the basis for future work, get cited as baselines more often, etc. Publishing research ML code is a competitive advantage.

I've posted a huge amount of academic code (I've linked to a small number at the end). I think you should, but it won't help advance your career immediately. However, I still think it's better for science.

What is useful is if you can produce code people can build on and do their own cool stuff with -- then they will cite you. However, getting something to a state where it is tested for all reasonable inputs, has some basic docs, etc. is a hard untaking.

https://github.com/minion/minion (C++ constraint solver)

https://github.com/stacs-cp/demystify (Python puzzle solver)

https://github.com/peal/vole (Rust group theory solver)

Thanks, agreed. Small note: it is not clear what Minion is doing, from just visiting the github repo. Perhaps add "C++ constraint solver" in the github description, but it is still unclear: it could be a rigid body constraint solver for games? Maybe add a link to a paper?
Yes, I should practice making things more accessible :)

In practice Minion is generally used as a backend to Conjure ( https://conjure.readthedocs.io/en/latest/ ), which provides a much nicer input language.

Thanks, I was not familiar with Conjure and general Constraint Programming. I haven't seen it in real-time appications for games or robotics (usually highly optimized domain specific constraint solvers are used there, for rigid body, fluid sim, cloth, deformables etc)
First, you should be proud of yourself for striving to do "the right thing".

In the field I follow the most (Computer Graphics/Rendering) I think there is a big problem with reproducibility as well, and to be honest, I think some of the major players actually have little interest in making this significantly better, since they can take advantage of the visibility of a flashy render/fps counter shown at an event while still keep on building a "moat" between them and others that want to adopt the same methods.

Which is in the end partly an answer to your question: your paper could clearly describe all the elements needed to implement a method correctly, but by providing a sample implementation you allow others to "stand" on your shoulders, as they say, instead of having to climb there first and then proceed. You can not worry too much about the state of your codebase by making clear via README/documentation/license that it's still in "proof of concept" phase.

One reasonable observation I have heard is that in some fields, during peer review, some reviewers seem to like to nitpick on the code rather than the paper, sometimes in subtle ways. Because of that, I think it can be (unfortunately) OK to release the code after acceptance or publication. But apart from this, I see only advantages.

I'm a scientist too, in computational chemistry. To me, releasing the research code that accompanies a paper is an imperative. Increasingly, journals or individual (peer) reviewers demand it. It's essential for reproducibility. I consider the work that goes into making the code releasable simply part of the job.
If you are for Open science(https://en.wikipedia.org/wiki/Open_science_data), go ahead and publish it. Would you ever publish the code on some GIT platform? If you would, this would be the equivalent. A lot of researches don't want to give their data to the public, but if locking their data they are just making harder for others to confirm or improve their findings. I guess sometimes there are legal issues behind that, and sometimes it is pure ego.
What benefit would you receive if you publish your code? Will that give you some privilege or earn you more money and/or more reputation?

If the answer to the above is no, and it will mostly cost you time and effort. Then don't publish.

If the answer to the above is yes, then consider the return on investment for publishing your code. If you earn more reputation/money/whatever if you publish than what you expenditure on doing the work of publishing, then publish, if not, then don't.

>> Besides, many scientists look at code as a competitive advantage so in this case publishing the code will be removing the competitive advantage.

That "competitive advantage" is just holding everyone back, slowing progress. This is particularly annoying to hear coming from "research" which I thought was supposed to be advancing the state of the art for the benefit of society. That's ostensibly the reason for publishing papers right, to disseminate knowledge? Or is it really just to increase ones ego and get paid?

Not saying you should publish code, just that deliberately keeping secrets in your field seems to go against what I thought you were doing.

Agreed, perhaps there is some competition in citations for follow up work, releasing source code makes it easier to get 'scooped' on your future plans section? (not saying that optimizing for citations is a good thing)
> it's substantially more work to clean and organize the code for publishing, it will increase the surface for nitpicking and criticism (e.g. coding style, etc).

I should take a couple of hours. The code works? You know how to reproduce what you did, right? It shouldn't be perfect. Shouldn't even pass code review. Should just work.

> many scientists look at code as a competitive advantage so in this case publishing the code will be removing the competitive advantage.

Well depends on the field I guess, but you also want recognition and impact. What is the point of publishing a result no one uses?

What is the purpose of doing research?

If the purpose is to push human knowledge forward, then it seems backwards not to publish everything.

Personally, I've found it difficult in my various careers to date when I've been put in positions where the actions that serve my immediate interests are in any way in conflict with my underlying principles or overarching goals. It's demotivating and deflating.

If I were in your position, I would publish everything and let myself feel pride in what I did. Even if we're all just insignificant specks in the grand scheme of things, pursuing a greater purpose can help make it feel like something matters.

Yes you should absolutely publish it. I wrote a paper about modelling radio wave propagation through the ionosphere all my code for it is on my github. The reason you should is simple, you are providing proof that your numbers aren't just made up.
If your field is not embracing open source yet, you should go for it ASAP. I believe in the end the field will recognize the benefits and move towards that and the sooner you are the larger impact you will make.
> The paper itself is enough to reproduce all the results.

Every researcher thinks this, and it's always wrong. If you care about scientific progress, publish the code and data.

Besides, available code should cause more people to look at your work and ultimately cite it.

Your code should have exactly the same license and distribution as your paper. Anyone who tells you different is simply wrong.

If you published a paper that uses information from the code then yes you absolutely must publish your code. Otherwise you're contributing to the decline of science via the opaqueness of papers and irreproducibility problem.

Publish the code or attach it as a listing, so that in 10, 15 years someone who finds your paper can find the code, too. When everything is "hot" and "live" it can be easy to reach out and get something, but when you're digging through papers and code that have been abandoned for decades, it's nice to find source.
“Besides, many scientists look at code as a competitive advantage so in this case publishing the code will be removing the competitive advantage.”

While I appreciate this is true, it’s also quite sad. Science shouldn’t be a competitive sport to increase a couple metrics like publications and citations such that useful parts of replicating and extending studies aren’t shared. :(

> Besides, many scientists look at code as a competitive advantage so in this case publishing the code will be removing the competitive advantage.

One of my most cited papers is a relatively uninteresting one we wrote for a conference competition. But we have code so it is easy to compare your alternative approach to us. That means citations.

So it can work for your benefit as well.

Yes, absolutely, and don't worry about how the code looks. So long as someone else can download your code and run it without issue, you're good to go. I've worked on multiple computational neuroscience papers and pushed to have the code published alongside each paper in every case. Not once has it come back to bite us, and if anything, it seems to get us significantly more citations.

Do it. There's no good reason not to.

Genome Research made me publish the code used for the data analysis, requiring a zip of the repo for archiving.

The thing is that I was required to provide a way to reproduce, so code obfuscated and/or uncommented were not a problem. I provided clean code anyway.

Many journals now require relevant code to be published. Those journals that don't are likely to be lower impact journals, but also are probably moving towards requiring the relevant code to be published. The reviewers are likely to complain about the code not being available, so you can defeat one review hurdle by publishing it. It's generally better for science if you publish it.
My 2 cents: you should publish the code as you used it in your research, so that it's possible to review your code. If there is a bug in your code, that could impact your results, and that problem would be much harder to find/reproduce without your source code.
Given the enormous amount of papers that come out, I personally tend to read papers that come with code (and data) first.

For me, it shows the authors are confident yet also open to critique. Which is a wonderful thing.

Secondly, I usually need the code to really understand the paper.

If Public money paid for you developing it, make the source Public und a liberal Open Source License.
Yes, you should publish it. Don't bother cleaning it up if you don't feel like it. No one will judge you for the code quality.

Published terrible code is far better than unpublished code.

Yes do it! I did it and you'll not receive criticism, that's just anxiety talking. Be clear at what the code is, why and that you may not maintain it as it is just a proof implementation. Most good humans understand that all researchers are not the-one-and-perfect coder. The bad ones are too busy arguing with others to even notice.
Was the research funded with public money? If so, then the public interest would be a reason to publish the research code.
In the past I've chosen to publish key algorithms. Publishing your entire code base can become a substantial demand for your support. As an open source project supported by one person, that can be very demanding.

So identify what's most critical or novel about your work and publish that.

Publish it.

Put a huge note in the readme that this is research code and only licensed for non commercial use.

Put a note on your personal homepage that you're available to hire as a research consultant for $1000 per day.

Companies who like your research will put 1+1 together. A friend of mine got hired straight out of university at a very competitive salary with this approach.

My advice is to leave industry standard formatting and style arguments to engineers.

If people want great code that runs easily and is easy to read, that's engineering work, built off the back of novel implementations.

If people want novel implementations that are likely rough around the corners and require a bit of finagling to run, leave that to the scientists.

If you publish your paper with code, you'll get more citations I would assume. When I look at research papers, one of the first things I look at is code and/or data availability. It would be even better if it's easy to run though and that's definitely not always the case.
Publish it, if it's interesting people will clean and improve it. It's the beauty of open source.
Publishing the code does have some selfish benefits too: better chance of people building on your research (and citing it).
It is at least as likely that they'll take your code, integrate it into their own research, and never mention you at any point in the process. So you have to be OK with that.
Yes, please, the state of affairs currently is that it's impossible to get code, data, and pretty much anything besides the actual paper.

To me at least sends a signal of people hiding stuff. That's not good. It made me distrust some papers in the past. I tried to reach out with no success.

You'll probably get more references if you have code, which will probably help your research career.
An excellent paper on this issue here: https://aclanthology.org/J08-3010.pdf

Agree with other comments on CRAPL, but you should release it.

Honestly, put your code out, and version control it. Benefits:

- People who use your work will cite you.

- You may get collaborators.

- It's an easy-to-get-to backup

- For non-academic jobs, it's part of your resume

Release the code as-is. It's alright if it's not clean and organized, research code is usually crappy code (no offense given).

Worst case scenario, it will end up in a star-less github repo that nobody reads.

Publish the code. At worst no one will look at it, at best you will draw more attention to your work and maybe get some good tips.
Yes, you should publish. Mainly because it will give you a sense of accomplishing something, also, nobody cares really :). Especially about old code, if they do that's even better.
I don't know your field, but personally when I read a paper the code makes things 100x clearer and resolves my questions. Are you afraid people will use your code?
I don't think you have to do any work at all on it if you don't want to. Just release it and let people fork it, let it grow.
How do you even know it works if you haven't been able to create a working implementation as the author?
Publish the code.

If someone has comments about style ask them to improve it for you.

Worry about maintaining things after someone asks for maintenance, the vast majority of code is never read again.

s.gif27 more comments...

About Joyk

Aggregate valuable and interesting links.
Joyk means Joy of geeK