Publish your computer code: it is good enough

by Kelle on October 15, 2010

Publish your computer code: it is good enough | Nature News
Compelling call for cultural change regarding releasing the scripts used to arrive at published results. What would be the best tool for us (astronomers) to use to for a code repository for static code? Ideally, the tool should allow the code to be tagged with relevant keywords and associated with an ADS entry. Like VizieR, but for code. No matter how unpolished and undocumented it might be.

That the code is a little raw is one of the main reasons scientists give for not sharing it with others. Yet, software in all trades is written to be good enough for the job intended. So if your code is good enough to do the job, then it is good enough to release — and releasing it will help your research and your field…
But the most important change must come in the attitude of scientists. If you are still hesitant about releasing your code, then ask yourself this question: does it perform the algorithm you describe in your paper? If it does, your audience will accept it, and maybe feel happier with its own efforts to write programs.

Thoughts?
(via @augustmuench.)

{ 17 comments… read them below or add one }

1 Steve October 15, 2010 at 6:57 pm

I’m of two opinions:

1) All code should be published. It is part of the scientific process and should be tested and reviewed.

2) I’m scared witless about anyone misusing my code.

I think everyone has written code that ‘only’ works in very specific circumstances. That code isn’t likely to work if someone else was just to blindly use it on their data. It takes a significant amount of work to produce ‘commercial’ code which is well tested (and documented) and able to return reliable results over a range of situations. However, I’m not sure where the balance is. Or how to properly credit scientist for releasing code.

Reply

2 Adam Ginsburg October 16, 2010 at 5:58 am

I release all of my codes that can be applied to more than one task; those that are specific to one project and are essentially scripts I leave hidden because they’re too ugly and obfuscated for anyone else to use. Releasing code is definitely something most scientists should do in general, just as most data published in papers should also be released in .fits formats along with the paper.

However, publishing my code has led to a lot of feedback. This has definitely been positive in that it alerted me to errors in my code and improved its quality, but it takes a lot of time to debug very small things. When your value is measured by your publication rate, that can amount to “lost time”, even though it may be valuable to yourself and many others in the same and related fields.

Misuse of code… yeah, that’s scary. Not sure what to do about it, but you can only take so much responsibility for what others do with open-source materials.

Reply

3 Kelle October 16, 2010 at 12:40 pm

I think the idea is not to publish the code so that other people could just download and use it, but rather to allow people to take a look at what you did and maybe steal a few parts of it. This is a whole different idea than the “useful scripts” that we’re already sharing with each other.

Christopher on FB said,

I have tried to keep my code published and maintained on my wiki, but the difficulty is that as I have moved job to job, the url in the article is increasingly out of date.

Addressing this problem is one of our primary motivations for starting the AstroBetter wiki. Feel free to start your own wiki page with your own bits of whatever. We’ll figure out a way to organize it eventually….probably through the use of tags.

But that raises the larger question of having a proper code repository. Is there really not an already available service that we could use? All of the ones I’m aware of are targeted at development and version control…which I think is overkill for the task at hand.

Also, John Gizis mentioned on FB,

The Spitzer Users Committee, which I’m on, recommended that Spitzer release all their source code at the end of the mission and I understand they will do it. Note that it will not compile, but it would document what was done.

Reply

4 anon October 16, 2010 at 3:03 pm

I think the idea is not to publish the code so that other people could just download and use it, but rather to allow people to take a look at what you did and maybe steal a few parts of it.

I’m not against this idea, but I’m not sure how well this will work in practice. It’s very difficult to steal parts of code, which are generally written for a specific use in a specific environment. I once asked someone who coded part of the SDSS pipeline if I could use their implementation of an algorithm, and they told me that they themselves would start from scratch if they were trying to implement the algorithm in a different environment. And it would be incredibly time-consuming to check that someone’s code “perform[s] the algorithm [they] describe in [their] paper”, even for fairly simple programs – as we all know from trying to debug our own code.

Reply

5 Eric G. Barron October 17, 2010 at 9:00 am

Good topic! I’m glad there are others who would like to see the release of code as well. I’ve been ranting about how many scientists do not do so (or do not release their data in any format besides a table or small plot in a PDF) for awhile…just ask my wife (also an astronomer) how much I talk about this. 😀

The overwhelming reason for releasing code should be that it is part of the scientific process. Do not be concerned with people misusing the code; their bad judgement is not your responsibility. If someone misuses your code, then it is that person who is at fault for misusing your code. We should not fall into the trap of making excuses against releasing code. It does not matter if you think your code is too ugly or obfuscated (which leads to the question of why is it so ugly and obfuscated in the first place) or that you think it would take others too long to examine your code or that you don’t think there is anything in the code that others can use (let’s try not to use the word steal). That the code was part of the process by which you arrived at your results should override any excuses against release.

Regarding a proper code repository…what is wrong with a repository that includes version control? Am I the only one who goes through multiple revisions when writing code? I use my own version control (specifically Mercurial) for everything I write…even small scripts (I group all of those in one small script repo). The history of the code can often tell you a lot about the current state of the code; it is important to preserve that history. I don’t think a service with version control is overkill. However, if others really don’t want version control, and want code to be easily associated with entries in ADS…then maybe it is time to update ADS to handle code. 🙂

Reply

6 John October 17, 2010 at 4:59 pm

I assume the only objection to using version control is that it adds to the (perceived, at least) complexity: I can’t imagine that anybody thinks it’s bad so much as just unnecessary. Personally, I tend to agree with Eric that version control pretty much essential for any serious project, and, if the overhead is currently seen as too great, that just means we need simpler systems. Mind you, an ADS which handles code does sound like an awesome idea…

On the larger point, I find it hard to see the downside of code release. If you don’t believe your code is robust enough to stand other people looking at it, why should you — or, more importantly, anybody else — believe the results it gave you in the first place?

Reply

7 Ben October 18, 2010 at 4:01 am

People interested in these questions might be interested in this white paper that I led and submitted to the Astro2010 decadal survey: http://arxiv.org/abs/0903.3971

Some of the main issues that we raised are that there are structural barriers to releasing code and getting credit for one’s efforts. The lack of a repository that we can all agree on using and the problem of link-rot on personal pages are instances of this. I personally don’t think misuse of code will be a common issue. If someone produces a result it’s their responsibility to check it for reasonableness, no matter what tools they are using.

Unfortunately, while the decadal survey report noted that astronomy’s software infrastructure is crumbling due to excessive reliance on aging software packages that sometimes aren’t practical to extend or maintain, they didn’t recommend anything to actually *DO* about it, much less recommend funding doing something about it. Other than perhaps vainly hope that large projects (LSST?) are going to produce code that the rest of us can use. (If LSST produces something that can fully replace IRAF, I’ll eat a SparcStation.)

I think this is still at least partly an effect of generational issues in astronomy that privilege hardware over software. We sat around for over a year arguing about which facilities to spend hundreds of millions of dollars on, and we can’t summon the vision to suggest spending a fraction of this on community software development, or even a commitment that funded initiatives and instruments need to produce publicly available pipelines as part of the cost of the project.

Reply

8 Bubak October 18, 2010 at 4:47 am

Please release not just code, but also verification tests. Problem is not to write code, but to verify it is working properly. Way you tested your code can have bigger value than code itself.

Reply

9 Steve October 18, 2010 at 11:03 am

In the case of astrophysical simulations, there are infrequent attempts to benchmark and compare codes that implement different algorithms for computing the same physics. This has often led to a large amount of collaboration and algorithm & code improvement, a better community understanding of what the codes do, and additional transparency into the engineering of these codes.

Sadly, these are the wheat; the chaff is all too common. Authors are often free to simply state usage of a package, perhaps with relatively undisclosed modifications, but without peer review. There are multiple levels to this as well. We’ve all hunted through reams of references for an algorithm, only to find some key piece buried in an undiscoverable document (proprietary tech note, a doctoral dissertation, etc.). If an algorithm is verbally well described, it is fairly rare to have even a pseudocode published; if a pseudocode is available, what tricks, if any, were used to make the code efficient, and to what effect on the end science? Worse, too many papers rely on a bunch of ad hoc scripts for processing data, both observational and simulated, which never are vetted appropriately.

The only solution (and an unpopular one!) is the institution of a scientific code repository, maintained by the same journals publishing the verbal content. These repositories would require paper authors to have code developed under adherence to software engineering principles, and submitted under peer review.

Reply

10 Gus October 18, 2010 at 11:14 am

you could also publish code in your paper (below are two examples). a problem with this is that there is no “search” framework for someone trying to directly find code in these journals.

to find these I had to search like this: “I know Reid/McLaughlin published papers with tar files of source code, so find all their recent papers and page through them until I find the right one…”

Reid, M. et al. 2009, “Trigonometric Parallaxes Of Massive Star-Forming Regions. VI. Galactic Structure, Fundamental Parameters, And Noncircular Motions” DOI ADS

McLaughlin, D. et al. 2006, “Hubble Space Telescope Proper Motions and Stellar Dynamics in the Core of the Globular Cluster 47 Tucanae” DOI ADS

Reply

11 Gus October 18, 2010 at 11:31 am

now to be less constructive and a bit more controversial (err, hypothetical), consider that another comment made on astrobetter’s FB post suggested that refereeing a paper with code in it would be difficult or laborious.

my personal experiences is that i find having to blindly trust authors’ hidden hacking to be a most difficult part of refereeing as it exists today. rare if ever are the litany of software blackboxes utilized even listed in a paper let alone their infinite parameter lists quoted or tabulated.

i would +1 to steve’s suggestion except that the barriers suggested (such as adherence to software engineering principles) are much too high for scientific computing. Look, I’ll quote Greg Wilson of Software Carpentry: “We ignore the fact that for the overwhelming majority of our fellow [scientists] computing is a price your have to pay to do something else that you actually care about” ITConversations (@47:50).

if a solution exists then it might look like a research framework that helps us capture all the “artifacts” of our scientific computing and make it easier to share our code instead of creating a long set of standards to not adhere to and that effectively prevent our results from ever appearing.

Reply

12 Tom October 18, 2010 at 4:55 pm

A quick note about the ease of sharing with version control – if anyone is interested in sharing short one-file scripts, gist (part of github) is very nice. Here is an example. You can edit the script in the browser, comment on it, and version control is automatic, without ever having to know a single git command. There’s always the issue of long-term reliability, but on the timescale of a few years, I think this is a good way of sharing code.

Sharing code with a paper can be done via an associated tar file. If a researcher uses for example a snippet of code from a version controlled repository (including gist), then she/he could include a snapshot of the version used in a tar file with the paper. Then that takes care of long-term association with the paper.

Reply

13 Gus October 18, 2010 at 11:38 pm

this just came across my friendfeed today — although this is a brand spanking new endeavor, don’t let the biomed fool you; the people behind this are pretty awesome.

Open Research Computation publishes peer reviewed papers that describe the development, capacities, and uses of software designed for use by researchers in any field.”

Reply

14 Tom October 21, 2010 at 2:51 pm

@Gus: this looks really interesting. If this journal gets indexed in ADS, I would actually consider publishing in that journal. In the long term, this could mean that scientists might finally get recognition for developing quality software (if recognition = # of publications).

Reply

15 Ben October 21, 2010 at 6:50 pm

I don’t want to discourage the development of new journals, but want to point out that existing astronomy journals already accept papers describing astronomy software. PASP is one journal where instrument and software papers are totally appropriate. For example: DAOphot, http://adsabs.harvard.edu/abs/1987PASP…99..191S ; SExtractor, http://adsabs.harvard.edu/abs/1996A%26AS..117..393B ; aXe, http://adsabs.harvard.edu/abs/2009PASP..121…59K . And the first two have a ton of citations.

The problems I see are: (1) culturally, astronomers don’t take software seriously enough to give you credit unless you are a mega-valuable programmer like the author of DAOphot; (2) what to do with programs that are not as large or as fully developed as a DAOphot or SExtractor, but still might be of use to somebody. We need a reward structure that encourages people to make their software public and get credit for it, even when they don’t have time, inclination, or funding to write a 20 page paper describing it.

Reply

16 Tom October 21, 2010 at 7:18 pm

One thing that will be interesting to see is whether the journal Gus pointed to will encourage cross-disciplinary software development (as opposed to e.g. PASP). Granted, there is a lot of software that needs to be astronomy-specific, but there’s also a lot of software we are re-inventing that already exists in other fields (clustering algorithms are an example).

Reply

17 Bruce Berriman October 22, 2010 at 12:45 am

Very interesting article. I wrote a post (more like an essay!) in response on my blog Astronomy Computing Today, visit http://t.co/HNEnLOi. My bottom line: I think it is more important to preserve data products than all the code, and there is much more need to release, say, codes for new analsis techniques than all the code that an astronomer used in the research,

Reply

Leave a Comment

Previous post:

Next post: