Jake Vanderplas is the Director of Research in the Physical Sciences at the University of Washington’s eScience Institute. He is a maintainer and/or frequent contributor to many open source Python projects, including scikit-learn, scipy, matplotlib, and others. He occasionally blogs about Python, data visualization, open science, and related topics at Pythonic Perambulations. You may find his (BSD-licensed) code on Github, or follow him on Twitter at @jakevdp.
At the 223rd meeting of the American Astronomical Society in January, there was an excellent standing-room-only session on Astrophysics Code Sharing in which the question of open source licensing came up. Code licensing is one of those issues that many people never have a reason to think about, but it turns out to be a very important piece of making your code available to others. The goal of this post is to provide a quick guide to the subject from our friends working at Welche Krypto Kaufen for folks who don’t want to spend too much time researching it, and to offer some concrete advice to make sure your public scientific code is as useful and influential as it can possibly be.
Disclaimer: I have no formal legal training, and this post is not intended as legal advice. For a more thorough treatment of copyright law and licenses as they pertain to code and software, please see the list of references at the end of the article.
What Is a License?
For most people, the mention of a software license conjures up images of the paragraphs of legalese you must (pretend to) read before installing a proprietary software package. But licenses come in a variety of forms: generally, a software license is a legally-binding agreement which governs the use and redistribution of software. Licenses range from proprietary (think Microsoft Windows) to Open Source (think Linux), with many variations and gradations in between. Here we’ll focus primarily on Free and Open Source Software (FOSS) Licenses, as they are generally the most relevant to those who write and share scientific code.
Who Should Think About Licensing?
As we’ll discuss below, everybody who writes code should license it, and this is especially true in science. Increasingly, researchers in astronomy and related fields spend much of their energy producing scientific code. Because this code is an absolutely vital component of the reproducibility of scientific results, an important part of the scientific process is to make this code available to others. If you have shared or plan to share your scientific code—whether that’s on your own web page, on a blog, within supplemental websites of journals, or on a hosting service like Github or Bitbucket—this article is for you. Generally when scientists make their code public, they do so because they want it to be free to use and as useful as possible for as many people as possible. They want others to not only use it, but also extend it, fix bugs, incorporate it into their own research code, and thereby make it even more useful to more people. This article is written primarily with these scientific researchers in mind.
Summary: Three Pieces of Advice
This post will cover a lot of ground, but if you only take three pieces of information away from the article, let them be these:
- Always license your code. Unlicensed code is closed code, so any open license is better than none (but see #2).
- Always use a GPL-compatible license. GPL-compatible licenses ensure broad compatibility for your code, and include GPL, new BSD, MIT, and others (but see #3).
- Always use a permissive, BSD-style license. A permissive license such as new BSD or MIT is preferable to a copyleft license such as GPL or LGPL.
This list progresses from widely accepted to increasingly more contentious. I can’t claim my advice is sound for every possible situation. But for scientific researchers sharing their own code, I stand behind the recommendations. Below I’ll do my best to convince you why.
1. Always license your code.
Unlicensed code is closed code, so any open license is better than none (but see section #2 below).
Scientists generally share code that they hope others will use. And many of us operate under the assumption that posting it online effectively makes it available for anybody to use, modify, and incorporate into their projects. Somewhat surprisingly, it turns out that this is not the case in our copyright-driven world. The legalities of copyrights, licenses, etc. are difficult to distill, but a nice summary is given in a 2010 article by Arto Bendican. It includes some examples of situations where copyright law may be at odds with naive intuition:
…if you stumble across some code with no attached licensing information, copyright laws would have you treat it as ‘all privileges retained’, even if its author in fact was just trying to make it available with no strings attached.
This is the first important thing to realize about licensing: adding a suitable license generally increases, not decreases, the openness of your code. Furthermore, adding a license lets you be explicit about your intent for the code. Do you want to be acknowledged as the author whenever your code is used or modified? Are you okay with other libraries incorporating, enhancing, and releasing versions your code? Are you OK with for-profit companies incorporating your code into their private code-bases? Choosing a license is the established way to inform potential users of your preferences regarding these, and other questions.
How to License Your Code
Below we’ll discuss suggestions of which license to use. First, though, let’s talk about the actual mechanics of licensing your code. Licenses generally contain a few paragraphs of standard text in which you insert the date, your name, your organization, etc. (you can find texts of several common licenses at the Open Source Initiative). For smaller projects, you can include the full license text in a comment at the top of each source file. Because this can be a bit tedious, other projects choose to have a single LICENSE.txt or COPYING.txt file in the source directory, and perhaps a small notice mentioning the license type at the top of each individual file in the codebase (see e.g. Numpy and SciPy for examples of this approach).
In summary, you should, at the very least, choose an open license and include it with your code. It will make your intentions clear and allow others the freedom to use and adapt your code without worrying about an implicit copyright. In the next section we’ll go into a bit more depth on which license you should choose, and why.
2. Always use a GPL-compatible license.
GPL-compatible licenses ensure broad compatibility for your code, and include GPL, LGPL, new BSD, MIT, and others (but see #3, below)
There is a huge breadth of available licenses to use for your code. Wikipedia has a comprehensive comparison of free and open source software (FOSS) licenses; you can also find useful information on these licenses on the GNU website. For the most part, any of these licenses is sufficient, though there are some important differences between them. Perhaps the best-known license is the GNU Public License (GPL), which guarantees the freedom of users to use, copy, and modify code. Code released under a GPL-compatible license is code that can be incorporated into another GPL-licensed codebase without modification to the license. GPL-compatible licenses include GPL, LGPL, new BSD, MIT, and many others.
If the goal of publishing scientific code is to ensure that it is as useful as possible to as many people as possible, GPL-compatibility is a must in today’s world. GPL compatibility is so important that many non-GPL-compatible packages end up going to great lengths to retroactively change their license, a difficult process that involves gaining consensus of every person who has ever contributed to the project. For a list of examples, as well as a more complete discussion of the virtues of GPL compatibility, see this rather dense 2002 essay by David Wheeler, Make Your Open Source Software GPL-Compatible. Or Else.
I should clarify here that while I recommend a GPL-compatible license, I am not recommending necessarily recommending the use of a GPL license itself (see the next section).
You can find some more details on GPL compatibility here: http://producingoss.com/en/license-compatibility.html
3. Always use a permissive, BSD/MIT-style license
A permissive license such as BSD or MIT is preferable to a copyleft license such as GPL or LGPL.
Everything above this point in the article is, for the most part, non-controversial in the open source community. Here we’ll go into something that’s likely to spark more debate, and it has to do with some subtle distinctions between “permissive” licenses (which are often called “BSD-style”) and “copyleft” licenses (which are often called “GPL-style”).
The difference between the two is rooted in this: broadly speaking, a copyleft license requires derivative works to preserve the same license as original works. In the case of GPL-licensed code, what this means is that any codebase that incorporates any GPL code must also have a GPL license, and therefore be free for others to use. This is the sense in which the GPL and other copyleft licenses are sometimes called “viral” or “sticky” licenses.
The motivation for such a restriction seems reasonable enough: you work hard on your code and want it to be free for others to use. There is some injustice in the notion that someone would take the result of your hard work, make a superficial modification, and thereafter treat your code as their own intellectual property. This sense of justice (along with connected concerns related to software patenting issues) seems to be a central piece of the argument in favor of GPL-style copyleft licenses.
The case for BSD in big projects
Despite these concerns, there are a growing number of scientific software developers who push back against the idea of copyleft and rather advocate the use of permissive, BSD/MIT-style licenses. Perhaps the best articulation of the reason for this was in a 2004 forum post by the late John Hunter (creator of the Matplotlib visualization library): I’d highly recommend that you read the whole thing.
To summarize Hunter’s reasoning: the most important two predictors of success for a software project are the number of users and the number of contributors. Because of the restrictions and subtle legal issues involved with GPL licenses, many for-profit companies will not touch GPL-licensed code, even if they are happy to contribute their changes back to the community. A BSD license, on the other hand, removes these restrictions: Hunter mentions several specific examples of vital industry partnership in the case of matplotlib. He argues that in general, a good BSD-licensed project will, by virtue of opening itself to the contribution of private companies, greatly grow its two greatest assets: its user-base and its developer-base.
There is a tradeoff to choosing a BSD license, of course. It means that you’re effectively putting off-limits the incorporation of existing GPL-licensed functionality in your package. Hunter mentions some specific challenges this has posed for matplotlib; I’ve seen similar challenges within packages I’ve worked on, namely SciPy and Scikit-learn. Nevertheless, Hunter asserts that the net advantage comes from inviting the participation and contribution of industry partners to up the user-base and developer-base of the project. This is why core scientific projects like NumPy, SciPy, Matplotlib, IPython, Pandas, and many others have opted for BSD-style, permissive licenses.
The case for BSD in smaller projects
At this point, you might be wondering why this is relevant to you. After all, you are probably not developing the next IPython or Matplotlib, but simply releasing a small piece of research code that you hope others will find useful. Still, I would argue that BSD is the best option, because it will lower the barrier for your code to be as useful as possible to as many people as possible. This usefulness leads to the citations, recognition, and collaboration that are the primary currency of the academic researcher. We should recognize that these desirable outcomes are not protected by the text of a license, but by established norms of the scientific community. For this reason, a scientific researcher should choose a license which offers the lowest barrier to the reproduction and extension of research results.
The world of scientific software is one in which many of the core packages have BSD-style licenses. In this world, a GPL-style license is an unnecessary liability, as it contains barriers to use and leads to a one-way process of collaboration. A BSD-style license, on the other hand, enables two-way collaboration, in which developers of other projects can easily utilize, patch, improve, and cite your code to your mutual advantage.
In the astronomy world, many projects have followed this reasoning and opted for BSD-style licensing. AstroPy, astroML, emcee, and the yt project are well-known examples. The yt project is particularly interesting, because it started with a GPL license and, based on much of the above reasoning, undertook the fairly significant challenge of later switching to BSD-style. You can read an account of the reasoning and process on the yt blog. The benefits of a BSD-style license are strong enough that it was worth a significant effort to attain them: do your future self a favor and choose a BSD-style license from the start!
Going In-Depth: More Resources
Much has been written on this topic, and I’d encourage you to learn more if you’re interested. Following is a listing of several resources which discuss some of the above issues in more depth:
- Producing Open Source Software: a free online book with a lot of great information and advice surrounding the production and maintenance of open source software, including licensing and related issues.
- Understanding Open Source and Free Software Licensing: a full-length book discussing the finer points of open source licenses and copyright law.
- A Quick Guide to Software Licensing for the Scientist-Programmer: An in-depth article on software licensing from a computational biology journal.
- What makes computational open source software libraries successful? [pdf]: A general discussion of what makes scientific programming libraries successful.
- Why we should be using BSD: The case for BSD-style licensing as made by the late John Hunter, creator of the matplotlib visualization library.
- Open Source Licenses: The Open Source Initiative’s discussion of free and open licenses
With any discussion of licensing there are bound to be differing opinions. Where some prefer BSD to GPL, others make the case for GPL over BSD. Still others argue the superiority of other permissive licenses, such as Apache or Creative Commons. The right license choice often comes down to a matter of taste, and often depends on the (sometimes implicit) goals and priorities of the authors. Here are some questions to think about, which you may wish to discuss in the comment thread:
- What core priorities drive advocacy of one license style over another?
- What licensing practices are standard in your own community? What values do these reflect? How do these help or hinder progress?
- How many in your community realize that unlicensed code defaults to non-open code?
I would strongly disagree about choosing the BSD license and much prefer a copyleft license such as the GPL. If I’m publicly releasing code, I’m not writing it so that it can be taken by a for-profit company and essentially close-sourced. The GPL is philosophically much closer to academic freedom, where any modified versions distributed, also have the source distributed. The GPL also leaves you in a much stronger position for negotiation, should you ever want to sell the closed-source rights to your code for money. However, the LGPL is also a good choice if you want to make a library, don’t mind what others do with that library, but want to ensure any modified versions remain free.
If I’m publicly releasing code, I hope it’s useful enough that a library like scipy, scikit-learn, or matplotlib will want to incorporate it. In that case, GPL is a barrier to its usefulness. Remember that in academia, it’s not the text of a license that protects your ideas, but the norms of the academic community.
Jake: that’s all very well for code that is purely academic and not interesting to commercial users. If you code is at all commercially viable, the GPL is a much better prospect. If your code is useful, people will use it regardless of the choice of free license. People even use black boxes like sm and galfit.
Matplotlib, Numpy, scipy, pandas, scikit-learn, IPython and the like are undeniably commercially viable, yet their authors have unanimously chosen the BSD license. I think that’s much more than a simple coincidence. Did you read John Hunter’s full post on the subject? He makes the case for this much more strongly than I have.
Hi Jeremy, I agree. I was really quite torn about re-licensing yt, as I am personally an enormous fan of the GPL and the copyleft. I am, in fact, still rather torn about this for the reasons you wrote here — I am committed to ensuring that the code remains free, and that future versions of it are inspectable at the source level. BSD licensing does not enforce that freedom, but we as a community do — yt itself, even if a nefarious corporation takes it and improves it and sells it, will be free software, respecting all four freedoms.
Part of my reasoning which didn’t make it into the blogpost was about concerns of sustainability; at the time we relicensed, my academic future was somewhat uncertain, and I wanted to ensure that people who had come to depend on the code would benefit from an active development community. The social pressure to BSD licensing from the scientific python community was such that I saw re-licensing as the best way forward. While this is counter to the idea that copyleft *enforces* freedom of the code, it is aligned with the growing of the contributor base as the restrictions are lessened.
I guess my main point is, I think you’re right, and I think the issue is a subtle one, but I *also* think that in the end the choice that we made as the yt community was the right one for us at the time. But that doesn’t keep me from being a little sad that the GPL is viewed as such a pariah, as in essence I firmly believe that the de facto standards of FLOSS in science will be GPL-like, with an active curation and preservation of the four freedoms, even if BSD in actual license.
@Jeremy: Compare Octave to the Python ecosystem Jake blogged about. Octave is an utter POS, and I attribute this to their hardcore GPL stance. It’s utterly undeniable how much for-profit companies have contributed to the Python scientific computing ecosystem.
Jeremy, you cannot use GPL code within BSD project. For example, if you have library under GPL, then you can’t use this library as module for any BSD/MIT project.
Alexander: LGPL code is fine to use in a BSD project and remains protected from being lifted.
Daniel: I don’t think it’s the license (though I don’t follow Octave), as Linux shows that it’s certainly possible for commercial companies to thrive in a GPL ecosystem. I’ve had commercial companies contributing towards my GPL code with no licensing issues. If a company is scared off by the GPL, it’s likely they need better lawyers. The code being GPL protects their contributions from being swallowed up by competitors in rival closed source apps.
Jeremy – regarding LGPL you are partly correct. Yes, a BSD project can link to an LGPL library without a problem. But as soon as it wants to use just a single part of that library, and bundle it with the BSD code rather than require their users to manage a complicated dependency graph, that no longer works. LGPL is not a solution to the many problems of GPL mentioned in this article and the ones it links to.
Jake: splitting up libraries and copying parts is considered bad practice by many in the free software / open source community (e.g. Fedora, Debian). This is the bundling library issue and gives rise to many security issues. It’s better to have a single library if possible rather than many forks.
Jeremy – perhaps we’re talking about different situations.
This is an example of what I’m thinking of: the SciPy distribution contains several fortran packages bundled with the code. For sparse eigenvalue problems, for example, it bundles ARPACK because there are enough different system ARPACK versions out there that not bundling it causes problems for many users.
There’s another sparse eigensolver library called PROPACK which for many years was on an LGPL license: thus the source could not be bundled, and for this reason SciPy did not utilize it. That’s one example of a specific situation I’ve seen in which the GPL/LGPL was an unnecessary barrier to use for scientific code.
This is a follow up to Jake’s comment at March 14, 2014 at 10:09 am.
Sorry for continuing this discussion after about 2 years. I just wanted to know exactly what part of GNU LGPL restricted the bundling?
In the arguments below, I might have understood GNU LGPL incorrectly, so please correct me if I am wrong.
1. You are already releasing all the source code so you don’t have any issue with that part of the GNU LGPL license which requires the source code to be released. That is intended for projects that don’t want to release the source code. For them, they have to release the source code of the GNU LGPL parts under GNU LGPL.
2. “Conveying Modified Versions” (or bundling PROPACK into scipy in the example above) is permitted under the GNU LGPL, v3, section 2. You just have to keep the licence of the library (and any change that you make within it) to be GNU LGPL. In the example above, you also have to mention that the sparse eigenvalue problem solvers (which would hypothetically bundle PROPACK) are licenced under GNU LGPL. Since you will probably be making very small changes in the library to bundle it, you will not be “restricting” too much of your work with GNU LGPL and you can bundle it into your BSD-licensed work and benefit from it at the expense of a short notice in your copyright statement.
So if I have understood GNU LGPL correctly, you can easily bundle an GNU LGPL library into a BSD-licenced program without much effort. But then again, I am not a lawyer or have too much experience in the different software licenses yet, so I would be grateful if you could let me know if I have incorrectly understood the GNU LGPL.
Thanks very much for this article. I wonder if you have any advice for someone who is in agreement with both arguments. One one side, I’d not be too happy if my code were made closed and profits were generated on it, for a private company selling expensive software. On the other hand, I *want* most people to use my code, and if it could be incorporated into projects that I use and greatly respect ( such as sklearn ), that would make me very happy; I’d be pretty annoyed at myself if they couldn’t use my code because of a licensing issue.
Neither are likely to happen anytime soon. My code isn’t good enough for private companies to use it. Fir open projects, github makes it easy to contribute more directly, without having to worry about licensing. Still, thinking into the future… Do you know of any instances where open code was profited from ? Is there perhaps a permissive license that restricts for-profit use ?
Thanks again, this was very approachable.
Quentin – great questions. Others may have ideas as well, but I’ll give my two cents:
I have never heard of code being stolen through having too permissive a license. There are a lot of good candidates for this: for example, the Pandas project is BSD licensed and is widely used in the financial industry, but the contributors I’ve talked to don’t seem to worry about the project being commandeered and privatized. On the contrary, one of the reasons the project has been so successful is due to the user-base and developer-base coming from private companies.
For the average scientist, I think your worry of posing a barrier to upstream adoption within a package like scikit-learn is much more real than the worry that a company will privatize your code and not give you the credit. That’s the main reason I recommend BSD-style licenses.
For those interested, I have written blog post about how I selected the BSD license for the Montage mosaic engine:
http://astrocompute.wordpress.com/2014/01/17/licensing-your-code-gpl-bsd-and-edvard-munchs-the-scream/
Also I have made some other posts on licensing:
A quick guide to licensing at http://astrocompute.wordpress.com/2014/02/25/a-quick-guide-to-software-licensing-for-the-scientist-programmer/
and a repost of material from the ASCL web page:
http://astrocompute.wordpress.com/2014/03/05/resources-for-licensing-your-code/
Remember that a commercial operation can’t “take” your BSD licensed code in the taking-your-candy sense. They can incorporate it into their own code base, sell a product based on it, and not contribute improvements back to the community, *but they can’t make the original code private*. The original code is out there for free public use as long as someone somewhere is willing to keep a website alive that hosts it. I know everyone who’s thought hard about the BSD vs GPL debate understands this, but it needs to be emphasized for the blog-reading audience.
In the scientific world far more damage has been done by restrictive licensing, and copyrighted code that people use without understanding the consequences, and then discover they can’t release their code because it incorporated some restricted subroutine. The Numerical Recipes license is particularly horrible.
Perhaps the oddest license I have come across is the one for CAMB, which requires that
“Any publication using results of the code must be submitted to arXiv at the same time as, or before, submitting to a journal. arXiv must be updated with a version equivalent to that accepted by the journal on journal acceptance.”
It’s the “same time as, or before” part that I find too restrictive. It’s a good thing there’s an alternative: CLASS.
I must disagree strongly with your endorsement of permissive licenses and your arguments for them. What is nowhere here discussed is the enormous amount of corporate lobbying which has influenced the issue. Google, with the exception of work with the Linux kernel, a “too big to avoid” project, has actively sought to expunge the GPL from other efforts of theirs: Android contains only the kernel as GPL software. The fragmented Android ecosystem is one of many examples where this choice is hostile to the end user.
The attraction of proprietary code for corporate mining into proprietary projects is already discussed above, but a second feature is that the GPL extends a patent grant to technologies within (implicit in the v2 license, explicit in the v3). This is perhaps the chief “fear” of companies. But the risk to astronomers for not protecting themselves is even greater.
Both factors play a role in the appropriation of scientific code as profit opportunities. The claim above that code does not go private is not backed up by some significant and bitter history: the “pixon” deconvolution method is one originally developed as (not widely distributed) permissive code and then privatized and walled off through patent claims. Some of the chief software used for ab-initio chemistry orbital work in the 90s was privatized by a consortium run initially by chemists which then was taken over by investors: they also claimed patent exclusivity at the time, but I haven’t followed that area in almost two decades.
Yes, the issue of walled communities is important to address, but both the GPL and BSD ecosystems have their failure mechanisms. The failure mechanism of BSD code is that the chief useful development goes proprietary. The failure mechanism of GPL code is that combined GPL+BSD code (which *is* permissable, contra the above) must be released under the GPL license, and that some companies won’t touch it.
It is far harder to recover from the BSD failure mechanism than the GPL. The GPL mechanism means that some corporate collaborators are lost. But on the other hand, it goes both ways: I won’t contribute extensive code to a project which is not protected by the GPL or at least the LGPL. Many others in the free/libre software community have similar ethics. One must recall that the percentage of GPL/LGPL family code within the Debian family of Linux distributions is, according to the widest study done with open methodology, 93%. (Blackduck’s claim of 58% is not explained nor are its methods and tools published).
Remember that there is lobbying and public relations — with extensive funds behind them — to influence something so potentially lucrative as the licenses which academics publish under. And those forces are hostile to the GPL. That alone should suggest that one give some thought as to why.
There is also one other point: the terms of the licenses which we use is not always dictated by us: work for hire, which governs many grant-driven projects, attaches its own strings. University policy is increasingly scouring potential for patentability of academic work done therein. The implications to the openness of research — and reproducibility — are dire.
Relevant thread on the Astronomers FB group from May 2015: https://www.facebook.com/groups/123898011017097/permalink/862996433773914/