How to license research artifacts?
In research, we generally start by gathering tons of data and running some analysis on them. You then write a draft about what you have found in your analysis, submit to a conference or journal, and, after it is accepted, hopefully the world will change based on your findings.
In this ideal scenario, the most important outcome is the written paper, which describes the methods, analyses, and results. The belief is that everything is properly described, anyone could redo the experiments and find the same findings.
However, in recent years, researchers seem to agree that sharing just the written paper is not enough; researchers should also share every bit that they created — the research artifacts — out of the research (whenever possible). By sharing research artifacts, researchers are strengthening the whole research chain. For instance:
- Other researchers interested in your work don’t have to start from scratch and spend weeks, months, or even years to find what you already found. Science would hardly progresses if we have to build the wheel over and over again.
- Newcomers could easily start exploring your research field. By means of comparison, in my research field, software engineering, there is this sub-field called mining software repositories (MSR). Doing MSR research, say, 10 years ago was much more challenging than today. Actually, it became so easy to mining repositories that even students with little to no programming experience can start mining something in a matter of hours.
- Reproduction becomes easier if we have research artifacts on your hands. Yes, we can try our best to recreate the very same scenario, but we may ended up missing some variables here and there (that may not even been described in the research paper). As noted in some other studies, small changes in the experiment could have non-trivial impacts on the findings.
- We could find (and fix) research bugs more easily and quickly. Only with the original research artifacts is possible to find eventual bugs in the data/analysis. If we face a bug when trying to recreate the research artifact based on what was written in the paper, should we attribute that bug to ourselves or to the original research?
Given this scenario, it may not be surprising that many researchers are strongly advocating in favor of open science. The open science initiative encourages researchers to make research artifacts more accessible and for everyone. It fosters reproduction, speeds up science, contributes towards strengthening your research field, among many other benefits.
The question is: how should we share these artifacts?
When you write a document, create data, make figures, write code, etc, you are essentially creating creative/original work.
By default, any kind of original work that we create is protected by the copyright law, which prohibits unauthorized use of your work. That is, one should ask permission to use your work. The copyright protection is automatic; that is, it does not depend on government registration or any formal declarations.
“The goal of copyright laws is to protect the rights of the author. If you want someone to use your original work, you should permit this person to use it. Similarly, if you want to sell your original work, the copyright laws ensure that the commercial exploitation of this work is also subject to its creator’s authorization.”
In other words: when you create and share your research artifacts, they are all protected by the copyright law and, thus, other researchers are not authorized to use your artifacts (at least without a clear permission from the copyright owner; that is, the researcher who created the artifact). More concretely speaking: if you read a research paper that points to some artifacts online, you could only use these artifacts if you ask the paper’s authors permission to do so.
At this moment, one vivid reader could think that the copyright law may be posing a toll on the open science initiative, in particular, and in the scientific progress progress, in general. Imagine how (counter-) productive it would be if we have to ask the creators of software programs, such as R (and its myriad of packages), before even touching on them.
By in reality, we don’t really ask permission to use R (and its packages). Do we?
Why? Are we bad persons violating the copyright law?
The reason why we don’t need to ask permission to use R is because it has a license that gives users certain permissions to use it. With a license, you don’t have to ask permission to use. In fact, a license state exactly what users can do with your artifacts. As a consequence, users are bound by the terms of the license.
The use of any original work, including software, is subject to the conditions established by its author. The license is the vehicle that describes these conditions.
“A license is a mechanism authors use to allow others to use their work. A license can be seen as a contract in which the right to use the original work are granted to the licensees. The license has the stipulations (that is, the terms and conditions) in which the licensee must follow to use the original work.”
There are many different kinds of licenses, even for similar purposes. The license that licenses R is a software license. Moreover, since R is an open source project, it is ruled by an open source license (GNU GPL v2, in particular). Similarly, if you are a Windows user, you have to agree with the terms of a proprietary license in order to use it.
Now that we understand what is copyright and what is a license, let’s get back to our research artifacts. If you happen to write research code that is available on the internet, this code is automatically protected by the copyright law (and as such no one could use it without your consent). If you want others to use your code, you should include on it a software license (an open source software license, most likely). However, research artifacts are not only about code. We also produce a lot of data, content, figures, etc. Does it make sense to use software licenses to license figures? Yes, it does not make sense.
Generally speaking, artifacts like documentation, figures, icons, music, etc., that you create as part of your research should also be shared with a license. To make things easier, next I provide a couple of examples of licenses that could be use to license these different kinds of artifacts.
- Source code files: there are a plethora of open source licenses available out there. The most common ones are the MIT license, the BSD license, the Apache v2 license, and the GPL family of licenses.
- Data files: examples of data files include spreadsheets, .csv files, .rda files (for R programming), etc. License for data files include OFL 1.0, PDDL, CDLA-Permissive-1.0, ODC-BY, and some of the creative commons family of licenses.
- Document files: For ordinary documents, it is more common to one of the creative commons family of license. Since some licenses under the creative commons umbrella have some issues with the GPL, the Free Software Foundation recommends the use of the “GFDL’’ (the GNU Free Documentation License). GFDL is widely used in GNU manuals. However, use GFDL with care, since it is not permitted as the only acceptable in some circumstances (e.g., if the creative work is not software-related, like ordinary paintings).
- Image files: Similar to document files, as far as I can tell, creative commons is also and by far the most used license for image files. Indeed, images hosting websites such as Flickr have their own part dedicated to materials under creative commons.
- Font files: perhaps the most used font license is the OFL (similar to data files, but its newer version 1.1).
Note that the majority of the licenses mentioned here deal with digital research artifacts. If you want to, say, share a printed version of your document, some licenses may not apply (e.g., the GFDL requires that a copy of the license goes along with the content).
Also note that the licenses mentioned in this blog post are licenses that encourage the use and reuse of your research artifacts. However, there are also licenses that encourage the commercial exploitation of your work (like the one in the Windows example).
I did not mention, but if do not want to understand all these details and want to share your work to the public domain, without bothering about copyright, restrictions, and licensing things, the Public Domain Mark 1.0 and the CC0 are perhaps the best choice for non-code artifacts. For source code artifacts, perhaps the MIT license is the equivalent.
In summary, the key lesson of this blog post is: every research artifact that is shared online must have a license associated with it; without a license, the original creator keeps all the rights of her work, limiting reuse.
If you want to know more about open source licensing and some hidden problems behind it, consider buying a copy of my ebook: Open Source Licensing 101.