A call for open access to all data used in AJ and ApJ articles

I don’t fully understand it, but I know the Astronomical Journal (AJ) and Astrophysical Journal (ApJ) are different than many other journals: They are run by the American Astronomical Society (AAS) and not by a for-profit publisher. That means that the AAS Council and the members (the people actually producing and reading the science) have a lot of control over how the journals are run. In a recent President’s Column, the AAS President, David Helfand proposed a radical, yet obvious, idea for propelling our field into the realm of data sharing and open access: require all journal articles to be accompanied by the data on which the conclusions are based.

We are a data-rich—and data-driven—field [and] I am advocating [that authors provide] a link in articles to the data that underlies a paper’s conclusions…In my view, the time has come—and the technological resources are available—to make the conclusion of every ApJ or AJ article fully reproducible by publishing the data that underlie that conclusion. It would be an important step toward enhancing and sharing our scientific understanding of the universe.

While David makes the case for reproducibility in his column, I think it’s much bigger than that: it’s also about enabling other investigators to use the data for other projects. For example, incorporating it into a bigger sample, using it as one epoch in a variability study, or by studying a feature that the original authors were not interested in. By making all of our fully reduced data available for the rest of the community to use, we are greatly increasing the return on investment. Especially as “small” telescopes are being closed and our access to resources is declining, we need to be getting the most bang for our buck out of each photon and CPU cycle.

In my opinion, the greatest obstacle to implementing this idea is the infrastructure necessary for making proper data archival and sharing a reality. We need a tool that makes both data ingestion and data discovery easy and intuitive. It needs to accommodate ground-based, space-based, and model data. It should somehow be linked to the the major wide-sky surveys, SIMBAD, VizieR, and ADS. There need to be guidelines for quality flagging, raw vs. reduced, appropriate citation, and so many other things. This is a huge project but absolutely achievable with current technology. We just need to figure out how to get it done. (I actually thought the Virtual Observatory was going to do this, but in my current understanding, building the data archive was never actually part of the project.)

Ase we talked about last week, MAST is providing a huge step in this direction by accepting user-contributed data but only data from “a MAST-supported mission (e.g. HST, FUSE, GALEX, IUE, EUVE etc.), or ground-based observations closely related to a MAST mission.” Ground-based observations related to Spitzer, WISE, 2MASS, etc. are accepted by the Infrared Science Archive (IRSA) but it looks like they focus mainly on large “Legacy” projects and not your average-Jane’s dataset. And as far as I know, there’s no supported repository for hosting reduced data products from wholly ground-based programs.

Regardless of the existence of an archive that will take your data, there is still a more fundamental problem which needs to be addressed. Until there is real incentive or enforced requirements for making data available, tarballs of nonsense will continue to be emailed and underutilized datasets will continue to languish on our hard drives or hidden on personal websites. Sure, NSF and NASA require Data Management Plans, but there is no enforcement of any of the promises made in those plans.

For those of you interested in doing the right thing, Gus Muench and the AAS Employment Committee are planning on providing a data archival workshop at the upcoming AAS 223 Winter meeting in DC.

What does a fully functional data archival and sharing tool look like to you? Is it many websites like we have now that just need to be linked together? or one central portal? Do you think published data should be accessible? How else do you think data publication and archival could be encouraged in our community? How high of a priority do you think this should be?

I close with this. It’s scary how familiar this conversation is, especially the bit in Act 3 about field names:

18 comments… add one
  • TMB Jul 10, 2013 @ 9:43

    I’ve used CDS before, although there’s a bit of a chicken-and-egg problem with linking it to a non-A&A article before the volume+page are known (which would be required in order for the link in the article to work…).

    The one philosophical issue is protecting graduate theses.

    • August Muench Jul 10, 2013 @ 11:16

      This CDS problem only means that there is a delay in publishing the data, right? You could add it tomorrow, a month later, years later (they accept all those).

      Note to self: figure out what CDS is doing with the new AAS journals that have article ids instead of page numbers.

    • TMB Jul 11, 2013 @ 9:35

      Yes, you can always add your data to CDS later. What you can’t easily do is say in the paper itself “the data are available on CDS as table Blah/Blah/Blah”.

    • August Muench Jul 11, 2013 @ 10:09

      Good points, TMB. I think that ADS serves as a very successful reaggregator of data/papers. The cases I am thinking of include those where telescope archives (MAST, Chandra) have digital librarians who go through the corpus of papers, identify papers with HST, etc data, and provide data links to ADS to co-index with the original papers (note I have found some examples where this feedback actually happens at the publication process as well). It is also true that CDS harvests data sets and provides them (under a new, related bibcode) that ADS also merges with the original paper behind the scene.

      The primary problem with this entire process (as we just described) is that it is dependent upon the a post-publication curation process performed by third party experts to alleviate or just work around the burden that should fall on the paper’s authors and original data analysts. I think that is a problem because no one may ever have the same knowledge again — authors have to be involved more deeply with sharing their data products.

  • James Schombert Jul 10, 2013 @ 12:18

    After jumping through referee and copyeditor hoops for months, my motivation is extremely low to jump through even more hoops to make my data available on ApJ/AJ. Since release of papers on astro-ph proceeds the publication date by months, it is more efficient to release my data on my own website referenced in the papers. This also allows me to post the scripts used to generate the figures and final datasets as a form of validity test. AJ is still requesting postscript files for figures, I have very little confidence in their abilities to maintain a data center.

  • Omar Laurino Jul 10, 2013 @ 13:47

    I am not sure I understand your statement about the Virtual Observatory (VO), so I apologize if my comment below is off scope.

    VO does provide an infrastructure for what I think you are asking for. It does not provide a single implementation, but it allows different implementations to be interoperable via shared standards and protocols. Moreover, there are already projects in place that can archive data and expose it according to the aforementioned standards and protocols.

    Resources (e.g. actual services) are registered in a distributed net of Resource Registries which can harvest each other.

    The VAO Data Discovery Tool is an example of a single portal that gives access to a great deal of resources taking advantage of the standards, but that’s only one example.

    There are tools that allow “people” to publish their data according to the standards, and we are trying to make those tools more and more usable and intuitive. Some institutions even allow you to send them your data, and they will implement, register, and expose the relative services through VO protocols. See for example this page maintained by the International Virtual Observatory Alliance (IVOA)

    Some science applications, e.g. the VAO SED Building and Analysis tool (Iris… full disclosure: I am one of Iris’ developers) can understand data provided in VO formats (likely from VO services) and perform operations on it in a more intuitive and domain-specific way. Such applications can be interoperable and exchange messages using VO Interoperability.

    Whether the VO is practically heading in the right direction is certainly up to discussion, but I think the main goal is indeed the kind of infrastructure you are depicting, in a distributed fashion.

    Some of us in the IVOA community are trying to build the kind of middleware that accommodate more and more use cases, helping “people” ranging from the individual astronomer to big Data Centers to publish their data in a standard way. So, we are certainly interested in feedback, suggestions, critiques from the astronomical community.

    • August Muench Jul 11, 2013 @ 10:00

      Omar’s is a great summary of the current state of the VO successes (from toolkits to desktop science applications). Nevertheless, I am 100% certain that the point was that there the “VO”, as an ensemble of variably funded, distributed institution(s), rarely implement their own data archives.

      This is a common misconception that is worth making clear as often as necessary. Perhaps these distributed institutions should have created archive(s), but in the cases that they have spun disks, those resources have to be turned off because the parent “VO” institution is inevitably defunded.

  • Cédric Jul 10, 2013 @ 17:24

    A few years ago, with the humble infrastructure of a wiki, of course not enough, but the idea was there: http://wikimbad.obs.ujf-grenoble.fr/Home.html

  • Nick Nelson Jul 10, 2013 @ 21:40

    This is a nice idea but it’s simply not feasible for all papers. I do large numerical simulations that routinely produce hundreds of terabytes of data. There are only a few supercomputing centers with the resources to store that kind of data and I have yet to see anyone try to make data on that scale publically available. It would be possible (and very reasonable) to archive the data displayed in plots and tables, but publishing the raw simulation data is not.

    • August Muench Jul 11, 2013 @ 9:54

      How much of that raw simulation data is reproducible with the code and parameters (to within the numerical accuracy of the supercomputer)?

      The same argument is often used by people who create large value added tables that are simply algorithmic manipulations of existing large databases. The selection and SQL algorithms are entirely reproducible, rendering the “processed” data not so important. What do you think?

    • Mark Krumholz Jul 12, 2013 @ 2:38

      I agree with Nick that archiving simulation data is impractical (unless someone wants to spend a lot of money to pay for a huge amount of spinning disk to live permanently at various national supercomputer centers). While archiving the source code and parameters as August suggests is better than nothing, it’s important to realize that this is by no means equivalent to archiving the data, and in practical terms it really does not provide reproducibility. First of all there’s a question of computational resources. If a simulation requires 10 million CPU-hours to run (not at all uncommon for large-scale computations), then reproducing that from just the source code costs another 10 million CPU-hours, which in dollar terms is something upwards of $1M, not to mention the fact that no review committee is going to give you 10 million CPU hours just to reproduce someone else’s simulation. Think of saving the source code and the parameters as akin to recording the exact instrument settings and pointings for a telescope observation, not as analogous to archiving the data. Sure, if you record exactly how you set up the telescope, someone could go redo the observations to check your work, but in practice that’s rarely if ever going to happen.

      There’s also a more subtle technical issue. For the vast majority of modern parallel codes, you are not guaranteed that the results will be identical to machine precision when a code is run on different numbers of processors, for example, or is compiled with different compiler options. If the system being simulated is chaotic (and most of the interesting ones are) then this is going to make the results diverge after a while. Hopefully the results have the same statistical properties as the original simulation, but there’s no guarantee at all that they will look visually the same. This means that if you really want full reprodicibility to machine precision, you need not just the code, but also access to the same platform, compilers, etc.

      In short, making the full outputs of large-scale simulations widely available is still a hard technical problem, and no amount of policy-making is going to change that. A long-term commitment to save simulation data and make it widely accessible would really involve a significant investment of resources. While it would certainly be nice, it would definitely not be near the top of my priority list for what the astronomy community should do with the small amount of money it gets.

    • TMB Jul 12, 2013 @ 10:30

      Mark makes excellent points. Something that we certainly can (and should) make available when possible, though, are derived data. For example, for me that might include halo catalogs, merger trees, and things of that nature – for other simulations, there are not always obvious individual objects that can be catalogued, but there are clearly properties of the simulation that have been measured in order to come to a scientific conclusion and those can be made available.

  • August Muench Jul 12, 2013 @ 11:03

    I agree, TMB. Mark’s points are extremely useful for understanding the challenges of reproducibility with large scale computational data. It almost seems to me like reproducibility scales inversely with increasing data set size. Perhaps that is somekind of useful axiom.

    I also found it interesting that he drew a parallel between naturally stochastic events (observing the sky) and computational experiments. I wish Matt Turk would weigh in here. I think I’ll ping him.

  • Anonymous Jul 16, 2013 @ 8:56

    Another major problem with publishing all data used in a paper is follow-up studies not yet completed. For example, if I am planning a follow-up spectroscopic observing run for photometrically identified targets, why should I have to publish the positions, magnitudes, etc., of my identified targets prior to getting the follow-up data I want? I am all for “moving science forward” with open access, but the reality is that others with easier/faster access to telescope time will make those observations first. While in theory “everyone benefits” from getting the observations, I certainly do not benefit from getting scooped or having my science program done by someone else.

    • August Muench Jul 16, 2013 @ 11:21

      Hi Anonymous: A lot of people have the concern about the effects of data sharing on collaborations, follow-up research, and competition. But I am a little bit confused by your problem. You want to publish a paper of photometrically identified candidates but not list those candidates in the paper? I don’t know a referee today that would let that pass.

      My deeper problem with those expressing concern over “not being done with their data” is that it is so frequently arbitrarily defined. Today, spectroscopic followup, then mid-IR followup, then radial velocities, then time series…all the time producing a stack of unreproducible results in the name of not being possibly scooped. This seems less about science and more about culture.

      Sure, I know many examples of those who are exactly in Anonymous’s situation and have had unscrupulous competitors clean out their published target lists. Does anyone have counter examples where sharing has led to new collaborations/telescope access?

  • Anonymous Jul 17, 2013 @ 9:34

    Hi Gus – I agree completely about the ambiguity in “not being done with their data”. One can always find follow-up projects to do. And, yes, it is ultimately about the culture in which science is being done. If people were less competitive, if careers depended less on paper counts/citations, if people were more open about collaborating, if it were less about egos and more about science…

    To respond briefly to your comment about referees, I think this happens more than we realize. One example: people publish color-magnitude diagrams all the time for various targets (e.g. open clusters). It is certainly not the norm that one publishes a table with RA/DEC, magnitudes, colors, etc. I agree with you that it would not pass if we’re talking about a paper on a single target (say a high z galaxy or exoplanet host star), but larger-scale data sets often do not have associated data tables.

    • August Muench Jul 17, 2013 @ 12:40

      Good point, Anonymous, about what happens when all the data behind a rich CMD/C-C diagram gets opened up for immediate reuse. Authors might subselect a list of sources to highlight (and to tabulate) in a paper, but limit themselves (till they have followup) on what they call out from the full list.

      Of course, a C-C diagram calling out late type dwarfs or young stars probably contains lots of potentially interesting (or not) AGN… 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *