Options for Saving in Python

by Jessica Lu on July 29, 2013

floppy disk

Question: 

What is the best way to store data that is easily readable and writable in Python?

Answer:

There isn’t a single correct answer as it will largely depend on how large your data set is, how fast you have to read/write it, and whether it needs to be readable by other applications or languages. Below I have compiled a few suggestions and example code snippets.

astropy

For most tabular data, this is probably your best bet. Astropy is the future “do-everything” python package for astronomy. It contains the astropy.table subpackage, which will read in data in many different formats (ASCII, FITS, HDF5, SQL, etc.) and write it out in many different formats. As a general rule of thumb, FITS (or another binary file format) is a better option than ASCII for saving tabular data. (Just trust us on this one, the justification is worthy of its own post.) The columns are numpy arrays, so they are easy to work with. These formats are relatively compact and very quick to read compared to ASCII. This should replace all uses of atpy (see below).

import astropy.table
import astropy.units as u
import numpy as np
 
# Create table from scratch
ra = np.random.random(5)
t = table.Table()
t.add_column(table.Column(name='ra', data=ra, units=u.degree))
 
# Write out to file
t.write('myfile.fits')  # also support HDF5, ASCII, etc.
 
# Read in from file
t = table.Table.read('myfile.fits')

numpy

If your data is all numbers and not too big, then you can save to a numpy (*.npy) file. For example, a 2D image stored as a numpy array. This method also supports record arrays (arrays with column names). One caution is that future computer upgrades that might change the default size of a python “float” might require special handling if you read in old numpy save files.

import numpy as np
 
# Create column (or array)
ra = np.random.random(size=(1000,1000,5))
 
# Save to file
np.save(filename, ra)
 
# Read from file
ra = np.load(filename)

pickle and cPickle

If you have something more complicated than a table or a numpy array, than you probably want to pickle it. An excellent pickle tutorial for intermediate users (e.g. pickle the “right way”) is presented Doug Helman’s Blog.

import pickle  # or import cPickle as pickle
 
# Create dictionary, list, etc.
favorite_color = { "lion": "yellow", "kitty": "red" }
 
# Write to file
f_myfile = open('myfile.pickle', 'wb')
pickle.dump(favorite_color, _myfile)
f_myfile.close()
 
# Read from file
f_myfile = open('myfile.pickle', 'rb')
favorite_color = pickle.load(f_myfile)  # variables come out in the order you put them in
f_myfile.close()

atpy

This used to be my preferred package; but it has been merged into astropy as astropy.table (described above) and is no longer supported. I have included an example here for completeness. We wrote up an earlier AstroBetter Blog Post with a quick tutorial on atpy.

import atpy
import numpy as np
 
# Create table from scratch
ra = np.random.random(5)
t = atpy.Table()
t.add_column('ra', ra, unit='deg')
 
# Write table to file
t.write('myfile.fits')  # also supports HDF5, ASCII, etc.
 
# Read table from file
t = atpy.Table('myfile.fits')

Which method do you prefer and why? If you’re not sure which one to use, ask in the comments!

{ 6 comments… read them below or add one }

1 John July 29, 2013 at 6:41 am

Jessica refers to this tangentially, but I think a really key question you should ask yourself before diving in is: what are you actually saving data for? This should educate your choice of method at least as much as any other considerations.

This applies most starkly to the idea of “pickling” your data. Pickling is really a means of serializing an instance of a class such that it can later be reconstituted. Which is fine, insofar as it goes: often, your data will be nicely represented by some particular object in your code, and you can just dump that to disk. Job done.

But: what you’ve stored is very specific to the particular implementation of your code and the environment it’s running in. That’s great, since it means you can just load the pickled object tomorrow and pick up where you left off. But it’s not a good archival format: in years to come, you’ll need to be able to replicate the environment you used to pickle the data in order to be able to reinterpret it. Worse, if you want to distribute the data, everybody will need to be able to replicate your setup! For archival purposes, therefore, you’re better off storing the data as divorced from any implementation issues as possible. Store it in a well defined, well structured form that it is independent from the details of the particular system you happen to be running today (but, of course, which is easy for you to work with): something like HDF5 or FITS should be fine.

Reply

2 Nathan Goldbaum July 29, 2013 at 2:52 pm

I just want to add on the really excellent comment above that the h5py and pytables libraries offer extremely intuitive python interfaces to the HDF5 library, making it almost trivial to save ND datasets to disk.

3 Chris Beaumont July 29, 2013 at 8:28 am

For small (<100K items) data, I'm increasingly using JSON. It's a common format for web data, more flexible than "rectangular" tables, human readable, and parses unambiguously (no more messing with options like delimiter / comment field / skip rows!). Python has a json module to handle IO, and the functions behave just like the pickle module.

Reply

4 Gregory P. Smith July 29, 2013 at 2:57 pm

I strongly advise against using pickle as a format. Ever. It ties your data to Python and tries to do to much such as storing arbitrary objects and code directly. It is hard to maintain for the long term… Something astronomy data often needs.

For a future proof defined structure binary format with cross language support consider Google’s protocol buffers (or Facebook’s thrift, which is pretty much the same thing; created before Google open sourced protobufs).

Otherwise JSON is universally supported. It lacks any structure definition so you’ll need to do that yourself.

Neither of these will be optimal for arrays of related value where you may want delta encoding or other similar tricks to shrink the data size but anything like that means reading stuff back involves more specialized code in all languages. You can do that on top of any of these.

Reply

5 Lisa August 8, 2013 at 3:39 pm

I have been having great success working with npy arrays. For bigger data sets, I use h5py to write into HDF5 format. Both are easy to work with within Python.

Reply

6 Error September 23, 2013 at 12:30 pm

Their is an error in your pickle code. See the following link for a correction:

http://stackoverflow.com/questions/18963949/error-pickling-in-python-io-unsupportedoperation-read

Reply

Leave a Comment

Previous post:

Next post: