How to make pickle faster?

(comments)

When I used the pickle module in python in a project to save an object, quickly, as the object began to grow bigger and bigger during the development, loading and saving became a bottleneck in my code.

import pickle
import numpy as np

a = np.ones((1000, 1000), dtype=np.float)
file = open('file.dat', 'wb')
pickle.dump(a, file)
file.close()
# Time : 2.5 s to dump
# Size : 28Mo

file = open('file.dat', 'rb')
b = pickle.load(file)
file.close()
# Time : 30 ms to load

I found 3 little tricks to speed up that a little bit :

1. Use higher protocol version

By default, pickle use the lowest protocol for serializing and writing objects in file. Hence, files are writed in ASCII mode which is slow and create voluminous files. To write in binary mode and use the highest protocol version available, we just have to specify the parameter protocol to -1.

file = open('file.dat', 'wb')
pickle.dump(a, file, protocol=-1)
file.close()
# Time : 24 ms to dump
# Size : 7.7Mo

file = open('file.dat', 'rb')
b = pickle.load(file)
file.close()
# Time : 1O ms to load

2. Use cPickle instead of pickle

A simple tweak to make it a little bit faster is to use cPickle instead of pickle. cPickle is exactly the same module as pickle - same functions, same parameters, same operations - but it's written in C, which permits us to gain that extra bit of speed.

import cPickle
file = open('file.dat', 'wb')
cPickle.dump(a, file, protocol=-1)
file.close()
# Time : 18 ms to dump
# Size : 7.7Mo

file = open('file.dat', 'rb')
b = cPickle.load(file)
file.close()
# Time : 10 ms to load

Here, the gain of speed is completely negligible but in certain cases, like in a deep hierarchy with objects in objects in objects, the gain can be substantial.

3. Disable the garbage collector

Finally, if we have a lot of objects with references to other objects, the garbage collector can slow down the process; disabling it can improve performance.

import cPickle
import gc

gc.disable()

file = open('file.dat', 'wb')
cPickle.dump(a, file, protocol=-1)
file.close()
# Time : 18 ms to dump
# Size : 7.7Mo

file = open('file.dat', 'rb')
b = cPickle.load(file)
file.close()

gc.enable()

Here, we haven't got any improvement because of the simplicity of our example.

If pickle/cPickle is still too slow after that, you can try other libraries like marshal or JSON, but you will lose the big advantage of being able to serialize any object wih arbitrary structure without effort.