When I used the pickle module in python in a project to save an object, quickly, as the object began to grow bigger and bigger during the development, loading and saving became a bottleneck in my code.
import pickle
import numpy as np
a = np.ones((1000, 1000), dtype=np.float)
file = open('file.dat', 'wb')
pickle.dump(a, file)
file.close()
# Time : 2.5 s to dump
# Size : 28Mo
file = open('file.dat', 'rb')
b = pickle.load(file)
file.close()
# Time : 30 ms to load
I found 3 little tricks to speed up that a little bit :
- 1. Use higher protocol version {#protocol}
- 2. Use cPickle instead of pickle {#cpickle}
- 3. Disable the garbage collector {#gc}
1. Use higher protocol version
By default, pickle use the lowest protocol for serializing and writing objects in file. Hence, files are writed in ASCII mode which is slow and create voluminous files. To write in binary mode and use the highest protocol version available, we just have to specify the parameter protocol to -1.
file = open('file.dat', 'wb')
pickle.dump(a, file, protocol=-1)
file.close()
# Time : 24 ms to dump
# Size : 7.7Mo
file = open('file.dat', 'rb')
b = pickle.load(file)
file.close()
# Time : 1O ms to load
2. Use cPickle instead of pickle
A simple tweak to make it a little bit faster is to use cPickle instead of pickle. cPickle is exactly the same module as pickle - same functions, same parameters, same operations - but it's written in C, which permits us to gain that extra bit of speed.
import cPickle
file = open('file.dat', 'wb')
cPickle.dump(a, file, protocol=-1)
file.close()
# Time : 18 ms to dump
# Size : 7.7Mo
file = open('file.dat', 'rb')
b = cPickle.load(file)
file.close()
# Time : 10 ms to load
Here, the gain of speed is completely negligible but in certain cases, like in a deep hierarchy with objects in objects in objects, the gain can be substantial.
3. Disable the garbage collector
Finally, if we have a lot of objects with references to other objects, the garbage collector can slow down the process; disabling it can improve performance.
import cPickle
import gc
gc.disable()
file = open('file.dat', 'wb')
cPickle.dump(a, file, protocol=-1)
file.close()
# Time : 18 ms to dump
# Size : 7.7Mo
file = open('file.dat', 'rb')
b = cPickle.load(file)
file.close()
gc.enable()
Here, we haven't got any improvement because of the simplicity of our example.
If pickle/cPickle is still too slow after that, you can try other libraries like marshal or JSON, but you will lose the big advantage of being able to serialize any object wih arbitrary structure without effort.