Advanced data handling¶
We have already seen how to use batch learning in the Tutorial. The use
of itertools.repeat
to handle simple batch learning seemed rather over
expressive; it makes more sense if we consider more advanced use cases.
Mini batches¶
Consider the case where we do not want to calculcate the full gradient in each
iteration but estimate it given a mini batch. The contract of climin is that
each item of the args
iterator will be used in a single iteration and thus
for one parameter update. The way to enable mini batch learning is thus to
provide an args
argument which is an infinite list of mini batches.
Let’s revisit our example of logistic regression. Here, we created the args
iterator using itertools.repeat
on the same array again and again:
import itertools
args = itertools.repeat(([X, Z], {}))
What we want to do now is to have an infinite stream of slices of X
and
Z
. How do we access the n’th batch of X
and Z
? We offer you a
convenience function that will give you random (with or without replacement)
slices from a container:
batch_size = 100
args = ((i, {}) for i in climin.util.iter_minibatches([train_set_x, train_set_y], batch_size, [0, 0]))
The last argument, [0, 0]
gives the axes along which to slice [X, Z]
.
In some cases, samples might be aligned along axis 0
for the input, but
along axis 1
in the target data.
External memory¶
What is nice about climin.util.iter_minibatches
is that it needs only slices
as a requirement for its arguments. We therefore only need to pass it a data
structure which reads data from disk as soon as it is needed and disposes of it
as soon as it is not any more.
HDF5 and its python package h5py are a perfect match for this. We have managed to use 6+ GB sized image data sets on GPUs with less than 2 GB of RAM with this simple recipe:
import climin.util
import gnumpy
import h5py
f = h5py.File('data.h5')
ds = f['inpts']
args = climin.util.iter_minibatches([ds], 100, [0])
args = (gnumpy.garray(i) for i in args)
# ...
This is in general not restricted by the size of the data set; it just show that going beyond the GPU RAM limit is achieved very naturally in climin.
Further usages¶
This architecture can be exploited in many different ways. E.g., a stream over a network can be directly used. A single pass over a file without keeping more than necessary is another option.