What if I run the collate_fn outside the DataLoader? - machine-learning

The traditional way of applying some arbitrary collate_fn foo() in torch code is
dataloader = torch.data.DataLoader(
dataset,
batch_size=64, # just for example
collate_fn=foo,
**other_kwargs
)
for batch in dataloader:
# incoming batch is already collated
do_stuff(batch)
But what if (for whatever reason), I wanted to do it like this:
dataloader = torch.data.DataLoader(
dataset,
batch_size=64, # just for example
**other_kwargs
)
for batch in dataloader:
# incoming batch is not yet collated
# this let's me do additional pre-collation stuff like
# batch = do_stuff_precollate(batch)
collated_batch = foo(batch) # finally we collate, outside of the dataloader
do_stuff(collated_batch)
Is there any reason why the latter is a big nono? Or why the former is particularly advantageous? I found a blogpost that even suggests that for HF tokenisation, the latter is faster

Related

Cuda streams synchronization issue with cupy and tensorRT

I am working with TensorRT and cupy. The following code does not wait for the cuda calls too be executed if I set the cp.cuda.Stream(non_blocking=True) while it works perfectly with non_blocking=False.
Why shouldn't it work with non_blocking=True? I checked the input data and it is fine. But the code ends up with my model returning random detections (random data), meaning that there are some synchronization issues.
# Select stream
stream.use()
# Copy cupy array to the buffer
input_images = cp.array(batch_input_image)
cp.copyto(cuda_inputs[0], input_images)
# Run inference.
context.execute_async(bindings=bindings, stream_handle=stream.ptr, batch_size=len(batch_input_image))
# Copy results from the buffer
output_images = cuda_outputs[0].copy()
# Split results into batch
list_output = cp.split(output_images, indices_or_sections=len(batch_input_image), axis=0)
# Squeeze output arrays to remove axis of length one
list_output = [cp.squeeze(array) for array in list_output]
# Synchronize the stream
stream.synchronize()
After receiving some support from Nvidia, I can confirm this was not a cupy issue. It seems to be a problem with the C++ code of the TensorRT model, as discussed here: github.com/cupy/cupy/issues/6104 .

Does dissecting a Pytorch model lower memory usage?

Suppose I have a Pytorch autoencoder model defined as:
class ae(torch.nn.Module):
def __init__(self, z_dim, n_channel=3, size_=8):
super(ae, self).__init__()
self.encoder = Encoder()
self.decoder = Decoder()
def forward(self, x):
z = self.encoder(x)
x_reconstructed = self.decoder(z)
return z, x_reconstructe
Now instead of defining an specific ae model and loading it, I can use the Encoder and Decoder code directly in my code. I know the number of total parameters wouldn't change but here's my question: since these two models are now separated, is it possible that the code can run on lower ram/gpu-memory? Does separating them means they do not need to be loaded into memory at once?
(Note that autoencoder is just an example, My question is really about any models that consists of several sub-modules).
is it possible that the code can run on lower ram/gpu-memory?
The way you created it right now no, it isn't. If you instantiate it and move to device, something along those lines:
encoder = ...
decoder = ...
autoencoder = ae(encoder, decoder).to("cuda")
It will take, in total, decoder + encoder GPU memory when moved to the device and will be loaded to memory at once.
But, instead, you could do this:
inputs = ...
inputs = inputs.to("cuda")
encoder = ...
encoder.to("cuda")
output = encoder(inputs)
encoder.to("cpu") # Free GPU memory
decoder = ...
decoder.to("cuda") # Uses less in total
result = decoder(output)
You could wrap this idea in model (or function), still one would have to wait for parts of the network to be copied to GPU and your performance will be inferior (but GPU memory will be smaller).
Depending on where you instantiate the models RAM memory footprint could also be lower (Python will automatically destroy object in function scope), let's look at this option (no need for casting to cpu as the object will be automatically garbage collected as mentioned above):
def encode(inputs):
encoder = ...
encoder.to("cuda")
results = encoder(inputs)
return results
def decode(inputs):
decoder = ...
decoder.to("cuda")
return decoder(inputs)
outputs = encode(inputs)
result = decode(outputs)

Batch results of intermediate dask computation

I have a large (10s of GB) CSV file that I want to load into dask, and for each row, perform some computation. I also want to write the results of the manipulated CSV into BigQuery, but it'd be better to batch network requests to BigQuery in groups of say, 10,000 rows each, so I don't incur network overhead per row.
I've been looking at dask delayed and see that you can create an arbitrary computation graph, but I'm not sure if this is the right approach: how do I collect and fire off intermediate computations based on some group size (or perhaps time elapsed). Can someone provide a simple example on that? Say for simplicity we have these functions:
def change_row(r):
# Takes 10ms
r = some_computation(r)
return r
def send_to_bigquery(rows):
# Ideally, in large-ish groups, say 10,000 rows at a time
make_network_request(rows)
# And here's how I'd use it
import dask.dataframe as dd
df = dd.read_csv('my_large_dataset.csv') # 20 GB
# run change_row(r) for each r in df
# run send_to_big_query(rows) for each appropriate size group based on change_row(r)
Thanks!
The easiest thing that you can do is provide a block size parameter to read_csv, which will get you approximately the right number of rows per block. You may need to measure some of your data or experiment to get this right.
The rest of your task will work the same way as any other "do this generic thing to blocks of data-frame": the `map_partitions' method (docs).
def alter_and_send(df):
rows = [change_row(r) for r in df.iterrows()]
send_to_big_query(rows)
return df
df.map_partitions(alter_and_send)
Basically, you are running the function on each piece of the logical dask dataframe, which are real pandas dataframes.
You may actually want map, apply or other dataframe methods in the function.
This is one way to do it - you don't really need the "output" of the map, and you could have used to_delayed() instead.

large numpy matrix as dataflow side input

I'm trying to write a Dataflow pipeline in Python that requires a large numpy matrix as a side input. The matrix is saved in cloud storage. Ideally, each Dataflow worker would load the matrix directly from cloud storage.
My understanding is that if I say matrix = np.load(LOCAL_PATH_TO_MATRIX), and then
p | "computation" >> beam.Map(computation, matrix)
the matrix get shipped from my laptop to each Datflow worker.
How could I instead direct each worker to load the matrix directly from cloud storage? Is there a beam source for "binary blob"?
Your approach is correct.
What Dataflow does, in this case, is handle the NumPy matrix as a side input. This means that it's uploaded once from your machine to the service, and the Dataflow service will send it to each worker.
Given that the matrix is large, this will make your workers use I/O to receive it from the service, and carry the burden of keeping the whole matrix in memory, but it should work.
If you want to avoid computing/loading the matrix in your machine, you can upload your matrix to GCS as a text file, read that file in, and obtain the matrix. You can do something like so:
matrix_file = 'gs://mybucket/my/matrix'
p | beam.ParDo(ComputationDoFn(matrix_file))
And your DoFn could be something like:
class ComputationDoFn(beam.DoFn):
def __init__(self, matrix_file):
self._matrix_file = matrix_file
self._matrix = None
def start_bundle(self, element):
# We check because one DoFn instance may be reused
# for different bundles.
if self._matrix is None:
self.load_matrix(self._matrix_file)
def process(self, element):
# Now process the element
def load_matrix(self, matrix_file):
# Load the file from GCS using the GCS API
I hope this makes sense. I can flesh up the functions if you feel like you need some more help.

save binarizer together with sklearn model

I'm trying to build a service that has 2 components. In component 1, I train a machine learning model using sklearn by creating a Pipeline. This model gets serialized using joblib.dump (really numpy_pickle.dump). Component 2 runs in the cloud, loads the model trained by (1), and uses it to label text that it gets as input.
I'm running into an issue where, during training (component 1) I need to first binarize my data since it is text data, which means that the model is trained on binarized input and then makes predictions using the mapping created by the binarizer. I need to get this mapping back when (2) makes predictions based on the model so that I can output the actual text labels.
I tried adding the binarizer to the pipeline like this, thinking that the model would then have the mapping itself:
p = Pipeline([
('binarizer', MultiLabelBinarizer()),
('vect', CountVectorizer(min_df=min_df, ngram_range=ngram_range)),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(clf))
])
But I get the following error:
model = p.fit(training_features, training_tags)
*** TypeError: fit_transform() takes 2 positional arguments but 3 were given
My goal is to make sure the binarizer and model are tied together so that the consumer knows how to decode the model's output.
What are some existing paradigms for doing this? Should I be serializing the binarizer together with the model in some other object that I create? Is there some other way of passing the binarizer to Pipeline so that I don't have to do that, and would I be able to get the mappings back from the model if I did that?
Your intuition that you should add the MultiLabelBinarizer to the pipeline was the right way to solve this problem. It would have worked, except that MultiLabelBinarizer.fit_transform does not take the fit_transform(self, X, y=None) method signature which is now standard for sklearn estimators. Instead, it has a unique fit_transform(self, y) signature which I had never noticed before. As a result of this difference, when you call fit on the pipeline, it tries to pass training_tags as a third positional argument to a function with two positional arguments, which doesn't work.
The solution to this problem is tricky. The cleanest way I can think of to work around it is to create your own MultiLabelBinarizer that overrides fit_transform and ignores its third argument. Try something like the following.
class MyMLB(MultiLabelBinarizer):
def fit_transform(self, X, y=None):
return super(MultiLabelBinarizer, self).fit_transform(X)
Try adding this to your pipeline in place of the MultiLabelBinarizer and see what happens. If you're able to fit() the pipeline, the last problem that you'll have is that your new MyMLB class has to be importable on any system that will de-pickle your now trained, pickled pipeline object. The easiest way to do this is to put MyMLB into its own module and place a copy on the remote machine that will be de-pickling and executing the model. That should fix it.
I misunderstood how the MultiLabelBinarizer worked. It is a transformer of outputs, not of inputs. Not only does this explain the alternative fit_transform() method signature for that class, but it also makes it fundamentally incompatible with the idea of inclusion in a single classification pipeline which is limited to transforming inputs and making predictions of outputs. However, all is not lost!
Based on your question, you're already comfortable with serializing your model to disk as [some form of] a .pkl file. You should be able to also serialize a trained MultiLabelBinarizer, and then unpack it and use it to unpack the outputs from your pipeline. I know you're using joblib, but I'll write this up this sample code as if you're using pickle. I believe the idea will still apply.
X = <training_data>
y = <training_labels>
# Perform multi-label classification on class labels.
mlb = MultiLabelBinarizer()
multilabel_y = mlb.fit_transform(y)
p = Pipeline([
('vect', CountVectorizer(min_df=min_df, ngram_range=ngram_range)),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(clf))
])
# Use multilabel classes to fit the pipeline.
p.fit(X, multilabel_y)
# Serialize both the pipeline and binarizer to disk.
with open('my_sklearn_objects.pkl', 'wb') as f:
pickle.dump((mlb, p), f)
Then, after shipping the .pkl files to the remote box...
# Hydrate the serialized objects.
with open('my_sklearn_objects.pkl', 'rb') as f:
mlb, p = pickle.load(f)
X = <input data> # Get your input data from somewhere.
# Predict the classes using the pipeline
mlb_predictions = p.predict(X)
# Turn those classes into labels using the binarizer.
classes = mlb.inverse_transform(mlb_predictions)
# Do something with predicted classes.
<...>
Is this the paradigm for doing this? As far as I know, yes. Not only that, but if you desire to keep them together (which is a good idea, I think) you can serialize them as a tuple as I did in the example above so they stay in a single file. No need to serialize a custom object or anything like that.
Model serialization via pickle et al. is the sklearn approved way to save estimators between runs and move them between computers. I've used this process successfully many times before, including in productions systems with success.

Resources