Does dissecting a Pytorch model lower memory usage?

Does dissecting a Pytorch model lower memory usage? - machine-learning

Suppose I have a Pytorch autoencoder model defined as:
class ae(torch.nn.Module):
def __init__(self, z_dim, n_channel=3, size_=8):
super(ae, self).__init__()
self.encoder = Encoder()
self.decoder = Decoder()
def forward(self, x):
z = self.encoder(x)
x_reconstructed = self.decoder(z)
return z, x_reconstructe
Now instead of defining an specific ae model and loading it, I can use the Encoder and Decoder code directly in my code. I know the number of total parameters wouldn't change but here's my question: since these two models are now separated, is it possible that the code can run on lower ram/gpu-memory? Does separating them means they do not need to be loaded into memory at once?
(Note that autoencoder is just an example, My question is really about any models that consists of several sub-modules).

is it possible that the code can run on lower ram/gpu-memory?
The way you created it right now no, it isn't. If you instantiate it and move to device, something along those lines:
encoder = ...
decoder = ...
autoencoder = ae(encoder, decoder).to("cuda")
It will take, in total, decoder + encoder GPU memory when moved to the device and will be loaded to memory at once.
But, instead, you could do this:
inputs = ...
inputs = inputs.to("cuda")
encoder = ...
encoder.to("cuda")
output = encoder(inputs)
encoder.to("cpu") # Free GPU memory
decoder = ...
decoder.to("cuda") # Uses less in total
result = decoder(output)
You could wrap this idea in model (or function), still one would have to wait for parts of the network to be copied to GPU and your performance will be inferior (but GPU memory will be smaller).
Depending on where you instantiate the models RAM memory footprint could also be lower (Python will automatically destroy object in function scope), let's look at this option (no need for casting to cpu as the object will be automatically garbage collected as mentioned above):
def encode(inputs):
encoder = ...
encoder.to("cuda")
results = encoder(inputs)
return results
def decode(inputs):
decoder = ...
decoder.to("cuda")
return decoder(inputs)
outputs = encode(inputs)
result = decode(outputs)

Related

PyDrake CollisionFilterManager Not Applying Filter

I have a system that consists of a robotic manipulator and an object. I want to evaluate signed distances between all collision geometries in the system while excluding collisions between the fingertips of the end-effector and the convex geometries that make up the collision geometry of the object. However, when I use the CollisionFilterManager to try to apply the relevant exclusions, my code still computes signed distances between fingertips and object geometries when calling ComputeSignedDistancePairClosestPoints() much later on downstream.
I have a container class that has the plant and its SceneGraph as attributes. When initializing this class, I try to filter the collisions. Here are the relevant parts of the code:
class SimplifiedClass:
def __init__(self, ...):
# initializing plant, contexts, and query object port
self.diagram = function_generating_diagram_with_plant(...)
self.plant = self.diagram.GetSubsystemByName("plant")
self.scene_graph = self.diagram.GetSubsystemByName("scene_graph")
_diag_context = self.diagram.CreateDefaultContext()
self.plant_context = self.plant.GetMyMutableContextFromRoot(_diag_context)
self.sg_context = self.scene_graph.GetMyMutableContextFromRoot(_diag_context)
self.qo_port = self.scene_graph.get_query_output_port()
# applying filters
cfm = self.scene_graph.collision_filter_manager()
inspector = self.query_object.inspector()
fingertip_geoms = []
obj_collision_geoms = []
gids = inspector.GetAllGeometryIds()
for g in gids:
name = inspector.GetName(g)
# if name.endswith("ds_collision") or name.endswith("collision_1"):
if name.endswith("tip_collision_1") and "algr" in name:
fingertip_geoms.append(g)
elif name.startswith("obj_collision"):
obj_collision_geoms.append(g)
ftip_set = GeometrySet(fingertip_geoms)
obj_set = GeometrySet(obj_collision_geoms)
cfm.Apply(
CollisionFilterDeclaration()
.ExcludeBetween(ftip_set, obj_set)
.ExcludeWithin(obj_set)
)
#property
def query_object(self):
return self.qo_port.Eval(self.sg_context)
def function_that_computes_signed_distances(self, ...):
# this function calls ComputeSignedDistancePairClosestPoints(), which
# computes signed distances even between filtered geometry pairs
I've confirmed that the correct geometry IDs are being retrieved during initialization (at least the names are correct), but if I do a simple test where I see how many SignedDistancePairs are returned from calling ComputeSignedDistancePairClosestPoints() before and after the attempt to apply the filter, the same number of signed distance pairs are returned, which implies the filter had no effect even immediately after declaring it. I also confirm the geometries that should be filtered are not by examining the names associated with signed distance pairs during my downstream function call.
Is there an obvious bug in my code? If not, where else could the bug be located besides here?

The problem is an age-old problem: model vs context.
In short, SceneGraph stores an interior model so you can construct as you go. When you create a context a copy of that model is placed in the context. That copy is independent. If you continue to modify SceneGraph's model, you'll only observe a change in future contexts you allocate.
In your code above, you've already allocated a context. You acquire a collision filter manager using cfm = self.scene_graph.collision_filter_manager(). This is the SceneGraph model version. You want the other one where you get the manager from the context: self.scene_graph.collision_filter_manager(self.sg_context) as documented here.
Alternatively, you can modify the collision filters before you allocate a context. Or throw out the old context and reallocate. All are viable choices.

What if I run the collate_fn outside the DataLoader?

The traditional way of applying some arbitrary collate_fn foo() in torch code is
dataloader = torch.data.DataLoader(
dataset,
batch_size=64, # just for example
collate_fn=foo,
**other_kwargs
)
for batch in dataloader:
# incoming batch is already collated
do_stuff(batch)
But what if (for whatever reason), I wanted to do it like this:
dataloader = torch.data.DataLoader(
dataset,
batch_size=64, # just for example
**other_kwargs
)
for batch in dataloader:
# incoming batch is not yet collated
# this let's me do additional pre-collation stuff like
# batch = do_stuff_precollate(batch)
collated_batch = foo(batch) # finally we collate, outside of the dataloader
do_stuff(collated_batch)
Is there any reason why the latter is a big nono? Or why the former is particularly advantageous? I found a blogpost that even suggests that for HF tokenisation, the latter is faster

How does one dynamically add new parameters to optimizers in Pytorch?

I was going through this post in the pytorch forum, and I also wanted to do this. The original post removes and adds layers but I think my situation is not that different. I also want to add layers or more filters or word embeddings. My main motivation is that the AI agent does not know the whole vocabulary/dictionary in advance because its large. I prefer strongly (for the moment) to not do character by character RNNs.
So what will happen for me is when the agent starts a forward pass it might find new words it has never seen and will need to add them to the embedding table (or perhaps add new filters before it starts the forward pass).
So what I want to make sure is:
embeddings are added correctly (at the right time, when a new computation graph is made) so that they are updatable by the optimizer
no issues with stored info of past parameters e.g. if its using some sort of momentum
How does one do this? Any sample code that works?

Just to add an answer to the title of your question: "How does one dynamically add new parameters to optimizers in Pytorch?"
You can append params at any time to the optimizer:
import torch
import torch.optim as optim
model = torch.nn.Linear(2, 2)
# Initialize optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001, momentum=0.9)
extra_params = torch.randn(2, 2)
optimizer.param_groups.append({'params': extra_params })
#then you can print your `extra_params`
print("extra params", extra_params)
print("optimizer params", optimizer.param_groups)

That is a tricky question, as I would argue that the answer is "depends", in particular on how you want to deal with the optimizer.
Let's start with your specific problem - an embedding. In particular, you are asking on how to add embeddings to allow for a larger vocabulary dynamically. My first advice is, that if you have a good sense of an upper boundary of your vocabulary size, make the embedding large enough to cope with it from the beginning, as this is more efficient, and as you will need the memory eventually anyway. But this is not what you asked. So - to dynamically change your embedding, you'll need to overwrite your old one with a new one, and inform your optimizer of the change. You can simply do that whenever you run into an exception with your old embedding, in a try ... except block. This should roughly follow this idea:
# from within whichever module owns the embedding
# remember the already trained weights
old_embedding_weights = self.embedding.weight.data
# create a new embedding of the new size
self.embedding = nn.Embedding(new_vocab_size, embedding_dim)
# initialize the values for the new embedding. this does random, but you might want to use something like GloVe
new_weights = torch.randn(new_vocab_size, embedding_dim)
# as your old values may have been updated, you want to retrieve these updates values
new_weights[:old_vocab_size] = old_embedding_weights
self.embedding.weights.data.copy_(new_weights)
However, you should not do this for every single new word you receive, as this copying takes time (and a whole lot of memory, as the embedding exists twice for a short time - if you're nearly out memory, just make your embedding large enough from the start). So instead increase the size dynamically by a couple of hundred slots at a time.
Additionally, this first step already raises some questions:
How does my respective nn.Module know about the new embedding parameter?
The __setattr__ method of nn.Module takes care of that (see here)
Second, why don't I simply change my parameter? That's already pointing towards some of the problems of changing the optimizer: pytorch internally keeps references by object ID. This means that if you change your object, all these references will point towards a potentially incompatible object, as its properties have changed. So we should simply create a new parameter instead.
What about other nn.Parameters or nn.Modules that are not embeddings? These you treat the same. You basically just instantiate them, and attach them to their parent module. The __setattr__ method will take care of the rest. So you can do so completely dyncamically ...
Except, of course, the optimizer. The optimizer is the only other thing that "knows" about your parameters except for your main model-module. So you need to let the optimizer know of any change.
And this is tricky, if you want to be sophisticated about it, and very easy if you don't care about keeping the optimizer state. However, even if you want to be sophisticated about it, there is a very good reason why you probably should not do this anyways. More about that below.
Anyways, if you don't care, a simple
# simply overwrite your old optimizer
optimizer = optim.SGD(model.parameters(), lr=0.001)
will do. If you care, however, you want to transfer your old state, you can do so the same way that you can store, and later load parameters and optimizer states from disk: using the .state_dict() and .load_state_dict() methods. This, however, does only work with a twist:
# extract the state dict from your old optimizer
old_state_dict = optimizer.state_dict()
# create a new optimizer
optimizer = optim.SGD(model.parameters())
new_state_dict = optimizer.state_dict()
# the old state dict will have references to the old parameters, in state_dict['param_groups'][xyz]['params'] and in state_dict['state']
# you now need to find the parameter mismatches between the old and new statedicts
# if your optimizer has multiple param groups, you need to loop over them, too (I use xyz as a placeholder here. mostly, you'll only have 1 anyways, so just replace xyz with 0
new_pars = [p for p in new_state_dict['param_groups'][xyz]['params'] if not p in old_state_dict['param_groups'][xyz]['params']]
old_pars = [p for p in old_state_dict['param_groups'][xyz]['params'] if not p in new_state_dict['param_groups'][xyz]['params']]
# then you remove all the outdated ones from the state dict
for pid in old_pars:
old_state_dict['state'].pop(pid)
# and add a new state for each new parameter to the state:
for pid in new_pars:
old_state_dict['param_groups'][xyz]['params'].append(pid)
old_state_dict['state'][pid] = { ... } # your new state def here, depending on your optimizer
However, here's the reason why you should probably never update your optimizer like this, but should instead re-initialize from scratch, and just accept the loss of state information: When you change your computation graph, you change forward and backward computation for all parameters along your computation path (if you do not have a branching architecture, this path will be your entire graph). This more specifically means, that the input to your functions (=layer/nn.Module) will be different if you change some function (=layer/nn.Module) applied earlier, and the gradients will change if you change some function (=layer/nn.Module) applied later. That in turn invalidates the entire state of your optimizer. So if you keep your optimizer's state around, it will be a state computed for a different computation graph, and will probably end up in catastrophic behavior on part of your optimizer, if you try to apply it to a new computation graph. (I've been there ...)
So - to sum it up: I'd really recommend to try to keep it simple, and to only change a parameter as conservatively as possible, and not to touch the optimizer.

If you want to customize initial params:
from itertools import chain
l1 = nn.Linear(3,3)
l2 = nn.Linear(2,3)
optimizer = optim.SGD(chain(l1.parameters(), l2.parameters()), lr=0.01, momentum=0.9)
The key is that the first param of constructor receives iterator.

Computation of two Dask arrays freezing at the end

I have generated two random dask arrays of length 450,000,000 that I want to divide by each other. When I go to compute them, the calculation always freezes at the end.
I have an 8 core 32GB instance running to run the code.
I have tried the code below and some modification that I've tried is not persisting the data in x or y.
x = da.random.random(450000000, chunks=(10000,))
x = client.persist(x)
z1 = dd.from_array(x)
y = da.random.random(450000000, chunks=(10000,))
y = client.persist(y)
z2 = dd.from_array(y)
flux_ratio_sq = z1.div(z2)
flux_ratio_sq.compute()
Actual results I am getting is that the persist holds the x and y in memory (total of 8GB of memory) which is expected and then compute adds more to memory. Some errors I'm getting are below.
A lot of these errors:
distributed.core - INFO - Event loop was unresponsive in Scheduler for
3.74s. This is often caused by long-running GIL-holding functions
or moving large chunks of data. This can cause timeouts and instability.
tornado.application - ERROR - Exception in callback <bound method
BokehTornado._keep_alive of <bokeh.server.tornado.BokehTornado
object at 0x7fb48562a4a8>>
raise StreamClosedError(real_error=self.error)
tornado.iostream.StreamClosedError: Stream is closed
I want the final result to be in a dask Series so I can merge it with my existing data.

I'll try to expand my comment here. Fist: given than numpy performs better than pandas (DataFrame or Series) it is better to do your calculation using numpy and then append the result to a DataFrame if is needed. With Dask it is exactly the same. Second following the documentation you should persist only in case you need to call the same dataframe several times.
So for your specific problem what you could do is
import dask.array as da
N = int(4.5e7)
x = da.random.random(N, chunks=(10000,))
y = da.random.random(N, chunks=(10000,))
flux_ratio_sq = da.divide(x, y).compute()
Addendum: with a dask.dataframe you could use to_parquet() instead of compute() and have your results stored to file. In embarrassingly parallel problems like this one the impact on RAM is less than using compute(). It will be interesting to know if something similar could apply in case of dask.array

Example of tf.Estimator with model parallel execution

I am currently experimenting with distributed tensorflow.
I am using the tf.estimator.Estimator class (custom model function) together with tf.contrib.learn.Experiment and managed it to get a working data parallel execution.
However, I would now like to try model parallel execution. I was not able to find any example for that, except Implementation of model parallelism in tensorflow.
But I am not sure how to implement this using tf.estimators (e.g. how to deal with the input functions?).
Does anybody have any experience with it or can provide a working example?

First up, you should stop using tf.contrib.learn.Estimator in favor of tf.estimator.Estimator, because contrib is an experimental module, and classes that have graduated to the core API (such es Estimator) automatically get deprecated.
Now, back to your main question, you can create a distributed model and pass it via model_fn parameter of tf.estimator.Estimator.__init__.
def my_model(features, labels, mode):
net = features[X_FEATURE]
with tf.device('/device:GPU:1'):
for units in [10, 20, 10]:
net = tf.layers.dense(net, units=units, activation=tf.nn.relu)
net = tf.layers.dropout(net, rate=0.1)
with tf.device('/device:GPU:2'):
logits = tf.layers.dense(net, 3, activation=None)
onehot_labels = tf.one_hot(labels, 3, 1, 0)
loss = tf.losses.softmax_cross_entropy(onehot_labels=onehot_labels,
logits=logits)
optimizer = tf.train.AdagradOptimizer(learning_rate=0.1)
train_op = optimizer.minimize(loss, global_step=tf.train.get_global_step())
return tf.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op)
[...]
classifier = tf.estimator.Estimator(model_fn=my_model)
The model above defines 6 layers with /device:GPU:1 placement and 3 other layers with /device:GPU:2 placement. The return value of my_model function should be an EstimatorSpec instance. A complete working example can be found in tensorflow examples.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Does dissecting a Pytorch model lower memory usage? - machine-learning

Related

PyDrake CollisionFilterManager Not Applying Filter

What if I run the collate_fn outside the DataLoader?

How does one dynamically add new parameters to optimizers in Pytorch?

Computation of two Dask arrays freezing at the end

Example of tf.Estimator with model parallel execution

Categories

Resources