I have a LeafSystem (controller) with two output ports, each of which depend on the solution to the same MathematicalProgram. My initial idea was to solve the program and store the solution as a discrete state which the output port callbacks can access and copy appropriately.
My interpretation of the documentation (https://drake.mit.edu/doxygen_cxx/group__discrete__systems.html) and what I see when implementing this, however, is that the output callbacks use the discrete state before the PerStepDiscreteUpdateEvent.
Now for my questions -
Is this behavior that I've described above consistent with how the Simulator handles update events or am I missing something there?
Is there a way to update the discrete state before the output calculation and have the updated state be used in the output?
Is there a different design that would be more appropriate here?
The simple solution to your problem is cache entry.
Declare a cache entry that does your mathematical program work and updates the associated cache entry (it stores the results). When each output port is evaluated, they both "Eval" the cache entry and draw whatever data they need from the stored result. Then, no matter which port is evaluated first, the second one will always benefit from the pre-computation.
You can look at the cache entry notes for more detail.
I've created a diagram with an LQR Controller, a MultiBodyPlant, a scenegraph, and a PlanarSceneGraphVisualizer.
While trying to run this simulation, I set the random initial conditions using the function: context.SetDiscreteState(randInitState). However, with this, I get the following error:
RuntimeError: Context::SetDiscreteState(): expected exactly 1 discrete state group but there were 2 groups. Use the other signature if you have multiple groups.
And indeed when I check the number of groups using context.num_discrete_state_groups(), it returns 2. So, then I have to specify the group index while setting the state using the command context.SetDiscreteState(0, randInitState). This works but I don't exactly know why. I understand that I have to select a correct group to set the state for but what exactly is a group here? In the cartpole example given here, the context was set using context.SetContinuousState(UprightState() + 0.1 * np.random.randn(4,)) without specifying any group(s).
Are groups only valid for discrete systems? The context documentation talks about groups and but doesn't define them.
Is there a place to find the definition of what a group is while setting up a drake simulation with multiple systems inside a diagram and how to check the group index of a system?
We would typically recommend that you use a workflow that sets the context using a subsystem interface. E.g.
plant_context = plant.GetMyMutableContextFromRoot(context)
plant_context.SetContinuousState(...)
Figuring out the discrete index of a state group for a DiagramContext might be possible, but it's certainly not typical.
You might find it helpful to print the context. In pydrake, you can actually just call print(context), and you will see the different elements and where they are coming from.
I was going through this post in the pytorch forum, and I also wanted to do this. The original post removes and adds layers but I think my situation is not that different. I also want to add layers or more filters or word embeddings. My main motivation is that the AI agent does not know the whole vocabulary/dictionary in advance because its large. I prefer strongly (for the moment) to not do character by character RNNs.
So what will happen for me is when the agent starts a forward pass it might find new words it has never seen and will need to add them to the embedding table (or perhaps add new filters before it starts the forward pass).
So what I want to make sure is:
embeddings are added correctly (at the right time, when a new computation graph is made) so that they are updatable by the optimizer
no issues with stored info of past parameters e.g. if its using some sort of momentum
How does one do this? Any sample code that works?
Just to add an answer to the title of your question: "How does one dynamically add new parameters to optimizers in Pytorch?"
You can append params at any time to the optimizer:
import torch
import torch.optim as optim
model = torch.nn.Linear(2, 2)
# Initialize optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001, momentum=0.9)
extra_params = torch.randn(2, 2)
optimizer.param_groups.append({'params': extra_params })
#then you can print your `extra_params`
print("extra params", extra_params)
print("optimizer params", optimizer.param_groups)
That is a tricky question, as I would argue that the answer is "depends", in particular on how you want to deal with the optimizer.
Let's start with your specific problem - an embedding. In particular, you are asking on how to add embeddings to allow for a larger vocabulary dynamically. My first advice is, that if you have a good sense of an upper boundary of your vocabulary size, make the embedding large enough to cope with it from the beginning, as this is more efficient, and as you will need the memory eventually anyway. But this is not what you asked. So - to dynamically change your embedding, you'll need to overwrite your old one with a new one, and inform your optimizer of the change. You can simply do that whenever you run into an exception with your old embedding, in a try ... except block. This should roughly follow this idea:
# from within whichever module owns the embedding
# remember the already trained weights
old_embedding_weights = self.embedding.weight.data
# create a new embedding of the new size
self.embedding = nn.Embedding(new_vocab_size, embedding_dim)
# initialize the values for the new embedding. this does random, but you might want to use something like GloVe
new_weights = torch.randn(new_vocab_size, embedding_dim)
# as your old values may have been updated, you want to retrieve these updates values
new_weights[:old_vocab_size] = old_embedding_weights
self.embedding.weights.data.copy_(new_weights)
However, you should not do this for every single new word you receive, as this copying takes time (and a whole lot of memory, as the embedding exists twice for a short time - if you're nearly out memory, just make your embedding large enough from the start). So instead increase the size dynamically by a couple of hundred slots at a time.
Additionally, this first step already raises some questions:
How does my respective nn.Module know about the new embedding parameter?
The __setattr__ method of nn.Module takes care of that (see here)
Second, why don't I simply change my parameter? That's already pointing towards some of the problems of changing the optimizer: pytorch internally keeps references by object ID. This means that if you change your object, all these references will point towards a potentially incompatible object, as its properties have changed. So we should simply create a new parameter instead.
What about other nn.Parameters or nn.Modules that are not embeddings? These you treat the same. You basically just instantiate them, and attach them to their parent module. The __setattr__ method will take care of the rest. So you can do so completely dyncamically ...
Except, of course, the optimizer. The optimizer is the only other thing that "knows" about your parameters except for your main model-module. So you need to let the optimizer know of any change.
And this is tricky, if you want to be sophisticated about it, and very easy if you don't care about keeping the optimizer state. However, even if you want to be sophisticated about it, there is a very good reason why you probably should not do this anyways. More about that below.
Anyways, if you don't care, a simple
# simply overwrite your old optimizer
optimizer = optim.SGD(model.parameters(), lr=0.001)
will do. If you care, however, you want to transfer your old state, you can do so the same way that you can store, and later load parameters and optimizer states from disk: using the .state_dict() and .load_state_dict() methods. This, however, does only work with a twist:
# extract the state dict from your old optimizer
old_state_dict = optimizer.state_dict()
# create a new optimizer
optimizer = optim.SGD(model.parameters())
new_state_dict = optimizer.state_dict()
# the old state dict will have references to the old parameters, in state_dict['param_groups'][xyz]['params'] and in state_dict['state']
# you now need to find the parameter mismatches between the old and new statedicts
# if your optimizer has multiple param groups, you need to loop over them, too (I use xyz as a placeholder here. mostly, you'll only have 1 anyways, so just replace xyz with 0
new_pars = [p for p in new_state_dict['param_groups'][xyz]['params'] if not p in old_state_dict['param_groups'][xyz]['params']]
old_pars = [p for p in old_state_dict['param_groups'][xyz]['params'] if not p in new_state_dict['param_groups'][xyz]['params']]
# then you remove all the outdated ones from the state dict
for pid in old_pars:
old_state_dict['state'].pop(pid)
# and add a new state for each new parameter to the state:
for pid in new_pars:
old_state_dict['param_groups'][xyz]['params'].append(pid)
old_state_dict['state'][pid] = { ... } # your new state def here, depending on your optimizer
However, here's the reason why you should probably never update your optimizer like this, but should instead re-initialize from scratch, and just accept the loss of state information: When you change your computation graph, you change forward and backward computation for all parameters along your computation path (if you do not have a branching architecture, this path will be your entire graph). This more specifically means, that the input to your functions (=layer/nn.Module) will be different if you change some function (=layer/nn.Module) applied earlier, and the gradients will change if you change some function (=layer/nn.Module) applied later. That in turn invalidates the entire state of your optimizer. So if you keep your optimizer's state around, it will be a state computed for a different computation graph, and will probably end up in catastrophic behavior on part of your optimizer, if you try to apply it to a new computation graph. (I've been there ...)
So - to sum it up: I'd really recommend to try to keep it simple, and to only change a parameter as conservatively as possible, and not to touch the optimizer.
If you want to customize initial params:
from itertools import chain
l1 = nn.Linear(3,3)
l2 = nn.Linear(2,3)
optimizer = optim.SGD(chain(l1.parameters(), l2.parameters()), lr=0.01, momentum=0.9)
The key is that the first param of constructor receives iterator.
I have SPSSmodeler stream which is now used and updated every week constantly to generate a certain dataset. A raw data for this stream is also renewed on a weekly basis.
In part of this stream, there is a chunk of nodes that were necessary to modify and update manually every week, and the sequence of this part is below: Type Node => Restructure Node => Aggregate Node
To simplify the explanation of those nodes' role, I drew an image of them as bellow.
Because the original raw data is changed weekly basis, the range of Unit value above is always varied, sometimes more than 6 (maybe 100) others less than 6 (maybe 3). That is why somebody has to modify there and update those chunk of nodes on a weekly basis until now. *Unit value has a certain limitation (300 for now)
However, now we are aiming to run this stream automatically without touching any human operations on it that we need to customize there to work perfectly, automatically. Please help and will appreciate your efforts, thanks!
In order to automatize, I suggest to try to use global nodes combined with clem scripts inside the execution (default script). I have a stream that calculates the first date and the last date and those variables are used to rename files at the end of execution. I think you could use something similar as explained here:
1) Create derive nodes to bring the unit values used in the weekly stream
2) Save this information in a table named 'count_variable'
3) Use a Global node named Global with a query similar to this:
#GLOBAL_MAX(variable created in (2)) (only to record the number of variables. The step 2 created a table with only 1 values, so the GLOBAL_MAX will only bring the number of variables).
4) The query inside the execution tab will be similar to this:
execute count_variable
var tabledata
var fn
set tabledata = count_variable.output
set count_variable = value tabledata at 1 1
execute Global
5) You now can use the information of variables just using the already creatde "count_variable"
It's not easy to explain just by typing, but I hope to have been helpful.
Please mark as +1 in this answer if it was relevant one.
I think there is a better, simpler and more effective (yet risky, due to node's requirements to input data) solution to your problem. It is called Transpose node and does exactly that - pivot your table. But just from version 18.1 on. Here's an example:
https://developer.ibm.com/answers/questions/389161/how-does-new-feature-partial-transpose-work-in-sps/
Let's say I have an unbounded pcollection of sentences keyed by userid, and I want a constantly updated value for whether the user is annoying, we can calculate whether a user is annoying by passing all of the sentences they've ever said into the funcion isAnnoying(). Forever.
I set the window to global with a trigger afterElement(1), accumulatingFiredPanes(), do GroupByKey, then have a ParDo that emits userid,isAnnoying
That works forever, keeps accumulating the state for each user etc. Except it turns out the vast majority of the time a new sentence does not change whether a user isAnnoying, and so most of the times the window fires and emits a userid,isAnnoying tuple it's a redundant update and the io was unnecessary. How do I catch these duplicate updates and drop while still getting an update every time a sentence comes in that does change the isAnnoying value?
Today there is no way to directly express "output only when the combined result has changed".
One approach that you may be able to apply to reduce data volume, depending on your pipeline: Use .discardingFiredPanes() and then follow the GroupByKey with an immediate filter that drops any zero values, where "zero" means the identity element of your CombineFn. I'm using the fact that associativity requirements of Combine mean you must be able to independently calculate the incremental "annoying-ness" of a sentence without reference to the history.
When BEAM-23 (cross-bundle mutable per-key-and-window state for ParDo) is implemented, you will be able to manually maintain the state and implement this sort of "only send output when the result changes" logic yourself.
However, I think this scenario likely deserves explicit consideration in the model. It blends the concepts embodied today by triggers and the accumulation mode.