torch.matmul doesn't seem to have an nn.Module wrapper to allow the standard forward hook registration by name. In this case, the matrix multiply happens in the middle of a forward() function. I suppose the intermediate result can be returned by forward() in addition to the final result, such as return x, mm_res. But what's a good way to collect these additional outputs?
What are the options for offloading torch.matmul outputs? TIA.
If your primary complaint is the fact that torch.matmul doesn't have a Module wrapper, how about just making one
class Matmul(nn.Module):
def forward(self, *args):
return torch.matmul(*args)
Now you can register a forward hook on a Matmul instance
class Network(nn.Module):
def __init__(self, ...):
self.matmul = Matmul()
self.matmul.register_module_forward_hook(...)
def forward(self, x):
y = ...
z = self.matmul(x, y)
...
Being said that, you must not overlook the warning (in red) in the doc that it should only be used for debugging purpose.
Related
I am trying to create a physics-informed neural network (PINN) in JAX. I want to differentiate the defined model (neural network) by the input (x). If I set model to jax.grad(params), I get an error.
If I set model to jax.grad(model), I don't get an error, but I don't know if I am able to differentiate the model of the neural network by x.
class MLP(fnn.Module):
#fnn.compact
def __call__(self, x):
x = fnn.Dense(128)(x)
x = fnn.relu(x)
x = fnn.Dense(256)(x)
x = fnn.relu(x)
x = fnn.Dense(10)(x)
return x
model = MLP()
params = model.init(jax.random.PRNGKey(0), jnp.ones([1]))['params']
tx = optax.adam(0.001)
state = TrainState.create(apply_fn=model.apply, params=params, tx=tx)
You can differentiate a model in JAX by (1) defining a function that you want to differentiate, (2) transforming it with jax.grad, jax.jacrev, jax.jacfwd, etc. as appropriate for your application, and (3) passing data to the transformed function.
It's not entirely clear from your question what operation you're hoping to differentiate, but here is an example of computing a forward-mode jacobian of the training state creation with respect to the params:
def f(params):
return TrainState.create(apply_fn=model.apply, params=params, tx=tx)
result = jax.jacfwd(f)(params)
If that doesn't help, I'd suggest editing your question to make clear what operation you're interested in differentiating.
I am working on a simple text generation problem with LSTMs. To make the preprocessing more compact and reproducible, I decided to implement everything in sklearn fashion, using custom sklearn transformers, and the KerasClassifier from scikeras to wrap the neural network definition in a sklearn-type estimator.
It almost works but I can't figure out how to pass information from within a certain custom transformer on to the KerasClassifier estimator. More precisely, for the method that creates the neural network, I need the number of outputs as an argument; but this depends on the number of words in the fitted vocabulary - which is an information that is currently encapsulated in ModelEncoder class.
(Note that in order to get the current logic work, I had to slightly modify the default sklearn Pipeline class, as it wouldn't allow modifying and returning both X and y. In other words, the default sklearn Pipeline only allows feature transformations but not target transformations. Modifying the custom Pipeline class was explained in this StackOverflow post.)
Example data:
train_data = ['o by no means honest ventidius i gave it freely ever and theres none can truly say he gives if our betters play at that game we must not dare to imitate them faults that are rich are fair'
'but was not this nigh shore'
'impairing henry strengthening misproud york the common people swarm like summer flies and whither fly the gnats but to the sun'
'what while you were there'
'chill pick your teeth zir come no matter vor your foins'
'thanks dear isabel' 'come prick me bullcalf till he roar again'
'go some of you knock at the abbeygate and bid the lady abbess come to me'
'an twere not as good deed as drink to break the pate on thee i am a very villain'
'beaufort it is thy sovereign speaks to thee'
'but say lucetta now we are alone wouldst thou then counsel me to fall in love'
'for being a bawd for being a bawd'
'all blest secrets all you unpublishd virtues of the earth spring with my tears'
'what likelihood' 'o find him']
Full code:
# Modify the sklearn Pipeline class to allow it to return tuples and hence enable both X and y modifications. (Current default implementation in sklearn only allows
# feature transformations, i.e. transformations on X, but not on y.)
class Pipeline(pipeline.Pipeline):
def _fit(self, X, y=None, **fit_params_steps):
self.steps = list(self.steps)
self._validate_steps()
memory = check_memory(self.memory)
fit_transform_one_cached = memory.cache(pipeline._fit_transform_one)
for (step_idx, name, transformer) in self._iter(
with_final=False, filter_passthrough=False
):
if transformer is None or transformer == "passthrough":
with _print_elapsed_time("Pipeline", self._log_message(step_idx)):
continue
try:
# joblib >= 0.12
mem = memory.location
except AttributeError:
mem = memory.cachedir
finally:
cloned_transformer = clone(transformer) if mem else transformer
X, fitted_transformer = fit_transform_one_cached(
cloned_transformer,
X,
y,
None,
message_clsname="Pipeline",
message=self._log_message(step_idx),
**fit_params_steps[name],
)
if isinstance(X, tuple): ###### unpack X if is tuple X = (X,y)
X, y = X
self.steps[step_idx] = (name, fitted_transformer)
return X, y
def fit(self, X, y=None, **fit_params):
fit_params_steps = self._check_fit_params(**fit_params)
Xt = self._fit(X, y, **fit_params_steps)
if isinstance(Xt, tuple): ###### unpack X if is tuple X = (X,y)
Xt, y = Xt
with _print_elapsed_time("Pipeline", self._log_message(len(self.steps) - 1)):
if self._final_estimator != "passthrough":
fit_params_last_step = fit_params_steps[self.steps[-1][0]]
self._final_estimator.fit(Xt, y, **fit_params_last_step)
return self
class ModelTokenizer(TransformerMixin, BaseEstimator):
def __init__(self, max_len=100):
super().__init__()
self.max_len = max_len
def fit(self, X=None, y=None):
return self
def transform(self, X, y=None):
X_flattened = " ".join(X).split()
sequences = list()
for i in range(self.max_len+1, len(X_flattened)):
seq = X_flattened[i-self.max_len-1:i]
sequences.append(seq)
return sequences
class ModelEncoder(TransformerMixin, BaseEstimator):
def __init__(self):
super().__init__()
self.tokenizer = Tokenizer()
def fit(self, X=None, y=None):
self.tokenizer.fit_on_texts(X)
return self
def transform(self, X, y=None):
encoded_sequences = np.array(self.tokenizer.texts_to_sequences(X))
return (encoded_sequences[:,:-1], encoded_sequences[:,-1])
def create_nn(input_shape=(100,1), output_shape=None):
model = Sequential()
model.add(LSTM(64, input_shape=input_shape, return_sequences=True))
model.add(Dropout(0.3))
model.add(Flatten())
model.add(Dense(20, activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(output_shape, activation='softmax'))
metrics_list = [tf.keras.metrics.BinaryAccuracy(name='accuracy')]
model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = metrics_list)
return model
pipe = Pipeline([
('tokenizer', ModelTokenizer()),
('encoder', ModelEncoder()),
('model', KerasClassifier(build_fn=create_nn, epochs=10, output_shape=vocab_size)),
])
# Question: how to pass 'vocab_size'?
Imports:
from sklearn import pipeline
from sklearn.base import clone
from sklearn.utils import _print_elapsed_time
from sklearn.utils.validation import check_memory
from sklearn.base import BaseEstimator, TransformerMixin
from keras.preprocessing.text import Tokenizer
from scikeras.wrappers import KerasClassifier
KerasClassifier has its own internal transformer (see here, it is used to provide one-hot encoding and such) which has an API to pass metadata to the model (see here, that's how arguments such as n_outputs_ are passed into the model building function). Could you override that to pass this extra metadata to the model? It's stepping a bit outside of the Scikit-Learn API, but as you've noted the Scikit-Learn API doesn't have this functionality built in. If you want to propagate that information from a Transformer in your pipeline into SciKeras you could encode it into a feature and then use the above-mentioned hooks along with a custom encoder to remove that feature and convert it into metadata that can be passed into the model, but now you'd be really pushing the Scikit-Learn API.
So my goal is basically implementing global top-k subsampling. Gradient sparsification is quite simple and I have already done this building on stateful clients example, but now I would like to use encoders as you have recommended here at page 28. Additionally I would like to average only the non-zero gradients, so say we have 10 clients but only 4 have nonzero gradients at a given position for a communication round then I would like to divide the sum of these gradients to 4, not 10. I am hoping to achieve this by summing gradients at numerator and masks, 1s and 0s, at denominator. Also moving forward I will add randomness to gradient selection so it is imperative that I create those masks concurrently with gradient selection. The code I have right now is
import tensorflow as tf
from tensorflow_model_optimization.python.core.internal import tensor_encoding as te
#te.core.tf_style_adaptive_encoding_stage
class GrandienrSparsificationEncodingStage(te.core.AdaptiveEncodingStageInterface):
"""An example custom implementation of an `EncodingStageInterface`.
Note: This is likely not what one would want to use in practice. Rather, this
serves as an illustration of how a custom compression algorithm can be
provided to `tff`.
This encoding stage is expected to be run in an iterative manner, and
alternatively zeroes out values corresponding to odd and even indices. Given
the determinism of the non-zero indices selection, the encoded structure does
not need to be represented as a sparse vector, but only the non-zero values
are necessary. In the decode mehtod, the state (i.e., params derived from the
state) is used to reconstruct the corresponding indices.
Thus, this example encoding stage can realize representation saving of 2x.
"""
ENCODED_VALUES_KEY = 'stateful_topk_values'
INDICES_KEY = 'indices'
SHAPES_KEY = 'shapes'
ERROR_COMPENSATION_KEY = 'error_compensation'
def encode(self, x, encode_params):
shapes_list = [tf.shape(y) for y in x]
flattened = tf.nest.map_structure(lambda y: tf.reshape(y, [-1]), x)
gradients = tf.concat(flattened, axis=0)
error_compensation = encode_params[self.ERROR_COMPENSATION_KEY]
gradients_and_error_compensation = tf.math.add(gradients, error_compensation)
percentage = tf.constant(0.1, dtype=tf.float32)
k_float = tf.multiply(percentage, tf.cast(tf.size(gradients_and_error_compensation), tf.float32))
k_int = tf.cast(tf.math.round(k_float), dtype=tf.int32)
values, indices = tf.math.top_k(tf.math.abs(gradients_and_error_compensation), k = k_int, sorted = False)
indices = tf.expand_dims(indices, 1)
sparse_gradients_and_error_compensation = tf.scatter_nd(indices, values, tf.shape(gradients_and_error_compensation))
new_error_compensation = tf.math.subtract(gradients_and_error_compensation, sparse_gradients_and_error_compensation)
state_update_tensors = {self.ERROR_COMPENSATION_KEY: new_error_compensation}
encoded_x = {self.ENCODED_VALUES_KEY: values,
self.INDICES_KEY: indices,
self.SHAPES_KEY: shapes_list}
return encoded_x, state_update_tensors
def decode(self,
encoded_tensors,
decode_params,
num_summands=None,
shape=None):
del num_summands, decode_params, shape # Unused.
flat_shape = tf.math.reduce_sum([tf.math.reduce_prod(shape) for shape in encoded_tensors[self.SHAPES_KEY]])
sizes_list = [tf.math.reduce_prod(shape) for shape in encoded_tensors[self.SHAPES_KEY]]
scatter_tensor = tf.scatter_nd(
indices=encoded_tensors[self.INDICES_KEY],
updates=encoded_tensors[self.ENCODED_VALUES_KEY],
shape=[flat_shape])
nonzero_locations = tf.nest.map_structure(lambda x: tf.cast(tf.where(tf.math.greater(x, 0), 1, 0), tf.float32) , scatter_tensor)
reshaped_tensor = [tf.reshape(flat_tensor, shape=shape) for flat_tensor, shape in
zip(tf.split(scatter_tensor, sizes_list), encoded_tensors[self.SHAPES_KEY])]
reshaped_nonzero = [tf.reshape(flat_tensor, shape=shape) for flat_tensor, shape in
zip(tf.split(nonzero_locations, sizes_list), encoded_tensors[self.SHAPES_KEY])]
return reshaped_tensor, reshaped_nonzero
def initial_state(self):
return {self.ERROR_COMPENSATION_KEY: tf.constant(0, dtype=tf.float32)}
def update_state(self, state, state_update_tensors):
return {self.ERROR_COMPENSATION_KEY: state_update_tensors[self.ERROR_COMPENSATION_KEY]}
def get_params(self, state):
encode_params = {self.ERROR_COMPENSATION_KEY: state[self.ERROR_COMPENSATION_KEY]}
decode_params = {}
return encode_params, decode_params
#property
def name(self):
return 'gradient_sparsification_encoding_stage'
#property
def compressible_tensors_keys(self):
return False
#property
def commutes_with_sum(self):
return False
#property
def decode_needs_input_shape(self):
return False
#property
def state_update_aggregation_modes(self):
return {}
I have run some simple tests manually following the steps you outlined here at page 45. It works but I have some questions/problems.
When I use list of tensors of same shape (ex:2 2x25 tensors) as input,x, of encode it works without any issues but when I try to use list of tensors of different shapes (2x20 and 6x10) it gives and error saying
InvalidArgumentError: Shapes of all inputs must match: values[0].shape = [2,20] != values1.shape = [6,10] [Op:Pack] name: packed
How can I resolve this issue? As i said I want to use global top-k so it is essential I encode entire trainable model weights at once. Take the cnn model used here, all the tensors have different shapes.
How can I do the averaging I described at the beginning? For example here you have done
mean_factory = tff.aggregators.MeanFactory(
tff.aggregators.EncodedSumFactory(mean_encoder_fn), # numerator
tff.aggregators.EncodedSumFactory(mean_encoder_fn), # denominator )
Is there a way to repeat this with one output of decode going to numerator and other going to denominator? How can I handle dividing 0 by 0? tensorflow has divide_no_nan function, can I use it somehow or do I need to add eps to each?
How is partition handled when I use encoders? Does each client get a unique encoder holding a unique state for it? As you have discussed here at page 6 client states are used in cross-silo settings yet what happens if client ordering changes?
Here you have recommended using stateful clients example. Can you explain this a bit further? I mean in the run_one_round where exactly encoders go and how are they used/combined with client update and aggregation?
I have some additional information such as sparsity I want to pass to encode. What is the suggested method for doing that?
Here are some answers, hope it helps:
If you want to treat all of the aggregated structure just as a single tensor, use concat_factory as the outermost aggregator. That will concatenate entire structure to a rank-1 Tensor at clients, and then unpack back to the original structure at the end. Example use: tff.aggregators.concat_factory(tff.aggregators.MeanFactory(...))
Note the encoding stage objects are meant to work with a single tensor, so what you describe with identical tensors probably works only accidentally.
There are two options.
a. Modify the client training code such that the weights being passed to the weighted aggregator are already what you want it to be (zero/one
mask). In the stateful clients example you link, that would be here. You will then get what you need by default (by summing the numerator).
b. Modify UnweightedMeanFactory to do exactly the variant of averaging you describe and use that. Start would be modifying this
(and 4.) I think that is what you would need to implement. The same way existing client states are initialized in the example here, you would need extend it to contain the aggregator states, and make sure those are sampled together with the clients, as done here. Then, to integrate the aggregators in the example you would need to replace this hard-coded tff.federated_mean. An example of such integration is in the implementation of tff.learning.build_federated_averaging_process, primarily here
I am not sure what the question is. Perhaps get the previous working (seems like a prerequisite to me), and then clarify and ask in a new post?
I am using LuaTorch normalization layer that currently normalizes the input tensor. I am adding it as a part of the network like this self:add( nn.Normalize(2) ). Now I want to normalize only a part of the input tensor. I am not sure how to specify only a part of the tensor in the following lines.
self:add( nn.View(-1, op_neurons) )
self:add( nn.Normalize(2) ) <--- how to normalize only a part of the input tensor
self:add( nn.View(-1,no_of_objects,op_neurons) )
I think a clean way to do this is to derive your own class from nn.Normalize. Just create a file like PartialNormalize.lua , and proceed like this (it is easy but a bit time-consuming to developp, so I'm just mostly giving you pseudo-code) :
local PartialNormalize, parent = torch.class('nn.PartialNormalize', 'nn.Normalize')
--now basically you need to override the funcions __init, updateOutput and updateGradInput from the parent class (I dont think there is a need to override other functions, but you shoud make check.)
-- you can find the code for nn.Normalize in <your_install_path>/install/share/lua/5.1/nn/Normalize.lua
-- the interval [first_index, last_index] determines which parts from your input vector you want to be normalized.
function PartialNormalize:__init(p,eps,first_index,last_index)
parent.__init(self)
self.first_index=first_index
self.last_index=last_index
end
function PartialNormalize:updateOutput(input)
--In the parent class, this just returns the normalized part
-- just modify this function so that it returns the normalized part from self.first_index to self.last_index, and that it just passes the other elements through
end
function PartialNormalize:updateGradInput(input, gradOutput)
-- make appropriate modifications to the gradient function: gradient for elements from self.first_index to self.last_index is computed just as in the parent class,
-- while the gradient for other elements is just 1 everywhere
end
-- I don't think other functions from the parent class need overriding, but make sure just in case
Hope this helps.
There are containers for independent processing of input parts. Using concat and narrow you could construct your partial normalization.
require"torch"
nn=require"nn"
local NX,NY = 2,6 --sizes of inputs
local Y1, Y2 = 1, 4 --normalize only data between these constraints
local DIMENSION_INDEX=2--dimension on which you want to split your input (NY here)
local input=torch.randn(NX,NY)--example input
--network construction
local normalize_part = nn.Sequential()
normalize_part:add(nn.Narrow(DIMENSION_INDEX,Y1,Y2))
normalize_part:add(nn.Normalize(2))
local dont_change_part=nn.Sequential()
dont_change_part:add(nn.Narrow(DIMENSION_INDEX,Y2+1,NY-Y2))
local partial_normalization=nn.Concat(DIMENSION_INDEX)
partial_normalization:add(normalize_part)
partial_normalization:add(dont_change_part)
--partial_normalization is ready for use:
print(input)
print( partial_normalization:forward(input))
--can be used as a block in a greater network
local main_net=nn.sequential()
main_net:add(partial_normalization)
Also, I'd like to note that nn.normalize is not equivalent to the (X - mean(x)) / std(x) which is also called normalization.
I wanted to update the parameters of a model manually with pytorch. I made a super simple standard sequential model (full code here) but whenever I try to train my model it does not train unless I create the actual variables explicitly (code for model variables explicitly). So with the sequential model the code looks as follow:
mdl_sgd = torch.nn.Sequential( torch.nn.Linear(D_sgd,1,bias=False) )
...
for i in range(nb_iter):
# Forward pass: compute predicted Y using operations on Variables
batch_xs, batch_ys = get_batch2(X,Y,M,dtype) # [M, D], [M, 1]
## FORWARD PASS
y_pred = mdl_sgd.forward(X)
## LOSS
loss = (1/N)*(y_pred - batch_ys).pow(2).sum()
## Manually zero the gradients after updating weights
mdl_sgd.zero_grad()
## BACKARD PASS
loss.backward() # Use autograd to compute the backward pass. Now w will have gradients
## SGD update
for W in mdl_sgd.parameters():
#print(W.grad.data)
W.data = W.data - eta*W.grad.data
when I train it it seems that nothing happens. I've tried many things to make this work like wrapping it in a class and putting explicit require_grads=True or change the locations where I make the zero out the gradients etc but nothing seems to work. What I really want/need is to be able to explicitly be able to do the update rule myself (not with optimum). Not sure if thats the reason it doesn't work but the following does work for some reason:
X = poly_kernel_matrix(x_true,Degree_mdl) # maps to the feature space of the model
X = Variable(torch.FloatTensor(X).type(dtype), requires_grad=False)
Y = Variable(torch.FloatTensor(Y).type(dtype), requires_grad=False)
w_init=torch.randn(D_sgd,1).type(dtype)
W = Variable( w_init, requires_grad=True)
...
for i in range(nb_iter):
# Forward pass: compute predicted Y using operations on Variables
batch_xs, batch_ys = get_batch2(X,Y,M,dtype) # [M, D], [M, 1]
## FORWARD PASS
#y_pred = mdl_sgd.forward(X)
y_pred = batch_xs.mm(W)
## LOSS
loss = (1/N)*(y_pred - batch_ys).pow(2).sum()
## BACKARD PASS
loss.backward() # Use autograd to compute the backward pass. Now w will have gradients
## SGD update
W.data = W.data - eta*W.grad.data
## Manually zero the gradients after updating weights
#mdl_sgd.zero_grad()
W.grad.data.zero_()
the reason I know this is because the plot of the regression lines look sensible:
while when I use the torch.nn.Sequential I get:
I am sure its a really newbie question but I am not sure why I can't update the parameters. Does someone know why? I want to be able to update the parameters manually (however I want) and in this case I decided to use SGD to see if I could even update the parameters.
Note I also tried subclassing modules and registering params but it didn't work either. This is the class I built:
class regression_NN(torch.nn.Module):
def __init__(self,w_init):
"""
"""
super(type(self), self).__init__()
# mdl
#self.W = Variable(w_init, requires_grad=True)
#self.W = torch.nn.Parameter( Variable(w_init, requires_grad=True) )
#self.W = torch.nn.Parameter( w_init )
self.W = torch.nn.Parameter( w_init,requires_grad=True )
#self.mod_list = torch.nn.ModuleList([self.W])
def forward(self, x):
"""
"""
y_pred = x.mm(self.W)
return y_pred
All code is:
https://github.com/brando90/simple_regression
I'm relatively new at pytorch so I might have many bad practice...you can correct them if u want but Im mostly concerned that my paremters are not updating even when I try to explicitly register them in a class that inherits from torch.nn.Module.
I also linked to the question from the pytorch official forum: https://discuss.pytorch.org/t/how-does-one-make-sure-that-the-parameters-are-update-manually-in-pytorch-using-modules/6076