Can caffe reshape layer do transpose

Can caffe reshape layer do transpose - machine-learning

Caffe have reshape layer implemented, but say I want to first reshape a blob of (1, n, k, p) to (1, a, b, k, p), where n= a*b and then transpose it to shape (1, b, a, k, p), how to implement this operation, I know I can write a seperate python layer and do all this with numpy.reshape and numpy.transpose, but that would be not efficient, is it?

transpose and reshape are two fundamentally different operations:
While reshape only changes the shape of a blob, it does not affect its internal structure (and thus can be execute very efficiently). On the other hand, transpose re-arrange the blob's data.
Let's look at a simple example.
Suppose you have a 2x2 blob with values
[[0, 1], [2, 3]]
In memory the values are stored in a 1D contiguous way (row-major):
[0, 1, 2, 3]
If you reshape the blob to 4x1
[[0], [1], [2], [3]]
The underlying arrangement of the elements in memory is not changed.
However, if you transpose the blob to get
[[0, 2], [1, 3]]
The underlying arrangement is also changed to
[0, 2, 1, 3]
Therefore, you cannot use "Reshape" layer to transpose a blob.
Caffe SSD branch (by Weilu) has a "Permute" layer which is equivalent to transpose.
A note about performance:
While reshape only changes the blob's header (O(1) runtime and space), transpose needs to re-arrange elements in memory thus taking O(n) time and space.
To make things worse, if you use numpy.transpose to perform the task it means you transpose in CPU (host memory) thus adding two sync operations between CPU and GPU memory (sync GPU->CPU, transpose in CPU, sync CPU->GPU).
So, if you have no alternative but to transpose (aka "Permute") make sure you have a GPU implementation.

Related

Drake flattens by channel

Why does Drake flatten along the channel dimension instead of along the batch dimension?
For a [n, c] vector, numpy/torch/tensorflow have a flatten operation and its inverse, a reshape operation
This flatten operation flattens along the batch dimension "n", whereas Drake flattens along "c"
Is there some built in function to get the flattening along the batch dimension?
For example, I want to access:
prog.initial_guess()
But I want it to be flattened along the batch dimension, according to the original matrix-shaped decision variables

Drake is written in C++ with a python binding. In C++ we use Eigen as our linear algebra library. Eigen::Matrix is default to column major, namely it stores data in the order of mat[0, 0], mat[1, 0], mat[2, 0], ..., mat[0, 1], mat[1, 1], .... This is different from how numpy/pytorch stores matrix, which is default to a row major.
Specifically for your question about getting the initial guess of a matrix variable, you can call the function
initial_guess = prog.GetInitialGuess(variable_matrix)
it will return a numpy matrix of float, where initial_guess[i, j] is the initial guess value of the variable variable_matrix[i, j], so you don't need to worry about matrix flattening/reshaping.

How do I match samples with their predictions when doing inference with PyTorch's DistributedSampler?

I have trained a torch model for NLP tasks and would like to perform some inference using a multi GPU machine (in this case with two GPUs).
Inside the processing code, I use this
dataset = TensorDataset(encoded_dict['input_ids'], encoded_dict['attention_mask'])
sampler = DistributedSampler(
dataset, num_replicas=args.nodes * args.gpus, rank=args.node_rank * args.gpus + gpu_number, shuffle=False
)
dataloader = DataLoader(dataset, batch_size=batch_size, sampler=sampler)
For those familiar with NLP, encoded_dict is the output from the tokenizer.batch_encode_plus function where the tokenizer is an instance of transformers.BertTokenizer.
The issue I’m having is that when I call the code through the torch.multiprocessing.spawn function, each GPU is doing predictions (i.e. inference) on a subset of the full dataset, and saving the predictions separately; for example, if I have a dataset with 1000 samples to predict, each GPU is predicting 500 of them. As a result, I have no way of knowing which samples out of the 1000 were predicted by which GPU, as their order is not preserved, therefore the model predictions are meaningless as I cannot trace each of them back to their input sample.
I have tried to save the dataloader instance (as a pickle) together with the predictions and then extracting the input_ids by using dataloader.dataset.tensors, however this requires a tokeniser decoding step which I rather avoid, as the tokenizer will have slightly changed the text (for example double whitespaces would be removed, words with dashes will have been split and so on).
What is the cleanest way to save the input text samples together with their predictions when doing inference in distributed mode, or alternatively to keep track of which prediction refers to which sample?

As I understand it, basically your dataset returns for an index idx [data,label] during training and [data] during inference. The issue with this is that the idx is not preserved by the dataloader object, so there is no way to obtain the idx values for the minibatch after the fact.
One way to handle this issue is to define a very simple custom dataset object that also returns [data,id] instead of only data during inference. Probably the easiest way to do this is to make the dataset return a dictionary object with keys id and data. The dictionary return type is convenient because Pytorch collates (converts data structures to batches) this type automatically, otherwise you'd have to define a custom collate_fn and pass it to the dataloader object, which is itself not very hard but is an extra step.
In any case, here's I would define a new dataset object as follows which should be almost a one-to-one substitute for your current dataset (I believe):
def TensorDictDataset(torch.data.Dataset):
def __init__(self,ids,attention_mask):
self.ids = ids
self.mask = attention_mask
def __len__(self):
return len(self.ids)
def __getitem(self,idx):
datum = {
"mask": self.mask[idx],
"id":ids[idx]
}
return datum
The only change you'll then have to make is that rather than returning mask your dataset will now return dict{"mask":mask,"id":id} so you'll have to parse that appropriately.

thanks for your answer. I have done further debugging and found another solution and wanted to post it.
Your solution is quite elegant (there was one minor misunderstanding, in that the predictions contain only the predicted labels and not the data contrary to what you understood, but this doesn't affect your answer anyway). Mask is NLP is also something else, and instead of having the mask tokens together with predictions I would like to have the untokenized text string. This is not so easy to achieve because the splitting of the data into different GPUs happens AFTER the tokenisation, however I believe that with a slight adaptation to your answer it could work.
However, I’ve done some further debugging and I’ve noticed that the data are not actually randomly split across GPUs as I thought. If I set shuffle=False in the DistributedSampler then this happens:
in the case of two GPUs, GPU 0 and GPU 1, all the samples with even index (starting from 0) will be passed to GPU 0, and all those with odd index will be passed to GPU 1.
So for example, if you have 10 samples, whose indices are [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], then samples 0, 2, 4, 6, 8 will go to GPU 0 and samples 1, 3, 5, 7, 9 will to go GPU 1. Therefore this allows me to map the predictions back to the original text string samples by just using this ordering. Not sure if this is the best solution, as keeping the original text string next to its prediction would be ideal, but at least it works.
N.B. Special case: As the two GPUs must be passed the SAME number of inputs, if the number of inputs is an odd number, for example we have 9 samples with indices [0, 1, 2, 3, 4, 5, 6, 7, 8], then GPU 0 will be passed samples 0, 2, 4, 6, 8 and GPU 1 will be passed samples 1, 3, 5, 7, 0 (in this exact order). In other words, the first sample with index 0 is repeated at the very end of the dataset to make sure each GPU has the same number of samples, in which case we can then write some codes which drops the last prediction from GPU 1 as it is redundant.

How to understand SpatialDropout1D and when to use it?

Occasionally I see some models are using SpatialDropout1D instead of Dropout. For example, in the Part of speech tagging neural network, they use:
model = Sequential()
model.add(Embedding(s_vocabsize, EMBED_SIZE,
input_length=MAX_SEQLEN))
model.add(SpatialDropout1D(0.2)) ##This
model.add(GRU(HIDDEN_SIZE, dropout=0.2, recurrent_dropout=0.2))
model.add(RepeatVector(MAX_SEQLEN))
model.add(GRU(HIDDEN_SIZE, return_sequences=True))
model.add(TimeDistributed(Dense(t_vocabsize)))
model.add(Activation("softmax"))
According to Keras' documentation, it says:
This version performs the same function as Dropout, however it drops
entire 1D feature maps instead of individual elements.
However, I am unable to understand the meaning of entrie 1D feature. More specifically, I am unable to visualize SpatialDropout1D in the same model explained in quora.
Can someone explain this concept by using the same model as in quora?
Also, under what situation we will use SpatialDropout1D instead of Dropout?

To make it simple, I would first note that so-called feature maps (1D, 2D, etc.) is our regular channels. Let's look at examples:
Dropout(): Let's define 2D input: [[1, 1, 1], [2, 2, 2]]. Dropout will consider every element independently, and may result in something like [[1, 0, 1], [0, 2, 2]]
SpatialDropout1D(): In this case result will look like [[1, 0, 1], [2, 0, 2]]. Notice that 2nd element was zeroed along all channels.

The noise shape
In order to understand SpatialDropout1D, you should get used to the notion of the noise shape. In plain vanilla dropout, each element is kept or dropped independently. For example, if the tensor is [2, 2, 2], each of 8 elements can be zeroed out depending on random coin flip (with certain "heads" probability); in total, there will be 8 independent coin flips and any number of values may become zero, from 0 to 8.
Sometimes there is a need to do more than that. For example, one may need to drop the whole slice along 0 axis. The noise_shape in this case is [1, 2, 2] and the dropout involves only 4 independent random coin flips. The first component will either be kept together or be dropped together. The number of zeroed elements can be 0, 2, 4, 6 or 8. It cannot be 1 or 5.
Another way to view this is to imagine that input tensor is in fact [2, 2], but each value is double-precision (or multi-precision). Instead of dropping the bytes in the middle, the layer drops the full multi-byte value.
Why is it useful?
The example above is just for illustration and isn't common in real applications. More realistic example is this: shape(x) = [k, l, m, n] and noise_shape = [k, 1, 1, n]. In this case, each batch and channel component will be kept independently, but each row and column will be kept or not kept together. In other words, the whole [l, m] feature map will be either kept or dropped.
You may want to do this to account for adjacent pixels correlation, especially in the early convolutional layers. Effectively, you want to prevent co-adaptation of pixels with its neighbors across the feature maps, and make them learn as if no other feature maps exist. This is exactly what SpatialDropout2D is doing: it promotes independence between feature maps.
The SpatialDropout1D is very similar: given shape(x) = [k, l, m] it uses noise_shape = [k, 1, m] and drops entire 1-D feature maps.
Reference: Efficient Object Localization Using Convolutional Networks
by Jonathan Tompson at al.

LSTM-RNN : How to shape multivariate Inputs

Hi everybody I am struggeling with the tensorflow RNN implementation:
The problem:
I want to train an LSTM implentation of an RNN to detect malicious connections in the KDD99 dataset. Its a dataset with 41 features and (after some preprocessing) a label vector of the size 5.
[
[x1, x2, x3, .....x40, x41],
...
[x1, x2, x3, .....x40, x41]
]
[
[0, 1, 0, 0, 0],
...
[0, 0, 1, 0, 0]
]
As a basic architurecture I would like to implement the following:
cell = tf.nn.rnn_cell.LSTMCell(num_units=64, state_is_tuple=True)
cell = tf.nn.rnn_cell.DropoutWrapper(cell=cell, output_keep_prob=0.5)
cell = tf.nn.rnn_cell.MultiRNNCell(cells=[cell] * 3, state_is_tuple=True)
My question is: In order to feed it to the model, how would i need to reshape the input features?
Would I not just have to reshape the input features, but to build sliding window sequences?
What I mean by that:
Assuming a sequence length of ten, the first suqence would contains data point 0 - 9, the second one contains data points 1 - 10, 2 - 11 and so on.
Thanks!

I do not know the dataset but I think that you problem is the following: you have a very long sequence and you want to know how to shape this sequence in order to provide this to the network.
The 'tf.contrib.rnn.static_rnn' has the following signature:
tf.contrib.rnn.static_rnn(cell, inputs, initial_state=None, dtype=None, sequence_length=None, scope=None)
where
inputs: A length T list of inputs, each a Tensor of shape [batch_size, input_size], or a nested tuple of such elements.
So the inputs need to be shaped into lists, where each element of the list is the element of the input sequence at each time step.
The length of this list depend on your problem and/or on computational issues.
In Natural Language Processing, for example, the length of this list can be the maximum sentence length of your document, where shorter sentences are padded to that length. As in this case, in many domains the length of the sequence is driven by the problem
However, you can have no such evidences in your problem or still having a long sequence. Long sequences are very heavy from a computational point of view. The BPTT algorithm, used to optimize this models, "unfolds" the recurrent network in a very deep feedforward network with shared parameters and back propagates over it. In this cases, it is still convenient to "cut" the sequence to a fixed length.
And here we arrive at your question, given this fixed length, let us say 10, how do I shape my input?
Usually, what is done is to cut the dataset in non overlapping windows (in your example, we will have 1-9, 10-19, 20-29, etc. What happens here is that the network only looks a the last 10 elements of the sequence each time it updates the weights with BPTT.
However, since the sequence has been arbitrarily cut, it is likely that predictions need to exploit evidences that are far back in the sequence, outside the current window. To do this, we initialize the initial state of the RNN at window i with the final state of the window i-1 using the parameter:
initial_state: (optional) An initial state for the RNN.
Finally, I give you two sources to go into more details:
RNN Tutorial This is the official tutorial of tensorflow. It is applied to the task of Language Modeling. At a certain point of the code, you will see that the final state is fed to the network from one run to the following one, in order to implement what said above.
feed_dict = {}
for i, (c, h) in enumerate(model.initial_state):
feed_dict[c] = state[i].c
feed_dict[h] = state[i].h
DevSummit 2017 This is a video of a talk during the Tensorflow DevSummit 2017 where, in the first section (Reading and Batching Sequence Data), it is explained how and using which functions you should shape your sequence inputs.
Hope this helps :)

Multi Label classification with Sklearn

I have tried using the OneVsRest with Logistic Regression from Sklearn, but it gives empty labels for some samples (i.e. doesn't predict any out), even though I do not have any unlabelled training data.
Any idea what might be causing this or how to fix this?
clf = OneVsRestClassifier(LogisticRegression(multi_class='ovr',max_iter=1000,solver='lbfgs'))
clf.fit(X,Y)
self.classifier=clf
self.classifier.predict(test_data)

Whenever you are performing MultiLabel classification, according to the OneVsRestClassifier the targets need to be "a sequence of sequences of labels".
Moreover, depending on how you encode this labels you may get the following warning: "DeprecationWarning: Direct support for sequence of sequences multilabel representation will be unavailable from version 0.17. Use sklearn.preprocessing.MultiLabelBinarizer to convert to a label indicator representation."
So, neat way to encode your labels:
from sklearn import preprocessing
mlb = preprocessing.MultiLabelBinarizer()
Y = mlb.fit_transform([(1, 2), (1,2), (1,2),(4,)])
# this means sample one belongs to classes {1,2} and so on.
# Take into account the format if only one class is needed, (4,) not (4)
so Y turns out to be:
array([[1, 1, 0],
[1, 1, 0],
[1, 1, 0],
[0, 0, 1]])

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart