How to understand SpatialDropout1D and when to use it? - machine-learning

Occasionally I see some models are using SpatialDropout1D instead of Dropout. For example, in the Part of speech tagging neural network, they use:
model = Sequential()
model.add(Embedding(s_vocabsize, EMBED_SIZE,
input_length=MAX_SEQLEN))
model.add(SpatialDropout1D(0.2)) ##This
model.add(GRU(HIDDEN_SIZE, dropout=0.2, recurrent_dropout=0.2))
model.add(RepeatVector(MAX_SEQLEN))
model.add(GRU(HIDDEN_SIZE, return_sequences=True))
model.add(TimeDistributed(Dense(t_vocabsize)))
model.add(Activation("softmax"))
According to Keras' documentation, it says:
This version performs the same function as Dropout, however it drops
entire 1D feature maps instead of individual elements.
However, I am unable to understand the meaning of entrie 1D feature. More specifically, I am unable to visualize SpatialDropout1D in the same model explained in quora.
Can someone explain this concept by using the same model as in quora?
Also, under what situation we will use SpatialDropout1D instead of Dropout?

To make it simple, I would first note that so-called feature maps (1D, 2D, etc.) is our regular channels. Let's look at examples:
Dropout(): Let's define 2D input: [[1, 1, 1], [2, 2, 2]]. Dropout will consider every element independently, and may result in something like [[1, 0, 1], [0, 2, 2]]
SpatialDropout1D(): In this case result will look like [[1, 0, 1], [2, 0, 2]]. Notice that 2nd element was zeroed along all channels.

The noise shape
In order to understand SpatialDropout1D, you should get used to the notion of the noise shape. In plain vanilla dropout, each element is kept or dropped independently. For example, if the tensor is [2, 2, 2], each of 8 elements can be zeroed out depending on random coin flip (with certain "heads" probability); in total, there will be 8 independent coin flips and any number of values may become zero, from 0 to 8.
Sometimes there is a need to do more than that. For example, one may need to drop the whole slice along 0 axis. The noise_shape in this case is [1, 2, 2] and the dropout involves only 4 independent random coin flips. The first component will either be kept together or be dropped together. The number of zeroed elements can be 0, 2, 4, 6 or 8. It cannot be 1 or 5.
Another way to view this is to imagine that input tensor is in fact [2, 2], but each value is double-precision (or multi-precision). Instead of dropping the bytes in the middle, the layer drops the full multi-byte value.
Why is it useful?
The example above is just for illustration and isn't common in real applications. More realistic example is this: shape(x) = [k, l, m, n] and noise_shape = [k, 1, 1, n]. In this case, each batch and channel component will be kept independently, but each row and column will be kept or not kept together. In other words, the whole [l, m] feature map will be either kept or dropped.
You may want to do this to account for adjacent pixels correlation, especially in the early convolutional layers. Effectively, you want to prevent co-adaptation of pixels with its neighbors across the feature maps, and make them learn as if no other feature maps exist. This is exactly what SpatialDropout2D is doing: it promotes independence between feature maps.
The SpatialDropout1D is very similar: given shape(x) = [k, l, m] it uses noise_shape = [k, 1, m] and drops entire 1-D feature maps.
Reference: Efficient Object Localization Using Convolutional Networks
by Jonathan Tompson at al.

Related

How do I match samples with their predictions when doing inference with PyTorch's DistributedSampler?

I have trained a torch model for NLP tasks and would like to perform some inference using a multi GPU machine (in this case with two GPUs).
Inside the processing code, I use this
dataset = TensorDataset(encoded_dict['input_ids'], encoded_dict['attention_mask'])
sampler = DistributedSampler(
dataset, num_replicas=args.nodes * args.gpus, rank=args.node_rank * args.gpus + gpu_number, shuffle=False
)
dataloader = DataLoader(dataset, batch_size=batch_size, sampler=sampler)
For those familiar with NLP, encoded_dict is the output from the tokenizer.batch_encode_plus function where the tokenizer is an instance of transformers.BertTokenizer.
The issue I’m having is that when I call the code through the torch.multiprocessing.spawn function, each GPU is doing predictions (i.e. inference) on a subset of the full dataset, and saving the predictions separately; for example, if I have a dataset with 1000 samples to predict, each GPU is predicting 500 of them. As a result, I have no way of knowing which samples out of the 1000 were predicted by which GPU, as their order is not preserved, therefore the model predictions are meaningless as I cannot trace each of them back to their input sample.
I have tried to save the dataloader instance (as a pickle) together with the predictions and then extracting the input_ids by using dataloader.dataset.tensors, however this requires a tokeniser decoding step which I rather avoid, as the tokenizer will have slightly changed the text (for example double whitespaces would be removed, words with dashes will have been split and so on).
What is the cleanest way to save the input text samples together with their predictions when doing inference in distributed mode, or alternatively to keep track of which prediction refers to which sample?
As I understand it, basically your dataset returns for an index idx [data,label] during training and [data] during inference. The issue with this is that the idx is not preserved by the dataloader object, so there is no way to obtain the idx values for the minibatch after the fact.
One way to handle this issue is to define a very simple custom dataset object that also returns [data,id] instead of only data during inference. Probably the easiest way to do this is to make the dataset return a dictionary object with keys id and data. The dictionary return type is convenient because Pytorch collates (converts data structures to batches) this type automatically, otherwise you'd have to define a custom collate_fn and pass it to the dataloader object, which is itself not very hard but is an extra step.
In any case, here's I would define a new dataset object as follows which should be almost a one-to-one substitute for your current dataset (I believe):
def TensorDictDataset(torch.data.Dataset):
def __init__(self,ids,attention_mask):
self.ids = ids
self.mask = attention_mask
def __len__(self):
return len(self.ids)
def __getitem(self,idx):
datum = {
"mask": self.mask[idx],
"id":ids[idx]
}
return datum
The only change you'll then have to make is that rather than returning mask your dataset will now return dict{"mask":mask,"id":id} so you'll have to parse that appropriately.
thanks for your answer. I have done further debugging and found another solution and wanted to post it.
Your solution is quite elegant (there was one minor misunderstanding, in that the predictions contain only the predicted labels and not the data contrary to what you understood, but this doesn't affect your answer anyway). Mask is NLP is also something else, and instead of having the mask tokens together with predictions I would like to have the untokenized text string. This is not so easy to achieve because the splitting of the data into different GPUs happens AFTER the tokenisation, however I believe that with a slight adaptation to your answer it could work.
However, I’ve done some further debugging and I’ve noticed that the data are not actually randomly split across GPUs as I thought. If I set shuffle=False in the DistributedSampler then this happens:
in the case of two GPUs, GPU 0 and GPU 1, all the samples with even index (starting from 0) will be passed to GPU 0, and all those with odd index will be passed to GPU 1.
So for example, if you have 10 samples, whose indices are [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], then samples 0, 2, 4, 6, 8 will go to GPU 0 and samples 1, 3, 5, 7, 9 will to go GPU 1. Therefore this allows me to map the predictions back to the original text string samples by just using this ordering. Not sure if this is the best solution, as keeping the original text string next to its prediction would be ideal, but at least it works.
N.B. Special case: As the two GPUs must be passed the SAME number of inputs, if the number of inputs is an odd number, for example we have 9 samples with indices [0, 1, 2, 3, 4, 5, 6, 7, 8], then GPU 0 will be passed samples 0, 2, 4, 6, 8 and GPU 1 will be passed samples 1, 3, 5, 7, 0 (in this exact order). In other words, the first sample with index 0 is repeated at the very end of the dataset to make sure each GPU has the same number of samples, in which case we can then write some codes which drops the last prediction from GPU 1 as it is redundant.

what to do after binning numerical feature?

I want to know what to do after I did the binning. For example, one of the feature is age. So my data is [11, 12, 35, 26].
Then I apply binning with size of 10:
bin, name
[0, 10) --> 1
[10, 20) --> 2
[20, 30) -->3
[30, 40) --> 4
Then my data becomes [2, 2, 4, 3]. Now assume I want to put this data to a linear regression mode. Should I treat the [2, 2, 4, 3] as numerical feature? Or should I treat them as categorical feature, like do one-hot encoding first and then feed it to the model?
If you are building a linear model, then one hot encoding of those bins might be a better option, so that if there is any linear relationship with the target, the ohe will preserve it.
If you are building tree based models, like random forests, then you could use the [2, 2, 4, 3] as numerical feature, because these models are non-linear.
If building a regression model and not wanting to expand the feature space with ohe, you could treat the bins as a categorical variable, and encode that variable using mean / target encoding, or encoding with digits by following the target mean per bin.
More details about the last 2 procedures in this article.
Disclaimer: I wrote the article.

Can artificial neural networks work with mathematical sets?

I know that using neural networks for anything text-related is difficult as they have problems with non-numerical input data.
But I'm not sure about mathematical sets. And sets of sets.
Like [0, 1, 2] and [3, 4, 5] or [[0, 1], [2, 3]] and [[4, 5], [6, 7]]
It should be possible to compute distances between these by computing the distances between all corresponding elements, right? I can't really find any information on that and don't want to start using neural networks without being sure.
(Googling anything with 'set' just isn't promising because all you get as result is the term 'data set'..)
EDIT:
First: The assignment specifically asks for a neural network, so I can't use k-means or any other clustering methods.
So the original question wasn't really addressing the actual problem. I don't have to think of a distance metric but of a way to add the sets to the activation function and for that of how to map them to a single value. But, regarding the distance metric, I'm actually not really sure at what point of the neural network I need it.. I guess that's a basic comprehension problem.
I will just write down some thoughts now.
The thing that confuses me is standardization of categories. Having three categories 'red', 'green' and 'blue' you can map them to numbers 1 to 3, but that would mean that 'red' would have a larger distance to 'blue' than 'green' does and that's not the case. So the categories are encoded as (1, 0, 0) and (0, 1, 0) and (0, 0, 1) which gives them all the same distance.
So it must be possible to add these to the activation function somehow. I could imagine that they are interpreted as binary numbers, so that (1,0,0)=100=4, (0,1,0)=010=2 and (0,0,1)=001=1. That would be a distinct mapping. But numbers 1 to 3 are distinct to, so as mentioned above, the distance metric must be necessary at some point.
So the problem still is how to map a set to a single value. I can do that right before I add it to the function, so I don't have to choose a mapping that also maintains a logical distance between the sets because when getting to the point of applying the distance metric I can still apply it to the original sets and don't have to use the mapped value. Is that correct? Or am I still missing something?
Neural nets, in general, have no such problem. Image recognition and language translation are well within their domains. What you do need is the metrics and manipulations to relate your inputs to the ground truth in a well-ordered fashion -- which your distance metric will do quite nicely.
Go right ahead and build your neural network. Supply it with the appropriate distance function, and let it train away. Do make sure to put in some tracking instrumentation (e.g. print statements) to trace the operation for a few iterations before you turn it entirely loose.

Why does the model fail to learn this game of filling up integers

My question: Why does my model fail to learn to play this game of just producing an array of unique elements from 1 to 5 from a partially filled array?
===
I am trying to train a model to perform this task:
Given a fixed array of 5 elements consisting of at most ONE of each element from (1, 2, 3, 4, 5) and ONE OR MORE (0), replace the 0s with appropriate values so that the final array has exactly ONE of each (1, 2, 3, 4, 5).
So, here is how it should be played:
[1, 2, 3, 4, 0] => [1, 2, 3, 4, 5]
[4, 3, 0, 5, 1] => [4, 3, 2, 5, 1]
[0, 3, 5, 4, 0] => [2, 3, 5, 4, 1] OR [1, 3, 5, 4, 2]
...
This is not a complicated game (in human sense), but I want to see if a model can identify the rules (replace 0s with 1 to 5, so that final array has only exactly one element from (1, 2, 3, 4, 5)).
The way I did this is:
Generate N examples of combinations configurations with elements of [1, 2, 3, 4, 5] as answers, and randomly replace some of the elements as 0s.
For instance, one training example is [(0, 3, 5, 4, 0), (2, 3, 5, 4, 1)].
There can be multiple same input mapping to different output, i.e. [(0, 3, 5, 4, 0), (2, 3, 5, 4, 1)] and [(0, 3, 5, 4, 0), (1, 3, 5, 4, 2)] can be both present as two separate training instances.
Separate the training data set 10 fold, shuffled, and train using a RandomForestClassifier from Scikit-Learn.
A correct output is defined as the final configuration array has exactly ONE element from (1, 2, 3, 4, 5). So (2, 4, 4, 5, 1) is not valid.
===
Surprisingly, using 1000, 10000, 50000, and even 100000 training examples still results in the model only getting ~70% of the test cases right - meaning the model did not learn how to play the game with increasing training examples.
One thing I was thinking is that RandomForestClassifier is just not used for this type of problem, called structured machine learning, where the output is not a single category or a real-valued output, but a vector of output.
More questions:
Why does the model fail to learn this
game?
Is this the right way to model
this problem?
Is the data not enough
to learn this task? But increasing
data from 1000 to 100000 does not
seem to help at all.
Thank you!
lejlot's answer is excellent, but I thought I'd add a bit of intuition as to why random forest fails in this case.
You have to keep in mind that Machine Learning isn't some magic way to impart intelligence to computers; it's simply a way of fitting a particular model to your data and using that model to make generalizations. As the old adage goes, "all models are wrong, but some are useful". You've hit on a case where the model is wrong as usual, but also happens to be useless!
The output space: Random forests at their core are basically a clever and generalizable way of mapping inputs to outputs. Your output space has 5^5 = 3125 possible unique outputs, and only 5! = 120 of these are valid (i.e. outputs with one of each number). The only way for a random forest to know whether an output is valid is if it has seen it: so in order to work correctly, your training set will have to include examples with all of those 120 outputs.
The input space: when a random forest encounters an input it has seen before, it will map that directly to an output that it has seen before. But what if it encounters an input it has not seen? For example, what if you ask for the answer to [0, 2, 3, 4, 1] and this is not in the training set? In terms of Euclidean distance (a useful way to think about how things are grouped) the closest result will probably be something like [0, 2, 3, 4, 0], which might map to [1, 2, 3, 4, 5], which is wrong. Thus we see that in order for random forests to work correctly, your training set will have to have all possible inputs. Some quick combinatorics show that your training set will have to be of size at least 5!*32 = 3840, with no duplicates.
The forest itself: even if you have a complete input space, the random forest does not consist of a simple dictionary mapping of inputs to outputs. Depending on the parameters of the model, the mapping is typically from groups of nearby results to a single answer, so that, for example, {[1, 2, 3, 4, 5], [1, 0, 3, 4, 5], [0, 1, 3, 4, 5]...} will all map to [1, 2, 3, 4, 5]. This sort of generalization is useful in most cases, but is not useful for your particular problem. The only way for the random forest to work in your case would be to push the max_depth and min_samples parameters to their extreme values, so that the forest is essentially a one-to-one mapping of inputs to their correct outputs: in other words your classifier would be just an extremely complicated way of building a dictionary.
To summarize: Machine Learning is just a model applied to data, which is useful in certain cases. In your case, the model is not all that useful: in order for Random Forests to work on your problem, you'd need to over-fit a comprehensive set of inputs and outputs. At that point, you might as well just construct a dictionary and call it a day.
I do assume that this is just a mind-exercise, and not actual problem, because obviously - set-based solution will be better then any ML technique in such task.
In short - because classifiers/regressors are not for combinatorial optimization. Your problem has extremely strong constraints - only very small number of values are "correct" and "observable", you look for a property of the output, and not the value. These is not setting for classification or regression.
What can you do?
in such contrained scenario you have to give your method knowledge about what is going on. Show it a state space. This is rather a case for simple state-space AI, not for ML as such - rather for any metaoptimizations, like hill climbing, simulated annealing, ga etc.
look at things like General Game Playing, this is somehow similar, but the important difference is that you provide set of rules.
look at things like Neural Turing Machines, these are sequential methods trying to learn how to manipulate the data instead of classification/regression
In general this is a very common missconception when one tries to learn machine learning. Not every problem is suitable for "just applying" known ML techniques. Most of the problems "out there" require considerable input from researcher to be able to explot the strength of ML.

How to evaluate predictions from incomplete data, where not all data is incomplete

I am using Non-negative Matrix Factorization and Non-negative Least Squares for predictions, and I want to evaluate how good the predictions are depending on the amount of data given. For example the original Data was
original = [1, 1, 0, 1, 1, 0]
And now I want to see how good I can reconstruct the original data when the given data is incomplete:
incomplete1 = [1, 1, 0, 1, 0, 0],
incomplete2 = [1, 1, 0, 0, 0, 0],
incomplete3 = [1, 0, 0, 0, 0, 0]
And I want to do this for every example in a big dataset. Now the problem is, the original data varies in the amount of positive data, in the original above there are 4, but for other examples in the dataset it could be more or less. Let´s say I make an evaluation round with 4 positives given, but half of my dataset only has 4 positives, the other half has 5,6 or 7. Should I exclude the half with 4 positives, because they have no data missing which makes the "prediction" much better? On the other side I would change the trainingset if I excluded data. What can I do? Or shouldn´t I evaluate with 4 at all in this case?
EDIT:
Basically I want to see how good I can reconstruct the input matrix. For simplicity, say the "original" stands for a user who watched 4 movies. And then I want to know how good I can predict each user, based on just 1 movie that the user acually watched. I get a prediction for lots of movies. Then I plot a ROC and Precision-Recall curve (using top-k of the prediction). And I will repeat all of this with n movies that the users actually watched. I will get a ROC curve in my plot for every n. When I come to the point where I use e.g. 4 movies that the user actually watched, to predict all movies he watched, but he only watched those 4, the results get too good.
The reason why I am doing this is to see how many "watched movies" my system needs to make reasonable predictions. If it would return only good results when there are already 3 movies watched, It would not be so good in my application.
I think it's first important to be clear what you are trying to measure, and what your input is.
Are you really measuring ability to reconstruct the input matrix? In collaborative filtering, the input matrix itself is, by nature, very incomplete. The whole job of the recommender is to fill in some blanks. If it perfectly reconstructed the input, it would give no answers. Usually, your evaluation metric is something quite different from this when using NNMF for collaborative filtering.
FWIW I am commercializing exactly this -- CF based on matrix factorization -- as Myrrix. It is based on my work in Mahout. You can read the docs about some rudimentary support for tests like Area under curve (AUC) in the product already.
Is "original" here an example of one row, perhaps for one user, in your input matrix? When you talk about half, and excluding, what training/test split are you referring to? splitting each user, or taking a subset across users? Because you seem to be talking about measuring reconstruction error, but that doesn't require excluding anything. You just multiply your matrix factors back together and see how close they are to the input. "Close" means low L2 / Frobenius norm.
But for convention recommender tests (like AUC or precision recall), which are something else entirely, you would either split your data into test/training by time (recent data is the test data) or value (most-preferred or associated items are the test data). If I understand the 0s to be missing elements of the input matrix, then they are not really "data". You wouldn't ever have a situation where the test data were all the 0s, because they're not input to begin with. The question is, which 1s are for training and which 1s are for testing.

Resources