Pytorch: Why batch is the second dimension in the default LSTM? - machine-learning

In the PyTorch LSTM documentation it is written:
batch_first – If True, then the input and output tensors are provided
as (batch, seq, feature). Default: False
I'm wondering why they chose the default batch dimension as the second one and not the first one. for me, it is easier to imaging my data as [batch, seq, feature] than [seq, batch, feature]. The first one is more intuitive for me and the second one is counterintuitive.
I'm asking here to know if the is any reason behind this and if you can help me to have some understanding about it.

As far as I know, there is not a heavily justified answer. Nowadays, it is different from other frameworks where, as you say, the shape is more intuitive, such as Keras, but only for compatibility reasons with older versions, changing a default parameter that modifies the dimensions of a vector would probably break half of the models out there if their maintainers update to newer PyTorch versions.
Probably the idea, in the beginning, was to set the temporal dimension first to simplify the iterating process over time, so you can just do a
for t, out_t in enumerate(my_tensor)
instead of having to do less visual stuff such as accessing with my_tensor[:, i] and having to iterate in range(time).

In an answer to another question it is written:
There is an argument in favor of not using batch_first, which states
that the underlying API provided by Nvidia CUDA runs considerably
faster using batch as secondary.
But I don't know if this argument is true or not.

Related

bootstrap parameter in sklearn random forest

Could someone explain the intuition of the parameter "bootstrap" for the random forest model?
When looking at the scikit-learn page https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html:
bootstrap bool, default=True
Whether bootstrap samples are used when building trees. If False, the
whole dataset is used to build each tree.
I am even more confused because I thought that random forest was already a technique using bootstrap so why is there this parameter to define ?
Roughly speaking, bootstrap sampling is just sampling by replacement, which naturally leads to samples of the original dataset being left out, while other samples being present more than once.
I thought that random forest was already a technique using bootstrap
You are right in that the original RF algorithm as suggested by Breiman indeed incorporates bootstrap sampling by default (this is actually an inheritance from bagging, which is used in RF).
Nevertheless, implementations like the scikit-learn one, understandably prefer to leave available the option not to use bootstrap sampling (i.e. sampling with replacement), and use the whole dataset instead; from the docs:
The sub-sample size is controlled with the max_samples parameter if bootstrap=True (default), otherwise the whole dataset is used to build each tree.
Similar is the situation in the standard R implementation (here the respective parameter is called replace, and, like here, it's also set by default to TRUE).
So, nothing really strange here, beyond the (generally desirable) design choice of leaving room and flexibility for the practitioner to be able to select bootstrap sampling or not. In the RF early days, bootstrap sampling offered the extra possibility to calculate out-of-bag (OOB) error without using cross-validation, an idea that (I think...) eventually fell out of favor, and "freed" the practitioners to try leaving out the bootstrap sampling option, if this leads to better performance.
You may also find parts of my answer in Why is Random Forest with a single tree much better than a Decision Tree classifier? useful.

Why does the gated activation function (used in Wavenet) work better than a ReLU?

I have recently been reading the Wavenet and PixelCNN papers, and in both of them they mention that using gated activation functions work better than a ReLU. But in neither cases they offer an explanation as to why that is.
I have asked on other platforms (like on r/machinelearning) but I have not gotten any replies so far. Might it be that they just tried (by chance) this replacement and it turned out to yield favorable results?
Function for reference:
y = tanh(Wk,f ∗ x) . σ(Wk,g ∗ x)
Element-wise multiplication between the sigmoid and tanh of the convolution.
I did some digging and talked some more with a friend, who pointed me towards a paper by Dauphin et. al. about "Language Modeling with Gated Convolutional Networks". He offers a good explanation on this topic in section 3 of the paper:
LSTMs enable long-term memory via a separate cell controlled by input and forget gates. This allows information to flow unimpeded through potentially many timesteps. Without these gates, information could easily vanish through the transformations of each timestep.
In contrast, convolutional networks do not suffer from the same kind of vanishing gradient and we find experimentally that they do not require forget gates. Therefore, we consider models possessing solely output gates, which allow the network to control what information should be propagated through the hierarchy of layers.
In other terms, that means, that they adopted the concept of gates and applied them to sequential convolutional layers, to control what type of information is being let through, and apparently this works better than using a ReLU.
edit: But WHY it works better, I still don't know, if anyone could give me an even remotely intuitive answer I would be grateful, I looked around a bit more, and apparently we are still basing our judgement on trial and error.
I believe it's because it's highly non-linear near zero, unlike relu. With their new activation function [tanh(W1 * x) * sigmoid(W2 * x)] you get a function that has some interesting bends in the [-1,1] range.
Don't forget that this isn't operating on the feature space, but on a matrix multiplication of the feature space, so it's not just "bigger feature values do this, smaller feature values do that" but rather it operates on the outputs of a linear transform of the feature space.
Basically it chooses regions to highlight, regions to ignore, and does so flexibly (and non-linearly) thanks to the activation.
https://www.desmos.com/calculator/owmzbnonlh , see "c" function.
This allows the model to separate the data in the gated attention space.
That's my understanding of it but it is still pretty alchemical to me as well.

Keras: Difference between Kernel and Activity regularizers

I have noticed that weight_regularizer is no more available in Keras and that, in its place, there are activity and kernel regularizer.
I would like to know:
What are the main differences between kernel and activity regularizers?
Could I use activity_regularizer in place of weight_regularizer?
The activity regularizer works as a function of the output of the net, and is mostly used to regularize hidden units, while weight_regularizer, as the name says, works on the weights (e.g. making them decay). Basically you can express the regularization loss as a function of the output (activity_regularizer) or of the weights (weight_regularizer).
The new kernel_regularizer replaces weight_regularizer - although it's not very clear from the documentation.
From the definition of kernel_regularizer:
kernel_regularizer: Regularizer function applied to
the kernel weights matrix
(see regularizer).
And activity_regularizer:
activity_regularizer: Regularizer function applied to
the output of the layer (its "activation").
(see regularizer).
Important Edit: Note that there is a bug in the activity_regularizer that was only fixed in version 2.1.4 of Keras (at least with Tensorflow backend). Indeed, in the older versions, the activity regularizer function is applied to the input of the layer, instead of being applied to the output (the actual activations of the layer, as intended). So beware if you are using an older version of Keras (before 2.1.4), activity regularization may probably not work as intended.
You can see the commit on GitHub
Five months ago François Chollet provided a fix to the activity regularizer, that was then included in Keras 2.1.4
This answer is a bit late, but is useful for the future readers.
So, necessity is the mother of invention as they say. I only understood it when I needed it.
The above answer doesn't really state the difference cause both of them end up affecting the weights, so what's the difference between punishing for the weights themselves or the output of the layer?
Here is the answer: I encountered a case where the weights of the net are small and nice, ranging between [-0.3] to [+0.3].
So, I really can't punish them, there is nothing wrong with them. A kernel regularizer is useless. However, the output of the layer is HUGE, in 100's.
Keep in mind that the input to the layer is also small, always less than one. But those small values interact with the weights in such a way that produces those massive outputs. Here I realized that what I need is an activity regularizer, rather than kernel regularizer. With this, I'm punishing the layer for those large outputs, I don't care if the weights themselves are small, I just want to deter it from reaching such state cause this saturates my sigmoid activation and causes tons of other troubles like vanishing gradient and stagnation.

Machine learning kerrnels (how to check if the data is linearly separable in high dimensional space using a given kernel)

How can I test/check whether a given kernel (example: RBF/ polynomial) does really separate my data?
I would like to know if there is a method (not plotting the data of course) which can allow me to check if a given data set (labeled with two classes) can be separated in high dimensional space?
In short - no, there is no general way. However, for some kernels you can easily say that... everything is separable. This property, proved in many forms (among other by Schoenberg) says for example that if your kernel is of form K(x,y) = f(||x-y||^2) and f is:
ifinitely differentible
completely monotonic (which more or less means that if you take derivatives, then the first one is negative, next positive, next negative, ... )
positive
then it will always be able to separate every binary labeled, consistent dataset (there are no two points of the exact same label). Actually it says even more - that you can exactly interpolate, meaning, that even if it is a regression problem - you will get zero error. So in particular multi-class, multi-label problems also will be linearly solvable (there exists linear/multi-linear model which gives you a correct interpolation).
However, if the above properties do not hold, it does not mean that your data cannot be perfectly separated. This is only "one way" proof.
In particular, this class of kernels include RBF kernel, thus it will always be able to separate any training set (this is why it overfits so easily!)
So what about in the other way? Here you have to first fix hyperparameters of the kernel and then you can also answer it through optimization - solve hard-margin SVM problem (C=inf) and it will find a solution iff data is separable.

Most appropriate normalization / transformation method for skewed features?

I am trying to pre-process biological data to train a neural network and despite an extensive search and repetitive presentation of the various normalization methods I am none the wiser as to which method should be used when. In particular I have a number of input variables which are positively skewed and have been trying to establish whether there is a normalisation method that is most appropriate.
I was also worried about whether the nature of these inputs would affect performance of the network and as such have experimented with data transformations (log transformation in particular). However some inputs have many zeros but may also be small decimal values and seem to be highly affected by a log(x + 1) (or any number from 1 to 0.0000001 for that matter) with the resulting distribution failing to approach normal (either remains skewed or becomes bimodal with a sharp peak at the min value).
Is any of this relevant to neural networks? ie. should I be using specific feature transformation / normalization methods to account for the skewed data or should I just ignore it and pick a normalization method and push ahead?
Any advice on the matter would be greatly appreciated!
Thanks!
As features in your input vector are of different nature, you should use different normalization algorithms for every feature. Network should be feeded by uniformed data on every input for better performance.
As you wrote that some data is skewed, I suppose you can run some algoritm to "normalize" it. If applying logarithm does not work, perhaps other functions and methods such as rank transforms can be tried out.
If the small decimal values do entirely occur in a specific feature, then just normalize it in specific way, so that they get transformed into your work range: either [0, 1] or [-1, +1] I suppose.
If some inputs have many zeros, consider removing them from main neural network, and create additional neural network which will operate on vectors with non-zeroed features. Alternatively, you may try to run Principal Component Analysis (for example, via Autoassociative memory network with structure N-M-N, M < N) to reduce input space dimension and so eliminate zeroed components (they will be actually taken into account in the new combined inputs somehow). BTW, new M inputs will be automatically normalized. Then you can pass new vectors to your actual worker neural network.
This is an interesting question. Normalization is meant to keep features' values in one scale to facilitate the optimization process.
I would suggest the following:
1- Check if you need to normalize your data. If, for example, the means of the variables or features are within same scale of values, you may progress with no normalization. MSVMpack uses some normalization check condition for their SVM implementation. If, however, you need to do so, you are still advised to run the models on the data without Normalization.
2- If you know the actual maximum or minimum values of a feature, use them to normalize the feature. I think this kind of normalization would preserve the skewedness in values.
3- Try decimal value normalization with other features if applicable.
Finally, you are still advised to apply different normalization techniques and compare the MSE for evey technique including z-score which may harm the skewedness of your data.
I hope that I have answered your question and gave some support.

Resources