In the paper CountNet: Estimating the Number of Concurrent Speakers Using Supervised Learning I recently read, it specified that the 3D volume output from a CNN layer must be reduced into a 2 dimensional sequence before entering the LSTM layer, why is that? What's wrong with using the 3 dimensional format?

The standard LSTM neural network assumes input of the following size:
[batch size] × [sequence length] × [feature dim]
The LSTM first multiplies each vector of size [feature dim] by a matrix, and then combines them in a fancy way. What's important here is that there's a vector per each example (the batch dimensions) and each timestep (the seq. length dimension). In a sense, this vector is first transformed by a matrix multiplication(s) (possibly involving some pointwise non-linearities, which don't change the shape, so I don't mention them) into a hidden state update, which is also a vector, and the updated hidden state vector is then used to produce the output (also a vector).
As you can see, the LSTM is designed to operate on vectors. You could design a Matrix-LSTM – an LSTM counterpart that assumes any or all of the following are matrices: the input, the hidden state, the output. That would require you to replace matrix-vector multiplications that process the input (or the state) by a generatlized linear operation that is able to turn any matrix into any other, which would be given by a rank-4 tensor, I believe. However, it'd be equivalent to just reshaping the input matrix into a vector, reshaping the rank-4 tensor into a matrix, doing matrix-vector product and then reshaping the output back into a matrix, so it makes little sense to devise such Matrix-LSTMs instead of just reshaping your inputs.
That said, it might still make sense to design a generalized LSTM that takes something other than a vector as input if the you know something about the input structure that instructs a more specific linear operator than a general rank-4 tensor. For example, images are known to have local structure (nearby pixels are more related than those far apart), hence using convolutions is more "reasonable" than reshaping images to vectors and then performing a general matrix multiplication. In a similar fashion you could replace all the matrix-vector multiplications in the LSTM with convolutions, which would allow for image-like input, states and outputs.


Different branches do not contribute equally in a deep neural network

Assume we have a neural network that gets two inputs. the first input is location and size of a object and the second one is an image of the object. the location and size go through an MLP that map 4 dimensional input to 512 dimensional vector and the image go through ResNet34 which gives us a 512 dimensional vector that describes appearance of the object. After obtaining them position vector and appearance vector are summed to obtain a singular vector. Then the vector goes through the rest of the network.
After training the network, I gained a bad accuracy. I analyzed what happens in the network, and I realized that position vector is not treated similarly as appearance vector and appearance branch has more weight in calculations.
I want my appearance and position features have the same impact. How should I achieve this?
Instead of summing your image vector with the vector from the additional features, I would suggest concatenating them - so you end up with a 1024 dimensional vector. Then the layers that come after this concatenation can determine the relative impact of the features through your loss function.
This should allow the model to rely more heavily on whatever features result in the lowest loss.

long feature vector size in Neural Network

I am trying to design a Neural Network where I want the feature vector size equal to the input vector size. In essence, I have an image ( my input ) and I wish to perform a regression task on
each of the pixels (i.e., my output is a prediction on how I should act on each of the pixel).
However, my experience with ML ( newbie ) seems to show that the size of the output vector is usually small compared to the input vector size. Is there a reason why I must design my network in a similar manner ? Are there any pitfalls in having an output feature vector as long as the input vector ?
You can safely have the output of the network as big as the input. Look for example at UNet for semantic segmentation. In that case there is one output for each pixel which represents the category (class) of that pixel.

Data normalization Convolutional Autoencoders

Iam a little bit confused about how to normalize/standarize image pixel values before training a convolutional autoencoder. The goal is to use the autoencoder for denoising, meaning that my traning images consists of noisy images and the original non-noisy images used as ground truth.
To my knowledge there are to options to pre-process the images:
- normalization
- standarization (z-score)
When normalizing using the MinMax approach (scaling between 0-1) the network works fine, but my question here is:
- When using the min max values of the training set for scaling, should I use the min/max values of the noisy images or of the ground truth images?
The second thing I observed when training my autoencoder:
- Using z-score standarization, the loss decreases for the two first epochs, after that it stops at about 0.030 and stays there (it gets stuck). Why is that? With normalization the loss decreases much more.
Thanks in advance,
[Note: This answer is a compilation of the comments above, for the record]
MinMax is really sensitive to outliers and to some types of noise, so it shouldn't be used it in a denoising application. You can use quantiles 5% and 95% instead, or use z-score (for which ready-made implementations are more common).
For more realistic training, normalization should be performed on the noisy images.
Because the last layer uses sigmoid activation (info from your comments), the network's outputs will be forced between 0 and 1. Hence it is not suited for an autoencoder on z-score-transformed images (because target intensities can take arbitrary positive or negative values). The identity activation (called linear in Keras) is the right choice in this case.
Note however that this remark on activation only concerns the output layer, any activation function can be used in the hidden layers. Rationale: negative values in the output can be obtained through negative weights multiplying the ReLU output of hidden layers.

Why Gaussian radial basis function maps the examples into an infinite-dimensional space?

I've just run through the Wikipedia page about SVMs, and this line caught my eyes:
"If the kernel used is a Gaussian radial basis function, the corresponding feature space is a Hilbert space of infinite dimensions." http://en.wikipedia.org/wiki/Support_vector_machine#Nonlinear_classification
In my understanding, if I apply Gaussian kernel in SVM, the resulting feature space will be m-dimensional (where m is the number of training samples), as you choose your landmarks to be your training examples, and you're measuring the "similarity" between a specific example and all the examples with the Gaussian kernel. As a consequence, for a single example you'll have as many similarity values as training examples. These are going to be the new feature vectors which are going to m-dimensional vectors, and not infinite dimensionals.
Could somebody explain to me what do I miss?
The dual formulation of the linear SVM depends only on scalar products of all training vectors. Scalar product essentially measures similarity of two vectors. We can then generalize it by replacing with any other "well-behaved" (it should be positive-definite, it's needed to preserve convexity, as well as enables Mercer's theorem) similarity measure. And RBF is just one of them.
If you take a look at the formula here you'll see that RBF is basically a scalar product in a certain infinitely dimensional space
Thus RBF is kind of a union of polynomial kernels of all possible degrees.
The other answers are correct but don't really tell the right story here. Importantly, you are correct. If you have m distinct training points then the gaussian radial basis kernel makes the SVM operate in an m dimensional space. We say that the radial basis kernel maps to a space of infinite dimension because you can make m as large as you want and the space it operates in keeps growing without bound.
However, other kernels, like the polynomial kernel do not have this property of the dimensionality scaling with the number of training samples. For example, if you have 1000 2D training samples and you use a polynomial kernel of <x,y>^2 then the SVM will operate in a 3 dimensional space, not a 1000 dimensional space.
The short answer is that this business about infinite dimensional spaces is only part of the theoretical justification, and of no practical importance. You never actually touch an infinite-dimensional space in any sense. It's part of the proof that the radial basis function works.
Basically, SVMs are proved to work the way they do by relying on properties of dot products over vector spaces. You can't just swap in the radial basis function and expect it necessarily works. To prove that it does, however, you show that the radial basis function is actually like a dot product over a different vector space, and it's as if we're doing regular SVMs in a transformed space, which works. And it happens that infinite dimensioal-ness is OK, and that the radial basis function does correspond to a dot product in such a space. So you can say SVMs still work when you use this particular kernel.

Feeding HOG into SVM: the HOG has 9 bins, but the SVM takes in a 1D matrix

In OpenCV, there is a CvSVM class which takes in a matrix of samples to train the SVM. The matrix is 2D, with the samples in the rows.
I created my own method to generate a histogram of oriented gradients (HOG) off of a video feed. To do this, I created a 9 channeled matrix to store the HOG, where each channel corresponds to an orientation bin. So in the end I have a 40x30 matrix of type CV_32FC(9).
Also made a visualisation for the HOG and it's working.
I don't see how I'm supposed to feed this matrix into the OpenCV SVM, because if I flatten it, I don't see how the SVM is supposed to learn a 9D hyperplane from 1D input data.
The SVM always takes in a single row of data per feature vector. The dimensionality of the feature vector is thus the length of the row. If you're dealing with 2D data, then there are 2 items per feature vector. Example of 2D data is on this webpage:
code of an equivalent demo in OpenCV http://sites.google.com/site/btabibian/labbook/svmusingopencv
The point is that even though you're thinking of the histogram as 2D with 9-bin cells, the feature vector is in fact the flattened version of this. So it's correct to flatten it out into a long feature vector. The result for me was a feature vector of length 2304 (16x16x9) and I get 100% prediction accuracy on a small test set (i.e. it's probably slightly less than 100% but it's working exceptionally well).
The reason this works is that the SVM is working on a system of weights per item of the feature vector. So it doesn't have anything to do with the problem's dimension, the hyperplane is always in the same dimension as the feature vector. Another way of looking at it is to forget about the hyperplane and just view it as a bunch of weights for each item in the feature vector. In this case, it needs one weighting for every item, then it multiplies each item by its weighting and outputs the result.
