long feature vector size in Neural Network

I am trying to design a Neural Network where I want the feature vector size equal to the input vector size. In essence, I have an image ( my input ) and I wish to perform a regression task on
each of the pixels (i.e., my output is a prediction on how I should act on each of the pixel).
However, my experience with ML ( newbie ) seems to show that the size of the output vector is usually small compared to the input vector size. Is there a reason why I must design my network in a similar manner ? Are there any pitfalls in having an output feature vector as long as the input vector ?

You can safely have the output of the network as big as the input. Look for example at UNet for semantic segmentation. In that case there is one output for each pixel which represents the category (class) of that pixel.


Different branches do not contribute equally in a deep neural network

Assume we have a neural network that gets two inputs. the first input is location and size of a object and the second one is an image of the object. the location and size go through an MLP that map 4 dimensional input to 512 dimensional vector and the image go through ResNet34 which gives us a 512 dimensional vector that describes appearance of the object. After obtaining them position vector and appearance vector are summed to obtain a singular vector. Then the vector goes through the rest of the network.
After training the network, I gained a bad accuracy. I analyzed what happens in the network, and I realized that position vector is not treated similarly as appearance vector and appearance branch has more weight in calculations.
I want my appearance and position features have the same impact. How should I achieve this?
Instead of summing your image vector with the vector from the additional features, I would suggest concatenating them - so you end up with a 1024 dimensional vector. Then the layers that come after this concatenation can determine the relative impact of the features through your loss function.
This should allow the model to rely more heavily on whatever features result in the lowest loss.

Why can't you use a 3D volume input for LSTM?

In the paper CountNet: Estimating the Number of Concurrent Speakers Using Supervised Learning I recently read, it specified that the 3D volume output from a CNN layer must be reduced into a 2 dimensional sequence before entering the LSTM layer, why is that? What's wrong with using the 3 dimensional format?
The standard LSTM neural network assumes input of the following size:
[batch size] × [sequence length] × [feature dim]
The LSTM first multiplies each vector of size [feature dim] by a matrix, and then combines them in a fancy way. What's important here is that there's a vector per each example (the batch dimensions) and each timestep (the seq. length dimension). In a sense, this vector is first transformed by a matrix multiplication(s) (possibly involving some pointwise non-linearities, which don't change the shape, so I don't mention them) into a hidden state update, which is also a vector, and the updated hidden state vector is then used to produce the output (also a vector).
As you can see, the LSTM is designed to operate on vectors. You could design a Matrix-LSTM – an LSTM counterpart that assumes any or all of the following are matrices: the input, the hidden state, the output. That would require you to replace matrix-vector multiplications that process the input (or the state) by a generatlized linear operation that is able to turn any matrix into any other, which would be given by a rank-4 tensor, I believe. However, it'd be equivalent to just reshaping the input matrix into a vector, reshaping the rank-4 tensor into a matrix, doing matrix-vector product and then reshaping the output back into a matrix, so it makes little sense to devise such Matrix-LSTMs instead of just reshaping your inputs.
That said, it might still make sense to design a generalized LSTM that takes something other than a vector as input if the you know something about the input structure that instructs a more specific linear operator than a general rank-4 tensor. For example, images are known to have local structure (nearby pixels are more related than those far apart), hence using convolutions is more "reasonable" than reshaping images to vectors and then performing a general matrix multiplication. In a similar fashion you could replace all the matrix-vector multiplications in the LSTM with convolutions, which would allow for image-like input, states and outputs.

Understanding Faster rcnn

I'm trying to understand fast(er) RCNN and following are the questions I'm searching for:
To train, a FastRcnn model do we have to give bounding box
information in training phase.
If you have to give bonding box information then what's the role of
ROI layer.
Can we use a pre-trained model, which is only trained for classification, not
object detection and use it for Fast(er) RCNN's
Your answers:
1.- Yes.
2.- The ROI layer is used to produce a fixed-size vector from variable-sized images. This is performed by using max-pooling, but instead of using the typical n by n cells, the image is divided into n by n non-overlapping regions (which vary in size) and the maximum value in each region is output. The ROI layer also does the job of proyecting the bounding box in input space to the feature space.
3.- Faster R-CNN MUST be used with a pretrained network (typically on ImageNet), it cannot be trained end-to-end. This might be a bit hidden in the paper but the authors do mention that they use features from a pretrained network (VGG, ResNet, Inception, etc).

In a feedforward neural network, am I able to put in a feature input of "don't care"?

I've created a feedforward neural network using DL4J in Java.
Hypothetically and to keep things simple, assume this neural network is a binary classifier of squares and circles.
The input, a feature vector, would be composed of say... 5 different variables:
Now so far, my binary classifier can tell the two shapes apart quite well as I'm giving it a complete feature vector.
My question: is it possible to input only maybe 2 or 3 of these features? Or even 1? I understand results will be less accurate while doing so, I just need to be able to do so.
If it is possible, how?
How would I do it for a neural network with 213 different features in the input vector?
Let's assume, for example, that you know the area, height, and width features (so you don't know the number_of_corners and number_of_edges features).
If you know that a shape can have, say, a maximum of 10 corners and 10 edges, you could input 10 feature vectors with the same area, height and width but where each vector has a different value for the number_of_corners and number_of_edges features. Then you can just average over the 10 outputs of the network and round to the nearest integer (so that you still get a binary value).
Similarly, if you only know the area feature you could average over the outputs of the network given several random combinations of input values, where the only fixed value is the area and all the others vary. (I.e. the area feature is the same for each vector but every other feature has a random value.)
This may be a "trick" but I think that the average will converge to a value as you increase the number of (almost-)random vectors.
My solution would not be a good choice if you have a lot of features. In this case you could try to use maybe a Deep Belief Network or some autoencoder to infer the values of the other features given a small number of them. For example, a DBN can "reconstruct" a noisy output (if you train it enough, of course); you could then try to give the reconstructed input vector to your feed-forward network.

How to enable a Convolutional NN to take variable size input?

So, I've seen that many of the first CNN examples in Machine Learning use the MNIST dataset. Each image there is 28x28, and so we know the shape of the input before hand. How would this be done for variable size input, let's say you have some images that are 56x56 and some 28x28.
I'm looking for a language and framework agnostic answer if possible or in tensorflow terms preferable
In some cases, resizing the images appropriately (for example to keep the aspectratio) will be sufficient. But, this can introduce distortion, and in case this is harmful, another solution is to use Spatial Pyramidal Pooling (SPP). The problem with different image sizes is that it produces layers of different sizes, for example, taking the features of the n-th layer of some network, you can end up with a featuremap of size 128*fw*fh where fw and fh vary depending on the size of the input example. What SPP does in order to alleviate this problem, is to turn this variable size feature map into a fix-length vector of features. It operates on different scales, by dividing the image into equal patches and performing maxpooling on them. I think this paper does a great job at explaining it. An example application can be seen here.
As a quick explanation, imagine you have a feature map of size k*fw*fh. You can consider it as k maps of the form
where each of the blocks are of size fw/2*fh/2. Now, performing maxpooling on each of those blocks separately gives you a vector of size 4, and therefore, you can grossly describe the k*fw*fh map as a k*4 fixed-size vector of features.
Now, call this fixed-size vector w and set it aside, and this time, consider the k*fw*fh featuremap as k featureplanes written as
and again, perform maxpooling separately on each block. So, using this, you obtain a more fine-grained representation, as a vector of length v=k*16.
Now, concatenating the two vectors u=[v;w] gives you a fixed-size representation. This is exaclty what a 2-scale SPP does (well, of course you can change the number/sizes of divisions).
Hope this helps.
When you use CNN for classification task, your network has two part:
Feature generator. Part generates feature map with size WF x HF and CF channels by image with size WI x HI and CI channels . The relation between image sizes and feature map size depends of structure your NN (for example, on amount of pooling layers and stride of them).
Classifier. Part solves the task of classification vectors with WF*HF*CF components into classes.
You can put image with different size into feature generator, and get feature map with different sizes. But classifier can only be training on some fixed lengths vectors. Therefore you obviously train your network for some fixed sizes of images. If you have images with different size you resize it to input size of network, or crop some part of image.
Another way described in the article
K. He, X. Zhang, S. Ren, J. Sun, "Spatial pyramid pooling in deep convolutional networks for visual recognition," arXiv:1406.4729 2014
Authors offered Spatial pyramid pooling, which solve the problem with different image on the input of CNN. But I don't sure is spatial pyramid pooling layer exists in tensorflow.
