Are convolutional layers necessary for Deep Q Networks? - machine-learning

I'm currently trying to build a deep Q network to play the classic Snake game. I designed the game in such a way that the state space is confined to a 20 x 20 matrix, with 1's representing a square occupied by a body, 2 representing a square occupied by the head, and 5 representing a square occupied by food. Given the fact that the space is relatively small, would it be feasible to have the network input be a 400 dimensional vector instead of a raw image?

You could try a 400-dimensional vector and look at the performance of the agent(learning). If it doesn't improve, then you should try a CNN. But in my opinion, a 400-dimensional vector should work.
Also, make sure that you normalise the state inputs to [0,1] or [-1,1] before feeding it to the neural network.

Related

Different branches do not contribute equally in a deep neural network

Assume we have a neural network that gets two inputs. the first input is location and size of a object and the second one is an image of the object. the location and size go through an MLP that map 4 dimensional input to 512 dimensional vector and the image go through ResNet34 which gives us a 512 dimensional vector that describes appearance of the object. After obtaining them position vector and appearance vector are summed to obtain a singular vector. Then the vector goes through the rest of the network.
After training the network, I gained a bad accuracy. I analyzed what happens in the network, and I realized that position vector is not treated similarly as appearance vector and appearance branch has more weight in calculations.
I want my appearance and position features have the same impact. How should I achieve this?
Instead of summing your image vector with the vector from the additional features, I would suggest concatenating them - so you end up with a 1024 dimensional vector. Then the layers that come after this concatenation can determine the relative impact of the features through your loss function.
This should allow the model to rely more heavily on whatever features result in the lowest loss.

In a feedforward neural network, am I able to put in a feature input of "don't care"?

I've created a feedforward neural network using DL4J in Java.
Hypothetically and to keep things simple, assume this neural network is a binary classifier of squares and circles.
The input, a feature vector, would be composed of say... 5 different variables:
[number_of_corners,
number_of_edges,
area,
height,
width]
Now so far, my binary classifier can tell the two shapes apart quite well as I'm giving it a complete feature vector.
My question: is it possible to input only maybe 2 or 3 of these features? Or even 1? I understand results will be less accurate while doing so, I just need to be able to do so.
If it is possible, how?
How would I do it for a neural network with 213 different features in the input vector?
Let's assume, for example, that you know the area, height, and width features (so you don't know the number_of_corners and number_of_edges features).
If you know that a shape can have, say, a maximum of 10 corners and 10 edges, you could input 10 feature vectors with the same area, height and width but where each vector has a different value for the number_of_corners and number_of_edges features. Then you can just average over the 10 outputs of the network and round to the nearest integer (so that you still get a binary value).
Similarly, if you only know the area feature you could average over the outputs of the network given several random combinations of input values, where the only fixed value is the area and all the others vary. (I.e. the area feature is the same for each vector but every other feature has a random value.)
This may be a "trick" but I think that the average will converge to a value as you increase the number of (almost-)random vectors.
Edit
My solution would not be a good choice if you have a lot of features. In this case you could try to use maybe a Deep Belief Network or some autoencoder to infer the values of the other features given a small number of them. For example, a DBN can "reconstruct" a noisy output (if you train it enough, of course); you could then try to give the reconstructed input vector to your feed-forward network.

spectral clustering eigenvectors and eigenvalues

What do the eigenvalues and eigenvectors in spectral clustering physically mean. I see that if λ_0 = λ_1 = 0 then we will have 2 connected components. But, what does λ_2,...,λ_k tell us. I don't understand the algebraic connectivity by multiplicity.
Can we draw any conclusions about the tightness of the graph or in comparison to two graphs?
The smaller the eigenvalue, the less connected. 0 just means "disconnected".
Consider this a value of what share of edges you need to cut to produce separate components. The cut is orthogonal to the eigenvector - there is supposedly some threshold t, such that nodes below t should go into one component, above t to the other.
That depends somewhat on the algorithm. For several of the spectral algorithms, the eigenstuff can be easily run through Principal Component Analysis to reduce the display dimensionality for human consumption. Power iteration clustering vectors are more difficult to interpret.
As Mr.Roboto already noted, the eigenvector is normal to the division brane (a plane after a Gaussian kernel transformation). Spectral clustering methods are generally not sensitive to density (is that what you mean by "tightness"?) per se -- they find data gaps. For instance, it doesn't matter whether you have 50 or 500 nodes within a unit sphere forming your first cluster; the game changer is whether there's clear space (a nice gap) instead of a thin trail of "bread crumb" points (a sequence of tiny gaps) leading to another cluster.

Why Gaussian radial basis function maps the examples into an infinite-dimensional space?

I've just run through the Wikipedia page about SVMs, and this line caught my eyes:
"If the kernel used is a Gaussian radial basis function, the corresponding feature space is a Hilbert space of infinite dimensions." http://en.wikipedia.org/wiki/Support_vector_machine#Nonlinear_classification
In my understanding, if I apply Gaussian kernel in SVM, the resulting feature space will be m-dimensional (where m is the number of training samples), as you choose your landmarks to be your training examples, and you're measuring the "similarity" between a specific example and all the examples with the Gaussian kernel. As a consequence, for a single example you'll have as many similarity values as training examples. These are going to be the new feature vectors which are going to m-dimensional vectors, and not infinite dimensionals.
Could somebody explain to me what do I miss?
Thanks,
Daniel
The dual formulation of the linear SVM depends only on scalar products of all training vectors. Scalar product essentially measures similarity of two vectors. We can then generalize it by replacing with any other "well-behaved" (it should be positive-definite, it's needed to preserve convexity, as well as enables Mercer's theorem) similarity measure. And RBF is just one of them.
If you take a look at the formula here you'll see that RBF is basically a scalar product in a certain infinitely dimensional space
Thus RBF is kind of a union of polynomial kernels of all possible degrees.
The other answers are correct but don't really tell the right story here. Importantly, you are correct. If you have m distinct training points then the gaussian radial basis kernel makes the SVM operate in an m dimensional space. We say that the radial basis kernel maps to a space of infinite dimension because you can make m as large as you want and the space it operates in keeps growing without bound.
However, other kernels, like the polynomial kernel do not have this property of the dimensionality scaling with the number of training samples. For example, if you have 1000 2D training samples and you use a polynomial kernel of <x,y>^2 then the SVM will operate in a 3 dimensional space, not a 1000 dimensional space.
The short answer is that this business about infinite dimensional spaces is only part of the theoretical justification, and of no practical importance. You never actually touch an infinite-dimensional space in any sense. It's part of the proof that the radial basis function works.
Basically, SVMs are proved to work the way they do by relying on properties of dot products over vector spaces. You can't just swap in the radial basis function and expect it necessarily works. To prove that it does, however, you show that the radial basis function is actually like a dot product over a different vector space, and it's as if we're doing regular SVMs in a transformed space, which works. And it happens that infinite dimensioal-ness is OK, and that the radial basis function does correspond to a dot product in such a space. So you can say SVMs still work when you use this particular kernel.

Interpreting a Self Organizing Map

I have been doing reading about Self Organizing Maps, and I understand the Algorithm(I think), however something still eludes me.
How do you interpret the trained network?
How would you then actually use it for say, a classification task(once you have done the clustering with your training data)?
All of the material I seem to find(printed and digital) focuses on the training of the Algorithm. I believe I may be missing something crucial.
Regards
SOMs are mainly a dimensionality reduction algorithm, not a classification tool. They are used for the dimensionality reduction just like PCA and similar methods (as once trained, you can check which neuron is activated by your input and use this neuron's position as the value), the only actual difference is their ability to preserve a given topology of output representation.
So what is SOM actually producing is a mapping from your input space X to the reduced space Y (the most common is a 2d lattice, making Y a 2 dimensional space). To perform actual classification you should transform your data through this mapping, and run some other, classificational model (SVM, Neural Network, Decision Tree, etc.).
In other words - SOMs are used for finding other representation of the data. Representation, which is easy for further analyzis by humans (as it is mostly 2dimensional and can be plotted), and very easy for any further classification models. This is a great method of visualizing highly dimensional data, analyzing "what is going on", how are some classes grouped geometricaly, etc.. But they should not be confused with other neural models like artificial neural networks or even growing neural gas (which is a very similar concept, yet giving a direct data clustering) as they serve a different purpose.
Of course one can use SOMs directly for the classification, but this is a modification of the original idea, which requires other data representation, and in general, it does not work that well as using some other classifier on top of it.
EDIT
There are at least few ways of visualizing the trained SOM:
one can render the SOM's neurons as points in the input space, with edges connecting the topologicaly close ones (this is possible only if the input space has small number of dimensions, like 2-3)
display data classes on the SOM's topology - if your data is labeled with some numbers {1,..k}, we can bind some k colors to them, for binary case let us consider blue and red. Next, for each data point we calculate its corresponding neuron in the SOM and add this label's color to the neuron. Once all data have been processed, we plot the SOM's neurons, each with its original position in the topology, with the color being some agregate (eg. mean) of colors assigned to it. This approach, if we use some simple topology like 2d grid, gives us a nice low-dimensional representation of data. In the following image, subimages from the third one to the end are the results of such visualization, where red color means label 1("yes" answer) andbluemeans label2` ("no" answer)
onc can also visualize the inter-neuron distances by calculating how far away are each connected neurons and plotting it on the SOM's map (second subimage in the above visualization)
one can cluster the neuron's positions with some clustering algorithm (like K-means) and visualize the clusters ids as colors (first subimage)

Resources