I want to generate a state field (e.g. temperature field in x-y plane) in form of an mxn image, based on a given input (a vector or a scalar).
To keep it simple, let me put it in a simple example:
if input is 0, the output should be a black cat. For input = 0.53, the output is a brown cat, ...., for input = 3, it's a black dog and so on and so forth.
It can be thought as an inverse of a classification, but I am not sure about it.
Based on my search so far, I think this is an image generation problem where GANs or Autoencoders can be used on a labeled dataset (I have images for different input vectors).
My questions:
Is this the right way to solve this problem? If so, can you recommend good examples?
If this is not the correct way to solve it, would you please share your opinion on how to solve it?
Autoencoders may be suitable, you need to train by giving the same image as input and output. Then, while training you can record the latent vectors between the encoder and decoder modules. Afterwards you can use these vectors with the decoder part of auto encoder to generate an image.
Training:
Encoder --> [latent vectors] --> Decoder
prediction
[latent vector] --> Decoder --> output
This is just an idea, I have not tried it before. Hope it helps :D
This is a typical problem for GANs. Especially conditional GANs can be used to solve this problem. Check the following link for more info:
https://phillipi.github.io/pix2pix/
Related
I think this is a comprehension issue, but I would appreciate any help.
I'm trying to learn how to use PyTorch for autoencoding. In the nn.Linear function, there are two specified parameters,
nn.Linear(input_size, hidden_size)
When reshaping a tensor to its minimum meaningful representation, as one would in autoencoding, it makes sense that the hidden_size would be smaller. However, in the PyTorch tutorial there is a line specifying identical input_size and hidden_size:
class NeuralNetwork(nn.Module):
def __init__(self):
super(NeuralNetwork, self).__init__()
self.flatten = nn.Flatten()
self.linear_relu_stack = nn.Sequential(
nn.Linear(28*28, 512),
nn.ReLU(),
nn.Linear(512, 512),
nn.ReLU(),
nn.Linear(512, 10),
)
I guess my question is, what is the purpose of having the same input and hidden size? Wouldn't this just return an identical tensor?
I suspect that this just a requirement after calling the nn.ReLU() activation function.
As well stated by wikipedia:
An autoencoder is a type of artificial neural network used to learn
efficient codings of unlabeled data. The
encoding is validated and refined by attempting to regenerate the
input from the encoding.
In other words, the idea of the autoencoder is to learn an identity. This identity-function will be learned only for particular inputs (i.e. without anomalies). From this, the following points derive:
Input will have same dimensions as output
Autoencoders are (generally) built to learn the essential features of the input
Because of point (1), you have that autoencoder will have a series of layers (e.g. a series of nn.Linear() or nn.Conv()).
Because of point (2), you generally have an Encoder which compresses the information (as your code-snippet, you start from 28x28 to the ending 10) and a Decoder that decompress the information (10 -> 28x28). Generally the latent space dimensionality (10) is much smaller than the input (28x28) across several implementation of this theoretical architecture. Now that the end-goal of the Encoder part is clear, you may appreciate that the compression may produce additional data during the compression itself (nn.Linear(28*28, 512)), which will disappear when the series of layers will give the final output (10).
Note that because the model in your question includes a nonlinearity after the linear layer, the model will not learn an identity transform between the input and output. In the specific case of the relu nonlinearity, the model could learn an identity transform if all of the input values were positive, but in general this won't be the case.
I find it a little easier to imagine the issue if we had an even smaller model consisting of Linear --> Sigmoid --> Linear. In such a case, the input will be mapped through the first matrix transform and then "squashed" into the space [0, 1] as the "hidden" layer representation. The next ("output") layer would need to take this squashed view of the input and come up with some way of "unsquashing" it back into the original. But with an affine output layer, it's not possible to do this, so the model will have to learn some other, non-identity, transforms for the two matrices.
There are some neat visualizations of this concept on Chris Olah's blog that are well worth a look.
After a few tries, I had trained a GAN to produce semi-sensible output. In this model, it almost instantly found a solution and got stuck there. The loss for both the discriminator and generator were 0.68 (I have used a BCE loss), and the accuracies for both went to around 50%. The output of the generator looked at first glance good enough to be real data, but after analysing it I could see it was still not very good.
My solution here was to increase the power of the discriminator (increased the size of it) and re-train. I hoped by making it larger it would force the generator to create better samples. I got the following output.
It seems that as the GAN loss increases, and is producing worse samples, the discriminator can pick it out more easily.
When I check my output from the trained generator I see it follows some basic rules the real data is following, but again under closer scrutiny, they fail more complex tests the real data would pass. I would like to improve this.
My questions are:
Is my above interpretation of the plots correct?
For this run, have I made the discriminator to powerful? Should I increase the power of the generator?
Is there another technique I should investigate to stop this form of mode collapse?
EDIT: The architecture I am using is a form of Graph GAN. The generator is just a series of linear layers. The discriminator is 3 Graph Conv Layers, then some linear layers. Slightly similar to this paper. Two potentially unconventional things I am doing:
There is no batch normalisation, I have found this has a very negative effect on the training. Though I could try and persevere with it.
I am using StandardScaler to scale my data. This choice was made as it easily allows you to unscale data. This is useful as I can take the output of the generator and easily transform it into an original scale. However, StandardScaler does not scale things between 1 and -1, so I cannot use tanh as the final activation function of my generator, instead, the final layer of the generator is just Linear.
The outputs of the GAN (once rescaled and the shape has been changed) are similar to:
[[ 46.09169 -25.462175 20.705683 -31.696495 ]
[ 35.10637 -18.956036 15.20579 -24.803787 ]
[ 10.253135 -5.759581 5.9068713 -6.3003526]]
An example of the truth is:
[[ 45.6 30.294546 -17.218746 -29.41284 ]
[ 1.8186008 1.7064333 0.5984112 0.19312467]
[ 44.31433 28.234058 -17.615921 -29.262213 ]]
Notably, the top-left value in the matrix will always be 45.6. My Generator does not even consistently produce this.
I have a CNN which inputs a satellite image and should output a mask where it finds cars. I have manually labelled images and created masks for each image where each pixel is 1 if there is part of a car in that pixel, 0 otherwise.
I am trying to work out the best output layer activation function and loss function, and I'm fishing for opinions. I know there is a wealth of information out there but I find myself getting confused about whether my problem is regression or classification.
Could someone please offer their opinion? I am currently using the following output and loss in keras:
conv10 = Conv2D(1, 1, activation='sigmoid')(conv9)
model = Model(inputs=[inputs], outputs=[conv10])
model.compile(optimizer=Adam(lr=1e-4), loss='binary_crossentropy', metrics=['accuracy'])
Is this a good idea? Thanks!
This seems like a good idea from my point of view, because you want to output a probability P(px is part of a car | image) for each pixel px in image. Therefore, that's a binary classification problem, for which using the binary_crossentropy loss function (plus a sigmoid activation in the output layer) is appropriate.
I have a neural network with an input layer having 10 nodes, some hidden layers and an output layer with only 1 node. Then I put a pattern in the input layer, and after some processing, it outputs the value in the output neuron which is a number from 1 to 10. After the training this model is able to get the output , provided the input pattern.
Now, my question is, if it is possible to calculate the inverse model: This means, that I provide a number from output side, (i.e. using output side as input) and then getting the random pattern from those 10 input neurons (i.e. using input as output side).
I want to do this because I will first train a network on basis of difficulty of pattern (input is the pattern and output is difficulty to understand the pattern). Then I want to feed the network with a number so it creates the random patterns on basis of difficulty.
I hope I understood your problem correctly, so I will summarize it in my own words: You have a given model, and want to determine the input which yields a given output.
Supposed, that this is correct, there is at least one way I know of, how you can do this approximately. This way is very easy to implement, but might take a while to calculate a value - probably there are better ways to do this, but I am not sure. (I needed this technique some weeks ago in the topic of reinforcement learning, and did not find anything better, compared to this): Lets assume that your Model maps an input to an output . We now have to create a new model, which we will call : This model will later on calculate the inverse of the model , so that it gives you the input which yields a specific output. To construct we will create a new model, which consists of one plain Dense layer which has the same dimension m as the input. This layer will be connected to the input of the model now. Next, you make all weights of non-trainable (this is very important!).
Now we are setup to find an inverse value already: Assuming you want to find the input corresponding (corresponding means here: it creates the output, but is not unique) to the output y. You have to create a new input vector v which is the unity of . Then you create a input-output data pair consisting of (v, y). Now you use any optimizer you wish to let the input-output-trainingdata propagate through your network, until the error converges to zero. Once this has happend, you can calculate the real input, which gives the output y by doing this: Supposed, that the weights if the new input layer are called w, and the bias is b, the desired input u is u = w*1 + b (whereby 1 )
You might be asking for the reason why this equation holds, so let me try to answer it: You model will try to learn the weights of your new input layer, so that the unity as an input will create the given output. As only the newly added input layer is trainable, only this weights will be changed. Therefore, each weight in this vector will represent the corresponding component of the desired input vector. By using an optimizer and minimizing the l^2 distance between the wanted output and the output of our inverse-model , we will finally determine a set of weights, which will give you a good approximation for the input vector.
I am currently working on a project where I have to extract the facial expression of a user (only one user at a time from a webcam) like sad or happy.
My method for classifying facial expressions is:
Use opencv to detect the face in the image
Use ASM and stasm to get the facial feature point
and now i'm trying to do facial expression classification
is SVM a good option ? and if it is how can i start with SVM :
how i'm going to train svm for every emotions using this landmarks ?
Yes, SVMs have been numerously shown to perform well in this task. There have been dozens (if not hundreads) of papers describing such procedures.
For example:
Simple paper
Longer paper
Poster about it
More complex example
Some basic sources of the SVMs themselves can be obtained on http://www.support-vector-machines.org/ (like books titles, software links etc.)
And if you are just interested in using them rather then understanding you can get one of basic libraries:
libsvm http://www.csie.ntu.edu.tw/~cjlin/libsvm/
svmlight http://svmlight.joachims.org/
if you are already using opencv,i suggest you use the built in svm implementation, training/saving/loading in python is as follow. c++ has corresponding api to do the same in about the same amount of code. it also has 'train_auto' to find best parameters
import numpy as np
import cv2
samples = np.array(np.random.random((4,5)), dtype = np.float32)
labels = np.array(np.random.randint(0,2,4), dtype = np.float32)
svm = cv2.SVM()
svmparams = dict( kernel_type = cv2.SVM_LINEAR,
svm_type = cv2.SVM_C_SVC,
C = 1 )
svm.train(samples, labels, params = svmparams)
testresult = np.float32( [svm.predict(s) for s in samples])
print samples
print labels
print testresult
svm.save('model.xml')
loaded=svm.load('model.xml')
and output
#print samples
[[ 0.24686454 0.07454421 0.90043277 0.37529686 0.34437731]
[ 0.41088378 0.79261768 0.46119651 0.50203663 0.64999193]
[ 0.11879266 0.6869216 0.4808321 0.6477254 0.16334397]
[ 0.02145131 0.51843268 0.74307418 0.90667248 0.07163303]]
#print labels
[ 0. 1. 1. 0.]
#print testresult
[ 0. 1. 1. 0.]
so you provide the n flattened shape models as samples and n labels and you are good to go. you probably dont even need the asm part, just apply some filters which are sensitive to orientation like sobel or gabor and concatenate the matrices and flatten them then feed them directly to svm. you probably can get maybe 70-90% accuracy.
as someone said cnn are an alternative to svms.here's some links that implement lenet5. so far,i find svms much simpler to get started.
https://github.com/lisa-lab/DeepLearningTutorials/
http://www.codeproject.com/Articles/16650/Neural-Network-for-Recognition-of-Handwritten-Digi
-edit-
landmarks are just n (x,y) vectors right? so why dont you try put them into a array of size 2n and simply feed them directly to the code above?
for example,3 training samples of 4 land marks (0,0),(10,10),(50,50),(70,70)
samples = [[0,0,10,10,50,50,70,70],
[0,0,10,10,50,50,70,70],
[0,0,10,10,50,50,70,70]]
labels=[0.,1.,2.]
0=happy
1=angry
2=disgust
You could check this code to get idea how this could be done using SVM.
You can find algorithm explained here