Random noise for GAN - machine-learning

I am new to GAN. I am learning to model GAN to generate images,however I don't really understand what exactly is the random noise given to the generator. Is it random numbers from 0 to 1 and what should be its size. Also should the random noise be constant every time the generator run?
Any help would be appreciated.

Random noise is a feature vector, unique for every image
Let's consider noise vector of 128
For now just focus on 1st entry from vector
let's consider it is for length of hairs on head
From training images model has learnt that for bald the value is=0 and for long hair value=1, by selecting random number from 0 to 1 decides amount of hairs.
so model can generate persons of different hair length
In this way all 128 entries in random noise will decide one factor of human face
That's why every time choosing random noise will generate new person image
If you use random noise same every then model will generate same image
I hope you understood how GAN works.

There is a jupyter notebook driven tutorial on github (full disclosure, it is my github).
(Solutions available here)
The noise or rather latent random variable can be generated pretty much however you like for example such as follows:
# Generate latent random variable to feed to the generator, by drawing from a uniform distribution
z = np.random.uniform(-1., 1., size=[batch_size, noise_dim])
Yet it makes sense to think about the activation function within the input layer of your generator, and pay attention to its sensitive range.
The generator takes this input as a seed to decode from that latent variable into the source datasets domain. So obviously the same random variable will lead to the exact same generated sample.
So you should constantly be drawing new samples while training, and do not keep the noise constant.

The random noise vector distribution represents the latent space. It is not that important for GANs but more so for Autoencoders. Typically, the noise is generated from the normal distribution but some researchers have reported improved training results using a spherical distribution (sorry, I don't have the references handy). The range of the noise depend on your input layer. If you are using images, you probably normalized the inputs between 0 and 1 or -1 and 1, so you would use a corresponding distribution range for your noise vector. A typical noise vector might be generated like so:
noise = tf.random.normal([BATCH_SIZE, noise_dim])
where BATCH_SIZE is the size of the training batch (16, 32, 64, 128...) and noise_dim is the size of the noise vector, which depends on your feature space (I use 1024 often for medium resolution images).

I'm also new to GAN but working recently on GAN for signal generation.
Random noise is the input to the generator. At the beginning it has no meaning and with training you try to find meaning to them. Regarding the size i'm still not sure if my conclusion is correct so I hope the others correct my if I'm wrong. We should search for the suitable size for our Problems.
If latent space is very small, the model will reach a point which it can't produce better quality anymore (bottleneck) and if it's too big, the model will take very Long time to produce good results and it may even fail to converge. Usually one starts with latent space size used by others for same Problem.

The Random vector is not actually random, typically we sample the vector from some specific distribution (Gaussian , Uniform etc). The generator takes the sampled vector and then it tries to map it to the distribution of the training data by minimising the Jensen-Shannon Divergence of the probability distribution of the sampled vector and the distribution of the all the training data.
The size of the sampled vector which we feed to the generator is a Hyperparameter.

Related

How do I increase the sensitivity of a CNN autoencoder for anomaly detection, when my dataset contains mostly zero values?

I am learning to use autoencoders for anomaly detection. I am using the model provided on the keras website, with data that I have created. My ‘correct data’ used for training the model is a single frequency sine wave with random amplitudes and my ‘anomaly data’ used for testing the model is the original data, with a sine wave of a different frequency superimposed.
In the time domain the model works very well. However, in the frequency domain (I did a fft), because the data is almost all zeros with either 1 or 2 nonzero values, the sensitivity of the loss function to these data points is extremely low. The model is ultimately able to detect the different between the correct data and the anomalies with 100% precision but the loss is near 0 in all cases. If we had a scenario where instead of the non-dominant frequencies being 0, they were simply a small number, the ML model would surely have no chance of making the correct prediction. Please see the following graphs
Example of correct data + prediction
Example of anomaly data + prediction
Histogram of predictions, showing clear separation into 2 group
What can I do to make my model loss more sensitive to the desired data? One idea I have is to remove the zero data. However, this means that my input data length is not fixed. I don’t even know if the CNN- can handle this. I don’t want to switch to an LSTM as the convolution is important to me so that the network can handle various frequencies without being explicitly trained on those frequencies.
What other options do I have? Perhaps a loss function which can give higher importance to these values? I do understand that if have noisy data I can clean it but I am still concerned that it will not insufficient for less 'black and white' scenarios.

Gaussian Progress Regression usecase

while reading the paper :" Tactile-based active object discrimination and target object search in an unknown workspace", there is something that I just can not understand:
The paper is about finding object's position and other properties using only tactile information. In the section 4.1.2, the author says that he uses GPR to guide the exploratory process and in section 4.1.4 he describes how he trained his GPR:
Using the example from the section 4.1.2, the input is (x,z) and the ouput y.
Whenever there is a contact, the coresponding y-value is stored.
This procedure is repeated several times.
This trained GPR is used to estimate the next exploring point, which is the point where the variance is maximum at.
In the following link, you also can see the demonstration: https://www.youtube.com/watch?v=ZiLq3i-BJcA&t=177s . In the first part of video (0:24-0:29), the first initalization takes place where the robot samples 4 times. Then in the next 25 seconds, the robot explores explores from the corresponding direction. I do not understand how this tiny initialization of GPR can guide the exploratory process. Could someone please explain how the input points (x,z) from the first exploring part could be estimated?
Any regression algorithm simply maps the input (x,z) to an output y in some way unique to the specific algorithm. For a new input (x0,z0) the algorithm will likely predict something very close to the true output y0 if many data points similar to this was included in the training. If only training data was available in a vastly different region, the predictions will likely be very bad.
GPR includes a measure of confidence of the predictions, namely the variance. The variance will naturally be very high in regions where no training data has been seen before and low very close to already seen data points. If the 'experiment' takes much longer than evaluating the Gaussian Process, you can use the Gaussian Process fit to make sure you sample regions where you are very uncertain of your answer.
If the goal is to fully explore the entire input space, you could draw a lot of random values of (x,z) and evaluate the variance at these values. Then you could perform the costly experiment at the input point where you are most uncertain in y. Then you can retrain the GPR with all the explored data so far and repeat the process.
For optimization problems (Not the OP's question)
If you wish to find the lowest value of y across the input space, you are not interested in doing the experiment in regions that you know give high values of y, but you are just uncertain of how high these values will be. So instead of choosing the (x,z) points with the highest variance, you might choose the predicted value of y plus one standard deviation. Minimizing values this way is named Bayesian Optimization and this specific scheme is named Upper Confidence Bound (UCB). Expected Improvement (EI) - the probability of improving the previously best score - is also commonly used.

How can I get the joint probability of new image and latent variable from generative model?

I just learned that generative model tries to learn p(x|z)p(z) = p(x,z).
But after I study some sample code of generative models such as VAE and GAN, I found that the output of model is the generated image x, which is a 2D matrix.
In my realization, the content of matrix means the probability of every pixel and the latent variable, is this right?
If it's right, is it possible to get joint probability p(x,z) between latent variables z and a whole image x from generative model?
Thanks!
What a generative model is trying to learn is just p(x). p(x|z) = 1 if g(z) = x and 0 otherwise, because GANs and VAEs are deterministic mappings and therefore have 100% chance to map to the same target given the same input.
Extracting the probability of x is not an easy task though and depends on the approach. With GANs you can approximate this by sampling from the model. E.g. you sample 1000 images and see how often an image occured. Then this image has a probability of occurences / 1000. By the law of large numbers you will eventually recover the actual probability distribution of your generator this way.
If you want an exact way to calculate probabilities you can use FLOW networks like GLOW or RealNVP, which optimize for log(p(x)) directly and have a way to recover p(x).

non linear svm kernel dimension

I have some problems with understanding the kernels for non-linear SVM.
First what I understood by non-linear SVM is: using kernels the input is transformed to a very high dimension space where the transformed input can be separated by a linear hyper-plane.
Kernel for e.g: RBF:
K(x_i, x_j) = exp(-||x_i - x_j||^2/(2*sigma^2));
where x_i and x_j are two inputs. here we need to change the sigma to adapt to our problem.
(1) Say if my input dimension is d, what will be the dimension of the
transformed space?
(2) If the transformed space has a dimension of more than 10000 is it
effective to use a linear SVM there to separate the inputs?
Well it is not only a matter of increasing the dimension. That's the general mechanism but not the whole idea, if it were true that the only goal of the kernel mapping is to increase the dimension, one could conclude that all kernels functions are equivalent and they are not.
The way how the mapping is made would make possible a linear separation in the new space.
Talking about your example and just to extend a bit what greeness said, RBF kernel would order the feature space in terms of hyperspheres where an input vector would need to be close to an existing sphere in order to produce an activation.
So to answer directly your questions:
1) Note that you don't work on feature space directly. Instead, the optimization problem is solved using the inner product of the vectors in the feature space, so computationally you won't increase the dimension of the vectors.
2) It would depend on the nature of your data, having a high dimensional pattern would somehow help you to prevent overfitting but not necessarily will be linearly separable. Again, the linear separability in the new space would be achieved because the way the map is made and not only because it is in a higher dimension. In that sense, RBF would help but keep in mind that it might not perform well on generalization if your data is not locally enclosed.
The transformation usually increases the number of dimensions of your data, not necessarily very high. It depends. The RBF Kernel is one of the most popular kernel functions. It adds a "bump" around each data point. The corresponding feature space is a Hilbert space of infinite dimensions.
It's hard to tell if a transformation into 10000 dimensions is effective or not for classification without knowing the specific background of your data. However, choosing a good mapping (encoding prior knowledge + getting right complexity of function class) for your problem improves results.
For example, the MNIST database of handwritten digits contains 60K training examples and 10K test examples with 28x28 binary images.
Linear SVM has ~8.5% test error.
Polynomial SVM has ~ 1% test error.
Your question is a very natural one that almost everyone who's learned about kernel methods has asked some variant of. However, I wouldn't try to understand what's going on with a non-linear kernel in terms of the implied feature space in which the linear hyperplane is operating, because most non-trivial kernels have feature spaces that it is very difficult to visualise.
Instead, focus on understanding the kernel trick, and think of the kernels as introducing a particular form of non-linear decision boundary in input space. Because of the kernel trick, and some fairly daunting maths if you're not familiar with it, any kernel function satisfying certain properties can be viewed as operating in some feature space, but the mapping into that space is never performed. You can read the following (fairly) accessible tutorial if you're interested: from zero to Reproducing Kernel Hilbert Spaces in twelve pages or less.
Also note that because of the formulation in terms of slack variables, the hyperplane does not have to separate points exactly: there's an objective function that's being maximised which contains penalties for misclassifying instances, but some misclassification can be tolerated if the margin of the resulting classifier on most instances is better. Basically, we're optimising a classification rule according to some criteria of:
how big the margin is
the error on the training set
and the SVM formulation allows us to solve this efficiently. Whether one kernel or another is better is very application-dependent (for example, text classification and other language processing problems routinely show best performance with a linear kernel, probably due to the extreme dimensionality of the input data). There's no real substitute for trying a bunch out and seeing which one works best (and make sure the SVM hyperparameters are set properly---this talk by one of the LibSVM authors has the gory details).

Is scaling of feature values in LibSVM necessary?

If I have 200 features, and if each feature can have a value ranging from 0 to infinity, should I scale the feature values to be in the range [0-1] before I go ahead and train a LibSVM on top of it?
Now, suppose I did scale the values, and after training the model if I get one vector with its values or the features as input, how do I scale these values of the input test vector before classifying it?
Thanks
Abhishek S
You should store the ranges of you feature-values used for training. Then when you extract a feature-value from an unknown instance, use the particular range for scaling.
Use the formula (here for the range [-1.0 , 1.0]):
double scaled_val = -1.0 + (1.0 - -1.0) * (extracted_val - vmin)/(vmax-vmin);
The Guide provided at libsvm website explains the scaling well:
"2.2 Scaling
Scaling before applying SVM is very important. Part 2 of Sarle's Neural Networks
FAQ Sarle (1997) explains the importance of this and most of considerations also apply
to SVM. The main advantage of scaling is to avoid attributes in greater numeric
ranges dominating those in smaller numeric ranges. Another advantage is to avoid
numerical diculties during the calculation. Because kernel values usually depend on
the inner products of feature vectors, e.g. the linear kernel and the polynomial kernel,
large attribute values might cause numerical problems. We recommend linearly
scaling each attribute to the range [-1; +1] or [0; 1].
Of course we have to use the same method to scale both training and testing
data."
If you've got infinite feature values, you're not going to be able to use LIBSVM anyway.
More practically, scaling is generally useful so the kernel doesn't have to deal with large numbers, so I would say go for it and scale. It's not a requirement, though.
And as Anony-Mousse implied in the comments, please try running experiments with and without scaling so you can see the difference.
Now, suppose I did scale the values, and after training the model if I get one vector with its values or the features as input, how do I scale these values of the input test vector before classifying it?
You don't need to scale again. You already did that in the pre-training step (i.e. data processing).

Resources