Do generative adversarial networks require class labels? - machine-learning

I am trying to understand how a GAN is trained. I believe understand the Adversarial training process. What I can't seem to find information on is this: do GANs use class labels in the training process? My current understanding says no - because the discriminator is simply trying to discriminate between real or fake images, while the generator is trying to create real image (but not images of any specific class.)
If this is the case, then how do researchers propose to use the discriminator network for classification tasks? the network would only be able to perform two way classification between real or fake images. The generator network would also be difficult to use, seeing as we don't know what setting of the input vector 'Z' will result in the required generated image.

It completely depends on the network you are trying to build. If you are talking specifically about the basic GAN, then you are correct. Class labels are not needed as the discriminator network is only classifying real/fake images. There is a conditional variant of the GAN (cGAN) where you do make use of the class labels in both the generator and the discriminator. This allows you to produce examples for a specific class with the generator and classify them with the discriminator (along with the real/fake classification)
From the reading that I have done, the discriminator network is just used as a tool for training the generator, and the generator is the main network of concern. Why would you use the discriminator that you used to train the GAN for classification when you could just use a ResNet or VGG net for your classification tasks. These networks would work better anyway. You are right however that using the original GAN could cause difficulty because of the mode collapse and constantly producing the same image. That is why the conditional variant was introduced.
Hope this clears things up!

Do GANs use class labels in the training process?
The author suspected GANs doesn't require labels. This is correct. The discriminator is trained to classify real and fake images. Since we know which images are real and which are generated by the generator, we do not need labels to train the discriminator. The generator is trained to fool the discriminator, which also doesn't require labels.
This is one of the most attractive benefits of GANs [1]. Usually, we refer to methods that do not require labels as unsupervised learning. That said, if we had labels, maybe we could train a GAN that uses the labels to improve performance. This idea underlies the follow-up work by [2] who introduced the conditional GAN.
If this is the case, then how do researchers propose to use the discriminator network for classification tasks?
There seems to be a misunderstanding here. The purpose of the discriminator is NOT to act as a classifier on real data. The purpose of the discriminator is to "tell the generator how to improve its fakes". This is done by using the discriminator as a loss function, which we can backpropagate gradients through if it is a neural network. After training, we usually discard the discriminator.
The generator network would also be difficult to use, seeing as we don't know what setting of the input vector 'Z' will result in the required generated image.
It seems the underlying reason for posting the question lies here. The input vector 'Z' is chosen such that it follows some distribution, typically a normal distribution. But then what happens if we take 'Z', a random vector with normally distributed entries, and computes 'G(Z)'? We get a new vector which follows a very complicated distribution that depends on G. The entire idea of GANs is to change G such that this new complicated distribution is close to the distribution of our data. This idea is formalized with f-Divergences in [3].
[1] https://arxiv.org/abs/1406.2661
[2] https://arxiv.org/abs/1411.1784
[3] https://arxiv.org/abs/1606.00709

Related

How to deal with dataset of different features?

I am working to create an MLP model on a CEA Classification Dataset (Binary Classification). Each sample contains different 4 features, such as resistance and other values, each in its own range (resistance in hundreds, another in micros, etc.). I am still new to machine learning and this is the first real model to build. How can I deal with such data? I have tried feeding each sample to the neural network with a sigmoid activation function, but I am not getting accurate results. My assumption to deal with this kind of data is to scale it? If so, what are some resources which are useful to look at, since I do not quite understand when is scaling required.
Scaling your data can be an important step in building a machine-learning model, especially when working with neural networks. Scaling can help to ensure that all of the features in your dataset are on a similar scale, which can make it easier for the model to learn.
There are a few different ways to scale your data, such as normalization and standardization. Normalization is the process of scaling the data so that it has a minimum value of 0 and a maximum value of 1. Standardization is the process of scaling the data so that it has a mean of 0 and a standard deviation of 1.
When working with your CEA Classification dataset, it might be helpful to try both normalization and standardization to see which one works better for your specific dataset. You can use scikit-learn library's preprocessing functions like MinMaxScaler() and StandardScaler() for normalization and standardization respectively.
Additionally, it might be helpful to try different activation functions, such as ReLU or LeakyReLU, to see if they lead to more accurate results. Also, you can try adding more layers and neurons in your neural network to see if it improves the performance.
It's also important to remember that feature engineering, which includes the process of selecting the most important features, can be more important than scaling.

Mapping a Plant with Machine Learning

There is a dataset of a plant that makes certain numeric outputs based on numeric inputs. The dataset contains the input values and output value for several years every 15 minutes.
Since it would be too expensive to model the physical properties of the system in software, I would like to create a model with Machine learning, which behaves as the system. When entering inputs, the model should provide output.
For the solution I have tested Feedforward neural network. The results are ok, but in some cases too inaccurate.
What other methods would be available for this problem?
If it's a time series task you could use the NARX architecture of a neural network or an LSTM network. Later is like the NARX a recurrent neural network. Matlab offers an implementation of the first one.
https://en.m.wikipedia.org/wiki/Nonlinear_autoregressive_exogenous_model
https://en.m.wikipedia.org/wiki/Long_short-term_memory
If you "simply" want to fit a polynomial to your data you could use basic linear regression with polynomials of different degree to see which one works best.
Note: It's not called linear because it's only able to fit linear models.
https://en.m.wikipedia.org/wiki/Linear_regression
Some other possibilities are kernel methods such as kernel ridge regression or SVR. Later one is based on support vector machines which usually perform quite well (at least for classification from my personal experience).
If you want to try SVR you can use a small but great lib called libSVM. Matlab also offers this.
The following link shows a comparison of this algorithms:
http://scikit-learn.org/stable/auto_examples/plot_kernel_ridge_regression.html
Edit: If i understand this correctly, it's a time series task if you want to predict the outputs of a future time t+1 from a given time t. Try the NARX model or the LSTM net.

Instance Normalisation vs Batch normalisation

I understand that Batch Normalisation helps in faster training by turning the activation towards unit Gaussian distribution and thus tackling vanishing gradients problem. Batch norm acts is applied differently at training(use mean/var from each batch) and test time (use finalized running mean/var from training phase).
Instance normalisation, on the other hand, acts as contrast normalisation as mentioned in this paper https://arxiv.org/abs/1607.08022 . The authors mention that the output stylised images should be not depend on the contrast of the input content image and hence Instance normalisation helps.
But then should we not also use instance normalisation for image classification where class label should not depend on the contrast of input image. I have not seen any paper using instance normalisation in-place of batch normalisation for classification. What is the reason for that? Also, can and should batch and instance normalisation be used together. I am eager to get an intuitive as well as theoretical understanding of when to use which normalisation.
Definition
Let's begin with the strict definition of both:
Batch normalization
Instance normalization
As you can notice, they are doing the same thing, except for the number of input tensors that are normalized jointly. Batch version normalizes all images across the batch and spatial locations (in the CNN case, in the ordinary case it's different); instance version normalizes each element of the batch independently, i.e., across spatial locations only.
In other words, where batch norm computes one mean and std dev (thus making the distribution of the whole layer Gaussian), instance norm computes T of them, making each individual image distribution look Gaussian, but not jointly.
A simple analogy: during data pre-processing step, it's possible to normalize the data on per-image basis or normalize the whole data set.
Credit: the formulas are from here.
Which normalization is better?
The answer depends on the network architecture, in particular on what is done after the normalization layer. Image classification networks usually stack the feature maps together and wire them to the FC layer, which share weights across the batch (the modern way is to use the CONV layer instead of FC, but the argument still applies).
This is where the distribution nuances start to matter: the same neuron is going to receive the input from all images. If the variance across the batch is high, the gradient from the small activations will be completely suppressed by the high activations, which is exactly the problem that batch norm tries to solve. That's why it's fairly possible that per-instance normalization won't improve network convergence at all.
On the other hand, batch normalization adds extra noise to the training, because the result for a particular instance depends on the neighbor instances. As it turns out, this kind of noise may be either good and bad for the network. This is well explained in the "Weight Normalization" paper by Tim Salimans at al, which name recurrent neural networks and reinforcement learning DQNs as noise-sensitive applications. I'm not entirely sure, but I think that the same noise-sensitivity was the main issue in stylization task, which instance norm tried to fight. It would be interesting to check if weight norm performs better for this particular task.
Can you combine batch and instance normalization?
Though it makes a valid neural network, there's no practical use for it. Batch normalization noise is either helping the learning process (in this case it's preferable) or hurting it (in this case it's better to omit it). In both cases, leaving the network with one type of normalization is likely to improve the performance.
Great question and already answered nicely. Just to add: I found this visualisation From Kaiming He's Group Norm paper helpful.
Source: link to article on Medium contrasting the Norms
I wanted to add more information to this question since there are some more recent works in this area. Your intuition
use instance normalisation for image classification where class label
should not depend on the contrast of input image
is partly correct. I would say that a pig in broad daylight is still a pig when the image is taken at night or at dawn. However, this does not mean using instance normalization across the network will give you better result. Here are some reasons:
Color distribution still play a role. It is more likely to be a apple than an orange if it has a lot of red.
At later layers, you can no longer imagine instance normalization acts as contrast normalization. Class specific details will emerge in deeper layers and normalizing them by instance will hurt the model's performance greatly.
IBN-Net uses both batch normalization and instance normalization in their model. They only put instance normalization in early layers and have achieved improvement in both accuracy and ability to generalize. They have open sourced code here.
IN provide visual and appearance in-variance and BN accelerate training and preserve discriminative feature.
IN is preferred in Shallow layer(starting layer of CNN) so remove appearance variation and BN is preferred in deep layers(last CNN layer) should be reduce in order to maintain discrimination.

Model selection with dropout training neural network

I've been studying neural networks for a bit and recently learned about the dropout training algorithm. There are excellent papers out there to understand how it works, including the ones from the authors.
So I built a neural network with dropout training (it was fairly easy) but I'm a bit confused about how to perform model selection. From what I understand, looks like dropout is a method to be used when training the final model obtained through model selection.
As for the test part, papers always talk about using the complete network with halved weights, but they do not mention how to use it in the training/validation part (at least the ones I read).
I was thinking about using the network without dropout for the model selection part. Say that makes me find that the net performs well with N neurons. Then, for the final training (the one I use to train the network for the test part) I use 2N neurons with dropout probability p=0.5. That assures me to have exactly N neurons active on average, thus using the network at the right capacity most of the time.
Is this a correct approach?
By the way, I'm aware of the fact that dropout might not be the best choice with small datasets. The project I'm working on has academic purposes, so it's not really needed that I use the best model for the data, as long as I stick with machine learning good practices.
First of all, model selection and the training of a particular model are completely different issues. For model selection, you would usually need a data set that is completely independent of both training set used to build the model and test set used to estimate its performance. So if you're doing for example a cross-validation, you would need an inner cross-validation (to train the models and estimate the performance in general) and an outer cross-validation to do the model selection.
To see why, consider the following thought experiment (shamelessly stolen from this paper). You have a model that makes a completely random prediction. It has a number of parameters that you can set, but have no effect. If you're trying different parameter settings long enough, you'll eventually get a model that has a better performance than all the others simply because you're sampling from a random distribution. If you're using the same data for all of these models, this is the model you will choose. If you have a separate test set, it will quickly tell you that there is no real effect because the performance of this parameter setting that achieves good results during the model-building phase is not better on the separate set.
Now, back to neural networks with dropout. You didn't refer to any particular paper; I'm assuming that you mean Srivastava et. al. "Dropout: A Simple Way to Prevent Neural Networks from Overfitting". I'm not an expert on the subject, but the method to me seems to be similar to what's used in random forests or bagging to mitigate the flaws an individual learner may exhibit by applying it repeatedly in slightly different contexts. If I understood the method correctly, essentially what you end up with is an average over several possible models, very similar to random forests.
This is a way to make an individual model better, but not for model selection. The dropout is a way of adjusting the learned weights for a single neural network model.
To do model selection on this, you would need to train and test neural networks with different parameters and then evaluate those on completely different sets of data, as described in the paper I've referenced above.

Classification Algorithm which can take predefined weights for attributes as input

I have 20 attributes and one target feature. All the attributes are binary(present or not present) and the target feature is multinomial(5 classes).
But for each instance, apart from the presence of some attributes, I also have the information that how much effect(scale 1-5) did each present attribute have on the target feature.
How do I make use of this extra information that I have, and build a classification model that helps in better prediction for the test classes.
Why not just use the weights as the features, instead of binary presence indicator? You can code the lack of presence as a 0 on the continuous scale.
EDIT:
The classifier you choose to use will learn optimal weights on the features in training to separate the classes... thus I don't believe there's any better you can do if you do not have access to test weights. Essentially a linear classifier is learning a rule of the form:
c_i = sgn(w . x_i)
You're saying you have access to weights, but without an example of what the data look like, and an explanation of where the weights come from, I'd have to say I don't see how you'd use them (or even why you'd want to---is standard classification with binary features not working well enough?)
This clearly depends on the actual algorithms that you are using.
For decision trees, the information is useless. They are meant to learn which attributes have how much effect.
Similarly, support vector machines will learn the best linear split, so any kind of weight will disappear since the SVM already learns this automatically.
However, if you are doing NN classification, just scale the attributes as desired, to emphasize differences in the influential attributes.
Sorry, you need to look at other algorithms yourself. There are just too many.
Use the knowledge as prior over the weight of features. You can actually compute the posterior estimation out of the data and then have the final model

Resources