How to check a trained neural network - machine-learning

I am writing a little bit about googles deepdream. It's possible to check with deepdream learned networks, see research blog google the examplbe with the dumbbells.
In the example a network is trained to recognize a dumbbell. Then they use deepdream to see what the network has learned and the result is the network was trained bad. Because it recognize a dumbbell plus an arm as a dumbbell.
My question is, how will networks check in practice? With deepdream or which other method?
Best greetings

Generally in machine learning you validate your learned network on a dataset you did not use in the training process (a test set). So in this case, you would have a set of examples with and without dumbbells that was used to train the model, as well as a set (also consisting of dumbbells and without) that were not seen during the training procedure.
When you have your model, you let it predict the labels of the withheld set. You then compare these predicted labels to the actual ones:
Every time you predict a dumbbell correctly, you increment the amount of True Positives,
in case it correctly predicts the absence of a dumbbell, you increment the amount of True Negatives
when it predicted a dumbbell, but it should not be one, increment the amount of False Positives
Finally if it predicted no dumbbell, but there is one, you increment the amount of False Negatives
Based on these four, you can then calculate measures such as F1 score or accuracy to calculate the performance of the model. (Have a look at the following wiki: https://en.wikipedia.org/wiki/F1_score )

Related

How to adjust to the randomness of the neural network weights?

The weights of the network are random during the initialization. Thus, if you train the network multiple times with multiple different random weights, you will get different results.
My question is:
What do you do during the hyperparameter tuning? Do you retrain the network multiple time for each hyperparameter configuration, and take the mean of the results as the value of this hyperparameter configuration?
And if this is the case, does anyone use the information provided by the standard deviation?
The final results reported on the test data. do we train the network multiple times to compensate for the random weights, or just once?
For example, in this paper A Neural Representation of Sketch Drawings,
they report the log-likelihood for different categories in this table
So I don't get the methodology behind getting these numbers.
I appreciate any clarification :-)
I'd say fix the seed so you get the same random init every time, and play with hyperparameters only. Of course if you wanna try different rand inits (e.g. one of https://keras.io/initializers/) then that would be a hyperparameter.
The paper you cited isn't about the network's weight initialization.
This is about the weighting of two loss functions as a the following key phrase reveals:
Our training procedure follows the approach of the Variational
Autoencoder [15], where the loss function is the sum of two terms: the
Reconstruction Loss, LR, and the Kullback-Leibler Divergence Loss, LKL.
Anyway to answer your question, there are several other random factors in a neural model, not just the weights initialization.
To handle these randomness, its variance there are several methods as well.
Some of them is training the network multiple times as you mentioned and with different train-test set break up, different cross-validation methods and many others.
You can fix the initial random state of random generator to get every hyper-parameter tuning process the same "randomness" regarding weights but you can and sometimes you should do it at the different stages of the training process i.e. you can use seed(1234) at the weight initialization, but at getting the train-test sets you can use seed(555) to get similar distribution of the two sets.

Neural Network Custom Binary Prediction

I am trying to design a neural network that makes a custom binary prediction.
Normally to do binary prediction, I would use a softmax as my last layer, and then my loss could be the difference between the prediction I made and the true binary value.
However, what if I don't want to use a softmax layer. Instead, I output a real valued number, and check if some condition on this number is true. In a really simple case, I check if this number is positive. If it is, I predict 1, else I predict 0. Let's say I want all the numbers to be positive, so the true predictions should be all 1, and then I want to train this network such that it outputs all positive numbers. I am confused as how to formulate a loss function for this problem, so that I am able to back propagate and train the network.
Does anyone have an idea how to create this kind of network?
I am confused as how to formulate a loss function for this problem, so
that I am able to back propagate and train the network.
Here's how you should approach it. Effectively, you need to transform the labels to positive and negative target values (say +1 and -1) and solve the regression problem. The loss function can be a simple L1 or L2 loss. The network will try to learn to output a prediction close to the training target, which you can afterwards interpret if it's closer to one target or another, i.e. positive or negative. You can even go ahead and make some targets larger (e.g. +2 or +10) to emphasize that these examples are very important. Example code: linear regression in tensorflow.
However, I simply have to warn you that your approach has serious drawbacks, see for instance this question. One outlier in training data can easily skew your predictions. Classification with softmax + cross-entropy loss is more stable, that's why almost always a better choice.

Machine Learning - Huge Only positive text dataset

I have a dataset with thousand of sentences belonging to a subject. I would like to know what would be best to create a classifier that will predict a text as "True" or "False" depending on whether they talk about that subject or not.
I've been using solutions with Weka (basic classifiers) and Tensorflow (neural network approaches).
I use string to word vector to preprocess the data.
Since there are no negative samples, I deal with a single class. I've tried one-class classifier (libSVM in Weka) but the number of false positives is so high I cannot use it.
I also tried adding negative samples but when the text to predict does not fall in the negative space, the classifiers I've tried (NB, CNN,...) tend to predict it as a false positive. I guess it's because of the sheer amount of positive samples
I'm open to discard ML as the tool to predict the new incoming data if necessary
Thanks for any help
I have eventually added data for the negative class and build a Multilineal Naive Bayes classifier which is doing the job as expected.
(the size of the data added is around one million samples :) )
My answer is based on the assumption that that adding of at least 100 negative samples for author’s dataset with 1000 positive samples is acceptable for the author of the question, since I have no answer for my question about it to the author yet
Since this case with detecting of specific topic is looks like particular case of topics classification I would recommend using classification approach with the two simple classes 1 class – your topic and another – all other topics for beginning
I succeeded with the same approach for face recognition task – at the beginning I built model with one output neuron with high level of output for face detection and low if no face detected
Nevertheless such approach gave me too low accuracy – less than 80%
But when I tried using 2 output neurons – 1 class for face presence on image and another if no face detected on the image, then it gave me more than 90% accuracy for MLP, even without using of CNN
The key point here is using of SoftMax function for the output layer. It gives significant increase of accuracy. From my experience, it increased accuracy of the MNIST dataset even for MLP from 92% up to 97% for the same model
About dataset. Majority of classification algorithms with a trainer, at least from my experience are more efficient with equal quantity of samples for each class in a training data set. In fact, if I have for 1 class less than 10% of average quantity for other classes it makes model almost useless for the detection of this class. So if you have 1000 samples for your topic, then I suggest creating 1000 samples with as many different topics as possible
Alternatively, if you don’t want to create a such big set of negative samples for your dataset, you can create a smaller set of negative samples for your dataset and use batch training with a size of batch = 2x your negative sample quantity. In order to do so, split your positive samples in n chunks with the size of each chunk ~ negative samples quantity and when train your NN by N batches for each iteration of training process with chunk[i] of positive samples and all your negative samples for each batch. Just be aware, that lower accuracy will be the price for this trade-off
Also, you could consider creation of more generic detector of topics – figure out all possible topics which can present in texts which your model should analyze, for example – 10 topics and create a training dataset with 1000 samples per each topic. It also can give higher accuracy
One more point about the dataset. The best practice is to train your model only with part of a dataset, for example – 80% and use the rest 20% for cross-validation. This cross-validation of unknown previously data for model will give you a good estimation of your model accuracy in real life, not for the training data set and allows to avoid overfitting issues
About building of model. I like doing it by "from simple to complex" approach. So I would suggest starting from simple MLP with SoftMax output and dataset with 1000 positive and 1000 negative samples. After reaching 80%-90% accuracy you can consider using CNN for your model, and also I would suggest increasing training dataset quantity, because deep learning algorithms are more efficient with bigger dataset
For text data you can use Spy EM.
The basic idea is to combine your positive set with a whole bunch of random samples, some of which you hold out. You initially treat all the random documents as the negative class, and train a classifier with your positive samples and these negative samples.
Now some of those random samples will actually be positive, and you can conservatively relabel any documents that are scored higher than the lowest scoring held out true positive samples.
Then you iterate this process until it stablizes.

how to train neural network with probabilistic input

Hello and thanks for helping,
My question is a long time problem that I try to tackle :
How do we train a neural network if the input is a probability rather than a value ?
To make it more intuitive :
Let's say we have 6 features and the value they may take is 1 or -1 for each.
Their value is determined probabilistically, such as the feature 1 can be 1 with 60% probability or -1 with 30% probability.
How do we train the network if in each trial, we may get a INPUT value in accordance with the probability distribution of each feature ?
Actually the answer is more straingthforward than you might expect, as many existing neural networks are actually trained exactly in this manner. You have to do ... nothing. Simply sample your batch in each iteration according to your distribution and that's all. Neural network does not require finite training set, thus you can efficiently train it on "potentialy ifinite" one (generator of samples). This is exactly what is being done in image processing with image augmentation - each batch consists random subsamples of the images (patches), which are sampled from very basic probability distributions.
#Nagabuhushan suggests solving different problem - where you know a priori probability of each sample, which, according to question is not the case:
we may get a INPUT value in accordance with the probability distribution of each feature
Plus, even if it would be the case, NNs are not good with multiplying thus one might need additional tweaking of architecture (log-transforms).
For the values you feed into the net, you should use the probabilities of each feature taking on the value 1. You could use the probabilities of them taking on -1, but be consistent. Also, determine some order of features and consistently order their probabilities, respectively.
Edit: I think I may have misunderstood the question. Do your inputs consist of probabilities, or 1's and -1's? If the latter, then a well-architected network should learn the distributions on its own. Just be sure to train it against the same input space that you'll be evaluating it against.

Machine Learning - Support Vector Machines

I came across an SVM example, but I didn't understand. I would appreciate it if somebody could explain how the prediction works. Please see the explanation below:
The dataset has 10,000 observations with 5 attributes (Sepal Width, Sepal Length, Petal Width, Petal Length, Label). The label gets positive if it belongs to the I.setosa class, and negative if belongs to some other class.
There are 6000 observations for which the outcome is known (i.e. they belong to the I.setosa class, so they get positive for the label attribute). The labels for the remaining 4000 are unknown, so the label was assumed to be negative. The 6000 observations and 2500 randomly selected observations from the remaining 4000 form the set for the 10-fold cross validation. SVM (10 fold cross validation) is then used for machine learning on the 8500 observations and the ROC is plotted.
Where are we predicting here? The set has 6000 observations for which the values are already known. How did the remaining 2500 get negative labels? When SVM is used, some observations that are positive get negative prediction. The prediction didn't make any sense to me here. Why are those 1500 observations excluded.
I hope my explanation is clear. Please let me know if I haven't explained anything clearly.
I think that the issue is a semantic one: you refer to the set of 4000 samples as being both "unknown" and "negative" -- which of these apply is the critical difference.
If the labels for the 4000 samples are truly unknown, then I'd do a 1-class SVM using the
6000 labelled samples [c.f. validation below]. And then the predictions would be generated by testing the N=4000 set to assess whether or not they belong to the setosa class.
If instead, we have 6000 setosa, and 4000 (known) non-setosa, we could construct a binary
classifier on the basis of this data [c.f. validation below], and then use it to predict setosa vs. non on
any other available non-labelled data.
Validation: Usually as part of the model construction process you will take only a subset of your labelled
training data and use it to configure the model. For the unused subset, you apply the model to the data (ignoring the labels), and compare what your model predicts against what the true labels are in order to assess error rates. This applies both to the 1-class and
the 2-class situations above.
Summary: if all of your data are labelled, then usually one will still make predictions for a subset of them (ignoring the known labels) as part of the model validation process.
Your SVM classifier is trained to tell if a new (unknown) instance is or not an instance of I. Setosa. In order words, you are predicting if the new, unlabeled instance is I.Setosa or not.
You found the incorrectly classified result, probably, because your training data has many more instances of the positive case than of the negative one. Also, it's common to have some error margin.
Summarizing: your SVM classifier learned how to identify I.Setosa instances, however, it was provided with too little examples of non-I.Setosa instances, which is likely to get you a biased model.

Resources