DL4J Prediction Formatting - deeplearning4j

I have two questions on deeplearning4j that are somewhat related.
When I execute “INDArray predicted = model.output(features,false);” to generate a prediction, I get the label predicted by the model; it is either 0 or 1. I tried to search for a way to have a probability (value between 0 and 1) instead of strictly 0 or 1. This is useful when you need to set a threshold for what your model should consider as a 0 and what it should consider as a 1. For example, you may want your model to output '1' for any prediction that is higher than or equal to 0.9 and output '0' otherwise.
My second question is that I am not sure why the output is represented as a two-dimensional array (shown after the code below) even though there are only two possibilities, so it would be better to represent it with one value - especially if we want it as a probability (question #1) which is one value.
PS: in case relevant to the question, in the Schema the output column is defined using ".addColumnInteger". Below are snippets of the code used.
Part of the code:
MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder()
.seed(seed)
.iterations(1)
.optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT)
.learningRate(learningRate)
.updater(org.deeplearning4j.nn.conf.Updater.NESTEROVS).momentum(0.9)
.list()
.layer(0, new DenseLayer.Builder()
.nIn(numInputs)
.nOut(numHiddenNodes)
.weightInit(WeightInit.XAVIER)
.activation("relu")
.build())
.layer(1, new OutputLayer.Builder(LossFunctions.LossFunction.NEGATIVELOGLIKELIHOOD)
.weightInit(WeightInit.XAVIER)
.activation("softmax")
.weightInit(WeightInit.XAVIER)
.nIn(numHiddenNodes)
.nOut(numOutputs)
.build()
)
.pretrain(false).backprop(true).build();
MultiLayerNetwork model = new MultiLayerNetwork(conf);
model.init();
model.setListeners(new ScoreIterationListener(10));
for (int n=0; n<nEpochs; n++) {
model.fit(trainIter);
}
Evaluation eval = new Evaluation(numOutputs);
while (testIter.hasNext()){
DataSet t = testIter.next();
INDArray features = t.getFeatureMatrix();
System.out.println("Input features: " + features);
INDArray labels = t.getLabels();
INDArray predicted = model.output(features,false);
System.out.println("Predicted output: "+ predicted);
System.out.println("Desired output: "+ labels);
eval.eval(labels, predicted);
System.out.println();
}
System.out.println(eval.stats());
Output from running the code above:
Input features: [0.10, 0.34, 1.00, 0.00, 1.00]
Predicted output: [1.00, 0.00]
Desired output: [1.00, 0.00]
*What I want the output to look like (i.e. a one-value probability):**
Input features: [0.10, 0.34, 1.00, 0.00, 1.00]
Predicted output: 0.14
Desired output: 0.0

I will answer your questions inline but I just want to note:
I would suggest taking a look at our docs and examples:
https://github.com/deeplearning4j/dl4j-examples
http://deeplearning4j.org/quickstart
A 100% 0 or 1 is just a badly tuned neural net. That's not at all how things work. A softmax by default returns probabilities. Your neural net is just badly tuned. Look at updating dl4j too. I'm not sure what version you're on but we haven't used strings in activations for at least a year now? You seem to have skipped a lot of steps when starting with us. I'll reiterate again, at least take a look above for a starting point rather than using year old code.
What you're seeing there is just standard deep learning 101. So the advice I'm about to give you can be found on the internet and is applicable for any deep learning software. A two label softmax sums each row to 1. If you want 1 label, use sigmoid with 1 output and a different loss function. We use softmax because it can work for any number of ouputs and all you have to do is change the number of outputs rather than having to change the loss function and activation function on top of that.

Related

Where is perplexity calculated in the Huggingface gpt2 language model code?

I see some github comments saying the output of the model() call's loss is in the form of perplexity:
https://github.com/huggingface/transformers/issues/473
But when I look at the relevant code...
https://huggingface.co/transformers/_modules/transformers/modeling_openai.html#OpenAIGPTLMHeadModel.forward
if labels is not None:
# Shift so that tokens < n predict n
shift_logits = lm_logits[..., :-1, :].contiguous()
shift_labels = labels[..., 1:].contiguous()
# Flatten the tokens
loss_fct = CrossEntropyLoss()
loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
outputs = (loss,) + outputs
return outputs # (loss), lm_logits, (all hidden states), (all attentions)
I see cross entropy being calculated, but no transformation into perplexity. Where does the loss finally get transformed? Or is there a transformation already there that I'm not understanding?
Ah ok, I found the answer. The code is actually returning cross entropy. In the github comment where they say it is perplexity...they are saying that because the OP does
return math.exp(loss)
which transforms entropy to perplexity :)
No latex no problem. By definition the perplexity (triple P) is:
PP(p) = e^(H(p))
Where H stands for chaos (Ancient Greek: χάος) or entropy. In general case we have the cross entropy:
PP(p) = e^(H(p,q))
e is the natural base of the logarithm which is how PyTorch prefers to compute the entropy and cross entropy.

How to apply different cost functions to different output channels of a convolutional network?

I have a convolutional neural network whose output is a 4-channel 2D image. I want to apply sigmoid activation function to the first two channels and then use BCECriterion to computer the loss of the produced images with the ground truth ones. I want to apply squared loss function to the last two channels and finally computer the gradients and do backprop. I would also like to multiply the cost of the squared loss for each of the two last channels by a desired scalar.
So the cost has the following form:
cost = crossEntropyCh[{1, 2}] + l1 * squaredLossCh_3 + l2 * squaredLossCh_4
The way I'm thinking about doing this is as follow:
criterion1 = nn.BCECriterion()
criterion2 = nn.MSECriterion()
error = criterion1:forward(model.output[{{}, {1, 2}}], groundTruth1) + l1 * criterion2:forward(model.output[{{}, {3}}], groundTruth2) + l2 * criterion2:forward(model.output[{{}, {4}}], groundTruth3)
However, I don't think this is the correct way of doing it since I will have to do 3 separate backprop steps, one for each of the cost terms. So I wonder, can anyone give me a better solution to do this in Torch?
SplitTable and ParallelCriterion might be helpful for your problem.
Your current output layer is followed by nn.SplitTable that splits your output channels and converts your output tensor into a table. You can also combine different functions by using ParallelCriterion so that each criterion is applied on the corresponding entry of output table.
For details, I suggest you read documentation of Torch about tables.
After comments, I added the following code segment solving the original question.
M = 100
C = 4
H = 64
W = 64
dataIn = torch.rand(M, C, H, W)
layerOfTables = nn.Sequential()
-- Because SplitTable discards the dimension it is applied on, we insert
-- an additional dimension.
layerOfTables:add(nn.Reshape(M,C,1,H,W))
-- We want to split over the second dimension (i.e. channels).
layerOfTables:add(nn.SplitTable(2, 5))
-- We use ConcatTable in order to create paths accessing to the data for
-- numereous number of criterions. Each branch from the ConcatTable will
-- have access to the data (i.e. the output table).
criterionPath = nn.ConcatTable()
-- Starting from offset 1, NarrowTable will select 2 elements. Since you
-- want to use this portion as a 2 dimensional channel, we need to combine
-- then by using JoinTable. Without JoinTable, the output will be again a
-- table with 2 elements.
criterionPath:add(nn.Sequential():add(nn.NarrowTable(1, 2)):add(nn.JoinTable(2)))
-- SelectTable is simplified version of NarrowTable, and it fetches the desired element.
criterionPath:add(nn.SelectTable(3))
criterionPath:add(nn.SelectTable(4))
layerOfTables:add(criterionPath)
-- Here goes the criterion container. You can use this as if it is a regular
-- criterion function (Please see the examples on documentation page).
criterionContainer = nn.ParallelCriterion()
criterionContainer:add(nn.BCECriterion())
criterionContainer:add(nn.MSECriterion())
criterionContainer:add(nn.MSECriterion())
Since I used almost every possible table operation, it looks a little bit nasty. However, this is the only way I could solve this problem. I hope that it helps you and others suffering from the same problem. This is how the result looks like:
dataOut = layerOfTables:forward(dataIn)
print(dataOut)
{
1 : DoubleTensor - size: 100x2x64x64
2 : DoubleTensor - size: 100x1x64x64
3 : DoubleTensor - size: 100x1x64x64
}

Add Bias to a Matrix in Torch

In Torch, how do I add a bias vector to each input to when i have a batch input? Suppose I have an input 3*2 matrix (where 2 = number of classes)
A
0.8191 0.2630
0.5344 0.4537
0.7380 0.5885
and I want to add the bias value to each element in the output matrix:
BIAS:
0.6588 0.6525
My final output should look like:
1.4779 0.9155
1.1931 1.1063
1.3967 1.2410
I am new to Torch and just figuring out the Syntax.
You can expand the bias to have the same dimensions as your input:
expandedBias=torch.expand(BIAS,3,2)
yields:
th> expandedBias
0.6588 0.6525
0.6588 0.6525
0.6588 0.6525
After that you can simply add them:
output=A+expandedBias
to give:
th> A+expandedBias
1.4779 0.9155
1.1931 1.1063
1.3967 1.2410
If you are using one of the latest versions of the torch, you do not even need to expand the bias.
you can directly write.
output = A + bias
The bias matrix will be broadcasted automatically. Check the documentation for the details of broadcasting.
https://pytorch.org/docs/stable/notes/broadcasting.html

sklearn Logistic Regression probability

I have a dataset that determines whether a student will be admitted given two scores. I train my model with this data and can determine if a student will be admitted or not using the following code:
model.predict([score1, score2])
This results in the answer:
[1]
How can I get the probability of that? If I try predict_proba, I get:
model.predict_proba([score1, score2])
>>[[ 0.38537034 0.61462966]]
I'd really like to see something like:
>> [0.75]
to indicate that P(admittance | score1, score2) = 0.75
You may notice that 0.38537034+ 0.61462966 = 1. This is because you are getting the probabilities for both classes (admitted and not admitted) from the output of predict_proba. If you had 7 classes, you would instead get something like [[p1, p2, p3, p4, p5, p6, p7]] where p1+p2+p3+p4+p5+p6+p7 = 1 and pi >= 0. So if you want the probability of output i, you go index into your result and look at what pi is. Thats just how that works.
So if you had something where the probability was 0.75 of being not admitted, you would get a result that looks like [[0.25, 0.75]].
(I may have reversed the ordering you used in your code for admitted/not admitted, but it doesn't matter - that just changes the index you look at).
If you want to sklearn's Lr model and you want to get the 2 classes' predicted probability, you should use this:
model.predict_proba(xtest)
You will get the array of two classes prob(shape N*2).

Issue in training hidden markov model and usage for classification

I am having a tough time in figuring out how to use Kevin Murphy's
HMM toolbox Toolbox. It would be a great help if anyone who has an experience with it could clarify some conceptual questions. I have somehow understood the theory behind HMM but it's confusing how to actually implement it and mention all the parameter setting.
There are 2 classes so we need 2 HMMs.
Let say the training vectors are :class1 O1={ 4 3 5 1 2} and class O_2={ 1 4 3 2 4}.
Now,the system has to classify an unknown sequence O3={1 3 2 4 4} as either class1 or class2.
What is going to go in obsmat0 and obsmat1?
How to specify/syntax for the transition probability transmat0 and transmat1?
what is the variable data going to be in this case?
Would number of states Q=5 since there are five unique numbers/symbols used?
Number of output symbols=5 ?
How do I mention the transition probabilities transmat0 and transmat1?
Instead of answering each individual question, let me illustrate how to use the HMM toolbox with an example -- the weather example which is usually used when introducing hidden markov models.
Basically the states of the model are the three possible types of weather: sunny, rainy and foggy. At any given day, we assume the weather can be only one of these values. Thus the set of HMM states are:
S = {sunny, rainy, foggy}
However in this example, we can't observe the weather directly (apparently we are locked in the basement!). Instead the only evidence we have is whether the person who checks on you every day is carrying an umbrella or not. In HMM terminology, these are the discrete observations:
x = {umbrella, no umbrella}
The HMM model is characterized by three things:
The prior probabilities: vector of probabilities of being in the first state of a sequence.
The transition prob: matrix describing the probabilities of going from one state of weather to another.
The emission prob: matrix describing the probabilities of observing an output (umbrella or not) given a state (weather).
Next we are either given the these probabilities, or we have to learn them from a training set. Once that's done, we can do reasoning like computing likelihood of an observation sequence with respect to an HMM model (or a bunch of models, and pick the most likely one)...
1) known model parameters
Here is a sample code that shows how to fill existing probabilities to build the model:
Q = 3; %# number of states (sun,rain,fog)
O = 2; %# number of discrete observations (umbrella, no umbrella)
%# prior probabilities
prior = [1 0 0];
%# state transition matrix (1: sun, 2: rain, 3:fog)
A = [0.8 0.05 0.15; 0.2 0.6 0.2; 0.2 0.3 0.5];
%# observation emission matrix (1: umbrella, 2: no umbrella)
B = [0.1 0.9; 0.8 0.2; 0.3 0.7];
Then we can sample a bunch of sequences from this model:
num = 20; %# 20 sequences
T = 10; %# each of length 10 (days)
[seqs,states] = dhmm_sample(prior, A, B, num, T);
for example, the 5th example was:
>> seqs(5,:) %# observation sequence
ans =
2 2 1 2 1 1 1 2 2 2
>> states(5,:) %# hidden states sequence
ans =
1 1 1 3 2 2 2 1 1 1
we can evaluate the log-likelihood of the sequence:
dhmm_logprob(seqs(5,:), prior, A, B)
dhmm_logprob_path(prior, A, B, states(5,:))
or compute the Viterbi path (most probable state sequence):
vPath = viterbi_path(prior, A, multinomial_prob(seqs(5,:),B))
2) unknown model parameters
Training is performed using the EM algorithm, and is best done with a set of observation sequences.
Continuing on the same example, we can use the generated data above to train a new model and compare it to the original:
%# we start with a randomly initialized model
prior_hat = normalise(rand(Q,1));
A_hat = mk_stochastic(rand(Q,Q));
B_hat = mk_stochastic(rand(Q,O));
%# learn from data by performing many iterations of EM
[LL,prior_hat,A_hat,B_hat] = dhmm_em(seqs, prior_hat,A_hat,B_hat, 'max_iter',50);
%# plot learning curve
plot(LL), xlabel('iterations'), ylabel('log likelihood'), grid on
Keep in mind that the states order don't have to match. That's why we need to permute the states before comparing the two models. In this example, the trained model looks close to the original one:
>> p = [2 3 1]; %# states permutation
>> prior, prior_hat(p)
prior =
1 0 0
ans =
0.97401
7.5499e-005
0.02591
>> A, A_hat(p,p)
A =
0.8 0.05 0.15
0.2 0.6 0.2
0.2 0.3 0.5
ans =
0.75967 0.05898 0.18135
0.037482 0.77118 0.19134
0.22003 0.53381 0.24616
>> B, B_hat(p,[1 2])
B =
0.1 0.9
0.8 0.2
0.3 0.7
ans =
0.11237 0.88763
0.72839 0.27161
0.25889 0.74111
There are more things you can do with hidden markov models such as classification or pattern recognition. You would have different sets of obervation sequences belonging to different classes. You start by training a model for each set. Then given a new observation sequence, you could classify it by computing its likelihood with respect to each model, and predict the model with the highest log-likelihood.
argmax[ log P(X|model_i) ] over all model_i
I do not use the toolbox that you mention, but I do use HTK. There is a book that describes the function of HTK very clearly, available for free
http://htk.eng.cam.ac.uk/docs/docs.shtml
The introductory chapters might help you understanding.
I can have a quick attempt at answering #4 on your list. . .
The number of emitting states is linked to the length and complexity of your feature vectors. However, it certainly does not have to equal the length of the array of feature vectors, as each emitting state can have a transition probability of going back into itself or even back to a previous state depending on the architecture. I'm also not sure if the value that you give includes the non-emitting states at the start and the end of the hmm, but these need to be considered also. Choosing the number of states often comes down to trial and error.
Good luck!

Resources