What do the values in model.predict() mean? - machine-learning

I ran a CNN for a digit recognizer I found online and when I tried testing the trained model it on a single image my result was this:
[[1.4540684e-13 5.0518111e-11 1.5008972e-14 3.2845630e-05 2.6759880e-16
9.9996710e-01 2.3198607e-10 6.3636325e-14 9.3972552e-10 1.5128480e-10]]
Can anyone explain what these values mean or how I should correspond them to my answer?

It depends on how the creator of the CNN model designed it but for the outputs you have, they are likely softmax probabilities associated with each of the 10 classes/digits. Since the 6th value (9.9996710e-01) is nearly equal to 1, the network is indicating that it is highly likely that the input corresponds to the 6th class (which would likely be the digit "5" or "6", depending on whether the classes are counted starting from "0" or "1").

Related

Estimating both the category and the magnitude of output using neural networks

Let's say I want to calculate which courses a final year student will take and which grades they will receive from the said courses. We have data of previous students'courses and grades for each year (not just the final year) to train with. We also have data of the grades and courses of the previous years for students we want to estimate the results for. I want to use a recurrent neural network with long-short term memory to solve this problem. (I know this problem can be solved by regression, but I want the neural network specifically to see if this problem can be properly solved using one)
The way I want to set up the output (label) space is by having a feature for each of the possible courses a student can take, and having a result between 0 and 1 in each of those entries to describe whether if a student will attend the class (if not, the entry for that course would be 0) and if so, what would their mark be (ie if the student attends class A and gets 57%, then the label for class A will have 0.57 in it)
Am I setting the output space properly?
If yes, what optimization and activation functions I should use?
If no, how can I re-shape my output space to get good predictions?
If I understood you correctly, you want that the network is given the history of a student, and then outputs one entry for each course. This entry is supposed to simultaneously signify whether the student will take the course (0 for not taking the course, 1 for taking the course), and also give the expected grade? Then the interpretation of the output for a single course would be like this:
0.0 -> won't take the course
0.1 -> will take the course and get 10% of points
0.5 -> will take the course and get half of points
1.0 -> will take the course and get full points
If this is indeed your plan, I would definitely advise to rethink it.
Some obviously realistic cases do not fit into this pattern. For example, how would you represent an (A+)-student is "unlikely" to take a course? Should the network output 0.9999, because (s)he is very likely to get the maximum amount of points if (s)he takes the course, OR should the network output 0.0001, because the student is very unlikely to take the course?
Instead, you should output two values between [0,1] for each student and each course.
First value in [0, 1] gives the probability that the student will participate in the course
Second value in [0, 1] gives the expected relative number of points.
As loss, I'd propose something like binary cross-entropy on the first value, and simple square error on the second, and then combine all the losses using some L^p metric of your choice (e.g. simply add everything up for p=1, square and add for p=2).
Few examples:
(0.01, 1.0) : very unlikely to participate, would probably get 100%
(0.5, 0.8): 50%-50% whether participates or not, would get 80% of points
(0.999, 0.15): will participate, but probably pretty much fail
The quantity that you wanted to output seemed to be something like the product of these two, which is a bit difficult to interpret.
There is more than one way to solve this problem. Andrey's answer gives a one good approach.
I would like to suggest simplifying the problem by bucketing grades into categories and adding an additional category for "did not take", for both input and output.
This turns the task into a classification problem only, and solves the issue of trying to differentiate between receiving a low grade and not taking the course in your output.
For example your training set might have m students, n possible classes, and six possible results: ['A', 'B', 'C', 'D', 'F', 'did_not_take'].
And you might choose the following architecture:
Input -> Dense Layer -> RELU -> Dense Layer -> RELU -> Dense Layer -> Softmax
Your input shape is (m, n, 6) and your output shape could be (m, n*6), where you apply softmax for every group of 6 outputs (corresponding to one class) and sum into a single loss value. This is an example of multiclass, multilabel classification.
I would start by trying 2n neurons in each hidden layer.
If you really want a continuous output for grades, however, then I recommend using separate classification and regression networks. This way you don't have to combine classification and regression loss into one number, which can get messy with scaling issues.
You can keep the grade buckets for input data only, so the two networks take the same input data, but for the grade regression network your last layer can be n sigmoid units with log loss. These will output numbers between 0 and 1, corresponding the predicted grade for each class.
If you want to go even further, consider using an architecture that considers the order in which students took previous classes. For example if a student took French I the previous year, it is more likely he/she will take French II this year than if he/she took French Freshman year and did not continue with French after that.

Keras LSTM: Injecting already-known *future* values into prediction

I've built an LSTM In Keras with the goal of predicting future values of a time-series from a high-dimensional, time-index input.
However, there's a unique requirement: for certain time points in the future, we know with certainty what some values of the input series will be. For example:
model = SomeLSTM()
trained_model = model.train(train_data)
known_data = [(24, {feature: 2, val: 7.0}), (25, {feature: 2, val: 8.0})]
predictions = trained_model(look_ahead=48, known_data=known_data)
Which would train the model up to time t (the end of training), and predict forward 48 time periods from time t, but substituting known_data values for feature 2 at times 24 and 25.
How exactly can I explicitly inject this into the LSTM at some time?
For reference, here's the model:
model = Sequential()
model.add(LSTM(hidden, input_shape=(look_back, num_features)))
model.add(Dropout(dropout))
model.add(Dense(look_ahead))
model.add(Activation('linear'))
This may be a result of my un-intuitive grasp of LSTMs, and I'd appreciate any clarification. I've dived into the Keras source code, and my first guess is to inject it right into the LSTM state variable, but I'm unsure how to do that at time t (or even if that is correct.)
I think a clean way of doing this is to introduce 2*look_ahead new features, where for each 0 <= i < look_ahead 2*i-th feature is an indicator whether the value of the i-th time step is known and (2*i+1)-th is the value itself (0 if not known). Accordingly, you can generate training data with these features to make your model take into account these known values.
I am not exactly sure what you are trying to do, but maybe create your own layer to go at the end that sets the data to the known values, similar to how dropout sets random values to zero. As a side note, I have had better results with pooling than dropout, so maybe try switching that out and training it. Here is a good guide on how to do it. https://www.tutorialspoint.com/keras/keras_customized_layer.htm

How to generate the predicted label in caffe from the output of the last layer?

I have trained my own dataset of images (traffic light images 11x27) with LeNet, using caffe and DIGITS interface. I get 99% accuracy and when I give new images via DIGITS, it predicts the good label, so the network seems to work very well.
However, I struggle to predict the labels through Python/Matlab API for caffe. The last layer output (ip2) is a vector with 2 elements (I have 2 classes), which looks like [4.8060, -5.2608] for example (the first component is always positive, the second always negative and the absolute values range from 4 to 20). I know it from many tests in Python, Matlab and DIGITS.
My problem is :
Argmax can't work directly on this layer (it always gives 0)
If I use a softmax function, it will always give me [1, 0] (and that's actually the value of net.blobs['prob'] or out['prob'] in the python interface, no matter the class of my image)
So, how can I get the good label predicted ?
Thanks!

How to decide numClasses parameter to be passed to Random Forest algorithm in SPark MLlib with pySpark

I am working on Classification using Random Forest algorithm in Spark have a sample dataset that looks like this:
Level1,Male,New York,New York,352.888890
Level1,Male,San Fransisco,California,495.8001345
Level2,Male,New York,New York,-495.8001345
Level1,Male,Columbus,Ohio,165.22352099
Level3,Male,New York,New York,495.8
Level4,Male,Columbus,Ohio,652.8
Level5,Female,Stamford,Connecticut,495.8
Level1,Female,San Fransisco,California,495.8001345
Level3,Male,Stamford,Connecticut,-552.8234
Level6,Female,Columbus,Ohio,7000
Here the last value in each row will serve as a label and rest serve as features. But I want to treat label as a category and not a number. So 165.22352099 will denote a category and so will -552.8234. For this I have encoded my features as well as label into categorical data. Now what I am having difficulty in is deciding what should I pass for numClasses parameter in Random Forest algorithm in Spark MlLib? I mean should it be equal to number of unique values in my label? My label has like 10000 unique values so if I put 10000 as value of numClasses then wouldn't it decrease the performance dramatically?
Here is the typical signature of building a model for Random Forest in MlLib:
model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
numTrees=3, featureSubsetStrategy="auto",
impurity='gini', maxDepth=4, maxBins=32)
The confusion comes from the fact that you are doing something that you should not do. You problem is clearly a regression/ranking, not a classification. Why would you think about it as a classification? Try to answer these two questions:
Do you have at least 100 samples per each value (100,000 * 100 = 1,000,000)?
Is there completely no structure in the classes, so for example - are objects with value "200" not more similar to those with value "100" or "300" than to those with value "-1000" or "+2300"?
If at least one answer is no, then you should not treat this as a classification problem.
If for some weird reason you answered twice yes, then the answer is: "yes, you should encode each distinct value as a different class" thus leading to 10000 unique classes, which leads to:
extremely imbalanced classification (RF, without balancing meta-learner will nearly always fail in such scenario)
extreme number of classes (there are no models able to solve it, for sure RF will not solve it)
extremely small dimension of the problem- looking at as small is your number of features I would be surprised if you could predict from that binary classifiaction. As you can see how irregular are these values, you have 3 points which only diverge in first value and you get completely different results:
Level1,Male,New York,New York,352.888890
Level2,Male,New York,New York,-495.8001345
Level3,Male,New York,New York,495.8
So to sum up, with nearly 100% certainty this is not a classification problem, you should either:
regress on last value (keyword: reggresion)
build a ranking (keyword: learn to rank)
bucket your values to at most 10 different values and then - classify (keywords: imbalanced classification, sparse binary representation)

Artificial Neural Network for formula classification/calculation

I am trying to create an ANN for calculating/classifying a/any formula.
I initially tried to replicate Fibonacci Sequence. I using the inputs:
[1,2] output [3]
[2,3] output [5]
[3,5] output [8]
etc...
The issue I am trying to overcome is how to normalize the data that could be potentially infinite or scale exponentially? I then tried to create an ANN to calculate the slope-intercept formula y = mx+b (2x+2) with inputs
[1] output [4]
[2] output [6]
etc...
Again I do not know how to normalize the data. If I normalize only the training data how would the network be able to calculate or classify with inputs outside of what was used for normalization?
So would it be possible to create an ANN to calculate/classify the formula ((a+2b+c^2+3d-5e) modulo 2), where the formula is unknown, but the inputs (some) a,b,c,d,and e are given as well as the output? Essentially classifying whether the calculations output is odd or even and the inputs are between -+infinity...
Okay, I think I understand what you're trying to do now. Basically, you are going to have a set of inputs representing the coefficients of a function. You want the ANN to tell you whether the function, with those coefficients, will produce an even or an odd output. Let me know if that's wrong. There are a few potential issues here:
First, while it is possible to use a neural network to do addition, it is not generally very efficient. You also need to set your ANN up in a very specific way, either by using a different node type than is usually used, or by setting up complicated recurrent topologies. This would explain your lack of success with the Fibonacci sequence and the line equation.
But there's a more fundamental problem. You might have heard that ANNs are general function approximators. However, in this case, the function that the ANN is learning won't be your formula. When you have an ANN that is learning to output either 0 or 1 in response to a set of inputs, it's actually trying to learn a function for a line (or set of lines, or hyperplane, depending on the topology) that separates all of the inputs for which the output should be 0 from all of the inputs for which the output should be 1. (see the answers to this question for a more thorough explanation, with pictures). So the question, then, is whether or not there is a hyperplane that separates coefficients that will result in an even output from coefficients that will result in an odd output.
I'm inclined to say that the answer to that question is no. If you consider the a coefficient in your example, for instance, you will see that every time you increment or decrement it by 1, the correct output switches. The same is true for the c, d, and e terms. This means that there aren't big clumps of relatively similar inputs that all return the same output.
Why do you need to know whether the output of an unknown function is even or odd? There might be other, more appropriate techniques.

Resources