Same vs Different Target Values for each sample for Regression in Machine Learning - machine-learning

I am a newbie in machine learning and learning the basic concepts in regression. The confusion I have can be well explained by placing an example of input samples with the target values. So, For example (please notice that the example I am putting is the general case, I observed the performance and predicted values on a large custom dataset of images. Also, notice that the target values are not in floats.), I have:
xtrain = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
ytrain = [10, 10, 10, 20, 20, 20, 30, 30, 30, 40, 40, 40]
and
xtest = [13, 14, 15, 16]
ytest = [25, 25, 35, 35]
As you can notice that the ever three (two in the test set) samples have similar target values. Suppose I have a multi-layer perceptron network with one Flatten() and two Dense() layers. The network, after training, predicts the target values all same for test samples:
yPredicted = [40, 40, 40, 40]
Because the predicted values are all same, the correlations between ytest and yPredicted return null and give an error.
But when I have:
xtrain = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
ytrain = [332, 433, 456, 675, 234, 879, 242, 634, 789, 432, 897, 982]
And:
xtest = [13, 14, 15, 16]
ytest = [985, 341, 354, 326]
The predicted values are:
yPredicted = [987, 345, 435, 232]
which gives very good correlations.
My question is, what it the thing or process in a machine learning algorithm that makes the learning better when having different target values for each input? Why the network does not work when having repeated values for a large number of inputs?

Why the network does not work when having repeated values for a large number of inputs?
Most certainly, this is not the reason why your network does not perform well in the first dataset shown.
(You have not provided any code, so inevitably this will be a qualitative answer)
Looking closely at your first dataset:
xtrain = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
ytrain = [10, 10, 10, 20, 20, 20, 30, 30, 30, 40, 40, 40]
it's not difficult to conclude that we have a monotonic (increasing) function y(x) (it is not strictly monotonic, but it is monotonic nevertheless over the whole x range provided).
Given that, your model has absolutely no way of "knowing" that, for x > 12, the qualitative nature of the function changes significantly (and rather abruptly), as apparent from your test set:
xtest = [13, 14, 15, 16]
ytest = [25, 25, 35, 35]
and you should not expect it to know or "guess" it in any way (despite what many people may seem to believe, NN are not magic).
Looking closely to your second dataset, you will realize that this is not the case with it, hence the network is unsurprisingly able to perform better here; when doing such experiments, it is very important to be sure that we are comparing apples to apples, and not apples to oranges.
Another general issue with your attempts here and your question is the following: neural nets are not good at extrapolation, i.e. predicting such numerical functions outside the numeric domain on which they have been trained. For details, please see own answer at Is deep learning bad at fitting simple non linear functions outside training scope?
A last unusual thing here is your use of correlation; not sure why you choose to do this, but you may be interested to know that, in practice, we never assess model performance using a correlation measure between predicted outcomes and ground truth - we use measures such as the mean squared error (MSE) instead (for regression problems, such as yours here).

Related

Error: "DimensionMismatch("matrix A has dimensions (1024,10), vector B has length 9")" using Flux in Julia

i'm still new in Julia and in machine learning in general, but I'm quite eager to learn. In the current project i'm working on I have a problem about dimensions mismatch, and can't figure what to do.
I have two arrays as follow:
x_array:
9-element Array{Array{Int64,N} where N,1}:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 72, 73]
[11, 12, 13, 14, 15, 16, 17, 72, 73]
[18, 12, 19, 20, 21, 22, 72, 74]
[23, 24, 12, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 72, 74]
[36, 37, 38, 39, 40, 38, 41, 42, 72, 73]
[43, 44, 45, 46, 47, 48, 72, 74]
[49, 50, 51, 52, 14, 53, 72, 74]
[54, 55, 41, 56, 57, 58, 59, 60, 61, 62, 63, 62, 64, 72, 74]
[65, 66, 67, 68, 32, 69, 70, 71, 72, 74]
y_array:
9-element Array{Int64,1}
75
76
77
78
79
80
81
82
83
and the next model using Flux:
model = Chain(
LSTM(10, 256),
LSTM(256, 128),
LSTM(128, 128),
Dense(128, 9),
softmax
)
I zip both arrays, and then feed them into the model using Flux.train!
data = zip(x_array, y_array)
Flux.train!(loss, Flux.params(model), data, opt)
and immediately throws the next error:
ERROR: DimensionMismatch("matrix A has dimensions (1024,10), vector B has length 9")
Now, I know that the first dimension of matrix A is the sum of the hidden layers (256 + 256 + 128 + 128 + 128 + 128) and the second dimension is the input layer, which is 10. The first thing I did was change the 10 for a 9, but then it only throws the error:
ERROR: DimensionMismatch("dimensions must match")
Can someone explain to me what dimensions are the ones that mismatch, and how to make them match?
Introduction
First off, you should know that from an architectural standpoint, you are asking something very difficult from your network; softmax re-normalizes outputs to be between 0 and 1 (weighted like a probability distribution), which means that asking your network to output values like 77 to match y will be impossible. That's not what is causing the dimension mismatch, but it's something to be aware of. I'm going to drop the softmax() at the end to give the network a fighting chance, especially since it's not what's causing the problem.
Debugging shape mismatches
Let's walk through what actually happens inside of Flux.train!(). The definition is actually surprisingly simple. Ignoring everything that doesn't matter to us, we are left with:
for d in data
gs = gradient(ps) do
loss(d...)
end
end
Therefore, let's start by pulling the first element out of your data, and splatting it into your loss function. You didn't specify your loss function or optimizer in the question. Although softmax usually means you should use crossentropy loss, your y values are very much not probabilities, and so if we drop the softmax we can just use the dead-simple mse() loss. For optimizer, we'll default to good old ADAM:
model = Chain(
LSTM(10, 256),
LSTM(256, 128),
LSTM(128, 128),
Dense(128, 9),
#softmax, # commented out for now
)
loss(x, y) = Flux.mse(model(x), y)
opt = ADAM(0.001)
data = zip(x_array, y_array)
Now, to simulate the first run of Flux.train!(), we take first(data) and splat that into loss():
loss(first(data)...)
This gives us the error message you've seen before; ERROR: DimensionMismatch("matrix A has dimensions (1024,10), vector B has length 12"). Looking at our data, we see that yes, indeed, the first element of our dataset has a length of 12. And so we will change our model to instead expect 12 values instead of 10:
model = Chain(
LSTM(12, 256),
LSTM(256, 128),
LSTM(128, 128),
Dense(128, 9),
)
And now we re-run:
julia> loss(first(data)...)
50595.52542674723 (tracked)
Huzzah! It worked! We can run this again:
julia> loss(first(data)...)
50578.01417593167 (tracked)
The value changes because the RNN holds memory within itself which gets updated each time we run the network, otherwise we would expect the network to give the same answer for the same inputs!
The problem comes, however, when we try to run the second training instance through our network:
julia> loss([d for d in data][2]...)
ERROR: DimensionMismatch("matrix A has dimensions (1024,12), vector B has length 9")
Understanding LSTMs
This is where we run into Machine Learning problems more than programming problems; the issue here is that we have promised to feed that first LSTM network a vector of length 10 (well, 12 now) and we are breaking that promise. This is a general rule of deep learning; you always have to obey the contracts you sign about the shape of the tensors that are flowing through your model.
Now, the reasons you're using LSTMs at all is probably because you want to feed in ragged data, chew it up, then do something with the result. Maybe you're processing sentences, which are all of variable length, and you want to do sentiment analysis, or somesuch. The beauty of recurrent architectures like LSTMs is that they are able to carry information from one execution to another, and they are therefore able to build up an internal representation of a sequence when applied upon one time point after another.
When building an LSTM layer in Flux, you are therefore declaring not the length of the sequence you will feed in, but rather the dimensionality of each time point; imagine if you had an accelerometer reading that was 1000 points long and gave you X, Y, Z values at each time point; to read that in, you would create an LSTM that takes in a dimensionality of 3, then feed it 1000 times.
Writing our own training loop
I find it very instructive to write our own training loop and model execution function so that we have full control over everything. When dealing with time series, it's often easy to get confused about how to call LSTMs and Dense layers and whatnot, so I offer these simple rules of thumb:
When mapping from one time series to another (E.g. constantly predict future motion from previous motion), you can use a single Chain and call it in a loop; for every input time point, you output another.
When mapping from a time series to a single "output" (E.g. reduce sentence to "happy sentiment" or "sad sentiment") you must first chomp all the data up and reduce it to a fixed size; you feed many things in, but at the end, only one comes out.
We're going to re-architect our model into two pieces; first the recurrent "pacman" section, where we chomp up a variable-length time sequence into an internal state vector of pre-determined length, then a feed-forward section that takes that internal state vector and reduces it down to a single output:
pacman = Chain(
LSTM(1, 128), # map from timepoint size 1 to 128
LSTM(128, 256), # blow it up even larger to 256
LSTM(256, 128), # bottleneck back down to 128
)
reducer = Chain(
Dense(128, 9),
#softmax, # keep this commented out for now
)
The reason we split it up into two pieces like this is because the problem statement wants us to reduce a variable-length input series to a single number; we're in the second bullet point above. So our code naturally must take this into account; we will write our loss(x, y) function to, instead of calling model(x), it will instead do the pacman dance, then call the reducer on the output. Note that we also must reset!() the RNN state so that the internal state is cleared for each independent training example:
function loss(x, y)
# Reset internal RNN state so that it doesn't "carry over" from
# the previous invocation of `loss()`.
Flux.reset!(pacman)
# Iterate over every timepoint in `x`
for x_t in x
y_hat = pacman(x_t)
end
# Take the very last output from the recurrent section, reduce it
y_hat = reducer(y_hat)
# Calculate reduced output difference against `y`
return Flux.mse(y_hat, y)
end
Feeding this into Flux.train!() actually trains, albeit not very well. ;)
Final observations
Although your data is all Int64's, it's pretty typical to use floating point numbers with everything except embeddings (an embedding is a way to take non-numeric data such as characters or words and assign numbers to them, kind of like ASCII); if you're dealing with text, you're almost certainly going to be working with some kind of embedding, and that embedding will dictate what the dimensionality of your first LSTM is, whereupon your inputs will all be "one-hot" encoded.
softmax is used when you want to predict probabilities; it's going to ensure that for each input, the outputs are all between [0...1] and moreover that they sum to 1.0, like a good little probability distribution should. This is most useful when doing classification, when you want to wrangle your wild network output values of [-2, 5, 0.101] into something where you can say "we have 99.1% certainty that the second class is correct, and 0.7% certainty it's the third class."
When training these networks, you're often going to want to batch multiple time series at once through your network for hardware efficiency reasons; this is both simple and complex, because on one hand it just means that instead of passing a single Sx1 vector through (where S is the size of your embedding) you're instead going to be passing through an SxN matrix, but it also means that the number of timesteps of everything within your batch must match (because the SxN must remain the same across all timesteps, so if one time series ends before any of the others in your batch you can't just drop it and thereby reduce N halfway through a batch). So what most people do is pad their timeseries all to the same length.
Good luck in your ML journey!

LSTM, pattern and noise gap

I want to find sequence patterns in a time series with random noise gap.
For example, this is the pattern I wan to find:
1, 2, 3, 4
But, my samples are:
*1*, 10, *2*, *3*, 11, 12, *4*
*1*, *2*, 10, 14, 15, *3*, 10, 13, *4*
10, *1*, 10, 10, 10, *2*, 11, 12, *3*, *4*
I don't know that the "good" elements are 1, 2, 3 and 4.
I started with a LSTM decoder, but "the noise" hide the good elements. For example, with the 3 samples, I get:
*1*, 10, 13, 10, ...
and 2, 3 and 4 are hidden
Have you an idea to find those patterns ?
Thanks.
Frédéric
As a starting point you can use a sequence-to-sequence (seq2seq) model. The linked repo has nice explanation how these models work and what type of problems they cover. The crucial point would be how to encode your sequence. Often they are encoded as one-hot vectors. So if you have a fixed upper bound on the number of distinct numbers/items in your sequence you can use it.
Instead of generating a new sequence without noise from the original one, you can also try to classify each point as noise or not and eliminate those classified as noise as your output. Something along the lines of:
seq = Input(shape=(timesteps, features))
hidden = LSTM(HIDDEN_UNITS, return_sequences=True)(seq)
out = TimeDistributed(Dense(1, activation='sigmoid'))(hidden)
You will have to know before hand whether each data point is noise or not.

Would Sklearn GridSearchCV go through all the possible default options of the estimator's parameters?

Algorithms in scikit-learn might have some parameters that have default range of options,
sklearn.neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=1, **kwargs)
and the parameter has a default value "auto", with the following options: algorithm : {‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}
My question is, when using **GridSearchCV** to find the best set of values for the parameters of an algorithm, would GridSearchCV go though all the default options of a parameter even though I don't add it to the parameter_list?
For example, I want to use **GridSearchCV** to find the best parameter values for **kNN**, I need to examine the n_neighbors and algorithm parameters, is it possible that I just need to pass the values with no as below (because the algorithm parameter has default options),
parameter_list = {'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]}
or, I have to specify all the options that I want to examine?
parameter_list = {
'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]}
Thanks.
No, You are misunderstanding about the parameter default and available option.
Looking at the documentation of KNeighborsClassifier, the parameter algorithm is an optional parameter (i.e. you may and may not specify it during constructor of KneighborsClassifier).
But if you decide to specify it, then it has options available: {‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}. It means that you can give the value only from these given options for algorithm and cannot use any other string to specify for algorithm. The default option is 'auto', means that if you dont supply any value, then it will internally use 'auto'.
Case 1:- KNeighborsClassifier(n_neighbors=3)
Here since no value for algorithm has been specified, so it will by default use algorithm='auto'.
Case 2:- KNeighborsClassifier(n_neighbors=3, algorithm='kd_tree')
Here as the algorithm has been specified, so it will use 'kd_tree'
Now, GridSearchCV will only pass those parameters to the estimator which are specified in the param_grid. So in your case when you use the first parameter_list from the question, then it will give only n_neighbors to the estimator and algorithm will have only default value ('auto').
If you use the second parameter_list, then both n_neighbors and algorithm will be passed on to the estimator.

torch7: Setting Variable Learning Rates for Different Conv-net Layers

I am trying to fine-tune a conv-net. It has the following structure (adapted from OverFeat):
net:add(SpatialConvolution(3, 96, 7, 7, 2, 2))
net:add(nn.ReLU(true))
net:add(SpatialMaxPooling(3, 3, 3, 3))
net:add(SpatialConvolutionMM(96, 256, 7, 7, 1, 1))
net:add(nn.ReLU(true))
net:add(SpatialMaxPooling(2, 2, 2, 2))
net:add(SpatialConvolutionMM(256, 512, 3, 3, 1, 1, 1, 1))
net:add(nn.ReLU(true))
net:add(SpatialConvolutionMM(512, 512, 3, 3, 1, 1, 1, 1))
net:add(nn.ReLU(true))
net:add(SpatialConvolutionMM(512, 1024, 3, 3, 1, 1, 1, 1))
net:add(nn.ReLU(true))
net:add(SpatialConvolutionMM(1024, 1024, 3, 3, 1, 1, 1, 1))
net:add(nn.ReLU(true))
net:add(SpatialMaxPooling(3, 3, 3, 3))
net:add(SpatialConvolutionMM(1024, 4096, 5, 5, 1, 1))
net:add(nn.ReLU(true))
net:add(SpatialConvolutionMM(4096, 4096, 1, 1, 1, 1))
net:add(nn.ReLU(true))
net:add(SpatialConvolutionMM(4096, total_classes, 1, 1, 1, 1))
net:add(nn.View(total_classes))
net:add(nn.LogSoftMax())
And I'm using SGD as the optimization method with the following parameters:
optimState = {
learningRate = 1e-3,
weightDecay = 0,
momentum = 0,
learningRateDecay = 1e-7
}
optimMethod = optim.sgd
I am training it as follows:
optimMethod(feval, parameters, optimState)
where:
-- 'feval' is the function with the forward and backward passes on the current batch
parameters,gradParameters = net:getParameters()
From my references, I have learned that while fine-tuning a pre-trained network, it is recommended that the lower (convolutional) layers should have lower learning rates and the higher layers should have relatively higher learning rates.
I referred to torch7's documentation of optim/sgd to set different learning rates for each layer. From there, I get that setting config.learningRates i.e. a vector of individual learning rates, I can achieve what I want. I am new to Torch, so, please pardon me if this seems as a silly question, but it would be really helpful if someone can please explain me how and where to create/use this vector to serve my purpose?
Thanks in advance.
I don't know if you still need an answer, as you posted this question one year ago.
Anyway, just in case someone sees this, I've written a post here about how to set different learning rates for different layers in torch.
The solution is to use net:parameters() instead of net:getParameters(). Instead of returning two tensors, it returns two tables of tensors, containing the parameters (and the gradParameters) for each layer in separate tensors.
In this way, you can run an sgd() step (with a different learning rate) for each layer. You can find the full code by clicking the above link.

How to use a Gaussian Process for Binary Classification?

I know that a Gaussian Process model is best suited for regression rather than classification. However, I would still like to apply a Gaussian Process to a classification task but I am not sure what is the best way to bin the predictions generated by the model. I have reviewed the Gaussian Process classification example that is available on the scikit-learn website at:
http://scikit-learn.org/stable/auto_examples/gaussian_process/plot_gp_probabilistic_classification_after_regression.html
But I found this example confusing (I have listed the things I found confusing about this example at the end of the question). To try and get a better understanding I have created a very basic python code example using scikit-learn that generates classifications by applying a decision boundary to the predictions made by a gaussian process:
#A minimum example illustrating how to use a
#Gaussian Processes for binary classification
import numpy as np
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.gaussian_process import GaussianProcess
if __name__ == "__main__":
#defines some basic training and test data
#If the descriptive features have large values
#(i.e., 8s and 9s) the target is 1
#If the descriptive features have small values
#(i.e., 2s and 3s) the target is 0
TRAININPUTS = np.array([[8, 9, 9, 9, 9],
[9, 8, 9, 9, 9],
[9, 9, 8, 9, 9],
[9, 9, 9, 8, 9],
[9, 9, 9, 9, 8],
[2, 3, 3, 3, 3],
[3, 2, 3, 3, 3],
[3, 3, 2, 3, 3],
[3, 3, 3, 2, 3],
[3, 3, 3, 3, 2]])
TRAINTARGETS = np.array([1, 1, 1, 1, 1, 0, 0, 0, 0, 0])
TESTINPUTS = np.array([[8, 8, 9, 9, 9],
[9, 9, 8, 8, 9],
[3, 3, 3, 3, 3],
[3, 2, 3, 2, 3],
[3, 2, 2, 3, 2],
[2, 2, 2, 2, 2]])
TESTTARGETS = np.array([1, 1, 0, 0, 0, 0])
DECISIONBOUNDARY = 0.5
#Fit a gaussian process model to the data
gp = GaussianProcess(theta0=10e-1, random_start=100)
gp.fit(TRAININPUTS, TRAINTARGETS)
#Generate a set of predictions for the test data
y_pred = gp.predict(TESTINPUTS)
print "Predicted Values:"
print y_pred
print "----------------"
#Convert the continuous predictions into the classes
#by splitting on a decision boundary of 0.5
predictions = []
for y in y_pred:
if y > DECISIONBOUNDARY:
predictions.append(1)
else:
predictions.append(0)
print "Binned Predictions (decision boundary = 0.5):"
print predictions
print "----------------"
#print out the confusion matrix specifiy 1 as the positive class
cm = confusion_matrix(TESTTARGETS, predictions, [1, 0])
print "Confusion Matrix (1 as positive class):"
print cm
print "----------------"
print "Classification Report:"
print metrics.classification_report(TESTTARGETS, predictions)
When I run this code I get the following output:
Predicted Values:
[ 0.96914832 0.96914832 -0.03172673 0.03085167 0.06066993 0.11677634]
----------------
Binned Predictions (decision boundary = 0.5):
[1, 1, 0, 0, 0, 0]
----------------
Confusion Matrix (1 as positive class):
[[2 0]
[0 4]]
----------------
Classification Report:
precision recall f1-score support
0 1.00 1.00 1.00 4
1 1.00 1.00 1.00 2
avg / total 1.00 1.00 1.00 6
The approach used in this basic example seems to work fine with this simple dataset. But this approach is very different from the classification example given on the scikit-lean website that I mentioned above (url repeated here):
http://scikit-learn.org/stable/auto_examples/gaussian_process/plot_gp_probabilistic_classification_after_regression.html
So I'm wondering if I am missing something here. So, I would appreciate if anyone could:
With respect to the classification example given on the scikit-learn website:
1.1 explain what the probabilities being generated in this example are probabilities of? Are they the probability of the query instance belonging to the class >0?
1.2 why the example uses a cumulative density function instead of a probability density function?
1.3 why the example divides the predictions made by the model by the square root of the mean square error before they are input into the cumulative density function?
With respect to the basic code example I have listed here, clarify whether or not applying a simple decision boundary to the predictions generated by a gaussian process model is an appropriate way to do binary classification?
Sorry for such a long question and thanks for any help.
In the GP classifier, a standard GP distribution over functions is "squashed," usually using the standard normal CDF (also called the probit function), to map it to a distribution over binary categories.
Another interpretation of this process is through a hierarchical model (this paper has the derivation), with a hidden variable drawn from a Gaussian Process.
In sklearn's gp library, it looks like the output from y_pred, MSE=gp.predict(xx, eval_MSE=True) are the (approximate) posterior means (y_pred) and posterior variances (MSE) evaluated at points in xx before any squashing occurs.
To obtain the probability that a point from the test set belongs to the positive class, you can convert the normal distribution over y_pred to a binary distribution by applying the Normal CDF (see [this paper again] for details).
The hierarchical model of the probit squashing function is defined by a 0 decision boundary (the standard normal distribution is symmetric around 0, meaning PHI(0)=.5). So you should set DECISIONBOUNDARY=0.

Resources