How to make stochastic gradient descent not ultimately kill all weights - machine-learning

Stochastic gradient descent as I understand it, is implemented as follows:
Activate the network, read output neurons, compare to desired output, calculate derivatives on weights, update weights, activate with new example.
However, let's say I have the following dataset:
[
{ input: [0, 0], output: [0] },
{ input: [0, 1], output: [0] },
{ input: [1, 0], output: [0] },
{ input: [1, 1], output: [1] },
]
So if I update the weights to decrease the output of the output neuron in 3 out of 4 cases, ultimately, over time I will set my weights to produce an output of 0 or close thereof.
Which mechanism insures that I will get an output of 1 if both input neurons are activated with 1?

I'm not sure if there is a method to do that. But when you have a dataset which you need to perform a classification task on, it is a good practice to make sure you have an almost uniform distribution of classes in the dataset.

Related

How should I split my data for cross validation and grid search?

Should i split my data in to two parts similar in size to use each half for eaxh tasks or i should do grid search on my whole data and then just do cross validation again on my whole data to check my accuracy ?
You need to split the data into test and train (20:80) (eg. test_train_split in sklearn), then run the model with the train data and check the accuracy. If its not what you expect, then you can try applying Hyper parameter Tuning.
You can do this by GridSearchCV, where you need to fit the desired estimator (depending on the type of problem ) and the parameter values.
Attached a sample code :
from sklearn.model_selection import GridSearchCV
# Create the parameter grid based on the results of random search
param_grid = {
'bootstrap': [True],
'max_depth': [50, 55, 60, 65],
'max_features': ["auto","sqrt", 2, 3],
'min_samples_leaf': [1, 2, 3],
'min_samples_split': [2, 3, 4],
'n_estimators': [60, 65, 70, 75]
}
grid_search = GridSearchCV(estimator = rfcv, param_grid = param_grid, cv = 3, n_jobs = -1, verbose = 2)
grid_search.fit(X_train, Y_train)
grid_search.best_params_
Based the best parameter results, you can fine tune the grid search.
Eg, if best parameter value is near 60 for n_estimators then you need to change the values as surrounding to 60 like [50,55,60,60]. To figure out the exact value.
Then build the machine learning model based on the best parameters value. Evaluate the train data accuracy and then predict the result using test data values.
rf = rgf(n_estimators = 70, random_state=0, min_samples_split = 2, min_samples_leaf=1, max_features = 'sqrt',bootstrap='True', max_depth=65)
regressor = rf.fit(X_train,Y_train)
pred_tuned = regressor.predict(X_test)
You can find an improvement in your accuracy !!

Correct way to split data to batches for Keras stateful RNNs

As the documentation states
the last state for each sample at index i in a batch will be used as
initial state for the sample of index i in the following batch
does it mean that to split data to batches I need to do it the following way
e.g. let's assume that I am training a stateful RNN to predict the next integer in range(0, 5) given the previous one
# batch_size = 3
# 0, 1, 2 etc in x are samples (timesteps and features omitted for brevity of the example)
x = [0, 1, 2, 3, 4]
y = [1, 2, 3, 4, 5]
batches_x = [[0, 1, 2], [1, 2, 3], [2, 3, 4]]
batches_y = [[1, 2, 3], [2, 3, 4], [3, 4, 5]]
then the state after learning on x[0, 0] will be initial state for x[1, 0]
and x[0, 1] for x[1, 1] (0 for 1 and 1 for 2 etc)?
Is it the right way to do it?
Based on this answer, for which I performed some tests.
Stateful=False:
Normally (stateful=False), you have one batch with many sequences:
batch_x = [
[[0],[1],[2],[3],[4],[5]],
[[1],[2],[3],[4],[5],[6]],
[[2],[3],[4],[5],[6],[7]],
[[3],[4],[5],[6],[7],[8]]
]
The shape is (4,6,1). This means that you have:
1 batch
4 individual sequences = this is batch size and it can vary
6 steps per sequence
1 feature per step
Every time you train, either if you repeat this batch or if you pass a new one, it will see individual sequences. Every sequence is a unique entry.
Stateful=True:
When you go to a stateful layer, You are not going to pass individual sequences anymore. You are going to pass very long sequences divided in small batches. You will need more batches:
batch_x1 = [
[[0],[1],[2]],
[[1],[2],[3]],
[[2],[3],[4]],
[[3],[4],[5]]
]
batch_x2 = [
[[3],[4],[5]], #continuation of batch_x1[0]
[[4],[5],[6]], #continuation of batch_x1[1]
[[5],[6],[7]], #continuation of batch_x1[2]
[[6],[7],[8]] #continuation of batch_x1[3]
]
Both shapes are (4,3,1). And this means that you have:
2 batches
4 individual sequences = this is batch size and it must be constant
6 steps per sequence (3 steps in each batch)
1 feature per step
The stateful layers are meant to huge sequences, long enough to exceed your memory or your available time for some task. Then you slice your sequences and process them in parts. There is no difference in the results, the layer is not smarter or has additional capabilities. It just doesn't consider that the sequences have ended after it processes one batch. It expects the continuation of those sequences.
In this case, you decide yourself when the sequences have ended and call model.reset_states() manually.

XOR gate with a neural network

I was trying to implement an XOR gate with tensorflow. I succeeded in implementing that, but i don't fully understand why it works. I got help from stackoverflow posts here and here. So both with one hot true and without one hot true outputs. Here is the network as i understood, in order to set things clear.
My Question #1:
Notice the RELU function and Sigmoid function. Why we need that(specifically the RELU function)? You may say that in order to achieve non linearity. I understand how RELU achieves non-linearity. I got the answer from here. Now from what I understand the difference between using RELU and without using RELU is this(see the picture).[I tested the tf.nn.relu function. The output is like this]
Now, if the first function works, why not the second function? From my perspective RELU achieves non-linearity by combining multiple linear functions. So both is linear function(upper two). If first one achieves non linearity, 2nd one should too, shouldn't it? The question is that, without using the RELU why the network gets stuck?
XOR gate with one hot true outputs
hidden1_neuron = 10
def Network(x, weights, bias):
layer1 = tf.nn.relu(tf.matmul(x, weights['h1']) + bias['h1'])
layer_final = tf.matmul(layer1, weights['out']) + bias['out']
return layer_final
weight = {
'h1' : tf.Variable(tf.random_normal([2, hidden1_neuron])),
'out': tf.Variable(tf.random_normal([hidden1_neuron, 2]))
}
bias = {
'h1' : tf.Variable(tf.random_normal([hidden1_neuron])),
'out': tf.Variable(tf.random_normal([2]))
}
x = tf.placeholder(tf.float32, [None, 2])
y = tf.placeholder(tf.float32, [None, 2])
net = Network(x, weight, bias)
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(net, y)
loss = tf.reduce_mean(cross_entropy)
train_op = tf.train.AdamOptimizer(0.2).minimize(loss)
init_op = tf.initialize_all_variables()
xTrain = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
yTrain = np.array([[1, 0], [0, 1], [0, 1], [1, 0]])
with tf.Session() as sess:
sess.run(init_op)
for i in range(5000):
train_data = sess.run(train_op, feed_dict={x: xTrain, y: yTrain})
loss_val = sess.run(loss, feed_dict={x: xTrain, y: yTrain})
if(not(i%500)):
print(loss_val)
result = sess.run(net, feed_dict={x:xTrain})
print(result)
The code you see above implements the XOR gate with one hot true outputs. If i take out tf.nn.relu, the network gets stuck. Why?
My Question #2:
How can I understand if a network is going to get stuck on some local minima[or some value]? Is it from the plot of cost function (or loss function)? Say, for the network designed above, I used cross entropy as the loss function. I could not find the plotting of cross entropy function. (If you can provide this, this would be very helpful.)
My Question #3:
Notice on the code there is a line hidden1_neuron = 10. It means that i have set the number of neurons in the hidden layer 10. Reducing the number of neurons to 5 makes the network to get stuck. So what should be the number of neurons on hidden layer?
The output when the network works the way it is supposed to :
2.42076
0.000456363
0.000149548
7.40216e-05
4.34194e-05
2.78939e-05
1.8924e-05
1.33214e-05
9.62602e-06
7.06308e-06
[[ 7.5128479 -7.58900356]
[-5.65254211 5.28509617]
[-6.96340656 6.62380219]
[ 7.26610374 -5.9665451 ]]
The output when the network gets stuck:
1.45679
0.346579
0.346575
0.346575
0.346574
0.346574
0.346574
0.346574
0.346574
0.346574
[[ 15.70696926 -18.21559143]
[ -7.1562047 9.75774956]
[ -0.03214722 -0.03214724]
[ -0.03214722 -0.03214724]]
Question 1
Both the ReLU and Sigmoid function is non-linear. On the contrary, the function drawn to the right of the ReLU function is linear. Applying multiple linear activation functions will still make the network linear.
Therefore, the network gets stuck when trying to perform linear regression on a non-linear problem.
Question 2
Yes, you will have to pay attention to the progression of the error rate. In larger problem instances, you would typically pay attention to the development of the error function on your test set. This is done by measuring the accuracy of the network after a period of training.
Question 3
The XOR problem requires at least 2 input, 2 hidden, and 1 output node, that is: five nodes are required to correctly model the XOR problem with a simple neural network.

How to use a Gaussian Process for Binary Classification?

I know that a Gaussian Process model is best suited for regression rather than classification. However, I would still like to apply a Gaussian Process to a classification task but I am not sure what is the best way to bin the predictions generated by the model. I have reviewed the Gaussian Process classification example that is available on the scikit-learn website at:
http://scikit-learn.org/stable/auto_examples/gaussian_process/plot_gp_probabilistic_classification_after_regression.html
But I found this example confusing (I have listed the things I found confusing about this example at the end of the question). To try and get a better understanding I have created a very basic python code example using scikit-learn that generates classifications by applying a decision boundary to the predictions made by a gaussian process:
#A minimum example illustrating how to use a
#Gaussian Processes for binary classification
import numpy as np
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.gaussian_process import GaussianProcess
if __name__ == "__main__":
#defines some basic training and test data
#If the descriptive features have large values
#(i.e., 8s and 9s) the target is 1
#If the descriptive features have small values
#(i.e., 2s and 3s) the target is 0
TRAININPUTS = np.array([[8, 9, 9, 9, 9],
[9, 8, 9, 9, 9],
[9, 9, 8, 9, 9],
[9, 9, 9, 8, 9],
[9, 9, 9, 9, 8],
[2, 3, 3, 3, 3],
[3, 2, 3, 3, 3],
[3, 3, 2, 3, 3],
[3, 3, 3, 2, 3],
[3, 3, 3, 3, 2]])
TRAINTARGETS = np.array([1, 1, 1, 1, 1, 0, 0, 0, 0, 0])
TESTINPUTS = np.array([[8, 8, 9, 9, 9],
[9, 9, 8, 8, 9],
[3, 3, 3, 3, 3],
[3, 2, 3, 2, 3],
[3, 2, 2, 3, 2],
[2, 2, 2, 2, 2]])
TESTTARGETS = np.array([1, 1, 0, 0, 0, 0])
DECISIONBOUNDARY = 0.5
#Fit a gaussian process model to the data
gp = GaussianProcess(theta0=10e-1, random_start=100)
gp.fit(TRAININPUTS, TRAINTARGETS)
#Generate a set of predictions for the test data
y_pred = gp.predict(TESTINPUTS)
print "Predicted Values:"
print y_pred
print "----------------"
#Convert the continuous predictions into the classes
#by splitting on a decision boundary of 0.5
predictions = []
for y in y_pred:
if y > DECISIONBOUNDARY:
predictions.append(1)
else:
predictions.append(0)
print "Binned Predictions (decision boundary = 0.5):"
print predictions
print "----------------"
#print out the confusion matrix specifiy 1 as the positive class
cm = confusion_matrix(TESTTARGETS, predictions, [1, 0])
print "Confusion Matrix (1 as positive class):"
print cm
print "----------------"
print "Classification Report:"
print metrics.classification_report(TESTTARGETS, predictions)
When I run this code I get the following output:
Predicted Values:
[ 0.96914832 0.96914832 -0.03172673 0.03085167 0.06066993 0.11677634]
----------------
Binned Predictions (decision boundary = 0.5):
[1, 1, 0, 0, 0, 0]
----------------
Confusion Matrix (1 as positive class):
[[2 0]
[0 4]]
----------------
Classification Report:
precision recall f1-score support
0 1.00 1.00 1.00 4
1 1.00 1.00 1.00 2
avg / total 1.00 1.00 1.00 6
The approach used in this basic example seems to work fine with this simple dataset. But this approach is very different from the classification example given on the scikit-lean website that I mentioned above (url repeated here):
http://scikit-learn.org/stable/auto_examples/gaussian_process/plot_gp_probabilistic_classification_after_regression.html
So I'm wondering if I am missing something here. So, I would appreciate if anyone could:
With respect to the classification example given on the scikit-learn website:
1.1 explain what the probabilities being generated in this example are probabilities of? Are they the probability of the query instance belonging to the class >0?
1.2 why the example uses a cumulative density function instead of a probability density function?
1.3 why the example divides the predictions made by the model by the square root of the mean square error before they are input into the cumulative density function?
With respect to the basic code example I have listed here, clarify whether or not applying a simple decision boundary to the predictions generated by a gaussian process model is an appropriate way to do binary classification?
Sorry for such a long question and thanks for any help.
In the GP classifier, a standard GP distribution over functions is "squashed," usually using the standard normal CDF (also called the probit function), to map it to a distribution over binary categories.
Another interpretation of this process is through a hierarchical model (this paper has the derivation), with a hidden variable drawn from a Gaussian Process.
In sklearn's gp library, it looks like the output from y_pred, MSE=gp.predict(xx, eval_MSE=True) are the (approximate) posterior means (y_pred) and posterior variances (MSE) evaluated at points in xx before any squashing occurs.
To obtain the probability that a point from the test set belongs to the positive class, you can convert the normal distribution over y_pred to a binary distribution by applying the Normal CDF (see [this paper again] for details).
The hierarchical model of the probit squashing function is defined by a 0 decision boundary (the standard normal distribution is symmetric around 0, meaning PHI(0)=.5). So you should set DECISIONBOUNDARY=0.

Scikit-learn Multiclass Naive Bayes with probabilities for y

I'm doing a tweet classification, where each tweet can belong to one of few classes.
The training set output is given as the probability for belonging that sample to each class.
Eg: tweet#1 : C1-0.6, C2-0.4, C3-0.0 (C1,C2,C3 being classes)
I'm planning to use a Naive Bayes classifier using Scikit-learn. I couldn't find a fit method in naive_bayes.py which takes probability for each class for training.
I need a classifier which accepts output probability for each class for the training set.
(ie: y.shape = [n_samples, n_classes])
How can I process my data set to apply a NaiveBayes classifier?
This is not so easy, as the "classes probability" can have many interpretations.
In case of NB classifier and sklearn the easiest procedure I see is:
Split (duplicate) your training samples according to the following rule:
given (x, [p1, p2, ..., pk ]) sample (where pi is probability for ith class) create artificial training samples:
(x, 1, p1), (x, 2, p2), ..., (x, k, pk). So you get k new observations, each "attached" to one class, and pi is treated as a sample weight, which NB (in sklearn) accepts.
Train your NB with fit(X,Y,sample_weights) (where X is a matrix of your x observations, Y is a matrix of classes from previous step, and sample_weights is a matrix of pi from the previous step.
For example if your training set consists of two points:
( [0 1], [0.6 0.4] )
( [1 3], [0.1 0.9] )
You transform them to:
( [0 1], 1, 0.6)
( [0 1], 2, 0.4)
( [1 3], 1, 0.1)
( [1 3], 2, 0.9)
and train NB with
X = [ [0 1], [0 1], [1 3], [1 3] ]
Y = [ 1, 2, 1, 2 ]
sample_weights = [ 0.6 0.4 0.1 0.9 ]

Resources