KNN distance function with one for loop - machine-learning

part of my assessment I'm writing the code for KNN classifier with one for loop, so I suppose to multiple each training examples with all of the testing examples, and this what I come up with:
for i in range(num_train):
distance = (x_train[i].view(-1, 1) - x_test.view(-1, num_test))
eucl_squ = distance**2
sum = eucl_squ.sum(dim=0)
dists[i] = sum
and it's not correct and the correct one is:
for i in range(num_train):
dists[i]=torch.sum((x_test - x_train[i])**2,dim=1).t()
and I don't know why mine is not right.

Related

Display inverted ROC Curve

my anomaly detection algorithm gave me an array of predictions where all the values greater than 0 should be of the positive class (= 0) and all the other should be classified as anomalies (= 1). I built my classifier as well: (I have three datasets, the one with only non-anomaly values and the other with all anomaly values):
normal = np.load('normal_score.pkl')
anom_1 = np.load('anom1_score.pkl')
anom2_ = np.load('anom2_score.pkl')
y_normal = np.asarray([0]*len(normal)) # I know they are normal
y_anom_1 = np.asarray([1]*len(anom_1)) # I know they are anomaly
y_anom_2 = np.asarray([1]*len(anom_2)) # I know they are anomaly
score = np.concatenate([normal, anom_1, anom_2])
y = np.concatenate([y_normal, y_anom_1, y_anom_2])
auc = roc_auc_score(y, score)
fpr, tpr, thresholds = roc_curve(y, score)
display = RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=auc)
The AUC score I get is 0.02 and the plot looks like:
From what I understood this result is great because I should just reverse the labels to make it almost 0.98, but my question is: is there a way to specify it and automatically reverse it through a function?
The values in my normal score data are all in the range (21;57) and the anomalies values are in the range (-1090; -1836) so it should be easy to spot them.
"I should just reverse the labels to make it almost 0.98"
That's not how it should be done. It is because if you can predict "normal", let's say, with 95% confidence, you can not infer from this that you can also predict "anomaly" with the same confidence.
It becomes crucial in case of heavily imbalanced data which is probably the case here.
You should define which of these two you want to predict with high confidence and what are the target prediction metrics. For example, if you have a target on the precision and recall for predicting the "anomaly" then that should be your class "1" and calculate the metrics accordingly, and vice versa.

In backpropogation, what does it mean when the error of a neural network converges to 0.5?

I've been trying to learn the math behind neural networks and have implemented (in Octave) a version of the following equations which include bias terms.
Back-propagation equations matrix form:
Visual representation of the problem and Network:
clear; clc; close all;
#Initialize weights and bias from input to hidden layer
W1 = rand(3,4)
b1 = ones(3,1)
#Initialize weights from hidden to output
W2 = rand(2,3)
b2 = ones(2,1)
#define sigmoid function
s = #(z) 1./(1 + exp(-z));
ds = #(z) s(z).*(1-s(z));
data = csvread("data.txt");
for j = 1 : 100
for i = 1 : length(data)
x0 = data(i,2:5)';
#Find the truth
if data(i,6) == 1 ;
t = [1;0] ;
else
t = [0;1];
end
#Forward propagate
x1 = s(W1*x0 + b1);
x2 = s(W2*x1 + b2);
iter = (j-1)*length(data) + i;
E((j-1)*length(data) + i) = norm(x2-t)^2;
E(length(E))
#Back propagate
delta2 = (x2-t).*ds(W2*x1+b2);
delta1 = W2'*delta2.*ds(W1*x0+b1);
dedw2 = delta2*x1';
dedw1 = delta1*x0';
alpha = 0.001*(40000-iter)/40000;
W2 = W2 - alpha*dedw2;
W1 = W1 - alpha*dedw1;
b2 = b2 - alpha*delta2;
b1 = b1 - alpha*delta1;
end
end
plot(E)
title('Gradient Descent')
xlabel('Iteration')
ylabel('Error')
When I run this, I converge on weights that give an constant error of 0.5 rather than 0.0. The error plot looks something like this depending on the initial samples of W1 and W2:
The resulting weights W1 and W2 yield output ~[0.5,0.5] for the whole set rather than [1,0](isStairs = true) or [0,1](isStairs = False)
Other information:
If I loop over a single data point instead of the entire learning set, it does converge to zero error for that particular case. (like 20 iterations or so), so I assume my derivatives are correct?
For the model to converge the learning rate has to be insanely small. Not sure what this means.
Is this neural network valid to solve the described problem? If so, what does it mean to converge to an error of 0.5?
The NN learns from data. If there is only one example, it will learn this example by heard and you have zero error. But if you have more examples, they will likely not lie on a nice curve, but are noisy instead. So it is harder to learn the data by heard for the network (it also depends on the number of free parameters that the NN has but you get the idea)... However, you don't want the NN to learn everything in detail. You want it to learn the overall trend (so not the noise). But this also means, that your error won't converge to zero as there is noise, which your NN should not learn... So don't worry if you have a (small) error at the end.
But what about the learning rate? Well, imagine you have 10 examples. Eight of them describe a perfect line but two exhibit noise. One sightly to the right (lets say +1) and the other slightly to the left (-1). If the NN estimates one of those points and updates to minimize the error drawn from it. The update will jump from + to - or vice versa. Depending on your learning rate, this jumping may eventually converge to the middle (which is the correct function) or may go on forever... This is essentially what the learning rate does: it determines how much impact an estimation error has on the update/learning of the network. So a good idea is to choose a larger learning rate the the beginning (where the network has a really bad performance due to its random initialization) and decrease the rate when it already learned something. You can achieve the same thing with a small learning rate but you will need longer time for it;)

Low confidence score in SVC for example from training set

Here is my code for SVC classifier.
vectorizer = TfidfVectorizer(lowercase=False)
train_vectors = vectorizer.fit_transform(training_data)
classifier_linear = svm.LinearSVC()
clf = CalibratedClassifierCV(classifier_linear)
linear_svc_model = clf.fit(train_vectors, train_labels)
training_data here is a list of english sentences and train_lables are the labels associated. I do the usual stopwords removal and some preprocessing before creating final version of training_data. Here is how my testing code:
test_lables = ["no"]
test_vectors = vectorizer.transform(test_lables)
prediction_linear = clf.predict_proba(test_vectors)
counter = 0
class_probability = {}
lables = []
for item in train_labels:
if item in lables:
continue
else:
lables.append(item)
for val in np.nditer(prediction_linear):
new_val = val.item(0)
class_probability[lables[counter]] = new_val
counter = counter + 1
sorted_class_probability = sorted(class_probability.items(), key=operator.itemgetter(1), reverse=True)
print(sorted_class_probability)
Now when I run the code with a phrase that is already there in the training set (a word 'no' in this case), it identifies properly, but the confidence score is even below .9. The output is as follows:
[('no', 0.8474342514152964), ('hi', 0.06830103628879058), ('thanks', 0.03070201906552546), ('confused', 0.02647134535600733), ('ok', 0.015857384248465656), ('yes', 0.005961945963546264), ('bye', 0.005272017662368208)]
When I am studying online, I have seen that usually confidence score for data already in the training set is closer to 1 or almost 1 and rest of them are really negligible. What can I do to get better confidence score? Should I be worried that if I add more classes to it, the confidence score will further dip and it will be difficult for me to surely point out one standout class?
As long as your scores help you classify your inputs correctly, you shouldn't worry at all. If anything, if your confidence on the input already in your training data is too high, that probably means your method has overfit to the data, and cannot generalize to the unseen data.
However, you can tune the complexity of your method by changing the penalization parameters. In the case of a LinearSVC, you have both the penalty and the C parameter. Try different values of those two and observe the effect. Make sure you also observe the effect on an unseen test set.
Just a not that the values of C should be in exponential space, eg. [0.001, 0.01, 0.1, 1, 10, 100, 1000] for you to see meaningful effects.
The SGDClassifier may be relevant to your case if you're interested in such linear models and tuning your parameters.

How to handle gradients when training two sub-graphs simultaneously

The general idea I am trying to realize is a seq2seq-model (taken from the translate.py-example in the models, based on the seq2seq-class). This trains well.
Furthermore I am using the hidden state of the rnn after all the encoding is done, right before decoding starts (I call it the “hidden state at end of encoding”). I use this hidden state at end of encoding to feed it into a further sub-graph which I call “prices” (see below). The training gradients of this sub-graph backprop not only through this additional sub-graph, but also back into the encoder-part of the rnn (which is what I want and need).
The plan is to add more such sub-graph to the hidden state at end of encoding, as I want to analyze the input phrases in a variety of ways.
Now during training when I evaluate and train both sub-graphs (encoder+prices AND encoder+decoder) at the same time, the net does NOT converge. However, if I train by executing the training in the following way (pseudo-code):
if global_step % 10 == 0:
execute-the-price-training_code
else:
execute-the-decoder-training_code
So I am not training both sub-graphs simultaneously. Now it does converge, but the encoder+decoder-part converges MUCH slower than if I ONLY train this part and never train the prices-sub-graph.
My question is: I should be able to train both sub-graphs simultaneously. But probably I have to rescale the gradients flowing back into the hidden state at end of encoding. Here we get the gradients from the prices sub-graph AND from the decoder-sub-graph. How should this rescaling be done. I didnt find any papers describing such an undertaking, but maybe I am searching with the wrong keywords.
Here is the training-part of the code:
This is the (almost original) training-op-preparation:
if not forward_only:
self.gradient_norms = []
self.updates = []
opt = tf.train.AdadeltaOptimizer(self.learning_rate)
for bucket_id in xrange(len(buckets)):
tf.scalar_summary("seq2seq loss", self.losses[bucket_id])
gradients = tf.gradients(self.losses[bucket_id], var_list_seq2seq)
clipped_gradients, norm = tf.clip_by_global_norm(gradients, max_gradient_norm)
self.gradient_norms.append(norm)
self.updates.append(opt.apply_gradients(zip(clipped_gradients, var_list_seq2seq), global_step=self.global_step))
Now, additionally, I am running a second sub-graph that takes the hidden state at end of encoding as input:
with tf.name_scope('prices') as scope:
#First layer
W_price_first_layer = tf.Variable(tf.random_normal([num_layers*size, self.prices_hidden_layer_size], stddev=0.35), name="W_price_first_layer")
B_price_first_layer = tf.Variable(tf.zeros([self.prices_hidden_layer_size]), name="B_price_first_layer")
self.output_price_first_layer = tf.add(tf.matmul(self.hidden_state, W_price_first_layer), B_price_first_layer)
self.activation_price_first_layer = tf.nn.sigmoid(self.output_price_first_layer)
#self.activation_price_first_layer = tf.nn.Relu(self.output_price_first_layer)
#Second layer to softmax (price ranges)
W_price = tf.Variable(tf.random_normal([self.prices_hidden_layer_size, self.prices_bit_size], stddev=0.35), name="W_price")
W_price_t = tf.transpose(W_price)
B_price = tf.Variable(tf.zeros([self.prices_bit_size]), name="B_price")
self.output_price_second_layer = tf.add(tf.matmul(self.activation_price_first_layer, W_price),B_price)
self.price_prediction = tf.nn.softmax(self.output_price_second_layer)
self.label_price = tf.placeholder(tf.int32, shape=[self.batch_size], name="price_label")
#Remember the prices trainables
var_list_prices = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, "prices")
var_list_all = tf.trainable_variables()
#Backprop
self.loss_price = tf.nn.sparse_softmax_cross_entropy_with_logits(self.output_price_second_layer, self.label_price)
self.loss_price_scalar = tf.reduce_mean(self.loss_price)
self.optimizer_price = tf.train.AdadeltaOptimizer(self.learning_rate_prices)
self.training_op_price = self.optimizer_price.minimize(self.loss_price, var_list=var_list_all)
Thx a bunch
I expect that running two optimizers simultaneously will lead to inconsistent gradient updates on the common variables, and this might be causing your training not to converge.
Instead, if you add the scalar loss from each sub-network to the "losses collection" (e.g. via tf.contrib.losses.add_loss() or tf.add_to_collection(tf.GraphKeys.LOSSES, ...), you can use tf.contrib.losses.get_total_loss() to get a single loss value that can be passed to a single standard TensorFlow tf.train.Optimizer subclass. TensorFlow will derive the appropriate back-prop computation for your split network.
The get_total_loss() method simply computes an unweighted sum of the values that have been added to the losses collection. I'm not familiar with the literature on how or if you should scale these values, but you can use any arbitrary (differentiable) TensorFlow expression to combine the losses and pass the result to a single optimizer.

Time Series Ahead Prediction in Neural Network (N Point Ahead Prediction) Large Scale Iterative Training

(N=90) Point ahead Prediction using Neural Network:
I am trying to predict 3 minutes ahead i.e. 180 points ahead. Because I compressed my time series data as taking the mean of every 2 points as one, I have to predict (N=90) step-ahead prediction.
My time series data is given in seconds. The values are in between 30-90. They usually move from 30 to 90 and 90 to 30, as seen in the example below.
My data could be reach from: https://www.dropbox.com/s/uq4uix8067ti4i3/17HourTrace.mat
I am having trouble in implementing neural network to predict N points ahead. My only feature is previous time. I used elman recurrent neural network and also newff.
In my scenario I need to predict 90 points ahead. First how I separated my input and target data manually:
For Example:
data_in = [1,2,3,4,5,6,7,8,9,10]; //imagine 1:10 only defines the array index values.
N = 90; %predicted second ahead.
P(:, :) T(:) it could also be(2 theta time) P(:, :) T(:)
[1,2,3,4,5] [5+N] | [1,3,5,7,9] [9+N]
[2,3,4,5,6] [6+N] | [2,4,6,8,10] [10+N]
...
until it reaches to end of the data
I have 100 input points and 90 output points in Elman recurrent neural networks. What could be the most efficient hidden node size?
input_layer_size = 90;
NodeNum1 =90;
net = newelm(threshold,[NodeNum1 ,prediction_ahead],{'tansig', 'purelin'});
net.trainParam.lr = 0.1;
net.trainParam.goal = 1e-3;
//At the beginning of my training I filter it with kalman, normalization into range of [0,1] and after that I shuffled the data.
1) I won't able to train my complete data. First I tried to train complete M data which is around 900,000, which didn't gave me a solution.
2) Secondly I tried iteratively training. But in each iteration the new added data is merged with already trained data. After 20,000 trained data the accuracy start to decreases. First trained 1000 data perfectly fits in training. But after when I start iterativelt merge the new data and continue to training, the training accuracy drops very rapidly 90 to 20.
For example.
P = P_test(1:1000) T = T_test(1:1000) counter = 1;
while(1)
net = train(net,P,T, [], [] );%until it reaches to minimum error I train it.
[normTrainOutput] = sim(net,P, [], [] );
P = [ P P(counter*1000:counter*2000)]%iteratively new training portion of the data added.
counter = counter + 1; end
This approach is very slow and after a point it won't give any good resuts.
My third approach was iteratively training; It was similar to previous training but in each iteration, I do only train the 1000 portion of the data, without do any merging with previous trained data.For example when I train first 1000 data until it gets to minimum error which has >95% accuracy. After it has been trained, when I have done the same for the second 1000 portion of the data;it overwrites the weight and the predictor mainly behave as the latest train portion of the data.
> P = P_test(1:1000) T = T_test(1:1000) counter = 1;
while(1)
> net = train(net,P,T, [], [] ); % I did also use adapt()
> [normTrainOutput] = sim(net,P, [], [] );
>
> P = [ P(counter*1000:counter*2000)]%iteratively only 1000 portion of the data is added.
> counter = counter + 1;
end
Trained DATA: This figure is snapshot from my trained training set, blue line is the original time series and red line is the predicted values with trained neural network. The MSE is around 50.
Tested DATA: On the below picture, you can see my prediction for my testing data with the neural network, which is trained with 20,000 input points while keeping MSE error <50 for the training data set. It is able to catch few patterns but mostly I doesn't give the real good accuracy.
I wasn't able to successes any of this approaches. In each iteration I also observe that slight change on the alpha completely overwrites to already trained data and more focus onto the currently trained data portion.
I won't able to come up with a solution to this problem. In iterative training should I keep the learning rate small and number of epochs as small.
And I couldn't find an efficient way to predict 90 points ahead in time series. Any suggestions that what should I do to do in order to predict N points ahead, any tutorial or link for information.
What is the best way for iterative training? On my second approach when I reach 15 000 of trained data, training size starts suddenly to drop. Iteratively should I change the alpha on run time?
==========
Any suggestion or the things I am doing wrong would be very appreciated.
I also implemented recurrent neural network. But on training for large data I have faced with the same problems.Is it possible to do adaptive learning(online learning) in Recurrent Neural Networks for(newelm)? The weight won't update itself and I didn't see any improvement.
If yes, how it is possible, which functions should I use?
net = newelm(threshold,[6, 8, 90],{'tansig','tansig', 'purelin'});
net.trainFcn = 'trains';
batch_size = 10;
while(1)
net = train(net,Pt(:, k:k+batch_size ) , Tt(:, k:k+batch_size) );
end
Have a look at Echo State Networks (ESNs) or other forms of Reservoir Computing. They are perfect for time series prediction, very easy to use and converge fast. You don't need to worry about the structure of the network at all (every neuron in the mid-layer has random weights which do not change). You only learn the output weights.
If I understood the problem correctly, with Echo State Networks, I would just train the network to predict the next point AND 90 points ahead. This can be done by simply forcing the desired output in the output neurons and then performing ridge regression to learn the output weights.
When running the network after having trained it, at every step n, it would output the next point (n+1), which you would feed back to the network as input (to continue the iteration), and 90 points ahead (n+90), which you can do whatever you want with - i.e: you could also feed it back to the network so that it affects the next outputs.
Sorry if the answer is not very clear. It's hard to explain how reservoir computing works in a short answer, but if you just read the article in the link, you will find it very easy to understand the principles.
If you do decide to use ESNs, also read this paper to understand the most important property of ESNs and really know what you're doing.
EDIT: Depending on how "predictable" your system is, predicting 90 points ahead may still be very difficult. For example if you're trying to predict a chaotic system, noise would introduce very large errors if you're predicting far ahead.
use fuzzy logic using membership function to predict the future data. will be efficient method.

Resources