computing the euclidean distance for KNN - machine-learning

I've been seeing a lot of examples of computing euclidean distance for KNN but non for sentiment classification.
For example I have a sentence "a very close game"
How do I compute the euclidean distance for the sentence "A great game"?

Think about a sentence as about a point in multi-dimensional space, only after you will defined system of coordinates you can calculate Euclidean distance. For instance. You can introduce
O1 - A sentence length (Length)
O2 - A words number (WordsCount)
O2 - Alphabetical center(I just thought of it). It can be calculated as arithmetical mean of alphabetical center of each work in a sentence.
CharsIndex = Sum(Char.indexInWord) / CharsCountInWord;
CharsCode = Sum(Char.charCode) / CharsCount;
AlphWordCoordinate = [CharsIndex, CharsCode];
WordsIndex = Sum(Words.CharsIndex) / WordsCount;
WordsCode = Sum(Words.CharsCode) / WordsCount;
AlphaSentenceCoordinate = (WordsIndex ^2+WordsCode^2+WordIndexInSentence^2)^1/2;
So, the Euclidean distance can be calculated no as following:
EuclidianSentenceDistance = (WordsCount^2 + Length^2 + AlphaSentenceCoordinate^2)^1/2
No every sentence can be transformed to point in three-dimensional space, like P[Length, Words, AlphaCoordinate]. Having a distance you can compare and classify sentences.
It is not ideal approach I guess, but I wanted to show you an idea.
import math
def calc_word_alpha_center(word):
chars_index = 0;
chars_codes = 0;
for index, char in enumerate(word):
chars_index += index
chars_codes += ord(char)
chars_count = len(word)
index = chars_index / len(word)
code = chars_codes / len(word)
return (index, code)
def calc_alpha_distance(words):
word_chars_index = 0;
word_code = 0;
word_index = 0;
for index, word in enumerate(words):
point = calc_word_alpha_center(word)
word_chars_index += point[0]
word_code += point[1]
word_index += index
chars_index = word_chars_index / len(words)
code = word_code / len(words)
index = word_index / len(words)
return math.sqrt(math.pow(chars_index, 2) + math.pow(code, 2) + math.pow(index, 2))
def calc_sentence_euclidean_distance(sentence):
length = len(sentence)
words = sentence.split(" ")
words_count = len(words)
alpha_distance = calc_alpha_distance(words)
return math.sqrt(math.pow(length, 2) + math.pow(words_count, 2) + math.pow(alpha_distance, 2))
sentence1 = "a great game"
sentence2 = "A great game"
distance1 = calc_sentence_euclidean_distance(sentence1)
distance2 = calc_sentence_euclidean_distance(sentence2)
print(sentence1)
print(str(distance1))
print(sentence2)
print(str(distance2))
Console output
a great game
101.764433866
A great game
91.8477000256

Related

Linefitting how to deal with continuous values?

I'm trying to fit a line using quadratic poly, but because the fit results in continuous values, the integer conversion (for CartesianIndex) rounds it off, and I loose data at that pixel.
I tried the method
here. So I get new y values as
using Images, Polynomials, Plots,ImageView
img = load("jTjYb.png")
img = Gray.(img)
img = img[end:-1:1, :]
nodes = findall(img.>0)
xdata = map(p->p[2], nodes)
ydata = map(p->p[1], nodes)
f = fit(xdata, ydata, 2)
ydata_new .= round.(Int, f.(xdata)
new_line_fitted_img=zeros(size(img))
new_line_fitted_img[xdata,ydata_new].=1
imshow(new_line_fitted_img)
which results in chopped line as below
whereas I was expecting it to be continuous line as it was in pre-processing
Do you expect the following:
Raw Image
Fitted Polynomial
Superposition
enter image description here
enter image description here
enter image description here
Code:
using Images, Polynomials
img = load("img.png");
img = Gray.(img)
fx(data, dCoef, cCoef, bCoef, aCoef) = #. data^3 *aCoef + data^2 *bCoef + data*cCoef + dCoef;
function fit_poly(img::Array{<:Gray, 2})
img = img[end:-1:1, :]
nodes = findall(img.>0)
xdata = map(p->p[2], nodes)
ydata = map(p->p[1], nodes)
f = fit(xdata, ydata, 3)
xdt = unique(xdata)
xdt, fx(xdt, f.coeffs...)
end;
function draw_poly!(X, y)
the_min = minimum(y)
if the_min<0
y .-= the_min - 1
end
initialized_img = Gray.(zeros(maximum(X), maximum(y)))
initialized_img[CartesianIndex.(X, y)] .= 1
dif = diff(y)
for i in eachindex(dif)
the_dif = dif[i]
if abs(the_dif) >= 2
segment = the_dif รท 2
initialized_img[i, y[i]:y[i]+segment] .= 1
initialized_img[i+1, y[i]+segment+1:y[i+1]-1] .= 1
end
end
rotl90(initialized_img)
end;
X, y = fit_poly(img);
y = convert(Vector{Int64}, round.(y));
draw_poly!(X, y)

Removing Softmax from last layer yields a lot better results

I was solving an nlp task, of converting English sentences to German in Keras. But the model was not learning... But as soon as I removed the softmax from the last layer, it started working! Is this a bug in Keras, or it has to do with something else?
optimizer = Adam()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
from_logits=True, reduction='none')
def loss_function(real, pred):
mask = tf.math.logical_not(tf.math.equal(real, 0))
loss_ = loss_object(real, pred)
mask = tf.cast(mask, dtype=loss_.dtype)
loss_ *= mask
return tf.reduce_mean(loss_)
EPOCHS = 20
batch_size = 64
batch_per_epoch = int(train_x1.shape[0] / batch_size)
embed_dim = 256
units = 1024
attention_units = 10
encoder_embed = Embedding(english_vocab_size, embed_dim)
decoder_embed = Embedding(german_vocab_size, embed_dim)
encoder = GRU(units, return_sequences=True, return_state=True, recurrent_initializer='glorot_uniform')
decoder = GRU(units, return_sequences=True, return_state=True, recurrent_initializer='glorot_uniform')
dense = Dense(german_vocab_size)
attention1 = Dense(attention_units)
attention2 = Dense(attention_units)
attention3 = Dense(1)
def train_step(english_input, german_target):
loss = 0
with tf.GradientTape() as tape:
enc_output, enc_hidden = encoder(encoder_embed(english_input))
dec_hidden = enc_hidden
dec_input = tf.expand_dims([german_tokenizer.word_index['startseq']] * batch_size, 1)
for i in range(1, german_target.shape[1]):
attention_weights = attention1(enc_output) + attention2(tf.expand_dims(dec_hidden, axis=1))
attention_weights = tanh(attention_weights)
attention_weights = attention3(attention_weights)
attention_weights = Softmax(axis=1)(attention_weights)
Context_Vector = tf.reduce_sum(enc_output * attention_weights, axis=1)
Context_Vector = tf.expand_dims(Context_Vector, axis = 1)
x = decoder_embed(dec_input)
x = Concatenate(axis=-1)([x, Context_Vector])
dec_output, dec_hidden = decoder(x)
output = tf.reshape(dec_output, (-1, dec_output.shape[2]))
prediction = dense(output)
loss += loss_function(german_target[:, i], prediction)
dec_input = tf.expand_dims(german_target[:, i], 1)
batch_loss = (loss / int(german_target.shape[1]))
variables = encoder_embed.trainable_variables + decoder_embed.trainable_variables + encoder.trainable_variables + decoder.trainable_variables + dense.trainable_variables + attention1.trainable_variables + attention2.trainable_variables + attention3.trainable_variables
gradients = tape.gradient(loss, variables)
optimizer.apply_gradients(zip(gradients, variables))
return batch_loss
Code Summary
The code just take the English sentence and German Sentence as input (It takes German sentence as input to implement Teacher-Forcing Method), and predicts the translated German sentence.
The loss function is SparseCategoricalCrossentropy, but it subtracts the loss of the 0. For example, lets say, we have a sentence, that is : 'StartSeq This is Stackoverflow 0 0 0 0 0 EndSeq' (The sentence also has a zero padding to make all the input sentences of the same length). Now, we would calculate loss for every word but not for the 0's. Doing this makes the model better.
Note - this model implementation implements Bahdanau Attention
Question
When I apply softmax on the predicted probabilities by the last layer, the model doesn't learns anything. But it learns properly without softmax in the last layer. Why is this happening?

shape of input to calculate information gain

I want to calculate the information gain on 20_newsgroup data set.
I am using the code here(also I put a copy of the code down of the question).
As you see the input to the algorithm is X,y
My confusion is that, X is going to be a matrix with documents in rows and features as column. (according to 20_newsgroup it is 11314,1000
in case i only considered 1000 features).
but according to the concept of information gain, it should calculate information gain for each feature.
(So I was expecting to see the code in a way loop through each feature, so the input to the function be a matrix where rows are features and columns are class)
But X is not feature here but X stands for documents, and I can not see the part in the code that take care of this part! ( I mean considering each document, and then going through each feature of that document; like looping through rows but at the same time looping through columns as the features are stored in columns).
I have read this and this and many similar questions but they are not clear in terms of input matrix shape.
this is the code for reading 20_newsgroup:
newsgroup_train = fetch_20newsgroups(subset='train')
X,y = newsgroup_train.data,newsgroup_train.target
cv = CountVectorizer(max_df=0.99,min_df=0.001, max_features=1000,stop_words='english',lowercase=True,analyzer='word')
X_vec = cv.fit_transform(X)
(X_vec.shape) is (11314,1000) which is not features in the 20_newsgroup data set. I am thinking am I calculating Information gain in a incorrect way?
This is the code for Information gain:
def information_gain(X, y):
def _calIg():
entropy_x_set = 0
entropy_x_not_set = 0
for c in classCnt:
probs = classCnt[c] / float(featureTot)
entropy_x_set = entropy_x_set - probs * np.log(probs)
probs = (classTotCnt[c] - classCnt[c]) / float(tot - featureTot)
entropy_x_not_set = entropy_x_not_set - probs * np.log(probs)
for c in classTotCnt:
if c not in classCnt:
probs = classTotCnt[c] / float(tot - featureTot)
entropy_x_not_set = entropy_x_not_set - probs * np.log(probs)
return entropy_before - ((featureTot / float(tot)) * entropy_x_set
+ ((tot - featureTot) / float(tot)) * entropy_x_not_set)
tot = X.shape[0]
classTotCnt = {}
entropy_before = 0
for i in y:
if i not in classTotCnt:
classTotCnt[i] = 1
else:
classTotCnt[i] = classTotCnt[i] + 1
for c in classTotCnt:
probs = classTotCnt[c] / float(tot)
entropy_before = entropy_before - probs * np.log(probs)
nz = X.T.nonzero()
pre = 0
classCnt = {}
featureTot = 0
information_gain = []
for i in range(0, len(nz[0])):
if (i != 0 and nz[0][i] != pre):
for notappear in range(pre+1, nz[0][i]):
information_gain.append(0)
ig = _calIg()
information_gain.append(ig)
pre = nz[0][i]
classCnt = {}
featureTot = 0
featureTot = featureTot + 1
yclass = y[nz[1][i]]
if yclass not in classCnt:
classCnt[yclass] = 1
else:
classCnt[yclass] = classCnt[yclass] + 1
ig = _calIg()
information_gain.append(ig)
return np.asarray(information_gain)
Well, after going through the code in detail, I learned more about X.T.nonzero().
Actually it is correct that information gain needs to loop through features.
Also it is correct that the matrix scikit-learn give us here is based on doc-features.
But:
in code it uses X.T.nonzero() which technically transform all the nonzero values into array. and then in the next row loop through the length of that array range(0, len(X.T.nonzero()[0]).
Overall, this part X.T.nonzero()[0] is returning all the none zero features to us :)

How to find accumulator matrix for line in an image?

I am a newbie in the field of CV and IP. I was writing the HoughTransform algorithm for finding line.I am not getting what is wrong with this code in which i m trying to find the accumulator array
numRowsInBW = size(BW,1);
numColsInBW = size(BW,2);
%length of the diagonal of image
D = sqrt((numRowsInBW - 1)^2 + (numColsInBW - 1)^2);
%number of rows in the accumulator array
nrho = 2*(ceil(D/rhoStep)) + 1;
%number of cols in the accumulator array
ntheta = length(theta);
H = zeros(nrho,ntheta);
%this means the particular pixle is white
%i.e the edge pixle
[allrows allcols] = find(BW == 1);
for i = (1 : size(allrows))
y = allrows(i);
x = allcols(i);
for th = (1 : 180)
d = floor(x*cos(th) - y*sin(th));
H(d+floor(nrho/2),th) += 1;
end
end
I m applying this for a simple image
I m getting this result
But this is expected
I am not able to find the mistake.Please help me.Thanks in advance.
There are several issues with your code. The main issue is here:
ntheta = length(theta);
% ...
for i = (1 : size(allrows))
% ...
for th = (1 : 180)
d = floor(x*cos(th) - y*sin(th));
% ...
th seems to be an angle in degrees. cos(th) is meaningless. Instead, use cosd and sind.
Another issue is that th iterates from 1 to 180, but there is no guarantee that ntheta is 180. So, loop as follows instead:
for i = 1 : size(allrows)
% ...
for j = 1 : numel(theta)
th = theta(j);
% ...
and use th as the angle, and j as the index into H.
Finally, given your image and your expected output, you should apply some edge detection first (Canny, for example). Maybe you already did this?

Nueral Network for Linear Regression: prediction different every time

I have 200 training examples. I have run linear regression with 6 features on this dataset and it works fine, so I want to run nueral networs on it too.
Problem: each time I run the program, the prediction (pred) is different, vastly different!
input_layer_size = 6;
hidden_layer_size = 3;
num_labels = 1;
% Load Training Data
load('capitaldata.mat');
% example size
m = size(X, 1);
% initialize theta
initial_Theta1 = randInitializeWeights(input_layer_size, hidden_layer_size);
initial_Theta2 = randInitializeWeights(hidden_layer_size, num_labels);
% Unroll parameters
initial_nn_params = [initial_Theta1(:) ; initial_Theta2(:)];
% find optimal theta
options = optimset('MaxIter', 50);
% set regularization parameter
lambda = 1;
% Create "short hand" for the cost function to be minimized
costFunction = #(p) nnCostFunctionLinear(p, input_layer_size, hidden_layer_size, num_labels, X, y, lambda);
% Now, costFunction is a function that takes in only one argument (the neural network parameters)
[nn_params, cost] = fmincg(costFunction, initial_nn_params, options);
% Obtain Theta1 and Theta2 back from nn_params
Theta1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), hidden_layer_size, (input_layer_size + 1));
Theta2 = reshape(nn_params((1 + (hidden_layer_size * (input_layer_size + 1))):end), num_labels, (hidden_layer_size + 1));
% test case
test = [18 279 86 59 23 16];
pred = predict(Theta1, Theta2, test);
display(pred);
Functions that are called by the above program:
1) randInitializeWeights.m
function W = randInitializeWeights(L_in, L_out)
W = zeros(L_out, 1 + L_in);
epsilon_init = 0.12;
W = rand(L_out , 1 + L_in) * 2 * epsilon_init - epsilon_init;
end;
2) nnCostFunctionLinear.m should be right since the test result is correct. Let me know if you would like to see it too.
I suspect that the problem is the dataset size, the number of features, or the initialize weights.
Thank you in advance for your help!
As a test, you can seed the random number generator with the same value each time to give the same sequence of random numbers each time. Search for
random seed
and the name of the software you are using to find how to set the seed for the random number generator.

Resources