Best similaity (dissimilarity) measure among multidimensions categorical vectors - uitableview

I would like to find similarity ( dissimilarity ) among the following data points :
my categorical data set as follow: { Art , Science , Math.s , medical , physics , chemistry , engineering ..etc } for example 15 or 20 category .
so I would like to find Sim(Dis) among these libraries which each library row ( data points ) represent the rows vectors ,
Books attributes
libraries total-books Art science Math. chemistry
lib1 1000 50 200 0 3
lib2 500 12 0 0 44
lib3 etc..
table here represent the number of books found in each library , when we found its frequency percentage to total books found then re-arrangement the representation of categories for each library based on frequency percentage for example
I'm not consider the zero category in the following vectors ,
library 1 = { science ,Art , chemistry , ... }
library 2 = { Chemistry , Art , .... }
etc...
How to find similarity / dissimilarity between lib1 and lib2 and etc...
any suggestion please .

If you normalize by the total number of books. you can treat the remaining columns as a histogram.
Then you could try any of the distribution-based distances:
histogram intersection distance
kullback-leibler-divergence
$\chi^2$ distance
Jensen-Shannon divergence

Related

Why am I getting almost same top 10 features using Multinomial Naive Bayes classifier for positive and negative class?

After running MultinomialNB multiple times I'm getting same features for +ve and -ve class BoW, TfIdf.
I even tried it on bi-grams, tri-grams still the same features for both classes.
best_alpha = 6
clf = MultinomialNB( alpha=best_alpha )
clf.fit(X_tr, y_train)
y_train_pred = batch_predict(clf, X_tr)
y_test_pred = batch_predict(clf, X_te)
train_fpr, train_tpr, tr_thresholds = roc_curve(y_train, y_train_pred)
test_fpr, test_tpr, te_thresholds = roc_curve(y_test, y_test_pred)
This is the code for getting top 10 features for positive and negative classes of text data Tf-Idf.
feats_tfidf contains the features of categorical, numerical and text data.
For Positive class
sorted_idx = np.argsort( clf.feature_log_prob_[1] )[-10:]
for p,q in zip(feats_tfidf[ sorted_idx ], clf.feature_log_prob_[1][ sorted_idx ]):
print('{:45}:{}'.format(p,q))
Output:
Mathematics :-7.134937347073638
Literacy :-6.910334729871051
Grades_3_5 :-6.832969821702653
Ms :-6.791634814736902
Math_Science :-6.748584860699069
Grades_PreK_2 :-6.664767807632341
Literacy_Language :-6.4833650280402875
Mrs :-6.404885953106168
Teacher number of previously posted projects :-3.285663623429455
price :-0.09775430166978438
For negative class
sorted_idx = np.argsort( clf.feature_log_prob_[0] )[-10:]
for p,q in zip(feats_tfidf[ sorted_idx ], clf.feature_log_prob_[0][ sorted_idx ]):
print('{:45}:{}'.format(p,q))
Output:
Literacy :-7.31906682336635
Mathematics :-7.318545582802034
Grades_3_5 :-7.088236519755028
Ms :-6.970453484098645
Math_Science :-6.887189615718408
Grades_PreK_2 :-6.85882128589294
Literacy_Language :-6.8194613665941155
Mrs :-6.648860662073821
Teacher number of previously posted projects :-4.008908256269724
price :-0.08131982830664697
Please help me someone is it correct way of doing.
It should be like this
sorted_idx = np.argsort(-1 * clf_bow.feature_log_prob_[0] )[0:11]
for i in sorted_idx:
print(count_vect.get_feature_names()[i])
When you say [-10:] you would be printing elements in position (n-10), (n-9)....n
but we would want elements to be printed are n, n-1, n-2,... n-10
I'm working on the same problem, and yes I too got many top features that are common in both the classes, though it's not exactly in same order as yours.
Here's how I did it -
I first chained all the features and the probability values(exponential of log-probability) together and then sorted in descending order.
top 10 Positive class features
top 10 Negative class features
So yes, I think what you're getting is correct.

Preprocessing categorical data already converted into numbers

I'm fairly new to machine learning, so I don't know the correct terminology, but I converted two categorical columns into numbers the following way. These columns are part of my features inputs, akin to the sex column in the titanic database.
(They are not the target data y which I have already created)
changed p_changed
Date
2010-02-17 0.477182 0 0
2010-02-18 0.395813 0 0
2010-02-19 0.252179 1 1
2010-02-22 0.401321 0 1
2010-02-23 0.519375 1 1
Now the rest of my data Xlooks something like this
Open High Low Close Volume Adj Close log_return \
Date
2010-02-17 2.07 2.07 1.99 2.03 219700.0 2.03 -0.019513
2010-02-18 2.03 2.03 1.99 2.03 181700.0 2.03 0.000000
2010-02-19 2.03 2.03 2.00 2.02 116400.0 2.02 -0.004938
2010-02-22 2.05 2.05 2.02 2.04 188300.0 2.04 0.009852
2010-02-23 2.05 2.07 2.01 2.05 255400.0 2.05 0.004890
close_open Daily_Change 30_Avg_Vol 20_Avg_Vol 15_Avg_Vol \
Date
2010-02-17 0.00 -0.04 0.909517 0.779299 0.668242
2010-02-18 0.00 0.00 0.747470 0.635404 0.543015
2010-02-19 0.00 -0.01 0.508860 0.417706 0.348761
2010-02-22 0.03 -0.01 0.817274 0.666903 0.562414
2010-02-23 0.01 0.00 1.078411 0.879007 0.742730
As you can see the rest of my data is continuous (containing many variables) as opposed to the two categorical columns which only have two values (0 and 1).
I was planning to preprocess all this data in one shot via this simple preprocess method
X_scaled = preprocessing.scale(X)
I was wondering if this is mistake? Is there something else I need to do to the categorical values before using this simple preprocessing?
EDIT: I tried two ways; First I tried scaling the full data, including the categorical data converted to 1's and 0's.
Full_X = OPK_df.iloc[:-5, 0:-5]
Full_X_scaled = preprocessing.scale( Full_X) # First way, which scales everything in one shot.
Then I tried dropping the last two columns, scaling, then adding the dropped columns via this code.
X =OPK_df.iloc[:-5, 0:-7] # Here I'm dropping both -7 while originally the offset was only till -5, which means two extra columns were dropped.
I created another dataframe which has those two columns I dropped
x2 =OPK_df.iloc[:-5, -7:-5]
x2 = np.array(x2) # convert it to an array
# preprocessing the data without last two columns
from sklearn import preprocessing
X_scaled = preprocessing.scale(X)
# Then concact the X_scaled with x2(originally dropped columns)
X =np.concatenate((X_scaled, x2), axis =1)
#Creating a classifier
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn2 = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_scaled, y)
knn2.fit(X,y)
knn.score(Full_X_scaled, y)
0.71396522714526078
knn2.score(X, y)
0.71789119461581608
So there is a higher score when I do indeed drop the two columns during standarization.
You're doing pretty well so far. Do not scale your classification data. Since those appear to be binary classifications, think of this as "Yes" and "No". What does it mean to scale these?
Even worse, consider that you might have classifications such as flower types: you've coded Zinnia=0, Rose=1, Orchid=2, etc. What does it meant to scale those? It doesn't make any sense to re-code these as Zinnia=-0.257, Rose=+0.448, etc.
Scaling your input data is the necessary part: it keeps the values within comparable ranges (mathematical influence), allowing you to readily use a single treatment for your loss function. Otherwise, the feature with the largest spread of values would have the greatest influence on training, until your model's weights learned how to properly discount the large values.
For your beginning explorations, don't do any other preprocessing: just scale the input data and start your fitting exercises.

Compute similarity between n entities

I am trying to compute the similarity between n entities that are being described by entity_id, type_of_order, total_value.
An example of the data might look like:
NR entity_id type_of_order total_value
1 1 A 10
2 1 B 90
3 1 C 70
4 2 B 20
5 2 C 40
6 3 A 10
7 3 B 50
8 3 C 20
9 4 B 50
10 4 C 80
My question would be what is a god way of measuring the similarity between entity_id 1 and 2 for example with regards to the type_of_order and the total_value for that type of order.
Would a simple KNN give satisfactory results or should I consider other algorithms?
Any suggestion would be much appreciated.
The similarity metric is a heuristic to capture a relationship between two data rows, with respect to the data semantics and the purpose of the training. We don't know your data; we don't know your usage. It would be irresponsible to suggest metrics to solve a problem when we have no idea what problem we're solving.
You have to address this question to the person you find in the mirror. You've given us three features with no idea of what they mean or how they relate. You need to quantify ...
relative distances within features: under type_of_order, what is the relationship (distance) between any two measurements? If we arbitrarily assign d(A, B) = 1, then what is d(B, C)? We have no information to help you construct this. Further, if we give that some value c, then what is d(A, C)? In various popular metrics, it could be 1+c, |1-c|, all distances could be 1, or perhaps it's something else -- even more than 1+c in some applications.
Even in the last column, we cannot assume that d(10, 20) = d(40, 50); the actual difference could be a ratio, difference of squares, etc. Again, this depends on the semantics behind these labels.
relative weights between features: How do the differences in the various columns combine to provide a similarity? For instance, how does d([A, 10], [B, 20]) compare to d([A, 10], [C, 30])? That's two letters in the left column, two steps of 10 in the right column. How about d([A, 10], [A, 20]) vs d([A, 10], [B, 10])? Are the distances linear, or do the relationships change as we slide up the alphabet or to higher numbers?

Dynamic Time Warping as a classifier, a good idea?

Before you start reading please forgive me for the bad English, thanks.
I am in my final year in computer engineering course in Libya.
my graduation project name is "Speech Recognition System for isolated words using classifier fusion method".
the basic idea of the project is, I input a 1sec recording of a number (0-9), and it gets displayed on the screen as text.
My steps are:
* Input the word .
* Pre-processing of the speech signal.
* Extract features using Mel Frequency Cepstral Coefficients.
* classify the word using:
* MED Classifier.
* Dynamic Time Warping Classifier .
* Bayes Classifier .
* Classifier Fusion: Combination of the above classifiers, hoping to compensate for weak
classier performance.
So after I used MFCC and extracted my features , I used the MED just to have a look at the whole ASR system a visualize how it should work.
Then I started with the DTW classifier, and to be honest I am not sure I am doing it right, so here is the code and if anyone ever used DTW as a classifier before please tell me is it a good idea using DTW, and if so, am I doing it right???
test.mat has two variables in it 'm' is the spoken word of the number one, 'b' is the spoken word of the number one also but every one was recorded alone, i will then keep 'm', and compare it to the recorded word two, the cost of 1vs1 must be smaller then 1vs2, but not in my case, why is that????
clear;
load('test.mat')
b=m;
m=b;
dis=zeros(length(m),length(b));
ac_cost=zeros(length(m),length(b));
cost=0;
p=[];
%we create the distance matrix by calculating the Eucliden distance between
%all pairs
for i = 1 : length(m)
for j = 1 : length(b)
dis(i,j)=(b(j)-m(i))^2;
end
end
ac_cost(1,1)=dis(1,1);
%calculate first row
for i = 2 : length(b)
ac_cost(1,i)=dis(1,i)+ac_cost(1,i-1);
end
%calculate first coulmn
for i = 2 : length(m)
ac_cost(i,1)=dis(i,1)+ac_cost(i-1,1);
end
%calculate the rest of the matrix
for i = 2 : length(m)
for j = 2 : length(b)
ac_cost(i,j)=min([ac_cost(i-1,j-1),ac_cost(i-1,j),ac_cost(i,j-1)])+dis(i,j);
end
end
%find the best path
i=length(m)
j=length(b)
cost=cost+dis(i,j)+dis(1,1)
while i>1 && j>1
cost=cost+min([dis(i-1, j-1), dis(i-1, j), dis(i, j-1)]);
if i==1
j=j-1;
elseif j==1
i=i-1;
else
if ac_cost(i-1,j)==min([ac_cost(i-1, j-1), ac_cost(i-1, j), ac_cost(i, j-1)])
i=i-1;
elseif ac_cost(i,j-1)==min([ac_cost(i-1, j-1), ac_cost(i-1, j), ac_cost(i, j-1)])
j=j-1;
else
i=i-1;
j=j-1;
end
end
end
Thank you all in advance

How to computer Document Length and Average Document Length in BM25

Please tell me anyone as how to compute document(dl) length and average document length(avdl) in BM25. For example we have the following 4 documents:
new york times east // Doc1
los angeles times west //Doc2
washington post district columbia //Doc3
wall street journal north //Doc4
The first step is to remove stop-words and perform stemming so that we can consider a document d as a set of constituent terms with corresponding term frequencies {tf(t,d) : t \in d}.
Now, the notion of document length is slightly different in vector space and probabilistic models, e.g. BM25, language model etc. While in the former, document length refers to the norm of a vector, in the latter it typically refers to total number of terms in a document.
Nonetheless, the vector norm notion of documents can, in principle, be also applied to probabilistic models as well because the term frequency values still remain normalized between 0 and 1. However, the normalized term frequency values would no longer sum to 1.
To illustrate with your example: In the case of vector space model, the length is defined as the norm of a vector, which is the case of doc1, is norm(doc1) = square root of the sum of squares of the term frequency values for each unique term in doc1 = sqrt(1^2 + 1^2 + 1^2 + 1^2) = sqrt(4) = 2.
For the probabilistic models, length would be defined as summation of term frequencies of the component terms = 1 + 1 + 1 + 1 = 4. The normalized term frequency values of a term t would be P(t,d) = tf(t,d)/dl(d) so that \sum{P(t,d) t \in d} = 1, e.g. 1/4+1/4+1/4+1/4=1.
The BM25Similarity implementation of Lucene uses vector norms as document lengths whereas the Terrier uses sum of tfs of constituent terms as document lengths.

Resources