multivariate random forest with opencv - opencv

Let's say we are trying to classify a pencil as healthy or not and we have two variables for this purpose: length and weight of the pencil. Now, what should I give to the training method of random forest implemented in opencv? I am really confused with this because I have two different data, both of them are numeric but their units are different. Below example will give a better sense:
Height (cm) Weight (gr) Healthy? (bool)
----------- ----------- ---------------
10 34 0
4 6 0
12 14 1
8 20 1
5 18 0
If I train a univariate random forest with only height, {10, 4, 12, 8, 5} and {0, 0, 1, 1, 0} vectors will be the parameters. However, what if I want to use both variables, what will be the parameters?

In Python the training data input can be fed into as a list of tuples, if you have multiple variables.

Related

Is there a way to implement a Neural Network able to work with vector target?

I'm trying to implement a Neural network model using keras, where the output is a vector of five elements.
Basically the target contains elements from 0 to 4 and nan. So I can have some targets like
[0,3,2,1,4] and others like [nan, 0, nan, 1 ,2]. The important thing is that the element in the vector are not repeated, only nan can.
One solution I tried was to use something like one hot encoder for the target, in this way I transformed a target in a 25 components vector, with all zeros and 1 in corrispondence of the number to map ( i.e. [nan, 0, nan, 1 ,2] -> [(0 , 0 ,0 ,0 ,0),(1,0,0,0,0),(0,0,0,0,0),(0,1,0,0,0)(0,0,1,0,0)] - i'm using the round brackets only to highlight groups of five element).
Any ideas please?
As far as I have understood, what you're trying to predict is a list of 5 elements, each of them takes a discrete value from the {nan, 0, 1, 2, 3, 4}.
What you'll need to do is training 5 neural networks (for each position of the list), each one predicts a value from the previous set, thus, you need to hot-encode the outputs, apply a softmax and select the highest probability for each of neural network.
when trying to predict the output list of a new sample, what you're going to do is predict every position, put them in the list and Voila !
def predict_sample(sample):
pos_0 = nn0.predict(sample)[0]
pos_1 = nn1.predict(sample)[0]
pos_2 = nn2.predict(sample)[0]
pos_3 = nn3.predict(sample)[0]
pos_4 = nn4.predict(sample)[0]
outp = [pos_0, pos_1, pos_2, pos_3, pos_4]
# if nan is encoded as 5 then:
outp[outp == 5] = np.nan
return outp
You cannot assume NaNs will be unique at prediction, only the data will affect that but what you can do if for example taking the second highest probability when already a NaN is predicted at a certain position of the list.

how to deal with features which are both numerical and categorical

what is the best approach to deal with features which is both numerical and categorical? take the following feaure X for example:
X
1
5
3
0
1
10
10
7
0
5
9
9
In which X represents credit score which should range 1 to 10, and if X=0, it means for this instance, the credit score doesn't exist.
how should I deal with it while using models like random forest or logistic regression to do a classification problem? Thank you.

How to interpret scikit's learn confusion matrix and classification report?

I have a sentiment analysis task, for this Im using this corpus the opinions have 5 classes (very neg, neg, neu, pos, very pos), from 1 to 5. So I do the classification as follows:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
tfidf_vect= TfidfVectorizer(use_idf=True, smooth_idf=True,
sublinear_tf=False, ngram_range=(2,2))
from sklearn.cross_validation import train_test_split, cross_val_score
import pandas as pd
df = pd.read_csv('/corpus.csv',
header=0, sep=',', names=['id', 'content', 'label'])
X = tfidf_vect.fit_transform(df['content'].values)
y = df['label'].values
from sklearn import cross_validation
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,
y, test_size=0.33)
from sklearn.svm import SVC
svm_1 = SVC(kernel='linear')
svm_1.fit(X, y)
svm_1_prediction = svm_1.predict(X_test)
Then with the metrics I obtained the following confusion matrix and classification report, as follows:
print '\nClasification report:\n', classification_report(y_test, svm_1_prediction)
print '\nConfussion matrix:\n',confusion_matrix(y_test, svm_1_prediction)
Then, this is the result:
Clasification report:
precision recall f1-score support
1 1.00 0.76 0.86 71
2 1.00 0.84 0.91 43
3 1.00 0.74 0.85 89
4 0.98 0.95 0.96 288
5 0.87 1.00 0.93 367
avg / total 0.94 0.93 0.93 858
Confussion matrix:
[[ 54 0 0 0 17]
[ 0 36 0 1 6]
[ 0 0 66 5 18]
[ 0 0 0 273 15]
[ 0 0 0 0 367]]
How can I interpret the above confusion matrix and classification report. I tried reading the documentation and this question. But still can interpretate what happened here particularly with this data?. Wny this matrix is somehow "diagonal"?. By the other hand what means the recall, precision, f1score and support for this data?. What can I say about this data?. Thanks in advance guys
Classification report must be straightforward - a report of P/R/F-Measure for each element in your test data. In Multiclass problems, it is not a good idea to read Precision/Recall and F-Measure over the whole data any imbalance would make you feel you've reached better results. That's where such reports help.
Coming to confusion matrix, it is much detailed representation of what's going on with your labels. So there were 71 points in the first class (label 0). Out of these, your model was successful in identifying 54 of those correctly in label 0, but 17 were marked as label 4. Similarly look at second row. There were 43 points in class 1, but 36 of them were marked correctly. Your classifier predicted 1 in class 3 and 6 in class 4.
Now you can see the pattern this follows. An ideal classifiers with 100% accuracy would produce a pure diagonal matrix which would have all the points predicted in their correct class.
Coming to Recall/Precision. They are some of the mostly used measures in evaluating how good your system works. Now you had 71 points in first class (call it 0 class). Out of them your classifier was able to get 54 elements correctly. That's your recall. 54/71 = 0.76. Now look only at first column in the table. There is one cell with entry 54, rest all are zeros. This means your classifier marked 54 points in class 0, and all 54 of them were actually in class 0. This is precision. 54/54 = 1. Look at column marked 4. In this column, there are elements scattered in all the five rows. 367 of them were marked correctly. Rest all are incorrect. So that reduces your precision.
F Measure is harmonic mean of Precision and Recall.
Be sure you read details about these. https://en.wikipedia.org/wiki/Precision_and_recall
Here's the documentation for scikit-learn's sklearn.metrics.precision_recall_fscore_support method: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html#sklearn.metrics.precision_recall_fscore_support
It seems to indicate that the support is the number of occurrences of each particular class in the true responses (responses in your test set). You can calculate it by summing the rows of the confusion matrix.
Confusion Matrix tells us about the distribution of our predicted values across all the actual outcomes.Accuracy_scores, Recall(sensitivity), Precision, Specificity and other similar metrics are subsets of Confusion Matrix.
F1 scores are the harmonic means of precision and recall.
Support columns in Classification_report tell us about the actual counts of each class in test data.
Well, rest is explained above beautifully.
Thank you.

Correct way to do Min-Max normalization

I am implementing alphabet classification using opencv svm.
I have doubt in normalizing feature vector.
I have two ways of normalizing feature vector,
I need to find which is logically correct normalization method ??
Method 1
Suppose I have 3 feature vector as follows
[2, 3, 8, 5 ] -> image 1
[3, 5, 2, 5 ] -> image 2
[9, 3, 8, 5 ] -> image 3
And each value in feature vector is obtained by convolving the pixel with a kernal.
Currently I am finding maximum and minimum value of the each column and doing normalization based on that.
In the above case first column is [2, 3, 9]
min = 2
max = 9
and normalization of first column is done based on that. Likewise all other columns are normalized
Method 2
If the kernal is as follows
[-1 0 1]
[-1 0 1]
[-1 0 1]
then maximum and minimum value can obtained by convolving with above kernel is as follows (8 bit image- Intensity range: 0-255)
max val = 765
min val = -765
And normalize every value with above max min ?
Which is logically correct way to do normalization (method-1 or method-2) ?
The standard way to do it is method-1 (see the answer to this question). I also recommend you to read this paper for a good reference about svm training.
However, in you case, the range of all features computed with the same kernel will be similar , and method-1 may hurt more than it helps (for example by increasing noise of almost constant features).
So my advice would be : test both methods, and evalute performances to see what works best in your case.

Bag of Visual Words in Opencv

I am using BOW in opencv for clustering the features of variable size. However one thing is not clear from the documentation of the opencv and also i am unable to find the reason for this question:
assume: dictionary size = 100.
I use surf to compute the features, and each image has variable size descriptors e.g.: 128 x 34, 128 x 63, etc. Now in BOW each of them are clustered and I get a fixed descriptor size of 128 x 100 for a image. I know 100 is the cluster center created using kmeans clustering.
But I am confused in that, if image has 128 x 63 descriptors, than how come it clusters into 100 clusters which is impossible using kmeans UNLESS i convert the descriptor matrix to 1D. Wont converting to 1D will lose valid 128 dimensional information of a single key points?
I need to know how is the descriptor matrix manipulated to get 100 cluter centers from only 63 features.
Think it like this.
You have 10 cluster means total and 6 features for current image. First 3 of those features are closest to 5th mean and remaining 3 of them are closest to 7th, 8th, and 9th mean respectively. Then your feature will be like [0, 0, 0, 0, 3, 0, 1, 1, 1, 0] or normalized version of this. Which is 10 dimensional, and that is equal to number of cluster mean. So you can create 100000 dimensional vector from 63 features if you want.
But still I think there is something wrong, because after you applied BOW, your features should be 1x100 not 128x100. Your cluster means are 128x1 and you are assigning your 128x1 sized features (you hvae 34 128x1 feature for first image, 63 128x1 feature for second image, etc.) to those means. So in basic you are assigning 34 or 63 features to 100 means, your result should be 1x100.

Resources