How to calculate score while using SVM? - machine-learning

I am new to machine learning, I am a bit confused by the documentation of the sklearn on how to get the score while using sklearn.svm.SVC.
This is my code
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.30)
for _c in [0.4,0.6,0.8,1.0,1.2,1.4]:
svm=SVC(C=_c,kernel='linear')
svm.fit(x_train,y_train)
result=svm.predict(x_test)
print('C value is {} and score is {}'.format(_c,svm.score(x_test,y_test)))
This is the output
C value is 0.4 and score is 0.0091324200913242
C value is 0.6 and score is 0.0091324200913242
C value is 0.8 and score is 0.0091324200913242
C value is 1.0 and score is 0.0091324200913242
C value is 1.2 and score is 0.0091324200913242
C value is 1.4 and score is 0.0091324200913242
I see all the score are same, my question how to determine the best score of my model?
should I pass the predicted value to svm.score y value i.e.
result=svm.predict(x_test)
svm.score(x_test,result))
should I pass the x_test and y_test value i.e.
svm.score(x_test,y_test))

To your question:
It is wrong, you cannot compare features xwith your target y
Same mistake as in 1.
you have to use:
for _c in [0.4,0.6,0.8,1.0,1.2,1.4]:
svm=SVC(C=_c,kernel='linear')
svm.fit(x_train,y_train)
result=svm.predict(x_test)
print('C value is {} and score is {}'.format(_c,svm.score(y_test,result)))
This will compare your original target values y_test with your predicted values result . That is the idea of testing, you test your prediction against original values to see how good/bad your prediction is.

Related

how are the leaf values of xgboost regression trees relate to the prediction

It seems that the sum of corresponding leaf values of each tree doesn't equal to the prediction. Here is a sample code:
X = pd.DataFrame({'x': np.linspace(-10, 10, 10)})
y = X['x'] * 2
model = xgb.XGBRegressor(booster='gbtree', tree_method='exact', n_estimators=100, max_depth=1).fit(X, y)
Xtest = pd.DataFrame({'x': np.linspace(-20, 20, 101)})
Ytest = model.predict(Xtest)
plt.plot(X['x'], y, 'b.-')
plt.plot(Xtest['x'], Ytest, 'r.')
The tree dumps reads:
model.get_booster().get_dump()[:2]
['0:[x<0] yes=1,no=2,missing=1\n\t1:leaf=-2.90277791\n\t2:leaf=2.65277767\n',
'0:[x<2.22222233] yes=1,no=2,missing=1\n\t1:leaf=-1.90595233\n\t2:leaf=2.44333339\n']
If I only use one tree to do prediction:
Ytest2 = model.predict(Xtest, ntree_limit=1)
plt.plot(XX1['x'], Ytest2, '.')
np.unique(Ytest2) # array([-2.4028, 3.1528], dtype=float32)
Clearly, Ytest2's unique values does not corresponds to the leaf value of the first tree, which is -2.90277791 and 2.65277767, although the observed split point is right at 0.
How are the leaf values related to the predictions?
Why are the leaf values in the first tree not symmetric, provided that the input is symmetric?
Before fitting the first tree, xgboost makes an initial prediction. This is controlled by the parameter base_score, which defaults to 0.5. And indeed, -2.902777 + 0.5 ~=-2.4028 and 2.652777 + 0.5 ~= 3.1528.
That also explains your second question: the differences from that initial prediction are not symmetric. If you set learning_rate=1 you probably could get the predictions to be symmetric after one round, or you could just set base_score=0.

How does binary cross entropy loss work on autoencoders?

I wrote a vanilla autoencoder using only Dense layer.
Below is my code:
iLayer = Input ((784,))
layer1 = Dense(128, activation='relu' ) (iLayer)
layer2 = Dense(64, activation='relu') (layer1)
layer3 = Dense(28, activation ='relu') (layer2)
layer4 = Dense(64, activation='relu') (layer3)
layer5 = Dense(128, activation='relu' ) (layer4)
layer6 = Dense(784, activation='softmax' ) (layer5)
model = Model (iLayer, layer6)
model.compile(loss='binary_crossentropy', optimizer='adam')
(trainX, trainY), (testX, testY) = mnist.load_data()
print ("shape of the trainX", trainX.shape)
trainX = trainX.reshape(trainX.shape[0], trainX.shape[1]* trainX.shape[2])
print ("shape of the trainX", trainX.shape)
model.fit (trainX, trainX, epochs=5, batch_size=100)
Questions:
1) softmax provides probability distribution. Understood. This means, I would have a vector of 784 values with probability between 0 and 1. For example [ 0.02, 0.03..... upto 784 items], summing all 784 elements provides 1.
2) I don't understand how the binary crossentropy works with these values. Binary cross entropy is for two values of output, right?
In the context of autoencoders the input and output of the model is the same. So, if the input values are in the range [0,1] then it is acceptable to use sigmoid as the activation function of last layer. Otherwise, you need to use an appropriate activation function for the last layer (e.g. linear which is the default one).
As for the loss function, it comes back to the values of input data again. If the input data are only between zeros and ones (and not the values between them), then binary_crossentropy is acceptable as the loss function. Otherwise, you need to use other loss functions such as 'mse' (i.e. mean squared error) or 'mae' (i.e. mean absolute error). Note that in the case of input values in range [0,1] you can use binary_crossentropy, as it is usually used (e.g. Keras autoencoder tutorial and this paper). However, don't expect that the loss value becomes zero since binary_crossentropy does not return zero when both prediction and label are not either zero or one (no matter they are equal or not). Here is a video from Hugo Larochelle where he explains the loss functions used in autoencoders (the part about using binary_crossentropy with inputs in range [0,1] starts at 5:30)
Concretely, in your example, you are using the MNIST dataset. So by default the values of MNIST are integers in the range [0, 255]. Usually you need to normalize them first:
trainX = trainX.astype('float32')
trainX /= 255.
Now the values would be in range [0,1]. So sigmoid can be used as the activation function and either of binary_crossentropy or mse as the loss function.
Why binary_crossentropy can be used even when the true label values (i.e. ground-truth) are in the range [0,1]?
Note that we are trying to minimize the loss function in training. So if the loss function we have used reaches its minimum value (which may not be necessarily equal to zero) when prediction is equal to true label, then it is an acceptable choice. Let's verify this is the case for binray cross-entropy which is defined as follows:
bce_loss = -y*log(p) - (1-y)*log(1-p)
where y is the true label and p is the predicted value. Let's consider y as fixed and see what value of p minimizes this function: we need to take the derivative with respect to p (I have assumed the log is the natural logarithm function for simplicity of calculations):
bce_loss_derivative = -y*(1/p) - (1-y)*(-1/(1-p)) = 0 =>
-y/p + (1-y)/(1-p) = 0 =>
-y*(1-p) + (1-y)*p = 0 =>
-y + y*p + p - y*p = 0 =>
p - y = 0 => y = p
As you can see binary cross-entropy have the minimum value when y=p, i.e. when the true label is equal to predicted label and this is exactly what we are looking for.

Difference between the weight parameter in xgb.DMatrix and scale_pos_weight in hyper params list?

I am having a little difficulty understanding what's the difference between the weight function in xgb.DMatrix and the sum_pos_weight parameter in the param list. I am going through the following code which is using the Higgs data;
Due to the data being unbalanced, the author defines a weight parameter:
weight <- as.numeric(dtrain[[32]]) * testsize / length(label)
sumwpos <- sum(weight * (label==1.0))
sumwneg <- sum(weight * (label==0.0))
However column 32 is already a weight variable, so the author is modifying an already defined weight variable?
Then, the modified weight variable is being set as the "weight" argument of xgb.DMatrix:
xgmat <- xgb.DMatrix(data, label = label, weight = weight, missing = -999.0)
Additionally, in the param list the author has: "scale_pos_weight" = sumwneg / sumwpos,.
so scale_pos_weight is a function of sumneg which is a function of weight which is a function of a previously defined weight (column 32). So I am confused.
What does the author do in the following line: weight <- as.numeric(dtrain[[32]]) * testsize / length(label)
What is the difference in setting the weight in xgb.DMatrix and again in sum_pos_weight?
When you set
xgmat <- xgb.DMatrix(data, label = label, weight = weight, missing = -999.0)
weight should be a vector corresponding to your data rows
If for example you have the following data:
A B C
1 1 1 1
2 2 2 2
you need to set weight as a vector of 2 weights
weight <- c(1, 2)
So you will have a weight of 1 to the first event and weight of 2 to the 2nd event. You ask your self why is it good? Assume event 1 has happened 1 time and event 2 happened 2 times, you'd like co responsive weights to them specifically mentioning the amount of time that event has occurred.
Here are few more examples for using weights:
If you want recent events to have more "value"
The amount of confidence you have in a data row. you will set all weights to be between 0 to 1 and the weight will represent how much you sure of that data. for example if weight = 0.88 you gave that row 88% confidence
If you have repetitive events. instead of creating more rows, you can set them once and give them a weight as the number they've repeated
scale_pos_weight is usually used when you have "imbalanced data". for example, assuming you have a classification problem where you have 5% of the data as 1 and 95% of the data as 0, you would like to give more weight for every positive "event". So you can just set scale_pos_weight = 19 (or as the author wrote: sumneg/sumpos)
As for the "author" re defining weight. I cannot know without the full code what he did there, but I assume he's doing some sort of normalization to the weights.

Naive Bayes classifier performance is unexpected

I have just started using the Naive Bayes for text classification. I have coded it from the pseudo code snapshot attached.
I have two classes i.e. positive and negative. I have total of 2000 samples(IMDB Movie Reviews) out of which 1800 (900 positive, 900 negative) are used to train the classifier whereas 200 (100 negative, 100 positive) are used to test the system.
It marks the positive class documents but failed for classifying negative class documents propely. All documents belonging from negative classes are misclassified into positive class and thereby give accuracy of 50%.
If i documents from each class individually like first test all document belonging from negative classes and then from positive test samples then it give me accuracy of 100% but when i feed it mixed test samples it fails and classify all in one class (in my case positive).
Is there any mistake i am doing or is unavailable in this algorithm ?
Are training sample too less and classifier performance will increase upon increase training samples?
I have tested same samples with weka and rapid miner both are giving much better accuracy. I know that i have made a mistake but what is that i can't grab it ?Its the most simple one in understanding but accuracy result was totally unexpected and driving me crazy.Here is my code algorithm pseudo code. I have generating document vector using tf-idf for term weighting and document vector is used for calculations.
TrainMultinomialNB(C, D)
1. V = ExtractVocabulary(D)
2. N = CountDocs(D)
3. For each c E C
4. Do Nc = CountDocsInClass (D, c)
5. Prior[c] = Nc/N
6. Textc = ConcatenateTextOfAllDocsInClass (D,c)
7. For each t E V
8. Do Tct = CountTokensOfTerm(textc, t)
9. For each t E V
10. Do condprob[t][c] = (Tct + 1) /(Sum(Tct) + |V|)
11. Return V, prior, condprob
ApplyMultinomialNB(C, V, prior, condprob, d)
1. W = ExtractTokensFromDoc (V, d)
2. For each c E C
3. Do score [c] = log (prior)
4. For each t E W
5. Do score [c] + = log (condprob[t][c])
6. Return argmax(cEC) score [c]

Confusion on the cost function in video lecture

I am unable to understand the graph of 2nd and 3rd in the below.
What does "x" represent here? In graph 1 the value of x doesn't matter as theta 1 is zero. But in graph 2 and 3 which is the value of "x"?
In graph 2, how did the instructor decide that h(x) value is 2?
x can be any independent variable. A signal from a sensor, a person's age, my weight after an all you can eat buffet, ...
theta0 is the zeroth order, i.e. independent of x.
theta1 is the first order, linear with x.
In graph 2: theta0 is 0, theta1 is 0.5. Thus when x is 2. h = 0 + 0.5*2 = 1

Resources