Classification Supervised Training Confusion - machine-learning

So I am new to supervised machine learning, but I've been reading books and articles about it and I'm stuck on a problem. (Not stuck, but I don't understand the logic behind classification algorithms). I am trying to classify records as being wrong or not based on historical data.
So this is the original data (training data):
Name Office Age isWrong
F1 1 32 0
F2 2 61 1
F3 1 35 0
F4 0 25 0
F5 1 36 0
F6 2 52 0
F7 2 48 0
F8 1 17 1
F9 2 51 0
F10 0 24 0
F11 4 34 1
F12 0 21 0
F13 2 51 0
F14 0 27 0
F15 3 37 1
(only showing top 15 results of 200 results)
A wrong record is any record which reports an age LOWER than 18 or HIGHER than 60, or an office location that is NOT {0, 1, 2}. I have more records that display a 1 when any of the mentioned conditions are met. I trained my model with this dataset and I created a test dataset to test the results. However, I end up getting 0 on the prediction column of every record. I used a Naïve Bayes approach because this approach assumes independence between the features variables which is my case (no relationship between the office number and age). I know there are other methods like Logistic Regression and SVC(SVM), but I assume that they require a degree of relationship between the features variables. Despite that, I still tried those two approaches and got the same results. Am I doing something wrong? Do I need to specify something before training my model?
Here is what I did (very simple):
NaiveBayes nb = new NaiveBayes().setLabelCol("isWrong");
NaiveBayesModel nbm = nb.fit(dataset);
nbm.transform(dataset2).show();
Here is dataset2 (top 15):
Name Office Age
F1 9 36 //wrong, office is 9
F2 2 20
F3 1 17
F4 2 43
F5 2 90 // wrong, age is >60
F6 1 36
F7 1 40
F8 2 52
F9 2 49
F10 1 38
F11 0 28
F12 0 18
F13 1 40
F14 1 31
F15 2 45
But like I said, the prediction column displays 0 every time. Any idea why?

I don't know why you are opting for transform(). It just tries to cast the result dtype to the same one as the original column has
To get the probability you should be using the function:
predict_proba(X): Return probability estimates for the test vector X.
The following code should work perfectly in your scenario
NaiveBayes nb = new NaiveBayes().setLabelCol("isWrong");
nb.fit(dataset)
nb.predict_proba(dataset2)

Related

GridSearchCV freezing with linear svm

I have problem with GridSearchCV freezing (CPU is active but program in not advancing) with linear svm (but with rbf svm it works fine).
Depending on the random_state that I use for splitting my data, I have this freezing in different splits points of cv for different PCA components?
The features of one sample looks like the following(it is about 39 features)
[1 117 137 2 80 16 2 39 228 88 5 6 0 10 13 6 22 23 1 227 246 7 1.656934307 0 5 0.434195726 0.010123735 0.55568054 5 275 119.48398 0.9359527 0.80484825 3.1272728 98 334 526 0.13454546 0.10181818]
Another sample's features:
[23149 4 31839 9 219 117 23 5 31897 12389 108 2 0 33 23 0 0 18 0 0 0 23149 0 0 74 0.996405221 0.003549844 4.49347E-05 74 5144 6.4480677 0.286384 0.9947901 3.833787 20 5135 14586 0.0060264384 0.011664075]
If I delete the last 10 feature I don't have this problem ( The 10 new features that I added before my code worked fine). I did not check other combinations of the 10 last new features to check if a specific feature is causing this problem.
Also I use StandardScaler to scale the features but still facing this issue. I have less of this problem if I use MinMaxScaler scaler (but read soewhere it is not good for svm).
I also put n_jobs to different numbers and it only could advance by little but freezes again.
What do you suggest?
I followed part of this code to write my code:
TypeError grid seach

How to predict multi-label dataset using svm

I'm using a dataset with all decimal values and timestamp which has the following features :
1. sno
2. timestamp
3. v1
4. v2
5. v3
I've the data for 5 months with timestamps for every minute. I need to predict if v1, v2 ,v3 is being used at any time in the future. The values of v1,v2,v3 are between 0 to 25.
How can I do this ?
I've used binary classification before but I've no clue how to process with the multi-label problem to predict. I've used the code below all the time . How should I train the model and how should I use v1,v2,v3 to fit into 'y'?
X_train, X_test, y_train, y_test = train_test_split(train, y, test_size=0.2)
Data:
sno power voltage v1 v2 v3 timestamp
1 3.74 235.24 0 16 18 2006-12-16 18:03:00
2 4.928 237.14 0 37 16 2006-12-16 18:04:00
3 6.052 236.73 0 37 17 2006-12-16 18:05:00
4 6.752 237.06 0 36 17 2006-12-16 18:06:00
5 6.474 237.13 0 37 16 2006-12-16 18:07:00
6 6.308 235.84 0 36 17 2006-12-16 18:08:00
7 4.464 232.69 0 37 16 2006-12-16 18:09:00
8 3.396 230.98 0 22 18 2006-12-16 18:10:00
9 3.09 232.21 0 12 17 2006-12-16 18:11:00
10 3.73 234.19 0 27 17 2006-12-16 18:12:00
11 2.308 234.96 0 1 17 2006-12-16 18:13:00
12 2.388 236.66 0 1 17 2006-12-16 18:14:00
13 4.598 235.84 0 20 17 2006-12-16 18:15:00
14 4.524 235.6 0 9 17 2006-12-16 18:16:00
15 4.202 235.49 0 1 17 2006-12-16 18:17:00
Following the documentation:
The multiclass support is handled according to a one-vs-one scheme (and should thus support one-vs-all strategy).
one-vs-one strat
The one-vs-one scheme basically refers to using a classifier per pair of classes. At a prediction stage, the class that receives the most votes (the outputs of the each classifier) is eventually selected as a prediction. If such a voting has a tie, i.e. having two classes with an equal amount of votes, then the classification confidence plays a role.
To use SVM with such a scheme, one should go:
from sklearn.multiclass import OneVsOneClassifier
from sklearn.svm import SVC
...
subclf = SVC(**params)
clf = OneVsOneClassifier(estimator=subclf)
clf.fit()
one-vs-rest strat
The other way around would be to use a one-vs-all strategy. This strategy fits a classifier per class and against all other classes in the data. It is more popular than the first scheme as it is fairly easier to interpert the results, and the computational time is much weaker. It is as simple to use as the first example:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
...
subclf = SVC(**params)
clf = OneVsRestClassifier(estimator=subclf)
clf.fit()
To read more about multi-label classification and learning proceed here
Aftermath variable coding
So, the basic idea is to instantiate a complex (i.e. multi-label) target variable in a way that:
y equals to 0 if v1 v2 v3 are zeros
y equals to 1 if either v1 or v2 or v3 is one
y equals to 2 if either v1 v2 or v1 v3 or v2 v3 are ones
y equals to 3 if v1 v2 v3 are ones
The workaround may be the following:
import numpy as np
y = []
for i, j, k in zip(data['v1'], data['v2'], data['v3']):
if i and j and k > 0:
y.append(3)
elif i and j or i and k or j and k > 0:
y.append(2)
elif i or j or k > 0:
y.append(1)
else:
y.append(0)

What to do if response (or label) columns are in another data frame?

I'm newbie in machine learning, so I need your advice.
Imagine, we have two data sets (df1 and df2).
First data set include about 5000 observations and some features, to simplify:
name age company degree_of_skill average_working_time alma_mater
1 John 39 A 89 38 Harvard
2 Steve 35 B 56 46 UCB
3 Ivan 27 C 88 42 MIT
4 Jack 26 A 87 37 MIT
5 Oliver 23 B 76 36 MIT
6 Daniel 45 C 79 39 Harvard
7 James 34 A 60 40 MIT
8 Thomas 28 B 89 39 Stanford
9 Charlie 29 C 83 43 Oxford
The learning problem - to predict productivity of companies from second data set (df2) for next period of time (june-2016), based on data from the first data set (df1).
df2:
company productivity date
1 A 1240 april-2016
2 B 1389 april-2016
3 C 1388 april-2016
4 A 1350 may-2016
5 B 1647 may-2016
6 C 1272 may-2016
So as we can see both data sets include feature "company". But I don't understand how I can create a link between these two features. What shoud I do with two data sets to solve the learning problem? Is it possible?

Tableau running count reset

I have a list of sporting matches by time with result and margin. I want Tableau to keep a running count of number of matches since the last x (say, since the last draw - where margin = 0).
This will mean that on every record, the running count will increase by one unless that match is a draw, in which case it will drop back to zero.
I have not found a method of achieving this. The only way I can see to restart counts is via dates (e.g. a new year).
As an aside, I can easily achieve this by creating a running count tally OUTSIDE of Tableau.
The interesting thing is that Tableau then doesn't quite deal with this well with more than one result on the same day.
For example, if the structure is:
GameID Date Margin Running count
...
48 01-01-15 54 122
49 08-01-15 12 123
50 08-01-15 0 124
51 08-01-15 17 0
52 08-01-15 23 1
53 15-01-15 9 2
...
Then when trying to plot running count against date, Tableau rearranges the data to show:
GameID Date Margin Running count
...
48 01-01-15 54 122
51 08-01-15 17 0
52 08-01-15 23 1
49 08-01-15 12 123
50 08-01-15 0 124
53 15-01-15 9 2
...
I assume it is doing this because by default it sorts the running count data in ascending order when dates are identical.

ERROR while implementing Cox PH model for recurrent event survival analysis using counting process

I have been trying to run Cox PH model on a sample data set of 10k customers (randomly taken from 32 million customer base) for predicting probability of survival in time t (which is month in my case). I am using recurrent event survival analysis using counting process for e-commerce. For this...
1. Observation starting point: right after a customer makes first purchase
2. Start/Stop times: Months of two consecutive purchases (as in the data)
I have a few independent variables as in the sample data below:
id start stop status tenure orders revenue Quantity
A 0 20 0 0 1 $89.0 1
B 0 17 0 0 1 $556.0 2
C 0 17 0 0 1 $900.0 2
D 32 33 0 1679 9 $357.8 9
D 26 32 1 1497 7 $326.8 7
D 23 26 1 1405 4 $142.9 4
D 17 23 1 1219 3 $63.9 3
D 9 17 1 978 2 $50.0 2
D 0 9 1 694 1 $35.0 1
E 0 15 0 28 2 $156.0 2
F 0 15 0 0 1 $348.0 1
F 12 14 0 375 2 $216.8 3
F 0 12 1 0 1 $67.8 2
G 9 15 0 277 2 $419.0 2
G 0 9 1 0 1 $359.0 1
While running cox PH using the following code:
fit10=coxph(Surv(start,stop,status)~orders+tenure+Quantity+revenue,data=test)
I keep getting the following error:
Warning: X matrix deemed to be singular; variable 1 2 3 4
I tried searching for the same error online but the answers I found said this could be because of interacting independent variables, whereas my variables are individual and continuous.

Resources