how make files training and testing in svm multi label? - machine-learning

how make files training and testing in svm multi label?
my question is https://www.quora.com/Can-anyone-give-me-some-pointers-for-using-SVM-for-user-recognition-using-keystroke-timing/answer/Chomba-Bupe?snid3=364610243&nsrc=1&filter=all
my project is dynamic keyboard, a user vs all user for training
For example if you have three classes A, B and C you will then have 3 SVMs each with its own parameters i.e weights and biases and 3 separate outputs corresponding to the 3 classes respectively. When training SVM-A the other two classes B and C act as negative training sets while A as positive, then when training SVM-B A and C are negative training sets and for SVM-C A and B are the negatives. This is the so called one vs all training procedure.
I try but the result goes wrong
my file to training is .csv and contains:
65 134,+1
70 98,+1
73 69,+1
82 122,+1
82 95,+1
83 127,+1
84 7,+1
85 64,+1
65 123,-1
71 115,-1
73 154,-1
73 156,-1
77 164,-1
77 144,-1
79 112,-1
83 91,-1
and my file to testing is .csv and contents is:
65 111
68 88
70 103
73 89
82 111
82 79
83 112
84 36
85 71
my code is
'use strict';
var so = require('stringify-object');
var Q = require('q');
var svm = require('../lib');
var trainingFile = './archivos/training/340.txt';
var testingFile = './archivos/present/340.txt';
var clf = new svm.CSVC({
gamma: 0.25,
c: 1, // allow you to evaluate several values during training
normalize: false,
reduce: false,
kFold: 1 // disable k-fold cross-validation
});
Q.all([
svm.read(trainingFile),
svm.read(testingFile)
]).spread(function (trainingSet, testingSet) {
return clf.train(trainingSet)
.progress(function(progress){
console.log('training progress: %d%', Math.round(progress*100));
})
.then(function () {
return clf.evaluate(testingSet);
});
}).done(function (evaluationReport) {
console.log('Accuracy against the testset:\n', so(evaluationReport));
});
enter code here

Are your labels 1 and -1? If so, you will need to know those classes for your test data as well. The point of testing your classifier is to see how well it can predict unseen data.
As a small example you could build your classifier with your training data:
x_train = [65, 134], [70,98]....... [79, 112], [83, 91]
y_train = [ 1, 1, ....-1, -1]
Then you test your classifier by passing in your test data. Say you pass in the first three examples in your test data and it makes the following predictions.
[65, 111] --> 1
[68, 88] -->-1
[70,103] -->-1
You then tally up how many pieces of test data it predicted right, but in order to do that you need to know the classes of your test data to begin with. If you don't have that, perhaps you want to try cross-validation on your training data.

Related

Drift Detection in categorical variables of high cardinality (10000+)

I am trying to solve a drift detection problem where I have to find out the drift in high cardinality (10000+) categorical variables such as ip_address, zipcode, cities. I have data points in the order of millions. I have tried the following methods -
chi square test from evidently python package https://github.com/evidentlyai/evidently/blob/main/src/evidently/analyzers/stattests/chisquare_stattest.py
maximum mean discrepancy test with GaussianRBF Kernel from alibi-detecthttps://github.com/SeldonIO/alibi-detect/blob/master/alibi_detect/cd/mmd.py
I have faced the below problems while applying these methods on my data
In chi square test, there is a constraint that we should have same set of categories in both training and inference datasets. This is highly unlikely for the features like ip address and zipcode. There are some data points which are available in training data but not in inference data. For such data points, I don't get observed frequency. I can assume their frequency as 0 as a work around.
But there are data points which have been newly introduced in the inference dataset and don't have their presence in the training dataset. So I would not be able to find out their expected frequency from training dataset. For such data points, I would have 0 in the denominator of the chi square formula and test statistic will be NaN. As a workaround, I can assume their minimum expected frequency equal to 1. But I wonder whether this is the correct way to approach the drift detection.
Moreover, the larger problem is the following -
The nature of these categorical feature variables is such that they can take any possible value from a very very large set of values. I don't have any control over these features taking a value. The users of the system can login from any IP address and from any zipcode. This becomes very difficult to find out the real drift in the data. Methods like chi square test can always give the significant result for such features. Is their any method which can handle such features for drift detection which takes into consideration the high cardinality and the aforementioned nature of the data.
In MMD test with GaussianRBF kernel, we use to calculate pairwise distance between two vectors. X is my training dataset which is having 15 millions records and 10 features. and Y is my inference dataset which is having 10 millions records and 10 features. Now when I perform MMD test on these datasets. I get the following -
a) K_XX = within similarity of X
b) K_YY = within similarity of Y
c) K_XY = cross similarity between X and Y
K_XX will try to generate a matrix (15 million X 15 million). This gives me "ResourceExhaustedError".
ResourceExhaustedError Traceback (most recent call last)
<command-574623167633967> in <module>
1 from alibi_detect.cd import MMDDrift
---> 2 detector = MMDDrift(x_ref=X, backend='tensorflow')
3 res = detector.predict(x=Y)
/databricks/python/lib/python3.8/site-packages/alibi_detect/utils/warnings.py in wrapper(*args, **kwargs)
15 def wrapper(*args, **kwargs):
16 _rename_kwargs(f.__name__, kwargs, aliases)
---> 17 return f(*args, **kwargs)
18 return wrapper
19 return deco
/databricks/python/lib/python3.8/site-packages/alibi_detect/cd/mmd.py in __init__(self, x_ref, backend, p_val, x_ref_preprocessed, preprocess_at_init, update_x_ref, preprocess_fn, kernel, sigma, configure_kernel_from_x_ref, n_permutations, device, input_shape, data_type)
101 if backend == 'tensorflow' and has_tensorflow:
102 kwargs.pop('device', None)
--> 103 self._detector = MMDDriftTF(*args, **kwargs) # type: ignore
104 else:
105 self._detector = MMDDriftTorch(*args, **kwargs) # type: ignore
/databricks/python/lib/python3.8/site-packages/alibi_detect/utils/warnings.py in wrapper(*args, **kwargs)
15 def wrapper(*args, **kwargs):
16 _rename_kwargs(f.__name__, kwargs, aliases)
---> 17 return f(*args, **kwargs)
18 return wrapper
19 return deco
/databricks/python/lib/python3.8/site-packages/alibi_detect/cd/tensorflow/mmd.py in __init__(self, x_ref, p_val, x_ref_preprocessed, preprocess_at_init, update_x_ref, preprocess_fn, kernel, sigma, configure_kernel_from_x_ref, n_permutations, input_shape, data_type)
86 # compute kernel matrix for the reference data
87 if self.infer_sigma or isinstance(sigma, tf.Tensor):
---> 88 self.k_xx = self.kernel(self.x_ref, self.x_ref, infer_sigma=self.infer_sigma)
89 self.infer_sigma = False
90 else:
/databricks/python/lib/python3.8/site-packages/keras/utils/traceback_utils.py in error_handler(*args, **kwargs)
65 except Exception as e: # pylint: disable=broad-except
66 filtered_tb = _process_traceback_frames(e.__traceback__)
---> 67 raise e.with_traceback(filtered_tb) from None
68 finally:
69 del filtered_tb
/databricks/python/lib/python3.8/site-packages/alibi_detect/utils/tensorflow/kernels.py in call(self, x, y, infer_sigma)
75 y = tf.cast(y, x.dtype)
76 x, y = tf.reshape(x, (x.shape[0], -1)), tf.reshape(y, (y.shape[0], -1)) # flatten
---> 77 dist = distance.squared_pairwise_distance(x, y) # [Nx, Ny]
78
79 if infer_sigma or self.init_required:
/databricks/python/lib/python3.8/site-packages/alibi_detect/utils/tensorflow/distance.py in squared_pairwise_distance(x, y, a_min, a_max)
28 x2 = tf.reduce_sum(x ** 2, axis=-1, keepdims=True)
29 y2 = tf.reduce_sum(y ** 2, axis=-1, keepdims=True)
---> 30 dist = x2 + tf.transpose(y2, (1, 0)) - 2. * x # tf.transpose(y, (1, 0))
31 return tf.clip_by_value(dist, a_min, a_max)
32
ResourceExhaustedError: Exception encountered when calling layer "gaussian_rbf_20" (type GaussianRBF).
OOM when allocating tensor with shape[14335347,14335347] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu [Op:AddV2]
The error seems pretty obvious given the quadratic complexity of MMD. Is their any way to mitigate this issue given the constraints that number of records in millions and high cardinality of categorical features.

Value Error : Only 2 class/es in training fold, but 1 in overall dataset. This is not supported for decision_function with imbalanced folds

I am learning machine learning and creating my first model on #mnist data set.
Can anyone help me over here? I have tried Stratified Fold, kfold and other methods to resolve this issue.
Pandas Version '0.25.1', Python Version 3.7, using Anaconda Distribution.
from sklearn.model_selection import train_test_split
train_set ,test_set = train_test_split(mnist,test_size = 0.2, random_state = 29)
from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(random_state=29)
sgd_clf.fit(X_train,y_train_5)
X_train, y_train = train_set.drop('label',axis = 1), train_set[['label']]
X_test, y_test = test_set.drop('label',axis = 1),test_set[['label']]
y_train_5 = (y_train == 5) #True for all 5's and false otherwise
y_test_5 = (y_train == 5)
from sklearn.model_selection import cross_val_predict
print(X_train.shape)
print(y_train_5.shape)
cross_val_predict(sgd_clf, X_train, y_train_5, cv=3, method="decision_function")
Last line of the code block gives an error:
RuntimeWarning: Number of classes in training fold (2) does not match total number of classes (1). Results may not be appropriate for your use case. To fix this, use a cross-validation technique resulting in properly stratified folds
RuntimeWarning)
ValueError Traceback (most recent call last)
<ipython-input-39-da1ad024473a> in <module>
3 print(X_train.shape)
4 print(y_train_5.shape)
----> 5 cross_val_predict(sgd_clf, X_train, y_train_5, cv=3, method="decision_function")
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in cross_val_predict(estimator, X, y, groups, cv, n_jobs, verbose, fit_params, pre_dispatch, method)
787 prediction_blocks = parallel(delayed(_fit_and_predict)(
788 clone(estimator), X, y, train, test, verbose, fit_params, method)
--> 789 for train, test in cv.split(X, y, groups))
790
791 # Concatenate the predictions
~\AppData\Local\Continuum\anaconda3\lib\site-packages\joblib\parallel.py in __call__(self, iterable)
919 # remaining jobs.
920 self._iterating = False
--> 921 if self.dispatch_one_batch(iterator):
922 self._iterating = self._original_iterator is not None
923
~\AppData\Local\Continuum\anaconda3\lib\site-packages\joblib\parallel.py in dispatch_one_batch(self, iterator)
757 return False
758 else:
--> 759 self._dispatch(tasks)
760 return True
761
~\AppData\Local\Continuum\anaconda3\lib\site-packages\joblib\parallel.py in _dispatch(self, batch)
714 with self._lock:
715 job_idx = len(self._jobs)
--> 716 job = self._backend.apply_async(batch, callback=cb)
717 # A job can complete so quickly than its callback is
718 # called before we get here, causing self._jobs to
~\AppData\Local\Continuum\anaconda3\lib\site-packages\joblib\_parallel_backends.py in apply_async(self, func, callback)
180 def apply_async(self, func, callback=None):
181 """Schedule a func to be run"""
--> 182 result = ImmediateResult(func)
183 if callback:
184 callback(result)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\joblib\_parallel_backends.py in __init__(self, batch)
547 # Don't delay the application, to avoid keeping the input
548 # arguments in memory
--> 549 self.results = batch()
550
551 def get(self):
~\AppData\Local\Continuum\anaconda3\lib\site-packages\joblib\parallel.py in __call__(self)
223 with parallel_backend(self._backend, n_jobs=self._n_jobs):
224 return [func(*args, **kwargs)
--> 225 for func, args, kwargs in self.items]
226
227 def __len__(self):
~\AppData\Local\Continuum\anaconda3\lib\site-packages\joblib\parallel.py in <listcomp>(.0)
223 with parallel_backend(self._backend, n_jobs=self._n_jobs):
224 return [func(*args, **kwargs)
--> 225 for func, args, kwargs in self.items]
226
227 def __len__(self):
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in _fit_and_predict(estimator, X, y, train, test, verbose, fit_params, method)
887 n_classes = len(set(y)) if y.ndim == 1 else y.shape[1]
888 predictions = _enforce_prediction_order(
--> 889 estimator.classes_, predictions, n_classes, method)
890 return predictions, test
891
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in _enforce_prediction_order(classes, predictions, n_classes, method)
933 'is not supported for decision_function '
934 'with imbalanced folds. {}'.format(
--> 935 len(classes), n_classes, recommendation))
936
937 float_min = np.finfo(predictions.dtype).min
ValueError: Only 2 class/es in training fold, but 1 in overall dataset. This is not supported for decision_function with imbalanced folds. To fix this, use a cross-validation technique resulting in properly stratified folds
I ran through a similar problem and on further investigation found a warning message with the error log-
DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
There are two ways to solve this:
Use the hint in the warning message and change your code as:
cross_val_predict(sgd_clf, X_train, y_train_5.values.ravel(), cv=3,
method="decision_function")
refer - answere here
Also, using the hint from - A column-vector y was passed when a 1d array was expected.; I released my mistake and did the following:
Even in your error log- Number of classes in training fold (2) does not match total number of classes (1)
I assume y_train_5 here is a DataFrame, (probably you are working your way through Aurelien's publication)
The expected type for y_train_5 is an array-type object (meaning the shaoe to be (n,) or one-dimensional), but DataFrame is 2-dimensional, in your case (n,1).
All you need to do is pass the Series object for your column vector as-
y_train_5.iloc[:,0] (I prefer this)
y_train_5.{COLUMN_NAME} (another variant)
Try running below in your console.
> y_train_5.iloc[:,0].shape
(n,)
cross_val_predict(sgd_clf, X_train, y_train_5.iloc[:,0], cv=3,
method="decision_function")

How to predict multi-label dataset using svm

I'm using a dataset with all decimal values and timestamp which has the following features :
1. sno
2. timestamp
3. v1
4. v2
5. v3
I've the data for 5 months with timestamps for every minute. I need to predict if v1, v2 ,v3 is being used at any time in the future. The values of v1,v2,v3 are between 0 to 25.
How can I do this ?
I've used binary classification before but I've no clue how to process with the multi-label problem to predict. I've used the code below all the time . How should I train the model and how should I use v1,v2,v3 to fit into 'y'?
X_train, X_test, y_train, y_test = train_test_split(train, y, test_size=0.2)
Data:
sno power voltage v1 v2 v3 timestamp
1 3.74 235.24 0 16 18 2006-12-16 18:03:00
2 4.928 237.14 0 37 16 2006-12-16 18:04:00
3 6.052 236.73 0 37 17 2006-12-16 18:05:00
4 6.752 237.06 0 36 17 2006-12-16 18:06:00
5 6.474 237.13 0 37 16 2006-12-16 18:07:00
6 6.308 235.84 0 36 17 2006-12-16 18:08:00
7 4.464 232.69 0 37 16 2006-12-16 18:09:00
8 3.396 230.98 0 22 18 2006-12-16 18:10:00
9 3.09 232.21 0 12 17 2006-12-16 18:11:00
10 3.73 234.19 0 27 17 2006-12-16 18:12:00
11 2.308 234.96 0 1 17 2006-12-16 18:13:00
12 2.388 236.66 0 1 17 2006-12-16 18:14:00
13 4.598 235.84 0 20 17 2006-12-16 18:15:00
14 4.524 235.6 0 9 17 2006-12-16 18:16:00
15 4.202 235.49 0 1 17 2006-12-16 18:17:00
Following the documentation:
The multiclass support is handled according to a one-vs-one scheme (and should thus support one-vs-all strategy).
one-vs-one strat
The one-vs-one scheme basically refers to using a classifier per pair of classes. At a prediction stage, the class that receives the most votes (the outputs of the each classifier) is eventually selected as a prediction. If such a voting has a tie, i.e. having two classes with an equal amount of votes, then the classification confidence plays a role.
To use SVM with such a scheme, one should go:
from sklearn.multiclass import OneVsOneClassifier
from sklearn.svm import SVC
...
subclf = SVC(**params)
clf = OneVsOneClassifier(estimator=subclf)
clf.fit()
one-vs-rest strat
The other way around would be to use a one-vs-all strategy. This strategy fits a classifier per class and against all other classes in the data. It is more popular than the first scheme as it is fairly easier to interpert the results, and the computational time is much weaker. It is as simple to use as the first example:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
...
subclf = SVC(**params)
clf = OneVsRestClassifier(estimator=subclf)
clf.fit()
To read more about multi-label classification and learning proceed here
Aftermath variable coding
So, the basic idea is to instantiate a complex (i.e. multi-label) target variable in a way that:
y equals to 0 if v1 v2 v3 are zeros
y equals to 1 if either v1 or v2 or v3 is one
y equals to 2 if either v1 v2 or v1 v3 or v2 v3 are ones
y equals to 3 if v1 v2 v3 are ones
The workaround may be the following:
import numpy as np
y = []
for i, j, k in zip(data['v1'], data['v2'], data['v3']):
if i and j and k > 0:
y.append(3)
elif i and j or i and k or j and k > 0:
y.append(2)
elif i or j or k > 0:
y.append(1)
else:
y.append(0)

Classification Supervised Training Confusion

So I am new to supervised machine learning, but I've been reading books and articles about it and I'm stuck on a problem. (Not stuck, but I don't understand the logic behind classification algorithms). I am trying to classify records as being wrong or not based on historical data.
So this is the original data (training data):
Name Office Age isWrong
F1 1 32 0
F2 2 61 1
F3 1 35 0
F4 0 25 0
F5 1 36 0
F6 2 52 0
F7 2 48 0
F8 1 17 1
F9 2 51 0
F10 0 24 0
F11 4 34 1
F12 0 21 0
F13 2 51 0
F14 0 27 0
F15 3 37 1
(only showing top 15 results of 200 results)
A wrong record is any record which reports an age LOWER than 18 or HIGHER than 60, or an office location that is NOT {0, 1, 2}. I have more records that display a 1 when any of the mentioned conditions are met. I trained my model with this dataset and I created a test dataset to test the results. However, I end up getting 0 on the prediction column of every record. I used a Naïve Bayes approach because this approach assumes independence between the features variables which is my case (no relationship between the office number and age). I know there are other methods like Logistic Regression and SVC(SVM), but I assume that they require a degree of relationship between the features variables. Despite that, I still tried those two approaches and got the same results. Am I doing something wrong? Do I need to specify something before training my model?
Here is what I did (very simple):
NaiveBayes nb = new NaiveBayes().setLabelCol("isWrong");
NaiveBayesModel nbm = nb.fit(dataset);
nbm.transform(dataset2).show();
Here is dataset2 (top 15):
Name Office Age
F1 9 36 //wrong, office is 9
F2 2 20
F3 1 17
F4 2 43
F5 2 90 // wrong, age is >60
F6 1 36
F7 1 40
F8 2 52
F9 2 49
F10 1 38
F11 0 28
F12 0 18
F13 1 40
F14 1 31
F15 2 45
But like I said, the prediction column displays 0 every time. Any idea why?
I don't know why you are opting for transform(). It just tries to cast the result dtype to the same one as the original column has
To get the probability you should be using the function:
predict_proba(X): Return probability estimates for the test vector X.
The following code should work perfectly in your scenario
NaiveBayes nb = new NaiveBayes().setLabelCol("isWrong");
nb.fit(dataset)
nb.predict_proba(dataset2)

Where is the main code of convolutional nets?

Although I want to modify code of convolutional nets, I could not find the main routine of the convolutional nets. The main routine is convolution and pooling.
SpatialConvolution.lua had the code below.
96 function SpatialConvolution:updateOutput(input)
97 backCompatibility(self)
98 viewWeight(self)
99 input = makeContiguous(self, input)
100 local out = input.nn.SpatialConvolutionMM_updateOutput(self, input) -- where?
101 unviewWeight(self)
102 return out
103 end
So I thought the routine was in SpatialConvolutionMM.
However, SpatialConvolutionMM.lua had the code below.
65 function SpatialConvolutionMM:updateOutput(input)
66 -- backward compatibility
67 if self.padding then
68 self.padW = self.padding
69 self.padH = self.padding
70 self.padding = nil
71 end
72 input = makeContiguous(self, input)
73 return input.nn.SpatialConvolutionMM_updateOutput(self, input) -- where??
74 end
So does anyone know where is SpatialConvolutionMM_updateOutput?
There's an open issue on GitHub for the same. One of the solution mentions upgrading nn and cunn:
luarocks install nn
luarocks install cunn
Also, take a look at this reply.

Resources