Drift Detection in categorical variables of high cardinality (10000+)

Drift Detection in categorical variables of high cardinality (10000+) - machine-learning

I am trying to solve a drift detection problem where I have to find out the drift in high cardinality (10000+) categorical variables such as ip_address, zipcode, cities. I have data points in the order of millions. I have tried the following methods -
chi square test from evidently python package https://github.com/evidentlyai/evidently/blob/main/src/evidently/analyzers/stattests/chisquare_stattest.py
maximum mean discrepancy test with GaussianRBF Kernel from alibi-detecthttps://github.com/SeldonIO/alibi-detect/blob/master/alibi_detect/cd/mmd.py
I have faced the below problems while applying these methods on my data
In chi square test, there is a constraint that we should have same set of categories in both training and inference datasets. This is highly unlikely for the features like ip address and zipcode. There are some data points which are available in training data but not in inference data. For such data points, I don't get observed frequency. I can assume their frequency as 0 as a work around.
But there are data points which have been newly introduced in the inference dataset and don't have their presence in the training dataset. So I would not be able to find out their expected frequency from training dataset. For such data points, I would have 0 in the denominator of the chi square formula and test statistic will be NaN. As a workaround, I can assume their minimum expected frequency equal to 1. But I wonder whether this is the correct way to approach the drift detection.
Moreover, the larger problem is the following -
The nature of these categorical feature variables is such that they can take any possible value from a very very large set of values. I don't have any control over these features taking a value. The users of the system can login from any IP address and from any zipcode. This becomes very difficult to find out the real drift in the data. Methods like chi square test can always give the significant result for such features. Is their any method which can handle such features for drift detection which takes into consideration the high cardinality and the aforementioned nature of the data.
In MMD test with GaussianRBF kernel, we use to calculate pairwise distance between two vectors. X is my training dataset which is having 15 millions records and 10 features. and Y is my inference dataset which is having 10 millions records and 10 features. Now when I perform MMD test on these datasets. I get the following -
a) K_XX = within similarity of X
b) K_YY = within similarity of Y
c) K_XY = cross similarity between X and Y
K_XX will try to generate a matrix (15 million X 15 million). This gives me "ResourceExhaustedError".
ResourceExhaustedError Traceback (most recent call last)
<command-574623167633967> in <module>
1 from alibi_detect.cd import MMDDrift
---> 2 detector = MMDDrift(x_ref=X, backend='tensorflow')
3 res = detector.predict(x=Y)
/databricks/python/lib/python3.8/site-packages/alibi_detect/utils/warnings.py in wrapper(*args, **kwargs)
15 def wrapper(*args, **kwargs):
16 _rename_kwargs(f.__name__, kwargs, aliases)
---> 17 return f(*args, **kwargs)
18 return wrapper
19 return deco
/databricks/python/lib/python3.8/site-packages/alibi_detect/cd/mmd.py in __init__(self, x_ref, backend, p_val, x_ref_preprocessed, preprocess_at_init, update_x_ref, preprocess_fn, kernel, sigma, configure_kernel_from_x_ref, n_permutations, device, input_shape, data_type)
101 if backend == 'tensorflow' and has_tensorflow:
102 kwargs.pop('device', None)
--> 103 self._detector = MMDDriftTF(*args, **kwargs) # type: ignore
104 else:
105 self._detector = MMDDriftTorch(*args, **kwargs) # type: ignore
/databricks/python/lib/python3.8/site-packages/alibi_detect/utils/warnings.py in wrapper(*args, **kwargs)
15 def wrapper(*args, **kwargs):
16 _rename_kwargs(f.__name__, kwargs, aliases)
---> 17 return f(*args, **kwargs)
18 return wrapper
19 return deco
/databricks/python/lib/python3.8/site-packages/alibi_detect/cd/tensorflow/mmd.py in __init__(self, x_ref, p_val, x_ref_preprocessed, preprocess_at_init, update_x_ref, preprocess_fn, kernel, sigma, configure_kernel_from_x_ref, n_permutations, input_shape, data_type)
86 # compute kernel matrix for the reference data
87 if self.infer_sigma or isinstance(sigma, tf.Tensor):
---> 88 self.k_xx = self.kernel(self.x_ref, self.x_ref, infer_sigma=self.infer_sigma)
89 self.infer_sigma = False
90 else:
/databricks/python/lib/python3.8/site-packages/keras/utils/traceback_utils.py in error_handler(*args, **kwargs)
65 except Exception as e: # pylint: disable=broad-except
66 filtered_tb = _process_traceback_frames(e.__traceback__)
---> 67 raise e.with_traceback(filtered_tb) from None
68 finally:
69 del filtered_tb
/databricks/python/lib/python3.8/site-packages/alibi_detect/utils/tensorflow/kernels.py in call(self, x, y, infer_sigma)
75 y = tf.cast(y, x.dtype)
76 x, y = tf.reshape(x, (x.shape[0], -1)), tf.reshape(y, (y.shape[0], -1)) # flatten
---> 77 dist = distance.squared_pairwise_distance(x, y) # [Nx, Ny]
78
79 if infer_sigma or self.init_required:
/databricks/python/lib/python3.8/site-packages/alibi_detect/utils/tensorflow/distance.py in squared_pairwise_distance(x, y, a_min, a_max)
28 x2 = tf.reduce_sum(x ** 2, axis=-1, keepdims=True)
29 y2 = tf.reduce_sum(y ** 2, axis=-1, keepdims=True)
---> 30 dist = x2 + tf.transpose(y2, (1, 0)) - 2. * x # tf.transpose(y, (1, 0))
31 return tf.clip_by_value(dist, a_min, a_max)
32
ResourceExhaustedError: Exception encountered when calling layer "gaussian_rbf_20" (type GaussianRBF).
OOM when allocating tensor with shape[14335347,14335347] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu [Op:AddV2]
The error seems pretty obvious given the quadratic complexity of MMD. Is their any way to mitigate this issue given the constraints that number of records in millions and high cardinality of categorical features.

Related

missing data in time series

As im so new to this field and im trying to explore the data for a time series, and find the missing values and count them and study a distribution of their length and fill in these gaps, the thing is i have, let's say 10 file.txt and for each file i have 2 columns as follows:
C1 C2
944 0
920 1
920 2
928 3
912 7
920 8
920 9
880 10
888 11
920 12
944 13
and so on... lets say till 100 and not necessarily the 10 files have the same number of observations.
so here for example the missing values and not necessarily appears in all files that i have, missing value are: 4,5 and 6 in C2 and the corresponding 1st column C1(measured in milliseconds, so the value of 928ms is not a time neighbor of 912ms). So i want to find those gaps(the total missing values in all 10 files) and show a histogram of their lengths.
i wrote a piece of code in R, but the problem is that i don't get the exact total number that i should have for the missing values.
path = "files path"
out.file<-data.frame(TS = 0, Index = 0, File = '')
file.names <- dir(path, pattern =".txt")
for(i in 1:length(file.names)){
file <- cbind(read.table(file.names[i],
header=F,
sep ="\t",
stringsAsFactors=FALSE),
file.names[i])
colnames(file) <- c('TS', 'Index', 'File')
out.file <- rbind(out.file, file)
}
d = dim(out.file)[1]
misDa = 0
for(i in 2:(d-1)){
if(abs(out.file$Index[i]-out.file$Index[i+1]) > 1)
misDa = misDa+1
}

Hard to give specific hints without having a more extensive example of your data that contains some of the actual NAs.
If you are using R (like it seems) the naniar and the imputeTS packages offer nice functions for missing data visualizations.
Some examples from the naniar package, which is especially good for multivariate data (more plot examples):
Some examples from the imputeTS package, which is especially good for time series data (additional plot examples):

Value Error : Only 2 class/es in training fold, but 1 in overall dataset. This is not supported for decision_function with imbalanced folds

I am learning machine learning and creating my first model on #mnist data set.
Can anyone help me over here? I have tried Stratified Fold, kfold and other methods to resolve this issue.
Pandas Version '0.25.1', Python Version 3.7, using Anaconda Distribution.
from sklearn.model_selection import train_test_split
train_set ,test_set = train_test_split(mnist,test_size = 0.2, random_state = 29)
from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(random_state=29)
sgd_clf.fit(X_train,y_train_5)
X_train, y_train = train_set.drop('label',axis = 1), train_set[['label']]
X_test, y_test = test_set.drop('label',axis = 1),test_set[['label']]
y_train_5 = (y_train == 5) #True for all 5's and false otherwise
y_test_5 = (y_train == 5)
from sklearn.model_selection import cross_val_predict
print(X_train.shape)
print(y_train_5.shape)
cross_val_predict(sgd_clf, X_train, y_train_5, cv=3, method="decision_function")
Last line of the code block gives an error:
RuntimeWarning: Number of classes in training fold (2) does not match total number of classes (1). Results may not be appropriate for your use case. To fix this, use a cross-validation technique resulting in properly stratified folds
RuntimeWarning)
ValueError Traceback (most recent call last)
<ipython-input-39-da1ad024473a> in <module>
3 print(X_train.shape)
4 print(y_train_5.shape)
----> 5 cross_val_predict(sgd_clf, X_train, y_train_5, cv=3, method="decision_function")
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in cross_val_predict(estimator, X, y, groups, cv, n_jobs, verbose, fit_params, pre_dispatch, method)
787 prediction_blocks = parallel(delayed(_fit_and_predict)(
788 clone(estimator), X, y, train, test, verbose, fit_params, method)
--> 789 for train, test in cv.split(X, y, groups))
790
791 # Concatenate the predictions
~\AppData\Local\Continuum\anaconda3\lib\site-packages\joblib\parallel.py in __call__(self, iterable)
919 # remaining jobs.
920 self._iterating = False
--> 921 if self.dispatch_one_batch(iterator):
922 self._iterating = self._original_iterator is not None
923
~\AppData\Local\Continuum\anaconda3\lib\site-packages\joblib\parallel.py in dispatch_one_batch(self, iterator)
757 return False
758 else:
--> 759 self._dispatch(tasks)
760 return True
761
~\AppData\Local\Continuum\anaconda3\lib\site-packages\joblib\parallel.py in _dispatch(self, batch)
714 with self._lock:
715 job_idx = len(self._jobs)
--> 716 job = self._backend.apply_async(batch, callback=cb)
717 # A job can complete so quickly than its callback is
718 # called before we get here, causing self._jobs to
~\AppData\Local\Continuum\anaconda3\lib\site-packages\joblib\_parallel_backends.py in apply_async(self, func, callback)
180 def apply_async(self, func, callback=None):
181 """Schedule a func to be run"""
--> 182 result = ImmediateResult(func)
183 if callback:
184 callback(result)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\joblib\_parallel_backends.py in __init__(self, batch)
547 # Don't delay the application, to avoid keeping the input
548 # arguments in memory
--> 549 self.results = batch()
550
551 def get(self):
~\AppData\Local\Continuum\anaconda3\lib\site-packages\joblib\parallel.py in __call__(self)
223 with parallel_backend(self._backend, n_jobs=self._n_jobs):
224 return [func(*args, **kwargs)
--> 225 for func, args, kwargs in self.items]
226
227 def __len__(self):
~\AppData\Local\Continuum\anaconda3\lib\site-packages\joblib\parallel.py in <listcomp>(.0)
223 with parallel_backend(self._backend, n_jobs=self._n_jobs):
224 return [func(*args, **kwargs)
--> 225 for func, args, kwargs in self.items]
226
227 def __len__(self):
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in _fit_and_predict(estimator, X, y, train, test, verbose, fit_params, method)
887 n_classes = len(set(y)) if y.ndim == 1 else y.shape[1]
888 predictions = _enforce_prediction_order(
--> 889 estimator.classes_, predictions, n_classes, method)
890 return predictions, test
891
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in _enforce_prediction_order(classes, predictions, n_classes, method)
933 'is not supported for decision_function '
934 'with imbalanced folds. {}'.format(
--> 935 len(classes), n_classes, recommendation))
936
937 float_min = np.finfo(predictions.dtype).min
ValueError: Only 2 class/es in training fold, but 1 in overall dataset. This is not supported for decision_function with imbalanced folds. To fix this, use a cross-validation technique resulting in properly stratified folds

I ran through a similar problem and on further investigation found a warning message with the error log-
DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
There are two ways to solve this:
Use the hint in the warning message and change your code as:
cross_val_predict(sgd_clf, X_train, y_train_5.values.ravel(), cv=3,
method="decision_function")
refer - answere here
Also, using the hint from - A column-vector y was passed when a 1d array was expected.; I released my mistake and did the following:
Even in your error log- Number of classes in training fold (2) does not match total number of classes (1)
I assume y_train_5 here is a DataFrame, (probably you are working your way through Aurelien's publication)
The expected type for y_train_5 is an array-type object (meaning the shaoe to be (n,) or one-dimensional), but DataFrame is 2-dimensional, in your case (n,1).
All you need to do is pass the Series object for your column vector as-
y_train_5.iloc[:,0] (I prefer this)
y_train_5.{COLUMN_NAME} (another variant)
Try running below in your console.
> y_train_5.iloc[:,0].shape
(n,)
cross_val_predict(sgd_clf, X_train, y_train_5.iloc[:,0], cv=3,
method="decision_function")

Dask distributed does not run SVD if some of the chunks contain only NaN values

First of all thank you for providing dask with all its functionality, which is highly appreciated!
However, using dask.distributed to run an SVD on a rasterized dataset, it seems as if it fails when only single chunks consist only of NaN values although most of the dataset does contain correct values.
I read a dataset using xarray.open_mfdataset(chunks={...}) and try to set the chunksize such, that SVD computation (dask.array.linalg) used in the eofs.xarray package makes use of the cores our cluster provides, by using a dask.distributed client.
<xarray.Dataset>
Dimensions: (time: 8760, x: 1000, y: 840)
Coordinates:
* x (x) float64 2.452e+06 2.458e+06 2.462e+06 ... 7.442e+06 7.448e+06
* y (y) float64 1.352e+06 1.358e+06 1.362e+06 ... 5.542e+06 5.548e+06
* time (time) datetime64[ns] 2005-01-01 ... 2005-12-31T23:00:00
Data variables:
capacity (y, x) float64 dask.array<shape=(840, 1000), chunksize=(840, 840)>
capfac (time, y, x) float32 dask.array<shape=(8760, 840, 1000), chunksize=(876, 840, 840)>
However, when I run the computation, it fails with the below-mentioned error message.
ValueError: error encountered in SVD, check that missing values are in the same places at each time and that all the values are not missing
See complete error message:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
~/.conda/envs/spagat_py36/lib/python3.6/site-packages/eofs/standard.py in __init__(self, dataset, weights, center, ddof)
164 # Use the parallel Dask algorithm
--> 165 dsvd = dask.array.linalg.svd(dataNoMissing)
166 A, Lh, E = (x.compute() for x in dsvd)
~/.conda/envs/spagat_py36/lib/python3.6/site-packages/dask/array/linalg.py in svd(a)
803 """
--> 804 return tsqr(a, compute_svd=True)
805
~/.conda/envs/spagat_py36/lib/python3.6/site-packages/dask/array/linalg.py in tsqr(data, compute_svd, _max_vchunk_size)
116 "Current shape: {},\nCurrent chunksize: {}".format(
--> 117 data.shape, data.chunksize
118 )
ValueError: Input must have the following properties:
1. Have two dimensions
2. Have only one column of blocks
Note: This function (tsqr) supports QR decomposition in the case of
tall-and-skinny matrices (single column chunk/block; see qr)Current shape: (8760, nan),
Current chunksize: (876, nan)
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
<ipython-input-17-f60250fedf8b> in <module>
----> 1 pca.analyze()
~/code/tsa_lib/tsa_lib/time_tools.py in f(*args, **kwargs)
8 def f(*args, **kwargs):
9 before = time.perf_counter() # maybe exchange with time.process_time()
---> 10 rv = func(*args, **kwargs)
11 after = time.perf_counter()
12 print('elapsed time for {.__name__}: {:.2f} minutes'.format(func, (after - before)/60))
~/code/playground/playground/PCA.py in analyze(self)
145 print('PCA completed. Weights used.')
146 else:
--> 147 self.eofs, self.pcs, self.solver = eof_analysis(self.data_variability, n_eofs=None, xarray=True)
148 print('PCA completed. No weights used.')
149
~/code/tsa_lib/tsa_lib/time_tools.py in f(*args, **kwargs)
8 def f(*args, **kwargs):
9 before = time.perf_counter() # maybe exchange with time.process_time()
---> 10 rv = func(*args, **kwargs)
11 after = time.perf_counter()
12 print('elapsed time for {.__name__}: {:.2f} minutes'.format(func, (after - before)/60))
~/code/playground/playground/PCA.py in eof_analysis(data, n_eofs, xarray, wgts, lats)
36 solver = xEof(data, weights=wgts)
37 else:
---> 38 solver = xEof(data)
39
40 eofs = solver.eofsAsCovariance(neofs=n_eofs)
~/.conda/envs/spagat_py36/lib/python3.6/site-packages/eofs/xarray.py in __init__(self, array, weights, center, ddof)
131 weights=wtarray,
132 center=center,
--> 133 ddof=ddof)
134 # Name of the input DataArray.
135 self._name = array.name
~/.conda/envs/spagat_py36/lib/python3.6/site-packages/eofs/standard.py in __init__(self, dataset, weights, center, ddof)
175
176 except (np.linalg.LinAlgError, ValueError):
--> 177 raise ValueError('error encountered in SVD, check that missing '
178 'values are in the same places at each time and '
179 'that all the values are not missing')
ValueError: error encountered in SVD, check that missing values are in the same places at each time and that all the values are not missing
When applying SVD on a rasterized dataset, the below-mentioned error is given. Is it possible, that the error is raised because single chunks might be only containing NaN values?
If so, it could be considered as a bug of dask.distributed because the SVD works fine when applying it without chunking. Hence, the SVD should not fail only because single chunks only contain NaN values, whereas other chunks contain valid values, should it?

GLMM glmer and glmmADMB - comparison error

I am trying to compare if there are differences in the number of obtained seeds in five different populations with different applied treatments, and having maternal plant and paternal plant as random effects. First I tried to fit a glmer model.
dat <-dat [,c(12,7,6,13,8,11)]
dat$parents<-factor(paste(dat$mother,dat$father,sep="_"))
compareTreat <- function(d)
{
d$treatment <-factor(d$treatment)
print (tapply(d$pop,list(d$pop,d$treatment),length))
print(summary(fit<-glmer(seed_no~treatment+(1|pop/mother)+
(1|pop/father),data=d,family="poisson")))
}
Then, I compared two treatments in two populations (pop 64 and pop 121, in that case). The other populations do not have this particular treatments, so I get NA values for those.
compareTreat(subset(dat,treatment%in%c("IE 5x","IE 7x")&pop%in%c(64,121)))
This is the output:
IE 5x IE 7x
10 NA NA
45 NA NA
64 31 27
121 33 28
144 NA NA
Generalized linear mixed model fit by maximum likelihood (Laplace
Approximation) [glmerMod]
Family: poisson ( log )
Formula: seed_no ~ treatment + (1 | pop/mother) + (1 | pop/father)
Data: d
AIC BIC logLik deviance df.resid
592.5 609.2 -290.2 580.5 113
Scaled residuals:
Min 1Q Median 3Q Max
-1.8950 -0.8038 -0.2178 0.4440 1.7991
Random effects:
Groups Name Variance Std.Dev.
father.pop (Intercept) 3.566e-01 5.971e-01
mother.pop (Intercept) 9.456e-01 9.724e-01
pop (Intercept) 1.083e-10 1.041e-05
pop.1 (Intercept) 1.017e-10 1.008e-05
Number of obs: 119, groups: father:pop, 81; mother:pop, 24; pop, 2
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.74664 0.24916 2.997 0.00273 **
treatmentIE 7x -0.05789 0.17894 -0.324 0.74629
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr)
tretmntIE7x -0.364
It seems there are no differences between treatments. But as there are many zeros in the data, a zero-inflated model would be worthy to try. I tried with glmmabmd, and I wrote the script like this:
compareTreat<-function(d)
{
d$treatment<-factor(d$treatment)
print(tapply(d$pop,list(d$pop,d$treatment), length))
print(summary(fit_zip<-glmmadmb(seed_no~treatment + (1|pop/mother)+
(1|pop/father),data=d,family="poisson", zeroInflation=TRUE)))
}
Then I compared again the treatments. Here I have not changed the code.
compareTreat(subset(dat,treatment%in%c("IE 5x","IE 7x")&pop%in%c(64,121)))
But in that case, the output is
IE 5x IE 7x
10 NA NA
45 NA NA
64 31 27
121 33 28
144 NA NA
Error in pop:father : NA/NaN argument
In addition: Warning messages:
1: In pop:father :
numerical expression has 119 elements: only the first used
2: In pop:father :
numerical expression has 119 elements: only the first used
3: In eval(parse(text = x), data) : NAs introduced by coercion
Called from: eval(parse(text = x), data)
I tried to change everything I came up with, but I still don't know where the problem is.
If I remove the (1|pop/father) from the glmmadmb script, the model runs, but it feels not correct. I wonder if the mistake is in the loop prior to the glmmadmb but it worked OK in the glmer model, or if it is in the comparison itself after the model. I tried as well to remove NAs with na.omit in case that was an issue, but it did not make a difference. Why does the script stop and does not continue running?
I am a student beginner with RStudio, my version is 3.4.2, called Short Summer. If someone with experience could point me in the right direction I would be very grateful!
H.

How to predict multi-label dataset using svm

I'm using a dataset with all decimal values and timestamp which has the following features :
1. sno
2. timestamp
3. v1
4. v2
5. v3
I've the data for 5 months with timestamps for every minute. I need to predict if v1, v2 ,v3 is being used at any time in the future. The values of v1,v2,v3 are between 0 to 25.
How can I do this ?
I've used binary classification before but I've no clue how to process with the multi-label problem to predict. I've used the code below all the time . How should I train the model and how should I use v1,v2,v3 to fit into 'y'?
X_train, X_test, y_train, y_test = train_test_split(train, y, test_size=0.2)
Data:
sno power voltage v1 v2 v3 timestamp
1 3.74 235.24 0 16 18 2006-12-16 18:03:00
2 4.928 237.14 0 37 16 2006-12-16 18:04:00
3 6.052 236.73 0 37 17 2006-12-16 18:05:00
4 6.752 237.06 0 36 17 2006-12-16 18:06:00
5 6.474 237.13 0 37 16 2006-12-16 18:07:00
6 6.308 235.84 0 36 17 2006-12-16 18:08:00
7 4.464 232.69 0 37 16 2006-12-16 18:09:00
8 3.396 230.98 0 22 18 2006-12-16 18:10:00
9 3.09 232.21 0 12 17 2006-12-16 18:11:00
10 3.73 234.19 0 27 17 2006-12-16 18:12:00
11 2.308 234.96 0 1 17 2006-12-16 18:13:00
12 2.388 236.66 0 1 17 2006-12-16 18:14:00
13 4.598 235.84 0 20 17 2006-12-16 18:15:00
14 4.524 235.6 0 9 17 2006-12-16 18:16:00
15 4.202 235.49 0 1 17 2006-12-16 18:17:00

Following the documentation:
The multiclass support is handled according to a one-vs-one scheme (and should thus support one-vs-all strategy).
one-vs-one strat
The one-vs-one scheme basically refers to using a classifier per pair of classes. At a prediction stage, the class that receives the most votes (the outputs of the each classifier) is eventually selected as a prediction. If such a voting has a tie, i.e. having two classes with an equal amount of votes, then the classification confidence plays a role.
To use SVM with such a scheme, one should go:
from sklearn.multiclass import OneVsOneClassifier
from sklearn.svm import SVC
...
subclf = SVC(**params)
clf = OneVsOneClassifier(estimator=subclf)
clf.fit()
one-vs-rest strat
The other way around would be to use a one-vs-all strategy. This strategy fits a classifier per class and against all other classes in the data. It is more popular than the first scheme as it is fairly easier to interpert the results, and the computational time is much weaker. It is as simple to use as the first example:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
...
subclf = SVC(**params)
clf = OneVsRestClassifier(estimator=subclf)
clf.fit()
To read more about multi-label classification and learning proceed here
Aftermath variable coding
So, the basic idea is to instantiate a complex (i.e. multi-label) target variable in a way that:
y equals to 0 if v1 v2 v3 are zeros
y equals to 1 if either v1 or v2 or v3 is one
y equals to 2 if either v1 v2 or v1 v3 or v2 v3 are ones
y equals to 3 if v1 v2 v3 are ones
The workaround may be the following:
import numpy as np
y = []
for i, j, k in zip(data['v1'], data['v2'], data['v3']):
if i and j and k > 0:
y.append(3)
elif i and j or i and k or j and k > 0:
y.append(2)
elif i or j or k > 0:
y.append(1)
else:
y.append(0)

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Drift Detection in categorical variables of high cardinality (10000+) - machine-learning

Related

missing data in time series

Value Error : Only 2 class/es in training fold, but 1 in overall dataset. This is not supported for decision_function with imbalanced folds

Dask distributed does not run SVD if some of the chunks contain only NaN values

GLMM glmer and glmmADMB - comparison error

How to predict multi-label dataset using svm

Categories

Resources