XGBoost error - check failed: auc <= local area - machine-learning

I was running some code for a machine learning classification project with XGBoost in Jupyter and received this error message. Can someone please clarify what it means and how to avoid?
I checked my data splits and train / test / validation all have examples of both the 0 and 1 class.
I am using BayesSearchCV to tune the model over 5 folds. I am also using n_jobs = 7 to take advantage of parallel processing via joblib.

Related

k-fold cross validation in RankLib

I want to do 5 fold cross validation on MQ2008 dataset. I am using RankLib to apply ML algo on the dataset. I am confused about the kcv option given in Ranklib for cross validation.
command used:
java - jar RankLib.jar -ranker 0 -train train.txt -test test.txt -validate vali.txt -kcv 5
here we are specifying different files for training,testing and validation.Then how it is dividing data for 5 fold cross validation.
To do k-fold cross-validation using ranklib, you only need to use one dataset.
The program itself divides the data to train, test and validate randomly.
When you use 5-fold cross-validation, the program will repeat the process 5 times and it gives you the average of the 5 analyses as the final result.
You need to choose a metric for your learning evaluation. See [ -metric2t <metric> ] on this How to use page.
For example, see the command below. I have only one dataset to feed my algorithm. I used NDCG#10 as my evaluation metric. Also, I used -kcvmd to save my models in a directory and -kcvmn to name the models.
java -jar RankLib-2.1-patched.jar -train trainingData.txt -ranker 8 -kcv 5 -kcvmd kcvModels/ -kcvmn txt -metric2t NDCG#10 -metric2T NDCG#10 -save Models/model.txt

sklearn High score with low performance

could you kindly help me decide whether I'm hitting a bug or the problem may be in my implementation?
I have a data set with 5 features and 2000+ observations and I use SVR to do regression tests and select parameters with grid search. If I don't scale my data, then I get a best score of close to zero, but if I do scale it, the best score is around 0.90.
When I manually test the data, it predicts wrong values totally randomly. How can this be? I expect the best score to show how well the trained data could have been validated on new ones during cross validation. I suppose I should not get high score if my model cannot generate well. Should I? Could this be a bug?
SKlearn version is 0.19.1 (from package of Ubuntu Linux 18.04 x64 LTS platform)
Python version is 3.6.7
Would it be worth an upgrade with pip? Any further idea? Thank you.
Edit: see the following code that produces high score, still generalizes badly - though it is regression, scoring should reflect the difference of the predicted ones from the test values:
C_range = 2.0 ** np.arange(-5, 15, 2)
gamma_range = 2.0 ** np.arange(-5, 15, 2)
parameters = {"kernel":["rbf"], "C":C_range, "gamma":gamma_range}
estimator = svm.SVR()
clf = GridSearchCV(estimator, parameters, cv=3, n_jobs=-1, verbose=0)
clf.fit(x, y)
print( clf.best_score_ )

Matching PyTorch w/ CNTK (VGG on CIFAR)

I am trying to understand how PyTorch works and want to replicate a simple CNN training on CIFAR. The CNTK script gets to 0.76 accuracy after 168 seconds of training (10 epochs), which is similar to my MXNet script (0.75 accuracy after 153 seconds).
However, my PyTorch script is lagging behind a lot at 0.71 accuracy and 354 seconds. I appreciate I will get differences in accuracy due to stochastic weight initialisation, etc. However the difference across frameworks is much greater than difference within a framework, initialising randomly between runs.
The reasons I can think of:
MXNet and CNTK are initialized to xavier/glorot uniform; not sure how to do this in PyTorch and so perhaps the weights are initialised to 0
CNTK does gradient-clipping by default; not sure if PyTorch has the equivalent
Perhaps the bias is dropped in PyTorch by default
I use SGD with momentum; perhaps the PyTorch implementation of momentum is a bit different
Edit:
I have tried specifying the weight initialisation, however it seems to have no big effect:
self.conv1 = nn.Conv2d(3, 50, kernel_size=3, padding=1)
init.xavier_uniform(self.conv1.weight, gain=np.sqrt(2.0))
init.constant(self.conv1.bias, 0)
I try to answer your first two questions:
weight initialization: different kinds of layers have their own method, you can find the default weight initialization of all these layers in the following link: https://github.com/pytorch/pytorch/tree/master/torch/nn/modules
gradient-clipping: you might want to use torch.nn.utils.clip_grad_norm
In addition, I am curious why you don't use torchvision.transforms torch.utils.data.DataLoader and torchvision.datasets.CIFAR10 to load and preprocess your data?
There is a similar image classification tutorial of cifar for Pytorch
http://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html#sphx-glr-beginner-blitz-cifar10-tutorial-py
Hope this can help you.

Weka Classification

I was trying to data model a Classification Machine Learning algorithm on a data set which has 32 attributes,the last column being Target class.I refined the attributes number in to 6 from 32 ,which I felt would be more useful for my Classification model.
I tried to perform J48 and some incremental classification algorithm.
I expected output structure which consists of confusion matrix,correctlt and incorrectly classified instances,kappa value.
But my result did not give any information on Correctly and Incorrectly classified instances.Also,it did not predict confusion matrix and Kappa value.All I received is like this:
=== Summary ===
Correlation coefficient 0.9482
Mean absolute error 0.2106
Root mean squared error 0.5673
Relative absolute error 13.4077 %
Root relative squared error 31.9157 %
Total Number of Instances 1461
Can anyone tell me why I did not get Confusion matrix,kappa and Correct,Incorrect instances information.
Unfortunately you didnt write your code, or what version of weka do you apply.
BTW, to calculate confusion mtx, kappa etc. you can use methods of Evaluation class, http://weka.sourceforge.net/doc.dev/weka/classifiers/Evaluation.html
for example, after you train your model:
classifier.buildClassifier(train); \\train is an instances
Evaluation eval = new Evaluation(train);
//evaulate your model at 10 fold cross validation manner
eval.crossValidateModel(classifier, train, 10, new Random(1));
System.out.println(classifier);
//print different stats with
System.out.println(eval.toSummaryString());
System.out.println(eval.toMatrixString());
System.out.println(eval.toClassDetailsString());

How to reavaluate model in WEKA?

I am trying to solve a numeric classification problem with numeric attributes in WEKA using linear regression and then I want to test my model on the existing dataset with ""re-evaluate model on current test dataset.
As a result of the evaluation I am getting the summary:
Correlation coefficient 0.9924
Mean absolute error 1.1017
Root mean squared error 1.2445
Total Number of Instances 17
But I don't have results as it is shown here: http://weka.wikispaces.com/Making+predictions
How to bring WEKA to the result I need?
Thank you.
To answer my question - for trained and tested model, right click on the model and go to visualize classifier error. there use save option to save actual and predicted values.
Are you using command line interface (CLI) or GUI.
If CLI, the command given in the above link works pretty fine
java weka.classifiers.trees.J48 -T unclassified.arff -l j48.model -p 0
So when you train the model you save it as *.model (j48.model) and later use it to evaluate on test data (unclassified.arff)

Resources