Score/Predict a large dataset using Dask with Lightgbm - dask

Looking to use a dask distributed cluster to speed up lightgbm scoring/predictions. Essentially looking for the equivalent of ParallelPostFit for lightgbm-- currently appears to only work with sklearn models https://examples.dask.org/machine-learning/parallel-prediction.html
Does anybody know what the lightgbm equivalent is?

In the time since this question was first ask, dask-lightgbm has been absorbed into lightgbm and dask-lightgbm has been archived.
lightgbm (the official Python package for LightGBM), provides interfaces for training and prediction using Dask.
See https://lightgbm.readthedocs.io/en/latest/Parallel-Learning-Guide.html#prediction-with-dask for documentation or https://github.com/microsoft/LightGBM/blob/fdc582ea6ba13faf15ee6707c7c7542790c8821d/examples/python-guide/dask/prediction.py for a code example.

Did you try these two?
dask-lightgbm
xgboost-dask

You can save the model and wrap it in ParallelPostFit

Related

TF-IDF calculation in Dask

Apache Spark comes with a package to do TF-IDF calculations that I find it quite handy:
https://spark.apache.org/docs/latest/mllib-feature-extraction.html
Is there any equivalent, or maybe a way to do this with Dask? If so, can it also be done in horizontally scaled Dask (i.e., cluster with multiple GPUs)
This was also asked on the dask gitter, with the following reply by #stsievert :
counting/hashing vectorizer are similar. They’re in Dask-ML and are the same as TFIDF without the normalization/function.
I think this would be a good github issue/feature request.
Here is the link to the API for HashingVectorizer.

How can re-train my logistic model using pymc3?

I have a binary classification problem where I have around 15 features. I have chosen these features using some other model. Now I want to perform Bayesian Logistic on these features. My target classes are highly imbalance(minority class is 0.001%) and I have around 6 million records. I want to build a model which can be trained nighty or weekend using Bayesian logistic.
Currently, I have divided the data into 15 parts and then I train my model on the first part and test on the last part then I am updating my priors using Interpolated method of pymc3 and rerun the model using the 2nd set of data. I am checking the accuracy and other metrics(ROC, f1-score) after each run.
Problems:
My score is not improving.
Am I using the right approch?
This process is taking too much time.
If someone can guide me with the right approach and code snippets it will be very helpful for me.
You can use variational inference. It is faster than sampling and produces almost similar results. pymc3 itself provides methods for VI, you can explore that.
I only know this part of question. If you can elaborate your problem a bit further, maybe.. I can help you.

Extract Classification Function In Supervised Learning

Possibly I am asking a trivial question but the answer is so crucial to me.
I'm really new to machine learning.I have read about Supervised learning and I know basics of these kind of algorithms.The question is when I'm using an algorithm such as j48 on a dataset how can I find the specified Function to use later to classify unlabeled data.
Thank you in advance
"Function" you are refering to is a classifier itself. It is being learned during the training procedure. Consequently, in order to use your model to classify new data you have to dump it to disk/database. How? It depends completely on the language/implementation used. For python you would simply pickle an object. In Java you can serialize your trained object or use Weka to learn j48 decision tree and save it for later use:
https://weka.wikispaces.com/Saving+and+loading+models

NuSVR vs SVR in scikit-learn

sklearn provides two SVM based regression, SVR and NuSVR. The latter claims to be using libsvm. However, other than that I don't see any description of when to use what.
Does anyone have an idea?
I am trying to do regression on 3m X 21 matrix using 5 fold cross validation using SVR, but it is taking forever to finish. I've aborted the job and I'm now considering using NuSVR. But I'm not sure what advantage it provides.
NuSVR - http://scikit-learn.org/stable/modules/generated/sklearn.svm.NuSVR.html#sklearn.svm.NuSVR
SVR - http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html#sklearn.svm.SVR
They are equivalent but slightly different parametrizations of the same implementation.
Most people use SVR.
You can not use that many samples with a kernel SVR. You could try SVR(kernel="Linear") but that would probably also be infeasible. I recommend using SGDRegressor. You might need to adjust the learning rate and number of epochs, though.
You can also try RandomForestRegressor which should work just fine.
Check out the github code for nuSVR. It says that it is based on libSVM as well. NuSVR allows you to limit the number of support vectors used.

WEKA's MultilayerPerceptron: training then training again

I am trying to do the following with weka's MultilayerPerceptron:
Train with a small subset of the training Instances for a portion of the epochs input,
Train with whole set of Instances for the remaining epochs.
However, when I do the following in my code, the network seems to reset itself to start with a clean slate the second time.
mlp.setTrainingTime(smallTrainingSetEpochs);
mlp.buildClassifier(smallTrainingSet);
mlp.setTrainingTime(wholeTrainingSetEpochs);
mlp.buildClassifier(wholeTrainingSet);
Am I doing something wrong, or is this the way that the algorithm is supposed to work in weka?
If you need more information to answer this question, please let me know. I am kind of new to programming with weka and am unsure as to what information would be helpful.
This thread on the weka mailing list is a question very similar to yours.
It seems that this is how weka's MultilayerPerceptron is supposed to work. It's designed to be a 'batch' learner, you are trying to use it incrementally. Only classifiers that implement weka.classifiers.UpdateableClassifier can be incrementally trained.

Resources