Logistic Regression Using Mahout - mahout

I've just read this interesting article about logistic regression using Mahout. The tutorial is clear to me... but how would a real use case looks like? For instance, when a [web] application first starts, some training data needs to be processed... and the result is kept in an OnlineLogisticRegression instance. Then, to test new data, one just needs to invoke OnlineLogisticRegression.classifyFull and look at the probability — represented by a value between 0 and 1 — that the data falls in a given classification.
But what if I want to improve a model and train it with additional data while the [web] application is online? The idea would be to train the model with additional data once a week or similar in order to improve accuracy. What's the correct way to implement such a mechanism? Are there significant performance issues?

Dont know whats your usecase but I have implemented like below.
I used Naivebayes. current flow using my model which is online.
Now after 15 days I used to add new training data into previous training data and generate a new model. once the new model is created its been replaced with the online model by cron.

Related

How to stack neural network and xgboost model?

I have trained a neural network and an XGBoost model for the same problem, now I am confused that how should I stack them. Should I just pass the output of the neural network as a parameter to the XGBoost model, or should I take the weighting of their results seperately ? Which would be better ?
This question cannot be clearly answered. I would suggest to check both possibilities and chose the one, that worked best.
Using the output of one model as input to the other model
I guess, you know, what you have to do to use the output of the NN as input to XGBoost. You should just take some time, about how you handle the test and train data (see below). Use the "probabilities" rather than the binary labels for that. Of course, you could also try it vice-versa, so that the NN gets the output of the XGBoost model as an additional input.
Using a Votingclassifier
The other possibility is to use a VotingClassifier using soft-voting. You can use VotingClassifier(voting='soft') for that (to be precise sklearn.ensemble.VotingClassifier). You could also play around with the weights here.
Difference
The big difference is, that with the first possibility the XGBoost model might learn, in what areas the NN is weak and in which it is strong, while with the VotingClassifier the outputs of both models are equally weighted for all samples and it relies on the assumption that the model output a "probability" not so close to 0 / 1 if they are not so confident about the prediciton of the specific input record. But this assumption might not be always true.
Handling of the Train/Testdata
In both cases, you need to think about, how you should handle the train/test data. The train/test data should ideally be split the same way for both models. Otherwise you might introduce some kind of data-leakage problem.
For the VotingClassifier this is no problem, because it can be used as a regular skearn model class. For the first method (output of model 1 is one feature of model 2), you should make sure, you do the train-test-split (or the cross-validation) with exactly the same records. If you don't do that, you would run the risk to validate the output of your second model on a record which was in the training set of model 1 (except for the additonal feature of course) and this clearly could cause a data-leakage problem which results in a score that appears to be better than how the model would actually perform on unseen productive data.

Temporal train-test split for forecasting

I know this may be a basic question but I want to know if I am using the train, test split correctly.
Say I have data that ends at 2019, and I want to predict values in the next 5 years.
The graph I produced is provided below:
My training data starts from 1996-2014 and my test data starts from 2014-2019. The test data perfectly fits the training data. I then used this test data to make predictions from 2019-2024.
Is this the correct way to do it, or my predictions should also be from 2014-2019 just like the test data?
The test/validation data is useful for you to evaluate the predictor to use. Once you have decided which model to use, you should train the model with the whole dataset 1996-2019 so that you do not lose possible valuable knowledge from 2014-2019. Take into account that when working with time-series, usually the newer part of the serie has more importance in your prediction than older values of the serie.

How can I use this dataset to train a fast.ai model to recognize the pupil-limbus ratio?

UPDATE:
I have created the dataset and run the model here:
https://github.com/woodytwoshoes/Eyetrain.git
I'm a medical student trying to produce a machine learning model which recognizes a particular feature of the eye: the Pupil-Limbus Ratio.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4387813/
The images I have saved contain the PLR as calculated by an algorithm.
GoodPLR_[pupil-limbus ratio is here]_[random number is here]
https://drive.google.com/open?id=1J1JRFq_l8aFEshFQVrmDhbDLqK7B24c7
The dataset is small, and I understand this will significantly limit the model, but a larger dataset will arrive in a month's time.
Is it correct that I must use a least-squares regression? I know that a classification model is not appropriate.
Perhaps using Jupyter notebook, is there a simple way to set up a fast.ai model to predict PLR based on this dataset?
Thank you.
PLR is useful in head trauma, neurological conditions, and psychiatry.
I used a self-designed algorithm to quickly create a dataset of images with PLR, but is has a high failure rate, and a high error rate. Erroneous PLRs are not contained in the dataset.
I am currently on lesson 1 of fast.ai
https://drive.google.com/open?id=1Uzulez6NQRxXoi_iJyyOQaV3bb1nWIcR
I am hoping for a very rough model with a high error rate due to small dataset. But it is something I can improve later as more data arrives.
The suitable way would be to use a Conv-Net using transfer learning. Fast-Ai provides for transfer learning in the first lesson itself..they use resnet30. Follow detailed notes of the lectures and the notebooks..your exact problem is not very clear though..do mention in detail

Training Data Vs. Test Data

This might sound like an elementary question but I am having a major confusion regarding Training Set and Test.
When we use Supervised learning techniques such as Classification to predict something a common practice is to split the dataset into two parts training and test set. The training set will have a predictor variable, we train the model on the dataset and "predict" things.
Let's take an example. We are going to predict loan defaulters in a bank and we have the German credit data set where we are predicting defaulters and non- defaulters but there is already a definition column which says whether a customer is a defaulter or Non-defaulter.
I understand the logic of prediction on UNSEEN data, like the Titanic survival data but what is the point of prediction where a class is already mentioned, such as German credit lending data.
As you said, the idea is to come up a model that you can predict UNSEEN data. The test data is only used to measure the performance of your model created through training data. You want to make sure the model you comes up does not "overfit" your training data. That's why the testing data is important. Eventually, you will use the model to predict whether a new loaner is going to default or not, thus making a business decision whether to approve the loan application.
The reason why they include the defaulted values is so that you can verify that the model is working as expected and predicting the correct results. Without which there is no way for anyone to be confident that their model is working as expected.
The ultimate purpose of training a model is to apply it to what you call UNSEEN data.
Even in your German credit lending example, at the end of the day you will have a trained model that you could use to predict if new - unseen - credit applications will default or not. And you should be able to use it in the future for any new credit application, as long as you are able to represent the new credit data in the same format you used to train your model.
On the other hand, the test set is just a formalism used to estimate how good the model is. You cannot know for sure how accurate your model it is going to be with future credit applications, but what you can do is to save a small part of your training data, and use it only to check the model's performance after it has been built. That's what you would call the test set (or more precisely, a validation set).

What are the basic steps for training a model?

I have been put in charge of a ML employee. I have never worked with ML before.
He spends most of his time training models. We give him text files and the expected result, and he then trains his SVM model.
There are roughly two models to train each month.
This appears to be full-time work for him.
Could someone please tell me what are the basic steps for training a model? I would like to know if this really requires full-time attention!
We use Python.
Thanks
The basic process to train a model involves following steps:
Create a model
Divide data into training and testing data sets
Apply N-Fold Validation technique to remove data bias
Check the accuracy of the model
Repeat above steps until you get required accuracy.
It requires loads of repetition to get higher accuracy and fine tuning the model.
You hired a data scientist. Let him do his work!
Hope this helps!
Loading the Data
Pre-process/Clean the Data
Perform EDA
Treat Missing Values/Outliers
Split the Data
Scale the Data
One-Hot Enconding (if needed)
Train the data (fine-tune the params)
Evaluate the Model

Resources