how to adjusting already built ML predictive model - machine-learning

How can I continue machine learning model after predicting results?
What I mean by that is that I built a model for my 1 million records dataset, this model took around 1 day to get built.
I extracted the model results using Python and now I have a (function) that I can feed it with my features and it gives me a prediction results
but with time my dataset has become 1.5 million records.
I do not want to redo the whole thing all over again from scratch.
Is there any way I continue of top of thf first model I built ( the one with 1 million records) so the new model take less time to adjust it based on the new 0.5 million records compare to re building everything from scratch for 1.5 million records.
P.S. I am asking for all algorithms, if there is anyway to do this for any algorithm that would be good to know which ones are these

Related

Re-training classifier with new data produce massive class changes

I'm training every day XGBoost binary classifier on 150k of records.
every day, approximately 500 records added to the train set.
The test set (inference) 10M records without labels.
I noticed that two models trained in consecutive days produce fundamentally different predictions on the same records (up to 10% of the inference data).
What I have tried for now:
set same seeds and parameters.
different classifiers.
My concern is that the model is not generalize enough so there are significant changes in the daily predictions.
I would love to hear about possible ways to address this issue

Leaderboard in H2O AutoML

I have just started learn to use H2O Auto ML and I am trying out a binary Classification model.
I am trying to understand why do the rankings of the model change with every run.
The top 5 models remain in top 5, but the models slightly shift to a higher or lower rank.
While DRF was ranked 2nd once, the other time it raked 3rd.
There are couple of reasons I can speculate that causes changes.
Seed to the algorithm changes each time
There is no leader board frame assigned
RF involve random sampling as part of the process resulting in different trees built each time
The leader board will not change, some other change to data / code is responsible for the change.
Could you please help me understand this better.
It sounds like you're not setting a seed, so you should start there. In order for the algorithms with inherent randomness (e.g. XGBoost, GBM, Random Forest) to produce the same answer each time, a random seed must be set (at minimum). In H2O AutoML, there's a single seed argument (which gets piped down to all the individual algorithms) and if you set it to the same value each time, most of the models will be the same on repeated runs. By default, AutoML will also do cross-validation with random folds, so this also guarantees the same folds are used each time.
There are a few caveats -- H2O Deep Learning is not reproducible (by default) even if you set a seed, so those models will always change. Since the "All Models" Stacked Ensemble uses Deep Learning models in addition to a bunch of other models, the final ensemble will also be non-reproducible.
Lastly, you should use the max_models instead of max_runtime_secs to control how long AutoML should run for -- otherwise you may get a different number of models on the leaderboard (and in the All Models Stacked Ensemble) on subsequent runs.

Validating accuracy on time-series data with an atypical ending

I'm working on a project to predict demand for a product based on past historical data for multiple stores. I have data from multiple stores over a 5 year period. I split the 5-year time series into overlapping subsequences and use the last 18 months to predict the next 3 and I'm able to make predictions. However, I've run into a problem in choosing a cross-validation method.
I want to have a holdout test split, and use some sort of cross-validation for training my model and tuning parameters. However, the last year of the data was a recession where almost all demand suffered. When I use the last 20% (time-wise) of the data as a holdout set, my test score is very low compared to my OOF cross-validation scores, even though I am using a timeseriessplit CV. This is very likely to be caused by this recession being new behavior, and the model can't predict these strong downswings since it has never seen them before.
The solution I'm thinking of is using a random 20% of the data as a holdout, and a shuffled Kfold as cross-validation. Since I am not feeding any information about when the sequence started into the model except the starting month (1 to 12) of the sequence (to help the model explain seasonality), my theory is that the model should not overfit this data based on that. If all types of economy are present in the data, the results of the model should extrapolate to new data too.
I would like a second opinion on this, do you think my assumptions are correct? Is there a different way to solve this problem?
Your overall assumption is correct in that you can probably take random chunks of time to form your training and testing set. However, when doing it this way, you need to be careful. Rather than predicting the raw values of the next 3 months from the prior 18 months, I would predict the relative increase/decrease of sales in the next 3 months vs. the mean of the past 18 months.
(see here)
http://people.stern.nyu.edu/churvich/Forecasting/Handouts/CourantTalk2.pdf
Otherwise, the correlation between the next 3 months with your prior 18 months data might give you a misleading impression about the accuracy of your model

Can I use a transfer learning in facebook prophet?

I want my prophet model to predict values for every 10 minute interval over the next 24h (e.g. 24*6=144 values).
Let's say I've trained a model on a huge (over 900k of rows) .csv file where sample row is
...
ds=2018-04-24 16:10, y=10
ds=2018-04-24 16:20, y=14
ds=2018-04-24 16:30, y=12
...
So I call mode.fit(huge_df) and wait for 1-2 seconds to receive 144 values.
And then an hour passes and I want to tune my prediction for the following (144 - 6) 138 values given a new data (6 rows).
How can I tune my existing prophet model without having to call mode.fit(huge_df + live_df) and wait for some seconds again? I'd like to be able to call mode.tune(live_df) and get an instant prediction.
As far as I'm aware this is not really a possibility. I think they use a variant of the BFGS optimization algorithm to maximize the posterior probability of the the models. So as I see it the only way to train the model is to take into account the whole dataset you want to use. The reason why transfer learning works with neural networks is that it is just a weight (parameter) initialization and back propagation is then run iteratively in the standard SGD training schema. Theoretically you could initialize the parameters to the ones of the previous model in the case of prophet, which might or might not work as expected. I'm however not aware that something of the likes is currently implemented (but since its open-source you could give it a shot, hopefully reducing convergence times quite a bit).
Now as far as practical advice goes. You probably don't need all the data, just tail it to what you really need for the problem at hand. For instance it does not make sense to have 10 years of data if you have only monthly seasonality. Also depending on how strongly your data is autocorrelated, you may downsample a bit without loosing any predictive power. Another idea would be to try an algorithm that is suitable for online-learning (or batch) - You could for instance try a CNN with dilated convolution.
Time Series problems are quite different from usual Machine Learning Problems. When we are training cat/dog classifier, the feature set of cats and dogs are not going to change instantly (evolution is slow). But when it comes to the time series problems, training should happen, every time prior to forecasting. This becomes even more important when you are doing univariate forecasting (as is your case), as only feature we're providing to the model is the past values and these value will change at every instance. Because of these concerns, I don't think something like transfer learning will work in time series.
Instead what you can do is, try converting your time series problem into the regression problem by use of rolling windowing approach. Then, you can save that model and get you predictions. But, make sure to train it again and again in short intervals of time, like once a day or so, depending upon how frequently you need a forecast.

Assistance regarding model choice

Im new to &investigating Machine Learning. I have a use case & data but I am unsure of a few things, mainly how my model will run, and what model to start with. Details of the use case and questions are below. Any advice is appreciated.
My Main question is:
When basing a result on scores that are accumulated over time, is it possible to design a model to run on a continuous basis so it gives a best guess at all times, be it run on day one or 3 months into the semester?
What model should I start with? I was thinking a classifier, but ranking might be interesting also.
Use Case Details
Apprentices take a semesterized course, 4 semesters long, each 6 months in duration. Over the course of a semester, apprentices perform various operations and processes & are scored on how well they do. After each semester, the apprentices either have sufficient score to move on to semester 2, or they fail.
We are investigating building a model that will help identify apprentices who are in danger of failing, with enough time for them to receive help.
Each procedure is assigned a complexity code of simple, intermediate or advanced, and are weighted by complexity.
Regarding Features, we have the following: -
Initial interview scores
Entry Exam Scores
Total number of simple procedures each apprentice performed
Total number of intermediate procedures each apprentice performed
Total number of advanced procedures each apprentice performed
Average score for each complexity level
Demograph information (nationality, age, gender)
I am unsure of is how the model will work and when we will run it. i.e. - if we run it on day one of the semester, I assume everyone will fail as everyone has procedure scores of 0
Current plan is to run the model 2-3 months into each semester, so there is enough score data & also enough time to help any apprentices who are in danger of failing.
This definitely looks like a classification model problem:
y = f(x[0],x[1], ..., x[N-1])
where y (boolean output) = {pass, fail} and x[i] are different features.
There is a plethora of ML classification models like Naive Bayes, Neural Networks, Decision Trees, etc. which can be used depending upon the type of the data. In case you are looking for an answer which suggests a particular ML model, then I would need more data for the same. However, in general, this flow-chart can be helpful in selection of the same. You can also read about Model Selection from Andrew-Ng's CS229's 5th lecture.
Now coming back to the basic methodology, some of these features like initial interview scores, entry exam scores, etc. you already know in advance. Whereas, some of them like performance in procedures are known over the semester.
So, there is no harm in saying that the model will always predict better towards the end of each semester.
However, I can make a few suggestions to make it even better:
Instead of taking the initial procedure-scores as 0, take them as a mean/median of the past performances in other procedures by the subject-apprentice.
You can even build a sub-model to analyze the relation between procedure-scores and interview-scores as they are not completely independent. (I will explain this sentence in the later part of the answer)
However, if the semester is very first semester of the subject-apprentice, then you won't have such data already present for that apprentice. In that case, you might need to consider the average performances of other apprentices with similar profiles as the subject-apprentice. If the data-set is not very large, K Nearest Neighbors approach can be quite useful here. However, for large data-sets, KNN suffers from the curse of dimensionality.
Also, plot a graph between y and different variables x[i], so as to see the independent variation of y with respect to each variable.
Most probably (although it's just a hypotheses), y will depend more the initial variables in comparison the variables achieved later. The reason being that the later variables are not completely independent of the former variables.
My point is, if a model can be created to predict the output of a semester, then, a similar model can be created to predict just the output of the 1st procedure-test.
In the end, as the model might be heavily based on demographic factors and other things, it might not be a very successful model. For the same reason, we cannot accurately predict election results, soccer match results, etc. As they are heavily dependent upon real-time dynamic data.
For dynamic predictions based on different procedure performances, Time Series Analysis can be a bit helpful. But in any case, the final result will heavily dependent on the apprentice's continuity in motivation and performance which will become more clear towards the end of each semester.

Resources