Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
Below is a simplified description of the problem:
Three weeks before delivery of a product a estimation of what the qty will be delivered on a certain demand date is given by the buyer.
This quantity might change as times comes closer to delivery (Illustrated in the image below). This seems quite straight forward but there is a high correlation between the Demand weeks. e.g if a qty is lowered for one week its likely that a surrounding week will increase.
Is there an approach that will get the model to acknowledge the surrounding demand weeks?
I'm currently using random forest regression with the attributes shown in the image and the results are OK but I thought asking for inspiration here might be a good idea.
From your description I understood, that you are currently using only the forecasts of the buyer as an input. And what you would like to do is to also consider the actual Qty of the last week(s) as an input for the next estimation. To achieve this you could create another column in your table that is the actual Qty shifted by one week. That way you get a new column "Actual Qty previous week". Then you can train your model to try to predict using both the buyer forecast and the actual Qty from last week. Of cause you can do the thing once more and shift by two weeks to also make the week before that available.
In addition you can also come up with more elaborate calculated features. One idea would be the average deviation of the buyer-forecast from the final demand (where you take the average for e.g. the last 10 weeks). That way you would be able to detect that some buyers tend to overestimate and some tend to underestimate.
Since you mentioned that variations of qty are influencing the subsequent weeks, I propose to just do tha: create a new feature that is going to show the variation.
This implies to run the predictive algorithm iteratively one week after the other, adding each time a new feature to the dataset: the variation of predicted total quantity for previous weeks.
The method would go like this:
run prediction model for week1
add a feature to the dataset: variation of predicted qty for week 1
run prediction model for week2
add a feature to the dataset: variation of predicted qty for week 1 + week 2
run prediction model for week3
etc ...
This is of course only the idea. It is possible to add different kind of features (variation of last week only, moving average of last weeks, whatever would make sense,...)
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 10 months ago.
Improve this question
I want to run a random forest classifier model. My data set is pretty big with 1 million rows and 300 columns. Of course, prefer not to run the model for like 3 days non-stop. So I was wondering if there are some good practices to find the optimal trade-off between running time and prediction quality.
Here are some examples of what a was thinking:
Can I use a random subsample of x rows to tune the parameters and then use does parameters for the model with all the data. (If yes how do I find the best value for x?)
Is there a way to know at what point it is useless to keep adding more data because the prediction will stop improving? (i.e., what is the minimum number of rows that will give me the best results for the running time)
How can I estimate the running time of the model? With 4000 rows the model takes 4 min with 8000 it takes 10 min. The running time is exponential or it's more or less linear and I could expect 1280min of running time with 1 million rows?
Random subsampling and then tuning on the full data rarely works, as the small subsample could be not representative of the full data.
About the amount of the data vs the model quality: try using learning curves from sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.learning_curve.html
train_sizes, train_scores, test_scores, fit_times, _ = learning_curve(
estimator,
X,
y,
cv=cv,
n_jobs=n_jobs,
train_sizes=train_sizes,
return_times=True,
)
This way you'll be able to plot the amount of the data vs the model performance.
Here are some examples of plotting:
https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_kernel_ridge_regression.html#sphx-glr-auto-examples-miscellaneous-plot-kernel-ridge-regression-py
https://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html#sphx-glr-auto-examples-model-selection-plot-learning-curve-py
Estimating total time is difficult, because it isn't linear.
Some additional practical suggestions:
set n_jobs=-1 to run the model in parallel on all cores;
use any feature selection approach to decrease the number of features. 300 features is really a lot, it should be possible to get rid of around half of them without serious decline of the model performance.
I'm currently working on a small machine learning project.
The task deals with medical data of a couple of thousands of patients. For each patient there where taken 12 of measurements of the same bunch of vital signs each one hour apart.
These measurements must note been taken immediately after the patient has entered the hospital but could start with some offset. However the patient will stay 24h in the hospital in total, so they can't start later than after 11 hours after the entrance.
Now the task is to predict for each patient whether none, one or multiple of 10 possible tests will be ordered during the remainder of the stay, and also to predict the future mean value of some of the vital signs for the remainder of the stay.
I have a training set that comes together with the labels that I should predict.
My question is mainly about how I can process the features, I thought about turning the measurement results for a patient into one long vector and use it as training example for a classifier.
However I'm not quite shure how I should include the Time information of each measurement into the features (should I even consider time at all?).
If I understood correctly, you want to include time information of each measurement into features. One way I thought is to make an empty vector of length 24, as the patient stays for 24 hours in the hospital. Then you can use one-hot representation, for example, if the measurement was taken in 12th, 15th and 20th hours of his stay, your time feature vector will have 1 at 12th, 15th and 20th position and all others are zero. You can append this time vector with other features and make a single vector for each patient of length = length(other vector) + length(time vector). Or you can use different approaches to combine these features.
Please let me know if you think this approach makes sense for you. Thanks.
The question: Is it normal / usual / professional to use the past of the labels as features?
I could not find anything reliable on this, although it is a basic question.
Edited: Please mind, this is not a time-series question, I have deleted the time-series tag now and I changed the question. This question is about features that change regularly over time, yes! But we do not create a time-series from this, as there are many other features as well which are not like the label and are also important features in the model. Now please think of using past labels as normal features without a time-series approach.
I try to predict a certain month of data that is available monthly, thus a time-series, but I am not using it as a time-series, it is just monthly avaiable data of various different features.
It is a classification model, and now I want to predict a label column of a selected month of that time-series. The previous months before the selected label month are now the point of the question.
I do not want to just drop the past months of the label just because they are "almost" a label (or in other words: they were just the label columns of the preceding models in time). I know the past of the label, why not considering it as features as well?
My predictions are of course much better when adding the past labels of the time-series of labels to the features. This is logical as the labels usually do not change so much from one month to the other and thus can be predicted very well if you have fed the data with the past of the label. It would be strange not to use such "past labels" as features, as any simple time-series regression would then be better than the ml model.
Example: Let's say I predict the IQ test result of a person, and I use her past IQ test results as features in addition to other normal "non-label" features like age, education aso. I use the first 11 months of "past labels" of a year as features in addition to my normal "non-label" features. I predict the label of the 12th month.
Predicting the label of the 12th month works much better if you add the past of the labels to the features - obviously. This is because the historical labels, if there are any, are of course better indicators of the final outcome than normal columns like age and education.
Possibly related p.s.:
p.s.1: In auto-regressive models, the past of the dependent variable can well be used as independent variable, see: https://de.wikipedia.org/wiki/Regressionsanalyse
p.s.2: In ML you can perhaps just try any features and take what gives you the best results, a bit like >Good question, try them [feature selection methods] all and see what works best< in https://machinelearningmastery.com/feature-selection-in-python-with-scikit-learn/ >If the features are relevant to the outcome, the model will figure out how to use them. Or most models will.< The same is said in Does the feature selection matter for learning algorithm with regularization?
p.s.3: Also probably relevant is the problem of multicollinearity: https://statisticsbyjim.com/regression/multicollinearity-in-regression-analysis/ though multicollinearity is said to be no issue for the prediction: >Multicollinearity affects the coefficients and p-values, but it does not influence the predictions, precision of the predictions, and the goodness-of-fit statistics. If your primary goal is to make predictions, and you don’t need to understand the role of each independent variable, you don’t need to reduce severe multicollinearity.
It is perfectly possible and also good practice to include past label columns as features, though it depends on your question: do you want to explain the label only with other features (on purpose), or do you want to consider other and your past label columns to get the next label predicted, as a sort of adding a time-series character to the model without using a time-series?
The sequence in time is not even important, as long as all of such monthly columns are shifted in time consistently by the same time when going over to the predicting set. The model does not care if it is just January and February of the same column type, for the model, every feature is isolated.
Example: You can perfectly run a random forest model on various features, including their past label columns that repeat the same column type again and again, only representing different months. Any month's column can be dealt with as an independent new feature in the ml model, the only importance is to shift all of those monthly columns by the exactly same period to reach a consistent predicting set. In other words, obviously you should avoid replacing January with March column when you go from a training set January-June to a predicting set February-July, instead you must replace January with February of course.
Update 202301: model name is "walk-forward"
This model setup is called "walk-forward", see Why isn’t out-of-time validation more ubiquitous? --> option 3 almost at the bottom of the page.
I got this from a comment at Splitting Time Series Data into Train/Test/Validation Sets.
In the following, it shows only training and testing set. It writes "validation set", but it is known that this gets mixed up all over the place, see What is the Difference Between Test and Validation Datasets?, and it must be meant as the testing set in the default understanding of it.
Thus, with the right wording, it is:
This should be the best model for labels that become features in time.
validation set in a "walk-forward" model?
As you can see in the model, no validation set is needed since the test data must be biased "forward" in time, that is the whole idea of predicting the "step forward in time", and any validation set would have to be in that same biased artificial future - which is already the past at the time of training, but the model does not know this.
The validation happens by default, without a needed dataset split, during the walk-forward, when the model learns again and again to predict the future and the output metrics can be put against each other. As the model is to predict the time-biased future, there is no need to prove that or how the artificial future is biased and sort of "overtrained by time". It is the aim of the model to have the validation in the artificial future and predict the real future as a last step only.
But then, why not still having a validation set on top of this, at least if it is just a small k-fold validation? It could play a role if the testing set has a few strong changes that happen in small time windows but which are still important to be predicted, or at least hinted at, but should also not be overtrained within each training step. The validation set would hit some of these time windows and might show whether the model can handle them well enough. Any other method than k-fold would shrink the power of the model too much. The more you take away from the testing set during training, the less it can predict the future.
Wrap up:
Try it out, and in doubt, leave the validation aside and judge upon the model by checking its metrics over time, during the "walk-forward". This model is not like the others.
Thus, in the end, you can, but you do not have to, split a k-fold validation from the testing set. That would look like:
After predicting a lot of known futures, the very last step in time is then the prediction of the unknown future.
This also answers Does the training+testing set have to be different from the predicting set (so that you need to apply a time-shift to ALL columns)?.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I built this ML model in Azure ML studio with 4 features including a date column.
Trying to predict if the price is going to be higher tomorrow than it is today. Higher = 1, not higher = 0
It is a Two class neural network (with a Tune model hyperparameters).
When I test it I expect to get a answer between 0 - 1 which I do. The problem comes when I change the feature from 1 to 0. And get almost a similar answer.
I thought that if 1 = a score probabilities of 0.6
Then a 0 (with the same features) should give a score of 0.4
A snapshot of the data (yes I know I need more)
Model is trained/tuned on the "Over5" feature, and I hope to get an answer from the Two class neural network module in the range between 0 -1.
The Score module also produce results between 1 and 0. Everything looks to be correct.
I changed normalization method (after rekommendation from commenter) but it does not change the output much.
Everything seems to be in order but my goal is to get a prediction of the likelihood that a day would finish "Over5" and result in a 1.
When I test the model by using a "1" in the Over5 column I get a prediction of 0.55... then I tested the model with the same settings only changing the 1 to a 0 and I still get the same output 0.55...
I do not understand why this is since the model is trained/tuned on the Over5 feature. Changing input in that column should produce different results?
Outputs of a neural network are not probabilities (generally), so that could be a reason that you're not getting the "1 - P" result you're looking for.
Now, if it's simple logistic regression, you'd get probabilities as output, but I'm assuming what you said is true and you're using a super-simple neural network.
Also, what you may be changing is the bias "feature", which could also lead to the model giving you the same result after training. Honestly there's too little information in this post to say for certain what's going on. I'd advise you try normalizing your features and trying again.
EDIT: Do you know if your neural network actually has 2 output nodes, or if it's just one output node? If there are two, then the raw output doesn't matter quite as much as which node had the higher output. If it's just one, I'd look into thresholding it somewhere (like >0.5 means the price will rise, but <=0.5 means the price will fall, or however you want to threshold it.) Some systems used in applications where false positives are more acceptable than false negatives threshold at much lower values, like 0.2.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I'm trying to train a language model with LSTM based on Penn Treebank (PTB) corpus.
I was thinking that I should simply train with every bigram in the corpus so that it could predict the next word given previous words, but then it wouldn't be able to predict next word based on multiple preceding words.
So what exactly is it to train a language model?
In my current implementation, I have batch size=20 and the vocabulary size is 10000, so I have 20 resulting matrices of 10k entries (parameters?) and the loss is calculated by making comparison to 20 ground-truth matrices of 10k entries, where only the index for actual next word is 1 and other entries are zero. Is this a right implementation? I'm getting perplexity of around 2 that hardly changes over iterations, which is definitely not in a right range of what it usually is, say around 100.
So what exactly is it to train a language model?
I think you don't need to train with every bigram in the corpus. Just use a sequence to sequence model, and when you predict the next word given previous words you just choose the one with the highest probability.
so I have 20 resulting matrices of 10k entries (parameters?)
Yes, per step of decoding.
Is this a right implementation? I'm getting perplexity of around 2 that hardly changes over iterations, which is definitely not in a right range of what it usually is, say around 100.
You can first read some open source code as a reference. For instance: word-rnn-tensorflow and char-rnn-tensorflow. The perplexity is at large -log(1/10000) which is around 9 per word(which means the model is not trained at all and selects the words totally randomly, as the model being tuned the complexity will decrease, so 2 is reasonable). I think 100 in your statement may mean the complexity per sentence.
For example, if tf.contrib.seq2seq.sequence_loss is employed to calculate the complexity, the result will be less than 10 if you set both average_across_timesteps and average_across_batch to be True as default, but if you set the average_across_timesteps to be False and the average length of the sequence is about 10, it will be about 100.