I have a data set attributes are (Date, Value, Variable-1, Variable-2, Variable-3, Variable-4, Variable-5), I have 100k plus rows. I wanted to predict the "Value" in the future based on 5 variables trained in time series manners, there will be seasonal trends and low and high scores in "Value". Can someone suggest to me some statistical or machine learning/deep learning solution for this?
Here is Dataset Screenshot, I wanted to forecast Value Variable
This is very interesting problem and you can use "Vector auto regression (VAR)" method to solve this problem. Packages are available in both R and Python to solve this problem.
Related
can we predict growth percentage in sales of an item given the change in discount(positive or negative number) from the previous year as a predictor variable. There seems to be no correlation between these. How to solve this problem using machine learning?
You are on the wrong track to ask this question.
Correlation is on the knowledge side of Statistics, Please check Pearson’s correlation of coefficient / Spearman’s correlation of coefficient in order to find the correlation between the discount changes and the sales groth correlation.
In Machine Learning, we seldom compare two percentage data, instead, we compare the actual sales/discount value. A simple ML can be applied by Linear regression (most ML is used in multi-dimension, as your case is one-x one-y data (single column to single output). Please refer to related information online and solved with excel or python code.
I am new to machine learning and therefore, trying to figure out if my dataset is enough to run LSTM model.
I am trying to do time series forecasting on daily road traffic data. Currently, I have daily data (2012-2019) for 20 different locations. Essentially, I just have ~2800 data points for each of the location. Is that a good dataset to start with?
Any recommendations on how I can tweak the data or transform it to help with my dataset?
Please help! Thank you!!
Consider this your dataset is ~ 2800*20 examples. Now you can always run an LSTM/RNN model on this much data, but you should try to check whether they are outperforming baseline models like Autoregressive Moving Average (ARMA),
Autoregressive Integrated Moving Average (ARIMA).
Also, if data is in format:
Example_1: Day_1: x, Day_2:y, ...., Day_n: xx .etc
Rather than inputing whole Day_1 ... Day_n features to predict Day_n+1
You can always increase your dataset by using Day_1 to predict Day_2 and so on.
Check this LINK. Something I worked on which might help.
i need to create a prediction model that predicts the quantity of an item per day...
this how my data look like on DB...
item id |date | quantity
1000 |2020-02-03 | 5
what I did is converted the date to :
year number
number of the week in year
weekday number
I trained this model on a dataset of 100,000 items with RegressionFastForest, RegressionFastTree, LbfgsPoissonRegression, FastTreeTweedie
but results are not so good (RMSE SCORE OF 3.5 - 4)
am I doing this wrong ?
I am using ML.NET if its matter
thanks
There're several techniques of time series forecasting. But the main point: we don't seek dependency of value on date. Instead, we're seeking dependence of value[i] on value[i-1].
Most common techniques are family of ARIMA models and recurrent neural networks. I would recommend to read about them. But, if you don't have much time or something else, there's something that can help. And it's Auto ARIMA models.
Implementation of auto ARIMA exists at least in Python and R. Here's python version:
from pyramid.arima import auto_arima
model = auto_arima(y)
where y is your time series.
P.S. Even though it is called auto model (which means that the algorithm will choose best hyperparameters by itself), you should still understand what does: p, q, P, Q and S mean.
There are several problems with directly applying linear regression to your data.
1) If item id is an index of sorts and does not reflect physical properties of the item, then it is a categorical feature. Use OneHotEncoding to replace it with regression-friendly labels.
2) If you assume that you data may have a cyclical dependence on the time of the day/week/month, use sin and cos of those functions. It will not work with year, as it is not periodic. Here is a good guide with examples in Python.
Good luck!
P. S. I usually use LogisticRegression in tasks with sparse representations of categorical features (OneHotEncoding) for benchmark. It will not be as good as a state-of-the art NN solution, but gives me a clue what the benchmark looks like.
I'm new to KNIME and trying to use ARIMA for extrapolation of my time series data. But I've failed to make ARIMA Predictor to do it's work.
Input data are of the following format
year,cv_diff
2011,-4799.099999999977
2012,60653.5
2013,64547.5
2014,60420.79999999993
And I would like to predict values for example for 2015 and 2016 years.
I'm using String to Date/Time node to convert year to date. In ARIMA Learner I can choose only cv_diff field. And this is the first question: for option 'Column containing univariate time series' should I set year column or variable that I'm going to predict? But in my case I have only one option - cv_diff variable. After that I connect Learner's output with ARIMA Predictor's input and execute. Execution is failing with ' ERROR ARIMA Predictor 2:3 Execute failed: The column with the defined time series was not found. Please configure the node anew.'
Help me to understand which variable should I set for Learner and Predictor? Should it be non-timeseries variable? And how then Arima nodes will understand which column to use as time series?
You should set the cv_diff as the time series variable and connect the input to the predictor too. (And do not try to set too large values for the parameters as with so little data points, learning will not work.)
Here is an example:
Finally, I've figured it out. Option 'Column containing univariate time series' for ARIMA Learner node seems little bit confusing especially for those unfamiliar with time series analysis. I should't have provided any time series field explicitly, because ARIMA treats variable on which it is going to make prediction as collected in equal time intervals and it doesn't matter what kind of intervals they are.
I've found a good explanation of what 'univariate time series' means
The term "univariate time series" refers to a time series that
consists of single (scalar) observations recorded sequentially over equal time increments. Some examples are monthly CO2 concentrations and southern
oscillations to predict el nino effects.
Although a univariate time series data set is usually given as a single column of numbers, time is in fact an implicit variable in the time series. If the data are equi-spaced, the time variable, or index, does not need to be explicitly given. The time variable may sometimes be explicitly used for plotting the series. However, it is not used in the time series model itself.
So, I should choose cv_diff variable for both Learner and Predictor and do not provide any timestamps or any other time related columns.
One more thing that I didn't understand. That I should train on some series of data and then provide another SERIES for which I want predictions. That is little bit different from other Machine Learning workflows when you need to provide only new data and there is no notion of series at all.
For a time series dataset, I would like to do some analysis and create prediction model. Usually, we would split data (by random sampling throughout entire data set) into training set and testing set and use the training set with randomForest function. and keep the testing part to check the behaviour of the model.
However, I have been told that it is not possible to split data by random sampling for time series data.
I would appreciate if someone explain how to split data into training and testing for time series data. Or if there is any alternative to do time series random forest.
Regards
We live in a world where "future-to-past-causality" only occurs in cool scifi movies. Thus, when modeling time series we like to avoid explaining past events with future events. Also, we like to verify that our models, strictly trained on past events, can explain future events.
To model time series T with RF rolling is used. For day t, value T[t] is the target and values T[t-k] where k= {1,2,...,h}, where h is the past horizon will be used to form features. For nonstationary time series, T is converted to e.g. the relatively change Trel. = (T[t+1]-T[t]) / T[t].
To evaluate performance, I advise to check the out-of-bag cross validation measure of RF. Be aware, that there are some pitfalls possibly rendering this measure over optimistic:
Unknown future to past contamination - somehow rolling is faulty and the model using future events to explain the same future within training set.
Non-independent sampling: if the time interval you want to forecast ahead is shorter than the time interval the relative change is computed over, your samples are not independent.
possible other mistakes I don't know of yet
In the end, everyone can make above mistakes in some latent way. To check that is not happening you need to validate your model with back testing. Where each day is forecasted by a model strictly trained on past events only.
When OOB-CV and back testing wildly disagree, this may be a hint to some bug in the code.
To backtest, do rolling on T[t-1 to t-traindays]. Model this training data and forecast T[t]. Then increase t by one, t++, and repeat.
To speed up you may train your model only once or at every n'th increment of t.
Reading Sales File
Sales<-read.csv("Sales.csv")
Finding length of training set.
train_len=round(nrow(Sales)*0.8)
test_len=nrow(Sales)
Splitting your data into training and testing set here I have considered 80-20 split you can change that. Make sure your data in sorted in ascending order.
Training Set
training<-slice(SubSales,1:train_len)
Testing Set
testing<-slice(SubSales,train_len+1:test_len)