how to build an linear regression model for daily predictions - machine-learning

i need to create a prediction model that predicts the quantity of an item per day...
this how my data look like on DB...
item id |date | quantity
1000 |2020-02-03 | 5
what I did is converted the date to :
year number
number of the week in year
weekday number
I trained this model on a dataset of 100,000 items with RegressionFastForest, RegressionFastTree, LbfgsPoissonRegression, FastTreeTweedie
but results are not so good (RMSE SCORE OF 3.5 - 4)
am I doing this wrong ?
I am using ML.NET if its matter
thanks

There're several techniques of time series forecasting. But the main point: we don't seek dependency of value on date. Instead, we're seeking dependence of value[i] on value[i-1].
Most common techniques are family of ARIMA models and recurrent neural networks. I would recommend to read about them. But, if you don't have much time or something else, there's something that can help. And it's Auto ARIMA models.
Implementation of auto ARIMA exists at least in Python and R. Here's python version:
from pyramid.arima import auto_arima
model = auto_arima(y)
where y is your time series.
P.S. Even though it is called auto model (which means that the algorithm will choose best hyperparameters by itself), you should still understand what does: p, q, P, Q and S mean.

There are several problems with directly applying linear regression to your data.
1) If item id is an index of sorts and does not reflect physical properties of the item, then it is a categorical feature. Use OneHotEncoding to replace it with regression-friendly labels.
2) If you assume that you data may have a cyclical dependence on the time of the day/week/month, use sin and cos of those functions. It will not work with year, as it is not periodic. Here is a good guide with examples in Python.
Good luck!
P. S. I usually use LogisticRegression in tasks with sparse representations of categorical features (OneHotEncoding) for benchmark. It will not be as good as a state-of-the art NN solution, but gives me a clue what the benchmark looks like.

Related

Forecasting Value in time series data with multiple independent variables

I have a data set attributes are (Date, Value, Variable-1, Variable-2, Variable-3, Variable-4, Variable-5), I have 100k plus rows. I wanted to predict the "Value" in the future based on 5 variables trained in time series manners, there will be seasonal trends and low and high scores in "Value". Can someone suggest to me some statistical or machine learning/deep learning solution for this?
Here is Dataset Screenshot, I wanted to forecast Value Variable
This is very interesting problem and you can use "Vector auto regression (VAR)" method to solve this problem. Packages are available in both R and Python to solve this problem.

Predict long jump results: is this a time series forecasting problem or a regression problem?

Here is my data (simplified):
Athletics Age Competition Result(m)
--------------------------------------------
Alex 10.2 CompA 3.2
Alex 11.5 CompB 4.3
...
Bob 9.9 CompC 3.5
Bob 10.7 CompD 5.6
...
Dave 10.3 CompB 5.2
Dave 11.6 CompD 6.3
....
So my data is about a set of children at different ages (8-28) the results of long jump in different competitions.
What I want to know:
Given a new child Paul, if we know his history (age 8 - 16 for example), how to forecast his future result (say at age 18, 20, 24)?
If we can group jumpers into A-E based on their best results, how to predict in which group Paul will be in the future (say when he is 18)?
I recently learned a bit about machine learning and deep learning, and I know this is a problem that can be solved using those models, but I'm confused what models I am supposed to use.
Am I supposed to do the forecasting for Paul (the new child) ONLY based on Paul's history data? Or I am supposed to do it using others' data like Alex, Bob, Dave?
Is this a time series forecasting problem, where I supposed to use models like ARIMA, ARCH, LSTM (RNN)?
Or this is a "normal" supervised or non-supervised regression or classification problem, where I supposed to use textbook models like Linear Regression, Logistic Regression, KNN, NB, DT, SVM, Random Forest, ANN, DNN, CNN?
Any direction will be greatly appreciated.
The answer is both. Regression just means there is no sigmoid activation on the output layer of the model. So you could use a time series model like LSTM or GRU (this may lead to overfitting to use such a complex model), then use them to perform regression. This way, the model will learn the way other children perform, then use the data for Paul to predict how well he will perform. This is not a classification problem! You are predicting continues value, not classes. This means it has to be regression.
I would suggest reading books or taking tutorials, I love Deep Learning with Python.
The problem you're trying to solve is usually called panel (or supervised) forecasting.
Whether or not to use data from other children is a practical question. You can compare models that use the data against models that use only Paul's data.
There is no need to use deep learning, but of course you can try. Other standard machine learning algorithm (random forest, etc.) or statistical forecasting algorithm (ARIMA, etc) can also be adapted to solve this kind of problem.
There are few libraries that solve this problem off-the-shelf. One is pysf with a tutorial on weather data (https://github.com/alan-turing-institute/pysf/blob/master/examples/Walkthrough.ipynb), another one is gluon-ts (mostly deep-learning methods).

Can we use Logistic Regression to predict numerical(continuous) variable i.e Revenue of the Restaurant

I have been given a task to predict the revenue of the Restaurant based on some variables can i use Logistic regression to predict the Revenue data.
the dataset is of kaggle Restaurant Revenue Prediction Project.
PS :- I have been told to use Logistic regression i know its not the correct algorithm for this problem
Yes... You can.!!
Prediction using Logistic Regression can be done for numerical variables. The data you have right now contains all independent variables, and the outcome will be a dichotomous (dependent variable, having value TRUE/1 or FALSE/0).
You can then use it to determine the log odds ratio to find a probability(range 0-1).
For a reference you can have look at this.
-------------------UPDATE-------------------------------
Let me give u an example of my last yr's wok.. we had to predict if a student can qualify in campus placement or not, given history data of 3 yrs of test results and their final success or failure. (NOTE : This is dichotomous, will talk about this later.)
Sample data was, student's marks in academics, and aptitude test held at college, and their status as placed or not.
But in your case, you have to predict the revenue (WHICH IS non-dichotomous). So what to do?? It seems that my case was simple, right??
Nope..!!
We were not asked just to predict if the student will qualify or not, we were to predict the chances of individual student getting placed, which is not at all a dichotomous. Looks like your scenario right?
So, what you can do is, first classify the data as for what input variables, what is the final output variable (that will help in revenue calculation).
For eg: Use data to find out if the restaurant will go in profit or loss, then relate it with some algorithms to find out the approx revenue prediction.
I'm not sure if there are already such algorithms (identical to your need) exists or not, but I'm sure you can do much better by putting more efforts on research an analysis on this topic.
TIP: NEVER think in such way that "Will Logistic Regression ONLY solve my problem?" Rather expand it to, "What Logistic can do better if used with some other technique.?"

Model that predict both categorical and numerical output

I am building a RNN for a time series model, which have a categorical output.
For example, if precious 3 pattern is "A","B","A","B" model predict next is "A".
there's also a numerical level associated with each category.
For example A is 100, B is 50,
so A(100), B(50), A(100), B(50),
I have the model framework to predict next is "A", it would be nice to predict the (100) at the same time.
For real life examples, you have national weather data.
You are predicting the next few days weather type(Sunny, windy, raining ect...) at the same time, it would be nice model will also predict the temperature.
Or for Amazon, analysis customer's trxns pattern.
Customer A shopped category
electronic($100), household($10), ... ...
predict what next trxn category that this customer is likely to shop and predict at the same time what would be the amount of that trxns.
Researched a bit, have not found any relevant research on similar topics.
What is stopping you from adding an extra output to your model? You could have one categorical output and one numerical output next to each other. Every neural network library out there supports multiple outputs.
Your will need to normalise your output data though. Categories should be normalised with one-hot encoding and numerical values should be normalised by dividing by some maximal value.
Researched a bit, have not found any relevant research on similar topics.
Because this is not really a 'topic'. This is something completely normal, and it does not require some special kind of network.

Ordinal classification packages and algorithms

I'm attempting to make a classifier that chooses a rating (1-5) for a item i. For each item i, I have a vector x containing about 40 different quantities pertaining to i. I also have a gold standard rating for each item. Based on some function of x, I want to train a classifier to give me a rating 1-5 that closely matches the gold standard.
Most of the information I've seen on classifiers deal with just binary decisions, while I have a rating decision. Are there common techniques or code libraries out there to deal with this sort of problem?
I agree with you that ML problems in which the response variable is on an ordinal scale
require special handling--'machine-mode' (i.e., returning a class label) seems insufficient
because the class labels ignore the relationship among the labels ("1st, 2nd, 3rd");
likewise, 'regression-mode' (i.e., treating the ordinal labels as floats, {1, 2, 3}) because
it ignores the metric distance between the response variables (e.g., 3 - 2 != 1).
R has (at least) several packages directed to ordinal regression. One of these is actually called Ordinal, but i haven't used it. I have used the Design Package in R for ordinal regression and i can certainly recommend it. Design contains a complete set of functions for solution, diagnostics, testing, and results presentation of ordinal regression problems via the Ordinal Logistic Model. Both Packages are available from CRAN) A step-by-step solution of an ordinal regression problem using the Design Package is presented on the UCLA Stats Site.
Also, i recently looked at a paper by a group at Yahoo working on ordinal classification using Support Vector Machines. I have not attempted to apply their technique.
Have you tried using Weka? It supports binary, numerical, and nominal attributes out of the box, the latter two of which might work well enough for your purposes.
Furthermore, it looks like one of the classifiers that's available is a meta-classifier called OrdinalClassClassifier.java, which is the result of this research:
Eibe Frank and Mark Hall, A simple approach to ordinal classification. In Proceedings of the 12th European Conference on Machine Learning, 2001, pp. 145-156.
If you don't need a pre-made approach, then these references (in addition to doug's note about the Yahoo SVM paper) might be useful:
W Chu and Z Ghahramani, Gaussian processes for ordinal regression. Journal of Machine Learning Research, 2006.
Wei Chu and S. Sathiya Keerthi, New approaches to support vector ordinal regression. In Proceedings of the 22nd international conference on Machine Learning, 2005, 145-152.
The problems that dough has raised are all valid. Let me add another one. You didn't say how you would like to measure the agreement between the classification and the "gold standard". You have to formulate the answer to that question as soon as possible, as this will have a huge impact on your next step. In my experience, the most problematic part of any (ok, not any, most) optimization task is the score function. Try asking yourself whether all errors equal? Does miss-classifying the "3" as being "4" has the same impact as classifying "4" as "3"? What about "1" vs "5". Can mistakenly missing one case have disastrous consequences (miss HIV diagnosis, activate pilot ejection in a plane)
The simplest way to measure the agreement between categorical classifiers is Cohen's Kappa. More complicated methods are described in the following links here, here, here, and here
Having said that, sometimes picking a solution that "just works", instead of "the right one" is faster and easier. If I were you I would pick a machine learning library (R, Weka, I personally love Orange) and see what I get. Only if you don't have reasonably good results with that, look for more complex solutions
If not interested in fancy statistics a one hidden layer back propagation neural network with 3 or 5 output nodes will probably do the trick if the training data is sufficiently large. Most NN classifiers try to minimize the mean squared error which is not always desired. Support Vector Machines mentioned earlier is a good alternative.
FANN is a good library for back propagation NNs, it also has some tools to assist in training of the network.
There are two packages in R that might help taming ordinal data
ordinalForest on CRAN
rpartScore on CRAN
I'm working on an OrdinalClassifier that is based on the sklearn framework (specifically the OVR multiclass classifier) and which works well with sklearn workflow such as pipelines, cross validation, and scoring.
Through testing, I'm finding that it performs very well vs. standard non-ordinal multiclass classification using SVC. And it gives much greater control over optimizing for precision and recall on the positive class (in my testing, I used sklearn's diabetes dataset and transformed the disease progression target(y) into a low, medium, high class label. Testing via cross validation is on my repo along with attribution. Scoring is based on weighted f1.
https://github.com/leeprevost/OrdinalClassifier

Resources