interpretable long-term monthly forecast with little data - machine-learning

I have rather challenging problem to tackle and would like to know how the community would approach it.
I have 3 years of monthly data in which the variable I want to predict doesn't seem to behave like a time series judging by a lack of autocorrelation. I would like to train a model that uses other time series (x1,x2,...) to predict y with a 12-month horizon. Also, i need the model to be interpretable to better understand the predictions of the model.
My initial thoughts are to use a simple model (e.g., linear regression, decision trees) and use a direct multi-step approach. Moreover, instead of creating lagged versions of the features, i'll only create one aggregation per feature to have better interpretability. This way i only have 1 feature that gives me info about x1, another about x2, and so on.
What do think about these thoughts? how would set the Cross Validation strategy?
Given all these constraints i'mnot expecting a super model, just one that's useful.
Validation or rejection of my thoughts along with reasons.
A good approach to tackle the problem.

Related

confused between choosing linear or nonlinear regression to model this data

I plotted the data I wanted to make a model for and the result is as shown in the picture. I tried modeling it using a Sinc-function but i failed so
if anyone has an idea that would help. https://i.stack.imgur.com/QY17L.jpg
First, please note that while using linear regression for this problem may let you fit these datapoints, it's not going to provide any information really about any future data. It may fit your test data if all of it lies on this same curve. If you're looking for something to predict future price you might want to consider a time series model.
However, if you're just trying use a linear regression to fit this data to that curve you have to get slightly creative. If all your features are linear and you're using linear reg, then one way or another you'll end up with a linear answer, which won't fit this model. So you will need to make your own custom features from your data. You can probably get a pretty good approximation for this using a 10th degree polynomial. So your features could be X (Years since 1992), X^2 (Years since 1992, squared), X^3, X^4, X^5, X^6 ... X^10.
A variety of other classifiers would also work fine, but you'll probably need to use some sort of time series model (LSTM for example) to get anything you can generalize to predict the future.

Usage of nested cross validation for different regressors

I'm working on an assignment in which I have to compare two regressors (random forest and svr) that I implement with scikit-learn. I want to evaluate both regressors and I was googling around a lot and came across the nested cross validation where you use the inner loop to tune the hyperparameter and the outer loop to validate on the k folds of the training set. I would like to use the inner loop to tune both my regressors and the outer loop to validate both so I will have the same test and train folds for both rergessors.
Is this a proper way to compare two ml algorithms with each other? Are there better ways to compare two algorithms with each other? Especialy regressors?
I found some entries in blogs but I could not find any scientific paper stating this is a good technique to compare two algorithms with each other, which would be important to me. If there are some links to current papers I would be glad if you could post them, too.
Thanks for the help in advance!
EDIT
I have a very low amount of data (apprx. 200 samples) with a high amount of features (apprx. 250, after using feature selection, otherwise about 4500) so I decided to use cross validation.My dependent variable is a continous value from 0 to 1.
The problem is a recommender problem so it makes no sense to test for accuracy in this case. As this is only an assignment I can only measure the ml algorithms with statistical methods rather than asking users for their opinion or measure the purchases done by them.
I think it depends on what you want to compare. If you just want to compare different models with regards to prediction power (classifier and regressor alike), nested cross validation is usually good in order to not report overly optimistic metrics: https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html while allowing you to find the best set of hyperparameters.
However, sometimes it seems like is just overkilling: https://arxiv.org/abs/1809.09446
Also, depending on how the ml algorithms behave, what datasets are you talking about, their characteristics, etc etc, maybe your "comparison" might need to take into consideration a lot of other things rather than just prediction power. Maybe if you give some more details we will be able to help more.

How do I create a feature vector if I don’t have all the data?

So say for each of my ‘things’ to classify I have:
{house, flat, bungalow, electricityHeated, gasHeated, ... }
Which would be made into a feature vector:
{1,0,0,1,0,...} which would mean a house that is heated by electricity.
For my training data I would have all this data- but for the actual thing I want to classify I might only have what kind of house it is, and a couple other things- not all the data ie.
{1,0,0,?,?,...}
So how would I represent this?
I would want to find the probability that a new item would be gasHeated.
I would be using a SVM linear classifier- I don’t have any core to show because this is purely theoretical at the moment. Any help would be appreciated :)
When I read this question, it seems that you may have confused with feature and label.
You said that you want to predict whether a new item is "gasHeated", then "gasHeated" should be a label rather than a feature.
btw, one of the most-common ways to deal with missing value is to set it as "zero" (or some unused value, say -1). But normally, you should have missing value in both training data and testing data to make this trick be effective. If this only happened in your testing data but not in your training data, it means that your training data and testing data are not from the same distribution, which basically violated the basic assumption of machine learning.
Let's say you have a trained model and a testing sample {?,0,0,0}. Then you can create two new testing samples, {1,0,0,0}, {0,0,0,0}, and you will have two predictions.
I personally don't think SVM is a good approach if you have missing values in your testing dataset. Just like I have mentioned above, although you can get two new predictions, but what if each one has different predictions? It is difficult to assign a probability to results of SVM in my opinion unless you use logistic regression or Naive Bayes. I would prefer Random Forest in this situation.

User behavior prediction/analysis

I am trying to apply machine learning methods to predict/ analyze user's behavior. The data which I have is in the following format:
data type
I am new to the machine learning, so I am trying to understand what I am doing makes sense or not. Now in the activity column, either I have two possibilities which I am representing as 0 or 1. Now in time column, I have time in a cyclic manner mapped to the range (0-24). Now at a certain time (onehot encoded) user performs an activity. If I use activity column as a target column in machine learning, and try to predict if at a certain time user will perform one activity or another, does it make sense or not?
The reason I am trying to predict activity is that if my model provides me some result about activity prediction and in real time a user does something else (which he has not been doing over the last week or so), I want to consider it as a deviation from normal behavior.
Am I doing right or wrong? any suggestion will be appreciated. Thanks.
I think your idea is valid, but machine learning models are not 100 % accurate all the time. That is why "Accuracy" is defined for a model.
If you want to create high-performance predictive models then go for deep learning models because its performance improves over time with the increase in the size of training data sets.
I think this is a great use case for a Classification problem. Since you have only few columns (features) in your dataset, i would say start with a simple Boosted Decision Tree Classification algorithm.
Your thinking is correct, that's basically how fraud detection AI works in some cases, one option to pursue is to use the decision tree model, this may help to scale dynamically.
I was working on the same project but in a different direction, have a look maybe it can help :) https://github.com/dmi3coder/behaiv-java.

Beginners guide to troubleshooting badly performing models

Im creating my first predictive model and its results are absolutely awful.
Im in need of some help identifying how i troubleshoot this.
Im doing linear regression & logistic regression classification, to predict if a student will pass a course, 1 for yes, 0 for no.
The dataset is tiny, as we only have complete data for one class, 16 features just under 60 rows, 35 passed and 25 failed.
I'm wondering if my dataset is simply too small.
I dont want to share the dataset just yet, but will clean it up so its completely anonymous.
The ROC is very very jagged and mostly (for log regression), and predicts more false positives than anything else.
Id appreciate some general troubleshooting advice for a beginner that i can try before we hire in a professional.
Thanks for any help provided.
Id suggest some tips:
In Azure ML theres a module called "filter based feature selection", you can use it to score your features and check if there is really predictive power in them or even select just the ones with the highest score.
If you haven't ,splitt in train/cross validation set and evaluate your model in both and use it as a diagnosis to identify underfitting(high bias) or overfitting(high variance), and depending on the diagnosis perform actions like:
For overfitting: get more data, use less features, use a less complex model , add or increase regularization
For underfitting: add more features, use a more complex model, decrease regularization.
And don't forget ,before start training to explore and evaluate your data, use scatter plots to see if indeed its separable, perform feature engineering and preprocessing for this ask yourself: given this features, would a human expert be able to perform predictions?, if your answer is not, transform or drop features so that the answer is positive

Resources