Identifying regression with ARIMA errors - time-series

I have read form several sources that in order to identify the best ARIMA process for regression model with ARIMA errors, one should start by estimation of the proxy ARIMA(2,0,0)(1,0,0) model and then read form ACF and PACF plots to develop the best model (see https://www.otexts.org/fpp/9/1).
But isn't the better way just to start with simple OLS regression and its residuals and not residuals from proxy model that already include ARIMA components?
Can anybody tell is there any specific advantages for one approach over the other?

Related

How to evaluate machine learning model performance on brand new datasets, in addition to train, validation, and test datasets?

The Scenario:
Our data science team builds machine learning models for classification tasks. We evaluate our model performance on train, validation and test datasets. We use precision, recall and F1 score.
We then run the models on brand-new datasets in the production environment and make predictions. One week later, we get feedback on how well our predictive models have performed.
The question:
When we evaluate the performance of our models on the real datasets, what metrics should we use? Is prediction accuracy a better metric in this context?
I think you should either measure the same metrics, or some business metrics.
Usually the models are optimized for a certain loss/metric and this means that model having a high value of a certain metric can have a worse value on a different metric.
Accuracy is a metric which is heavily influenced by balance of classes in the data, so it should be used with care.
So I suggest to use the same metrics.
Another approach is using some business metrics - for example the revenue, which these models brought.
Model evaluation
Take a look at this paper. It is fairly easy to follow and covers everything you need to know about machine learning model validation.

Machine Learning - Feature Ranking by Algorithms

I have a dataset that contains around 30 features and I want to find out which features contribute the most to the outcome. I have 5 algorithms:
Neural Networks
Logistics
Naive
Random Forest
Adaboost
I read a lot about Information Gain technique and it seems it is independent of the machine learning algorithm used. It is like a preprocess technique.
My question follows, is it best practice to perform feature importance for each algorithm dependently or just use Information Gain. If yes what are the technique used for each ?
First of all, it's worth stressing that you have to perform the feature selection based on the training data only, even if it is a separate algorithm. During testing, you then select the same features from the test dataset.
Some approaches that spring to mind:
Mutual information based feature selection (eg here), independent of the classifier.
Backward or forward selection (see stackexchange question), applicable to any classifier but potentially costly since you need to train/test many models.
Regularisation techniques that are part of the classifier optimisation, eg Lasso or elastic net. The latter can be better in datasets with high collinearity.
Principal components analysis or any other dimensionality reduction technique that groups your features (example).
Some models compute latent variables which you can use for interpretation instead of the original features (e.g. Partial Least Squares or Canonical Correlation Analysis).
Specific classifiers can aid interpretability by providing extra information about the features/predictors, off the top of my head:
Logistic regression: you can obtain a p-value for every feature. In your interpretation you can focus on those that are 'significant' (eg p-value <0.05). (same for two-classes Linear Discriminant Analysis)
Random Forest: can return a variable importance index that ranks the variables from most to least important.
I have a dataset that contains around 30 features and I want to find out which features contribute the most to the outcome.
This will depend on the algorithm. If you have 5 algorithms, you will likely get 5 slightly different answers, unless you perform the feature selection prior to classification (eg using mutual information). One reason is that Random Forests and neural networks would pick up nonlinear relationships while logistic regression wouldn't. Furthermore, Naive Bayes is blind to interactions.
So unless your research is explicitly about these 5 models, I would rather select one model and proceed with it.
Since your purpose is to get some intuition on what's going on, here is what you can do:
Let's start with Random Forest for simplicity, but you can do this with other algorithms too. First, you need to build a good model. Good in the sense that you need to be satisfied with its performance and it should be Robust, meaning that you should use a validation and/or a test set. These points are very important because we will analyse how the model takes its decisions, so if the model is bad you will get bad intuitions.
After having built the model, you can analyse it at two level : For the whole dataset (understanding your process), or for a given prediction. For this task I suggest you to look at the SHAP library which computes features contributions (i.e how much does a feature influences the prediction of my classifier) that can be used for both puproses.
For detailled instructions about this process and more tools, you can look fast.ai excellent courses on the machine learning serie, where lessons 2/3/4/5 are about this subject.
Hope it helps!

What is a simple machine learning model for time series prediction that can be used in the feature engineering phase?

I need a simple model that would be fast to train and would be suitable for time series prediction that would be used mainly to generate new features. Should I use LSTM or SVM or maybe something else?
The model which is suitable for your data is variable. But the simplest model in math is vanilla RNN.
There is a nice article for your reference:
http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Is there any best practice for features selection for Machine Learning model to do click through rate prediction

For e-commerce company, how to pick up features when doing Click Through Rate prediction using logistic regression, SVM or other machine learning models.
I tried gender, statistic features from goods tags, and used SVM, NN. but the result was very bad.
Is there any suggestions or best practices about the important factors for CTR prediction in e-commerce? THANKS!
When you use a library like scikit-learn, you can use GridSearchCV the best parameter for the model you're building! You can specify the evaluation metric that you want to optimize! In your case, you need to understand what the evaluation metric is!
Read about it here:
http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

How to Evaluate and Analyse Machine Learning algorithm Performance?

Sorry if my question sounds too naive... i am really new to machine learning and regression
i have recently joined a machine learning lab as a master student . my professor wants me to write "the experiments an analysis" section of a paper the lab is about to submit about a regression algorithm that they have developed.
the problem is i don't know what i have to do he said the algorithm is stable and completed and they have written the first part of paper and i need to write the evaluation part .
i really don't know what to do . i have participated in coding the algorithm and i understand it pretty well but i don't know what are the tasks i must take in order to evaluate and analysis its performance.
-where do i get data?
-what is the testing process?
-what are the analysis to be done?
i am new to research and paper writing and really don't know what to do.
i have read a lot of paper recently but i have no experience in analyzing ML algorithms.
could you please guide me and explain (newbie level) the process please.
detailed answers are appreciated
thanks
You will need a test dataset to evaluate the performance. If you
don't have that, divide your training dataset (that you're currently
running this algorithm on) into training set and cross validation set
(non overlapping).
Create the test set by stripping out the predictions (y values) from
the cross validation set.
Run the algorithm with the training dataset to train the model.
Once your model is trained, test it's performance using the stripped
off 'test set'.
To evaluate the performance, you can use the RMSE (Root Mean Squared
Error) metric. You will need to use the predictions that your
algorithm made for each sample in the test set and their
corresponding actual predictions (that you stripped off earlier to
feed in the test set). You can find more information here.
Machine learning model evaluation
Take a look at this paper. It has been written for people without a computer science background, so it should be fairly easy to follow. It covers:
model evaluation workflow
holdout validation
cross-validation
k-fold cross-validation
stratified k-fold cross-validation
leave-one-out cross-validation
leave-p-out cross-validation
leave-one-group-out cross-validation
nested cross-validation

Resources