ETS model in statsmodels library - time-series

I have several questions regarding the ETS model in statsmodels library. The description of the model can be found here.
The default initialization_method is estimated. In the description, it says ‘estimated’ uses the same heuristic as initial guesses, but then estimates the initial states as part of the fitting process. What does this mean? What is the heuristic value as the initial guesses? How does the estimation work? Does it try to minimize the sum of squared errors of the one-step ahead forecast?
If I specify damped_trend = True, how does the model choose the optimal damping parameter?

Related

Choosing right metrics for regression model

I have always been using r2 score metrics. I know there are several evaluation metrics out there i have read several articles about it. Since i'm still a beginner in machine learning. I'm still very confused of
When to use each of it, is depending on our case, if yes please give me example
I read this article and it said, r2 score is not straightforward, we need other stuff to measure the performance of our model. Does it mean we need more than 1 evaluation metrics in order to get better insight of our model performance?
Is it recommended if we only measure our model performance by just one evaluation metrics?
From this article it said knowing the distribution of our data and our business goal helps us to understand choose appropriate metrics. What does it mean by that?
How to know for each metrics that the model is 'good' enough?
There are different evaluation metrics for regression problems like below.
Mean Squared Error(MSE)
Root-Mean-Squared-Error(RMSE)
Mean-Absolute-Error(MAE)
R² or Coefficient of Determination
Mean Square Percentage Error (MSPE)
so on so forth..
As you mentioned you need to use them based on your problem type, what you want to measure and the distribution of your data.
To do this, you need to understand how these metrics evaluate the model. You can check the definitions and pros/cons of evaluation metrics from this nice blog post.
R² shows what variation of your purpose variable is described by independent variables. A good model can give R² score close to 1.0 but it does not mean it should be. Models which have low R² can also give low MSE score. So to ensure your predictive power of your model it is better to use MSE, RMSE or other metrics besides the R².
No. You can use multiple evaluation metrics. The important thing is if you compare two models, you need to use same test dataset and the same evaluation metrics.
For example, if you want to penalize your bad predictions too much, you can use MSE evaluation metric because it basically measures the average squared error of our predictions or if your data have too much outlier MSE give too much penalty to this examples.
The good model definition changes based on your problem complexity. For example if you train a model which predicts that heads or tails and gives %49 accuracy it is not good enough because the baseline of this problem is %50. But for any other problem, %49 accuracy may enough for your problem. So in a summary, it depends on your problem and you need to define or think that human(baseline) threshold.

Are there any methods for finding the value of variable which has significant influence on response?

I have a dataset which has 5 variables and 1 response. The variables are discrete. I want to find the key variable and its value which leads to a significant increase or decrease to the response.
You will need to perform some statistical tests in order to find which variables are the most significant.
If you are familiar with python you could use SelectKBest from scikit-learn. It will give you a score, the highest the score, the stronger the link between the feature and the output.
Additionally you can train an explainable ML model, strong enough to converge, and find the pattern within the data, from that you could compute the feature importance.
For example you could use DecisionTreeClasifier from scikit-learn. It has a decision_path class function that will plot the decision path taken by the tree, decision_path has a property called feature_importances_ that uses Gini coefficient to compute the importance of the features.
Last but not the least, you can use feature reduction techniques, such as PCA, it's used to find the variance between variables, from the PCA you will compute new Principal Components that are linked to the features, from the most explenatory ones you can find the features importance. Check this stack overflow answer that explains everything you should know for that.

Random Forest - Max Features

I do have a question and I need your support. I have a data set which I am analyzing. I need to predict a target. To do this I did some data cleaning, among others drop highly (linear correlated feautes)
After preparing my data I applied random forest regressor (it is a regression problem). I am stucked a bit, since I really cannot catch the meaning and thus the value for max_features
I found the following page answer, where it is written
features=n_features for regression is a mistake on scikit's part. The original paper for RF gave max_features = n_features/3 for regression
I do get different results if I use max_features=sqrt(n) or max_features=n_features
Can any1 give me a good explanation how to approach this parameter?
That would be really great
max_features is a parameter that needs to be tuned. Values such as sqrt or n/3 are defaults and usually perform decently, but the parameter needs to be optimized for every dataset, as it will depend on the features you have, their correlations and importances.
Therefore, I suggest training the model many times with a grid of values for max_features, trying every possible value from 2 to the total number of your features. Train your RandomForestRegressor with oob_score=True and use oob_score_ to assess the performance of the Forest. Once you have looped over all possible values of max_features, keep the one that obtained the highest oob_score.
For safety, keep the n_estimators on the high end.
PS: this procedure is basically a grid search optimization for one parameter, and is usually done via Cross Validation. Since RFs give you OOB scores, you can use these instead of CV scores, as they are quicker to compute.

training set with only one label, missing the other

Hi I've been doing a machine learning project about predicting if a given (query, answer) pair is a good match (label the pair with 1 if it is a good match, 0 otherwise). But the problem is, in the training set, all the items are labelled with 1. So I got confused because I don't think the training set has strong discriminative power. To be more specific, now I could extract some features like:
1. textual similarity between query and answer
2. some attributes like the posting date, who created it, which aspect is it about etc.
Maybe I should try semi supervised learning (never studied it so have no idea if it will work)? But with such a training set I even cannot do validation....
Actually, you can train a data set on only positive examples; 1-class SVM does this. However, this presumes that anything "sufficiently outside" the original data set is negative data, with "sufficiently outside" affected mainly by gamma (allowed error rate) and k (degree of the kernel function).
A solution for your problem depends on the data you have. You are quite correct that a model trains better when given representative negative examples. The description you give strongly suggests that you do know there are insufficient matches.
Do you need a strict +/- scoring for the matches? Most applications simply rank them: the match strength is the score. This changes your problem from a classification to a prediction case. If you do need a strict +/- partition (classification), then I suggest that you slightly alter your training set: include only obvious examples: throw out anything scored near your comfort threshold for declaring a match.
With these inputs only, train your model. You'll have a clear "alley" between good and bad matches, and the model will "decide" which way to judge the in-between cases in testing and production.

interpret statistical model metrics

Do you know how to intepret RAE and RSE values? I know a COD closer to 1 is a good sign. Does this indicate that boosted decision tree regression is best?
RAE and RSE closer to 0 is a good sign...you want error to be as low as possible. See this article for more information on evaluating your model. From that page:
The term "error" here represents the difference between the predicted value and the true value. The absolute value or the square of this difference are usually computed to capture the total magnitude of error across all instances, as the difference between the predicted and true value could be negative in some cases. The error metrics measure the predictive performance of a regression model in terms of the mean deviation of its predictions from the true values. Lower error values mean the model is more accurate in making predictions. An overall error metric of 0 means that the model fits the data perfectly.
Yes, with your current results, the boosted decision tree performs best. I don't know the details of your work well enough to determine if that is good enough. It honestly may be. But if you determine it's not, you can also tweak the input parameters in your "Boosted Decision Tree Regression" module to try to get even better results. The "ParameterSweep" module can help with that by trying many different input parameters for you and you specify the parameter that you want to optimize for (such as your RAE, RSE, or COD referenced in your question). See this article for a brief description. Hope this helps.
P.S. I'm glad that you're looking into the black carbon levels in Westeros...I'm sure Cersei doesn't even care.

Resources