interpret statistical model metrics - machine-learning

Do you know how to intepret RAE and RSE values? I know a COD closer to 1 is a good sign. Does this indicate that boosted decision tree regression is best?

RAE and RSE closer to 0 is a good sign...you want error to be as low as possible. See this article for more information on evaluating your model. From that page:
The term "error" here represents the difference between the predicted value and the true value. The absolute value or the square of this difference are usually computed to capture the total magnitude of error across all instances, as the difference between the predicted and true value could be negative in some cases. The error metrics measure the predictive performance of a regression model in terms of the mean deviation of its predictions from the true values. Lower error values mean the model is more accurate in making predictions. An overall error metric of 0 means that the model fits the data perfectly.
Yes, with your current results, the boosted decision tree performs best. I don't know the details of your work well enough to determine if that is good enough. It honestly may be. But if you determine it's not, you can also tweak the input parameters in your "Boosted Decision Tree Regression" module to try to get even better results. The "ParameterSweep" module can help with that by trying many different input parameters for you and you specify the parameter that you want to optimize for (such as your RAE, RSE, or COD referenced in your question). See this article for a brief description. Hope this helps.
P.S. I'm glad that you're looking into the black carbon levels in Westeros...I'm sure Cersei doesn't even care.

Related

Need an interpretation on a statistical expression on missing values

I was reading a paper about missing values on the Internet and having a problem in interpreting interpreting the meaning of the first sentence highlighted in bold below:
Missing data present various problems. First, the absence of data reduces statistical power, which refers to the probability that the test will reject the null hypothesis when it is false. Second, the lost data can cause bias in the estimation of parameters. Third, it can reduce the representativeness of the samples. Fourth, it may complicate the analysis of the study. Each of these distortions may threaten the validity of the trials and can lead to invalid conclusions.
Hope to hear some explanations.
Firstly, power is the probability of rejecting the null hypothesis when in fact it is false. So, you could say it is the probability of making the correct decision. The absence of data reduces this statistical power, a low sample size of studies, small effects being investigated, or both adversely impacts the likelihood that a statistically significant finding actually reflects a true effect. Meaning let's say if you've 100 samples and because of missing values you discard 40 samples from the dataset, now whatever conclusion you come up with using the remaining 60 samples, you can't be much confident that it reflects a true effect.
Secondly, If you choose to replace those missing values using the mean for example, then you're injecting a sort of bias to the data, actually, however you decide to replace or remove the data, the bias is getting injected. (though certain bias is more plausible in certain situations)
Thirdly, the sentence is quite explanatory itself, those missing values reduce the representativeness of the samples, as you don't have all the info you need about those samples.
Lastly, we can say it (missing values) actually does complicates our study, It's the last thing we would want when working with data, however because of human error and many other sources of errors we often have to deal with these missing values with certain operations.

Why is it important to have more than one evaluation method for different machine learning algorithms?

Couldn't find a precise and concise answer. I'm not particularly interested in different machine learning evaluation methods, I just want to know why it's important to have more than one?
Each metrics gives a different insight and evaluates your model differently.
Let's take an example for binary classification:
Accuracy tells you what percentage of your predictions are correct. But what if you also want to know exactly how many 1's you got wrong [i.e. you predicted 0's where they should be 1]. for this, you will calculate the recall score.
So you get the idea maybe you want good accuracy but also good recall [real world example : maybe spam detection], so you look at both metric and choose wisely

Choosing right metrics for regression model

I have always been using r2 score metrics. I know there are several evaluation metrics out there i have read several articles about it. Since i'm still a beginner in machine learning. I'm still very confused of
When to use each of it, is depending on our case, if yes please give me example
I read this article and it said, r2 score is not straightforward, we need other stuff to measure the performance of our model. Does it mean we need more than 1 evaluation metrics in order to get better insight of our model performance?
Is it recommended if we only measure our model performance by just one evaluation metrics?
From this article it said knowing the distribution of our data and our business goal helps us to understand choose appropriate metrics. What does it mean by that?
How to know for each metrics that the model is 'good' enough?
There are different evaluation metrics for regression problems like below.
Mean Squared Error(MSE)
Root-Mean-Squared-Error(RMSE)
Mean-Absolute-Error(MAE)
R² or Coefficient of Determination
Mean Square Percentage Error (MSPE)
so on so forth..
As you mentioned you need to use them based on your problem type, what you want to measure and the distribution of your data.
To do this, you need to understand how these metrics evaluate the model. You can check the definitions and pros/cons of evaluation metrics from this nice blog post.
R² shows what variation of your purpose variable is described by independent variables. A good model can give R² score close to 1.0 but it does not mean it should be. Models which have low R² can also give low MSE score. So to ensure your predictive power of your model it is better to use MSE, RMSE or other metrics besides the R².
No. You can use multiple evaluation metrics. The important thing is if you compare two models, you need to use same test dataset and the same evaluation metrics.
For example, if you want to penalize your bad predictions too much, you can use MSE evaluation metric because it basically measures the average squared error of our predictions or if your data have too much outlier MSE give too much penalty to this examples.
The good model definition changes based on your problem complexity. For example if you train a model which predicts that heads or tails and gives %49 accuracy it is not good enough because the baseline of this problem is %50. But for any other problem, %49 accuracy may enough for your problem. So in a summary, it depends on your problem and you need to define or think that human(baseline) threshold.

training set with only one label, missing the other

Hi I've been doing a machine learning project about predicting if a given (query, answer) pair is a good match (label the pair with 1 if it is a good match, 0 otherwise). But the problem is, in the training set, all the items are labelled with 1. So I got confused because I don't think the training set has strong discriminative power. To be more specific, now I could extract some features like:
1. textual similarity between query and answer
2. some attributes like the posting date, who created it, which aspect is it about etc.
Maybe I should try semi supervised learning (never studied it so have no idea if it will work)? But with such a training set I even cannot do validation....
Actually, you can train a data set on only positive examples; 1-class SVM does this. However, this presumes that anything "sufficiently outside" the original data set is negative data, with "sufficiently outside" affected mainly by gamma (allowed error rate) and k (degree of the kernel function).
A solution for your problem depends on the data you have. You are quite correct that a model trains better when given representative negative examples. The description you give strongly suggests that you do know there are insufficient matches.
Do you need a strict +/- scoring for the matches? Most applications simply rank them: the match strength is the score. This changes your problem from a classification to a prediction case. If you do need a strict +/- partition (classification), then I suggest that you slightly alter your training set: include only obvious examples: throw out anything scored near your comfort threshold for declaring a match.
With these inputs only, train your model. You'll have a clear "alley" between good and bad matches, and the model will "decide" which way to judge the in-between cases in testing and production.

SPSS two way repeated measures ANOVA

i am fairly new with statitistic.
I made an experiment and used the two way ANOVA with repeated measures. The calculation was done in SPSS. In most papers I have seen, the f-value and the degree of freedom were reported as well. is it normal to report those values as well? if so, which values do i take from the spss output.
how do I interpret these values? what do they mean?
when does the f-value support a significant result and when not?
what are good values for the f-value and the degree of freedom.
in some article is also read about the critical f-values, how do I get this value?
most articles describe how to calculate those values but do not explain their meaning for the experiment.
some clarification in these issues is greatly appreciated.
My English is not very good, but I will try to answer your question.
The main purpose of ANOVA is that we want statistical proof that the measured groups have the same mean or not. So we make a null hypothesis and an alternative hypothesis, then we use a test statistics on the data. You can use ANOVA if the groups has the same variance (squared standard deviation).
You need to test this. This is a hyptest too, the nullhyp. is the groups have the same variance, the anternative hyp. is they dont.
You need to make decision from the Sig. value, if the value is higher than 0,05, we usually accept the nullhyp. If the variances are equal, we can use ANOVA. (I assume that the data is following the Normal distribution.) The nullhyp. is that the groups have equal means, the alternative hyp is that we have at least 1 group with a different mean. You can make your decision from the Sig. value, as I said before, if the value higher than 0.05 we accept the nullhyp. The F-critical value is not important if you are calculating on a computer. You can make an accepting interval from the lower and the upper F-critical, and if the F-value is in the interval you accept the nullhyp, but I only used this method in statistics class. You don't need the F-value and the df in the report, because they don't explain anything on their own.

Resources