SPSS two way repeated measures ANOVA - spss

i am fairly new with statitistic.
I made an experiment and used the two way ANOVA with repeated measures. The calculation was done in SPSS. In most papers I have seen, the f-value and the degree of freedom were reported as well. is it normal to report those values as well? if so, which values do i take from the spss output.
how do I interpret these values? what do they mean?
when does the f-value support a significant result and when not?
what are good values for the f-value and the degree of freedom.
in some article is also read about the critical f-values, how do I get this value?
most articles describe how to calculate those values but do not explain their meaning for the experiment.
some clarification in these issues is greatly appreciated.

My English is not very good, but I will try to answer your question.
The main purpose of ANOVA is that we want statistical proof that the measured groups have the same mean or not. So we make a null hypothesis and an alternative hypothesis, then we use a test statistics on the data. You can use ANOVA if the groups has the same variance (squared standard deviation).
You need to test this. This is a hyptest too, the nullhyp. is the groups have the same variance, the anternative hyp. is they dont.
You need to make decision from the Sig. value, if the value is higher than 0,05, we usually accept the nullhyp. If the variances are equal, we can use ANOVA. (I assume that the data is following the Normal distribution.) The nullhyp. is that the groups have equal means, the alternative hyp is that we have at least 1 group with a different mean. You can make your decision from the Sig. value, as I said before, if the value higher than 0.05 we accept the nullhyp. The F-critical value is not important if you are calculating on a computer. You can make an accepting interval from the lower and the upper F-critical, and if the F-value is in the interval you accept the nullhyp, but I only used this method in statistics class. You don't need the F-value and the df in the report, because they don't explain anything on their own.

Related

Cut points from which to choose the best split in Decision Tree regressor with continious feature?

I understand, that in the Decision Tree algorithm, when the splitting is decided, we choose the best split based on some criterion. And when you're looking for the best split, you have to iterate through some list of values. But it seems very computationally expensive to consider every value of the feature as the possible threshold (or, so called, cut point). Thus, there is a necessity for some heuristic for choosing these thresholds. For example, if we have continuous feature and categorical target (i.e, we are dealing with classification problem), we can do the following: sort dataset by given feature and consider for splitting only values, where target variable is changing it's value.
But what do you do if you have regression task, i.e. both feature and target are continuous variables? I realize, that I have to calculate, for example, the mean variance or mean median deviation in both branches for each split. But how do you decide from which values you're choosing you best split? People surely have came up with some optimal solution in order to avoid iterating over every value of the feature in the training set.
I've done some research, but most sources only focuses on different criteria and questions of how you determine whether your split is suitable. Which is not really answering my question.
I've found this question, but Predictor only suggests, that it can be done using the percentiles. And I think, that there is no guarantee, that this is how it really done in real life.
I've also found this question, but for me geledek's answer is not very clear (obviously, dude just copy-pasted his answer from presentation, that he is referring to). I'm pretty much fine with the Method 1, but I would really appreciate if someone could explain Method 2 in more details. Or, perhaps, provide some different source or explanation of your own.
UPD: I've also looked up to the scikit-learn repo at GitHub, and found this line. I can't quite understand the overall code, but it seems that this particular line implies that thresholds are chosen as the averages of the neighboring feature values (which corresponds with the aforementioned Method 1 from the question above). Is that correct? I also don't understand this comment: # sum of halves is used to avoid infinite value. How exactly does dividing by two prevent from getting infinite values? Don't you get infinity only when you are dividing by zero? Is dividing by two necessary, because this way we are getting average value (and not because we don't want to get infinitely)?

Understanding Precision#K, AP#K, MAP#K

I'm currently evaluating a recommender system based on implicit feedback. I've been a bit confused with regard to the evaluation metrics for ranking tasks. Specifically, I am looking to evaluate by both precision and recall.
Precision#k has the advantage of not requiring any estimate of the
size of the set of relevant documents but the disadvantages that it is
the least stable of the commonly used evaluation measures and that it
does not average well, since the total number of relevant documents
for a query has a strong influence on precision at k
I have noticed myself that it tends to be quite volatile and as such, I would like to average the results from multiple evaluation logs.
I was wondering; say if I run an evaluation function which returns the following array:
Numpy array containing precision#k scores for each user.
And now I have an array for all of the precision#3 scores across my dataset.
If I take the mean of this array and average across say, 20 different scores: Is this equivalent to Mean Average Precision#K or MAP#K or am I understanding this a little too literally?
I am writing a dissertation with an evaluation section so the accuracy of the definitions is quite important to me.
There are two averages involved which make the concepts somehow obscure, but they are pretty straightforward -at least in the recsys context-, let me clarify them:
P#K
How many relevant items are present in the top-k recommendations of your system
For example, to calculate P#3: take the top 3 recommendations for a given user and check how many of them are good ones. That number divided by 3 gives you the P#3
AP#K
The mean of P#i for i=1, ..., K.
For example, to calculate AP#3: sum P#1, P#2 and P#3 and divide that value by 3
AP#K is typically calculated for one user.
MAP#K
The mean of the AP#K for all the users.
For example, to calculate MAP#3: sum AP#3 for all the users and divide that value by the amount of users
If you are a programmer, you can check this code, which is the implementation of the functions apk and mapk of ml_metrics, a library mantained by the CTO of Kaggle.
Hope it helped!

interpret statistical model metrics

Do you know how to intepret RAE and RSE values? I know a COD closer to 1 is a good sign. Does this indicate that boosted decision tree regression is best?
RAE and RSE closer to 0 is a good sign...you want error to be as low as possible. See this article for more information on evaluating your model. From that page:
The term "error" here represents the difference between the predicted value and the true value. The absolute value or the square of this difference are usually computed to capture the total magnitude of error across all instances, as the difference between the predicted and true value could be negative in some cases. The error metrics measure the predictive performance of a regression model in terms of the mean deviation of its predictions from the true values. Lower error values mean the model is more accurate in making predictions. An overall error metric of 0 means that the model fits the data perfectly.
Yes, with your current results, the boosted decision tree performs best. I don't know the details of your work well enough to determine if that is good enough. It honestly may be. But if you determine it's not, you can also tweak the input parameters in your "Boosted Decision Tree Regression" module to try to get even better results. The "ParameterSweep" module can help with that by trying many different input parameters for you and you specify the parameter that you want to optimize for (such as your RAE, RSE, or COD referenced in your question). See this article for a brief description. Hope this helps.
P.S. I'm glad that you're looking into the black carbon levels in Westeros...I'm sure Cersei doesn't even care.

When are precision and recall inversely related?

I am reading about precision and recall in machine learning.
Question 1: When are precision and recall inversely related? That is, when does the situation occur where you can improve your precision but at the cost of lower recall, and vice versa? The Wikipedia article states:
Often, there is an inverse relationship between precision and recall,
where it is possible to increase one at the cost of reducing the
other. Brain surgery provides an obvious example of the tradeoff.
However, I have seen research experiment results where both precision and recall increase simultaneously (for example, as you use different or more features).
In what scenarios does the inverse relationship hold?
Question 2: I'm familiar with the precision and recall concept in two fields: information retrieval (e.g. "return 100 most relevant pages out of a 1MM page corpus") and binary classification (e.g. "classify each of these 100 patients as having the disease or not"). Are precision and recall inversely related in both or one of these fields?
The inverse relation only holds when you have some parameter in the system that you can vary in order to get more/less results. Then there's a straightforward relationship: you lower the threshold to get more results and among them some are TPs and some FPs. This, actually, doesn't always mean that precision or recall will rise and fall simultaneously - the real relationship can be mapped using the ROC curve. As for Q2, likewise, in both of these tasks precision and recall are not necessarily inversely related.
So, how do you increase recall or precision, not impacting the other simultaneously? Usually, by improving the algorithm or model. I.e. when you just change parameters of a given model, the inverse relationship will usually hold, although you should mind that it will also be usually non-linear. But if you, for example, add more descriptive features to the model, you can increase both metrics at once.
Regarding the first question, I interpret these concepts in terms of how restrictive your results must be.
If you're more restrictive, I mean, if you're more "demanding on the correctness" of the results, you want it to be more precise. For that, you might be willing to reject some correct results as long as everything you get is correct. Thus, you're raising your precision and lowering your recall. Conversely, if you do not mind getting some incorrect results as long as you get all the correct ones, you're raising your recall and lowering your precision.
On what concerns the second question, if I look at it from the point of view of the paragraphs above, I can say that yes, they are inversely related.
To the best of my knowledge, In order to be able to increase both, precision and recall, you'll need either, a better model (more suitable for your problem) or better data (or both, actually).

What does it mean to have zero mean in the data?

I'm trying to find ways to normalize my dataset (represented as a matrix with documents as rows and columns as features) and I came across a technique called feature scaling. I found a Wikipedia article on it here.
One of the methods listed is Standardization which says "Feature standardization makes the values of each feature in the data have zero-mean and unit-variance." What does that mean (no pun intended)?
In this method, "we subtract the mean from each feature. Then we divide the values (mean is already subtracted) of each feature by its standard deviation." When they say 'subtract the mean', is it the mean of the entire matrix or the mean of the column pertaining to that feature?
Also, if this feature scaling method is applied, does the mean not have to be subtracted from columns when performing Principal Component Analysis (PCA) on the data?
The basic idea is to do a simple (and reversible) transformation on your dataset set to make it easier to handle. You are subtracting a constant from each column and then dividing each column by a (different) constant. Those constants are column-specific.
When they say 'subtract the mean', is it the mean of the entire matrix
or the mean of the column pertaining to that feature?
The mean of the column pertaining to that feature.
...does the mean not have to be subtracted from columns when performing Principal Component Analysis (PCA) on the data?
Correct. PCA requires data with a mean of zero. Usually this is enforced by subtracting the mean as a first step. If the mean has already been subtracted that step is not required. However, there is no harm in performing the "subtract the mean" operation twice. Because the second time the mean will be zero, so nothing will change. Formally, we might say that standardization is idempotent.
From looking at the article, my understanding is that you would subtract the mean of that feature. This will give you a set of data for the feature that describes the same layout of the data but normalized.
Imagine you added data for a new feature. You're probably going to want the data for your original features to remain the same, and not be influenced by the new feature.
I guess you would still get a "standardized" range of values if you subtracted the mean of the whole data set, but that would be something different - you're probably more interested in how the data of a single feature lies around its mean.
You could also have a look (or ask the question) on math.stackexchange.com.

Resources