Assessing the significance of change in one parameter between two samples - p-value

Let's assume we have two data samples. One of them has an additional parameter. This parameter does not initially exist in that data sample. But after being added, we can see one of other attributes of the sample (e.g. performance percentage) has increased considerably. In the other sample, the performance percentage is not increased. While the other sample does not have this parameter, how can we make sure if the change in performance percent in the first sample is directly related with the increase in that specific parameter?
Thanks

Related

Need an interpretation on a statistical expression on missing values

I was reading a paper about missing values on the Internet and having a problem in interpreting interpreting the meaning of the first sentence highlighted in bold below:
Missing data present various problems. First, the absence of data reduces statistical power, which refers to the probability that the test will reject the null hypothesis when it is false. Second, the lost data can cause bias in the estimation of parameters. Third, it can reduce the representativeness of the samples. Fourth, it may complicate the analysis of the study. Each of these distortions may threaten the validity of the trials and can lead to invalid conclusions.
Hope to hear some explanations.
Firstly, power is the probability of rejecting the null hypothesis when in fact it is false. So, you could say it is the probability of making the correct decision. The absence of data reduces this statistical power, a low sample size of studies, small effects being investigated, or both adversely impacts the likelihood that a statistically significant finding actually reflects a true effect. Meaning let's say if you've 100 samples and because of missing values you discard 40 samples from the dataset, now whatever conclusion you come up with using the remaining 60 samples, you can't be much confident that it reflects a true effect.
Secondly, If you choose to replace those missing values using the mean for example, then you're injecting a sort of bias to the data, actually, however you decide to replace or remove the data, the bias is getting injected. (though certain bias is more plausible in certain situations)
Thirdly, the sentence is quite explanatory itself, those missing values reduce the representativeness of the samples, as you don't have all the info you need about those samples.
Lastly, we can say it (missing values) actually does complicates our study, It's the last thing we would want when working with data, however because of human error and many other sources of errors we often have to deal with these missing values with certain operations.

How do I interpret this output from sklearn.tree.export_graphviz?

I am working on analyzing grade data. As a new way to look at the data I am using a decision tree, for the first time. I believe I have the code right and now I am trying to interpret it. The features are grades gotten for a series of quizzes, and the classification is the final grade the student received. I have a few questions:
If my understanding is correct, each node has a test and a left branch representing the test being true, and the other for false. And when the tree seems to have asked enough questions, it says what the "class" is. If that is the case, how come there's a class= on boxes well before the leaves? I would have thought that just leaves have a class=
How do I "tune" the overall tree? It seems to have too many boxes. Is this an example of "overfitting"? How can I tune that better?
For example, the use of FINAL_GRADE_PA01 seems arbitrary to be based on the ordering of the data. Is that true or did the analysis actually conclude that that feature was the best discriminator?
If I'm not mistaken, those class values indicate what the model would have predicted, had it stopped branching on that node. It still stores those values, but it doesn't use them if there's a branching from that node.
About the number of nodes, as you see in the docs:
The default values for the parameters controlling the size of the
trees (e.g. max_depth, min_samples_leaf, etc.) lead to fully grown and
unpruned trees which can potentially be very large on some data sets.
To reduce memory consumption, the complexity and size of the trees
should be controlled by setting those parameter values.
There are several parameters which you can use to reduce the complexity of your model. The following two parameters are just an example:
max_leaf_nodes : int or None, optional (default=None)
Grow a tree with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited
number of leaf nodes.
min_impurity_decrease : float, optional (default=0.)
A node will be split if this split induces a decrease of the impurity greater than or equal to this value.

SPSS two way repeated measures ANOVA

i am fairly new with statitistic.
I made an experiment and used the two way ANOVA with repeated measures. The calculation was done in SPSS. In most papers I have seen, the f-value and the degree of freedom were reported as well. is it normal to report those values as well? if so, which values do i take from the spss output.
how do I interpret these values? what do they mean?
when does the f-value support a significant result and when not?
what are good values for the f-value and the degree of freedom.
in some article is also read about the critical f-values, how do I get this value?
most articles describe how to calculate those values but do not explain their meaning for the experiment.
some clarification in these issues is greatly appreciated.
My English is not very good, but I will try to answer your question.
The main purpose of ANOVA is that we want statistical proof that the measured groups have the same mean or not. So we make a null hypothesis and an alternative hypothesis, then we use a test statistics on the data. You can use ANOVA if the groups has the same variance (squared standard deviation).
You need to test this. This is a hyptest too, the nullhyp. is the groups have the same variance, the anternative hyp. is they dont.
You need to make decision from the Sig. value, if the value is higher than 0,05, we usually accept the nullhyp. If the variances are equal, we can use ANOVA. (I assume that the data is following the Normal distribution.) The nullhyp. is that the groups have equal means, the alternative hyp is that we have at least 1 group with a different mean. You can make your decision from the Sig. value, as I said before, if the value higher than 0.05 we accept the nullhyp. The F-critical value is not important if you are calculating on a computer. You can make an accepting interval from the lower and the upper F-critical, and if the F-value is in the interval you accept the nullhyp, but I only used this method in statistics class. You don't need the F-value and the df in the report, because they don't explain anything on their own.

What does it mean to have zero mean in the data?

I'm trying to find ways to normalize my dataset (represented as a matrix with documents as rows and columns as features) and I came across a technique called feature scaling. I found a Wikipedia article on it here.
One of the methods listed is Standardization which says "Feature standardization makes the values of each feature in the data have zero-mean and unit-variance." What does that mean (no pun intended)?
In this method, "we subtract the mean from each feature. Then we divide the values (mean is already subtracted) of each feature by its standard deviation." When they say 'subtract the mean', is it the mean of the entire matrix or the mean of the column pertaining to that feature?
Also, if this feature scaling method is applied, does the mean not have to be subtracted from columns when performing Principal Component Analysis (PCA) on the data?
The basic idea is to do a simple (and reversible) transformation on your dataset set to make it easier to handle. You are subtracting a constant from each column and then dividing each column by a (different) constant. Those constants are column-specific.
When they say 'subtract the mean', is it the mean of the entire matrix
or the mean of the column pertaining to that feature?
The mean of the column pertaining to that feature.
...does the mean not have to be subtracted from columns when performing Principal Component Analysis (PCA) on the data?
Correct. PCA requires data with a mean of zero. Usually this is enforced by subtracting the mean as a first step. If the mean has already been subtracted that step is not required. However, there is no harm in performing the "subtract the mean" operation twice. Because the second time the mean will be zero, so nothing will change. Formally, we might say that standardization is idempotent.
From looking at the article, my understanding is that you would subtract the mean of that feature. This will give you a set of data for the feature that describes the same layout of the data but normalized.
Imagine you added data for a new feature. You're probably going to want the data for your original features to remain the same, and not be influenced by the new feature.
I guess you would still get a "standardized" range of values if you subtracted the mean of the whole data set, but that would be something different - you're probably more interested in how the data of a single feature lies around its mean.
You could also have a look (or ask the question) on math.stackexchange.com.

Minimum number of observation when performing Random Forest

Is it possible to apply RandomForests to very small datasets?
I have a dataset with many variables but only 25 observation each. Random forests produce reasonable results with low OOB errors (10-25%).
Is there any rule of thumb regarding the minimum number of observations to use?
In fact one of the response variable is unbalanced, and if I'm going to subsample it I will end up with an even smaller number of observations.
Thanks in advance
Absolutely RF can be used on these type of datasets (i.e. p>n). In fact they use RF in fields like genomics where the number of fields >= 20000 and there are only a very small number of rows - say 10-12. The entire problem is figuring out which of the 20k variables would make up a parsimonious marker (i.e. feature selection is the entire problem).
I don't have any ROTs about minimum size other than if your model doesn't work well on a held back sample (or Hold-One-Back cross validation might work well in your case) well then you should try something else.
Hope this helps

Resources