Testing dispersion of values between condition on a group - standard-deviation

I may have a tricky question. I have a within design where every participant is tested in 2 conditions. I want to compare the dispersion of the measure, for the two condition, with the hypotheses that one condition has a larger dispersion than the other one.
I though about taking the sd to get a measure of dispersion by condition and participant. After that I don't know if I have the right to test for a statistical difference by using a t test on the group (just a paired t test).
I'd be grateful if anyone can help me.
Thank you.

You should consider the F-test for equality of two variances.
You didn't specify a language (and if there's no programming, this question would be more appropriate for CrossValidated) but in R it looks like:
#make some dummy data
d <- data.frame(subj_id = c(1:1000,1:1000),
cond = c(rep("Control",1000),rep("Exposed",1000)),
resp = c(rnorm(1000,10,1),rnorm(1000,10,2)) # make different variances
)
var.test(d$resp[d$cond=="Exposed"],
d$resp[d$cond=="Control"])
# F test to compare two variances
#
# data: d$resp[d$cond == "Exposed"] and d$resp[d$cond == "Control"]
# F = 4.0976, num df = 999, denom df = 999, p-value < 2.2e-16
# alternative hypothesis: true ratio of variances is not equal to 1
# 95 percent confidence interval:
# 3.619378 4.638939
# sample estimates:
# ratio of variances
# 4.097569

Related

MCMC for multi coefficients with normal gaussian distribution

I have a linear model as follows: Acceleration= (C1V)+ C2(X- D) where D= alpha-(betaV)+(gammaM).
note that the values for V,X,M are given in dataset.
My goal is to run 350 times MCMC for each following coefficients: C1,C2,alpha,beta,gamma.
1- I have mean and standard deviation for C1,C2,alpha,beta,gamma.
2- All coefficients (C1,C2,alpha,beta,gamma) are normal distribution.
I have tried two methods to find MCMC for each of coefficients, one is by pymc3(which I'm not sure if I have done correctly), another is defining likelihood function based on the method mentioned in following link by Jonny Homfmeister (in my case, I changed the distribution from binomial to normal gaussian):
https://towardsdatascience.com/bayesian-inference-and-markov-chain-monte-carlo-sampling-in-python-bada1beabca7
The problem is that, after running MCMC for C1,C2,alpha,beta and gamma, and using the mean of posterior (output of MCMC) in my main model, I see that the absolute error has increased! it means the coefficients have not optimized by MCMC and my method does not work properly!
I do appreciate it if someone help me with correct algorithm for MCMC for Normal distribution.
####First method: pymc3##########
import pymc3 as pm
import scipy.stats as st
import arviz as az
for row in range(350):
X_c1 = st.norm(loc=-0.06, scale=0.47).rvs(size=100)
with pm.Model() as model:
prior = pm.Normal('c1', mu=-0.06, sd=0.47) ###### prior #####weights
obs = pm.Normal('obs', mu=prior, sd=0.47, observed=X_c1) #######likelihood
step = pm.Metropolis()
trace_c1 = pm.sample(draws=30, chains=2, step=step, return_inferencedata=True)
###calculate the mean of output (posterior distribution)#################3
mean_c1= az.summary(trace_c1, var_names=["c1"], round_to=2).iloc[0][['mean']]
mean_c1 = mean_c1.to_numpy()
Acceleration= (mean_C1*V)+ C2*(X- D) ######apply model
######second method in the given link##########
## Define the Likelihood P(x|p) - normal distribution
def likelihood(p):
return scipy.stats.norm.cdf(C1, loc=-0.06, scale=0.47)
def prior(p):
return scipy.stats.norm.pdf(p)
def acceptance_ratio(p, p_new):
# Return R, using the functions we created before
return min(1, ((likelihood(p_new) / likelihood(p)) * (prior(p_new) / prior(p))))
p = np.random.normal(C1, 0.47) # Initialzie a value of p
#### Define model parameters
n_samples = 790 ################# I HAVE NO IDEA HOW TO CHOOSE THIS VALUE????###########
burn_in = 99
lag = 2
##### Create the MCMC loop
for i in range(n_samples):
p_new = np.random.random_sample() ###Propose a new value of p randomly from a normal distribution
R = acceptance_ratio(p, p_new) #### Compute acceptance probability
u= np.random.uniform(0, 1) ####Draw random sample to compare R to
if u < R: ##### If R is greater than u, accept the new value of p
p = p_new
if i > burn_in and i%lag == 0: #### Record values after burn in - how often determined by lag
results_1.append(p)

How to split data into train and test sets using torchvision.datasets.Imagefolder?

In my custom dataset, one kind of image is in one folder which torchvision.datasets.Imagefolder can handle, but how to split the dataset into train and test?
You can use torch.utils.data.Subset to split your ImageFolder dataset into train and test based on indices of the examples.
For example:
orig_set = torchvision.datasets.Imagefolder(...) # your dataset
n = len(orig_set) # total number of examples
n_test = int(0.1 * n) # take ~10% for test
test_set = torch.utils.data.Subset(orig_set, range(n_test)) # take first 10%
train_set = torch.utils.data.Subset(orig_set, range(n_test, n)) # take the rest

Tensorflow Deprecation Warning

I am trying to create a convolutional neural network for image classification using one of the open access github codes. I have two classes of images. But, when I start running the one part of the code I keep getting this error
/Users/user/anaconda/envs/tensorflow/lib/python3.5/site-packages/ipykernel/__main__.py:46: DeprecationWarning: elementwise == comparison failed; this will raise an error in the future.
This is the part of code that has error (although the origin of this error are probably somewhere else, my intuition tells me that it lies in the labelling of images, but I am not sure how to fix that, I tried relabelling multiple times, nothing worked to fix this).
def print_test_accuracy(show_example_errors=False,
show_confusion_matrix=False):
# Number of images in the test-set.
num_test = len(test_images)
# Allocate an array for the predicted classes which
# will be calculated in batches and filled into this array.
cls_pred = np.zeros(shape=num_test, dtype=np.int)
# Now calculate the predicted classes for the batches.
# We will just iterate through all the batches.
# There might be a more clever and Pythonic way of doing this.
# The starting index for the next batch is denoted i.
i = 0
while i < num_test:
# The ending index for the next batch is denoted j.
j = min(i + test_batch_size, num_test)
# Get the images from the test-set between index i and j.
images = test_images[i:j, :]
# Get the associated labels.
labels = test_labels[i:j, :]
# Create a feed-dict with these images and labels.
feed_dict = {x: images,
y_true: labels}
# Calculate the predicted class using TensorFlow.
cls_pred[i:j] = session.run(y_pred_cls, feed_dict=feed_dict)
# Set the start-index for the next batch to the
# end-index of the current batch.
i = j
# Convenience variable for the true class-numbers of the test-set.
cls_true = test_class_labels
# Create a boolean array whether each image is correctly classified.
correct = (cls_true == cls_pred)
# Calculate the number of correctly classified images.
# When summing a boolean array, False means 0 and True means 1.
correct_sum = sum(correct)
# Classification accuracy is the number of correctly classified
# images divided by the total number of images in the test-set.
acc = float(correct_sum) / num_test
# Print the accuracy.
msg = "Accuracy on Test-Set: {0:.1%} ({1} / {2})"
print(msg.format(acc, correct_sum, num_test))
# Plot some examples of mis-classifications, if desired.
if show_example_errors:
print("Example errors:")
plot_example_errors(cls_pred=cls_pred, correct=correct)
# Plot the confusion matrix, if desired.
if show_confusion_matrix:
print("Confusion Matrix:")
plot_confusion_matrix(cls_pred=cls_pred)
Try tf.equal:
correct = tf.equal(cls_pred, cls_true)
or, if it is a probability distribution rather than just the argmax already:
correct = tf.equal(tf.argmax(cls_pred, 1), tf.argmax(cls_true, 1))

What is the meaning of the GridSearchCV best_score_ attribute? (the value is different from the mean of the cross validation array)

I'm confused with the results, probably I'm not getting the concept of cross validation and GridSearch right. I had followed the logic behind this post:
https://randomforests.wordpress.com/2014/02/02/basics-of-k-fold-cross-validation-and-gridsearchcv-in-scikit-learn/
argd = CommandLineParser(argv)
folder,fname=argd['dir'],argd['fname']
df = pd.read_csv('../../'+folder+'/Results/'+fname, sep=";")
explanatory_variable_columns = set(df.columns.values)
response_variable_column = df['A']
explanatory_variable_columns.remove('A')
y = np.array([1 if e else 0 for e in response_variable_column])
X =df[list(explanatory_variable_columns)].as_matrix()
kf_total = KFold(len(X), n_folds=5, indices=True, shuffle=True, random_state=4)
dt=DecisionTreeClassifier(criterion='entropy')
min_samples_split_range=[x for x in range(1,20)]
dtgs=GridSearchCV(estimator=dt, param_grid=dict(min_samples_split=min_samples_split_range), n_jobs=1)
scores=[dtgs.fit(X[train],y[train]).score(X[test],y[test]) for train, test in kf_total]
# SAME AS DOING: cross_validation.cross_val_score(dtgs, X, y, cv=kf_total, n_jobs = 1)
print scores
print np.mean(scores)
print dtgs.best_score_
RESULTS OBTAINED:
# score [0.81818181818181823, 0.78181818181818186, 0.7592592592592593, 0.7592592592592593, 0.72222222222222221]
# mean score 0.768
# .best_score_ 0.683486238532
ADDITIONAL NOTE:
I ran it using another combination of the explanatory variables (using only some of them) and I got the inverse problem. Now the .best_score_ is higher than all the values in the cross validation array.
# score [0.74545454545454548, 0.70909090909090911, 0.79629629629629628, 0.7407407407407407, 0.64814814814814814]
# mean score 0.728
# .best_score_ 0.802752293578
The code is confusing several things.
dtgs.fit(X[train_],y[train_]) does internal 3-fold cross-validation for every parameter combination from param_grid, producing a grid of 20 results, which you can open by calling dtgs.grid_scores_.
[dtgs.fit(X[train_],y[train_]).score(X[test],y[test]) for train_, test in kf_total] Therefore this line fits grid search five times and then takes its score using 5-Fold cross validation. The result is the array of scores of 5-Fold validation.
And when you call dtgs.best_score_ you get the best score in the grid of the results of 3-fold validation of hyperparameters for the last fit (of 5).

Understanding a passage in the Paper about VGGNet

I don't understand a passage in the article about the VGGNet. Maybe someone can help.
In my opinion, the number of weights in a convolutional layer is
p=w*h*d*n+n
where w is the width of the filters, h the height of the filters, d the depth of the filters and n the num of the filters.
In the article the following is written:
assuming that both the input and the output of a three-layer 3 × 3 onvolution stack has C channels, the stack is parametrised by 3*(3^2*C^2) = 27C^2
weights; at the same time, a single 7 × 7 conv. layer would require 7^2*C^2 = 49C^2 parameters.
I do not understand, what is meant by channels here, and why this formula is used.
Can someone explain this to me?
Thanks in advance.
Your intuition is correct; we just need to unpack their explanation a bit. For the first case:
w = 3 # filter width
h = 3 # filter height
d = C # filter depth (number of channels is same as number of input filters; eg RGB is C=3)
n = C # number of output filters/channels
This then makes whdn = 9C^2 parameters. Then, they also say there are three of these stacked, so thats 27C^2.
For a single 7x7 filter, then it's all the same 7x7xCxCx1.
The final difference is that you add n once more at the end in your original post; that is the bias terms, which in VGG they skip (many people skip bias terms; their value is debatable in some settings).

Resources