Fitting 3 Normals using PyMC: wrong convergence on simple data - mcmc

I wrote a PyMC model for fitting 3 Normals to data using (similar to the one in this question).
import numpy as np
import pymc as mc
import matplotlib.pyplot as plt
n = 3
ndata = 500
# simulated data
v = np.random.randint( 0, n, ndata)
data = (v==0)*(10+ 1*np.random.randn(ndata)) \
+ (v==1)*(-10 + 2*np.random.randn(ndata)) \
+ (v==2)*3*np.random.randn(ndata)
# the model
dd = mc.Dirichlet('dd', theta=(1,)*n)
category = mc.Categorical('category', p=dd, size=ndata)
precs = mc.Gamma('precs', alpha=0.1, beta=0.1, size=n)
means = mc.Normal('means', 0, 0.001, size=n)
#mc.deterministic
def mean(category=category, means=means):
return means[category]
#mc.deterministic
def prec(category=category, precs=precs):
return precs[category]
obs = mc.Normal('obs', mean, prec, value=data, observed = True)
model = mc.Model({'dd': dd,
'category': category,
'precs': precs,
'means': means,
'obs': obs})
M = mc.MAP(model)
M.fit()
# mcmc sampling
mcmc = mc.MCMC(model)
mcmc.use_step_method(mc.AdaptiveMetropolis, model.means)
mcmc.use_step_method(mc.AdaptiveMetropolis, model.precs)
mcmc.sample(100000,burn=0,thin=10)
tmeans = mcmc.trace('means').gettrace()
tsd = mcmc.trace('precs').gettrace()**-.5
plt.plot(tmeans)
#plt.errorbar(range(len(tmeans)), tmeans, yerr=tsd)
plt.show()
The distributions from which I sample my data are clearly overlapping, yet there are 3 well distinct peaks (see image below). Fitting 3 Normals to this kind of data should be trivial and I would expect it to produce the means I sample from (-10, 0, 10) in 99% of the MCMC runs.
Example of an outcome I would expect. This happened in 2 out of 10 cases.
Example of an unexpected result that happened in 6 out of 10 cases. This is weird because on -5, there is no peak in the data so I can't really a serious local minimum that the sampling can get stuck in (going from (-5,-5) to (-6,-4) should improve the fit, and so on).
What could be the reason that (adaptive Metropolis) MCMC sampling gets stuck in the majority of cases? What would be possible ways to improve the sampling procedure that it doesn't?
So the runs do converge, but do not really explore the right range.
Update: Using different priors, I get the right convergence (appx. first picture) in 5/10 and the wrong one (appx. second picture) in the other 5/10. Basically, the lines changed are the ones below and removing the AdaptiveMetropolis step method:
precs = mc.Gamma('precs', alpha=2.5, beta=1, size=n)
means = mc.Normal('means', [-5, 0, 5], 0.0001, size=n)

Is there a particular reason you would like to use AdaptiveMetropolis? I imagine that vanilla MCMC wasn't working, and you got something like this:
Yea, that's no good. There are a few comments I can make. Below I used vanilla MCMC.
Your means prior variance, 0.001, is too big. This corresponds to a std deviation of about 31 ( = 1/sqrt(0.001) ), which is too small. You are really forcing your means to be close to 0. You want a much larger std. deviation to help explore the area. I decreased the value to 0.00001 and got this:
Perfect. Of course, apriori I knew the true means were 50,0,and -50. Usually we don't know this, so it's always a good idea to set that value to be quite small.
2. Do you really think all the normals line up at 0, like your mean prior suggests? (You set the mean of all of them to 0) The point of this exercise is to find them to be different, so your priors should reflect that. Something like:
means = mc.Normal('means', [-5,0,5], 0.00001, size=n)
more accurately reflects your true belief. This actually also helps convergence by suggesting to the MCMC where the means should be. Of course, you'd have to use your best estimate to come up with these numbers (I've naively chosen -5,0,5 here).

The problem is caused by a low acceptance rate for the category variable. See the answer I gave to a similar question.

Related

Confusion matrix subset of classes not working properly

I have searched for an answer to this question on the internet including suggestion when writing the title but still to no avail so hopefully someone can help!
I am trying to construct a confusion matrix using sci-kit learn. This comes after a keras model.
This is bizarre because i am having the following problem: For the training and test set of the original data... I can construct the confusion matrix as follows (please note this is a multi-label problem and so data has to be subset for the different labels.
The following works fine:
cm = confusion_matrix(y_train[:,0:6].argmax(axis=1), trainpred[:,0:6].argmax(axis=1))
and the 6:18 etc... until all classes have been subset. The confusion matrix that forms as a result reflects the true outcome of the keras model..
The problem arises when i deploy the model on completely unseen data.
I deploy the model by calling model.predict() and get results as above. However, now I cannot subset confusion matrices in the same way.
The code cm=confusion_matrix etc...causes the output of the CM to be the wrong dimensions, even when specifying 0:6 etc..
I therefore used the code from above used but with the labels argument modification:
age[0,1,2,3,4]
organ[5,6,7,8]
cm = confusion_matrix(y_train[:,0:6].argmax(axis=1), trainpred[:,0:6].argmax(axis=1), labels=age)
The FIRST label (1:5) works perfectly... However, the next labels do not! I dont get the right values in the confusion matrices and the matching is also incorrect for those that are in there.
To put this in to context: there are over 400 samples in the unseen test data.
model.predict shows very high classification and correct scores for most labels..
calling CM=ytest[:,4:8]etc, does indeed produce a 4x4 matrix, however there are like 5 values in there not 400, and those values that are in there are not correctly matching.
Also.. with the labels age being 012345, subsetting the ytest to 0:6 causes the correct confusion matrix to form (i am unsure as to why the 6 has to be included in the subset... nevertheless i have tried different combinations with the same issue!
I have searched high and low for this answer so would really appreciate some assistance as it is incredibly frustrating. any more code/information i can provide i will be happy to!!
Many thanks!
This is happening because you are trying to subset the generated confusion matrix, but you actually have to generate a new confusion matrix manually with the specified class labels. If you classes A, B, C you will get a 3X3 matrix. If you want to create matrix focusing only on class A, the other classes will become the false class, but the false positive and false negative will change and hence you cannot just sample the initial matrix.
This is how you show actually do it
import matplotlib.pytplot as plt
import seaborn as sns
def generate_matrix(y_true, predict, class_name):
TP, FP, FN, TN = 0, 0, 0, 0
for i in range(len(y_true)):
if y_true[i] == class_name:
if y_true[i] == predict[i]:
TP += 1
else:
FN += 1
else:
if y_true[i] == predict[i]:
TN += 1
else:
FP += 1
return np.array([[TP, FP],
[FN, TN]])
# Plot new matrix
matrix = generate_matrix(actual_labels,
predicted_labels,
class_name = 'A')
This will generate a confusion matrix for class A.

Low confidence score in SVC for example from training set

Here is my code for SVC classifier.
vectorizer = TfidfVectorizer(lowercase=False)
train_vectors = vectorizer.fit_transform(training_data)
classifier_linear = svm.LinearSVC()
clf = CalibratedClassifierCV(classifier_linear)
linear_svc_model = clf.fit(train_vectors, train_labels)
training_data here is a list of english sentences and train_lables are the labels associated. I do the usual stopwords removal and some preprocessing before creating final version of training_data. Here is how my testing code:
test_lables = ["no"]
test_vectors = vectorizer.transform(test_lables)
prediction_linear = clf.predict_proba(test_vectors)
counter = 0
class_probability = {}
lables = []
for item in train_labels:
if item in lables:
continue
else:
lables.append(item)
for val in np.nditer(prediction_linear):
new_val = val.item(0)
class_probability[lables[counter]] = new_val
counter = counter + 1
sorted_class_probability = sorted(class_probability.items(), key=operator.itemgetter(1), reverse=True)
print(sorted_class_probability)
Now when I run the code with a phrase that is already there in the training set (a word 'no' in this case), it identifies properly, but the confidence score is even below .9. The output is as follows:
[('no', 0.8474342514152964), ('hi', 0.06830103628879058), ('thanks', 0.03070201906552546), ('confused', 0.02647134535600733), ('ok', 0.015857384248465656), ('yes', 0.005961945963546264), ('bye', 0.005272017662368208)]
When I am studying online, I have seen that usually confidence score for data already in the training set is closer to 1 or almost 1 and rest of them are really negligible. What can I do to get better confidence score? Should I be worried that if I add more classes to it, the confidence score will further dip and it will be difficult for me to surely point out one standout class?
As long as your scores help you classify your inputs correctly, you shouldn't worry at all. If anything, if your confidence on the input already in your training data is too high, that probably means your method has overfit to the data, and cannot generalize to the unseen data.
However, you can tune the complexity of your method by changing the penalization parameters. In the case of a LinearSVC, you have both the penalty and the C parameter. Try different values of those two and observe the effect. Make sure you also observe the effect on an unseen test set.
Just a not that the values of C should be in exponential space, eg. [0.001, 0.01, 0.1, 1, 10, 100, 1000] for you to see meaningful effects.
The SGDClassifier may be relevant to your case if you're interested in such linear models and tuning your parameters.

How to cut down train error for a dense-matrix factorization task?

This problem may seem very different from the normal Matrix Factorization task which is widely used in recommender system.
My problem is described as below:
Given a dense Matrix M
(approximately 55000*200, may contain much negative elements, 0.1< abs(M[i][j]) <1 )
I have to find two matrix A(55000*1400) and B(1400*200), such that:
AB=M
However, we have some knowledge about A. We have another Matrix C, if C[i][j] = 0, then A[i][j] must be zero, otherwise it can be any value(C[i][j] = 1).
In my practice , I use machine learning to solve the problem, my loss function is:
||(A*C)(element-wise product) x B - M ||(2)(L2 norm)
I have tried adagrad,momentum,adadelta and some other optimization method, but the train error is pretty much and is cut down slowly (learning_rate = 0.1)
UP1:
Well, actually I've got a machine with 32GB memory and I only need 2 min for each epoch. I decompose an element in M only if its corresponding element in C is anotated as 1. Practically , I only decompose M[i][j] when C[i][j] = 1, and after I decompose M[i][j], I solve the gradient for M[i][j] to update A[i : ] and B[ : j] at once. So, the batch I used is too small--just contain one element. Also , I have to mention that C is a pretty sparse matrix. For each line in C, there is only 2-3 elements that are anotated as 1.
After struggling with it for nearly half month, I finally got the answer: I should update the matrix A much more quickly, say, update the parameters at a more smaller step. I originally updated every element in A only once per epoch, much less than B. However, after I changed the code to let A be updated at the same speed as B, then surprise happened: it worked pretty well!
Maybe smaller steps will help SGD work better? I don't really believe it mathematically.

Tensorflow: My loss function produces huge number

I'm trying image inpainting using a NN with weights pretrained using denoising autoencoders. All according to https://papers.nips.cc/paper/4686-image-denoising-and-inpainting-with-deep-neural-networks.pdf
I have made the custom loss function they are using.
My set is a batch of overlapping patches (196x32x32) of an image. My input are the corrupted batches of the image, and the output should be the cleaned ones.
Part of my loss function is
dif_y = tf.subtract(y_xi,y_)
dif_norm = tf.norm(dif_y, ord = 'euclidean', axis = (1,2))
Where y_xi(196 x 1 x 3072) is the reconstructed clean image and y_ (196 x 1 x 3072) is the real clean image. So I actually I subtract all images from their corrupted version and I sum all those differences. I think it is normal to be a very big number.
train_step = tf.train.AdamOptimizer().minimize(loss)
The loss value begins at around 3*10^7 and is converging after 200 runs (I loop for 1000) at a close value. So my output image will be miles away from the original.
Edit: starts at 3.02391e+07 and converges to 3.02337e+07
Is there any way my loss value is correct? If so, how can I dramatically reduce it?
Thanks
Edit 2: My loss function
dif_y = tf.subtract(y,y_)
dif_norm = tf.norm(dif_y, ord = 'euclidean', axis = (1,2))
sqr_norm = tf.square(dif_norm)
prod = tf.multiply(sqr_norm,0.5)
sum_norm2 = tf.reduce_sum(prod,0)
error_1 = tf.divide(sum_norm2,196)
Just for the record if anyone else has a similar problem: Remember to normalize your data! I was actually subtracting values in range [0,1] from values in range [0,255]. Very noobish mistake, I learned it the hard way!
Input values / 255
Expected values / 255
Problem solved.
sum_norm2 = tf.reduce_sum(prod,0) - I don't think this is doing what you want it to do.
Say y and y_ have values for 500 images and you have 10 labels for a 500x10 matrix. When tf.reduce_sum(prod,0) processes that you will have 1 value that is the sum of 500 values each which will be the sum of all values in the 2nd rank.
I don't think that is what you want, the sum of the error across each label. Probably what you want is the average, at least in my experience that is what works wonders for me. Additionally, I don't want a whole bunch of losses, one for each image, but instead one loss for the batch.
My preference is to use something like
loss = tf.reduce_mean ( tf.reduce_mean( prod ) )
This has the additional upshot of making your Optimizer parameters simple. I haven't run into a situation yet where I have to use anything other than 1.0 for the learning_rate for GradientDescent, Adam, or MomentumOptimizer.
Now your loss will be independent of batch size or number of labels.

Interpretation of a learning curve in machine learning

While following the Coursera-Machine Learning class, I wanted to test what I learned on another dataset and plot the learning curve for different algorithms.
I (quite randomly) chose the Online News Popularity Data Set, and tried to apply a linear regression to it.
Note : I'm aware it's probably a bad choice but I wanted to start with linear reg to see later how other models would fit better.
I trained a linear regression and plotted the following learning curve :
This result is particularly surprising for me, so I have questions about it :
Is this curve even remotely possible or is my code necessarily flawed?
If it is correct, how can the training error grow so quickly when adding new training examples? How can the cross validation error be lower than the train error?
If it is not, any hint to where I made a mistake?
Here's my code (Octave / Matlab) just in case:
Plot :
lambda = 0;
startPoint = 5000;
stepSize = 500;
[error_train, error_val] = ...
learningCurve([ones(mTrain, 1) X_train], y_train, ...
[ones(size(X_val, 1), 1) X_val], y_val, ...
lambda, startPoint, stepSize);
plot(error_train(:,1),error_train(:,2),error_val(:,1),error_val(:,2))
title('Learning curve for linear regression')
legend('Train', 'Cross Validation')
xlabel('Number of training examples')
ylabel('Error')
Learning curve :
S = ['Reg with '];
for i = startPoint:stepSize:m
temp_X = X(1:i,:);
temp_y = y(1:i);
% Initialize Theta
initial_theta = zeros(size(X, 2), 1);
% Create "short hand" for the cost function to be minimized
costFunction = #(t) linearRegCostFunction(X, y, t, lambda);
% Now, costFunction is a function that takes in only one argument
options = optimset('MaxIter', 50, 'GradObj', 'on');
% Minimize using fmincg
theta = fmincg(costFunction, initial_theta, options);
[J, grad] = linearRegCostFunction(temp_X, temp_y, theta, 0);
error_train = [error_train; [i J]];
[J, grad] = linearRegCostFunction(Xval, yval, theta, 0);
error_val = [error_val; [i J]];
fprintf('%s %6i examples \r', S, i);
fflush(stdout);
end
Edit : if I shuffle the whole dataset before splitting train/validation and doing the learning curve, I have very different results, like the 3 following :
Note : the training set size is always around 24k examples, and validation set around 8k examples.
Is this curve even remotely possible or is my code necessarily flawed?
It's possible, but not very likely. You might be picking the hard to predict instances for the training set and the easy ones for the test set all the time. Make sure you shuffle your data, and use 10 fold cross validation.
Even if you do all this, it is still possible for it to happen, without necessarily indicating a problem in the methodology or the implementation.
If it is correct, how can the training error grow so quickly when adding new training examples? How can the cross validation error be lower than the train error?
Let's assume that your data can only be properly fitted by a 3rd degree polynomial, and you're using linear regression. This means that the more data you add, the more obviously it will be that your model is inadequate (higher training error). Now, if you choose few instances for the test set, the error will be smaller, because linear vs 3rd degree might not show a big difference for too few test instances for this particular problem.
For example, if you do some regression on 2D points, and you always pick 2 points for your test set, you will always have 0 error for linear regression. An extreme example, but you get the idea.
How big is your test set?
Also, make sure that your test set remains constant throughout the plotting of the learning curves. Only the train set should increase.
If it is not, any hint to where I made a mistake?
Your test set might not be large enough or your train and test sets might not be properly randomized. You should shuffle the data and use 10 fold cross validation.
You might want to also try to find other research regarding that data set. What results are other people getting?
Regarding the update
That makes a bit more sense, I think. Test error is generally higher now. However, those errors look huge to me. Probably the most important information this gives you is that linear regression is very bad at fitting this data.
Once more, I suggest you do 10 fold cross validation for learning curves. Think of it as averaging all of your current plots into one. Also shuffle the data before running the process.

Resources