Tensorflow: My loss function produces huge number - image-processing

I'm trying image inpainting using a NN with weights pretrained using denoising autoencoders. All according to https://papers.nips.cc/paper/4686-image-denoising-and-inpainting-with-deep-neural-networks.pdf
I have made the custom loss function they are using.
My set is a batch of overlapping patches (196x32x32) of an image. My input are the corrupted batches of the image, and the output should be the cleaned ones.
Part of my loss function is
dif_y = tf.subtract(y_xi,y_)
dif_norm = tf.norm(dif_y, ord = 'euclidean', axis = (1,2))
Where y_xi(196 x 1 x 3072) is the reconstructed clean image and y_ (196 x 1 x 3072) is the real clean image. So I actually I subtract all images from their corrupted version and I sum all those differences. I think it is normal to be a very big number.
train_step = tf.train.AdamOptimizer().minimize(loss)
The loss value begins at around 3*10^7 and is converging after 200 runs (I loop for 1000) at a close value. So my output image will be miles away from the original.
Edit: starts at 3.02391e+07 and converges to 3.02337e+07
Is there any way my loss value is correct? If so, how can I dramatically reduce it?
Thanks
Edit 2: My loss function
dif_y = tf.subtract(y,y_)
dif_norm = tf.norm(dif_y, ord = 'euclidean', axis = (1,2))
sqr_norm = tf.square(dif_norm)
prod = tf.multiply(sqr_norm,0.5)
sum_norm2 = tf.reduce_sum(prod,0)
error_1 = tf.divide(sum_norm2,196)

Just for the record if anyone else has a similar problem: Remember to normalize your data! I was actually subtracting values in range [0,1] from values in range [0,255]. Very noobish mistake, I learned it the hard way!
Input values / 255
Expected values / 255
Problem solved.

sum_norm2 = tf.reduce_sum(prod,0) - I don't think this is doing what you want it to do.
Say y and y_ have values for 500 images and you have 10 labels for a 500x10 matrix. When tf.reduce_sum(prod,0) processes that you will have 1 value that is the sum of 500 values each which will be the sum of all values in the 2nd rank.
I don't think that is what you want, the sum of the error across each label. Probably what you want is the average, at least in my experience that is what works wonders for me. Additionally, I don't want a whole bunch of losses, one for each image, but instead one loss for the batch.
My preference is to use something like
loss = tf.reduce_mean ( tf.reduce_mean( prod ) )
This has the additional upshot of making your Optimizer parameters simple. I haven't run into a situation yet where I have to use anything other than 1.0 for the learning_rate for GradientDescent, Adam, or MomentumOptimizer.
Now your loss will be independent of batch size or number of labels.

Related

Use of 'is_unbalance' parameter in Lightgbm

I am trying to use the 'is_unbalance' parameter in my model training for a binary classification problem where the positive class is approximately 3%. If I set the parameter 'is_unbalance', I observe that the binary log loss drops in the first iteration but then keeps on increasing. I'm noticing this behavior only if I enable this parameter 'is_unbalance'. Otherwise, there is a steady drop in log_loss. Appreciate your help on this. Thanks.
When you do not balance the sets for such an unbalanced dataset, then obviously the objective value will always drop - and will probably reach the point of classifying all the predictions to the majority class, while having a fantastic objective value.
Balancing the classes is necessary, but it doesn't mean that you should stop on is_unbalanced - you can use sample_pos_weight, have customized metric, or apply weights to your samples, like following:
WEIGHTS = y_train.value_counts(normalize = True).min() / y_train.value_counts(normalize = True)
TRAIN_WEIGHTS = pd.DataFrame(y_train.rename('old_target')).merge(WEIGHTS, how = 'left', left_on = 'old_target', right_on = WEIGHTS.index).target.values
train_data = lgb.Dataset(X_train, label=y_train, weight = TRAIN_WEIGHTS)
Also, optimizing other hyperparameters should solve the issue of increasing log_loss.
When you set Is_unbalace: True, the algorithm will try to Automatically balance the weight of the dominated label (with the pos/neg fraction in train set).
If you want change scale_pos_weight (it is by default 1 which mean assume both positive and negative label are equal) in case of unbalance dataset you can use following formula(based on this issue on lightgbm repository) to set it correctly.
sample_pos_weight = number of negative samples / number of positive samples

How to cut down train error for a dense-matrix factorization task?

This problem may seem very different from the normal Matrix Factorization task which is widely used in recommender system.
My problem is described as below:
Given a dense Matrix M
(approximately 55000*200, may contain much negative elements, 0.1< abs(M[i][j]) <1 )
I have to find two matrix A(55000*1400) and B(1400*200), such that:
AB=M
However, we have some knowledge about A. We have another Matrix C, if C[i][j] = 0, then A[i][j] must be zero, otherwise it can be any value(C[i][j] = 1).
In my practice , I use machine learning to solve the problem, my loss function is:
||(A*C)(element-wise product) x B - M ||(2)(L2 norm)
I have tried adagrad,momentum,adadelta and some other optimization method, but the train error is pretty much and is cut down slowly (learning_rate = 0.1)
UP1:
Well, actually I've got a machine with 32GB memory and I only need 2 min for each epoch. I decompose an element in M only if its corresponding element in C is anotated as 1. Practically , I only decompose M[i][j] when C[i][j] = 1, and after I decompose M[i][j], I solve the gradient for M[i][j] to update A[i : ] and B[ : j] at once. So, the batch I used is too small--just contain one element. Also , I have to mention that C is a pretty sparse matrix. For each line in C, there is only 2-3 elements that are anotated as 1.
After struggling with it for nearly half month, I finally got the answer: I should update the matrix A much more quickly, say, update the parameters at a more smaller step. I originally updated every element in A only once per epoch, much less than B. However, after I changed the code to let A be updated at the same speed as B, then surprise happened: it worked pretty well!
Maybe smaller steps will help SGD work better? I don't really believe it mathematically.

TensorFlow custom model optimizer returning NaN. Why?

I want to learn optimal weights and exponents for a custom model I've created:
weights = tf.Variable(tf.zeros([t.num_features, 1], dtype=tf.float64))
exponents = tf.Variable(tf.ones([t.num_features, 1], dtype=tf.float64))
# works fine:
pred = tf.matmul(x, weights)
# doesn't work:
x_to_exponent = tf.mul(tf.sign(x), tf.pow(tf.abs(x), tf.transpose(exponents)))
pred = tf.matmul(x_to_exponent, weights)
cost_function = tf.reduce_mean(tf.abs(pred-y_))
optimizer = tf.train.GradientDescentOptimizer(t.LEARNING_RATE).minimize(cost_function)
The problem is that whenever there is a negative value zero in x the optimizer returns the weight as NaN. If I simply add 0.0001 when x = 0 then everything works as expected. But should I really have to do this? Shouldn't the TensorFlow optimizer have a way to handle this?
I've noticed Wikipedia shows no activation functions where x is taken to an exponent. Why isn't there an activation function that looks as below Image?
For the above image I'd like my program to learn that the correct exponent is 0.5.
This is correct behavior on TensorFlow's part, since the gradient is infinity there (and many computations that should mathematically be infinity end up NaN due to indeterminate limits).
If you want to work around the problem, a slightly generalized version of gradient clipping may work. You can get the gradients via Optimizer.compute_gradients, manually clip them via something like
safe_grad = tf.clip_by_value(tf.select(tf.is_nan(grad), 0, grad), -lim, lim)
and then pass the clipped gradients to Optimizer.apply_gradients. The clipping will be necessary to not explode for values near the singularity, where the gradient may be arbitrarily large.
Warning: There is no guarantee that this will work, especially for deeper networks where the nans may pollute large swaths of the network.

Fitting 3 Normals using PyMC: wrong convergence on simple data

I wrote a PyMC model for fitting 3 Normals to data using (similar to the one in this question).
import numpy as np
import pymc as mc
import matplotlib.pyplot as plt
n = 3
ndata = 500
# simulated data
v = np.random.randint( 0, n, ndata)
data = (v==0)*(10+ 1*np.random.randn(ndata)) \
+ (v==1)*(-10 + 2*np.random.randn(ndata)) \
+ (v==2)*3*np.random.randn(ndata)
# the model
dd = mc.Dirichlet('dd', theta=(1,)*n)
category = mc.Categorical('category', p=dd, size=ndata)
precs = mc.Gamma('precs', alpha=0.1, beta=0.1, size=n)
means = mc.Normal('means', 0, 0.001, size=n)
#mc.deterministic
def mean(category=category, means=means):
return means[category]
#mc.deterministic
def prec(category=category, precs=precs):
return precs[category]
obs = mc.Normal('obs', mean, prec, value=data, observed = True)
model = mc.Model({'dd': dd,
'category': category,
'precs': precs,
'means': means,
'obs': obs})
M = mc.MAP(model)
M.fit()
# mcmc sampling
mcmc = mc.MCMC(model)
mcmc.use_step_method(mc.AdaptiveMetropolis, model.means)
mcmc.use_step_method(mc.AdaptiveMetropolis, model.precs)
mcmc.sample(100000,burn=0,thin=10)
tmeans = mcmc.trace('means').gettrace()
tsd = mcmc.trace('precs').gettrace()**-.5
plt.plot(tmeans)
#plt.errorbar(range(len(tmeans)), tmeans, yerr=tsd)
plt.show()
The distributions from which I sample my data are clearly overlapping, yet there are 3 well distinct peaks (see image below). Fitting 3 Normals to this kind of data should be trivial and I would expect it to produce the means I sample from (-10, 0, 10) in 99% of the MCMC runs.
Example of an outcome I would expect. This happened in 2 out of 10 cases.
Example of an unexpected result that happened in 6 out of 10 cases. This is weird because on -5, there is no peak in the data so I can't really a serious local minimum that the sampling can get stuck in (going from (-5,-5) to (-6,-4) should improve the fit, and so on).
What could be the reason that (adaptive Metropolis) MCMC sampling gets stuck in the majority of cases? What would be possible ways to improve the sampling procedure that it doesn't?
So the runs do converge, but do not really explore the right range.
Update: Using different priors, I get the right convergence (appx. first picture) in 5/10 and the wrong one (appx. second picture) in the other 5/10. Basically, the lines changed are the ones below and removing the AdaptiveMetropolis step method:
precs = mc.Gamma('precs', alpha=2.5, beta=1, size=n)
means = mc.Normal('means', [-5, 0, 5], 0.0001, size=n)
Is there a particular reason you would like to use AdaptiveMetropolis? I imagine that vanilla MCMC wasn't working, and you got something like this:
Yea, that's no good. There are a few comments I can make. Below I used vanilla MCMC.
Your means prior variance, 0.001, is too big. This corresponds to a std deviation of about 31 ( = 1/sqrt(0.001) ), which is too small. You are really forcing your means to be close to 0. You want a much larger std. deviation to help explore the area. I decreased the value to 0.00001 and got this:
Perfect. Of course, apriori I knew the true means were 50,0,and -50. Usually we don't know this, so it's always a good idea to set that value to be quite small.
2. Do you really think all the normals line up at 0, like your mean prior suggests? (You set the mean of all of them to 0) The point of this exercise is to find them to be different, so your priors should reflect that. Something like:
means = mc.Normal('means', [-5,0,5], 0.00001, size=n)
more accurately reflects your true belief. This actually also helps convergence by suggesting to the MCMC where the means should be. Of course, you'd have to use your best estimate to come up with these numbers (I've naively chosen -5,0,5 here).
The problem is caused by a low acceptance rate for the category variable. See the answer I gave to a similar question.

how to handle large number of features machine learning

I developed a image processing program that identifies what a number is given an image of numbers. Each image was 27x27 pixels = 729 pixels. I take each R, G and B value which means I have 2187 variables from each image (+1 for the intercept = total of 2188).
I used the below gradient descent formula:
Repeat {
θj = θj−α/m∑(hθ(x)−y)xj
}
Where θj is the coefficient on variable j; α is the learning rate; hθ(x) is the hypothesis; y is real value and xj is the value of variable j. m is the number of training sets. hθ(x), y are for each training set (i.e. that's what the summation sign is for). Further the hypothesis is defined as:
hθ(x) = 1/(1+ e^-z)
z= θo + θ1X1+θ2X2 +θ3X3...θnXn
With this, and 3000 training images, I was able to train my program in just over an hour and when tested on a cross validation set, it was able to identify the correct image ~ 67% of the time.
I wanted to improve that so I decided to attempt a polynomial of degree 2.
However the number of variables jumps from 2188 to 2,394,766 per image! It takes me an hour just to do 1 step of gradient descent.
So my question is, how is this vast number of variables handled in machine learning? On the one hand, I don't have enough space to even hold that many variables for each training set. On the other hand, I am currently storing 2188 variables per training sample, but I have to perform O(n^2) just to get the values of each variable multiplied by another variable (i.e. the polynomial to degree 2 values).
So any suggestions / advice is greatly appreciated.
try to use some dimensionality reduction first (PCA, kernel PCA, or LDA if you are classifying the images)
vectorize your gradient descent - with most math libraries or in matlab etc. it will run much faster
parallelize the algorithm and then run in on multiple CPUs (but maybe your library for multiplying vectors already supports parallel computations)
Along with Jirka-x1's answer, I would first say that this is one of the key differences in working with image data than say text data for ML: high dimensionality.
Second... this is a duplicate, see How to approach machine learning problems with high dimensional input space?

Resources