How does batching interact with the loss function in TensorFlow? - machine-learning

I'm training a multi-objective neural net in TensorFlow with my own loss function and can't find documentation regarding how batching interacts with that functionality.
For example, I have snippet of my loss function below, which takes the tensor/list of predictions and makes sure that their absolute value sums to no more than one:
def fitness(predictions,actual):
absTensor = tf.abs(predictions)
sumTensor = tf.reduce_sum(absTensor)
oneTensor = tf.constant(1.0)
isGTOne = tf.greater(sumTensor,oneTensor)
def norm(): return predictions/sumTensor
def unchanged(): return predictions
predictions = tf.cond(isGTOne,norm,unchanged)
etc...
But when I'm passing in a batch of estimates I feel like this loss function is normalising the whole set of inputs to sum to 1 at this point, rather than each individual set summing to 1. I.e.
[[.8,.8],[.8,.8]] -> [[.25,.25],[.25,25]]
rather than the desired
[[.8,.8],[.8,.8]] -> [[.5,.5],[.5,.5]]
Can anybody clarify or put to rest my suspicions? If this is how my function is currently working, how do I change that?

You must specify a reduction axis for reduction ops, otherwise all axes will be reduces. Traditionally this is the first dimension of your tensor. So, line 2 should look like this:
sumTensor = tf.reduce_sum(absTensor, 0)
After you make that change you will run into another problem. sumTensor will no longer be a scalar and will thus no longer make sense as a condition for tf.cond (i.e. what does it mean to branch per entry of a batch?). What you really want is tf.select since you don't really want to branch logic per batch entry. Like this:
isGTOne = tf.greater(sumTensor,oneTensor)
norm = predictions/sumTensor
predictions = tf.select(isGTOne,norm,predictions)
But, looking at this now, I wouldn't even bother conditionally normalizing the entries. Since you are operating at the granularity of a batch now, I don't think you can gain performance from normalizing an entry of a batch one at a time. Especially, since dividing is not really an expensive side effect. Might as well just do:
def fitness(predictions,actual):
absTensor = tf.abs(predictions)
sumTensor = tf.reduce_sum(absTensor, 0)
predictions = predictions/sumTensor
etc...
Hope that helps!

Related

Decision tree split implementation

I am doing this as a part of my university assignment, but I can't find any resources online on how to correctly implement this.
I have read tons materials on metrics that define optimal set split (like Entropy, Gini and others), so I understand how we would choose an optimal value of feature to split learning set into left and right nodes.
However what I totally don't get is the complexity of implementation, considering we also have to choose optimal feature, which means that on each node to compute optimal value it would take O(n^2), which is bad considering real ML datasets are shaped about 10^2 x 10^6, this is really big in terms of computation cost.
Am I missing some kind of approach that could be used here to help reduce complexity?
I currently have this baseline implementation for choosing best feature and value to split on, but I really want to make it better:
for f_idx in range(X_subset.shape[1]):
sorted_values = X_subset.iloc[:, f_idx].sort_values()
for v in sorted_values[self.min_samples_split - 1 : -self.min_samples_split + 1]:
y_left, y_right = self.make_split_only_y(f_idx, v, X_subset, y_subset)
if threshold is not None:
G = calc_g(y_subset, y_left, y_right)
if G < tr_G:
threshold = v
feature_idx = f_idx
tr_G = G
else:
threshold = v
feature_idx = f_idx
tr_G = G
return feature_idx, threshold
So, since no one answered, here some stuff I found out.
Firstly, yes, this task is very computationaly intensive. However, several tricks may be used to reduce amount of splits you need to perform to "grow a tree".
This is especially important, since you don't really want a giant overfitted tree - it just doesn't has any value, what it is more important is to get weak model, which can be used with others in some sort of ensmebling teqnique.
As for the regularization tricks, here are couple of I used myself:
limit the maximum depth of tree
limit the minimal amount of items in node
limit the maximimum amount of leafes in tree
limit the minimum quiality change in split criteria after performing an optimal split
For algorithmic part, there is a way to build a tree a smart way. If you do it as in the code I posted earlier, time complexity will be around O(h * N^2 * D), where h is height of the tree. To work around this, there are several approaches, which I didn't personally code, but read about:
Use dynamic programming for accumulating of statistics per feature, so you don't have to recalculate them every split
Use data binning and bucket sort for O(n) sorting
Source of info: https://ml-handbook.ru/chapters/decision_tree/intro
(use google translate, since website is in russian)

How to use tf.metrics.accuracy?

I want to use tf.metrics.accuracy to track the accuracy of my predictions, but I am unsure of how to use the update_op (acc_update_op below) that the function returns:
accuracy, acc_update_op = tf.metrics.accuracy(labels, predictions)
I was thinking that adding it to tf.GraphKeys.UPDATE_OPS would make sense, but I am not sure how to do this.
tf.metrics.accuracy is one of the many streamed metric TensorFlow operations (another one of which is tf.metrics.recall). Upon creation, two variables (count and total) are created in order to accumulate all incoming results for one final outcome. The first returned value is a tensor for the calculation count / total. The second op returned is a stateful function which updates these variables. Streamed metric functions are useful when evaluating the performance of a classifier over multiple batches of data. A quick example of use:
# building phase
with tf.name_scope("streaming"):
accuracy, acc_update_op = tf.metrics.accuracy(labels, predictions)
test_fetches = {
'accuracy': accuracy,
'acc_op': acc_update_op
}
# when testing the classifier
with tf.name_scope("streaming"):
# clear counters for a fresh evaluation
sess.run(tf.local_variables_initializer())
for _i in range(n_batches_in_test):
fd = get_test_batch()
outputs = sess.run(test_fetches, feed_dict=fd)
print("Accuracy:", outputs['accuracy'])
I was thinking that adding it to tf.GraphKeys.UPDATE_OPS would make sense, but I am not sure how to do this.
That would not be a good idea unless you are only using the UPDATE_OPS collection for testing purposes. Usually, the collection will already have certain control operations for the training phase (such as moving batch normalization parameters) that are not meant to be run alongside the validation phase. It may be best to either keep them in a new collection or add these operations to the fetch dictionary manually.

How do I perform a differentiable operation selection in TensorFlow?

I am trying to produce a mathematical operation selection nn model, which is based on the scalar input. The operation is selected based on the softmax result which is produce by the nn. Then this operation has to be applied to the scalar input in order to produce the final output. So far I’ve come up with applying argmax and onehot on the softmax output in order to produce a mask which then is applied on the concated values matrix from all the possible operations to be performed (as show in the pseudo code below). The issue is that neither argmax nor onehot appears to be differentiable. I am new to this, so any would be highly appreciated. Thanks in advance.
#perform softmax
logits = tf.matmul(current_input, W) + b
softmax = tf.nn.softmax(logits)
#perform all possible operations on the input
op_1_val = tf_op_1(current_input)
op_2_val = tf_op_2(current_input)
op_3_val = tf_op_2(current_input)
values = tf.concat([op_1_val, op_2_val, op_3_val], 1)
#create a mask
argmax = tf.argmax(softmax, 1)
mask = tf.one_hot(argmax, num_of_operations)
#produce the input, by masking out those operation results which have not been selected
output = values * mask
I believe that this is not possible. This is similar to Hard Attention described in this paper. Hard attention is used in Image captioning to allow the model to focus only on a certain part of the image at each step. Hard attention is not differentiable but there are 2 ways to go around this:
1- Use Reinforcement Learning (RL): RL is made to train models that makes decisions. Even though, the loss function won't back-propagate any gradients to the softmax used for the decision, you can use RL techniques to optimize the decision. For a simplified example, you can consider the loss as penalty, and send to the node, with the maximum value in the softmax layer, a policy gradient proportional to the penalty in order to decrease the score of the decision if it was bad (results in a high loss).
2- Use something like soft attention: instead of picking only one operation, mix them with weights based on the softmax. so instead of:
output = values * mask
Use:
output = values * softmax
Now, the operations will converge down to zero based on how much the softmax will not select them. This is easier to train compared to RL but it won't work if you must completely remove the non-selected operations from the final result (set them to zero completely).
This is another answer that talks about Hard and Soft attention that you may find helpful: https://stackoverflow.com/a/35852153/6938290

Tweaking the Loss before the Optimizer Step

I want to add an extra operation before running the AdamOptimizer operation on my loss, so as to help the model deal with repetitions in my data. The relevant code snippet looks something like this:
loss = tf.nn.softmax_cross_entropy_with_logits(logits=predLogits, labels=actLabels)
loss = tf.reshape(loss, [batchsize, -1])
repMask = tf.sqrt(tf.cast(tf.abs(tf.subtract(tf.cast(Y, tf.int64), tf.cast(X, tf.int64))), tf.float32))
lossPost = loss - repMask
train_step = tf.train.AdamOptimizer(LR).minimize(lossPost)
So, in other words, instead of minimizing loss, I want AdamOptimizer to minimize its slightly tweaked version, which is lossPost. I then train the model in the usual way:
_ = sess.run([train_step], feed_dict=feed_dict)
I noticed that adding this workaround of minimizing lossPost instead of loss has no impact on the accuracy of the model. The model produces the exact same output with or without this workaround. It seems that it continues to optimize the original, unmodified loss. Why is this the case?
My original approach was to perform this tweak at the softmax_cross_entropy_with_logits step, by using the weighted_cross_entropy_with_logits instead, but I have an extra complication there, since there is an extra dimension of Vocabulary (this is a character-level-style model). So I thought it would be easier to do this afterwords, and as long as it's done prior to the optimization step it should be doable?
In your model it seems like X and Y are constants (that is, they depend only on the data). In this case repMask is also constant, as it is defined by
repMask = tf.sqrt(tf.cast(tf.abs(tf.subtract(tf.cast(Y, tf.int64), tf.cast(X, tf.int64))), tf.float32))
Hence loss and lossPost differ by constant value, and this has no effect on the minimization process (it is like finding x that minimizes x^2-1 vs x that minimizes x^2-5. Both x are the same).

how can fixed parameters cost and gamma using libsvm matlab to improve accuracy?

I use libsvm to classify a data base that contain 1000 labels. I am new in libsvm and I found a problem to choose the parameters c and g to improve performance. First, here is the program that I use to set the parameters:
bestcv = 0;
for log2c = -1:3,
for log2g = -4:1,
cmd = ['-v 5 -c ', num2str(2^log2c), ' -g ', num2str(2^log2g)];
cv = svmtrain(yapp, xapp, cmd);
if (cv >= bestcv),
bestcv = cv; bestc = 2^log2c; bestg = 2^log2g;
end
fprintf('%g %g %g (best c=%g, g=%g, rate=%g)\n', log2c, log2g, cv, bestc, bestg, bestcv);
end
end
as a result, this program gives c = 8 and g = 2 and when I use these values
c and g, I found an accuracy rate of 55%. for classification, I use svm one against all.
numLabels=max(yapp);
numTest=size(ytest,1);
%# train one-against-all models
model = cell(numLabels,1);
for k=1:numLabels
model{k} = svmtrain(double(yapp==k),xapp, ' -c 1000 -g 10 -b 1 ');
end
%# get probability estimates of test instances using each model
prob_black = zeros(numTest,numLabels);
for k=1:numLabels
[~,~,p] = svmpredict(double(ytest==k), xtest, model{k}, '-b 1');
prob_black(:,k) = p(:,model{k}.Label==1); %# probability of class==k
end
%# predict the class with the highest probability
[~,pred_black] = max(prob_black,[],2);
acc = sum(pred_black == ytest) ./ numel(ytest) %# accuracy
The problem is that I need to change these parameters to increase performance. for example, when I put randomly c = 10000 and g = 100, I found a better accuracy rate: 70%.
Please I need help, how can I set theses parameters ( c and g) so to find the optimum accuracy rate? thank you in advance
Hyperparameter tuning is a nontrivial problem in machine learning. The simplest approach is what you've already implemented: define a grid of values, and compute the model on the grid until you find some optimal combination. A key assumption is that the grid itself is a good approximation of the surface: that it's fine enough to not miss anything important, but not so fine that you waste time computing values that are essentially the same as neighboring values. I'm not aware of any method to, in general, know ahead of time how fine a grid is necessary. As illustration: imagine that the global optimum is at $(5,5)$ and the function is basically flat elsewhere. If your grid is $(0,0),(0,10),(10,10),(0,10)$, you'll miss the optimum completely. Likewise, if the grid is $(0,0), (-10,-10),(-10,0),(0,-10)$, you'll never be anywhere near the optimum. In both cases, you have no hope of finding the optimum itself.
Some rules of thumb exist for SVM with RBF kernels, though: a grid of $\gamma\in\{2^{-15},2^{-14},...,2^5\}$ and $C \in \{2^{-5}, 2^{-4},...,2^{15}\}$ is one such recommendation.
If you found a better solution outside of the range of grid values that you tested, this suggests you should define a larger grid. But larger grids take more time to evaluate, so you'll either have to commit to waiting a while for your results, or move to a more efficient method of exploring the hyperparameter space.
Another alternative is random search: define a "budget" of the number of SVMs that you want to try out, and generate that many random tuples to test. This approach is mostly just useful for benchmarking purposes, since it's entirely unintelligent.
Both grid search and random search have the advantage of being stupidly easy to implement in parallel.
Better options fall in the domain of global optimization. Marc Claeson et al have devised the Optunity package, which uses particle swarm optimization. My research focuses on refinements of the Efficient Global Optimization algorithm (EGO), which builds up a Gaussian process as an approximation of the hyperparameter response surface and uses that to make educated predictions about which hyperparameter tuples are most likely to improve upon the current best estimate.
Imagine that you've evaluated the SVM at some hyperparameter tuple $(\gamma, C)$ and it has some out-of-sample performance metric $y$. An advantage to EGO-inspired methods is that it assumes that the values $y^*$ nearby $(\gamma,C)$ will be "close" to $y$, so we don't necessarily need to spend time exploring those tuples nearby, especially if $y-y_{min}$ is very large (where $y_{min}$ is the smallest $y$ value we've discovered). EGO will identify and evaluate the SVM at points where it estimates there is a high probability of improvement, so it will intelligently move through the hyper-parameter space: in the ideal case, it will skip over regions of low performance in favor of focusing on regions of high performance.

Resources