I have a regression problem, but the cost function is different: The cost for an underestimate is higher than an overestimate. For example, if predicted value < true value, the cost will be 3*(true-predicted)^2; if predicted value > true value, the cost will be 1*(true-predicted)^2.
I'm thinking of using classical regression models such as linear regression, random forest etc. What modifications should I make to adjust for this cost function?
As I know, the ML API such as scikit-learn does not provide the functionality to directly modify the cost function. If I have to use these APIs, what can I do?
Any recommended reading?
You can use Tensorflow (or theano) for custom cost functions. The common linear regression implementation is here.
To find out how you can implement your custom cost function looking at a huber loss function implemented in tensorflow might help you. Here comes your custom cost function which you should replace in the linked code so instead of
cost = tf.reduce_sum(tf.pow(pred-Y, 2))/(2*n_samples)
in the linked code you'll have:
error = y_known - y_pred
condition = tf.less(error, 0)
overestimation_loss = 1 * tf.square(error)
underestimation_loss = 3 * tf.square(error)
cost = tf.reduce_mean(tf.where(condition, overestimation_loss, underestimation_loss))
Here when condition is true, error is lower than zero which means you y_known is smaller than y_pred so you'll have overestimation and so the tf.where statement will choose overestimation_loss otherwise underestimation loss.
The secret is that you'll compute both losses and choose where to use them using tf.where and condition.
Update:
If you want to use other libraries, if huber loss is implemented you can take a look to get ideas because huber loss is a conditional loss function similar to yours.
You can use asymmetric cost function to make your model overestimate or underestimate. You can replace cost function in this implementation with:
def acost(a): return tf.pow(pred-Y, 2) * tf.pow(tf.sign(pred-Y) + a, 2)
for more detail see this link
Related
Question: How should I define a custom loss function that matches with the default value of the learning rate?
It is obvious that defining a proper loss function is vital for training a neural network.
On the other hand, learning rate is an important parameter that must be set properly.
As I know, loss function and learning rate are tightly related because they directly determine new values of the network weights:
w_new = w_old - learning_rate * (gradient_of_loss w.r.t. w)
For example one may choose the following loss function:
loss_1 = || y_pred - y_true||^2_2
But another chooses the following:
loss_2 = (1/2)|| y_pred - y_true||^2_2
In many neural network/deep learning framework, the value of learning rate is set as default. So using default learning rate, the effective learning rate in latter loss function is half of the former. Please correct me if I'm wrong.
TensorFlow calls each of the inputs to a softmax a logit. They go on to define the softmax's inputs/logits as: "Unscaled log probabilities."
Wikipedia and other sources say that a logit is the log of the odds, and the inverse of the sigmoid/logistic function. I.e., if sigmoid(x) = p(x), then logit( p(x) ) = log( p(x) / (1-p(x)) ) = x.
Is there a mathematical or conventional reason for TensorFlow to call a softmax's inputs "logits"? Shouldn't they just be called "unscaled log probabilities"?
Perhaps TensorFlow just wanted to keep the same variable name for binary logistic regression (where it makes sense to use the term logit) and categorical logistic regression...
This question was covered a little bit here, but no one seemed bothered by the use of the word "logit" to mean "unscaled log probability".
Logit is nowadays used in ML community for any non-normalised probability distribution (basically anything that gets mapped to a probability distribution by a parameter-less transformation, like sigmoid function for a binary variable or softmax for multinomial one). It is not a strict mathematical term, but gained enough popularity to be included in TF documentation.
I use libsvm to classify a data base that contain 1000 labels. I am new in libsvm and I found a problem to choose the parameters c and g to improve performance. First, here is the program that I use to set the parameters:
bestcv = 0;
for log2c = -1:3,
for log2g = -4:1,
cmd = ['-v 5 -c ', num2str(2^log2c), ' -g ', num2str(2^log2g)];
cv = svmtrain(yapp, xapp, cmd);
if (cv >= bestcv),
bestcv = cv; bestc = 2^log2c; bestg = 2^log2g;
end
fprintf('%g %g %g (best c=%g, g=%g, rate=%g)\n', log2c, log2g, cv, bestc, bestg, bestcv);
end
end
as a result, this program gives c = 8 and g = 2 and when I use these values
c and g, I found an accuracy rate of 55%. for classification, I use svm one against all.
numLabels=max(yapp);
numTest=size(ytest,1);
%# train one-against-all models
model = cell(numLabels,1);
for k=1:numLabels
model{k} = svmtrain(double(yapp==k),xapp, ' -c 1000 -g 10 -b 1 ');
end
%# get probability estimates of test instances using each model
prob_black = zeros(numTest,numLabels);
for k=1:numLabels
[~,~,p] = svmpredict(double(ytest==k), xtest, model{k}, '-b 1');
prob_black(:,k) = p(:,model{k}.Label==1); %# probability of class==k
end
%# predict the class with the highest probability
[~,pred_black] = max(prob_black,[],2);
acc = sum(pred_black == ytest) ./ numel(ytest) %# accuracy
The problem is that I need to change these parameters to increase performance. for example, when I put randomly c = 10000 and g = 100, I found a better accuracy rate: 70%.
Please I need help, how can I set theses parameters ( c and g) so to find the optimum accuracy rate? thank you in advance
Hyperparameter tuning is a nontrivial problem in machine learning. The simplest approach is what you've already implemented: define a grid of values, and compute the model on the grid until you find some optimal combination. A key assumption is that the grid itself is a good approximation of the surface: that it's fine enough to not miss anything important, but not so fine that you waste time computing values that are essentially the same as neighboring values. I'm not aware of any method to, in general, know ahead of time how fine a grid is necessary. As illustration: imagine that the global optimum is at $(5,5)$ and the function is basically flat elsewhere. If your grid is $(0,0),(0,10),(10,10),(0,10)$, you'll miss the optimum completely. Likewise, if the grid is $(0,0), (-10,-10),(-10,0),(0,-10)$, you'll never be anywhere near the optimum. In both cases, you have no hope of finding the optimum itself.
Some rules of thumb exist for SVM with RBF kernels, though: a grid of $\gamma\in\{2^{-15},2^{-14},...,2^5\}$ and $C \in \{2^{-5}, 2^{-4},...,2^{15}\}$ is one such recommendation.
If you found a better solution outside of the range of grid values that you tested, this suggests you should define a larger grid. But larger grids take more time to evaluate, so you'll either have to commit to waiting a while for your results, or move to a more efficient method of exploring the hyperparameter space.
Another alternative is random search: define a "budget" of the number of SVMs that you want to try out, and generate that many random tuples to test. This approach is mostly just useful for benchmarking purposes, since it's entirely unintelligent.
Both grid search and random search have the advantage of being stupidly easy to implement in parallel.
Better options fall in the domain of global optimization. Marc Claeson et al have devised the Optunity package, which uses particle swarm optimization. My research focuses on refinements of the Efficient Global Optimization algorithm (EGO), which builds up a Gaussian process as an approximation of the hyperparameter response surface and uses that to make educated predictions about which hyperparameter tuples are most likely to improve upon the current best estimate.
Imagine that you've evaluated the SVM at some hyperparameter tuple $(\gamma, C)$ and it has some out-of-sample performance metric $y$. An advantage to EGO-inspired methods is that it assumes that the values $y^*$ nearby $(\gamma,C)$ will be "close" to $y$, so we don't necessarily need to spend time exploring those tuples nearby, especially if $y-y_{min}$ is very large (where $y_{min}$ is the smallest $y$ value we've discovered). EGO will identify and evaluate the SVM at points where it estimates there is a high probability of improvement, so it will intelligently move through the hyper-parameter space: in the ideal case, it will skip over regions of low performance in favor of focusing on regions of high performance.
My dataset has m features and n data points. Let w be a vector (to be estimated). I'm trying to implement gradient descent with stochastic update method. My minimizing function is least mean square.
The update algorithm is shown below:
for i = 1 ... n data:
for t = 1 ... m features:
w_t = w_t - alpha * (<w>.<x_i> - <y_i>) * x_t
where <x> is a raw vector of m features, <y> is a column vector of true labels, and alpha is a constant.
My questions:
Now according to wiki, I don't need to go through all data points and I can stop when error is small enough. Is it true?
I don't understand what should be the stopping criterion here. If anyone can help with this that would be great.
With this formula - which I used in for loop - is it correct? I believe (<w>.<x_i> - <y_i>) * x_t is my ∆Q(w).
Now according to wiki, I don't need to go through all data points and I can stop when error is small enough. Is it true?
This is especially true when you have a really huge training set and going through all the data points is so expensive. Then, you would check the convergence criterion after K stochastic updates (i.e. after processing K training examples). While it's possible, it doesn't make much sense to do this with a small training set. Another thing people do is randomizing the order in which training examples are processed to avoid having too many correlated examples in a raw which may result in "fake" convergence.
I don't understand what should be the stopping criterion here. If anyone can help with this that would be great.
There are a few options. I recommend trying as many of them and deciding based on empirical results.
difference in the objective function for the training data is smaller than a threshold.
difference in the objective function for held-out data (aka. development data, validation data) is smaller than a threshold. The held-out examples should NOT include any of the examples used for training (i.e. for stochastic updates) nor include any of the examples in the test set used for evaluation.
the total absolute difference in parameters w is smaller than a threshold.
in 1, 2, and 3 above, instead of specifying a threshold, you could specify a percentage. For example, a reasonable stopping criterion is to stop training when |squared_error(w) - squared_error(previous_w)| < 0.01 * squared_error(previous_w) $$.
sometimes, we don't care if we have the optimal parameters. We just want to improve the parameters we originally had. In such case, it's reasonable to preset a number of iterations over the training data and stop after that regardless of whether the objective function actually converged.
With this formula - which I used in for loop - is it correct? I believe (w.x_i - y_i) * x_t is my ∆Q(w).
It should be 2 * (w.x_i - y_i) * x_t but it's not a big deal given that you're multiplying by the learning rate alpha anyway.
Is it a good practice to use sigmoid or tanh output layers in Neural networks directly to estimate probabilities?
i.e the probability of given input to occur is the output of sigmoid function in the NN
EDIT
I wanted to use neural network to learn and predict the probability of a given input to occur..
You may consider the input as State1-Action-State2 tuple.
Hence the output of NN is the probability that State2 happens when applying Action on State1..
I Hope that does clear things..
EDIT
When training NN, I do random Action on State1 and observe resultant State2; then teach NN that input State1-Action-State2 should result in output 1.0
First, just a couple of small points on the conventional MLP lexicon (might help for internet searches, etc.): 'sigmoid' and 'tanh' are not 'output layers' but functions, usually referred to as "activation functions". The return value of the activation function is indeed the output from each layer, but they are not the output layer themselves (nor do they calculate probabilities).
Additionally, your question recites a choice between two "alternatives" ("sigmoid and tanh"), but they are not actually alternatives, rather the term 'sigmoidal function' is a generic/informal term for a class of functions, which includes the hyperbolic tangent ('tanh') that you refer to.
The term 'sigmoidal' is probably due to the characteristic shape of the function--the return (y) values are constrained between two asymptotic values regardless of the x value. The function output is usually normalized so that these two values are -1 and 1 (or 0 and 1). (This output behavior, by the way, is obviously inspired by the biological neuron which either fires (+1) or it doesn't (-1)). A look at the key properties of sigmoidal functions and you can see why they are ideally suited as activation functions in feed-forward, backpropagating neural networks: (i) real-valued and differentiable, (ii) having exactly one inflection point, and (iii) having a pair of horizontal asymptotes.
In turn, the sigmoidal function is one category of functions used as the activation function (aka "squashing function") in FF neural networks solved using backprop. During training or prediction, the weighted sum of the inputs (for a given layer, one layer at a time) is passed in as an argument to the activation function which returns the output for that layer. Another group of functions apparently used as the activation function is piecewise linear function. The step function is the binary variant of a PLF:
def step_fn(x) :
if x <= 0 :
y = 0
if x > 0 :
y = 1
(On practical grounds, I doubt the step function is a plausible choice for the activation function, but perhaps it helps understand the purpose of the activation function in NN operation.)
I suppose there an unlimited number of possible activation functions, but in practice, you only see a handful; in fact just two account for the overwhelming majority of cases (both are sigmoidal). Here they are (in python) so you can experiment for yourself, given that the primary selection criterion is a practical one:
# logistic function
def sigmoid2(x) :
return 1 / (1 + e**(-x))
# hyperbolic tangent
def sigmoid1(x) :
return math.tanh(x)
what are the factors to consider in selecting an activation function?
First the function has to give the desired behavior (arising from or as evidenced by sigmoidal shape). Second, the function must be differentiable. This is a requirement for backpropagation, which is the optimization technique used during training to 'fill in' the values of the hidden layers.
For instance, the derivative of the hyperbolic tangent is (in terms of the output, which is how it is usually written) :
def dsigmoid(y) :
return 1.0 - y**2
Beyond those two requriements, what makes one function between than another is how efficiently it trains the network--i.e., which one causes convergence (reaching the local minimum error) in the fewest epochs?
#-------- Edit (see OP's comment below) ---------#
I am not quite sure i understood--sometimes it's difficult to communicate details of a NN, without the code, so i should probably just say that it's fine subject to this proviso: What you want the NN to predict must be the same as the dependent variable used during training. So for instance, if you train your NN using two states (e.g., 0, 1) as the single dependent variable (which is obviously missing from your testing/production data) then that's what your NN will return when run in "prediction mode" (post training, or with a competent weight matrix).
You should choose the right loss function to minimize.
The squared error does not lead to the maximum likelihood hypothesis here.
The squared error is derived from a model with Gaussian noise:
P(y|x,h) = k1 * e**-(k2 * (y - h(x))**2)
You estimate the probabilities directly. Your model is:
P(Y=1|x,h) = h(x)
P(Y=0|x,h) = 1 - h(x)
P(Y=1|x,h) is the probability that event Y=1 will happen after seeing x.
The maximum likelihood hypothesis for your model is:
h_max_likelihood = argmax_h product(
h(x)**y * (1-h(x))**(1-y) for x, y in examples)
This leads to the "cross entropy" loss function.
See chapter 6 in Mitchell's Machine Learning
for the loss function and its derivation.
There is one problem with this approach: if you have vectors from R^n and your network maps those vectors into the interval [0, 1], it will not be guaranteed that the network represents a valid probability density function, since the integral of the network is not guaranteed to equal 1.
E.g., a neural network could map any input form R^n to 1.0. But that is clearly not possible.
So the answer to your question is: no, you can't.
However, you can just say that your network never sees "unrealistic" code samples and thus ignore this fact. For a discussion of this (and also some more cool information on how to model PDFs with neural networks) see contrastive backprop.