how to define the derivative of a custom activation function in keras - machine-learning

I have a custom activation function and its derivative, although I can use the custom activation function I don't know how to tell keras what is its derivative.
It seems like it finds one itself but I have a parameter that has to be shared between the function and its derivative so how can I do that?
I know there is a relatively easy way to do this in tensorflow but I have no idea how to implement it in keras here is how you do it in tensorflow
Edit: based on the answer I got maybe I wasn't clear enough. What I want is to implement a custom derivative for my activation function so that it use my derivative during the backpropagation. I know how to implement a custom activation function.

Take a look at the source code where the activation functions of Keras are defined:
keras/activations.py
For example:
def relu(x, alpha=0., max_value=None):
"""Rectified Linear Unit.
# Arguments
x: Input tensor.
alpha: Slope of the negative part. Defaults to zero.
max_value: Maximum value for the output.
# Returns
The (leaky) rectified linear unit activation: `x` if `x > 0`,
`alpha * x` if `x < 0`. If `max_value` is defined, the result
is truncated to this value.
"""
return K.relu(x, alpha=alpha, max_value=max_value)
And also how does Keras layers call the activation functions: self.activation = activations.get(activation) the activation can be string or callable.
Thus, similarly, you can define your own activation function, for example:
def my_activ(x, p1, p2):
...
return ...
Suppose you want use this activation in Dense layer, you just put your function like this:
x = Dense(128, activation=my_activ(p1, p2))(input)
If you mean you want to implement your own derivative:
If your activation function is written in Tensorflow/Keras functions of which the operations are differentiable (e.g. K.dot(), tf.matmul(), tf.concat() etc.), then the derivatives will be obtained by automatic differentiation https://en.wikipedia.org/wiki/Automatic_differentiation. In that case you dont need to write your own derivative.
If you still want to re-write the derivatives, check this document https://www.tensorflow.org/extend/adding_an_op where you need to register your gradients using tf.RegisterGradient

Related

How to change activation's of a layer using lambda function during training

I am new to keras and trying to modify the outputs of a layer during training. I want to write a function that takes the layer outputs and return the modeified outputs to the next layer during learning. I have tried using lambda functions but not really got hold of it.
def fun(x):
a = min(x)
y = np.round(x*(2**a))
return y
layer_1 = Dense(32, activation='relu')(input)
layer_2 = Dense(12, activation='relu')(layer_1)
lambda_layer = Lambda(fun, output_shape=(12,))(layer_2)
layer_3 = dense(32, activation='relu')(lambda_layer)
how can I get the layer outputs and modify them before passing it to next layer?
Using a lambda function is the right approach for your problem. However, keep in mind that the lambda function will be part of your computational graph and during training gradients have to be computed for the whole graph.
For example, you should not use the min() function as you did but rather use functions which are part of Keras Backend. Replacing all operations by their keras backend equivalent results in:
import keras.backend as K
def fun(x):
a = K.min(x)
y = K.round(K.dot(x, (K.pow(2, a))))
return y
Your final model (and so all Lambda layers) should only contain native Keras functions, in order to safely perform all calculations during training.
This fails because you are using non-native operations (like np.round) inside a Lambda function, which expects keras operations
Examine the keras.backend docs, and take the functions you want to use from there.
So your function should look something like this
from keras import backend as K
def fun(x):
a = K.min(x, axis=-1) # Specify the axis you need!
y = K.round(x*(2**a))
return y

What is the difference between keras.activations.softmax and keras.layers.Softmax?

What is the difference between keras.activations.softmax and keras.layers.Softmax? Why are there two definitions of the same activation function?
keras.activations.softmax: https://keras.io/activations/
keras.layers.Softmax: https://keras.io/layers/advanced-activations/
They are equivalent to each other in terms of what they do. Actually, the Softmax layer would call the activations.softmax under the hood:
def call(self, inputs):
return activations.softmax(inputs, axis=self.axis)
However, their difference is that the Softmax layer could be directly used as a layer:
from keras.layers import Softmax
soft_out = Softmax()(input_tensor)
But, activations.softmax could not be used directly as a layer. Rather, you can pass it as the activation function of other layers through activation argument:
from keras import activations
dense_out = Dense(n_units, activation=activations.softmax)
Further, note that the good thing about using Softmax layer is that it takes an axis argument and you can compute the softmax over another axis of the input instead of its last axis (which is the default):
soft_out = Softmax(axis=desired_axis)(input_tensor)

How to use my own activation function in tensorflow train API?

Can I define my own activation function and use it in the TensorFlow Train API, i.e. the high level API with pre-defined estimators like DNNClassifier?
For example, I want to use this code but replace the activation function tf.nn.tanh with something my own:
tf.estimator.DNNClassifier(
feature_columns=feature_columns,
hidden_units=[5,10,5
n_classes=3,
optimizer=tf.train.ProximalAdagradOptimizer(learning_rate=0.01,
l1_regularization_strength=0.0001),
activation_fn=tf.nn.tanh)
If your custom function can be expressed in terms of built-in tensorflow ops, then it's fairly straightforward. For example:
DNNClassifier(feature_columns=feature_columns,
...,
activation_fn=lambda x: 2*tf.nn.tanh(x)+3*tf.nn.relu(x)+1)
In general, activation_fn can be a callable that accepts a tensor of arbitrary shape (because it'll be applied after each layer). Tensorflow will be able to backpropagate through this expression without any problem.
However, if you want a completely new custom op, not expressible via existing ones, you'll have to register it and compute its gradient manually. See this question for the details.

How does keras(or any other ML framework) calculate the gradient of a lambda function layer for backpropagation?

Keras enables adding a layer which calculates a user defined lambda function.
What I don't get is how Keras knows to calculate the gradient of this user defined function for the backpropagation.
That one of the benefit of using Theano/Tensorflow and libraries build on top of them. They can give you automatic gradient calculation of the mathematical functions and operations.
Keras gets them by calling:
# keras/theano_backend.py
def gradients(loss, variables):
return T.grad(loss, variables)
# keras/tensorflow_backend.py
def gradients(loss, variables):
'''Returns the gradients of `variables` (list of tensor variables)
with regard to `loss`.
'''
return tf.gradients(loss, variables, colocate_gradients_with_ops=True)
which are in turn called by the optimizers(keras/optimizers.py) grads = self.get_gradients(loss, params) to get the gradients which are used to write the update rule for all the params. params here are the trainable weights of the layers. But layers created by the Lambda functional layers don't have any trainable weights. But they affect the loss function though the forward prob and hence indirectly affect the calculation of the gradients of trainable weights of other layers.
The only time you need to write new gradient calculation is when you are defining a new basic mathematical operation/function. Also, when you write a custom loss function the auto grad almost always takes care of the gradient calculation. But optionally you can optimize training (not always) if you implement analytical gradient of your custom functions. For example softwax function can be expressed in exp, sum and div and auto grad can take care of it, but its analytical/symbolic grad is usually implemented in Theano/Tensorflow.
For implementing new Ops you can see the below links for that:
http://deeplearning.net/software/theano/extending/extending_theano.html
https://www.tensorflow.org/versions/r0.12/how_tos/adding_an_op/index.html

Probability and Neural Networks

Is it a good practice to use sigmoid or tanh output layers in Neural networks directly to estimate probabilities?
i.e the probability of given input to occur is the output of sigmoid function in the NN
EDIT
I wanted to use neural network to learn and predict the probability of a given input to occur..
You may consider the input as State1-Action-State2 tuple.
Hence the output of NN is the probability that State2 happens when applying Action on State1..
I Hope that does clear things..
EDIT
When training NN, I do random Action on State1 and observe resultant State2; then teach NN that input State1-Action-State2 should result in output 1.0
First, just a couple of small points on the conventional MLP lexicon (might help for internet searches, etc.): 'sigmoid' and 'tanh' are not 'output layers' but functions, usually referred to as "activation functions". The return value of the activation function is indeed the output from each layer, but they are not the output layer themselves (nor do they calculate probabilities).
Additionally, your question recites a choice between two "alternatives" ("sigmoid and tanh"), but they are not actually alternatives, rather the term 'sigmoidal function' is a generic/informal term for a class of functions, which includes the hyperbolic tangent ('tanh') that you refer to.
The term 'sigmoidal' is probably due to the characteristic shape of the function--the return (y) values are constrained between two asymptotic values regardless of the x value. The function output is usually normalized so that these two values are -1 and 1 (or 0 and 1). (This output behavior, by the way, is obviously inspired by the biological neuron which either fires (+1) or it doesn't (-1)). A look at the key properties of sigmoidal functions and you can see why they are ideally suited as activation functions in feed-forward, backpropagating neural networks: (i) real-valued and differentiable, (ii) having exactly one inflection point, and (iii) having a pair of horizontal asymptotes.
In turn, the sigmoidal function is one category of functions used as the activation function (aka "squashing function") in FF neural networks solved using backprop. During training or prediction, the weighted sum of the inputs (for a given layer, one layer at a time) is passed in as an argument to the activation function which returns the output for that layer. Another group of functions apparently used as the activation function is piecewise linear function. The step function is the binary variant of a PLF:
def step_fn(x) :
if x <= 0 :
y = 0
if x > 0 :
y = 1
(On practical grounds, I doubt the step function is a plausible choice for the activation function, but perhaps it helps understand the purpose of the activation function in NN operation.)
I suppose there an unlimited number of possible activation functions, but in practice, you only see a handful; in fact just two account for the overwhelming majority of cases (both are sigmoidal). Here they are (in python) so you can experiment for yourself, given that the primary selection criterion is a practical one:
# logistic function
def sigmoid2(x) :
return 1 / (1 + e**(-x))
# hyperbolic tangent
def sigmoid1(x) :
return math.tanh(x)
what are the factors to consider in selecting an activation function?
First the function has to give the desired behavior (arising from or as evidenced by sigmoidal shape). Second, the function must be differentiable. This is a requirement for backpropagation, which is the optimization technique used during training to 'fill in' the values of the hidden layers.
For instance, the derivative of the hyperbolic tangent is (in terms of the output, which is how it is usually written) :
def dsigmoid(y) :
return 1.0 - y**2
Beyond those two requriements, what makes one function between than another is how efficiently it trains the network--i.e., which one causes convergence (reaching the local minimum error) in the fewest epochs?
#-------- Edit (see OP's comment below) ---------#
I am not quite sure i understood--sometimes it's difficult to communicate details of a NN, without the code, so i should probably just say that it's fine subject to this proviso: What you want the NN to predict must be the same as the dependent variable used during training. So for instance, if you train your NN using two states (e.g., 0, 1) as the single dependent variable (which is obviously missing from your testing/production data) then that's what your NN will return when run in "prediction mode" (post training, or with a competent weight matrix).
You should choose the right loss function to minimize.
The squared error does not lead to the maximum likelihood hypothesis here.
The squared error is derived from a model with Gaussian noise:
P(y|x,h) = k1 * e**-(k2 * (y - h(x))**2)
You estimate the probabilities directly. Your model is:
P(Y=1|x,h) = h(x)
P(Y=0|x,h) = 1 - h(x)
P(Y=1|x,h) is the probability that event Y=1 will happen after seeing x.
The maximum likelihood hypothesis for your model is:
h_max_likelihood = argmax_h product(
h(x)**y * (1-h(x))**(1-y) for x, y in examples)
This leads to the "cross entropy" loss function.
See chapter 6 in Mitchell's Machine Learning
for the loss function and its derivation.
There is one problem with this approach: if you have vectors from R^n and your network maps those vectors into the interval [0, 1], it will not be guaranteed that the network represents a valid probability density function, since the integral of the network is not guaranteed to equal 1.
E.g., a neural network could map any input form R^n to 1.0. But that is clearly not possible.
So the answer to your question is: no, you can't.
However, you can just say that your network never sees "unrealistic" code samples and thus ignore this fact. For a discussion of this (and also some more cool information on how to model PDFs with neural networks) see contrastive backprop.

Resources