Is "chainer.functions.sigmoid_cross_entropy" a second-order differentiable function? - machine-learning

I am a student studying machine learning.
For my study, we need to differentiate the loss function by second order, we use "chainer.functions.sigmoid_cross_entropy".
A similar function is "chainer.functions.softmax_cross_entropy". This function has an argument ", enable_double_backprop" to realize the second derivative, but not in "chainer.functions.sigmoid_cross_entropy".
Is "chainer.functions.sigmoid_cross_entropy" a second-order differentiable function?
Please teach me!
chainer.functions.sigmoid_cross_entropy (x, t, normalize = True, reduce = 'mean')
chainer.functions.softmax_cross_entropy (x, t, normalize = True, cache_score = True,
class_weight = None, ignore_label = -1, reduce = 'mean', enable_double_backprop = False,
soft_target_loss = 'cross-entropy')

Yes, sigmoid_cross_entropy is second-order differentiable. For performance reasons, softmax_cross_entropy is not second-order differentiable unless enable_double_backprop=True is given.
Functions that does not support higher-order derivatives are listed in https://github.com/chainer/chainer/issues/4449.

Related

Approach for Linearizing Nonlinear System around Decision Variables (x, u) from MathematicalProgram at each discrete point in time

This is a follow up question on the same system from the following post (for additional context). I describe a nonlinear system as a LeafSystem_[T] (using templates) with two input ports and one output port. Then, I essentially would like to perform direct transcription using MathematicalProgram with an additional cost function that is dependent on the linearized dynamics at each time step (and hence linearized around the decision variables). I use two input ports as it seemed the most straightforward way for obtaining the linearized dynamics of the form from this paper on DIRTREL (if I can take the Jacobian with respect to input ports)
δxi+1 ≈ Aiδx + Biδu + Giw
where i is the timestep, x is the state, the first input port can encapsulate u, and the second input port can model w which may be disturbance, uncertainty, etc.
My main question is what would be the most suitable set of methods to obtain the linearized dynamics around the decision variables at each time step using automatic differentation? I was recommended trying automatic differentiation after attempting a symbolic approach in the previous post, but am not familiar with the setup for doing so. I have experimented with
using primitives.Linearize() (calling it twice, once for each input port) which feels rather clunky and I am not sure whether it is possible to pass in decision variables into context
perhaps converting my system into a multibody and making use of multibody.tree.JacobianWrtVariable()
or formatting my system dynamics so that I can pass them in as the function argument for forwarddiff.jacobian
but have met limited success.
The easiest way to get Ai, Bi is to instantiate your system with AutoDiffXd, namely LeafSystem<AutoDiffXd>. The following code will give you Ai, Bi
MyLeafSystem<AutoDiffXd> my_system;
Eigen::VectorXd x_val = ...
Eigen::VectorXd u_val = ...
Eigen::VectorXd w_val = ...
// xuw_val concantenate x_val, u_val and w_val
Eigen::VectorXd xuw_val(x_val.rows() + u_val.rows() + w_val.rows());
xuw_val.head(x_val.rows()) = x_val;
xuw_val.segment(x_val.rows(), u_val.rows()) = u_val;
xuw_val.segment(w_val.rows()) = w_val;
// xuw_autodiff stores xuw_val in its value(), and an identity matrix in its gradient()
AutoDiffVecXd xuw_autodiff = math::initializeAutoDiff(xuw_val);
AutoDiffVecXd x_autodiff = xuw_autodiff.head(x_val.rows());
AutoDiffVecXd u_autodiff = xuw_autodiff.segment(x_val.rows(), u_val.rows());
AutoDiffVecXd w_autodiff = xuw_autodiff.tail(u_val.rows());
// I suppose you have a function x[n+1] = dynamics(system, x[n], u[n], w[n]). This dynamics function could be a wrapper of CalcUnrestrictedUpdate function.
AutoDiffVecXd x_next_autodiff = dynamics(my_system, x_autodiff, u_autodiff, w_autodiff);
Eigen::MatrixXd x_next_gradient = math::autoDiffToGradientMatrix(x_next_autodiff);
Eigen::MatrixXd Ai = x_next_gradient.block(0, 0, x_val.rows(), x_val.rows());
Eigen::MatrixXd Bi = x_next_gradient.block(0, x_val.rows(), x_val.rows(), u_val.rows());
Eigen::MatrixXd Gi = x_next_gradient.block(0, x_val.rows() + u_val.rows(), x_val.rows(), w_val.rows());
So you get the value of Ai, Bi, Gi in the end.
If you need to write a cost function, you will need to create a subclass of solvers::Cost. Inside the Eval function of this derived class, you will implement your code to first compute Ai, Bi, Gi, and then integrate the Riccati equation.
But I think since your cost function depends on Ai, Bi, Gi, the gradient of your cost function will depend on the gradient of Ai, Bi, Gi. Currently we don't provide the function to compute the second order gradient of the dynamics.
How complicated is your dynamical system? Is it possible to write down the dynamics by hand? If so, there are some shortcuts we can do to generate the second order gradient of your dynamics.
#sherm or other Drake dynamics folks, it would be great to get your opinion on how to get the second order gradient (assuming Phil could confirm he does need the second order gradient.)
Sorry for my belated reply.
Since your dynamics can be written by hand, then I would suggest to create a templated function to compute Ai, Bi, Gi as
template <typename T>
void ComputeLinearizedDynamics(
const LeafSystem<T>& my_system,
const Eigen::Ref<const drake::VectorX<T>>& x,
const Eigen::Ref<const drake::VectorX<T>>& u,
drake::MatrixX<T>* Ai,
drake::MatrixX<T>* Bi,
drake::MatrixX<T>* Gi) const;
You will need to write down the matrix Ai, Bi, Gi by hand within this function. Then when you instantiate your LeafSystem with T=AutoDiffXd, this function will compute Ai, Bi, Gi with its gradient, given the state x, input u and disturbance w.
Then in the cost function, you could consider to create a sub-class of Cost class as
class MyCost {
public:
MyCost(const LeafSystem<AutoDiffXd>& my_system) : my_system_{&my_system} {}
protected:
void DoEval(const Eigen::Ref<const Eigen::VectorXd>& x_input, Eigen::VectorXd* y) const {
// The computation here is inefficient, as we need to cast
// x_input to Eigen vector of AutoDiffXd, and then call
// DoEval with AutoDiffXd version, and then convert the
// result back to double. But it is easy to implement.
const AutoDiffVecXd x_autodiff = math::initializeAutoDiff(x_input);
AutoDiffVecXd y_autodiff;
this->DoEval(x_autodiff, &y_autodiff);
*y = math::autodiffToValueMatrix(y_autodiff);
}
void DoEval(const Eigen::Ref<const drake::AutoDiffVecXd>& x_input, drake::AutoDiffVecXd* y) const {
// x_input here contains all the state and control sequence The authors need to first partition x_input into x, u
drake::VectorX<T> x_all = x_input.head(num_x_ * nT_);
drake::VectorX<T> u_all = x_input.tail(num_u_ * nT_);
y->resize(1);
y(0) = 0;
// I assume S_final_ is stored in this class.
Eigen::MatrixXd S = S_final_;
for (int i = nT-1; i >= 0; --i) {
drake::MatrixX<AutoDiffXd> Ai, Bi, Gi;
ComputeLinearizedDynamics(
*my_system_,
x_all.segment(num_x_ * i, num_x_),
u_all.segment(num_u_ * i, num_u_),
&Ai, &B_i, &Gi);
S = Ai.T*S + S*Ai + ... // This is the Riccati equation.
// Now compute your cost with this S
...
}
}
void DoEval(const Eigen::Ref<const VectorX<symbolic::Variable>& x, VectorX<symbolic::Expression>* y) const {
// You don't need the symbolic version of the cost in nonlinear optimization.
throw std::runtime_error("Not implemented yet");
}
private:
LeafSystem<AutoDiffXd>* my_system_;
};
The DoEval function with autodiff version will compute the gradient of the cost for you automatically. Then you will need to call AddCost function in MathematicalProgram to add this cost together with all of x, u as the associated variable of this cost.

Tensorflow LSTM PTB Example - Understanding forward and backward pass

Right now I am going through the tensorflow example on LSTMs where they use the PTB dataset to create an LSTM network capable of predicting the next word. I've spent a lot of time trying to understand the code, and have a good understanding for most of it however there is one function which I don't fully grasp:
def run_epoch(session, model, eval_op=None, verbose=False):
"""Runs the model on the given data."""
costs = 0.0
iters = 0
state = session.run(model.initial_state)
fetches = {
"cost": model.cost,
"final_state": model.final_state,
}
if eval_op is not None:
fetches["eval_op"] = eval_op
for step in range(model.input.epoch_size):
feed_dict = {}
for i, (c, h) in enumerate(model.initial_state):
feed_dict[c] = state[i].c
feed_dict[h] = state[i].h
vals = session.run(fetches, feed_dict)
cost = vals["cost"]
state = vals["final_state"]
costs += cost
iters += model.input.num_steps
return np.exp(costs / iters)
My confusion is this: each time through the outerloop I believe we have processed batch_size * num_steps numbers of words, done the forward propagation and done the backward propagation. But, how in the next iteration, for example, do we know to start with the 36th word of each batch if num_steps = 35? I suspect it is some change in an attribute of the class model on each iteration but I cannot figure that out. Thanks for your help.

Calculating vector distance for classification with mixed features

I'm doing a project comparing the effectiveness of various classification algorithms, but I'm stuck on a frustrating point. The data may be found here: http://archive.ics.uci.edu/ml/datasets/Adult The classification problem is whether or not a person makes over 50k a year based on their census data.
Two example entries are as follows:
45, Private, 98092, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, <=50K
50, Self-emp-not-inc, 386397, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, <=50K
I'm familiar with using Euclidean distance to calculate the difference between vectors, but I'm not sure how to work with a mix of continuous and discrete attributes. Are there any effective methods for representing the difference between two vectors in a meaningful way? I'm having a hard time wrapping my head around how large values like the third attribute (a weight calculated by the people who extracted the data set based on factors, so that similar weights should have similar attributes) and differences between it can preserve meaning from discrete features like male or female, which is only a Euclidean distance of 1 if I understand the method correctly. I'm sure some categories could be removed, but I don't want to remove something that factors into classification significantly. I'm tackling k-NN first once I get this figured out, then a Bayesian classifier, and finally a decision tree model like C4.5 or ID3 if I have the time.
Sure, you can extend Euclidean distance in any number of ways. The simplest extension would be the following rule:
distance = 0 in that coordinate if there's a match, 1 otherwise
The challenge will be making the concept of distance "relevant" for the k-NN follow up. In some cases (e.g. education), I think it will be best to map education (discrete variable) into a continuous variable, such as years of education. So you'll need to write a function which maps e.g. "HS-grad" to 12, "Bachelors" to 16, something like that.
Beyond that, using k-NN directly isn't going to work because the idea of "distance" among multiple dis-similar dimensions isn't well defined. I think you'll be better off throwing some of these dimensions away or weighting them differently. I don't know what the third number in your dataset (e.g. 98092) means, but if you use naive Euclidean distance this would be extremely overweighted compared to other dimensions such as age.
I'm not a machine learning expert, but I would personally be tempted to start k-NN on a reduced dimensionality dataset where you just pick some broad demographics (e.g. age, education, marital status) and ignore the trickier/"noisier" categories.
You need to code your categorical variables as 1-of-n binary variables (n choices for the variable, and of those variables one and only one is active). Then standardise your features---for each feature, subtract its mean and divide by standard deviation. Or normalise into the range 0-1. It's not perfect, but this will at least make dimensions comparable.
Create individual Maps for each data points and use the map to convert to a double value.
def createMap(data: RDD[String]) : Map[String,Double] = {
var mapData:Map[String,Double] = Map()
var counter = 0.0
data.collect().foreach{ item =>
counter = counter +1
mapData += (item -> counter)
}
mapData
}
def getLablelValue(input: String): Int = input match {
case "<=50K" => 0
case ">50K" => 1
}
val census = sc.textFile("/user/cloudera/census_data.txt")
val orgTypeRdd = census.map(line => line.split(", ")(1)).distinct
val gradeTypeRdd = census.map(line => line.split(", ")(3)).distinct
val marStatusRdd = census.map(line => line.split(", ")(5)).distinct
val jobTypeRdd = census.map(line => line.split(", ")(6)).distinct
val familyStatusRdd = census.map(line => line.split(", ")(7)).distinct
val raceTypeRdd = census.map(line => line.split(", ")(8)).distinct
val genderTypeRdd = census.map(line => line.split(", ")(9)).distinct
val countryRdd = census.map(line => line.split(", ")(13)).distinct
val salaryRange = census.map(line => line.split(", ")(14)).distinct
val orgTypeMap = createMap(orgTypeRdd)
val gradeTypeMap = createMap(gradeTypeRdd)
val marStatusMap = createMap(marStatusRdd)
val jobTypeMap = createMap(jobTypeRdd)
val familyStatusMap = createMap(familyStatusRdd)
val raceTypeMap = createMap(raceTypeRdd)
val genderTypeMap = createMap(genderTypeRdd)
val countryMap = createMap(countryRdd)
val salaryRangeMap = createMap(salaryRange)
val featureVector = census.map{line =>
val fields = line.split(", ")
LabeledPoint(getLablelValue(fields(14).toString) , Vectors.dense(fields(0).toDouble, orgTypeMap(fields(1).toString) , fields(2).toDouble , gradeTypeMap(fields(3).toString) , fields(4).toDouble , marStatusMap(fields(5).toString), jobTypeMap(fields(6).toString), familyStatusMap(fields(7).toString),raceTypeMap(fields(8).toString),genderTypeMap (fields(9).toString), fields(10).toDouble , fields(11).toDouble , fields(12).toDouble,countryMap(fields(13).toString) , salaryRangeMap(fields(14).toString)))
}

How to get the decision function from svm_model

Say I have a feature vector [v1,v2,v3],
then I have a decision function a*v1+b*v2+c*v3 =d
how do I get the values (a,b,c,d) using the inforrmation in svm_model?
I saw that these two fields in svm_model
public double[][] sv_coef;// coefficients for SVs in decision functions (sv_coef[k-1][l])
public double[] rho;// constants in decision functions (rho[k*(k-1)/2])
I suspect it could be essential for getting the decision function.
There is also a SVs field in svm_model. Your decision function is wv+b=0, where v = [v1,v2,v3]. Then,
w = SVs' * msv_coef;
b = -.rho;
For multi-class SVM, you may also need another field called Label
if Label(1) == -1
w = -w;
b = -b;
end
Check the FAQ part for more details.

Output function for fminunc in Octave

I am trying to implement the Regularized Logistic Regression Algorithm, using the fminunc() function in Octave for minimising the cost function. As generally advised, I would like to plot the cost function as a function of iterations of the fminunc() function. The function call looks as follows -
[theta, J, exit_flag] = ...
fminunc(#(t)(costFunctionReg(t, X, y, lambda)), initial_theta, options);
with
options = optimset('GradObj', 'on', 'MaxIter', 400, 'OutputFcn',#showJ_history);
[showJ-history is the intended output function; I hope I have set the options parameter correctly].
But, I can't find good sources on the internet highlighting how to write this output function, specifically, what parameters are passed to it by the fminunc(), what it returns (if anything in particular required by the fminunc()).
Could someone please mention some helpful links or assist me in writing the output function.
I think you can refer to the source code. Consider also this example:
1;
function f = __rosenb (x)
# http://en.wikipedia.org/wiki/Rosenbrock_function
n = length (x);
f = sumsq (1 - x(1:n-1)) + 100 * sumsq (x(2:n) - x(1:n-1).^2);
endfunction
function bstop = showJ_history(x, optv, state)
plot(optv.iter, optv.fval, 'x')
# setting bstop to true stops optimization
bstop = false;
endfunction
opt = optimset('OutputFcn', #showJ_history);
figure()
xlabel("iteration")
ylabel("cost function")
hold on
[x, fval, info, out] = fminunc (#__rosenb, [5, -5], opt);

Resources