Nonlinear (non-polynomial) cost function with DirectCollocation in Drake

Nonlinear (non-polynomial) cost function with DirectCollocation in Drake - drake

I am trying to formulate a trajectory optimization problem for a glider, where I want to maximize the average horisontal velocity. I have formulated the system as a drakesystem, and the state vector consists of the position and velocity.
Currently, I have something like the following:
dircol = DirectCollocation(
plant,
context,
num_time_samples=N,
minimum_timestep=min_dt,
maximum_timestep=max_dt,
)
... # other constraints etc
horisontal_pos = dircol.state()[0:2] # Only (x,y)
time = dircol.time()
dircol.AddFinalCost(-w.T.dot(horisontal_pos) / time)
where AddFinalCost() should replace all instances of state() and time() with the final values, as far as I understand from the documentation. min_dt is non-zero and w is a vector of linear weights.
However, I am getting the following error message
Expression (...) is not a polynomial. ParseCost does not support non-polynomial expression.
which makes me think that there is no way of adding the type of cost function that I am looking for. Is there anything that I am missing?
Thank you in advance!

When calling AddFinalCost(e) with e being a symbolic expression, we can only handle it when e is a polynomial function of the state (more precisely, either a quadratic function or a linear function). Hence the error you see complaining that the cost is not polynomial.
You could add the cost like this
def average_speed(v):
x = v[0]
time_steps = v[1:]
return v[0] / np.sum(time_steps)
h_vars = [dircol.timestep[i] for i in range(N-1)]
dircol.AddCost(average_speed, vars=[dircol.state(N-1)[0]] + h_vars)
which uses a function average_speed to evaluate the average speed. You could find example of doing this in https://github.com/RobotLocomotion/drake/blob/e5f3c3e5f7927ef675066d97d3afac55d3481305/bindings/pydrake/solvers/test/mathematicalprogram_test.py#L590

First, the cost function should be a scalar, but you a vector-valued horisontal_pos / time, which has two entries containing both position_x / dt and position_y / dt, namely a vector as the cost. You should instead provide a scalar valued cost.
Second, it is unclear to me why you divide time in the final cost. As far as I understand it, you want the final position to be close to the origin, so something like position_x² + position_y². The code can look like
dircol.AddFinalCost(horisontal_pos[0]**2 + horisontal_pos[1]**2)

Related

How to add an L1 penalty to the loss function for Neural ODEs?

I've been trying to fit a system of differential equations to some data I have and there are 18 parameters to fit, however ideally some of these parameters should be zero/go to zero. While googling this one thing I came across was building DE layers into neural networks, and I have found a few Github repos with Julia code examples, however I am new to both Julia and Neural ODEs. In particular, I have been modifying the code from this example:
https://computationalmindset.com/en/neural-networks/experiments-with-neural-odes-in-julia.html
Differences: I have a system of 3 DEs, not 2, I have 18 parameters, and I import two CSVs with data to fit that instead of generate a toy dataset to fit.
My dilemma: while goolging I came across LASSO/L1 regularization and hope that by adding an L1 penalty to the cost function, that I can "zero out" some of the parameters. The problem is I don't understand how to modify the cost function to incorporate it. My loss function right now is just
function loss_func()
pred = net()
sum(abs2, truth[1] .- pred[1,:]) +
sum(abs2, truth[2] .- pred[2,:]) +
sum(abs2, truth[3] .- pred[3,:])
end
but I would like to incorporate the L1 penalty into this. For L1 regression, I came across the equation for the cost function: J′(θ;X,y) = J(θ;X,y)+aΩ(θ), where "where θ denotes the trainable parameters, X the input... y [the] target labels. a is a hyperparameter that weights the contribution of the norm penalty" and for L1 regularization, the penalty is Ω(θ) = ∣∣w∣∣ = ∑∣w∣ (source: https://theaisummer.com/regularization/). I understand the first-term on the RHS is the loss J(θ;X,y) and is what I already have, that a is a hyperparameter that I choose and could be 0.001, 0.1, 1, 100000000, etc., and that the L1 penalty is the sum of the absolute value of the parameters. What I don't understand is how I add the a∑∣w∣ term to my current function - I want to edit it to be something like so:
function cost_func(lambda)
pred = net()
penalty(lambda) = lambda * (sum(abs(param[1])) +
sum(abs(param[2])) +
sum(abs(param[3]))
)
sum(abs2, truth[1] .- pred[1,:]) +
sum(abs2, truth[2] .- pred[2,:]) +
sum(abs2, truth[3] .- pred[3,:]) +
penalty(lambda)
end
where param[1], param[2], param[3] refers to the parameters for DEs u[1], u[2], u[3] that I'm trying to learn. I don't know if this logic is correct though or the proper way to implement it, and also I don't know how/where I would access the learned parameters. I suspect that the answer may lie somewhere in this chunk of code
callback_func = function ()
loss_value = loss_func()
println("Loss: ", loss_value)
end
fparams = Flux.params(p)
Flux.train!(loss_func, fparams, data, optimizer, cb = callback_func);
but I don't know for certain or even how to use it, if it were the answer.

I've been messing with this, and looking at some other NODE implementations (this one in particular) and have adjusted my cost function so that it is:
function cost_fnct(param)
prob = ODEProblem(model, u0, tspan, param)
prediction = Array(concrete_solve(prob, Tsit5(), p = param, saveat = trange))
loss = Flux.mae(prediction, data)
penalty = sum(abs, param)
loss + lambda*penalty
end;
where lambda is the tuning parameter, and using the definition that the L1 penalty is the sum of the absolute value of the parameters. Then, for training:
lambda = 0.01
resinit = DiffEqFlux.sciml_train(cost_fnct, p, ADAM(), maxiters = 3000)
res = DiffEqFlux.sciml_train(cost_fnct, resinit.minimizer, BFGS(initial_stepnorm = 1e-5))
where p is initial just my parameter "guesses", i.e., a vector of ones with the same length as the number of parameters I am attempting to fit.
If you're looking at the first link I had in the original post (here), you can redefine the loss function to add this penalty term and then define lambda before the callback function and subsequent training:
lambda = 0.01
callback_func = function ()
loss_value = cost_fnct()
println("Loss: ", loss_value)
println("\nLearned parameters: ", p)
end
fparams = Flux.params(p)
Flux.train!(cost_fnct, fparams, data, optimizer, cb = callback_func);
None of this, of course, includes any sort of cross-validation and tuning parameter optimization! I'll go ahead and accept my response to my question because it's my understanding that unanswered questions get pushed to encourage answers, and I want to avoid clogging the tag, but if anyone has a different solution, or wants to comment, please feel free to go ahead and do so.

arbitrarily weighted moving average (low- and high-pass filters)

Given input signal x (e.g. a voltage, sampled thousand times per second couple of minutes long), I'd like to calculate e.g.
/ this is not q
y[3] = -3*x[0] - x[1] + x[2] + 3*x[3]
y[4] = -3*x[1] - x[2] + x[3] + 3*x[4]
. . .
I'm aiming for variable window length and weight coefficients. How can I do it in q? I'm aware of mavg and signal processing in q and moving sum qidiom
In the DSP world it's called applying filter kernel by doing convolution. Weight coefficients define the kernel, which makes a high- or low-pass filter. The example above calculates the slope from last four points, placing the straight line via least squares method.

Something like this would work for parameterisable coefficients:
q)x:10+sums -1+1000?2f
q)f:{sum x*til[count x]xprev\:y}
q)f[3 1 -1 -3] x
0n 0n 0n -2.385585 1.423811 2.771659 2.065391 -0.951051 -1.323334 -0.8614857 ..
Specific cases can be made a bit faster (running 0 xprev is not the best thing)
q)g:{prev[deltas x]+3*x-3 xprev x}
q)g[x]~f[3 1 -1 -3]x
1b
q)\t:100000 f[3 1 1 -3] x
4612
q)\t:100000 g x
1791
There's a kx white paper of signal processing in q if this area interests you: https://code.kx.com/q/wp/signal-processing/

This may be a bit old but I thought I'd weigh in. There is a paper I wrote last year on signal processing that may be of some value. Working purely within KDB, dependent on the signal sizes you are using, you will see much better performance with a FFT based convolution between the kernel/window and the signal.
However, I've only written up a simple radix-2 FFT, although in my github repo I do have the untested work for a more flexible Bluestein algorithm which will allow for more variable signal length. https://github.com/callumjbiggs/q-signals/blob/master/signal.q
If you wish to go down the path of performing a full manual convolution by a moving sum, then the best method would be to break it up into blocks equal to the kernel/window size (which was based on some work Arthur W did many years ago)
q)vec:10000?100.0
q)weights:30?1.0
q)wsize:count weights
q)(weights$(((wsize-1)#0.0),vec)til[wsize]+) each til count v
32.5931 75.54583 100.4159 124.0514 105.3138 117.532 179.2236 200.5387 232.168.

If your input list not big then you could use the technique mentioned here:
https://code.kx.com/q/cookbook/programming-idioms/#how-do-i-apply-a-function-to-a-sequence-sliding-window
That uses 'scan' adverb. As that process creates multiple lists which might be inefficient for big lists.
Other solution using scan is:
q)f:{sum y*next\[z;x]} / x-input list, y-weights, z-window size-1
q)f[x;-3 -1 1 3;3]
This function also creates multiple lists so again might not be very efficient for big lists.
Other option is to use indices to fetch target items from the input list and perform the calculation. This will operate only on input list.
q) f:{[l;w;i]sum w*l i+til 4} / w- weight, l- input list, i-current index
q) f[x;-3 -1 1 3]#'til count x
This is a very basic function. You can add more variables to it as per your requirements.

Couting or integrating multivariate Gaussian probabilities on opposite side of non-linear decision line

So I have something that looks like the following:
However, I am having real trouble integrating the data on the other side of this decision line to get my errors.

In general, given analytic form of the decision boundary you could compute the integrals exactly. However, why not use monte carlo which is fast, simple and generic (will work for any distributions and decision boundaries). All you have to do is repeatedly sample from your gaussians, check if the sampled point is on the correct side (N_c) or incorrect (N_i) and in the limit you will get your integrals from
INTEGRAL_of_distributions_being_on_correct_side ~ N_c / (N_c + N_i)
INTEGRAL_of_distributions_being_on_incorrect_side ~ N_i / (N_c + N_i)
thus in pseudo code:
N_c = 0
N_i = 0
for i=1 to N do
y ~ P({-, +}) # sample distribution
x ~ P(X|y) # sample point from given class
if side_of_decision(x) == y then
N_c += 1
else
N_i += 1
end
end
return N_c, N_i
In your case P({-, +}) is probably just 50-50 chance and P(X|-) and P(X|+) are your two Gaussians.

Normalize a feature in this table

This has become quite a frustrating question, but I've asked in the Coursera discussions and they won't help. Below is the question:
I've gotten it wrong 6 times now. How do I normalize the feature? Hints are all I'm asking for.
I'm assuming x_2^(2) is the value 5184, unless I am adding the x_0 column of 1's, which they don't mention but he certainly mentions in the lectures when talking about creating the design matrix X. In which case x_2^(2) would be the value 72. Assuming one or the other is right (I'm playing a guessing game), what should I use to normalize it? He talks about 3 different ways to normalize in the lectures: one using the maximum value, another with the range/difference between max and mins, and another the standard deviation -- they want an answer correct to the hundredths. Which one am I to use? This is so confusing.

...use both feature scaling (dividing by the
"max-min", or range, of a feature) and mean normalization.
So for any individual feature f:
f_norm = (f - f_mean) / (f_max - f_min)
e.g. for x2,(midterm exam)^2 = {7921, 5184, 8836, 4761}
> x2 <- c(7921, 5184, 8836, 4761)
> mean(x2)
6676
> max(x2) - min(x2)
4075
> (x2 - mean(x2)) / (max(x2) - min(x2))
0.306 -0.366 0.530 -0.470
Hence norm(5184) = 0.366
(using R language, which is great at vectorizing expressions like this)
I agree it's confusing they used the notation x2 (2) to mean x2 (norm) or x2'
EDIT: in practice everyone calls the builtin scale(...) function, which does the same thing.

It's asking to normalize the second feature under second column using both feature scaling and mean normalization. Therefore,
(5184 - 6675.5) / 4075 = -0.366

Usually we normalize all of them to have zero mean and go between [-1, 1].
You can do that easily by dividing by the maximum of the absolute value and then remove the mean of the samples.

"I'm assuming x_2^(2) is the value 5184" is this because it's the second item in the list and using the subscript _2? x_2 is just a variable identity in maths, it applies to all rows in the list. Note that the highest raw mid-term exam result (i.e. that which is not squared) goes down on the final test and the lowest raw mid-term result increases the most for the final exam result. Theta is a fixed value, a coefficient, so somewhere your normalisation of x_1 and x_2 values must become (EDIT: not negative, less than 1) in order to allow for this behaviour. That should hopefully give you a starting basis, by identifying where the pivot point is.

I had the same problem, in my case the thing was that I was using as average the maximum x2 value (8836) minus minimum x2 value (4761) divided by two, instead of the sum of each x2 value divided by the number of examples.

For the same training set, I got the question as
Q. What is the normalized feature x^(3)_1?
Thus, 3rd training ex and 1st feature makes out to 94 in above table.
Now, normalized form is
x = (x - mean(x's)) / range(x)
Values are :
x = 94
mean(89+72+94+69) / 4 = 81
range = 94 - 69 = 25
Normalized x = (94 - 81) / 25 = 0.52

I'm taking this course at the moment and a really trivial mistake I made first time I answered this question was using comma instead of dot in the answer, since I did by hand and in my country we use comma to denote decimals. Ex:(0,52 instead of 0.52)
So in the second time I tried I used dot and works fine.

Estimating change of a cyclic boolean variable

We have a boolean variable X which is either true or false and alternates at each time step with a probability p. I.e. if p is 0.2, X would alternate once every 5 time steps on average. We also have a time line and observations of the value of this variable at various non-uniformly sampled points in time.
How would one learn, from observations, the probability that after t+n time steps where t is the time X is observed and n is some time in the future that X has alternated/changed value at t+n given that p is unknown and we only have observations of the value of X at previous times? Note that I count changing from true to false and back to true again as changing value twice.

I'm going to approach this problem as if it were on a test.
First, let's name the variables.
Bx is value of the boolean variable after x opportunities to flip (and B0 is the initial state). P is the chance of changing to a different value every opportunity.
Given that each flip opportunity is not related to other flip opportunities (there is, for example, no minimum number of opportunities between flips) the math is extremely simple; since events are not affected by the events of the past, we can consolidate them into a single computation, which works best when considering Bx not as a boolean value, but as itself a probability.
Here is the domain of the computations we will use: Bx is a probability (with a value between 0 and 1 inclusive) representing the likelyhood of truth. P is a probability (with a value between 0 and 1 inclusive) representing the likelyhood of flipping at any given opportunity.
The probability of falseness, 1 - Bx, and the probability of not flipping, 1 - P, are probabilistic identities which should be quite intuitive.
Assuming these simple rules, the general probability of truth of the boolean value is given by the recursive formula Bx+1 = Bx*(1-P) + (1-Bx)*P.
Code (in C++, because it's my favorite language and you didn't tag one):
int max_opportunities = 8; // Total number of chances to flip.
float flip_chance = 0.2; // Probability of flipping each opportunity.
float probability_true = 1.0; // Starting probability of truth.
// 1.0 is "definitely true" and 0.0 is
// "definitely false", but you can extend this
// to situations where the initial value is not
// certain (say, 0.8 = 80% probably true) and
// it will work just as well.
for (int opportunities = 0; opportunities < max_opportunities; ++opportunities)
{
probability_true = probability_true * (1 - flip_chance) +
(1 - probability_true) * flip_chance;
}
Here is that code on ideone (the answer for P=0.2 and B0=1 and x=8 is B8=0.508398). As you would expect, given that the value becomes less and less predictable as more and more opportunities pass, the final probability will approach Bx=0.5. You will also observe oscillations between more and less likely to be true, if your chance of flipping is high (for instance, with P=0.8, the beginning of the sequence is B={1.0, 0.2, 0.68, 0.392, 0.46112, ...}.
For a more complete solution that will work for more complicated scenarios, consider using a stochastic matrix (page 7 has an example).

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Nonlinear (non-polynomial) cost function with DirectCollocation in Drake - drake

Related

How to add an L1 penalty to the loss function for Neural ODEs?

arbitrarily weighted moving average (low- and high-pass filters)

Couting or integrating multivariate Gaussian probabilities on opposite side of non-linear decision line

Normalize a feature in this table

Estimating change of a cyclic boolean variable

Categories

Resources