The objective is not DCP - cvxpy

I try to solve the convex problem on page 20 presented in this paper
In my opinion, the objective is convex.
cvxpy version 1.1.18
prob is DCP: False
it reports
xception has occurred: DCPError Problem does not follow DCP rules.
Specifically: The objective is not DCP. Its following subexpressions
are not:
0.004 / (power(var1[0], 0.5) + power(var1[0], 0.5))
0.004 / (power(var11, 0.5) + power(var11, 0.5))
my code is
K = 1000
KK = 500
delta = 1/(1000)*2
a = cp.Variable(KK)
b = cp.Variable(KK+1,nonneg=True)
tau = cp.Variable((KK,7))
print('create problem')
print('cvx version')
print(cp.__version__)
prob = cp.Problem(cp.Minimize(cp.sum([2*delta/(cp.sqrt(b[i])+cp.sqrt(b[i+1])) for i in range(KK)])),
[...])
a concise example reports the same problem
import copy as cp
# A non-DCP problem.
K=5
x=cp.Variable(K+1)
prob = cp.Problem(\
cp.Minimize( cp.sum([1/(cp.sqrt(x[i]) +cp.sqrt(x[i+1])) for i in range(K)]) )\
,[x>=25,x<=100])
print( "prob is DCP:", prob.is_dcp())
print('be solving by cvxpy')
try:
prob.solve()
except Exception as e:
print(e)

Use cp.inv_pos(u) instead of 1/u.

Related

Modifying the loss in ppo in stable-baselines3

I'm trying to implement an addition to the loss function of the ppo algorithm in stable-baselines3. For this I collected additional observations for the states s(t-10) and s(t+1) which I can access in the train-function of the PPO class in ppo.py as part of the rollout_buffer.
I'm using a 3-layer-mlp as my network architecture and need the outputs of the second layer for the triplet (s(t-α), s(t), s(t+1)) to use them to calculate L = max(d(s(t+1) , s(t)) − d(s(t+1) , s(t−α)) + γ, 0), where d is the L2-distance.
Finally I want to add this term to the old loss, so loss = loss + 0.3 * L
This is my implementation starting with the original loss in line 242:
loss = policy_loss + self.ent_coef * entropy_loss + self.vf_coef * value_loss
###############################
net1 = nn.Sequential(*list(self.policy.mlp_extractor.policy_net.children())[:-1])
L_losses = []
a = 0
obs = rollout_data.observations
obs_alpha = rollout_data.observations_alpha
obs_plusone = rollout_data.observations_plusone
inds = rollout_data.inds
for i in inds:
if i > alpha: # only use observations for which L can be calculated
fs_t = net1(obs[a])
fs_talpha = net1(obs_alpha[a])
fs_tone = net1(obs_plusone[a])
L = max(
th.norm(th.subtract(fs_tone, fs_t)) - th.norm(th.subtract(fs_tone, fs_talpha)) + 1.0, 0.0)
L_losses.append(L)
else:
L_losses.append(0)
a += 1
L_loss = th.mean(th.FloatTensor(L_losses))
loss += 0.3 * L_loss
So with net1 I tried to get a clone of the original network with the outputs from the second layer. I am unsure if this is the right way to do this.
I do have some questions about my approach as the resulting performance is slightly worse compared to without the added term although it should be slightly better:
Is my way of getting the outputs of the second layer of the mlp network working?
When loss.backward() is called can the gradient be calculated correctly (with the new term included)?

Error in glove_event$fit_transform in text2vec package

While experimenting with word embedding using text2vec package in R, the following error is thrown
embd_dim <- 5
glove_event <- GlobalVectors$new(rank = embd_dim, x_max = 10,learning_rate = 0.01, alpha = 0.95, lambda = 0.005)
wrd_embd_event <- glove_event$fit_transform(tcm_event, n_iter = 200, convergence_tol = 0.001)
Error in glove_event$fit_transform(tcm_event, n_iter = 200, convergence_tol = 0.001) :
Cost is too big, probably something goes wrong... try smaller learning rate
Smaller learning rate has not helped. Similar outcome from experiment with different skip_grams_window values in ctreate_tcm() and different rank values in glove_event().
I am clueless about the source of this error.

Can't replicate RStan ESS code from Vehtari paper

I am trying to replicate an ESS (effective sample size) calculation using the method of Vehtari et al. in: Rank-normalization, folding, and localization: An improved Rhat for assessing convergence of MCMC
I am working from the code here:
https://github.com/avehtari/rhat_ess/blob/master/code/monitornew.R
# Geyer's initial positive sequence
rho_hat_t <- rep.int(0, n_samples)
t <- 0
rho_hat_even <- 1
rho_hat_t[t + 1] <- rho_hat_even
rho_hat_odd <- 1 - (mean_var - mean(acov[t + 2, ])) / var_plus # 251
rho_hat_t[t + 2] <- rho_hat_odd
while (t < nrow(acov) - 5 && !is.nan(rho_hat_even + rho_hat_odd) &&
(rho_hat_even + rho_hat_odd > 0)) {
t <- t + 2
rho_hat_even = 1 - (mean_var - mean(acov[t + 1, ])) / var_plus # 256
rho_hat_odd = 1 - (mean_var - mean(acov[t + 2, ])) / var_plus # 257
if ((rho_hat_even + rho_hat_odd) >= 0) {
rho_hat_t[t + 1] <- rho_hat_even
rho_hat_t[t + 2] <- rho_hat_odd
}
}
I can follow the code from the paper except when we get to equation 10 in the paper (calculating the cross-chain autocorrelation). The code (lines 251, 256 and 257) appears in the form:
1 - (mean_var - mean(acov[t + 1, ])) / var_plus
which is close to equation 10, except the missing the 's' terms from equation 10:
I can't see anywhere in the code that this is somehow accounted for elsewhere in the way the calculation is being done. I have tried putting the 's' terms back into those lines of code and it makes a big difference to the final ESS value.
Is anyone able to help me understand the discrepancy between paper and code?
Thanks.
In the formula in the paper, s^2 is is the estimate of variance and rho the estimate of autocorrelation. Thus s^2 * rho is an estimate of the autocovariance, which is what you see in the code.

Should I exit my gradient descent loop as soon as the cost increases?

I'm trying to learn machine learning so I'm taking a course and currently studying gradient descent for linear regression. I just learned that if the learning rate is small enough, the value returned by the cost function should continuously decrease until convergence. When I imagine this being done in a loop of code, it seems like I could just keep track of what the cost was in the previous iteration and exit the loop if the new cost is greater than the previous, since this tells us the learning rate is too large. I'd like to hear opinions since I'm new to this, but in an effort to not make this question primarily opinion-based my main question is this: Would there be anything wrong with this method of detecting a learning rate that needs to be decreased? I'd appreciate an example of when this method would fail, if possible.
In this example below, we will vary the learning rate eta = 10^k with k={-6,-5,-4,...0}
def f(x):
return 100 * (x[ 0] *x[0] - x[ 1]) **2 + (x[ 0] -1) **2
def df(x):
a = x[ 0] *x[0] - x[ 1]
ret = np.zeros(2)
ret[ 0] = 400 * a * x[0] + 2 * (x[0] - 1)
ret[ 1] = -200 * a
return ret
for k in range(-6, 0):
eta = math.pow(10.0, k)
print("eta: " + str(eta))
x = -np.ones(2)
for iter in range(1000000):
fx = f(x)
if fx < 1e-10:
print(" solved after " + str(iter) + " iterations; f(x) = " + str(f(x)))
break
if fx > 1e10:
print(" divergence detected after " + str(iter) + " iterations; f(x) = " +
str(f(x)))
break
g = df(x)
x -= eta * g
if iter == 999999:
print(" not solved; f(x) = " + str(f(x)))
For too small learning rates, the optimization is very slow and the problem is not solved within the iteration budget.
For too large learning rates, the optimization process becomes unstable and diverges very quickly. The learning rate must be "just right" for the optimization process to work well.

Understanding code wrt Logistic Regression using gradient descent

I was following Siraj Raval's videos on logistic regression using gradient descent :
1) Link to longer video :
https://www.youtube.com/watch?v=XdM6ER7zTLk&t=2686s
2) Link to shorter video :
https://www.youtube.com/watch?v=xRJCOz3AfYY&list=PL2-dafEMk2A7mu0bSksCGMJEmeddU_H4D
In the videos he talks about using gradient descent to reduce the error for a set number of iterations so that the function converges(slope becomes zero).
He also illustrates the process via code. The following are the two main functions from the code :
def step_gradient(b_current, m_current, points, learningRate):
b_gradient = 0
m_gradient = 0
N = float(len(points))
for i in range(0, len(points)):
x = points[i, 0]
y = points[i, 1]
b_gradient += -(2/N) * (y - ((m_current * x) + b_current))
m_gradient += -(2/N) * x * (y - ((m_current * x) + b_current))
new_b = b_current - (learningRate * b_gradient)
new_m = m_current - (learningRate * m_gradient)
return [new_b, new_m]
def gradient_descent_runner(points, starting_b, starting_m, learning_rate, num_iterations):
b = starting_b
m = starting_m
for i in range(num_iterations):
b, m = step_gradient(b, m, array(points), learning_rate)
return [b, m]
#The above functions are called below:
learning_rate = 0.0001
initial_b = 0 # initial y-intercept guess
initial_m = 0 # initial slope guess
num_iterations = 1000
[b, m] = gradient_descent_runner(points, initial_b, initial_m, learning_rate, num_iterations)
# code taken from Siraj Raval's github page
Why does the value of b & m continue to update for all the iterations? After a certain number of iterations, the function will converge, when we find the values of b & m that give slope = 0.
So why do we continue iteration after that point and continue updating b & m ?
This way, aren't we losing the 'correct' b & m values? How is learning rate helping the convergence process if we continue to update values after converging? Thus, why is there no check for convergence, and so how is this actually working?
In practice, most likely you will not reach to slope 0 exactly. Thinking of your loss function as a bowl. If your learning rate is too high, it is possible to overshoot over the lowest point of the bowl. On the contrary, if the learning rate is too low, your learning will become too slow and won't reach the lowest point of the bowl before all iterations are done.
That's why in machine learning, the learning rate is an important hyperparameter to tune.
Actually, once we reach a slope 0; b_gradient and m_gradient will become 0;
thus, for :
new_b = b_current - (learningRate * b_gradient)
new_m = m_current - (learningRate * m_gradient)
new_b and new_m will remain the old correct values; as nothing will be subtracted from them.

Resources