I am using JAX to implement a simple neural network (NN) and I want to access and save the gradients from the backward pass for further analysis after the NN ran. I can access and look at the gradients temporarily with the python debugger (as long as I am not using jit). But I want to save all gradients over the whole training process and analyze them after the training is done. I have come up with a rather hacky solution for this using id_tap and a global variable (see the code below). But I was wondering whether there is a better solution which does not violate the functional principles of JAX.
Many thanks!
import jax.numpy as jnp
from jax import grad, jit, vmap, random, custom_vjp
from jax.experimental.host_callback import id_tap
# experimental solution
global_save_list = {'x':[],'w':[],'g':[],'des':[]}
def global_save_func(ctx, des):
x, w, g = ctx
global_save_list['x'].append(x)
global_save_list['w'].append(w)
global_save_list['g'].append(g)
global_save_list['des'].append(des)
#custom_vjp
def qmvm(x, w):
return jnp.dot(x, w)
def qmvm_fwd(x, w):
return qmvm(x, w), (x, w)
def qmvm_bwd(ctx, g):
x, w = ctx
# here I would like to save gradients g - or at least running statistics of them
# experimental solution with id_tap
id_tap(global_save_func, ((x, w, g)))
fwd_grad = jnp.dot(g, w.transpose())
w_grad = jnp.dot(x, g.transpose())
return fwd_grad, w_grad
qmvm.defvjp(qmvm_fwd, qmvm_bwd)
def run_nn(x, w):
out = qmvm(x, w) # 1st MVM
out = qmvm(out, w) # 2nd MVM
return out
run_nn_batched = vmap(run_nn)
#jit
def loss(x, w, target):
out = run_nn_batched(x, w)
return jnp.sum((out - target)**2)
key = random.PRNGKey(42)
subkey1, subkey2, subkey3 = random.split(key, 3)
A = random.uniform(subkey1, (10, 10, 10), minval = -10, maxval = 10)
B = random.uniform(subkey2, (10, 10, 10), minval = -10, maxval = 10)
C = random.uniform(subkey3, (10, 10, 10), minval = -10, maxval = 10)
for e in range(10):
gval = grad(loss, argnums = 0)(A, B, C)
# some type of update rule
# here I would like to access gradients, preferably knowing to which MVM (1st or 2nd) and example they belong
# experimental solution:
print(global_save_list)
Related
Recently, I am making a code that allows curve fitting (power law and truncated power law) in a log-log scale graph using a series of data.
I found the 'from scipy.optimize import curve_fit' module, so the code I made it is as follows:
Power law (the x and y value of data is self.fx_ti and self.fy_pti, respectively)
def Powerlaw (self, event):
fig, ax = plt.subplots(2,1)
ax[0].plot(self.fx_ti, self.fy_pti, 'ko', alpha = 0.7, markersize = 4)
ax[0].set_title("Probability Distribution")
ax[0].set(ylabel = "P" + r"($\tau_i$) " + r'$s^{-1}$', xlabel = r"$\tau _i$"+" (s)")
def power_law(ti, a0, mi):
return a0*ti**(-mi)
popt, pcov = curve_fit(power_law, self.fx_ti, self.fy_pti, p0 = [0,0])
ax[0].plot(self.fx_ti, power_law(self.fx_ti, *popt), 'b', lw = 1)
res = self.fy_pti - power_law(self.fx_ti, *popt)
ax[1].plot(self.fx_ti, res, "ko", markersize = 4)
ax[0].set_xscale('log')
ax[0].set_yscale('log')
plt.ion
plt.show()
I think a pretty good fitting occurs when running the code.
But the problem is in 'truncated power law' fitting.
There is nothing different from general power law fitting.
The code I made it is as follows :
def Truncated (self, event):
fig, ax = plt.subplots(2,1)
ax[0].plot(self.fx_ti, self.fy_pti, 'ko', alpha = 0.7, markersize = 4)
ax[0].set_title("Probability Distribution")
ax[0].set(ylabel = "P" + r"($\tau_i$) " + r'$s^{-1}$', xlabel = r"$\tau _i$"+" (s)")
def truncated_power_law(x, a, b, c):
return a*x**(-b)*np.exp(np.multiply(-1,x)/c)
popt, pcov = curve_fit(truncated_power_law, self.fx_ti, self.fy_pti)
ax[0].plot(self.fx_ti, popt[0]*self.fx_ti**(-popt[1])*np.exp(np.multiply(-1,self.fx_ti)/popt[2]), 'b', lw = 1)
res = self.fy_pti - truncated_power_law(self.fx_ti, *popt)
ax[1].plot(self.fx_ti, res, "ko", markersize = 4)
ax[0].set_xscale('log')
ax[0].set_yscale('log')
plt.ion
plt.show()
You can see the result in the picture I attached.
Please python code geniuses...help...please
enter image description here
I'm trying to implement a gradient-free optimizer function to train convolutional neural networks with Julia using Flux.jl. The reference paper is this: https://arxiv.org/abs/2005.05955. This paper proposes RSO, a gradient-free optimization algorithm updates single weight at a time on a sampling bases. The pseudocode of this algorithm is depicted in the picture below.
optimizer_pseudocode
I'm using MNIST dataset.
function train(; kws...)
args = Args(; kws...) # collect options in a stuct for convinience
if CUDA.functional() && args.use_cuda
#info "Training on CUDA GPU"
CUDA.allwoscalar(false)
device = gpu
else
#info "Training on CPU"
device = cpu
end
# Prepare datasets
x_train, x_test, y_train, y_test = getdata(args, device)
# Create DataLoaders (mini-batch iterators)
train_loader = DataLoader((x_train, y_train), batchsize=args.batchsize, shuffle=true)
test_loader = DataLoader((x_test, y_test), batchsize=args.batchsize)
# Construct model
model = build_model() |> device
ps = Flux.params(model) # model's trainable parameters
best_param = ps
if args.optimiser == "SGD"
# Regular training step with SGD
elseif args.optimiser == "RSO"
# Run RSO function and update ps
best_param .= RSO(x_train, y_train, args.RSOupdate, model, args.batchsize, device)
end
And the corresponding RSO function:
function RSO(X,L,C,model, batch_size, device)
"""
model = convolutional model structure
X = Input data
L = labels
C = Number of rounds to update parameters
W = Weight set of layers
Wd = Weight tensors of layer d that generates an activation
wid = weight tensor that generates an activation aᵢ
wj = a weight in wid
"""
# Normalize input data to have zero mean and unit standard deviation
X .= (X .- sum(X))./std(X)
train_loader = DataLoader((X, L), batchsize=batch_size, shuffle=true)
#println("model = $(typeof(model))")
std_prep = []
σ_d = Float64[]
D = 1
for layer in model
D += 1
Wd = Flux.params(layer)
# Initialize the weights of the network with Gaussian distribution
for id in Wd
wj = convert(Array{Float32, 4}, rand(Normal(0, sqrt(2/length(id))), (3,3,4,4)))
id = wj
append!(std_prep, vec(wj))
end
# Compute std of all elements in the weight tensor Wd
push!(σ_d, std(std_prep))
end
W = Flux.params(model)
# Weight update
for _ in 1:C
d = D
while d > 0
for id in 1:length(W[d])
# Randomly sample change in weights from Gaussian distribution
for j in 1:length(w[d][id])
# Randomly sample mini-batch
(x, l) = train_loader[rand(1:length(train_loader))]
# Sample a weight from normal distribution
ΔWj[d][id][j] = rand(Normal(0, σ_d[d]), 1)
loss, acc = loss_and_accuracy(data_loader, model, device)
W = argmin(F(x,l, W+ΔWj), F(x,l,W), F(x,l, W-ΔWj))
end
end
d -= 1
end
end
return W
end
The problem here is the second block of the RSO function. I'm trying to evaluate the loss with the change of single weight in three scenarios, which are F(w, l, W+gW), F(w, l, W), F(w, l, W-gW), and choose the weight-set with minimum loss. But how do I do that using Flux.jl? The loss function I'm trying to use is logitcrossentropy(ŷ, y, agg=sum). In order to generate y_hat, we should use model(W), but changing single weight parameter in Zygote.Params() form was already challenging....
Based on the paper you shared, it looks like you need to change the weight arrays per each output neuron per each layer. Unfortunately, this means that the implementation of your optimization routine is going to depend on the layer type, since an "output neuron" for a convolution layer is quite different than a fully-connected layer. In other words, just looping over Flux.params(model) is not going to be sufficient, since this is just a set of all the weight arrays in the model and each weight array is treated differently depending on which layer it comes from.
Fortunately, Julia's multiple dispatch does make this easier to write if you use separate functions instead of a giant loop. I'll summarize the algorithm using the pseudo-code below:
for layer in model
for output_neuron in layer
for weight_element in parameters(output_neuron)
weight_element = sample(N(0, sqrt(2 / num_outputs(layer))))
end
end
sigmas[layer] = stddev(parameters(layer))
end
for c in 1 to C
for layer in reverse(model)
for output_neuron in layer
for weight_element in parameters(output_neuron)
x, y = sample(batches)
dw = N(0, sigmas[layer])
# optimize weights
end
end
end
end
It's the for output_neuron ... portions that we need to isolate into separate functions.
In the first block, we don't actually do anything different to every weight_element, they are all sampled from the same normal distribution. So, we don't actually need to iterate the output neurons, but we do need to know how many there are.
using Statistics: std
# this function will set the weights according to the
# normal distribution and the number of output neurons
# it also returns the standard deviation of the weights
function sample_weight!(layer::Dense)
sample = randn(eltype(layer.weight), size(layer.weight))
num_outputs = size(layer.weight, 1)
# notice the "." notation which is used to mutate the array
layer.weight .= sample .* num_outputs
return std(layer.weight)
end
function sample_weight!(layer::Conv)
sample = randn(eltype(layer.weight), size(layer.weight))
num_outputs = size(layer.weight, 4)
# notice the "." notation which is used to mutate the array
layer.weight .= sample .* num_outputs
return std(layer.weight)
end
sigmas = map(sample_weights!, model)
Now, for the second block, we will do a similar trick by defining different functions for each layer.
function optimize_layer!(loss, layer::Dense, data, sigma)
for i in 1:size(layer.weight, 1)
for j in 1:size(layer.weight, 2)
wj = layer.weight[i, j]
x, y = data[rand(1:length(data))]
dw = randn() * sigma
ws = [wj + dw, wj, wj - dw]
losses = Float32[]
for (k, w) in enumerate(ws)
layer.weight[i, j] = w
losses[k] = loss(x, y)
end
layer.weight[i, j] = ws[argmin(losses)]
end
end
end
function optimize_layer!(loss, layer::Conv, data, sigma)
for i in 1:size(layer.weight, 4)
# we use a view to reference the full kernel
# for this output channel
wid = view(layer.weight, :, :, :, i)
# each index let's us treat wid like a vector
for j in eachindex(wid)
wj = wid[j]
x, y = data[rand(1:length(data))]
dw = randn() * sigma
ws = [wj + dw, wj, wj - dw]
losses = Float32[]
for (k, w) in enumerate(ws)
wid[j] = w
losses[k] = loss(x, y)
end
wid[j] = ws[argmin(losses)]
end
end
end
for c in 1:C
for (layer, sigma) in reverse(zip(model, sigmas))
optimize_layer!(layer, data, sigma) do x, y
logitcrossentropy(model(x), y; agg = sum)
end
end
end
Notice that nowhere did I use Flux.params which does not help us here. Also, Flux.params would include both the weight and bias, and the paper doesn't look like it bothers with the bias at all. If you had an optimization method that generically optimized any parameter regardless of layer type the same (i.e. like gradient descent), then you could use for p in Flux.params(model) ....
Thanks #darsnack :)
I found your answer a bit late, so in the meantime I could figure out my own script that works. Mine is indeed a bit hardcoded but could you also give feedback on this?
function RSO(train_loader, test_loader, C,model, batch_size, device, args)
"""
model = convolutional model structure
C = Number of rounds to update parameters (epochs)
batch_size = size of the mini batch that will be used to calculate loss
device = CPU or GPU
"""
# Evaluate initial weight
test_loss, test_acc = loss_and_accuracy(test_loader, model, device)
println("Initial Weight:")
println(" test_loss = $test_loss, test_accuracy = $test_acc")
random_batch = []
for (x, l) in train_loader
push!(random_batch, (x,l))
end
# Initialize weights
std_prep = []
σ_d = Float64[]
D = 0
for layer in model
D += 1
Wd = Flux.params(layer)
# Initialize the weights of the network with Gaussian distribution
for id in Wd
if typeof(id) == Array{Float32, 4}
wj = convert(Array{Float32, 4}, rand(Normal(0, sqrt(2/length(id))), size(id)))
elseif typeof(id) == Vector{Float32}
wj = convert(Vector{Float32}, rand(Normal(0, sqrt(2/length(id))), length(id)))
elseif typeof(id) == Matrix{Float32}
wj = convert(Matrix{Float32}, rand(Normal(0, sqrt(2/length(id))), size(id)))
end
id = wj
append!(std_prep, vec(wj))
end
# Compute std of all elements in the weight tensor Wd
push!(σ_d, std(std_prep))
end
# Weight update
for c in 1:C
d = D
# First update the weights of the layer closest to the labels
# and then sequentially move closer to the input
while d > 0
Wd = Flux.params(model[d])
for id in Wd
# Randomly sample change in weights from Gaussian distribution
for j in 1:length(id)
# Randomly sample mini-batch
(x, y) = rand(random_batch, 1)[1]
x, y = device(x), device(y)
# Sample a weight from normal distribution
ΔWj = rand(Normal(0, σ_d[d]), 1)[1]
# Weight update with three scenario
## F(x,l, W+ΔWj)
id[j] = id[j]+ΔWj
ŷ = model(x)
ls_pos = logitcrossentropy(ŷ, y, agg=sum) / size(x)[end]
## F(x,l,W)
id[j] = id[j]-ΔWj
ŷ = model(x)
ls_org = logitcrossentropy(ŷ, y, agg=sum) / size(x)[end]
## F(x,l, W-ΔWj)
id[j] = id[j]-ΔWj
ŷ = model(x)
ls_neg = logitcrossentropy(ŷ, y, agg=sum) / size(x)[end]
# Check weight update that gives minimum loss
min_loss = argmin([ls_org, ls_pos, ls_neg])
# Save weight update with minimum loss
if min_loss == 1
id[j] = id[j] + ΔWj
elseif min_loss == 2
id[j] = id[j] + 2*ΔWj
elseif min_loss == 3
id[j] = id[j]
end
end
end
d -= 1
end
train_loss, train_acc = loss_and_accuracy(train_loader, model, device)
test_loss, test_acc = loss_and_accuracy(test_loader, model, device)
track!(args.tracker, test_acc)
println("RSO Round=$c")
println(" train_loss = $train_loss, train_accuracy = $train_acc")
println(" test_loss = $test_loss, test_accuracy = $test_acc")
end
return Flux.params(model)
end
I'm getting this error:
DimensionMismatch("second dimension of A, 1, does not match length of x, 20")
for the following code. I'm trying to train a model on some sample data. I'm using the Flux machine learning library in Julia.
I've checked my dimensions and they seem right to me. What is the problem?
using Flux
using Flux: mse
data = [(i,i) for i in 1:20]
x = [i for i in 1:20]
y = [i for i in 1:20]
m = Chain(
Dense(1, 10, relu),
Dense(10, 1),
softmax)
opt = ADAM(params(m))
loss(x, y) = mse(m(x), y)
evalcb = () -> #show(loss(x, y))
accuracy(x, y) = mean(argmax(m(x)) .== argmax(y))
#this line gives the error
Flux.train!(loss, data, opt,cb = throttle(evalcb, 10))
Your first dense layer has a weight matrix whose size is 10x1. You can check it as follows:
m.layers[1].W
So, your data should be size of 1x20 so that you can multiply it with the weights in the chain.
x = reshape(x,1,20)
opt = ADAM(params(m))
loss(x, y) = mse(m(x), y)
evalcb = () -> #show(loss(x, y))
accuracy(x, y) = mean(argmax(m(x)) .== argmax(y))
#Now it should work.
Flux.train!(loss, data, opt,cb = Flux.throttle(evalcb, 10))
I'm trying to get a basic LSTM working in TensorFlow. I'm receiving the following error:
TypeError: 'Tensor' object is not iterable.
The offending line is:
rnn_outputs, final_state = tf.nn.dynamic_rnn(cell, x, sequence_length=seqlen,
initial_state=init_state,)`
I'm using version 1.0.1 on windows 7. My inputs and label have the following shapes
x_shape = (50, 40, 18), y_shape = (50, 40)
Where:
batch size = 50
sequence length = 40
input vector length at each step = 18
I'm building my graph as follows
def build_graph(learn_rate, seq_len, state_size=32, batch_size=5):
# use a fixed sequence length
seqlen = tf.constant(seq_len, shape=[batch_size],dtype=tf.int32)
# Placeholders
x = tf.placeholder(tf.float32, [batch_size, None, 18])
y = tf.placeholder(tf.float32, [batch_size, None])
keep_prob = tf.constant(1.0)
# RNN
cell = tf.contrib.rnn.LSTMCell(state_size)
init_state = tf.get_variable('init_state', [1, state_size],
initializer=tf.constant_initializer(0.0))
init_state = tf.tile(init_state, [batch_size, 1])
rnn_outputs, final_state = tf.nn.dynamic_rnn(cell, x, sequence_length=seqlen,
initial_state=init_state,)
# Add dropout, as the model otherwise quickly overfits
rnn_outputs = tf.nn.dropout(rnn_outputs, keep_prob)
# Prediction layer
with tf.variable_scope('prediction'):
W = tf.get_variable('W', [state_size, num_classes])
b = tf.get_variable('b', [num_classes], initializer=tf.constant_initializer(0.0))
preds = tf.tanh(tf.matmul(rnn_outputs, W) + b)
# MSE
loss = tf.square(tf.subtract(y, preds))
# loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits, y))
train_step = tf.train.AdamOptimizer(learn_rate).minimize(loss)
Can anyone tell me what I am missing?
Sequence length should be iterable e.g. a list or tensor, not a scalar. In your case specifically, you need to replace sequence length = 40 with a list of the lengths of each input. For instance, if your first sequence has 10 steps, the second 13 and the third 18, you would pass in [10, 13, 18]. This lets TensorFlow's dynamic RNN know how many steps to unroll for (I believe it uses a while loop internally).
I'm trying to use the method cv2.estimateAffine3D but without success. Here is my code sample :
import numpy as np
import cv2
shape = (1, 4, 3)
source = np.zeros(shape, np.float32)
# [x, y, z]
source[0][0] = [857, 120, 854]
source[0][1] = [254, 120, 855]
source[0][2] = [256, 120, 255]
source[0][3] = [858, 120, 255]
target = source * 10
retval, M, inliers = cv2.estimateAffine3D(source, target)
When I try to run this sample, I obtain the same error as this other post here.
I'm using OpenCV 2.4.3 and Python 2.7.3
Please help me!
This is a known bug that is fixed in 2.4.4.
http://code.opencv.org/issues/2375
If you just need rigid (rotation + translation) alignment, here's the standard method:
def get_rigid(src, dst): # Assumes both or Nx3 matrices
src_mean = src.mean(0)
dst_mean = dst.mean(0)
# Compute covariance
H = reduce(lambda s, (a,b) : s + np.outer(a, b), zip(src - src_mean, dst - dst_mean), np.zeros((3,3)))
u, s, v = np.linalg.svd(H)
R = v.T.dot(u.T) # Rotation
T = - R.dot(src_mean) + dst_mean # Translation
return np.hstack((R, T[:, np.newaxis]))
Change covariance toH = reduce(lambda s, a: s + np.outer(a[0], a[1]), zip(src - src_mean, dst - dst_mean), np.zeros((3,3)))
for python3 in previous post. Can't comment bacause of reputation score.