I am trying to impose smoothness on the state covariance matrix, while using frequency domain seasonal components. I initiate my model as follows with a local level component and a particular frequency and harmonics specified.
model = sm.tsa.UnobservedComponents(df, level='llevel',
freq_seasonal=[{'period':130.51, 'harmonics':2}],
stochastic_freq_seasonal=[True])
res = model.fit()
>>>
sigma2.irregular 0.730561
sigma2.level 0.187833
sigma2.freq_seasonal_130.51(2) 0.003718
This will generate some parameter values as noted above. Now Since I am using 2 harmonics there are in fact 4 error variance and I want to set them as follows
model.ssm.state_cov[1,1,0] = 17.65
model.ssm.state_cov[2,2,0] = 0.3102
model.ssm.state_cov[3,3,0] = 17.65
model.ssm.state_cov[4,4,0] = 0.3102
And then get a 'smooth' and 'filter' object and see how they do. I know i can set the parameters under res.params, but these 4 do no appear in the parameter list. Is there a way to do it in this library?
The implementation in Statsmodels assumes a single common error variance parameter across all of the seasonal harmonic error terms, as in Harvey (1989, "Forecasting, Structural Time Series, and the Kalman Filter") section 2.3.4.
As a result, it's not particularly easy to set those parameters as you have suggested and then estimate the remaining parameters.
However, it is possible. For this specific case, you can set the variance parameters to 1 and then put the square root of the variance terms you actually want into the diagonals of the selection matrix, as follows:
model = sm.tsa.UnobservedComponents(df, level='llevel',
freq_seasonal=[{'period':130.51, 'harmonics':2}],
stochastic_freq_seasonal=[True])
model['selection', 1:, 1:] = np.diag([17.65, 0.3102, 17.65, 0.3102])**0.5
with model.fix_params({'sigma2.freq_seasonal_130.51(2)': 1}):
res = model.fit()
Related
I would like to use the DeepQLearning.jl package from https://github.com/JuliaPOMDP/DeepQLearning.jl. In order to do so, we have to do something similar to
using DeepQLearning
using POMDPs
using Flux
using POMDPModels
using POMDPSimulators
using POMDPPolicies
# load MDP model from POMDPModels or define your own!
mdp = SimpleGridWorld();
# Define the Q network (see Flux.jl documentation)
# the gridworld state is represented by a 2 dimensional vector.
model = Chain(Dense(2, 32), Dense(32, length(actions(mdp))))
exploration = EpsGreedyPolicy(mdp, LinearDecaySchedule(start=1.0, stop=0.01, steps=10000/2))
solver = DeepQLearningSolver(qnetwork = model, max_steps=10000,
exploration_policy = exploration,
learning_rate=0.005,log_freq=500,
recurrence=false,double_q=true, dueling=true, prioritized_replay=true)
policy = solve(solver, mdp)
sim = RolloutSimulator(max_steps=30)
r_tot = simulate(sim, mdp, policy)
println("Total discounted reward for 1 simulation: $r_tot")
In the line mdp = SimpleGridWorld(), we create the MDP. When I was trying to create the MDP, I had the problem of very large state space. A state in my MDP is a vector in {1,2,...,m}^n for some m and n. So, when defining the function POMDPs.states(mdp::myMDP), I realized that I must iterate over all the states which are very large, i.e., m^n.
Am I using the package in the wrong way? Or we must iterate the states even if there are exponentially many? If the latter, then what is the point of using Deep Q Learning? I thought, Deep Q Learning can help when the action and state spaces are very large.
DeepQLearning does not require to enumerate the state space and can handle continuous space problems.
DeepQLearning.jl only uses the generative interface of POMDPs.jl. As such, you do not need to implement the states function but just gen and initialstate (see the link on how to implement the generative interface).
However, due to the discrete action nature of DQN you also need POMDPs.actions(mdp::YourMDP) which should return an iterator over the action space.
By making those modifications to your implementation you should be able to use the solver.
The neural network in DQN takes as input a vector representation of the state. If your state is a m dimensional vector, the neural network input will be of size m. The output size of the network will be equal to the number of actions in your model.
In the case of the grid world example, the input size of the Flux model is 2 (x, y positions) and the output size is length(actions(mdp))=4.
I am trying to produce a mathematical operation selection nn model, which is based on the scalar input. The operation is selected based on the softmax result which is produce by the nn. Then this operation has to be applied to the scalar input in order to produce the final output. So far I’ve come up with applying argmax and onehot on the softmax output in order to produce a mask which then is applied on the concated values matrix from all the possible operations to be performed (as show in the pseudo code below). The issue is that neither argmax nor onehot appears to be differentiable. I am new to this, so any would be highly appreciated. Thanks in advance.
#perform softmax
logits = tf.matmul(current_input, W) + b
softmax = tf.nn.softmax(logits)
#perform all possible operations on the input
op_1_val = tf_op_1(current_input)
op_2_val = tf_op_2(current_input)
op_3_val = tf_op_2(current_input)
values = tf.concat([op_1_val, op_2_val, op_3_val], 1)
#create a mask
argmax = tf.argmax(softmax, 1)
mask = tf.one_hot(argmax, num_of_operations)
#produce the input, by masking out those operation results which have not been selected
output = values * mask
I believe that this is not possible. This is similar to Hard Attention described in this paper. Hard attention is used in Image captioning to allow the model to focus only on a certain part of the image at each step. Hard attention is not differentiable but there are 2 ways to go around this:
1- Use Reinforcement Learning (RL): RL is made to train models that makes decisions. Even though, the loss function won't back-propagate any gradients to the softmax used for the decision, you can use RL techniques to optimize the decision. For a simplified example, you can consider the loss as penalty, and send to the node, with the maximum value in the softmax layer, a policy gradient proportional to the penalty in order to decrease the score of the decision if it was bad (results in a high loss).
2- Use something like soft attention: instead of picking only one operation, mix them with weights based on the softmax. so instead of:
output = values * mask
Use:
output = values * softmax
Now, the operations will converge down to zero based on how much the softmax will not select them. This is easier to train compared to RL but it won't work if you must completely remove the non-selected operations from the final result (set them to zero completely).
This is another answer that talks about Hard and Soft attention that you may find helpful: https://stackoverflow.com/a/35852153/6938290
The general idea I am trying to realize is a seq2seq-model (taken from the translate.py-example in the models, based on the seq2seq-class). This trains well.
Furthermore I am using the hidden state of the rnn after all the encoding is done, right before decoding starts (I call it the “hidden state at end of encoding”). I use this hidden state at end of encoding to feed it into a further sub-graph which I call “prices” (see below). The training gradients of this sub-graph backprop not only through this additional sub-graph, but also back into the encoder-part of the rnn (which is what I want and need).
The plan is to add more such sub-graph to the hidden state at end of encoding, as I want to analyze the input phrases in a variety of ways.
Now during training when I evaluate and train both sub-graphs (encoder+prices AND encoder+decoder) at the same time, the net does NOT converge. However, if I train by executing the training in the following way (pseudo-code):
if global_step % 10 == 0:
execute-the-price-training_code
else:
execute-the-decoder-training_code
So I am not training both sub-graphs simultaneously. Now it does converge, but the encoder+decoder-part converges MUCH slower than if I ONLY train this part and never train the prices-sub-graph.
My question is: I should be able to train both sub-graphs simultaneously. But probably I have to rescale the gradients flowing back into the hidden state at end of encoding. Here we get the gradients from the prices sub-graph AND from the decoder-sub-graph. How should this rescaling be done. I didnt find any papers describing such an undertaking, but maybe I am searching with the wrong keywords.
Here is the training-part of the code:
This is the (almost original) training-op-preparation:
if not forward_only:
self.gradient_norms = []
self.updates = []
opt = tf.train.AdadeltaOptimizer(self.learning_rate)
for bucket_id in xrange(len(buckets)):
tf.scalar_summary("seq2seq loss", self.losses[bucket_id])
gradients = tf.gradients(self.losses[bucket_id], var_list_seq2seq)
clipped_gradients, norm = tf.clip_by_global_norm(gradients, max_gradient_norm)
self.gradient_norms.append(norm)
self.updates.append(opt.apply_gradients(zip(clipped_gradients, var_list_seq2seq), global_step=self.global_step))
Now, additionally, I am running a second sub-graph that takes the hidden state at end of encoding as input:
with tf.name_scope('prices') as scope:
#First layer
W_price_first_layer = tf.Variable(tf.random_normal([num_layers*size, self.prices_hidden_layer_size], stddev=0.35), name="W_price_first_layer")
B_price_first_layer = tf.Variable(tf.zeros([self.prices_hidden_layer_size]), name="B_price_first_layer")
self.output_price_first_layer = tf.add(tf.matmul(self.hidden_state, W_price_first_layer), B_price_first_layer)
self.activation_price_first_layer = tf.nn.sigmoid(self.output_price_first_layer)
#self.activation_price_first_layer = tf.nn.Relu(self.output_price_first_layer)
#Second layer to softmax (price ranges)
W_price = tf.Variable(tf.random_normal([self.prices_hidden_layer_size, self.prices_bit_size], stddev=0.35), name="W_price")
W_price_t = tf.transpose(W_price)
B_price = tf.Variable(tf.zeros([self.prices_bit_size]), name="B_price")
self.output_price_second_layer = tf.add(tf.matmul(self.activation_price_first_layer, W_price),B_price)
self.price_prediction = tf.nn.softmax(self.output_price_second_layer)
self.label_price = tf.placeholder(tf.int32, shape=[self.batch_size], name="price_label")
#Remember the prices trainables
var_list_prices = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, "prices")
var_list_all = tf.trainable_variables()
#Backprop
self.loss_price = tf.nn.sparse_softmax_cross_entropy_with_logits(self.output_price_second_layer, self.label_price)
self.loss_price_scalar = tf.reduce_mean(self.loss_price)
self.optimizer_price = tf.train.AdadeltaOptimizer(self.learning_rate_prices)
self.training_op_price = self.optimizer_price.minimize(self.loss_price, var_list=var_list_all)
Thx a bunch
I expect that running two optimizers simultaneously will lead to inconsistent gradient updates on the common variables, and this might be causing your training not to converge.
Instead, if you add the scalar loss from each sub-network to the "losses collection" (e.g. via tf.contrib.losses.add_loss() or tf.add_to_collection(tf.GraphKeys.LOSSES, ...), you can use tf.contrib.losses.get_total_loss() to get a single loss value that can be passed to a single standard TensorFlow tf.train.Optimizer subclass. TensorFlow will derive the appropriate back-prop computation for your split network.
The get_total_loss() method simply computes an unweighted sum of the values that have been added to the losses collection. I'm not familiar with the literature on how or if you should scale these values, but you can use any arbitrary (differentiable) TensorFlow expression to combine the losses and pass the result to a single optimizer.
I have one dataset, and need to do cross-validation, for example, a 10-fold cross-validation, on the entire dataset. I would like to use radial basis function (RBF) kernel with parameter selection (there are two parameters for an RBF kernel: C and gamma). Usually, people select the hyperparameters of SVM using a dev set, and then use the best hyperparameters based on the dev set and apply it to the test set for evaluations. However, in my case, the original dataset is partitioned into 10 subsets. Sequentially one subset is tested using the classifier trained on the remaining 9 subsets. It is obviously that we do not have fixed training and test data. How should I do hyper-parameter selection in this case?
Is your data partitioned into exactly those 10 partitions for a specific reason? If not you could concatenate/shuffle them together again, then do regular (repeated) cross validation to perform a parameter grid search. For example, with using 10 partitions and 10 repeats gives a total of 100 training and evaluation sets. Those are now used to train and evaluate all parameter sets, hence you will get 100 results per parameter set you tried. The average performance per parameter set can be computed from those 100 results per set then.
This process is built-in in most ML tools already, like with this short example in R, using the caret library:
library(caret)
library(lattice)
library(doMC)
registerDoMC(3)
model <- train(x = iris[,1:4],
y = iris[,5],
method = 'svmRadial',
preProcess = c('center', 'scale'),
tuneGrid = expand.grid(C=3**(-3:3), sigma=3**(-3:3)), # all permutations of these parameters get evaluated
trControl = trainControl(method = 'repeatedcv',
number = 10,
repeats = 10,
returnResamp = 'all', # store results of all parameter sets on all partitions and repeats
allowParallel = T))
# performance of different parameter set (e.g. average and standard deviation of performance)
print(model$results)
# visualization of the above
levelplot(x = Accuracy~C*sigma, data = model$results, col.regions=gray(100:0/100), scales=list(log=3))
# results of all parameter sets over all partitions and repeats. From this the metrics above get calculated
str(model$resample)
Once you have evaluated a grid of hyperparameters you can chose a reasonable parameter set ("model selection", e.g. by choosing a well performing while still reasonable incomplex model).
BTW: I would recommend repeated cross validation over cross validation if possible (eventually using more than 10 repeats, but details depend on your problem); and as #christian-cerri already recommended, having an additional, unseen test set that is used to estimate the performance of your final model on new data is a good idea.
For standard Q-Learning combined with a neural network things are more or less easy.
One stores (s,a,r,s’) during interaction with the environment and use
target = Qnew(s,a) = (1 - alpha) * Qold(s,a) + alpha * ( r + gamma * max_{a’} Qold(s’, a’) )
as target values for the neural network approximation the Q-function. So the input of the ANN is (s,a) and the output is the scalar Qnew(s,a). Deep Q-Learning papers/tutorials change the structure of the Q-function. Instead of providing a single Q-value for the pair (s,a) it should now provide the Q-Values of all possible actions for a state s, so it is Q(s) instead of Q(s,a).
There comes my problem. The data base filled with (s,a,r,s’) does for a specific state s does not contain the reward for all actions. Only for some, maybe just one action. So how to setup the target values for the network Q(s) = [Q(a_1), …. , Q(a_n) ] without having all rewards for the state s in the database? I have seen different loss functions/target values but all contain the reward.
As you see; I am puzzled. Does someone help me? There are a lot of tutorials on the web but this step is in general poorly descried and even less motivated looking at the theory…
You just get the target value corresponding to the action which exists on the observation s,a,r,s'. Basically you get the target value for all actions, and then choose the maximum of them as you wrote yourself: max_{a'} Qold(s', a'). Then, it is added to the r(s,a) and the result is the target value. For example, assume you have 10 actions, and observation is (s_0, a=5, r(s_0,a=5)=123, s_1). Then, the target value is r(s_0,a=5)+ \gamma* \max_{a'} Q_target(s_1,a'). For example, with tensorflow it could be something like:
Q_Action = tf.reduce_sum(tf.multiply(Q_values,tf.one_hot(action,output_dim)), axis = 1) # dim: [batchSize , ]
in which Q_values is of size batchSize, output_dim. So, the output is a vector of size batchSize, and then there exists a vector of same size obtained as the target value. The loss is the square of the differences of them.
When you calculate loss value, also you only run the backward on the existing action and the gradient from the other action is just zero.
So, you only need the reward of the existing action.