When setting up a neural network, or any numeric optimization system using gradient descent, it's necessary to provide initial values for the weights (or whatever the system parameters are to be called).
One strategy is to initialize them to random values (set the random number seed to a known value, change for a different starting point). But this isn't always desirable (e.g. right now I'm comparing accuracy in single versus double precision, the TensorFlow random number generator outputs different values in each case). So I'm talking about a scenario where the initial values will be nonrandom.
Some initial value must be provided. In the absence of any information to specify a value, what should it be? The most obvious values are 0.0 and 1.0. Is there a reason to prefer one of those over the other? Or is there some other value that tends to be preferable for some reason?
As sascha observes, constant initial weights aren't a solution in general anyway because you have to break symmetry. Better solution for the particular context in which I came across the problem: a random number generator that gives the same sequence regardless of type.
dtype = np.float64
# Random number generator that returns the correct type
# and returns the same sequence regardless of type
def rnd(shape=(), **kwargs):
if type(shape) == int or type(shape) == float:
shape = shape,
x = tf.random_normal(shape, **kwargs, dtype=np.float64)
if dtype == np.float32:
x = tf.to_float(x)
return x.eval()
Related
I am trying to impose smoothness on the state covariance matrix, while using frequency domain seasonal components. I initiate my model as follows with a local level component and a particular frequency and harmonics specified.
model = sm.tsa.UnobservedComponents(df, level='llevel',
freq_seasonal=[{'period':130.51, 'harmonics':2}],
stochastic_freq_seasonal=[True])
res = model.fit()
>>>
sigma2.irregular 0.730561
sigma2.level 0.187833
sigma2.freq_seasonal_130.51(2) 0.003718
This will generate some parameter values as noted above. Now Since I am using 2 harmonics there are in fact 4 error variance and I want to set them as follows
model.ssm.state_cov[1,1,0] = 17.65
model.ssm.state_cov[2,2,0] = 0.3102
model.ssm.state_cov[3,3,0] = 17.65
model.ssm.state_cov[4,4,0] = 0.3102
And then get a 'smooth' and 'filter' object and see how they do. I know i can set the parameters under res.params, but these 4 do no appear in the parameter list. Is there a way to do it in this library?
The implementation in Statsmodels assumes a single common error variance parameter across all of the seasonal harmonic error terms, as in Harvey (1989, "Forecasting, Structural Time Series, and the Kalman Filter") section 2.3.4.
As a result, it's not particularly easy to set those parameters as you have suggested and then estimate the remaining parameters.
However, it is possible. For this specific case, you can set the variance parameters to 1 and then put the square root of the variance terms you actually want into the diagonals of the selection matrix, as follows:
model = sm.tsa.UnobservedComponents(df, level='llevel',
freq_seasonal=[{'period':130.51, 'harmonics':2}],
stochastic_freq_seasonal=[True])
model['selection', 1:, 1:] = np.diag([17.65, 0.3102, 17.65, 0.3102])**0.5
with model.fix_params({'sigma2.freq_seasonal_130.51(2)': 1}):
res = model.fit()
Let's say that we have an algorithm that given a dataset point, it runs some analysis on it and returns the results. The algorithm has a user-defined parameter X that affects the run-time of the algorithm (result of the algorithm is always constant for the same input point). Also, we already know that there's a relation between dataset point and the parameter X. For instance, if two dataset points are close to each other, their parameter X will also be the same.
Can we say that in this example we have the following and thus can use Q-Learning to find the best parameter X given any dataset point?
Initial state: dataset point, current value of X (for initial state = 0)
Terminal state: dataset point, current value of X (the value chosen based on action)
Actions: Different values that X can have
Reward: -1 if execution time decreases, +1 if it increases, 0 if it stays the same
Is it correct if we define different input dataset points as episodes and different values of X as the steps in each episode (where in each step, an action is chosen either randomly or via the network)? In this case, what would be the input to the neural network?
Since all of the examples and implementations I've seen so far are containing several states where each state is dependent on the previous one, I'm confused with my scenario where I only have two states.
Inside an autoregressive continuous problem, when the zeros take too much place, it is possible to treat the situation as a zero-inflated problem (i.e. ZIB). In other words, instead of working to fit f(x), we want to fit g(x)*f(x) where f(x) is the function we want to approximate, i.e. y, and g(x) is a function which output a value between 0 and 1 depending if a value is zero or non-zero.
Currently, I have two models. One model which gives me g(x) and another model which fits g(x)*f(x).
The first model gives me a set of weights. This is where I need your help. I can use the sample_weights arguments with model.fit(). As I work with tremendous amount of data, then I need to work with model.fit_generator(). However, fit_generator() does not have the argument sample_weights.
Is there a work around to work with sample_weights inside fit_generator()? Otherwise, how can I fit g(x)*f(x) knowing that I have already a trained model for g(x)?
You can provide sample weights as the third element of the tuple returned by the generator. From Keras documentation on fit_generator:
generator: A generator or an instance of Sequence (keras.utils.Sequence) object in order to avoid duplicate data when using multiprocessing. The output of the generator must be either
a tuple (inputs, targets)
a tuple (inputs, targets, sample_weights).
Update: Here is a rough sketch of a generator that returns the input samples and targets as well as the sample weights obtained from model g(x):
def gen(args):
while True:
for i in range(num_batches):
# get the i-th batch data
inputs = ...
targets = ...
# get the sample weights
weights = g.predict(inputs)
yield inputs, targets, weights
model.fit_generator(gen(args), steps_per_epoch=num_batches, ...)
Any binary one-hot encoding is aware of only values seen in training, so features not encountered during fitting will be silently ignored. For real time, where you have millions of records in a second, and features have very high cardinality, you need to keep your hasher/mapper updated with the data.
How can we do an incremental update to the hasher (rather calculating the entire fit() every time we incounter a new feature-value pair)? What is the suggested approach here the tackle this?
It depends on the learning algorithm that you are using. If you are using a method that has been designated for sparse data sets (FTRL, FFM, linear SVM) one possible approach is the following (note that it will introduce collisions in the features and a lot of constant columns).
First allocate for each element of your sample a (as large as possible) vector V, of length D.
For each categorical variable, evaluate hash(var_name + "_" + var_value) % D. This gives you an integer i, and you can store V[i] = 1.
Therefore, V never grows larger as new features appear. However, as soon as the number of features is large enough, some features will collide (i.e. be written at the same place) and this may result in an increased error rate...
Edit. You can write your own vectorizer to avoid collisions. First call L the current number of features. Prepare the same vector V of length 2L (this 2 will allow you to avoid collisions as new features arrive - at least for some time, depending of the arrival rate of new features).
Starting with an emty dictionary<input_type,int>, associate to each feature an integer. If have already seen the feature, return the int corresponding to the feature. If not, create a new entry with an integer corresponding to the new index. I think (but I am not sure) this is what LabelEncoder does for you.
For standard Q-Learning combined with a neural network things are more or less easy.
One stores (s,a,r,s’) during interaction with the environment and use
target = Qnew(s,a) = (1 - alpha) * Qold(s,a) + alpha * ( r + gamma * max_{a’} Qold(s’, a’) )
as target values for the neural network approximation the Q-function. So the input of the ANN is (s,a) and the output is the scalar Qnew(s,a). Deep Q-Learning papers/tutorials change the structure of the Q-function. Instead of providing a single Q-value for the pair (s,a) it should now provide the Q-Values of all possible actions for a state s, so it is Q(s) instead of Q(s,a).
There comes my problem. The data base filled with (s,a,r,s’) does for a specific state s does not contain the reward for all actions. Only for some, maybe just one action. So how to setup the target values for the network Q(s) = [Q(a_1), …. , Q(a_n) ] without having all rewards for the state s in the database? I have seen different loss functions/target values but all contain the reward.
As you see; I am puzzled. Does someone help me? There are a lot of tutorials on the web but this step is in general poorly descried and even less motivated looking at the theory…
You just get the target value corresponding to the action which exists on the observation s,a,r,s'. Basically you get the target value for all actions, and then choose the maximum of them as you wrote yourself: max_{a'} Qold(s', a'). Then, it is added to the r(s,a) and the result is the target value. For example, assume you have 10 actions, and observation is (s_0, a=5, r(s_0,a=5)=123, s_1). Then, the target value is r(s_0,a=5)+ \gamma* \max_{a'} Q_target(s_1,a'). For example, with tensorflow it could be something like:
Q_Action = tf.reduce_sum(tf.multiply(Q_values,tf.one_hot(action,output_dim)), axis = 1) # dim: [batchSize , ]
in which Q_values is of size batchSize, output_dim. So, the output is a vector of size batchSize, and then there exists a vector of same size obtained as the target value. The loss is the square of the differences of them.
When you calculate loss value, also you only run the backward on the existing action and the gradient from the other action is just zero.
So, you only need the reward of the existing action.