Deeplearning4j vs R results differ - deeplearning4j

I believe that deeplearning4j and R with exactly the same parameters should perform the same, comparable MSE. But I am not sure how to achieve that.
I have a csv file with the following format, which contains 46 variables and 2 outputs. Totally there are 1,0000 samples. All the data is normalized and the model is for regression analysis.
S1 | S2 | ... | S46 | X | Y
In R, I use neuralnet package, and the code is:
rn <- colnames(traindata)
f <- as.formula(paste("X + Y ~", paste(rn[1:(length(rn)-2)], collapse="+")))
nn <- neuralnet(f,
rep=1,
data=traindata,
hidden=c(10),
linear.output=T,
threshold = 0.5)
which is quite straightforward.
As I want to integrate the algorithm into Java project, so I consider dl4j to train the model. The trainset is exactly the same as that in R code. The test set is randomly selected. THe dl4j code is:
MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder()
.seed(rngSeed) //include a random seed for reproducibility
// use stochastic gradient descent as an optimization algorithm
.optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT)
.iterations(100)
.learningRate(0.0001) //specify the learning rate
.updater(Updater.NESTEROVS).momentum(0.9) //specify the rate of change of the learning rate.
.regularization(true).l2(0.0001)
.list()
.layer(0, new DenseLayer.Builder() //create the first, input layer with xavier initialization
.nIn(46)
.nOut(10)
.activation(Activation.TANH)
.weightInit(WeightInit.XAVIER)
.build())
.layer(1, new OutputLayer.Builder(LossFunctions.LossFunction.MSE) //create hidden layer
.nIn(10)
.nOut(outputNum)
.activation(Activation.IDENTITY)
.build())
.pretrain(false).backprop(true) //use backpropagation to adjust weights
.build();
The number of epoch is 10 and batchsize is 128.
Using the test set, the performance of R is
and the performance of dl4j is the following, I think it does not work out its full potential.
The mornitor of dl4j is
As there is much more parameters in dl4j such as updater, regulization and weightInit. So I think some of the parameters are not properly set. BTW, why there are periodic thorns in the mornitor graph.
Can any one help?

Most neural net training happens in minibatches. Deeplearning4j assumes you aren't doing toy problems by default (all data in memory < 10 examples etc)
The neural net config has a function called minibatch you should look for.
Set minibatch to false on the configuration and you should get the same results.
If you're wondering why this happens, it's because minibatch learning is different from doing everything in memory. Minibatch learning automatically divides the gradient by the minibatch size. When you do everything in memory you don't want that.
Take note of this when you're running other experiments.
For more on this see:
https://deeplearning4j.org/toyproblems

Related

Can any machine learning algorithm find this pattern: x1 < x2 without generating a new feature (e.g. x1-x2) first?

If I had 2 features x1 and x2 where I know that the pattern is:
if x1 < x2 then
class1
else
class2
Can any machine learning algorithm find such a pattern? What algorithm would that be?
I know that I could create a third feature x3 = x1-x2. Then feature x3 can easily be used by some machine learning algorithms. For example a decision tree can solve the problem 100% using x3 and just 3 nodes (1 decision and 2 leaf nodes).
But, is it possible to solve this without creating new features? This seems like a problem that should be easily solved 100% if a machine learning algorithm could only find such a pattern.
I tried MLP and SVM with different kernels, including svg kernel and the results are not great. As an example of what I tried, here is the scikit-learn code where the SVM could only get a score of 0.992:
import numpy as np
from sklearn.svm import SVC
# Generate 1000 samples with 2 features with random values
X_train = np.random.rand(1000,2)
# Label each sample. If feature "x1" is less than feature "x2" then label as 1, otherwise label is 0.
y_train = X_train[:,0] < X_train[:,1]
y_train = y_train.astype(int) # convert boolean to 0 and 1
svc = SVC(kernel = "rbf", C = 0.9) # tried all kernels and C values from 0.1 to 1.0
svc.fit(X_train, y_train)
print("SVC score: %f" % svc.score(X_train, y_train))
Output running the code:
SVC score: 0.992000
This is an oversimplification of my problem. The real problem may have hundreds of features and different patterns, not just x1 < x2. However, to start with it would help a lot to know how to solve for this simple pattern.
To understand this, you must go into the settings of all the parameters provided by sklearn, and C in particular. It also helps to understand how the value of C influences the classifier's training procedure.
If you look at the equation in the User Guide for SVC, there are two main parts to the equation - the first part tries to find a small set of weights that solves the problem, and the second part tries to minimize the classification errors.
C is the penalty multiplier associated with misclassifications. If you decrease C, then you reduce the penalty (lower training accuracy but better generalization to test) and vice versa.
Try setting C to 1e+6. You will see that you almost always get 100% accuracy. The classifier has learnt the pattern x1 < x2. But it figures that a 99.2% accuracy is enough when you look at another parameter called tol. This controls how much error is negligible for you and by default it is set to 1e-3. If you reduce the tolerance, you can also expect to get similar results.
In general, I would suggest you to use something like GridSearchCV (link) to find the optimal values of hyper parameters like C as this internally splits the dataset into train and validation. This helps you to ensure that you are not just tweaking the hyperparameters to get a good training accuracy but you are also making sure that the classifier will do well in practice.

How to find if a data set can train a neural network?

I'm a newbie to machine learning and this is one of the first real-world ML tasks challenged.
Some experimental data contains 512 independent boolean features and a boolean result.
There are about 1e6 real experiment records in the provided data set.
In a classic XOR example all 4 out of 4 possible states are required to train NN. In my case its only 2^(10-512) = 2^-505 which is close to zero.
I have no more information about the data nature, just these (512 + 1) * 1e6 bits.
Tried NN with 1 hidden layer on available data. Output of the trained NN on the samples even from the training set are always close to 0, not a single close to "1". Played with weights initialization, gradient descent learning rate.
My code utilizing TensorFlow 1.3, Python 3. Model excerpt:
with tf.name_scope("Layer1"):
#W1 = tf.Variable(tf.random_uniform([512, innerN], minval=-2/512, maxval=2/512), name="Weights_1")
W1 = tf.Variable(tf.zeros([512, innerN]), name="Weights_1")
b1 = tf.Variable(tf.zeros([1]), name="Bias_1")
Out1 = tf.sigmoid( tf.matmul(x, W1) + b1)
with tf.name_scope("Layer2"):
W2 = tf.Variable(tf.random_uniform([innerN, 1], minval=-2/512, maxval=2/512), name="Weights_2")
#W2 = tf.Variable(tf.zeros([innerN, 1]), name="Weights_2")
b2 = tf.Variable(tf.zeros([1]), name="Bias_2")
y = tf.nn.sigmoid( tf.matmul(Out1, W2) + b2)
with tf.name_scope("Training"):
y_ = tf.placeholder(tf.float32, [None,1])
cross_entropy = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(
labels = y_, logits = y)
)
train_step = tf.train.GradientDescentOptimizer(0.005).minimize(cross_entropy)
with tf.name_scope("Testing"):
# Test trained model
correct_prediction = tf.equal( tf.round(y), tf.round(y_))
# ...
# Train
for step in range(500):
batch_xs, batch_ys = Datasets.train.next_batch(300, shuffle=False)
_, my_y, summary = sess.run([train_step, y, merged_summaries],
feed_dict={x: batch_xs, y_: batch_ys})
I suspect two cases:
my fault – bad NN implementation, wrong architecture;
bad data. Compared to XOR example, incomplete training data would result in a failing NN. However, the training examples fed to the trained NN are supposed to give right predictions, aren't they?
How to evaluate if it is possible at all to train a neural network (a 2-layer perceptron) on the provided data to forecast the result? A case of aceptable set would be the XOR example. Opposed to some random noise.
There are only ad hoc ways to know if it is possible to learn a function with a differentiable network from a dataset. That said, these ad hoc ways do usually work. For example, the network should be able to overfit the training set without any regularisation.
A common technique to gauge this is to only fit the network on a subset of the full dataset. Check that the network can overfit to that, then increase the size of the subset, and increase the size of the network as well. Unfortunately, deciding whether to add extra layers or add more units in a hidden layer is an arbitrary decision you'll have to make.
However, looking at your code, there are a few things that could be going wrong here:
Are your outputs balanced? By that I mean, do you have the same number of 1s as 0s in the dataset targets?
Your initialisation in the first layer is all zeros, the gradient to this will be zero, so it can't learn anything (although, you have a real initialisation above it commented out).
Sigmoid nonlinearities are more difficult to optimise than simpler nonlinearities, such as ReLUs.
I'd recommend using the built-in definitions for layers in Tensorflow to not worry about initialisation, and switching to ReLUs in any hidden layers (you need sigmoid at the output for your boolean target).
Finally, deep learning isn't actually very good at most "bag of features" machine learning problems because they lack structure. For example, the order of the features doesn't matter. Other methods often work better, but if you really want to use deep learning then you could look at this recent paper, showing improved performance by just using a very specific nonlinearity and weight initialisation (change 4 lines in your code above).

understanding test and validation set usage for early stop and model selection

I implemented an ANN (1 hidden layer of 64 units, learning rate = 0.001, epsilon = 0.001, iters = 500) with pythons OpenCV module. Train error ~ 3% and test error ~ 12%
In order to improve the accruacy/ generalisation of my NN I decided to proceed by- implementing model selection (of #hidden units and learning rate) to get an accurate value of hyperparameters and plotting learning curves to determine if more data is needed (currently have 2.5k).
Having read some sources regarding NN training and model selection, I'm very confused on the following matter -
1) In order to perform model selection, I know the following needs to be done-
create set possibleHiddenUnits {4, 8, 16, 32, 64}
randomly select Tr & Va sets from the total set of Tr + Va with some split e.g. 80/20
foreach ele in possibleHiddenUnits
(*) compute weights for the NN using backpropagation and an iterative optimisation algorithm like Gradient Descent (where we provide the termination criteria in the form of number of iterations / epsilon)
compute Validation set error using these trained weights
select the number of hidden units which min Va set error
Alternatively, I believe we can also use k-fold cross validation.
a. how do you decide what the number of iterations/ epsilon for GD should be?
b. does 1 iteration out of x iterations of GD (where the entire training set is used to compute the gradients of cost wrt weights through backprop) constitute an 'epoch'?
2) Sources (whats is the difference between train, validation and test set, in neural networks? and How to use k-fold cross validation in a neural network) mention that the training for a NN is done in the following way as it prevents over-fitting
for each epoch
for each training data instance
propagate error through the network
adjust the weights
calculate the accuracy over training data
for each validation data instance
calculate the accuracy over the validation data
if the threshold validation accuracy is met
exit training
else
continue training
a. I believe this method should be executed once the model selection has been done. But then how do we avoid overfitting of the model in step (*) of the model selection process above?
b. Am I right in assuming that one epoch constitues one iteration of training where weights are calculated using the entire Tr set through GD + backprop and GD involves x (>1) iters over the entire Tr set to calculate the weights ?
Also, out off 1b and 2b which is correct?
This is more of a comment but since I cant make comments yet I write it here. Have you tried other methods like l2 regularization or dropout? I dont know a lot about model selection but dropout has a very similiar effect like taking lots of models and averaging them. Normaly dropout should do the trick and you wont have problems with overfitting anymore.

Using RNN to recover sine wave from noisy signal

I am involved with an application that needs to estimate the state of a certain system in real time by measuring a set of (non-linearly) dependent parameters. Up until now the application was using an extended Kalman filter, but it was found to be underperforming in certain circumstances, which is likely caused by the fact that the differences between the real system and its model used in the filter are too significant to be modeled as white noise. We cannot use a more precise model for a number of unrelated reasons.
We decided to try recurrent neural networks for the task. Since my experience with neural networks is quite limited, before tackling the real task itself, I decided to practice with a hand crafted problem first. That problem, however, I could not solve, so I'm asking for help here.
Here's what I did: I generated some sine waveforms of varying phase, frequency, amplitude, and offset. Then I distorted the waveforms with some white noise, and (unsuccessfully) attempted to train an LSTM network to recover my waveforms from the noisy signal. I expected that the network will eventually learn to fit a sine waveform into the noisy data set.
Here's the source (slightly abridged, but it should work):
#!/usr/bin/env python3
import time
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, LSTM
from keras.layers.wrappers import TimeDistributed
from keras.objectives import mean_absolute_error, cosine_proximity
POINTS_PER_WF = int(1e4)
X_SPACE = np.linspace(0, 100, POINTS_PER_WF)
def make_waveform_with_noise():
def add_noise(vec):
stdev = float(np.random.uniform(0.01, 0.2))
return vec + np.random.normal(0, stdev, size=len(vec))
f = np.random.choice((np.sin, np.cos))
wf = f(X_SPACE * np.random.normal(scale=5)) *\
np.random.normal(scale=5) + np.random.normal(scale=50)
return wf, add_noise(wf)
RESCALING = 1e-3
BATCH_SHAPE = (1, POINTS_PER_WF, 1)
model = Sequential([
TimeDistributed(Dense(5, activation='tanh'), batch_input_shape=BATCH_SHAPE),
LSTM(20, activation='tanh', inner_activation='sigmoid', return_sequences=True),
LSTM(20, activation='tanh', inner_activation='sigmoid', return_sequences=True),
TimeDistributed(Dense(1, activation='tanh'))
])
def compute_loss(y_true, y_pred):
skip_first = POINTS_PER_WF // 2
y_true = y_true[:, skip_first:, :] * RESCALING
y_pred = y_pred[:, skip_first:, :] * RESCALING
me = mean_absolute_error(y_true, y_pred)
cp = cosine_proximity(y_true, y_pred)
return me + cp
model.summary()
model.compile(optimizer='adam', loss=compute_loss,
metrics=['mae', 'cosine_proximity'])
NUM_ITERATIONS = 30000
for iteration in range(NUM_ITERATIONS):
wf, noisy_wf = make_waveform_with_noise()
y = wf.reshape(BATCH_SHAPE) * RESCALING
x = noisy_wf.reshape(BATCH_SHAPE) * RESCALING
info = model.train_on_batch(x, y)
model.save_weights('final.hdf5')
The first dense layer is actually useless, the reason I added it is because I wanted to make sure I can successfully combine LSTM and time distributed dense layers, since my real application will likely need that setup.
The error function was modified a number of times. Initially I was using plain mean squared error, but the training process was extremely slow, and it was mostly converging to simply copying the input noisy signal into the output. The cosine proximity metric I added later essentially defines the degree of similarity between the shapes of the functions; it seemed to speed up the learning quite a bit. Also note that I'm applying the loss function only to the last half of the dataset; the motivation for that is that I expected that the network will need to see a few periods of the signal in order to be able to correctly identify the parameters of the waveform. However, I found that this modification has no visible effect on the performance of the network.
The latest modification of the script uses Adam optimizer, I also experimented with RMSProp with varying learning rate and decay settings, but I found no noticeable difference in behavior of the network.
I am using Theano 0.9 (dev) backend configured to use 64 bit floating point, in order to prevent possible issues with numerical stability. The epsilon value is set accordingly to 1e-14.
This is what the output looks like after 15k..30k training steps (performance stops improving starting from about 15k steps) (the first plot is zoomed in for the sake of clarity):
Plot legend:
blue (0) - noisy signal, input of the RNN
green (1) - recovered signal, output of the RNN
red (2) - ground truth
My question is: what am I doing wrong?

Translating a TensorFlow LSTM into synapticjs

I'm working on implementing an interface between a TensorFlow basic LSTM that's already been trained and a javascript version that can be run in the browser. The problem is that in all of the literature that I've read LSTMs are modeled as mini-networks (using only connections, nodes and gates) and TensorFlow seems to have a lot more going on.
The two questions that I have are:
Can the TensorFlow model be easily translated into a more conventional neural network structure?
Is there a practical way to map the trainable variables that TensorFlow gives you to this structure?
I can get the 'trainable variables' out of TensorFlow, the issue is that they appear to only have one value for bias per LSTM node, where most of the models I've seen would include several biases for the memory cell, the inputs and the output.
Internally, the LSTMCell class stores the LSTM weights as a one big matrix instead of 8 smaller ones for efficiency purposes. It is quite easy to divide it horizontally and vertically to get to the more conventional representation. However, it might be easier and more efficient if your library does the similar optimization.
Here is the relevant piece of code of the BasicLSTMCell:
concat = linear([inputs, h], 4 * self._num_units, True)
# i = input_gate, j = new_input, f = forget_gate, o = output_gate
i, j, f, o = array_ops.split(1, 4, concat)
The linear function does the matrix multiplication to transform the concatenated input and the previous h state into 4 matrices of [batch_size, self._num_units] shape. The linear transformation uses a single matrix and bias variables that you're referring to in the question. The result is then split into different gates used by the LSTM transformation.
If you'd like to explicitly get the transformations for each gate, you can split that matrix and bias into 4 blocks. It is also quite easy to implement it from scratch using 4 or 8 linear transformations.

Resources