I'm trying to develop some intuition for machine learning. I looked over examples from https://github.com/deeplearning4j/dl4j-0.4-examples and I wanted to develop my own example. Basically I just took a simple function: a * a + b * b + c * c - a * b * c + a + b + c and generated 10000 outputs for random a,b,c and tried to train my network on 90% of the inputs. The thing is no matter what I done my network never gets to predict the rest of the examples.
Here is my code:
public class BasicFunctionNN {
private static Logger log = LoggerFactory.getLogger(MlPredict.class);
public static DataSetIterator generateFunctionDataSet() {
Collection<DataSet> list = new ArrayList<>();
for (int i = 0; i < 100000; i++) {
double a = Math.random();
double b = Math.random();
double c = Math.random();
double output = a * a + b * b + c * c - a * b * c + a + b + c;
INDArray in = Nd4j.create(new double[]{a, b, c});
INDArray out = Nd4j.create(new double[]{output});
list.add(new DataSet(in, out));
}
return new ListDataSetIterator(list, list.size());
}
public static void main(String[] args) throws Exception {
DataSetIterator iterator = generateFunctionDataSet();
Nd4j.MAX_SLICES_TO_PRINT = 10;
Nd4j.MAX_ELEMENTS_PER_SLICE = 10;
final int numInputs = 3;
int outputNum = 1;
int iterations = 100;
log.info("Build model....");
MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder()
.iterations(iterations).weightInit(WeightInit.XAVIER).updater(Updater.SGD).dropOut(0.5)
.learningRate(.8).regularization(true)
.l1(1e-1).l2(2e-4)
.optimizationAlgo(OptimizationAlgorithm.LINE_GRADIENT_DESCENT)
.list(3)
.layer(0, new DenseLayer.Builder().nIn(numInputs).nOut(8)
.activation("identity")
.build())
.layer(1, new DenseLayer.Builder().nIn(8).nOut(8)
.activation("identity")
.build())
.layer(2, new OutputLayer.Builder(LossFunctions.LossFunction.RMSE_XENT)//LossFunctions.LossFunction.RMSE_XENT)
.activation("identity")
.weightInit(WeightInit.XAVIER)
.nIn(8).nOut(outputNum).build())
.backprop(true).pretrain(false)
.build();
//run the model
MultiLayerNetwork model = new MultiLayerNetwork(conf);
model.init();
model.setListeners(Collections.singletonList((IterationListener) new ScoreIterationListener(iterations)));
//get the dataset using the record reader. The datasetiterator handles vectorization
DataSet next = iterator.next();
SplitTestAndTrain testAndTrain = next.splitTestAndTrain(0.9);
System.out.println(testAndTrain.getTrain());
model.fit(testAndTrain.getTrain());
//evaluate the model
Evaluation eval = new Evaluation(10);
DataSet test = testAndTrain.getTest();
INDArray output = model.output(test.getFeatureMatrix());
eval.eval(test.getLabels(), output);
log.info(">>>>>>>>>>>>>>");
log.info(eval.stats());
}
}
I also played with the learning rate, and it happens many time that score doesn't improve:
10:48:51.404 [main] DEBUG o.d.o.solvers.BackTrackLineSearch - Exited line search after maxIterations termination condition; score did not improve (bestScore=0.8522868127536543, scoreAtStart=0.8522868127536543). Resetting parameters
As an activation function I also tried relu
One obvious problem is that you are trying to model nonlinear function with linear model. Your neural network has no activation functions thus it efficiently can only express functions of the form W1a + W2b + W3c + W4. It does not matter how many hidden units you create - as long as there is no non-linear activation function used, your network degenerates to simple linear model.
update
There are also many "small weird things", including but not limited to:
you are using huge learning rate (0.8)
you are using lots of regularization for (quite complex, using both l1 and l2 regularizers for regression is not a common approach, especially in neural networks) a problem where you need none
rectifier units might not be the best ones to express square operation, as well as multiplication that you are looking for. Rectifiers are very good for classification, especially with deeper architectures, but not for shallow regression. Try sigmoid-alike (tanh, sigmoid) activations instead.
I am not entirely sure what "iteration" means in this implementation, but usually this is amount of samples/minibatches used for training. Thus using just 100 might be orders of magnitude too small for gradient descent learning
Related
I am implementing from scratch the multivariate normal probability function in python. The formula for it is as follows:
I was able to code this version, where $\mathbf{x}$ is an input vector (single sample). However, i could make good use of numpy's matrix operations and extend it to the case of using $\mathbf{X}$ (set of samples) to return all the samples probabilities at once. This is equal to the scipy's implementation.
This is the code i have made:
def multivariate_normal(X, center, cov):
k = X.shape[0]
det_cov = np.linalg.det(cov)
inv_cov = np.linalg.inv(cov)
o = 1 / np.sqrt( (2 * np.pi) ** k * det_cov)
p = np.exp( -.5 * ( np.dot(np.dot((X - center).T, inv_cov), (X - center))))
return o * p
Thanks in advance.
I was following Siraj Raval's videos on logistic regression using gradient descent :
1) Link to longer video :
https://www.youtube.com/watch?v=XdM6ER7zTLk&t=2686s
2) Link to shorter video :
https://www.youtube.com/watch?v=xRJCOz3AfYY&list=PL2-dafEMk2A7mu0bSksCGMJEmeddU_H4D
In the videos he talks about using gradient descent to reduce the error for a set number of iterations so that the function converges(slope becomes zero).
He also illustrates the process via code. The following are the two main functions from the code :
def step_gradient(b_current, m_current, points, learningRate):
b_gradient = 0
m_gradient = 0
N = float(len(points))
for i in range(0, len(points)):
x = points[i, 0]
y = points[i, 1]
b_gradient += -(2/N) * (y - ((m_current * x) + b_current))
m_gradient += -(2/N) * x * (y - ((m_current * x) + b_current))
new_b = b_current - (learningRate * b_gradient)
new_m = m_current - (learningRate * m_gradient)
return [new_b, new_m]
def gradient_descent_runner(points, starting_b, starting_m, learning_rate, num_iterations):
b = starting_b
m = starting_m
for i in range(num_iterations):
b, m = step_gradient(b, m, array(points), learning_rate)
return [b, m]
#The above functions are called below:
learning_rate = 0.0001
initial_b = 0 # initial y-intercept guess
initial_m = 0 # initial slope guess
num_iterations = 1000
[b, m] = gradient_descent_runner(points, initial_b, initial_m, learning_rate, num_iterations)
# code taken from Siraj Raval's github page
Why does the value of b & m continue to update for all the iterations? After a certain number of iterations, the function will converge, when we find the values of b & m that give slope = 0.
So why do we continue iteration after that point and continue updating b & m ?
This way, aren't we losing the 'correct' b & m values? How is learning rate helping the convergence process if we continue to update values after converging? Thus, why is there no check for convergence, and so how is this actually working?
In practice, most likely you will not reach to slope 0 exactly. Thinking of your loss function as a bowl. If your learning rate is too high, it is possible to overshoot over the lowest point of the bowl. On the contrary, if the learning rate is too low, your learning will become too slow and won't reach the lowest point of the bowl before all iterations are done.
That's why in machine learning, the learning rate is an important hyperparameter to tune.
Actually, once we reach a slope 0; b_gradient and m_gradient will become 0;
thus, for :
new_b = b_current - (learningRate * b_gradient)
new_m = m_current - (learningRate * m_gradient)
new_b and new_m will remain the old correct values; as nothing will be subtracted from them.
I going to do Anomaly detection on my own images by using the example on deeplearning4j platform. And I change the code like this:
int rngSeed=123;
Random rnd = new Random(rngSeed);
int width=28;
int height=28;
int batchSize = 128;
MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder()
.seed(12345)
.iterations(1)
.weightInit(WeightInit.XAVIER)
.updater(Updater.ADAGRAD)
.activation(Activation.RELU)
.optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT)
.learningRate(0.05)
.regularization(true).l2(0.0001)
.list()
.layer(0, new DenseLayer.Builder().nIn(784).nOut(250)
.build())
.layer(1, new DenseLayer.Builder().nIn(250).nOut(10)
.build())
.layer(2, new DenseLayer.Builder().nIn(10).nOut(250)
.build())
.layer(3, new OutputLayer.Builder().nIn(250).nOut(784)
.lossFunction(LossFunctions.LossFunction.MSE)
.build())
.pretrain(false).backprop(true)
.build();
MultiLayerNetwork net = new MultiLayerNetwork(conf);
net.setListeners(Collections.singletonList((IterationListener) new ScoreIterationListener(1)));
File trainData = new File("mnist_png/training");
FileSplit fsTrain = new FileSplit(trainData, NativeImageLoader.ALLOWED_FORMATS, rnd);
ImageRecordReader recorderReader = new ImageRecordReader(height, width);
recorderReader.initialize(fsTrain);
DataSetIterator dataIt = new RecordReaderDataSetIterator(recorderReader, batchSize);
List<INDArray> featuresTrain = new ArrayList<>();
while(dataIt.hasNext()){
DataSet ds = dataIt.next();
featuresTrain.add(ds.getFeatureMatrix());
}
System.out.println("************ training **************");
int nEpochs = 30;
for( int epoch=0; epoch<nEpochs; epoch++ ){
for(INDArray data : featuresTrain){
net.fit(data,data);
}
System.out.println("Epoch " + epoch + " complete");
}
And it threw an exception while training:
Exception in thread "main" org.deeplearning4j.exception.DL4JInvalidInputException: Input that is not a matrix; expected matrix (rank 2), got rank 4 array with shape [128, 1, 28, 28]
at org.deeplearning4j.nn.layers.BaseLayer.preOutput(BaseLayer.java:363)
at org.deeplearning4j.nn.layers.BaseLayer.activate(BaseLayer.java:384)
at org.deeplearning4j.nn.layers.BaseLayer.activate(BaseLayer.java:405)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.activationFromPrevLayer(MultiLayerNetwork.java:590)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.feedForwardToLayer(MultiLayerNetwork.java:713)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.computeGradientAndScore(MultiLayerNetwork.java:1821)
at org.deeplearning4j.optimize.solvers.BaseOptimizer.gradientAndScore(BaseOptimizer.java:151)
at org.deeplearning4j.optimize.solvers.StochasticGradientDescent.optimize(StochasticGradientDescent.java:54)
at org.deeplearning4j.optimize.Solver.optimize(Solver.java:51)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.fit(MultiLayerNetwork.java:1443)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.fit(MultiLayerNetwork.java:1408)
at org.deeplearning4j.examples.dataExamples.AnomalyTest.main(AnomalyTest.java:86)
It seem that my input dataset has 4 columns while it need just 2 columes, so the question is how to convert imagerecorderread or something else to make it running properly?
So first of all, you may want to understand what a tensor is:
http://nd4j.org/tensor
The record reader returns a multi dimensional image, you need to flatten it in order for it to be used with a 2d neural net unless you plan on using CNNs for part of your training.
If you take a look at the exception (again you really should be familiar with ndarrays, they aren't new and are used in every deep learning library): you'll see a shape of:
[128, 1, 28, 28]
That is batch size by channels by rows x columns. You need to do a:
.setInputType(InputType.convolutional(28,28,1))
This will tell dl4j that it needs to flatten the 4d to a 2d. In this case it indicates that there's a rows,columns,channels of 28 x 28 x 1
If you add this to the bottom of your config it will work.
Of note if you are trying to do anomaly detection is we also have variational autoencoders you may want to look in to as well.
The total error for the network did not change on over 100,000 iterations.
The input is 22 values and the output is a single value. the input array is [195][22] and the output array is [195][1].
BasicNetwork network = new BasicNetwork();
network.addLayer(new BasicLayer(null,true,22));
network.addLayer(new BasicLayer(new ActivationSigmoid(),true,10));
network.addLayer(new BasicLayer(new ActivationSigmoid(),false,1));
network.getStructure().finalizeStructure();
network.reset();
MLDataSet training_data = new BasicMLDataSet(input, target_output);
final Backpropagation train = new Backpropagation(network, training_data);
int epoch = 1;
do {
train.iteration();
System.out.println("Epoch #" + epoch + " Error:" + train.getError());
epoch++;
}
while(train.getError() > 0.01);
{
train.finishTraining();
}
What is wrong with this code?
Depending on what the data you are trying to classify your network may be too small to transform the search space into a linearly separable problem. So try adding more neurons or layers - this will probably take longer to train. Unless it is already linearly separable and then a NN may be an inefficient way to solve this.
Also you don't have a training strategy, if the NN falls into local minima on the error surface it will be stuck there. See the encog user guide https://s3.amazonaws.com/heatonresearch-books/free/Encog3Java-User.pdf pg 166 has a list of training strategy's.
final int strategyCycles = 50;
final double strategyError = 0.25;
train.addStrategy(new ResetStrategy(strategyError,strategyCycles));
I'm trying to learn theano and decided to implement linear regression (using their Logistic Regression from the tutorial as a template). I'm getting a wierd thing where T.grad doesn't work if my cost function uses .sum(), but does work if my cost function uses .mean(). Code snippet:
(THIS DOESN'T WORK, RESULTS IN A W VECTOR FULL OF NANs):
x = T.matrix('x')
y = T.vector('y')
w = theano.shared(rng.randn(feats), name='w')
b = theano.shared(0., name="b")
# now we do the actual expressions
h = T.dot(x,w) + b # prediction is dot product plus bias
single_error = .5 * ((h - y)**2)
cost = single_error.sum()
gw, gb = T.grad(cost, [w,b])
train = theano.function(inputs=[x,y], outputs=[h, single_error], updates = ((w, w - .1*gw), (b, b - .1*gb)))
predict = theano.function(inputs=[x], outputs=h)
for i in range(training_steps):
pred, err = train(D[0], D[1])
(THIS DOES WORK, PERFECTLY):
x = T.matrix('x')
y = T.vector('y')
w = theano.shared(rng.randn(feats), name='w')
b = theano.shared(0., name="b")
# now we do the actual expressions
h = T.dot(x,w) + b # prediction is dot product plus bias
single_error = .5 * ((h - y)**2)
cost = single_error.mean()
gw, gb = T.grad(cost, [w,b])
train = theano.function(inputs=[x,y], outputs=[h, single_error], updates = ((w, w - .1*gw), (b, b - .1*gb)))
predict = theano.function(inputs=[x], outputs=h)
for i in range(training_steps):
pred, err = train(D[0], D[1])
The only difference is in the cost = single_error.sum() vs single_error.mean(). What I don't understand is that the gradient should be the exact same in both cases (one is just a scaled version of the other). So what gives?
The learning rate (0.1) is way to big. Using mean make it divided by the batch size, so this help. But I'm pretty sure you should make it much smaller. Not just dividing by the batch size (which is equivalent to using mean).
Try a learning rate of 0.001.
Try dividing your gradient descent step size by the number of training examples.