I have build two models one is without any hidden layer and I used softmax at the output. And other is with one hidden layer and in hidden layer I used sigmoid as an activation function. I was expecting that the model with one hidden layer will give better performance but I am getting almost same performance in both models. I was wondering why the model without any hidden layer is showing such a high performance? In both cases I have used large amount of data to train the network.
Here is the out of the model without any hidden layer. Can someone please guide me why it is showing such a high accuracy. In literature I have read that deeper network has more expressive power.
`step: 4400, train_acc: 0.99, test_acc: 0.996
step: 4500, train_acc: 1.0, test_acc: 0.996
step: 4600, train_acc: 1.0, test_acc: 0.998
step: 4700, train_acc: 0.99, test_acc: 0.998
step: 4800, train_acc: 1.0,test_acc: 1.0
step: 4900, train_acc: 0.99,test_acc: 0.996`
it seems that your data set is linearly separable , which means a linear classifier can be used to get good accuracy on training set if not 100%. one neuron is all it takes to find a decision boundary for a linearly separable problem. adding more layers and more neurons in each layer with none linear activation functions, is only for the sake of making more complex classifiers for more complex patterns.
conclusion, if you get the most accuracy that is possible, what is more that you expect a more complex network would offer? computation cost of course.
Related
I am training a unsupervised NN model and for some reason, after exactly one epoch (80 steps), model stops learning.
]
Do you have any idea why it might happen and what should I do to prevent it?
This is more info about my NN:
I have a deep NN that tries to solve an optimization problem. My loss function is customized and it is my objective function in the optimization problem.
So if my optimization problems is min f(x) ==> loss, now in my DNN loss = f(x). I have 64 input, 64 output, 3 layers in between :
self.l1 = nn.Linear(input_size, hidden_size)
self.relu1 = nn.LeakyReLU()
self.BN1 = nn.BatchNorm1d(hidden_size)
and last layer is:
self.l5 = nn.Linear(hidden_size, output_size)
self.tan5 = nn.Tanh()
self.BN5 = nn.BatchNorm1d(output_size)
to scale my network.
with more layers and nodes(doubles: 8 layers each 200 nodes), I can get a little more progress toward lower error, but again after 100 steps training error becomes flat!
The symptom is that the training loss stops being improved relatively early. Suppose that your problem is learnable at all, there are many reasons for the for this behavior. Following are most relavant:
Improper preprocessing of input: Neural network prefers input with
zero mean. E.g., if the input is all positive, it will restrict the
weights to be updated in the same direction, which may not be
desirable (https://youtu.be/gYpoJMlgyXA).
Therefore, you may want to subtract the mean from all the images (e.g., subtract 127.5 from each of the 3 channels). Scaling to make unit standard deviation in each channel may also be helpful.
Generalization ability of the network: The network is not complicated
or deep enough for the task.
This is very easy to check. You can train the network on just a few
images (says from 3 to 10). The network should be able to overfit the
data and drives the loss to almost 0. If it is not the case, you may
have to add more layers such as using more than 1 Dense layer.
Another good idea is to used pre-trained weights (in applications of Keras documentation). You may adjust the Dense layers at the top to fit with your problem.
Improper weight initialization. Improper weight initialization can
prevent the network from converging (https://youtu.be/gYpoJMlgyXA,
the same video as before).
For the ReLU activation, you may want to use He initialization
instead of the default Glorot initialiation. I find that this may be
necessary sometimes but not always.
Lastly, you can use debugging tools for Keras such as keras-vis, keplr-io, deep-viz-keras. They are very useful to open the blackbox of convolutional networks.
I faced the same problem then I followed the following:
After going through a blog post, I managed to determine that my problem resulted from the encoding of my labels. Originally I had them as one-hot encodings which looked like [[0, 1], [1, 0], [1, 0]] and in the blog post they were in the format [0 1 0 0 1]. Changing my labels to this and using binary crossentropy has gotten my model to work properly. Thanks to Ngoc Anh Huynh and rafaelvalle!
I trained a network on a real-value labels (floating point numbers from 0.0 to 1.0) - several residual blocks in the beginning, and the last layers are
fully-connected layer with 64 neurons + ELU activation,
fully-connected layer with 16 neurons + ELU activation,
output logistic regression layer ( 1 neuron with y = 1 / (1 + exp(-x) ).
After training, I visualised weights of the layer with 16 neurons:
figure rows represents weights that every single 1 of 16 neurons developed for every single 1 of 64 neurons of previous layer, indices are 0..15 and 0..63;
UPD: figure shows neurons weights correlation (Pearson);
UPD: figure shows neurons weights MAD (mean absolute difference) - this proves redundancy event better than correlation.
Now the detailed questions:
Can we say that there are redundant features? I see several redundant groups of neurons: 0,4; 1,6,7 (maybe 8,11,15 too); 2,14; 12,13 (maybe) .
is it bad ?
if so, is there any regularizer, that penalizes redundant neuron weights, and makes neurons develop uncorrelated weights?
I use adam regularizer, Xavier initialization (the best of the tested), weight decay 1e-5/batch (the best of the tested), other output layers did not work as well as logistic regression (by means of precison & recall & lack of overfitting).
I use only 10 filters in each resnet blocks (which are 10, too) to address overfitting.
Are you using Tensorflow ? if yes, is post training quantization an option ?
tensorflow.org/lite/performance/post_training_quantization
This has some similar effect to what you need but also makes other improvements.
Alternatively maybe you can also try to use Quantization-aware training
https://github.com/tensorflow/tensorflow/tree/r1.14/tensorflow/contrib/quantize
The image on the left shows a standard ROC curve formed by sweeping a single threshold and recording the corresponding True Positive Rate (TPR) and False Positive Rate (FPR).
The image on the right shows my problem setup where there are 3 parameters, and for each, we have only 2 choices. Together, it produces 8 points as depicted on the graph. In practice, I intend to have thousands of possible combinations of 100s of parameters, but the concept remains the same in this down-scaled case.
I intend to find 2 things here:
Determine the optimum parameter(s) for the given data
Provide an overall performance score for all combinations of parameters
In the case of the ROC curve on the left, this is done easily using the following methods:
Optimal parameter: Maximal difference of TPR and FPR with a cost component (I believe it is called the J-statistic?)
Overall performance: Area under the curve (the shaded portion in the graph)
However, for my case in the image on the right, I do not know if the methods I have chosen are the standard principled methods that are normally used.
Optimal parameter set: Same maximal difference of TPR and FPR
Parameter score = TPR - FPR * cost_ratio
Overall performance: Average of all "parameter scores"
I have found a lot of reference material for the ROC curve with a single threshold and while there are other techniques available to determine the performance, the ones mentioned in this question is definitely considered a standard approach. I found no such reading material for the scenario presented on the right.
Bottomline, the question here is two-fold: (1) Provide methods to evaluate the optimal parameter set and overall performance in my problem scenario, (2) Provide reference that claims the suggested methods to be a standard approach for the given scenario.
P.S.: I had first posted this question on the "Cross Validated" forum, but didn't get any responses, in fact, got only 7 views in 15 hours.
I'm going to expand a little on aberger's previous answer on a Grid Search. As with any tuning of a model it's best to optimise hyper-parameters using one portion of the data and evaluate those parameters using another proportion of the data, so GridSearchCV is best for this purpose.
First I'll create some data and split it into training and test
import numpy as np
from sklearn import model_selection, ensemble, metrics
np.random.seed(42)
X = np.random.random((5000, 10))
y = np.random.randint(0, 2, 5000)
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.3)
This gives us a classification problem, which is what I think you're describing, though the same would apply to regression problems too.
Now it's helpful to think about what parameters you may want to optimise. A cross-validated grid search is a computational expensive process, so the smaller the search space the quicker it gets done. I will show an example for a RandomForestClassifier because it's my go to model.
clf = ensemble.RandomForestClassifier()
parameters = {'n_estimators': [10, 20, 30],
'max_features': [5, 8, 10],
'max_depth': [None, 10, 20]}
So now I have my base estimator and a list of parameters that I want to optimise. Now I just have to think about how I want to evaluate each of the models that I'm going to build. It seems from your question that you're interested in the ROC AUC, so that's what I'll use for this example. Though you can chose from many default metrics in scikit or even define your own.
gs = model_selection.GridSearchCV(clf, param_grid=parameters,
scoring='roc_auc', cv=5)
gs.fit(X_train, y_train)
This will fit a model for all possible combinations of parameters that I have given it, using 5-fold cross-validation evaluate how well those parameters performed using the ROC AUC. Once that's been fit, we can look at the best parameters and pull out the best performing model.
print gs.best_params_
clf = gs.best_estimator_
Outputs:
{'max_features': 5, 'n_estimators': 30, 'max_depth': 20}
Now at this point you may want to retrain your classifier on all of the training data, as currently it's been trained using cross-validation. Some people prefer not to, but I'm a retrainer!
clf.fit(X_train, y_train)
So now we can evaluate how well the model performs on both our training and test set.
print metrics.classification_report(y_train, clf.predict(X_train))
print metrics.classification_report(y_test, clf.predict(X_test))
Outputs:
precision recall f1-score support
0 1.00 1.00 1.00 1707
1 1.00 1.00 1.00 1793
avg / total 1.00 1.00 1.00 3500
precision recall f1-score support
0 0.51 0.46 0.48 780
1 0.47 0.52 0.50 720
avg / total 0.49 0.49 0.49 1500
We can see that this model has overtrained by the poor score on the test set. But this is not surprising as the data is just random noise! Hopefully when performing these methods on data with a signal you will end up with a well-tuned model.
EDIT
This is one of those situations where 'everyone does it' but there's no real clear reference to say this is the best way to do it. I would suggest looking for an example close to the classification problem that you're working on. For example using Google Scholar to search for "grid search" "SVM" "gene expression"
I feeeeel like we're talking about Grid Search in scikit-learn. It (1), provides methods to evaluate optimal (hyper)parameters and (2), is implemented in a massively popular and well referenced statistical software package.
I can't get TensorFlow RELU activations (neither tf.nn.relu nor tf.nn.relu6) working without NaN values for activations and weights killing my training runs.
I believe I'm following all the right general advice. For example I initialize my weights with
weights = tf.Variable(tf.truncated_normal(w_dims, stddev=0.1))
biases = tf.Variable(tf.constant(0.1 if neuron_fn in [tf.nn.relu, tf.nn.relu6] else 0.0, shape=b_dims))
and use a slow training rate, e.g.,
tf.train.MomentumOptimizer(0.02, momentum=0.5).minimize(cross_entropy_loss)
But any network of appreciable depth results in NaN for cost and and at least some weights (at least in the summary histograms for them). In fact, the cost is often NaN right from the start (before training).
I seem to have these issues even when I use L2 (about 0.001) regularization, and dropout (about 50%).
Is there some parameter or setting that I should adjust to avoid these issues? I'm at a loss as to where to even begin looking, so any suggestions would be appreciated!
Following He et. al (as suggested in lejlot's comment), initializing the weights of the l-th layer to a zero-mean Gaussian distribution with standard deviation
where nl is the flattened length of the the input vector or
stddev=np.sqrt(2 / np.prod(input_tensor.get_shape().as_list()[1:]))
results in weights that generally do not diverge.
If you use a softmax classifier at the top of your network, try to make the initial weights of the layer just below the softmax very small (e.g. std=1e-4). This makes the initial distribution of outputs of the network very soft (high temperature), and helps ensure that the first few steps of your optimization are not too large and numerically unstable.
Have you tried gradient clipping and/or a smaller learning rate?
Basically, you will need to process your gradients before applying them, as follows (from tf docs, mostly):
# Replace this with what follows
# opt = tf.train.MomentumOptimizer(0.02, momentum=0.5).minimize(cross_entropy_loss)
# Create an optimizer.
opt = tf.train.MomentumOptimizer(learning_rate=0.001, momentum=0.5)
# Compute the gradients for a list of variables.
grads_and_vars = opt.compute_gradients(cross_entropy_loss, tf.trainable_variables())
# grads_and_vars is a list of tuples (gradient, variable). Do whatever you
# need to the 'gradient' part, for example cap them, etc.
capped_grads_and_vars = [(tf.clip_by_value(gv[0], -5., 5.), gv[1]) for gv in grads_and_vars]
# Ask the optimizer to apply the capped gradients.
opt = opt.apply_gradients(capped_grads_and_vars)
Also, the discussion in this question might help.
Given a feed-forward neural-network, how to:
Ensure that it is independent on the order of the inputs? e.g., feeding [0.2, 0.3] would output the same result as [0.3, 0.2];
Ensure that it is independent on the order of groups of inputs? e.g., feeding [0.2, 0.3, 0.4, 0.5] would output the same result as [0.4, 0.5, 0.2, 0.3], but not [0.5, 0.4, 0.3, 0.2];
Ensure that a permutation on the input sequence would give a permutation on the output sequence. e.g., if [0.2, 0.3] gives as output [0.8, 0.7], then [0.3, 0.2] gives as output [0.7, 0.8].
Given the above:
Is there any other solution besides ensuring that the train set covers all the possible permutations?
Is the parity of the hidden layer somehow constrained (i.e., the number of neurons in the hidden layer must be odd or even)?
Does it make sense too look for some sort of symmetry in the weight matrix?
well, it looks like a hard job for NN but
1. I'd make some preprocessing and maybe postprocessing script which would take care of all your permutation, make sure that the easiest possible input is given to NN. I think pre(post)processing would be much easier to achieve your goal than adjusting NN (adding one or more hidden layers)
2&3 NN are usually perceived as blackboxes. It means you train it and analyse just input and output. In most cases it doesn't make sense(time-demanding) to try to understand how is it working inside (of course there are some exceptions eg if you have functional NN and you would like to mine some knowledge - butas i said - it is time-consuming).
In general, there are no constraints regarding to number of hidden neurons per layer. Also, looking for symetry in weight matrix doesn't make sense unless you are trying to find some knowledge ...
Here is my try to answer the questions as best as i can
How to
To get the required results you can either
train all permutations
sort the input data and train it (so it doesn't have to learn the permutations extra)
To get the requested result you do have again two possibilities
train all permutations (timeconsuming)
or better, use another type of network, for example a recurrent neural network with the echo state network training algorithm (paper here)
i would try to solve it again with the echo state network algorithm
I hope it helps even if the possible solutions for the second and third problem are no feed forward networks.
Answering the questions
3 I don't think that it makes any sense to look for symetries in the weight matrix.