I am training a sklearn MLPClassifier() on an input dataset with 23 features. But the layer 0 in coefs_ I.e clf.coefs_[0] only has 18 arrays in it. Does this mean 5 of the features have zero weights (i.e not used)? If yes, how do I find which features got dropped during training?
In general suppose I have N input features and K hidden layers of size (S_1,....S_k), is this correct interpretation of how coefs_ will look like
coefs_[i] can be interpreted as giving weights for layer i (layer 0 connects inputs with first hidden layer)
In above case i will be from 0 to K (1 input layer + K hidden layers)
coefs_[i][j] gives weights for connecting input j in layer i with each of inputs in layer i + 1
E.g
coefs_[0][0] will have S_1 entries telling how input feature 0 is connected to each of S_1 "features" in hidden layer 1
coefs_[1][0] will have S_2 entries telling how hidden layer feature 0 is connected to each of S_2 "features" in hidden layer 2
Related
Is there any kind of majority layer in Keras? That the input would be dome 1D vector, and the output is a single number which is the value that has the most occurrences kn the input vector?
My use case is - I'm building an ensemble of neural networks, but let's say that I want to have a single network. So I'm building a new network, with the previous models as inputs. I want to add a dingle output layer that simply runs a majority vote. Is it possible with Keras?
Note, that such a layer would not make much sense with float activations, since probability of two floats being exactly the same is 0. Consequently this only makes sense if your inputs are categorical. And for categorical layer you usually one-hot encode the result, thus imagine having N networks with one hot encoding of K possible values, thus N x K tensor
Net 1: 0 0 1 0 0 0
Net 2: 1 0 0 0 0 0
Net 3: 0 0 1 0 0 0
Now all we have to do is to sum over N
1 0 2 0 0 0
and take an argmax to find what you want. So you just compose 2 standard operations from every NN library - summation, and argmaxing. Note, that this solution actually works also if your inputs are floats, you "sum" the votes, and take an argmax.
I trained a network on a real-value labels (floating point numbers from 0.0 to 1.0) - several residual blocks in the beginning, and the last layers are
fully-connected layer with 64 neurons + ELU activation,
fully-connected layer with 16 neurons + ELU activation,
output logistic regression layer ( 1 neuron with y = 1 / (1 + exp(-x) ).
After training, I visualised weights of the layer with 16 neurons:
figure rows represents weights that every single 1 of 16 neurons developed for every single 1 of 64 neurons of previous layer, indices are 0..15 and 0..63;
UPD: figure shows neurons weights correlation (Pearson);
UPD: figure shows neurons weights MAD (mean absolute difference) - this proves redundancy event better than correlation.
Now the detailed questions:
Can we say that there are redundant features? I see several redundant groups of neurons: 0,4; 1,6,7 (maybe 8,11,15 too); 2,14; 12,13 (maybe) .
is it bad ?
if so, is there any regularizer, that penalizes redundant neuron weights, and makes neurons develop uncorrelated weights?
I use adam regularizer, Xavier initialization (the best of the tested), weight decay 1e-5/batch (the best of the tested), other output layers did not work as well as logistic regression (by means of precison & recall & lack of overfitting).
I use only 10 filters in each resnet blocks (which are 10, too) to address overfitting.
Are you using Tensorflow ? if yes, is post training quantization an option ?
tensorflow.org/lite/performance/post_training_quantization
This has some similar effect to what you need but also makes other improvements.
Alternatively maybe you can also try to use Quantization-aware training
https://github.com/tensorflow/tensorflow/tree/r1.14/tensorflow/contrib/quantize
In the context of a convolutional neural network designed to extract DNA motifs, why would one stack convolution layers without max pooling functions in between?
Here's the context in which this architecture appears.
self.model = Sequential()
assert len(num_filters) == len(conv_width)
for i, (nb_filter, nb_col) in enumerate(zip(num_filters, conv_width)):
conv_height = 4 if i == 0 else 1
self.model.add(Convolution2D(
nb_filter=nb_filter, nb_row=conv_height,
nb_col=nb_col, activation='linear',
init='he_normal', input_shape=self.input_shape,
W_regularizer=l1(L1), b_regularizer=l1(L1)))
self.model.add(Activation('relu'))
self.model.add(Dropout(dropout))
self.model.add(MaxPooling2D(pool_size=(1, pool_width)))
For a given input dimension you can only reduce spatial dimensions (typically by factor of 2 each time) so many times before you arrive at a 1x1 output dimension that can't be reduced any more! Hence, for a deep net you have no choice but to have groups of layers (convolutions) without dimensionality reduction, separated by layers that do dimensionality reduction. So, it's not a matter of there being any advantage to having convolutional layers without max pooling in between, but rather than you can only have so many total max pooling layers for a given input size.
Note that the only function of max pooling as used here is dimensionality reduction - there's no other benefit to it. In fact, more modern all-convolutional architectures such as ResNet-50 don't use max pooling (except at the input), and instead use stride 2 convolutions to gradually reduce dimensions.
The code provided does use activations between convolutions
self.model = Sequential()
assert len(num_filters) == len(conv_width)
for i, (nb_filter, nb_col) in enumerate(zip(num_filters, conv_width)):
conv_height = 4 if i == 0 else 1
self.model.add(Convolution2D(
nb_filter=nb_filter, nb_row=conv_height,
nb_col=nb_col, activation='linear',
init='he_normal', input_shape=self.input_shape,
W_regularizer=l1(L1), b_regularizer=l1(L1)))
self.model.add(Activation('relu')) # <--------------------- ACTIVATION
self.model.add(Dropout(dropout))
self.model.add(MaxPooling2D(pool_size=(1, pool_width)))
resulting model is something like
conv -- relu -- dropout -- conv -- relu -- dropout -- ... -- max pool
why they put activation separately instead of specyfing "activation" inside conv itself? No idea, looks like an odd implementation decision, but from practical point of view the
self.model.add(Convolution2D(
nb_filter=nb_filter, nb_row=conv_height,
nb_col=nb_col, activation='linear',
init='he_normal', input_shape=self.input_shape,
W_regularizer=l1(L1), b_regularizer=l1(L1)))
self.model.add(Activation('relu'))
and
self.model.add(Convolution2D(
nb_filter=nb_filter, nb_row=conv_height,
nb_col=nb_col, activation='relu',
init='he_normal', input_shape=self.input_shape,
W_regularizer=l1(L1), b_regularizer=l1(L1)))
are equivalent.
I am trying to make a digit recognition program. I shall feed a white/black image of a digit and my output layer will fire the corresponding digit (one neuron shall fire, out of the 0 -> 9 neurons in the Output Layer). I finished implementing a Two-dimensional BackPropagation Neuron Network. My topology sizes are [5][3] -> [3][3] -> 1[10]. So it's One 2-D Input Layer, One 2-D Hidden Layer and One 1-D Output Layer. However I am getting weird and wrong results (Average Error and Output Values).
Debugging at this stage is kind of time consuming. Therefore, I would love to hear if this is the correct design so I continue debugging. Here are the flow steps of my implementation:
Build the Network: One Bias on each Layer except on the Output Layer (No Bias). A Bias's output value is always = 1.0, however its Connections Weights get updated on each pass like all other neurons in the network. All Weights range 0.000 -> 1.000 (no negatives)
Get Input data (0 | OR | 1) and set nth value as the nth Neuron Output Value in the input layer.
Feed Forward: On each Neuron 'n' in every Layer (except the Input Layer):
Get result of SUM (Output Value * Connection Weight) of connected Neurons
from previous layer towards this nth Neuron.
Get TanHyperbolic - Transfer Function - of this SUM as Results
Set Results as the Output Value of this nth Neuron
Get Results: Take Output Values of Neurons in the Output Layer
BackPropagation:
Calculate Network Error: on the Output Layer, get SUM Neurons' (Target Values - Output Values)^2. Divide this SUM by the size of the Output Layer. Get its SquareRoot as Result. Compute Average Error = (OldAverageError * SmoothingFactor * Result) / (SmoothingFactor + 1.00)
Calculate Output Layer Gradients: for each Output Neuron 'n', nth Gradient = (nth Target Value - nth Output Value) * nth Output Value TanHyperbolic Derivative
Calculate Hidden Layer Gradients: for each Neuron 'n', get SUM (TanHyperbolic Derivative of a weight going from this nth Neuron * Gradient of the destination Neuron) as Results. Assign (Results * this nth Output Value) as the Gradient.
Update all Weights: Starting from the hidden Layer and back to the Input Layer, for nth Neuron: Compute NewDeltaWeight = (NetLearningRate * nth Output Value * nth Gradient + Momentum * OldDeltaWeight). Then assign New Weight as (OldWeight + NewDeltaWeight)
Repeat process.
Here is my attempt for digit number seven. The outputs are Neuron # zero and Neuron # 6. Neuron six should be carrying 1 and Neuron # zero should be carrying 0. In my results, all Neuron other than six are carrying the same value (# zero is a sample).
Sorry for the long post. If you know this then you probably know how cool it is and how large it is to be in a single post. Thank you in advance
Softmax with log-loss is typically used for multiclass output layer activation function. You have multiclass/multinomial: with the 10 possible digits comprising the 10 classes.
So you can try changing your output layer activation function to softmax
http://en.wikipedia.org/wiki/Softmax_function
Artificial neural networks
In neural network simulations, the
softmax function is often implemented at the final layer of a network
used for classification. Such networks are then trained under a log
loss (or cross-entropy) regime, giving a non-linear variant of
multinomial logistic regression.
Let us know what effect that has. –
I previously asked for an explanation of linearly separable data. Still reading Mitchell's Machine Learning book, I have some trouble understanding why exactly the perceptron rule only works for linearly separable data?
Mitchell defines a perceptron as follows:
That is, it is y is 1 or -1 if the sum of the weighted inputs exceeds some threshold.
Now, the problem is to determine a weight vector that causes the perceptron to produce the correct output (1 or -1) for each of the given training examples. One way of achieving this is through the perceptron rule:
One way to learn an acceptable weight vector is to begin with random
weights, then iteratively apply the perceptron to each training
example, modify- ing the perceptron weights whenever it misclassifies
an example. This process is repeated, iterating through the training
examples as many times as needed until the perceptron classifies all
training examples correctly. Weights are modified at each step
according to the perceptron training rule, which revises the weight wi
associated with input xi according to the rule:
So, my question is: Why does this only work with linearly separable data? Thanks.
Because the dot product of w and x is a linear combination of xs, and you, in fact, split your data into 2 classes using a hyperplane a_1 x_1 + … + a_n x_n > 0
Consider a 2D example: X = (x, y) and W = (a, b) then X * W = a*x + b*y. sgn returns 1 if its argument is greater than 0, that is, for class #1 you have a*x + b*y > 0, which is equivalent to y > -a/b x (assuming b != 0). And this equation is linear and divides a 2D plane into 2 parts.