Background:
Tagging TensorFlow since Keras runs on top of it and this is more a general deep learning question.
I have been working on the Kaggle Digit Recognizer problem and used Keras to train CNN models for the task. This model below has the original CNN structure I used for this competition and it performed okay.
def build_model1():
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), padding="Same" activation="relu", input_shape=[28, 28, 1]))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Dropout(0.25))
model.add(layers.Conv2D(64, (3, 3), padding="Same", activation="relu"))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Dropout(0.25))
model.add(layers.Conv2D(64, (3, 3), padding="Same", activation="relu"))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Dropout(0.25))
model.add(layers.Flatten())
model.add(layers.Dense(64, activation="relu"))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(10, activation="softmax"))
return model
Then I read some other notebooks on Kaggle and borrowed another CNN structure (copied below), which works much better than the one above in that it achieved better accuracy, lower error rate, and took many more epochs before overfitting the training data.
def build_model2():
model = models.Sequential()
model.add(layers.Conv2D(32, (5, 5),padding ='Same', activation='relu', input_shape = (28, 28, 1)))
model.add(layers.Conv2D(32, (5, 5),padding = 'Same', activation ='relu'))
model.add(layers.MaxPool2D((2, 2)))
model.add(layers.Dropout(0.25))
model.add(layers.Conv2D(64,(3, 3),padding = 'Same', activation ='relu'))
model.add(layers.Conv2D(64, (3, 3),padding = 'Same', activation ='relu'))
model.add(layers.MaxPool2D(pool_size=(2, 2), strides=(2, 2)))
model.add(layers.Dropout(0.25))
model.add(layers.Flatten())
model.add(layers.Dense(256, activation = "relu"))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(10, activation = "softmax"))
return model
Question:
Is there any intuition or explanation behind the better performance of the second CNN structure? What is it that makes stacking 2 Conv2D layers better than just using 1 Conv2D layer before max pooling and dropout? Or is there something else that contributes to the result of the second model?
Thank y'all for your time and help.
The main difference between these two approaches is that the later (2 conv) has more flexibility in expressing non-linear transformations without loosing information. Maxpool removes information from the signal, dropout forces distributed representation, thus both effectively make it harder to propagate information. If, for given problem, highly non-linear transformation has to be applied on raw data, stacking multiple convs (with relu) will make it easier to learn, that's it. Also note that you are comparing a model with 3 max poolings with model with only 2, consequently the second one will potentially loose less information. Another thing is it has way bigger fully connected bit at the end, while the first one is tiny (64 neurons + 0.5 dropout means that you effectively have at most 32 neurons active, that is a tiny layer!). To sum up:
These architectures differe in many aspects, not just stacking conv nets.
Stacking convnets usually leads to less information being lost in processing; see for example "all convolutional" architectures.
Related
model = Sequential()
model.add(Conv2D(64, kernel_size=(3, 3), activation='relu', input_shape=(X_train.shape[1:])))
model.add(Conv2D(64,kernel_size= (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2), strides=(2, 2)))
model.add(Dropout(0.5))
In the above code whether I can use con2D(128) instead of conv24(64) twice.
No, you cannot, because both configurations do not represent the same functions, this pattern was introduced in the VGG network paper, and it is used to increase the representation power of the network. Two layers with 3x3 filters are kind of equivalent to one layer with a 5x5 filter (through composition), it is not equivalent to adding the number of filters
In particular, if you had a convolutional layer with 128 filters, this is not the same to having two convolutional layers with 64 filters each, specially considering that there is a ReLU activation in between them, which makes behavior more non-linear.
I wanna train a CNN using SVM to classify at the last layer. I understand that the categorical_hinge is the best loss function for that . I have 6 classes to classify .
My model is as shown below:
model = Sequential()
model.add(Conv2D(50, 3, 3, activation = 'relu', input_shape = train_data.shape[1:]))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(50, 3, 3, activation = 'relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(50, 3, 3, activation = 'relu'))
model.add(Flatten())
model.add(Dense(400, activation = 'relu'))
model.add(Dropout(0.5))
model.add(Dense(128, activation = 'relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation = 'sigmoid'))
Is there a problem with the network , data processing , or the loss function?
The model does not learn anything after a point as shown in the image
What should I do?
Your model has a single output neuron, there is no way this will work with 6 classes. The output of your model should have 6 neurons. Also the output of your model should have no activation function in order to produce logits that the categorical hinge can use.
Note that the categorical hinge was added recently (2-3 weeks ago) so its quite new and probably not many people have tested it.
Use hinge loss in and linear activation in last layer.
model.add(Dense(nb_classes), W_regularizer=l2(0.01))
model.add(Activation('linear'))
model.compile(loss='hinge',
optimizer='adadelta',
metrics=['accuracy'])
for more information visit https://github.com/keras-team/keras/issues/6090
I am enjoying the simplicity that Keras offers, however I have not been successful in configuring a Keras regression model with multiple outputs.
More specifically, I have a Keras model that consumes X values with 308 columns and with 28 target Y values. The model is (I think) quite simple and I would have thought it would converge quite quickly, but in fact is does not.
I am guessing here, but I think I have setup the model incorrectly and am looking for assistance on how to configure a Keras model to work properly.
Data information:
Number of rows: 46038
My input shape: X_train: (46038, 308)
My target shape: Y_train: (46038, 28)
The inputs (X) are a series of floats representing values that influence the allocation of a resource. The targets are a series of floats (which total/sum to 1.0 representing the actual percent allocation to a particular resource). My goal is to predict resource pct allocations (Y) based upon the provided inputs (X) As such, I believe this is a regression problem and not a classification problem (correct me if I am wrong)
Sample data:
X: [100, 200, 400, 600, 32, 1, 0.1, 0.5, 2500...] (308 columns, with 40000+ rows)
Y: [0.333, 0.667, 0.0, 0.0, 0.0, ...]
In the case of Y above, this means that 0.333 (33%) of the resource is allocated to first resource, 0.667 (67%) is allocated to the second resource and 0.0 to all others)
Model:
model = Sequential()
model.add(Dense(256, input_shape=(308,) ))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(256, input_shape=(256,)))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(28))
model.compile(loss='mean_squared_error', optimizer='adam')
Here are a few specific questions:
1. Is my model configured properly to achieve my goals?
2. Should I have different activation functions?
3. Are my input shapes (308,) setup properly? Are my output shapes (28) correct?
4. Should I have an activation on my output layer (for example: model.add(Activation('softmax'))? if yes, what type would be ideal?
(I don't think it is particularly relevant, but I am using a Tensorflow backend)
model = Sequential()
model.add(Dense(256, input_shape=(308,) ))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(256, input_shape=(256,)))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(28, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
Should solve the problem. Although it seems like a regression problem, the allocations are competing with each other which makes it like a classification and requires softmax nonlinearity and categorical_crossentropy loss.
Update
For early stopping you'll need a validation set and the following code:
earlyStopping=keras.callbacks.EarlyStopping(monitor='val_loss', patience=0, verbose=0, mode='auto')
model.fit(X, y, batch_size=100, nb_epoch=100, verbose=1, callbacks=[earlyStopping], validation_split=0.0, validation_data=None, shuffle=True, show_accuracy=False, class_weight=None, sample_weight=None)
Also you'll need to define a new custom metric function which instead of accuracy returns cross-entropy loss. You set the metric argument in model.compile to this new function.
I am trying to adapt keras autoencoder example to a my data. I have the following network:
Xtrain = np.reshape(Xtrain, (len(Xtrain), 28, 28, 2))
Xtest = np.reshape(Xtest, (len(Xtest), 28, 28, 2))
input_signal = Input(shape=(28, 28, 2))
x = Conv2D(16, (3, 3), activation='relu', padding='same')(input_signal)
x = MaxPooling2D((2, 2), padding='same')(x)
x = Conv2D(8, (3, 3), activation='relu', padding='same')(x)
x = MaxPooling2D((2, 2), padding='same')(x)
x = Conv2D(8, (3, 3), activation='relu', padding='same')(x)
encoded = MaxPooling2D((2, 2), padding='same', name='encoder')(x)
# added Dense layers, is that correct?
encoded2 = Flatten()(encoded)
encoded2 = Dense(128, activation='sigmoid')(encoded2)
encoded2 = Dense(128, activation='softmax')(encoded2)
encoded3 = Reshape((4, 4, 8))(encoded2)
x = Conv2D(8, (3, 3), activation='relu', padding='same')(encoded3)
x = UpSampling2D((2, 2))(x)
x = Conv2D(8, (3, 3), activation='relu', padding='same')(x)
x = UpSampling2D((2, 2))(x)
x = Conv2D(16, (3, 3), activation='relu')(x)
x = UpSampling2D((2, 2))(x)
decoded = Conv2D(2, (3, 3), activation='sigmoid', padding='same')(x)
autoencoder = Model(inputs=input_signal, outputs=decoded)
encoder = Model(input_signal, encoded2)
autoencoder.compile(optimizer='adam', loss='binary_crossentropy')
autoencoder.fit(Xtrain, Xtrain, epochs=100, batch_size=128, shuffle=True, validation_data=(Xtest, Xtest))
And, when I'm running on MNIST data, which are normalized to [0,1] everything works fine, but with my data that are in range [-1,1] I only see negative losses and 0.0000 accuracy while training. If I do data = np.abs(data), training starts and looks that goes well, but doing abs() on data makes no reasons to train data fakes.
The data I'm trying to feed to network are IQ channels of signal, 1st channel for real part, and 2nd channel for imag part, so both are normalized to a [-1 1], and both often contains very low values, e.g. 5e-12. I have shaped them to a (28,28,2) input.
I have also added Dense layers in the middle of autoencoder, as I wish to make predictions about classes (that are fitted automatically) when autoencoder completes training. Am I did this correctly, does this breaks the network?
There are several issues with your question, including your understanding of autoencoders and their usage. I strongly suggest at least going through the Keras blog post Building Autoencoders in Keras (if you do have gone through it, arguably you have to do it again, this time more thoroughly).
A few general points, most of which are included in the above linked post:
Autoencoders are not used for classification, hence it makes no sense to ask for a metric such as accuracy. Similarly, since the fitting objective is the reconstruction of their input, categorical cross entropy is not the correct loss function to use (try binary cross entropy instead).
The very existence of the intermediate dense layers you use is puzzling, and even more puzzling is the choice of a sigmoid layer followed by a softmax one; the same holds for the sigmoid choice in your final, decoded layer. Both these activation functions are normally used for classification purposes at final layers, so again refer to point (1) above.
I strongly suggest you start with a model demonstrated in the blog post linked above, and, if necessary, incrementally modify it to fit your purpose, as I am not sure what you have built here can even qualify as an autoencoder in the first place.
You are mixing between binary ('sigmoid') and categorical ('softmax' and 'categorical_crossentropy'). Change the following:
Remove the dense layers in between and feed 'encoded' instead of 'encoded3' to the decoder
Change the autoencoder loss to 'binary_crossentropy'
Alternatively if you really want to try the dense layers in between, just use them without an activation function (None)
How to prevent a lazy Convolutional Neural Network? I end with a ‘lazy CNN’ after training it with KERAS. Whatever the input is, the output is constant. What do you think the problem is?
I try to repeat an experiment of NVIDIA’s End to End Learning for Self-Driving Cars the paper. Absolutely, I do not have a real car but a Udacity’s simulator . The simulator generates figures about the foreground of a car.
A CNN receives the figure, and it gives the steering angle to keep the car in the track. The rule of the game is to keep the simulated car runs in the track safely. It is not very difficult.
The strange thing is sometimes I end with a lazy CNN after training it with KERAS, which gives constant steering angles. The simulated car will go off the trick, but the output of the CNN has no change. Especially the layer gets deeper, e.g. the CNN in the paper.
If I use a CNN like this, I can get a useful model after training.
model = Sequential()
model.add(Lambda(lambda x: x/255.0 - 0.5, input_shape = (160,320,3)))
model.add(Cropping2D(cropping=((70,25),(0,0))))
model.add(Conv2D(24, 5, strides=(2, 2)))
model.add(Activation('relu'))
model.add(Conv2D(36, 5, strides=(2, 2)))
model.add(Activation('relu'))
model.add(Conv2D(48, 5, strides=(2, 2)))
model.add(Activation('relu'))
model.add(Flatten())
model.add(Dense(50))
model.add(Activation('sigmoid'))
model.add(Dense(10))
model.add(Activation('sigmoid'))
model.add(Dense(1))
But, if I use a deeper CNN, I have more chance to receive a lazy CNN.
Specifically, if I use a CNN which likes NVIDIA’s, I almost receive a lazy CNN after every training.
model = Sequential()
model.add(Lambda(lambda x: x/255.0 - 0.5, input_shape = (160,320,3)))
model.add(Cropping2D(cropping=((70,25),(0,0))))
model.add(Conv2D(24, 5, strides=(2, 2)))
model.add(Activation('relu'))
model.add(Conv2D(36, 5, strides=(2, 2)))
model.add(Activation('relu'))
model.add(Conv2D(48, 5, strides=(2, 2)))
model.add(Activation('relu'))
model.add(Conv2D(64, 3, strides=(1, 1)))
model.add(Activation('relu'))
model.add(Conv2D(64, 3, strides=(1, 1)))
model.add(Activation('relu'))
model.add(Flatten())
model.add(Dense(1164))
model.add(Activation('sigmoid'))
model.add(Dense(100))
model.add(Activation('sigmoid'))
model.add(Dense(50))
model.add(Activation('sigmoid'))
model.add(Dense(10))
model.add(Activation('sigmoid'))
model.add(Dense(1))
I use ‘relu’ for convolution layers, and the activation function for the fully connected layer is ‘sigmoid’. I try to change the activation function, but there is no effect.
There is my analysis. I do not agree with a bug in my program because I can successfully drive the car with same codes and a simpler CNN. I think the reason is the simulator or the structure of the neural network. In a real self-driving car, the training signal, that is the steering angle, should contain noise; therefor, the driver never holds the wheel still in the real road. But in the simulator, the training signal is very clean. Almost 60% of the steering angle is zero. The optimizer can easily do the job by turning the output of CNN close to the zero. It seems the optimizer is lazy too. However, when we really want this CNN output something, it also gives zeros. So, I add small noise for these zero steering angles. The chance that I get a lazy CNN is smaller, but it is not disappearing.
What do you think about my analysis? Is there other strategy that I can use? I am wondering whether similar problems have been solved in the long history of CNN research.
resource:
The related files have been uploaded to GitHub. You can repeat the entire experiment with these files.
I can't run your model, because neither the question not the GitHub repo contains the data. That's why I am 90% sure of my answer.
But I think the main problem of your network is the sigmoid activation function after dense layers. I assume, it will train well when there's just two of them, but four is too much.
Unfortunately, NVidia's End to End Learning for Self-Driving Cars paper doesn't specify it explicitly, but these days the default activation is no longer sigmoid (as it once was), but relu. See this discussion if you're interested why that is so. So the solution I'm proposing is try this model:
model = Sequential()
model.add(Lambda(lambda x: x/255.0 - 0.5, input_shape = (160,320,3)))
model.add(Cropping2D(cropping=((70,25),(0,0))))
model.add(Conv2D(24, (5, 5), strides=(2, 2), activation="relu"))
model.add(Conv2D(36, (5, 5), strides=(2, 2), activation="relu"))
model.add(Conv2D(48, (5, 5), strides=(2, 2), activation="relu"))
model.add(Conv2D(64, (3, 3), strides=(1, 1), activation="relu"))
model.add(Conv2D(64, (3, 3), strides=(1, 1), activation="relu"))
model.add(Flatten())
model.add(Dense(1164, activation="relu"))
model.add(Dense(100, activation="relu"))
model.add(Dense(50, activation="relu"))
model.add(Dense(10, activation="relu"))
model.add(Dense(1))
It mimics the NVidia's network architecture and does not suffer from the vanishing gradients.