What layers should I use for Keras?

What layers should I use for Keras? - machine-learning

I am building a sample project in Keras. The project is to identify the difference between cats and dogs. I found an example online with the model as such:
model = Sequential()
model.add(Conv2D(32, (3, 3), input_shape=(3, 150, 150)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(32, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(64, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
My question is, how do people know which layers to use? Are there guidelines or rules of thumb when to use Conv2D vs a Conv1D vs a another layer?

In short - they don't. Coming up with the good architecture is a majority of current deep learning research. There are some rules of thumbs, intuitions, but mostly - experience or coping existing ones that were reported to work.
In really short words:
convolutions are used when you have spatial and/or temporal structure in data thus images, videos, sound etc.
pooling has similar use cases to convolutions, it still requires spatial and/or temporal structure (unless it is applied to the whole channel/dimension) and provides a way of removing "details" (usually noise) and reduce dimension of the signal
recurrent when your data has sequential character
fully connected are needed to "force" a given dimension (thus often used as a last layer) or when one does not really know any structure that can be exploited (since they are pretty much the most generic ones)
However the question how to compose, what hyperparameters to use, how many to use is a huge open research question, and at the very beginning the best approach is to copy someone else's architectures and gain some experience/intuition what works and what does not for the data you are working with.

Related

What is best practice for which CNN fully-connected layers to keep when doing transfer-learning?

I can't seem to find a concrete answer to the question. I am currently doing transfer learning from a VGG19 network, and my target domain is document classification (either solely by visual classification or using CNN's feature extraction for another model).
I want to understand in which cases is it desirable to keep all fully connected layers of the model, and in which cases should I remove the fully connected layers and make a new fully-connected layer on top of the last convolutional layer. What does each of these choices imply for the training, predictions, etc. ?
These are code examples using Keras of what I mean:
Extracting the last fully connected layer:
original_model = VGG19(include_top=True, weights='imagenet', input_shape=(224, 224, 3))
layer_name = 'fc2'
x = Dropout(0.5)(original_model.get_layer(layer_name).output)
x = BatchNormalization()(x)
predictions = Dense(num_classes, activation='softmax')(x)
features_model = Model(inputs=original_model.input, outputs=predictions)
adam = optimizers.Adam(lr=0.001)
features_model.compile(optimizer=adam, loss='categorical_crossentropy', metrics=['accuracy'])
features_model.summary()
return features_model
Adding one fully connected layer after the last convolutional layer:
original_model = VGG19(include_top=False, weights='imagenet', input_shape=(224, 224, 3))
x = Flatten()(base_model.output)
x = Dense(4096, activation='relu')(x)
x = Dropout(0.5)(x)
x = BatchNormalization()(x)
predictions = Dense(num_classes, activation='softmax')(x)
head_model = Model(input=base_model.input, output=predictions)
adam = optimizers.Adam(lr=0.001)
head_model.compile(optimizer=adam, loss='categorical_crossentropy', metrics=['accuracy'])
head_model.summary()
return head_model
Is there a rule of thumb for what to choose when doing transfer-learning?

According to my past experience, applying transfer learning from stock market to business forecast successfully, you should keep original structure, because if you are doing transfer learning, you will want to load weights trained from original structure, without issues regarding differences in neural net architecture. Then you unfreeze parts of the CNN and your neural net training will start training from a high accuracy and adapt weights for the target problem.
However, if you remove a Flatten layer, computational cost will decrease as you will have fewer parameters to train.
I follow the rule of keeping neural nets as simple as possible (equals bigger generalization properties), with high efficiency.
#Kamen, as a complement to your comment, regarding how much data you will need, it depends on the variance of your data. More variance, you will need more layers and weights to learn the details. However, when you increase complexity in the architecture, your neural net will be more prone to overfit, than can be decreased using Dropout, for instance.
As fully connected layers are the more expensive part of a neural net, if you add one or two of them your parameter number will increase a lot, demanding more time to train. With more layers you will get a higher accuracy, but you may have overfit.
For instance, MNIST with 10,000 examples can reach an accuracy bigger than 99% with a quite simple architecture. However, IMAGENET has 1,000,000 examples (155 GB) and then demands a more complex structure, like VGG16.

Is a linear stack of layers equal to multilinear regression?

So for an application I'm making I'm using tf.keras.models.Sequential. I know that there are linear and multilinear regression models for machine learning. In the documentation of Sequential is said that the model is a linear stack of layers. Is that equal to multilinear regression? The only explaination of linear stack of layers I could find was this question on Stackoverflow.
def trainModel(bow,unitlabels,units):
x_train = np.array(bow)
print("X_train: ", x_train)
y_train = np.array(unitlabels)
print("Y_train: ", y_train)
model = tf.keras.models.Sequential([
tf.keras.layers.Dense(256, activation=tf.nn.relu),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(len(units), activation=tf.nn.softmax)])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=50)
return model

you are confusing two things very important here. One is the model and the other is the model of the model.
The model of the model is indeed a linear one because it follows a direct line (straightforward) from beginning till end.
the model itself is not linear: The relu activation is here to make sure that the solutions are not linear.
the linear stack is not a linear regression nor a multilinear one. The linear stack is not a ML term here but the english one to say straightforward.
tell me if i misunderstood the question in any regard.

In the documentation of Sequential is said that the model is a linear stack of layers. Is that equal to multilinear regression?
Assuming you mean a regression with multiple variables, no.
tf.keras.models.Sequential() defines how the layers in your model are connecting, specifically in this case it means they are fully connected (every output from the first layer is connected as an input to every neuron in the next layer). The term linear is used to mean that there is no funny business going on, e.g. recurrency (connections can go backwards) or residual connections (connections can skip layers).
For context, a regression with multiple variables is comparable to a single layered network with a single neuron with multiple inputs and no transfer function.

Is deep learning bad at fitting simple non linear functions outside training scope (extrapolating)?

I am trying to create a simple deep-learning based model to predict y=x**2
But looks like deep learning is not able to learn the general function outside the scope of its training set.
Intuitively I can think that neural network might not be able to fit y=x**2 as there is no multiplication involved between the inputs.
Please note I am not asking how to create a model to fit x**2. I have already achieved that. I want to know the answers to following questions:
Is my analysis correct?
If the answer to 1 is yes, then isn't the prediction scope of deep learning very limited?
Is there a better algorithm for predicting functions like y = x**2 both inside and outside the scope of training data?
Path to complete notebook:
https://github.com/krishansubudhi/MyPracticeProjects/blob/master/KerasBasic-nonlinear.ipynb
training input:
x = np.random.random((10000,1))*1000-500
y = x**2
x_train= x
training code
def getSequentialModel():
model = Sequential()
model.add(layers.Dense(8, kernel_regularizer=regularizers.l2(0.001), activation='relu', input_shape = (1,)))
model.add(layers.Dense(1))
print(model.summary())
return model
def runmodel(model):
model.compile(optimizer=optimizers.rmsprop(lr=0.01),loss='mse')
from keras.callbacks import EarlyStopping
early_stopping_monitor = EarlyStopping(patience=5)
h = model.fit(x_train,y,validation_split=0.2,
epochs= 300,
batch_size=32,
verbose=False,
callbacks=[early_stopping_monitor])
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_18 (Dense) (None, 8) 16
_________________________________________________________________
dense_19 (Dense) (None, 1) 9
=================================================================
Total params: 25
Trainable params: 25
Non-trainable params: 0
_________________________________________________________________
Evaluation on random test set
Deep learning in this example is not good at predicting a simple non linear function. But good at predicting values in the sample space of training data.

Is my analysis correct?
Given my remarks in the comments that your network is certainly not deep, let's accept that your analysis is indeed correct (after all, your model does seem to do a good job inside its training scope), in order to get to your 2nd question, which is the interesting one.
If the answer to 1 is yes, then isn't the prediction scope of deep learning very limited?
Well, this is the kind of questions not exactly suitable for SO, since the exact meaning of "very limited" is arguably unclear...
So, let's try to rephrase it: should we expect DL models to predict such numerical functions outside the numeric domain on which they have been trained?
An example from a different domain may be enlightening here: suppose we have built a model able to detect & recognize animals in photos with very high accuracy (it is not hypothetical; such models do exist indeed); should we complain when the very same model cannot detect and recognize airplanes (or trees, refrigerators etc - you name it) in these same photos?
Put like that, the answer is a clear & obvious no - we should not complain, and in fact we are certainly not even surprised by such a behavior in the first place.
It is tempting for us humans to think that such models should be able to extrapolate, especially in the numeric domain, since this is something we do very "easily" ourselves; but ML models, while exceptionally good at interpolating, they fail miserably in extrapolation tasks, such as the one you present here.
Trying to make it more intuitive, think that the whole "world" of such models is confined in the domain of their training sets: my example model above would be able to generalize and recognize animals in unseen photos as long as these animals are "between" (mind the quotes) the ones it has seen during training; in a similar manner, your model does a good job predicting the function value for arguments between the sample you have used for training. But in neither case these models are expected to go beyond their training domain (i.e. extrapolate). There is no "world" for my example model beyond animals, and similarly for your model beyond [-500, 500]...
For corroboration, consider the very recent paper Neural Arithmetic Logic Units, by DeepMind; quoting from the abstract:
Neural networks can learn to represent and manipulate numerical information, but they seldom generalize well outside of the range of numerical values encountered during training.
See also a relevant tweet of a prominent practitioner:
On to your third question:
Is there a better algorithm for predicting functions like y = x**2 both inside and outside the scope of training data?
As it should be clear by now, this is a (hot) area of current research; see the above paper for starters...
So, are DL models limited? Definitely - forget the scary tales about AGI for the foreseeable future. Are they very limited, as you put it? Well, I don't know... But, given their limitation in extrapolating, are they useful?
This is arguably the real question of interest, and the answer is obviously - hell, yeah!

Keras classification model

I need help to build keras model for classification.
I have
Input: 167 points of optical spectrum
Output 11 classes of investigated substance.
But in one data set can be spectre of substance with several substance (for example contains classes 2,3,4).
I tried to use categorical_crossentropy, but it is suitable only for non-intersecting classes.
KerasDoc:
Note: when using the categorical_crossentropy loss, your targets should be in categorical format (e.g. if you have 10 classes, the target for each sample should be a 10-dimensional vector that is all-zeros expect for a 1 at the index corresponding to the class of the sample). In order to convert integer targets into categorical targets, you can use the Keras utility to_categorical:
My code:
model = Sequential()
model.add(Dense(64, input_dim=167))
model.add(Dense(32))
model.add(Dense(11))
model.add(Activation('sigmoid'))
model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
I tried many models but can not get a good result.

You should probably go well with sigmoid and binary_crossentropy (See here)
PS: This is not your case, but for a categorial_crossentropy, you should ideally use a softmax activation. The softmax outputs things optimized to maximize one class only.
(If anyone would like to complement this answer with a good or better "optimizer", feel free).

Overfitting after first epoch

I am using convolutional neural networks (via Keras) as my model for facial expression recognition (55 subjects). My data set is quite hard and around 450k with 7 classes. I have balanced my training set per subject and per class label.
I implemented a very simple CNN architecture (with real-time data augmentation):
model = Sequential()
model.add(Convolution2D(32, 3, 3, border_mode=borderMode, init=initialization, input_shape=(48, 48, 3)))
model.add(BatchNormalization())
model.add(PReLU())
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(256))
model.add(BatchNormalization())
model.add(PReLU())
model.add(Dropout(0.5))
model.add(Dense(nb_output))
model.add(Activation('softmax'))
After first epoch, my training loss decreases constantly while validation loss increases. Could overfitting happen that soon? Or is there a problem with my data being confusing? Should I also balance my testing set?

It could be that the task is easy to solve and after one epoch the model has learned enough to solve it, and training for more epochs just increases overfitting.
But if you have balanced the train set and not the test set, what may be happening is that you are training for one task (expression recognition on evenly distributed data) and then you are testing on a slightly different task, because the test set is not balanced.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart