Why does fully convolutional network plateau first and then learns? - machine-learning

Why does fully convolutional network plateau first and then learns?
Im training a fully convolutional network to classify handwriting Chinese characters. The dev dataset I am using has 250 classes with 200 - 300 samples in each class.
And I found out no matter how I tweak the model, somehow the ones I've tried so far all has a similar behaviour, which they all plateau at first and then the accuracies start to shoot up while the losses decrease, as shown in the screenshot below:
I would love to know more about the reasons behind this behaviour.
Thanks a lot!
Edit:
Sorry for didn't provide more details before.
My best performing network so far is as below, using an Adadelta optimizer with LR at 0.1. My weights were initialised using xavier initialisation.
Input(shape=(30, 30, 1))
Lambda(
lambda image: tf.image.resize_images(
image, size=(resize_size, resize_size),
method=tf.image.ResizeMethod.BILINEAR
)
)
Conv2D(filters=96, kernel_size=(5, 5), strides=(1, 1), padding="same", "relu")
Conv2D(filters=96, kernel_size=(1, 1), strides=(1, 1), padding="same", "relu")
MaxPooling2D(pool_size=(3, 3), strides=2)
Conv2D(filters=192, kernel_size=(5, 5), strides=(2, 2), padding="same", "relu")
Conv2D(filters=192, kernel_size=(1, 1), strides=(1, 1), padding="same", "relu")
MaxPooling2D(pool_size=(3, 3), strides=2)
Conv2D(filters=192, kernel_size=(3, 3), "same", "relu")
Conv2D(filters=192, kernel_size=(1, 1), "same", "relu")
Conv2D(filters=10, kernel_size=(1, 1), "same", "relu")
AveragePooling2D(pool_size=(3, 3))
Flatten()
Dense(250, activation="softmax")
model = Model(inp, x)
model.compile(
loss=categorical_crossentropy,
optimizer=Adadelta(lr=0.1),
metrics=["accuracy"],
)
As to the input data, they are all handwriting chinese characters that had been transformed into a MNIST format by me, with size of 30x30x1 (the existence of the Lambda layer after the Input layer was because I was following the original FCN paper paper and they used 32x32 input size), which is as below:
And this is how the loss and accuracy charts above came about.
Hope this provides better intuition. Thanks.

We can't answer specifically, because you've neglected to identify your network and inputs sufficiently, let alone the training methods. To fully trace the high-level training characteristics, we'd need some detailed visualization of the kernels through the iterations in question.
In general, this is simply because a highly complex model usually needs a few iterations before it gets better than random results. We begin with random weights and kernels. In the first few iterations, the model has to work through the chaos, establish a few useful patterns in the early-level kernels, and find weights that correlate with enough output categories that the accuracy moves above 0.4% with statistical significance.
Part of the problem is that, in those first few iterations, the model also stumbles across patterns that are useful in chaos, but actually harm long-term learning. For instance, it may build a pattern for black dots, and guess right that this correlates to mammal eyes and vehicle wheels. All too soon, that generalization, that an airplane and an Airedale are structurally related, turns out to be a wrong assumption. It has to break down the second-level correlations between those categories and find something else.
This is the sort of learning that keeps the accuracy low for longer than you might think. The model spends the first few iterations jumping to hundreds of conclusions about classifications, anything that correlates with one or two correct guesses. Then it has to learn enough to separate valid ones from invalid ones. That is where the model starts making advances it can retain.

Related

Chess evaluation Neural Network is converging to the average

I'm currently working on a Chess AI.
The idea behind this project is to create a neural network that learns how to evaluate a board state and then traverse the next moves using Monte Carlo tree search to find the "best" move to play (evaluated by the NN).
Code on GitHub
TL;DR
The NN gets stuck predicting the average evaluation of the dataset and is thereby not learning to predict the evaluation of the board state.
Implementation
Dataset
The dataset is a collection of chess games. The games are fetched from the official lichess database.
Only games which have a evaluation score (which the NN is supposed to learn) are included.
This reduces the size of the dataset to about 11% of the original.
Data representation
Each move is a datapoint to train the network on.
The input for the NN are 12 arrays of size 8x8 (so called Bitboards), one for each of the 6x2 different pieces and colors.
The move evaluation is normalized to the range [-1, 1] using a scaled tanh function.
Since many evaluations are very close to 0 and -1/1, a percentage of these are dropped aswell, to reduce the variation in the dataset.
Without dropping some of the moves with evaluation close to 0 or -1/1 the dataset would look like this:
With dropping some, the dataset looks like this and is a lot less focused at one point:
The output of the NN is a single scalar value between -1 and 1, representing the evaluation of the board state. -1 meaning the board is heavily favored for the black player, 1 meaning the board is heavily favored for the white player.
def create_training_data(dataset: DataFrame) -> Tuple[np.ndarray, np.ndarray]:
def drop(indices, fract):
drop_index = np.random.choice(
indices,
size=int(len(indices) * fract),
replace=False)
dataset.drop(drop_index, inplace=True)
drop(dataset[abs(dataset[12] / 10.) > 30].index, fract=0.80)
drop(dataset[abs(dataset[12] / 10.) < 0.1].index, fract=0.90)
drop(dataset[abs(dataset[12] / 10.) < 0.15].index, fract=0.10)
# the first 12 entries are the bitboards for the pieces
y = dataset[12].values
X = dataset.drop(12, axis=1)
# move into range of -1 to 1
y = y.astype(np.float32)
y = np.tanh(y / 10.)
return X, y
The neural network
The neural network is implemented using Keras.
The CNN is used to extract features from the board, then passed to a dense network to reduce to an evaluation. This is based on the NN AlphaGo Zero has used in its implementation.
The CNN is implemented as follows:
model = Sequential()
model.add(Conv2D(256, (3, 3), activation='relu', padding='same', input_shape=(12, 8, 8, 1)))
for _ in range(10):
model.add(Conv2D(256, (3, 3), activation='relu', padding='same'))
model.add(BatchNormalization())
model.add(Conv2D(128, (3, 3), activation='relu'))
model.add(BatchNormalization())
model.add(Conv2D(128, (3, 3), activation='relu'))
model.add(BatchNormalization())
model.add(Flatten())
model.add(Dense(units=64, activation='relu'))
# model.add(Rescaling(scale=1 / 10., offset=0)) required? Data gets scaled in create_training_data, does the Network learn that/does doing that explicitly help?
model.add(Dense(units=1, activation='tanh'))
model.compile(
loss='mean_squared_error',
optimizer=Adam(learning_rate=0.01),
# metrics=['accuracy', 'mse'] # do these influence training at all?
)
Training
The training is done using Keras.
Multiple sets of 50k-500k moves are used to train the network.
The network is trained for 20 epochs on each move set with a batchsize of 64 and 10% of moves are used for validation.
Afterwards the learning rate is adjusted by 0.001 / (index + 1).
for i, chunk in enumerate(pd.read_csv("../dataset/nm_games.csv", header=None, chunksize=100000)):
X, y = create_training_data(chunk)
model.fit(
X,
y,
epochs=20,
batch_size=64,
validation_split=0.1
)
model.optimizer.learning_rate = 0.001 / (i + 1)
Issues
The NN currently does not learn anything. It converges within a few epochs to a average evaluation of the dataset and does not predict anything depending on the board state.
Example after 20 epochs:
Dataset Evaluation
NN Evaluation
Difference
-0.10164772719144821
0.03077016
0.13241789
0.6967725157737732
0.03180310
0.66496944
-0.3644430935382843
0.03119821
0.39564130
0.5291759967803955
0.03258476
0.49659124
-0.25989893078804016
0.03316733
0.29306626
The NN Evaluation is stuck at 0.03, which is the approximate average evaluation of the dataset.
It is also stuck there, not continuing to improve.
What I tried
Increased and decreased NN size
Added up to 20 extra Conv2D layers since google did that in their implementation aswell
Removed all 10 extra Conv2D layers since I read that many NN are too complex for the dataset
Trained for days at a time
Since the NN is stuck at 0.03, and also doesn't move from there, that was wasted.
Dense NN instead of CNN
Did not eliminate the point where the NN gets stuck, but trains faster (aka. gets stuck faster :) )
model = Sequential()
model.add(Dense(2048, input_shape=(12 * 8 * 8,), activation='relu'))
model.add(Dense(2048, activation='relu'))
model.add(Dense(2048, activation='relu'))
model.add(Dense(1, activation='tanh'))
model.compile(
loss='mean_squared_error',
optimizer=Adam(learning_rate=0.001),
# metrics=['accuracy', 'mse']
)
Sigmoid activation instead of tanh
Moves evaluation from a range of -1 to 1 to a range of 0 to 1 but otherwise did not change anything about getting stuck.
Epochs, batchsize and chunksize increased and decreased
All of these changes did not significantly change the NN evaluation.
Learning Rate addaption
Larger learning rates (0.1) made the NN unstable, each time training, converging to either -1, 1 or 0.
Smaller learning rates (0.0001) made the NN converge slower, but still stuck at 0.03.
Code on GitHub
Question
What to do? Is there something I'm missing or is there an error?
my two suggestions:
use the full dataset and score each position based on the fact if that player won the game or not. i don't know this dataset and there might be something with the evaluations by others ( or are they verified?) even if you are sure about the validity of it i would test this as it can provide some more information on what the problem might be
Check your data representation. probably you already did this a couple of times but i can tell you from experience it is easy to introduce one and to overlook them. adding a test might help you in the long run. some of my problems:
indication of current player colour? not sure if you have a player colour plane or you switch current player pieces?
incorrect translation from 1d to 3d or vice-versa. (should not prevent you from training but saves you a lot of time if you want to port to a different device)
I trained a go game engine and do not know what representation is used for chess, it took me some time to figure out a good representation for checkers.
not a solution but i found that cyclic learning rates worked great for my go-engine might be something to look at when the rest works

Wiggle in the initial part of an LSTM prediction

I working on using LSTMs and GRUs to make time series predictions. For the most part the predictions are pretty good.
However, there seems to be a wiggle (or initial up-then-down) before the prediction settles out similar to the left side of this figure from another question.
In my case, it is also causing a slight offset. Does anyone have any idea why this might be the case? Below are the shapes of the training and test sets, as well as the current network structure. I've tried reducing the sequence lengths from 60 timesteps, switching between LSTMs and GRUs, and also having the inputs and outputs overlap slightly, all to no avail. Adding dropout does not seem to help either. The wiggles will not disappear!
My sequence lengths are 60 inputs and 60 outputs, currently.
Xtrain: (920, 60, 2)
Ytrain: (920, 60, 2)
Xtest: (920, 60, 2)
Ytest: (920, 60, 2)
def define_model():
model = Sequential()
model.add(LSTM(64, return_sequences=True, input_shape=(None, 2)))
model.add(LSTM(64, return_sequences=True))
model.add(LSTM(64, return_sequences=True))
model.add(TimeDistributed(Dense(2)))
model.add(Activation('linear'))
model.summary()
return model
As they are "recurrent" networks, they have a direction. There is no memory at all for the first step, and the initial steps have a few memory.
Memory is only built after a few steps, only then it's possible to understand how a sequence is evolving.
The usual solution to this is to use a Bidirectional wrapper, so you have x units working from start to end and another x units working from end to start.
model.add(Bidirectional(LSTM(32, return_sequences = True), input_shape=...))
As for the offset, this is probably something with the data preprocessing.

Find fundamental frequency of signal

I have 1000 datasets, each of them consists 8000 amplitudes of signal and a label - the fundamental frequency of this signal. What is the best approach to build a neural network to predict fundamental frequency for newly provided signal?
For example:
Fundamental freq: 75.88206932 Hz
Snippet of data:
-9.609272558949627507e-02
-4.778297441391140543e-01
-2.434520972570237696e-01
-1.567176020112603263e+00
-1.020037056101358752e+00
-1.129608807811322446e+00
4.303651786855859918e-01
-3.936956061582048694e-01
-1.224883726737033163e+00
-1.776803300708089672e+00
The model I've created: (the training set shape: (600,8000,1))
model=Sequential()
model.add(Conv1D(filters=64, kernel_size=3, activation='tanh', \
input_shape=(data.shape[1],data.shape[2])))
model.add(MaxPooling1D(pool_size=2))
model.add(Conv1D(filters=64, kernel_size=3, activation='tanh'))
model.add(MaxPooling1D(pool_size=2))
model.add(Conv1D(filters=64, kernel_size=3, activation='tanh'))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(500, activation='tanh'))
model.add(Dropout(0.2))
model.add(Dense(50, activation='tanh'))
model.add(Dropout(0.2))
model.add(Dense(1, activation='linear'))
model.compile(loss='mean_squared_error', optimizer='adam', metrics=["accuracy"])
But the model doesn't want to train. Accuracy ~ 0.0.
I do appreciate any advice.
What is the best approach to build a neural network to predict
fundamental frequency for newly provided signal?
That is a way too-broad question for SO, and consequently you should not really expect any sufficiently detailed meaningful answer.
That said, there are certain issues with your code, and rectifying them will arguably move you a step closer to achieving your end goal.
So, you are making a very fundamental mistake:
Accuracy is suitable only for classification problems; for regression (i.e. numeric prediction) ones, such as yours, accuracy is meaningless.
What's more, the fact is that Keras unfortunately will not "protect" you or any other user from putting such meaningless requests in your code, i.e. you will not get any error, or even a warning, that you are attempting something that does not make sense, such as requesting the accuracy in a regression setting; see my answer in What function defines accuracy in Keras when the loss is mean squared error (MSE)? for more details and a practical demonstration.
So, here your performance metric is actually the same as your loss, i.e. the Mean Squared Error (MSE); you should go for making this quantity in your validation set as small as possible, and remove completely the metrics=['accuracy'] argument from the compilation of your model.
Additionally, nowadays we practically never use tanh activation for the hidden layers; you should try relu instead.
You might first FFT the data, either with or without a window, and then use the FFT magnitude vectors as ML training data vectors.

Intuition behind Stacking Multiple Conv2D Layers before Dropout in CNN

Background:
Tagging TensorFlow since Keras runs on top of it and this is more a general deep learning question.
I have been working on the Kaggle Digit Recognizer problem and used Keras to train CNN models for the task. This model below has the original CNN structure I used for this competition and it performed okay.
def build_model1():
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), padding="Same" activation="relu", input_shape=[28, 28, 1]))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Dropout(0.25))
model.add(layers.Conv2D(64, (3, 3), padding="Same", activation="relu"))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Dropout(0.25))
model.add(layers.Conv2D(64, (3, 3), padding="Same", activation="relu"))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Dropout(0.25))
model.add(layers.Flatten())
model.add(layers.Dense(64, activation="relu"))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(10, activation="softmax"))
return model
Then I read some other notebooks on Kaggle and borrowed another CNN structure (copied below), which works much better than the one above in that it achieved better accuracy, lower error rate, and took many more epochs before overfitting the training data.
def build_model2():
model = models.Sequential()
model.add(layers.Conv2D(32, (5, 5),padding ='Same', activation='relu', input_shape = (28, 28, 1)))
model.add(layers.Conv2D(32, (5, 5),padding = 'Same', activation ='relu'))
model.add(layers.MaxPool2D((2, 2)))
model.add(layers.Dropout(0.25))
model.add(layers.Conv2D(64,(3, 3),padding = 'Same', activation ='relu'))
model.add(layers.Conv2D(64, (3, 3),padding = 'Same', activation ='relu'))
model.add(layers.MaxPool2D(pool_size=(2, 2), strides=(2, 2)))
model.add(layers.Dropout(0.25))
model.add(layers.Flatten())
model.add(layers.Dense(256, activation = "relu"))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(10, activation = "softmax"))
return model
Question:
Is there any intuition or explanation behind the better performance of the second CNN structure? What is it that makes stacking 2 Conv2D layers better than just using 1 Conv2D layer before max pooling and dropout? Or is there something else that contributes to the result of the second model?
Thank y'all for your time and help.
The main difference between these two approaches is that the later (2 conv) has more flexibility in expressing non-linear transformations without loosing information. Maxpool removes information from the signal, dropout forces distributed representation, thus both effectively make it harder to propagate information. If, for given problem, highly non-linear transformation has to be applied on raw data, stacking multiple convs (with relu) will make it easier to learn, that's it. Also note that you are comparing a model with 3 max poolings with model with only 2, consequently the second one will potentially loose less information. Another thing is it has way bigger fully connected bit at the end, while the first one is tiny (64 neurons + 0.5 dropout means that you effectively have at most 32 neurons active, that is a tiny layer!). To sum up:
These architectures differe in many aspects, not just stacking conv nets.
Stacking convnets usually leads to less information being lost in processing; see for example "all convolutional" architectures.

How to perform max pooling on a 1-dimensional ConvNet (conv1d) in TensowFlow?

I'm training a convolutional neural network on text (on the character level) and I want to do max-pooling. tf.nn.max_pool expects a rank 4 Tensor, but 1-d convnets are rank 3 in tensorflow ([batch, width, depth]), so when I pass the output of conv1d to the max pool function, this is the error:
ValueError: Shape (1, 144, 512) must have rank 4
I'm new to tensorflow and deep learning frameworks in general and would like advice on the best practice here, because I can imagine there are multiple workarounds. How can I perform max-pooling in the 1-d case?
Thanks.
A quick way would be to add an extra singleton dimension i.e. make the shape (1, 1, 144, 512), from there you can reduce it back with tf.squeeze.
I'm curious about other approaches though.

Resources