Caffe stuck at iteration 0 - machine-learning

Caffe stuck at iteration 0 - machine-learning

I'm implementing a domain-adaption project with CPU-only Caffe. It stuck at iteration 0 during the training process. This is what I get:
I0326 18:14:51.217656 9257 net.cpp:693] Ignoring source layer concat_data
I0326 18:14:51.218354 9257 net.cpp:693] Ignoring source layer slice_features_fc7
I0326 18:14:51.218359 9257 net.cpp:693] Ignoring source layer source_features_fc7_slice_features_fc7_0_split
I0326 18:14:51.218361 9257 net.cpp:693] Ignoring source layer target_features_fc7_slice_features_fc7_1_split
I0326 18:14:51.218364 9257 net.cpp:693] Ignoring source layer source_features_fc8_fc8_source_0_split
I0326 18:14:51.218365 9257 net.cpp:693] Ignoring source layer softmax_loss
I0326 18:14:51.218366 9257 net.cpp:693] Ignoring source layer fc8_target
I0326 18:14:51.218369 9257 net.cpp:693] Ignoring source layer mmd_loss_fc7
I0326 18:14:51.218369 9257 net.cpp:693] Ignoring source layer mmd_loss_fc8
I0326 18:17:06.733678 9257 solver.cpp:407] Test net output #0: lp_accuracy = 0.0301887
I0326 18:17:34.953090 9257 solver.cpp:231] Iteration 0, loss = 4.42734
I0326 18:17:34.953160 9257 solver.cpp:247] Train net output #0: fc7_mmd_loss = 0 (* 1 = 0 loss)
I0326 18:17:34.953181 9257 solver.cpp:247] Train net output #1: fc8_mmd_loss = 0 (* 1 = 0 loss)
I0326 18:17:34.953202 9257 solver.cpp:247] Train net output #2: softmax_loss = 4.42734 (* 1 = 4.42734 loss)
I0326 18:17:34.953223 9257 sgd_solver.cpp:106] Iteration 0, lr = 0.0003
System: Ubuntu 16.04
Command line:
./build/tools/caffe train -solver models/DAN/amazon_to_webcam/solver.prototxt -weights models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel
Solver.prototxt:
net: "./models/DAN/amazon_to_webcam/train_val.prototxt"
test_iter: 795
test_interval: 300
base_lr: 0.0003
momentum: 0.9
lr_policy: "inv"
gamma: 0.002
power: 0.75
display: 100
max_iter: 50000
snapshot: 60000
snapshot_prefix: "./models/RTN/amazon_to_webcam/trained_model"
solver_mode: CPU
snapshot_after_train: false

Stuck at iteration 0 means that the training is waiting for input on a channel that successfully opened. (Failing to open the channel would produce an error message, at least a time-out problem.)
You need to debug your input flow. If nothing else, put some debugger breakpoints (or even print statements) to check that you're reaching critical parts of the flow.

Eventually, My question has nothing to do with input flow. It's just CPU mode is too slow to train the network. So if you have the same problem, it doesn't hurt to try on a GPU version Caffe. Problem closed.

Related

How to Design the Neural Network?

I was trying to make a deep learning prediction model for predicting whether a person is a CKD patient or not. Can you please tell me? How can I design a neural network for it? How many neurons should I add in each layer? Or is there any other method in Keras to do so? The dataset link: https://github.com/Samar-080301/Python_Project/blob/master/ckd_full.csv
import tensorflow as tf
from tensorflow import keras
import pandas as pd
from sklearn.model_selection import train_test_split
import os
from matplotlib import pyplot as plt
os.chdir(r'C:\Users\samar\OneDrive\desktop\projects\Chronic_Kidney_Disease')
os.getcwd()
x=pd.read_csv('ckd_full.csv')
y=x[['class']]
y['class']=y['class'].replace(to_replace=(r'ckd',r'notckd'), value=(1,0))
x=x.drop(columns=['class'])
x['rbc']=x['rbc'].replace(to_replace=(r'normal',r'abnormal'), value=(1,0))
x['pcc']=x['pcc'].replace(to_replace=(r'present',r'notpresent'), value=(1,0))
x['ba']=x['ba'].replace(to_replace=(r'present',r'notpresent'), value=(1,0))
x['pc']=x['pc'].replace(to_replace=(r'normal',r'abnormal'), value=(1,0))
x['htn']=x['htn'].replace(to_replace=(r'yes',r'no'), value=(1,0))
x['dm']=x['dm'].replace(to_replace=(r'yes',r'no'), value=(1,0))
x['cad']=x['cad'].replace(to_replace=(r'yes',r'no'), value=(1,0))
x['pe']=x['pe'].replace(to_replace=(r'yes',r'no'), value=(1,0))
x['ane']=x['ane'].replace(to_replace=(r'yes',r'no'), value=(1,0))
x['appet']=x['appet'].replace(to_replace=(r'good',r'poor'), value=(1,0))
x[x=="?"]=np.nan
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.01)
#begin the model
model=keras.models.Sequential()
model.add(keras.layers.Dense(128,input_dim = 24, activation=tf.nn.relu))
model.add(tf.keras.layers.Dense(128,activation=tf.nn.relu)) # adding a layer with 128 nodes and relu activaation function
model.add(tf.keras.layers.Dense(128,activation=tf.nn.relu)) # adding a layer with 128 nodes and relu activaation function
model.add(tf.keras.layers.Dense(128,activation=tf.nn.relu)) # adding a layer with 128 nodes and relu activaation function
model.add(tf.keras.layers.Dense(128,activation=tf.nn.relu)) # adding a layer with 128 nodes and relu activaation function
model.add(tf.keras.layers.Dense(128,activation=tf.nn.relu)) # adding a layer with 128 nodes and relu activaation function
model.add(tf.keras.layers.Dense(2,activation=tf.nn.softmax)) # adding a layer with 2 nodes and softmax activaation function
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy']) # specifiying hyperparameters
model.fit(xtrain,ytrain,epochs=5) # load the model
model.save('Nephrologist') # save the model with a unique name
myModel=tf.keras.models.load_model('Nephrologist') # make an object of the model
prediction=myModel.predict((xtest))
C:\Users\samar\anaconda3\lib\site-packages\ipykernel_launcher.py:12: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
if sys.path[0] == '':
Epoch 1/5
396/396 [==============================] - 0s 969us/sample - loss: nan - acc: 0.3561
Epoch 2/5
396/396 [==============================] - 0s 343us/sample - loss: nan - acc: 0.3763
Epoch 3/5
396/396 [==============================] - 0s 323us/sample - loss: nan - acc: 0.3763
Epoch 4/5
396/396 [==============================] - 0s 283us/sample - loss: nan - acc: 0.3763
Epoch 5/5
396/396 [==============================] - 0s 303us/sample - loss: nan - acc: 0.3763

Here is the structure that I achieved 100% test accuracy with:
model=keras.models.Sequential()
model.add(keras.layers.Dense(200,input_dim = 24, activation=tf.nn.tanh))
model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy']) # specifiying hyperparameters
xtrain_tensor = tf.convert_to_tensor(xtrain, dtype=tf.float32)
ytrain_tensor = tf.convert_to_tensor(ytrain, dtype=tf.float32)
model.fit(xtrain_tensor , ytrain_tensor , epochs=500, batch_size=128, validation_split = 0.15, shuffle=True, verbose=2) # load the model
results = model.evaluate(xtest, ytest, batch_size=128)
Output:
3/3 - 0s - loss: 0.2560 - accuracy: 0.9412 - val_loss: 0.2227 - val_accuracy: 0.9815
Epoch 500/500
3/3 - 0s - loss: 0.2225 - accuracy: 0.9673 - val_loss: 0.2224 - val_accuracy: 0.9815
1/1 [==============================] - 0s 0s/step - loss: 0.1871 - accuracy: 1.0000
The last line represents the evaluation of the model on the test dataset. Seems like it generalized well :)
------------------------------------------------- Original answer below ---------------------------------------------------
I would go with a logistic regression model first in order to see if there is any predictive value to your dataset.
model=keras.models.Sequential()
model.add(keras.layers.Dense(1,input_dim = 24, activation=tf.nn.sigmoid))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy']) # specifiying hyperparameters
model.fit(xtrain,ytrain,epochs=100) # Might require more or less epoches. It depends on the amount of noise in your dataset.
If you see you receive an accuracy score that satisfies you, I would give it a try and add 1 or 2 more dense hidden layers with between 10 to 40 nodes.
It's important to mention that my advice is solely based on my experience.
I HIGHLY(!!!!) recommend transforming the y_label into a binary value when 1 represents the positive class (a record is a record of a CKD patient) and 0 represents the negative class.
Let me know if it works, and if it doesn't I'll also try to play with your dataset.

apparently you seem to have problem with your data pre-processing
you can use
df.fillna('ffill')
and also you can use feature columns to do those long tasks example:
CATEGORICAL_COLUMNS = ['columns','which have','categorical data','like sex']
NUMERIC_COLUMNS = ['columns which have','numeric data']
feature_column =[]
for items in CATEGORICAL_COLUMNS:
feature_column.append( tf.feature_clolumns.categorical_columns_with_vocavulary_list(items, df[items].unique()))
for items in NUMERIC_COLUMNS:
feature_column.append( tf.feature_clolumns.numeric_columns(items, df[items].unique()))
now you can use these feature columns to make a prediction for your model which will be more accurate more can be done in data preprocessing here is the official documentation to help you more : tensorflow Documentation on feature columns

Allocator runs out of memory even on very low batch sizes

This problem never used to occur but since today Tensorflow always tries to allocate a huge amount of memory, even when using very small batch sizes.
I followed this tutorial:
https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html
"Using the bottleneck features of a pre-trained network: 90% accuracy in a minute"
This is my code:
import numpy as np
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential
from keras.layers import Dropout, Flatten, Dense
from keras import applications
img_width, img_height = 150, 150
top_model_weights_path = 'bottleneck_fc_model.h5'
train_data_dir = 'C:\\ImageData\\Augmented\\Train'
validation_data_dir = 'C:\\ImageData\\Augmented\\Validate'
#train_data_dir = 'C:\\Users\\NSA\\flower_photos\\Train'
#validation_data_dir = 'C:\\Users\\NSA\\flower_photos\\Validate'
nb_train_samples = 25
nb_validation_samples = 5
epochs = 10
my_batch_size = 10
def save_bottleneck_features():
datagen = ImageDataGenerator(rescale=1./255)
# build the VGG16 network
model = applications.VGG16(include_top=False, weights='imagenet')
generator = datagen.flow_from_directory(
train_data_dir,
target_size=(img_width, img_height),
batch_size=my_batch_size,
class_mode=None,
shuffle=False)
bottleneck_features_train = model.predict_generator(
generator,
steps=nb_train_samples // my_batch_size,
max_queue_size=10,
workers=1,
use_multiprocessing=False,
verbose=1)
np.save(open('bottleneck_features_train.npy', 'w'),
bottleneck_features_train)
generator = datagen.flow_from_directory(
validation_data_dir,
target_size=(img_width, img_height),
batch_size=my_batch_size,
class_mode=None,
shuffle=False)
bottleneck_features_validation = model.predict_generator(
generator, nb_validation_samples // my_batch_size)
np.save(open('bottleneck_features_validation.npy', 'w'),
bottleneck_features_validation)
def train_top_model():
train_data = np.load(open('bottleneck_features_train.npy'))
train_labels = np.array(
[0] * (nb_train_samples / 2) + [1] * (nb_train_samples / 2))
validation_data = np.load(open('bottleneck_features_validation.npy'))
validation_labels = np.array(
[0] * (nb_validation_samples / 2) + [1] * (nb_validation_samples / 2))
model = Sequential()
model.add(Flatten(input_shape=train_data.shape[1:]))
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop',
loss='binary_crossentropy', metrics=['accuracy'])
model.fit(train_data, train_labels,
epochs=epochs,
batch_size=my_batch_size,
validation_data=(validation_data, validation_labels))
model.save_weights(top_model_weights_path)
save_bottleneck_features()
train_top_model()
And this is the error I get:
PS C:\Users\NSA\ownCloud\Documents\Tensorflow\Skripts> cd 'c:\Users\NSA\ownCloud\Documents\Tensorflow\Skripts'; ${env:PYTHONIOENCODING}='UTF-8'; ${env:PYTHONUNBUFFERED}='1'; & 'C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python36_64\python.exe' 'C:\Users\NSA\.vscode\extensions\ms-python.python-2018.3.1\pythonFiles\PythonTools\visualstudio_py_launcher.py' 'c:\Users\NSA\ownCloud\Documents\Tensorflow\Skripts' '50490' '34806ad9-833a-4524-8cd6-18ca4aa74f14' 'RedirectOutput,RedirectOutput' 'c:\Users\NSA\ownCloud\Documents\Tensorflow\Skripts\first_try_real_transfer_learning_keras_vgg16.py'
C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python36_64\lib\site-packages\h5py\__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will
be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
Using TensorFlow backend.
Bottleneck Features saven
2018-04-09 16:02:08.772206: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\platform\cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2018-04-09 16:02:09.345010: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:1212] Found device 0 with properties:
name: GeForce 940MX major: 5 minor: 0 memoryClockRate(GHz): 1.189
pciBusID: 0000:02:00.0
totalMemory: 2.00GiB freeMemory: 1.66GiB
2018-04-09 16:02:09.356147: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:1312] Adding visible gpu devices: 0
2018-04-09 16:02:10.108947: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1429 MB memory) -> physical GPU (device: 0, name: GeForce 940MX, pci bus id: 0000:02:00.0, compute capability: 5.0)
Found 109 images belonging to 2 classes.
2018-04-09 16:02:16.979539: W C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.33GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2018-04-09 16:02:17.441196: W C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.19GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2018-04-09 16:02:17.792983: W C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.14GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2018-04-09 16:02:18.122577: W C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.17GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2/2 [==============================] - 4s 2s/step
Traceback (most recent call last):
File "c:\Users\NSA\ownCloud\Documents\Tensorflow\Skripts\first_try_real_transfer_learning_keras_vgg16.py", line 94, in <module>
save_bottleneck_features()
File "c:\Users\NSA\ownCloud\Documents\Tensorflow\Skripts\first_try_real_transfer_learning_keras_vgg16.py", line 56, in save_bottleneck_features
bottleneck_features_train)
File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python36_64\lib\site-packages\numpy\lib\npyio.py", line 511, in save
pickle_kwargs=pickle_kwargs)
File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python36_64\lib\site-packages\numpy\lib\format.py", line 565, in write_array
version)
File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python36_64\lib\site-packages\numpy\lib\format.py", line 335, in _write_array_header
fp.write(header_prefix)
TypeError: write() argument must be str, not bytes
PS C:\Users\NSA\ownCloud\Documents\Tensorflow\Skripts>
The error occurs specifically when calling model.predict_generator()
At first I thought its running out of memory because my batch size is too large, but even when I use a batch size of 1 it requires over 2GiB of memory. I have installed CUDA 9.0, cuDNN 7.0, Tensorflow 1.6.0 and Keras 2.1.5 using TensorFlow backend. This used to work without issue but it suddenly started giving me this error. I'm using a NVIDIA GeForce 940MX

Your problem has nothing to do with memory or tensorflow. A file opened as text is being written bytes.
Instead of opening the file as text:
open('bottleneck_features_train.npy', 'w')
open it as bytes:
open('bottleneck_features_train.npy', 'wb')
This applies to all the calls to open you have.

Tensorflow: loss becomes 'NaN'

I was doing CIFAR-10 training on CPU with Tensorflow. During the first few rounds, the loss seemed alright. But after the step 10210 the loss varies and ends up becoming NaN.
My network model the CIFAR-10 CNN model from their website. Here is my setting,
image_size = 32
num_channels = 3
num_classes = 10
num_batches_to_run = 50000
batch_size = 128
eval_batch_size = 64
initial_learning_rate = 0.1
learning_rate_decay_factor = 0.1
num_epochs_per_decay = 350.0
moving_average_decay = 0.9999
and the result is shown as below.
2017-05-12 21:53:05.125242: step 10210, loss = 4.99 (124.9 examples/sec; 1.025 sec/batch)
2017-05-12 21:53:13.960001: step 10220, loss = 7.55 (139.5 examples/sec; 0.918 sec/batch)
2017-05-12 21:53:23.491228: step 10230, loss = 6.63 (149.5 examples/sec; 0.856 sec/batch)
2017-05-12 21:53:33.355805: step 10240, loss = 8.08 (113.3 examples/sec; 1.129 sec/batch)
2017-05-12 21:53:43.007007: step 10250, loss = 7.18 (126.7 examples/sec; 1.010 sec/batch)
2017-05-12 21:53:52.650118: step 10260, loss = 16.61 (138.0 examples/sec; 0.928 sec/batch)
2017-05-12 21:54:02.537279: step 10270, loss = 9.60 (137.6 examples/sec; 0.930 sec/batch)
2017-05-12 21:54:12.390117: step 10280, loss = 46526.25 (145.5 examples/sec; 0.880 sec/batch)
2017-05-12 21:54:22.060741: step 10290, loss = 133479743509972411931057146822656.00 (130.4 examples/sec; 0.982 sec/batch)
2017-05-12 21:54:31.691058: step 10300, loss = nan (115.8 examples/sec; 1.105 sec/batch)
Any idea about the NaN loss?

This happens a lot in practice when your learning rate is too high, I tend to start at 0.001 and move from there, 0.1 is on the very high side on most datasets, especially if you aren't dividing your loss by your batch size.

You can clip the gradients, if you are using Keras with Tensorflow backend, you could do as follows,
The parameters clipnorm and clipvalue can be used with all optimizers to control gradient clipping:
from keras import optimizers
# All parameter gradients will be clipped to
# a maximum norm of 1.
sgd = optimizers.SGD(lr=0.01, clipnorm=1.)
or
from keras import optimizers
# All parameter gradients will be clipped to
# a maximum value of 0.5 and
# a minimum value of -0.5.
sgd = optimizers.SGD(lr=0.01, clipvalue=0.5)

You might have the cross entropy loss and take log(0). Just add a small constant within the log.
(you might also want to look into gradient clipping)

Keras RNN loss does not decrease over epoch

I built a RNN using Keras. The RNN is used to solve a regression problem:
def RNN_keras(feat_num, timestep_num=100):
model = Sequential()
model.add(BatchNormalization(input_shape=(timestep_num, feat_num)))
model.add(LSTM(input_shape=(timestep_num, feat_num), output_dim=512, activation='relu', return_sequences=True))
model.add(BatchNormalization())
model.add(LSTM(output_dim=128, activation='relu', return_sequences=True))
model.add(BatchNormalization())
model.add(TimeDistributed(Dense(output_dim=1, activation='relu'))) # sequence labeling
rmsprop = RMSprop(lr=0.00001, rho=0.9, epsilon=1e-08)
model.compile(loss='mean_squared_error',
optimizer=rmsprop,
metrics=['mean_squared_error'])
return model
The whole process looks fine. But the loss stays the exact same over epochs.
61267 in the training set
6808 in the test set
Building training input vectors ...
888 unique feature names
The length of each vector will be 888
Using TensorFlow backend.
Build model...
# Each batch has 1280 examples
# The training data are shuffled at the beginning of each epoch.
****** Iterating over each batch of the training data ******
Epoch 1/3 : Batch 1/48 | loss = 11011073.000000 | root_mean_squared_error = 3318.232910
Epoch 1/3 : Batch 2/48 | loss = 620.271667 | root_mean_squared_error = 24.904161
Epoch 1/3 : Batch 3/48 | loss = 620.068665 | root_mean_squared_error = 24.900017
......
Epoch 1/3 : Batch 47/48 | loss = 618.046448 | root_mean_squared_error = 24.859678
Epoch 1/3 : Batch 48/48 | loss = 652.977051 | root_mean_squared_error = 25.552946
****** Epoch 1: RMSD(training) = 24.897174
Epoch 2/3 : Batch 1/48 | loss = 607.372620 | root_mean_squared_error = 24.644049
Epoch 2/3 : Batch 2/48 | loss = 599.667786 | root_mean_squared_error = 24.487448
Epoch 2/3 : Batch 3/48 | loss = 621.368103 | root_mean_squared_error = 24.926300
......
Epoch 2/3 : Batch 47/48 | loss = 620.133667 | root_mean_squared_error = 24.901398
Epoch 2/3 : Batch 48/48 | loss = 639.971924 | root_mean_squared_error = 25.297264
****** Epoch 2: RMSD(training) = 24.897174
Epoch 3/3 : Batch 1/48 | loss = 651.519836 | root_mean_squared_error = 25.523636
Epoch 3/3 : Batch 2/48 | loss = 673.582581 | root_mean_squared_error = 25.952084
Epoch 3/3 : Batch 3/48 | loss = 613.930054 | root_mean_squared_error = 24.776562
......
Epoch 3/3 : Batch 47/48 | loss = 624.460327 | root_mean_squared_error = 24.988203
Epoch 3/3 : Batch 48/48 | loss = 629.544250 | root_mean_squared_error = 25.090448
****** Epoch 3: RMSD(training) = 24.897174
I do NOT think it is normal. Do I miss something?
UPDATE:
I find that all predictions are always zero after all epochs. This is the reason why all RMSDs are all the same because the predictions are all the same, i.e. 0. I checked the training y. It only contains just a few zeros. So it is not due to data imbalance.
So now I am thinking if it is because of the layers and activation that I am using.

Your RNN functions seems to be ok.
The speed of reduction in loss depends on optimizer and learning rate.
Any how you are using decay rate 0.9. try with bigger learning rate, any how it is going to decrease with 0.9 rate.
Try out other optimizers with different learning rates
Other optimizers available with keras: https://keras.io/optimizers/
Many times, some optimizers work well on some data sets while some may fails.

Have you tried changing activation function from relu to softmax?
Relu activation has the tendency to diverge. However, if initializing the weight with eigenmatrix may result in a better convergence.

Since you are using RNNs for regression problem (not for classification), you should use 'linear' activation at the last layer.
In your code,
model.add(TimeDistributed(Dense(output_dim=1, activation='relu'))) # sequence labeling
change to activation='linear' instead of 'relu'.
If it doesn't work, remove activation='relu' in second layer.
Also learning rate for rmsprop usually ranges from 0.1 to 0.0001.

How to calculate the number of parameters of convolutional neural networks?

I can't give the correct number of parameters of AlexNet or VGG Net.
For example, to calculate the number of parameters of a conv3-256 layer of VGG Net, the answer is 0.59M = (3*3)*(256*256), that is (kernel size) * (product of both number of channels in the joint layers), however in that way, I can't get the 138M parameters.
So could you please show me where is wrong with my calculation, or show me the right calculation procedure?

If you refer to VGG Net with 16-layer (table 1, column D) then 138M refers to the total number of parameters of this network, i.e including all convolutional layers, but also the fully connected ones.
Looking at the 3rd convolutional stage composed of 3 x conv3-256 layers:
the first one has N=128 input planes and F=256 output planes,
the two other ones have N=256 input planes and F=256 output planes.
The convolution kernel is 3x3 for each of these layers. In terms of parameters this gives:
128x3x3x256 (weights) + 256 (biases) = 295,168 parameters for the 1st one,
256x3x3x256 (weights) + 256 (biases) = 590,080 parameters for the two other ones.
As explained above you have to do that for all layers, but also the fully-connected ones, and sum these values to obtain the final 138M number.
-
UPDATE: the breakdown among layers give:
conv3-64 x 2 : 38,720
conv3-128 x 2 : 221,440
conv3-256 x 3 : 1,475,328
conv3-512 x 3 : 5,899,776
conv3-512 x 3 : 7,079,424
fc1 : 102,764,544
fc2 : 16,781,312
fc3 : 4,097,000
TOTAL : 138,357,544
In particular for the fully-connected layers (fc):
fc1 (x): (512x7x7)x4,096 (weights) + 4,096 (biases)
fc2 : 4,096x4,096 (weights) + 4,096 (biases)
fc3 : 4,096x1,000 (weights) + 1,000 (biases)
(x) see section 3.2 of the article: the fully-connected layers are first converted to convolutional layers (the first FC layer to a 7 × 7 conv. layer, the last two FC layers to 1 × 1 conv. layers).
Details about fc1
As precised above the spatial resolution right before feeding the fully-connected layers is 7x7 pixels. This is because this VGG Net uses spatial padding before convolutions, as detailed within section 2.1 of the paper:
[...] the spatial padding of conv. layer input is such that the spatial resolution is preserved after convolution, i.e. the padding is 1 pixel for 3×3 conv. layers.
With such a padding, and working with a 224x224 pixels input image, the resolution decreases as follow along the layers: 112x112, 56x56, 28x28, 14x14 and 7x7 after the last convolution/pooling stage which has 512 feature maps.
This gives a feature vector passed to fc1 with dimension: 512x7x7.

A great breakdown of the calculation for VGG-16 network is also given in CS231n lecture notes.
INPUT: [224x224x3] memory: 224*224*3=150K weights: 0
CONV3-64: [224x224x64] memory: 224*224*64=3.2M weights: (3*3*3)*64 = 1,728
CONV3-64: [224x224x64] memory: 224*224*64=3.2M weights: (3*3*64)*64 = 36,864
POOL2: [112x112x64] memory: 112*112*64=800K weights: 0
CONV3-128: [112x112x128] memory: 112*112*128=1.6M weights: (3*3*64)*128 = 73,728
CONV3-128: [112x112x128] memory: 112*112*128=1.6M weights: (3*3*128)*128 = 147,456
POOL2: [56x56x128] memory: 56*56*128=400K weights: 0
CONV3-256: [56x56x256] memory: 56*56*256=800K weights: (3*3*128)*256 = 294,912
CONV3-256: [56x56x256] memory: 56*56*256=800K weights: (3*3*256)*256 = 589,824
CONV3-256: [56x56x256] memory: 56*56*256=800K weights: (3*3*256)*256 = 589,824
POOL2: [28x28x256] memory: 28*28*256=200K weights: 0
CONV3-512: [28x28x512] memory: 28*28*512=400K weights: (3*3*256)*512 = 1,179,648
CONV3-512: [28x28x512] memory: 28*28*512=400K weights: (3*3*512)*512 = 2,359,296
CONV3-512: [28x28x512] memory: 28*28*512=400K weights: (3*3*512)*512 = 2,359,296
POOL2: [14x14x512] memory: 14*14*512=100K weights: 0
CONV3-512: [14x14x512] memory: 14*14*512=100K weights: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K weights: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K weights: (3*3*512)*512 = 2,359,296
POOL2: [7x7x512] memory: 7*7*512=25K weights: 0
FC: [1x1x4096] memory: 4096 weights: 7*7*512*4096 = 102,760,448
FC: [1x1x4096] memory: 4096 weights: 4096*4096 = 16,777,216
FC: [1x1x1000] memory: 1000 weights: 4096*1000 = 4,096,000
TOTAL memory: 24M * 4 bytes ~= 93MB / image (only forward! ~*2 for bwd)
TOTAL params: 138M parameters

The below VGG-16 architechture is in the original paper as highlighted by #deltheil in (table 1, column D) , and I quote from there
2.1 ARCHITECTURE
During training, the input to our ConvNets is a fixed-size 224 × 224
RGB images. The only preprocessing we do is subtracting the mean RGB
value, computed on the training set, from each pixel.
The image is passed through a stack of convolutional (conv.) layers,
where we use filters with a very small receptive field: 3 × 3 (which
is the smallest size to capture the notion of left/right, up/down,
center). The convolution stride is fixed to 1 pixel; the spatial
padding of conv. layer input is such that the spatial resolution is
preserved after convolution, i.e. the padding is 1 pixel for 3 × 3
conv. layers. Spatial pooling is carried out by five max-pooling
layers, which follow some of the conv. layers (not all the conv.
layers are followed by max-pooling). Max-pooling is performed over a 2
× 2 pixel window, with stride 2.
A stack of convolutional layers (which has a different depth in
different architectures) is followed by three Fully-Connected (FC)
layers: the first two have 4096 channels each, the third performs
1000-way ILSVRC classification and thus contains 1000 channels (one
for each class).
The final layer is the soft-max layer.
Using the above, and
A formula to find activation shape of a layer!
A formula to calculate the weights corresponding to every layer:
Note:
you can simply multiply respective activation shape column to get the activation size
CONV3: means a filter of 3*3 will convolve on the input!
MAXPOOL3-2: means, 3rd pooling layer, with 2*2 filter, stride=2, padding=0(pretty standard in pooling layers)
Stage-3 : means it has multiple CONV layer stacked! with same padding=1, , stride=1, and filter 3*3
Cin : means the depth a.k.a channel coming from the input layer!
Cout: means the depth a.k.a channel outgoing (you configure it differently- to learn more complex features!),
Cin and Cout are the number of filters that you stack together to learn multiple features at different scales such as in the first layer you might want to learn vertical edges, and horizontal edges and edges at say 45degree, blah blah!, 64 possible different filters each of different kind of edges!!
n: input dimension without depth such n=224 in case of INPUT-image!
p: padding for each layer
s: stride used for each layer
f: filter size i.e 3*3 for CONV and 2*2 for MAXPOOL layers!
After MAXPOOL5-2, you simply flatten the volume and interface it with the first FC layer.!
We get the table:
Finally, if you add all the weights calculated in the last column, you end up with 138,357,544(138 million) parameters to train for VGG-15!

Here is how to compute the number of parameters in each cnn layer:
some definitions
n--width of filter
m--height of filter
k--number of input feature maps
L--number of output feature maps
Then number of paramters #= (n*m *k+1)*L in which the first contribution is from
weights and the second is from bias.

I know this is a old post nevertheless, I think the accepted answer by #deltheil contains a mistake. If not, I would be happy to be corrected. The convolution layer should not have bias.
i.e.
128x3x3x256 (weights) + 256 (biases) = 295,168
should be
128x3x3x256 (weights) = 294,9112
Thanks

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart