Keras LSTM Time Series - machine-learning

I have a problem and at this point I'm completely lost as to how to solve it. I'm using Keras with an LSTM layer to project a time series. I'm trying to use the previous 10 data points to predict the 11th.
Here's the code:
from keras.models import Sequential
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.recurrent import LSTM
def _load_data(data):
"""
data should be pd.DataFrame()
"""
n_prev = 10
docX, docY = [], []
for i in range(len(data)-n_prev):
docX.append(data.iloc[i:i+n_prev].as_matrix())
docY.append(data.iloc[i+n_prev].as_matrix())
if not docX:
pass
else:
alsX = np.array(docX)
alsY = np.array(docY)
return alsX, alsY
X, y = _load_data(df_test)
X_train = X[:25]
X_test = X[25:]
y_train = y[:25]
y_test = y[25:]
in_out_neurons = 2
hidden_neurons = 300
model = Sequential()
model.add(LSTM(in_out_neurons, hidden_neurons, return_sequences=False))
model.add(Dense(hidden_neurons, in_out_neurons))
model.add(Activation("linear"))
model.compile(loss="mean_squared_error", optimizer="rmsprop")
model.fit(X_train, y_train, nb_epoch=10, validation_split=0.05)
predicted = model.predict(X_test)
So I'm taking the input data (a two column dataframe), creating X which is an n by 10 by 2 array, and y which is an n by 2 array which is one step ahead of the last row in each array of X (labeling the data with the point directly ahead of it.
predicted is returning
[[ 7.56940445, 5.61719704],
[ 7.57328357, 5.62709032],
[ 7.56728049, 5.61216415],
[ 7.55060187, 5.60573629],
[ 7.56717342, 5.61548522],
[ 7.55866942, 5.59696181],
[ 7.57325984, 5.63150951]]
but I should be getting
[[ 73, 48],
[ 74, 42],
[ 91, 51],
[102, 64],
[109, 63],
[ 93, 65],
[ 92, 58]]
The original data set only has 42 rows, so I'm wondering if there just isn't enough there to work with? Or am I missing a key step in the modeling process maybe? I've seen some examples using Embedding layers etc, is that something I should be looking at?
Thanks in advance for any help!

Hey Ryan!
I know it's late but I just came across your question hope it's not too late or that you still find some knowledge here.
First of all, Stackoverflow may not be the best place for this kind of question. First reason to that is you have a conceptual question that is not this site's purpose. Moreover your code runs so it's not even a matter of general programming. Have a look at stats.
Second from what I see there is no conceptual error. You're using everything necessary that is:
lstm with propper dimensions
return_sequences=false just before your Dense layer
linear activation for your output
mse cost/loss/objective function
Third I however find it extremely unlikely that your network learns anything with so few pieces of data. You have to understand that you have less data than parameters here! For the great majority of supervised learning algorithm, the first thing you need is not a good model, it's good data. You can not learn from so few examples, especially not with a complex model such as LSTM networks.
Fourth It seems like your target data is made of relatively high values. First step of pre-processing here could be to standardize the data : center it around zero - that is translate your data by its mean - and rescale by ists standard deviation. This really helps learning!
Fifth In general here are a few things you should look into to improve learning and reduce overfitting :
Dropout
Batch Normalization
Other optimizer (such as Adam)
Gradient clipping
Random hyper parameter search
(This is not exhaustive, if you're reading this and think something should be added, comment it so it's useful for future readers!)
Last but NOT least I suggest you look at this tutorial on Github, especially the recurrent tutorial for time series with keras.
PS: Daniel Hnyk updated his post ;)

Related

Accuracy score in K-nearest Neighbour Classifier not matching with GridSearchCV

I'm learning Machine Learning and I'm facing a mismatch I can't explain.
I have a grid to compute the best model, according to the accuracy returned by GridSearchCV.
model=sklearn.neighbors.KNeighborsClassifier()
n_neighbors=[3, 4, 5, 6, 7, 8, 9]
weights=['uniform','distance']
algorithm=['auto','ball_tree','kd_tree','brute']
leaf_size=[20,30,40,50]
p=[1]
param_grid = dict(n_neighbors=n_neighbors, weights=weights, algorithm=algorithm, leaf_size=leaf_size, p=p)
grid = sklearn.model_selection.GridSearchCV(estimator=model, param_grid=param_grid, cv = 5, n_jobs=1)
SGDgrid = grid.fit(data1, targetd_simp['VALUES'])
print("SGD Classifier: ")
print("Best: ")
print(SGDgrid.best_score_)
value=SGDgrid.best_score_
print("params:")
print(SGDgrid.best_params_)
print("Best estimator:")
print(SGDgrid.best_estimator_)
y_pred_train=SGDgrid.best_estimator_.predict(data1)
print(sklearn.metrics.confusion_matrix(targetd_simp['VALUES'],y_pred_train))
print(sklearn.metrics.accuracy_score(targetd_simp['VALUES'],y_pred_train))
The results I get are the following:
SGD Classifier:
Best:
0.38694539229180525
params:
{'algorithm': 'auto', 'leaf_size': 20, 'n_neighbors': 8, 'p': 1, 'weights': 'distance'}
Best estimator:
KNeighborsClassifier(leaf_size=20, n_neighbors=8, p=1, weights='distance')
[[4962 0 0]
[ 0 4802 0]
[ 0 0 4853]]
1.0
Probably this model is highly overfitted. I still to check it, but it's not the matter of question here.
So, basically, if I understand correctly, GridSearchCV is finding a best accuracy score of 0.3869 (quite poor) for one of the chunks in the cross validation, but the final confusion matrix is perfect, as well as the accuracy of this final matrix. It doesn't make much sense for me... How such a in theory, bad model is performing so well?
I also added scoring = 'accuracy' in GridSearchCV to be sure that the returned value is actually accuracy, and it returns exactly the same value.
What am I missing here?
The behavior you are describing is rather normal and to be expected. You should know that GridSearchCV has a parameter refit which is by default set to true. It triggers the following:
Refit an estimator using the best found parameters on the whole dataset.
This means that the estimator returned by best_estimator_ has been refit on your whole dataset (data1 in your case). It is therefore data that the estimator has already seen during training and, expectedly, performs especially well on it. You can easily reproduce this with the following example:
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
X, y = make_classification(random_state=7)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
search = GridSearchCV(KNeighborsClassifier(), param_grid={'n_neighbors': [3, 4, 5]})
search.fit(X_train, y_train)
print(search.best_score_)
>>> 0.8533333333333333
print(accuracy_score(y_train, search.predict(X_train)))
>>> 0.9066666666666666
While this is not as impressive as in your case, it is still a clear result. During cross-validation, the model is validated against one fold that was not used for training the model, and thus, against data the model has not seen before. In the second case, however, the model already saw all data during training and it is to be expected that the model will perform better on them.
To get a better feeling of the true model performance, you should use a holdout set with data the model has not seen before:
print(accuracy_score(y_test, search.predict(X_test)))
>>> 0.76
As you can see, the model performs considerably worse on this data and shows us that the former metrics were all a bit too optimistic. The model did in fact not generalize that well.
In conclusion, your result is not surprising and has an easy explanation. The high discrepancy in scores is impressive but still follows the same logic and is actually just a clear indicator of overfitting.

Layers for predicting financial data using Tensorflow/tflearn

I'd like to predict the interest rate and I've got some relevant factors like stock index and money supply number, something like that. The number of factors may be up to 200.
For example,the training data like, X contains factors and y is the interest rate I want to train and predict.
factor1 factor2 factor3 factor176 factor177 factor178
X= [[ 2.1428 6.1557 5.4101 ..., 5.86 6.0735 6.191 ]
[ 2.168 6.1533 5.2315 ..., 5.8185 6.0591 6.189 ]
[ 2.125 4.7965 3.9443 ..., 5.7845 5.9873 6.1283]...]
y= [[ 3.5593]
[ 3.014 ]
[ 2.7125]...]
So I want to use tensorflow/tflearn to train this model but I don't really know what method exactly I should choose to do regression. I have tried LinearRegression from tflearn before, but the result is not so great.
For now, I just use the code I found online.
net = tflearn.input_data([None, 178])
net = tflearn.fully_connected(net, 64, activation='linear',
weight_decay=0.0005)
net = tflearn.fully_connected(net, 1, activation='linear')
net = tflearn.regression(net, optimizer=
tflearn.optimizers.AdaGrad(learning_rate=0.01, initial_accumulator_value=0.01),
loss='mean_square', learning_rate=0.05)
model = tflearn.DNN(net, tensorboard_verbose=0, checkpoint_path='tmp/')
model.fit(X, y, show_metric=True,
batch_size=1, n_epoch=100)
The result is roughly 50% accuracy when the error range is ±10%.
I have tried to make the window to 7 days but the result is still bad. So I want to know what additional layer I can use to make this network better.
First of all this network makes no sense. If you do not have any activations on your hidden units, you network is equivalent to linear regression.
So first of all change
net = tflearn.fully_connected(net, 64, activation='linear',
weight_decay=0.0005)
to
net = tflearn.fully_connected(net, 64, activation='relu',
weight_decay=0.0005)
Another general thing is to always normalise your data. Your X's are big, y's are big as well - make sure they aren't, by for example whitening them (making them 0 mean and 1 std).
Finding right architecture is hard problem and you will not find any "magical recipies" for that. Start with understanding what you are doing. Log your training, see if the training loss converges to small values, if it does not - you either do not train long enough, network is too small, or training hyperparameters are off (like too big learning right, too high regularisation etc.)

neural network produces similar pattern for all inputs

I am attempting to train an ANN on time series data in Keras. I have three vectors of data that are broken into scrolling window sequences (i.e. for vector l).
np.array([l[i:i+window_size] for i in range( len(l) - window_size)])
The target vector is similarly windowed so the neural net output is a prediction of the target vector for the next window_size number of time steps. All the data is normalized with a min-max scaler. It is fed into the neural network as a shape=(nb_samples, window_size, 3). Here is a plot of the 3 input vectors.
The only output I've managed to muster from the ANN is the following plot. Target vector in blue, predictions in red (plot is zoomed in to make the prediction pattern legible). Prediction vectors are plotted at window_size intervals so each one of the repeated patterns is one prediction from the net.
I've tried many different model architectures, number of epochs, activation functions, short and fat networks, skinny, tall. This is my current one (it's a little out there).
Conv1D(64,4, input_shape=(None,3)) ->
Conv1d(32,4) ->
Dropout(24) ->
LSTM(32) ->
Dense(window_size)
But nothing I try will affect the neural net from outputting this repeated pattern. I must be misunderstanding something about time-series or LSTMs in Keras. But I'm very lost at this point so any help is greatly appreciated. I've attached the full code at this repository.
https://github.com/jaybutera/dat-toy
I played with your code a little and I think I have a few suggestions for getting you on the right track. The code doesn't seem to match your graphs exactly, but I assume you've tweaked it a bit since then. Anyway, there are two main problems:
The biggest problem is in your data preparation step. You basically have the data shapes backwards, in that you have a single timestep of input for X and a timeseries for Y. Your input shape is (18830, 1, 8), when what you really want is (18830, 30, 8) so that the full 30 timesteps are fed into the LSTM. Otherwise the LSTM is only operating on one timestep and isn't really useful. To fix this, I changed the line in common.py from
X = X.reshape(X.shape[0], 1, X.shape[1])
to
X = windowfy(X, winsize)
Similarly, the output data should probably be only 1 value, from what I've gathered of your goals from the plotting function. There are certainly some situations where you want to predict a whole timeseries, but I don't know if that's what you want in this case. I changed Y_train to use fuels instead of fuels_w so that it only had to predict one step of the timeseries.
Training for 100 epochs might be way too much for this simple network architecture. In some cases when I ran it, it looked like there was some overfitting going on. Observing the decrease of loss in the network, it seems like maybe only 3-4 epochs are needed.
Here is the graph of predictions after 3 training epochs with the adjustments I mentioned. It's not a great prediction, but it looks like it's on the right track now at least. Good luck to you!
EDIT: Example predicting multiple output timesteps:
from sklearn import datasets, preprocessing
import numpy as np
from scipy import stats
from keras import models, layers
INPUT_WINDOW = 10
OUTPUT_WINDOW = 5 # Predict 5 steps of the output variable.
# Randomly generate some regression data (not true sequential data; samples are independent).
np.random.seed(11798)
X, y = datasets.make_regression(n_samples=1000, n_features=4, noise=.1)
# Rescale 0-1 and convert into windowed sequences.
X = preprocessing.MinMaxScaler().fit_transform(X)
y = preprocessing.MinMaxScaler().fit_transform(y.reshape(-1, 1))
X = np.array([X[i:i + INPUT_WINDOW] for i in range(len(X) - INPUT_WINDOW)])
y = np.array([y[i:i + OUTPUT_WINDOW] for i in range(INPUT_WINDOW - OUTPUT_WINDOW,
len(y) - OUTPUT_WINDOW)])
print(np.shape(X)) # (990, 10, 4) - Ten timesteps of four features
print(np.shape(y)) # (990, 5, 1) - Five timesteps of one features
# Construct a simple model predicting output sequences.
m = models.Sequential()
m.add(layers.LSTM(20, activation='relu', return_sequences=True, input_shape=(INPUT_WINDOW, 4)))
m.add(layers.LSTM(20, activation='relu'))
m.add(layers.RepeatVector(OUTPUT_WINDOW))
m.add(layers.LSTM(20, activation='relu', return_sequences=True))
m.add(layers.wrappers.TimeDistributed(layers.Dense(1, activation='sigmoid')))
print(m.summary())
m.compile(optimizer='adam', loss='mse')
m.fit(X[:800], y[:800], batch_size=10, epochs=60) # Train on first 800 sequences.
preds = m.predict(X[800:], batch_size=10) # Predict the remaining sequences.
print('Prediction:\n' + str(preds[0]))
print('Actual:\n' + str(y[800]))
# Correlation should be around r = .98, essentially perfect.
print('Correlation: ' + str(stats.pearsonr(y[800:].flatten(), preds.flatten())[0]))

Why does my NN not classify these tic tac toe pattern correctly?

I'm trying to teach an AI to recognize patterns of tic-tac-toe with a winning line.
Unfortunately, it's not learning to recognize them correctly. I think my way of representing/encoding the game into vectors is wrong.
I choose a way that is easy for an human (me, in particular!) to understand:
training_data = np.array([[0,0,0,
0,0,0,
0,0,0],
[0,0,1,
0,1,0,
0,0,1],
[0,0,1,
0,1,0,
1,0,0],
[0,1,0,
0,1,0,
0,1,0]], "float32")
target_data = np.array([[0],[0],[1],[1]], "float32")
This uses an array of length 9 to represent a 3 x 3 board. The first three items represent the first row, the next three the second row, and so on. The line breaks should make it obvious. The target data then maps the first two game states to "no wins" and the last two game states to "wins".
Then I wanted to create some validation data that is slightly different to see if it generalizes.
validation_data = np.array([[0,0,0,
0,0,0,
0,0,0],
[1,0,0,
0,1,0,
1,0,0],
[1,0,0,
0,1,0,
0,0,1],
[0,0,1,
0,0,1,
0,0,1]], "float32")
Obviously, again the last two game states should be "wins" whereas the first two should not.
I tried to play with the number of neurons and learning rate, but no matter what I try, my output looks pretty off, e.g.
[[ 0.01207292]
[ 0.98913926]
[ 0.00925775]
[ 0.00577191]]
I tend to think it's the way how I represent the game state that may be wrong but actually I have no idea :D
Can anyone help me out here?
This is the entire code that I use
import numpy as np
from keras.models import Sequential
from keras.layers.core import Activation, Dense
from keras.optimizers import SGD
training_data = np.array([[0,0,0,
0,0,0,
0,0,0],
[0,0,1,
0,1,0,
0,0,1],
[0,0,1,
0,1,0,
1,0,0],
[0,1,0,
0,1,0,
0,1,0]], "float32")
target_data = np.array([[0],[0],[1],[1]], "float32")
validation_data = np.array([[0,0,0,
0,0,0,
0,0,0],
[1,0,0,
0,1,0,
1,0,0],
[1,0,0,
0,1,0,
0,0,1],
[0,0,1,
0,0,1,
0,0,1]], "float32")
model = Sequential()
model.add(Dense(2, input_dim=9, activation='sigmoid'))
model.add(Dense(1, activation='sigmoid'))
sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='mean_squared_error', optimizer=sgd)
history = model.fit(training_data, target_data, nb_epoch=10000, batch_size=4, verbose=0)
print(model.predict(validation_data))
UPDATE
I tried to follow the advice and used more training data with no success so far.
My training set looks like this now
training_data = np.array([[0,0,0,
0,0,0,
0,0,0],
[0,0,1,
0,0,0,
1,0,0],
[0,0,1,
0,1,0,
0,0,1],
[1,0,1,
0,1,0,
0,0,0],
[0,0,0,
0,1,0,
1,0,1],
[1,0,0,
0,0,0,
0,0,0],
[0,0,0,
0,0,0,
1,0,0],
[0,0,0,
0,1,0,
0,0,1],
[1,0,1,
0,0,0,
0,0,0],
[0,0,0,
0,0,0,
0,0,1],
[1,1,0,
0,0,0,
0,0,0],
[0,0,0,
1,0,0,
1,0,0],
[0,0,0,
1,1,0,
0,0,0],
[0,0,0,
0,0,1,
0,0,1],
[0,0,0,
0,0,0,
0,1,1],
[1,0,0,
1,0,0,
1,0,0],
[1,1,1,
0,0,0,
0,0,0],
[0,0,0,
0,0,0,
1,1,1],
[0,0,1,
0,1,0,
1,0,0],
[0,1,0,
0,1,0,
0,1,0]], "float32")
target_data = np.array([[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[1],[1],[1],[1],[1]], "float32")
Considering that I only count patterns of 1 as wins there are only 8 different win states for the way I represent the data. I made the NN see 5 of them so that I still have 3 to test against to see if the generalization works. I'm now feeding it 15 states that it should not consider a win.
However, the outcome for my validation seems to actually get worse.
[[ 1.06987642e-07]
[ 4.72647212e-02]
[ 1.97011139e-03]
[ 2.93282426e-07]]
Things I tried:
Changing from sigmoid to softmax
Adding more neurons
Adding more layer
A mix of all of the above
I see your problem immediately: your training set is far too small. Your problem space consists of the 512 corners a 9-dimensional hypercube. Your training colours two of the corners green, and two others red. You now somehow expect the trained model to have correctly intuited the proper colourings for the remaining 508 corners.
No general-purpose machine-learning algorithm will intuit the pattern of "does this board position contain any of the eight approved sequences of three evenly-spaced '1' values?" from only two positive and two negative examples. For one thing, note that your training data has no row wins, does not exclude evenly-spaced points that aren't a win, and ... well, many other patterns in the space.
I expect that you'll need at least two dozen well-chosen examples on each side of the classification to get any appreciable performance from your model. Think in terms of test cases: bits 1-2-3 make a win, but 3-4-5 does not; 3-5-7 make a win, but 1-3-5 and 2-4-6 do not.
Does this move you toward a solution?
One thing you might try is to generate random vectors and then classify them with a subroutine; feed these as training data. Do more for testing and validation data.
What Prune said makes a lot of sense. Given that your problem space is 138 terminal board positions (and that's excluding rotations and reflections! - see wiki) it is very unlikely that the learning algorithm can sufficiently adjust the weights and biases, just by training on a 4-entry data set. I had a similar experience in one of my "learning experiments", where, even though the net was trained on the complete data set, because the set was very small, I ended up having to train it over multiple epochs until it was able to output decent predictions.
I think what's important to remember here is that what training a FF neural net ultimately does is to fine-tune weights and biases so that the loss function is minimised as much as possible. The lower the loss, the closer the predictions get to the expected outputs and the better the neural net gets. This means the more training data the merrier :)
I found this complete training set for tic tac toe, though it's not in the format that you set out with, but who knows, perhaps it will be useful for you. I would be curious to know, what the min subset of that training set would be, for the net to start making reliable predictions :P
This is an interesting problem. I think you're really wanting your system to recognize "lines", but as others have said, with so little training data it's hard for the system to generalize.
A different and counterintuitive approach might be to start with a larger board, say, 10x10, not 3x3, and generate random lines in that space and try to make the model learn them. You might explore convolutional networks in that case. This would be a lot like the handwritten digit recognition problem, and I expect it would succeed easily. Once your system is good at recognizing lines, maybe you can creatively adapt it somehow and scale it down to recognize the tiny lines in the 3x3 case.
(That said, I think you can learn this particular 3x3 problem just by giving your network ALL the data. It might be too small for generalization, so I wouldn't even try in this case. After all, in training a net to learn the binary XOR function, we just fee it all 4 examples -- the complete space. You can't train it reliably from just 3 examples.)
I think there are problems here beyond a small data set, and these lie in your representation of the game state. In Tic-Tac-Toe, there are three possible states for each space on the board at any given time: [X], [O], or empty []. Furthermore there are conditions on the game which limit possible board configurations. i.e. there can be no more then n+1 [X] squares, given n [O] squares. I suggest going back and thinking about how to represent the three-state nature of the game-squares.
After playing around with this for a while I think I learned enough to add a valuable answer as well.
1. Grid size
Increasing the size of the grid will make it much easier to come up with more samples for the training while still leaving enough room for validation data that the NN won't see during the training. I'm not saying it can't be done for a 3 x 3 grid but increasing the size of the grid will definitely help. I ended up increasing the size to 6 x 6 and looking for straight lines of a min length of four connected points.
2. Data representation
Representing the data in a one dimensional vector isn't optimal.
Think about it. When we want to represent the following line in our grid...
[0,1,0,0,0,0,
0,1,0,0,0,0,
0,1,0,0,0,0,
0,1,0,0,0,0,
0,0,0,0,0,0,
0,0,0,0,0,0]
...how should our NN know that what we mean isn't actually this pattern in a grid of size 3 x 12?
[0,1,0,0,0,0,0,1,0,0,0,0,
0,1,0,0,0,0,0,1,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0]
We can provide much more context to our NN if we represent the data in a way that the NN knows we are talking about a grid of size 6 x 6.
[[0,1,0,0,0,0],
[0,1,0,0,0,0],
[0,1,0,0,0,0],
[0,1,0,0,0,0],
[0,0,0,0,0,0],
[0,0,0,0,0,0]]
The good news is that we can do exactly that using a Convolution2D layer in keras.
3. Target data representation
It's not only helpful to rethink the representation of our training data, we can also tweak the representation of our target data. Initially I wanted to go with with a binary question: Does this grid contain a straight line or not? 1 or 0.
Turns out we can do much better by using the same shape for our target data that we use for our input data and redefine our question as: Does this pixel belong to a straight line or not? So, considering we have an input sample that looks like this:
[[0,1,1,0,0,1],
[0,1,0,1,0,0],
[0,1,0,0,1,0],
[0,1,0,0,0,1],
[0,0,0,1,0,0],
[1,0,1,0,0,0]]
Our target output would look like this.
[[0,1,1,0,0,0],
[0,1,0,1,0,0],
[0,1,0,0,1,0],
[0,1,0,0,0,1],
[0,0,0,0,0,0],
[0,0,0,0,0,0]]
That way we're giving the NN much more context about we are actually looking for. Think about it, if you had to make sense of these samples I'm sure this target data representation would also hint your brain much better compared to a target data representation that is just 0 or 1.
Now the question is. How can we model our NN to have a target shape that is of the same shape as our input data? Because what usually happens is that each convolutional layer slices the grid in smaller grids to look for certain features which effectively then changes the shape of our data that is passed to the next layer.
However we can set border_mode='same' for our convolutional layers which essentially fills up the smaller grids with a border of zeros so that the original shape is preserved.
4. Measure
Measuring the performance of our model is the key to make the right adjustments. In particular, we want to see how accurate the predictions of our NN are for the training data and the validation data. Having these numbers gives us the right hints.
For instance if the accurancy for the predictions of our training data goes up while the accuracy of the predictions for our validation data is stale or even declines, that means that our NN is overfitting. That means, it basically memorizes the training data but it doesn't actually generalize the learnings so that it can apply them to data it hasn't seen before (e.g. our validation data).
There are three things we want to do:
A.) we want to set validation_data = (val_input_data, val_target_data) when we call model.fit(...) so that keras can inform us about the accuracy for our validation data after each epoch.
B.) we want to set verbose=2 when we call model.fit(...) so that keras actually prints out the progress after each epoch.
C.) we want to set metrics=['binary_accuracy'] when we call model.compile(...) to actually include the right metric in these progress logs that keras gives us after each epoch.
5. Data generation
Last but not least, as most of the other answers suggest. The more data, the better. I ended up writing a data generator that produces the training data and target data samples for me. My validation data is hand picked and I made sure that the generator does not generate training data that is identical with my validation data. I ended up training with 1000 samples.
The final model
This is the model I ended up with using. It uses a Dropout and a feature size of 64. That said, you can play with these numbers and will notice there are lots of models that will work pretty well.
model = Sequential()
model.add(Convolution2D(64, 3, 3, input_shape=(1, 6, 6), activation='relu', border_mode='same'))
model.add(Dropout(0.25))
model.add(Convolution2D(64, 3, 3, activation='relu', border_mode='same'))
model.add(Dropout(0.25))
model.add(Convolution2D(64, 3, 3, activation='relu', border_mode='same'))
model.add(Dropout(0.25))
model.add(Convolution2D(1, 1, 1, activation='sigmoid', border_mode='same'))

Sequence labeling in Keras

I'm working on sentence labeling problem. I've done embedding and padding by myself and my inputs look like:
X_i = [[0,1,1,0,2,3...], [0,1,1,0,2,3...], ..., [0,0,0,0,0...], [0,0,0,0,0...], ....]
For every word in sentence I want to predict one of four classes, so my desired output should look like:
Y_i = [[1,0,0,0], [0,0,1,0], [0,1,0,0], ...]
My simple network architecture is:
model = Sequential()
model.add(LSTM(input_shape = (emb,),input_dim=emb, output_dim=hidden, return_sequences=True))
model.add(TimeDistributedDense(output_dim=4))
model.add(Activation('softmax'))
model.compile(loss='binary_crossentropy', optimizer='adam')
model.fit(X_train, Y_train, batch_size=32, nb_epoch=3, validation_data=(X_test, Y_test), verbose=1, show_accuracy=True)
It shows approximately 95% while training, but when I'm trying to predict new sentences using trained model results are really bad. It looks like model just learnt some classes for first words and shows it every time. I think the problem can is:
Written by myself padding (zero vectors in the end of the sentence), can it make learning worse?
I should try to learn sentences of different length, without padding (if yes, can you help me how train such kind of a model in Keras?)
Wrong objective of learning, but I tried mean squared error, binary cross entropy and others, it doesn't change.
Something with TimeDistributedDense and softmax, I think, that I've got how it works, but still not 100% sure.
I'll be glad to see any hint or help regarding to this problem, thank you!
I personally think that you misunderstand what "sequence labeling" means.
Do you mean:
X is a list of sentences, each element X[i] is a word sequence of arbitrary length?
Y[i] is the category of X[i], and the one hot form of Y[i] is a [0, 1, 0, 0] like array?
If it is, then it's not a sequence labeling problem, it's a classification problem.
Don't use TimeDistributedDense, and if it is a multi-class classification problem, i.e., len(Y[i]) > 2, then use "categorical_crossentropy" instead of "binary_crossentropy"

Resources