Sequence labeling in Keras - machine-learning

I'm working on sentence labeling problem. I've done embedding and padding by myself and my inputs look like:
X_i = [[0,1,1,0,2,3...], [0,1,1,0,2,3...], ..., [0,0,0,0,0...], [0,0,0,0,0...], ....]
For every word in sentence I want to predict one of four classes, so my desired output should look like:
Y_i = [[1,0,0,0], [0,0,1,0], [0,1,0,0], ...]
My simple network architecture is:
model = Sequential()
model.add(LSTM(input_shape = (emb,),input_dim=emb, output_dim=hidden, return_sequences=True))
model.add(TimeDistributedDense(output_dim=4))
model.add(Activation('softmax'))
model.compile(loss='binary_crossentropy', optimizer='adam')
model.fit(X_train, Y_train, batch_size=32, nb_epoch=3, validation_data=(X_test, Y_test), verbose=1, show_accuracy=True)
It shows approximately 95% while training, but when I'm trying to predict new sentences using trained model results are really bad. It looks like model just learnt some classes for first words and shows it every time. I think the problem can is:
Written by myself padding (zero vectors in the end of the sentence), can it make learning worse?
I should try to learn sentences of different length, without padding (if yes, can you help me how train such kind of a model in Keras?)
Wrong objective of learning, but I tried mean squared error, binary cross entropy and others, it doesn't change.
Something with TimeDistributedDense and softmax, I think, that I've got how it works, but still not 100% sure.
I'll be glad to see any hint or help regarding to this problem, thank you!

I personally think that you misunderstand what "sequence labeling" means.
Do you mean:
X is a list of sentences, each element X[i] is a word sequence of arbitrary length?
Y[i] is the category of X[i], and the one hot form of Y[i] is a [0, 1, 0, 0] like array?
If it is, then it's not a sequence labeling problem, it's a classification problem.
Don't use TimeDistributedDense, and if it is a multi-class classification problem, i.e., len(Y[i]) > 2, then use "categorical_crossentropy" instead of "binary_crossentropy"

Related

How to forecast one output with multiple features by LSTM model?

I am playing with some stocks timeseries data and trying to predict the trend with multivariate features. Below is the sample dataset I have which including different technical indicators including moving average, Parabolic SAR etc for each stocks. From different online sources, most of them are predicting one stock with one feature like "Close" price a time. How can I make use all the stocks' features to predict one output let's say S&P's close price. I know it may not help boosting the prediction accuracy but I am not sure what I am training right now and hope having more insight on LSTM model.
Basically, I put the whole dataset in and do the scaling and training stuffs. How could the prediction being specified on one column?
Code:
scaler = MinMaxScaler(feature_range = (0,1))
scaled_feature_data = scaler.fit_transform(feature_data)
X_train, y_train = training_set[:, :-1], training_set[:, -1]
X_test, y_test = testing_set[:, :-1], testing_set[:, -1]
X_train = X_train.reshape((X_train.shape[0],1,X_train.shape[1]))
X_test = X_test.reshape((X_test.shape[0],1,X_test.shape[1]))
model_lstm.add(LSTM(50, return_sequences = True, input_shape = (X_train.shape[1], X_train.shape[2])))
Model:
model_lstm.add(LSTM(50, return_sequences = True, input_shape = (X_train.shape[1], X_train.shape[2])))
model_lstm.add(Dropout(0.2))
model_lstm.add(LSTM(units=50, return_sequences=True))
model_lstm.add(Dropout(0.2))
model_lstm.add(LSTM(units=50, return_sequences=True))
model_lstm.add(Dropout(0.2))
model_lstm.add(LSTM(units=50))
model_lstm.add(Dropout(0.2))
model_lstm.add(Dense(units=1, activation='relu'))
This is a regression problem. You are predicting one variable. So last layers may look like
model_lstm.add(Dense(1))
model_lstm.compile(loss='mse', optimizer='rmsprop')
Also, you probably don't need to return_sequences for each of input tokens. If return_sequences=true, then an output will be a matrix (actually (batch_size, num_tokens, num_features)), that cannot be automatically flattened to vector (batch_size, num_features) that is expected to be an input of Dense(1) layer. Just use an output of a last LSTM node. For this, set return_sequences=false. Its values depends on previous tokens, so you won't lose much information from them.
The whole model may look like this:
model_lstm = Sequential()
model_lstm.add(LSTM(50, return_sequences=False, input_shape=(X_train.shape[1], X_train.shape[2])))
model_lstm.add(Dropout(0.5))
model_lstm.add(Dense(1))
model_lstm.compile(loss='mse', optimizer='rmsprop')
If you want more layers it will become:
model_lstm = Sequential()
model_lstm.add(LSTM(64, return_sequences=True, dropout=0.5, input_shape=(X_train.shape[1], X_train.shape[-1])))
model_lstm.add(Dropout(0.5))
model_lstm.add(LSTM(32, return_sequences=False, dropout=0.5))
model_lstm.add(Dense(1))
model_lstm.compile(loss='mse', optimizer='rmsprop')
Actually I am wondering how can all features, lets say 10 technical indicator features, can help predict the one price column?
Don't know if i understand you correctly.But thats what a ML does, it tries to find corelations between the features and how they can be used to predict something. So maybe, the "DE30" has (or seems to have) a influence on the price and is therfore helpful. Was that your question?
rom different online sources, most of them are predicting one stock with one feature like "Close" price a time
I guess that for simplification. Therefore they used only one feature
Let my know if that was what you asked for..

Is it okay to use STATEFUL Recurrent NN (LSTM) for classification

I have a dataset C of 50,000 (binary) samples each of 128 features. The class label is also binary either 1 or -1. For instance, a sample would look like this [1,0,0,0,1,0, .... , 0,1] [-1]. My goal is to classify the samples based on the binary classes( i.e., 1 or -1). I thought to try using Recurrent LSTM to generate a good model for classification. To do so, I have written the following code using Keras library:
tr_C, ts_C, tr_r, ts_r = train_test_split(C, r, train_size=.8)
batch_size = 200
print('>>> Build STATEFUL model...')
model = Sequential()
model.add(LSTM(128, batch_input_shape=(batch_size, C.shape[1], C.shape[2]), return_sequences=False, stateful=True))
model.add(Dense(1, activation='softmax'))
print('>>> Training...')
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(tr_C, tr_r,
batch_size=batch_size, epochs=1, shuffle=True,
validation_data=(ts_C, ts_r))
However, I am getting bad accuracy, not more than 55%. I tried to change the activation function along with the loss function hoping to improve the accuracy but nothing works. Surprisingly, when I use Multilayer Perceptron, I get very good accuracy around 97%. Thus, I start questioning if LSTM can be used for classification or maybe my code here has something missing or it is wrong. Kindly, I want to know if the code has something missing or wrong to improve the accuracy. Any help or suggestion is appreciated.
You cannot use softmax as an output when you have only a single output unit as it will always output you a constant value of 1. You need to either change output activation to sigmoid or set output units number to 2 and loss to categorical_crossentropy. I would advise the first option.

neural network produces similar pattern for all inputs

I am attempting to train an ANN on time series data in Keras. I have three vectors of data that are broken into scrolling window sequences (i.e. for vector l).
np.array([l[i:i+window_size] for i in range( len(l) - window_size)])
The target vector is similarly windowed so the neural net output is a prediction of the target vector for the next window_size number of time steps. All the data is normalized with a min-max scaler. It is fed into the neural network as a shape=(nb_samples, window_size, 3). Here is a plot of the 3 input vectors.
The only output I've managed to muster from the ANN is the following plot. Target vector in blue, predictions in red (plot is zoomed in to make the prediction pattern legible). Prediction vectors are plotted at window_size intervals so each one of the repeated patterns is one prediction from the net.
I've tried many different model architectures, number of epochs, activation functions, short and fat networks, skinny, tall. This is my current one (it's a little out there).
Conv1D(64,4, input_shape=(None,3)) ->
Conv1d(32,4) ->
Dropout(24) ->
LSTM(32) ->
Dense(window_size)
But nothing I try will affect the neural net from outputting this repeated pattern. I must be misunderstanding something about time-series or LSTMs in Keras. But I'm very lost at this point so any help is greatly appreciated. I've attached the full code at this repository.
https://github.com/jaybutera/dat-toy
I played with your code a little and I think I have a few suggestions for getting you on the right track. The code doesn't seem to match your graphs exactly, but I assume you've tweaked it a bit since then. Anyway, there are two main problems:
The biggest problem is in your data preparation step. You basically have the data shapes backwards, in that you have a single timestep of input for X and a timeseries for Y. Your input shape is (18830, 1, 8), when what you really want is (18830, 30, 8) so that the full 30 timesteps are fed into the LSTM. Otherwise the LSTM is only operating on one timestep and isn't really useful. To fix this, I changed the line in common.py from
X = X.reshape(X.shape[0], 1, X.shape[1])
to
X = windowfy(X, winsize)
Similarly, the output data should probably be only 1 value, from what I've gathered of your goals from the plotting function. There are certainly some situations where you want to predict a whole timeseries, but I don't know if that's what you want in this case. I changed Y_train to use fuels instead of fuels_w so that it only had to predict one step of the timeseries.
Training for 100 epochs might be way too much for this simple network architecture. In some cases when I ran it, it looked like there was some overfitting going on. Observing the decrease of loss in the network, it seems like maybe only 3-4 epochs are needed.
Here is the graph of predictions after 3 training epochs with the adjustments I mentioned. It's not a great prediction, but it looks like it's on the right track now at least. Good luck to you!
EDIT: Example predicting multiple output timesteps:
from sklearn import datasets, preprocessing
import numpy as np
from scipy import stats
from keras import models, layers
INPUT_WINDOW = 10
OUTPUT_WINDOW = 5 # Predict 5 steps of the output variable.
# Randomly generate some regression data (not true sequential data; samples are independent).
np.random.seed(11798)
X, y = datasets.make_regression(n_samples=1000, n_features=4, noise=.1)
# Rescale 0-1 and convert into windowed sequences.
X = preprocessing.MinMaxScaler().fit_transform(X)
y = preprocessing.MinMaxScaler().fit_transform(y.reshape(-1, 1))
X = np.array([X[i:i + INPUT_WINDOW] for i in range(len(X) - INPUT_WINDOW)])
y = np.array([y[i:i + OUTPUT_WINDOW] for i in range(INPUT_WINDOW - OUTPUT_WINDOW,
len(y) - OUTPUT_WINDOW)])
print(np.shape(X)) # (990, 10, 4) - Ten timesteps of four features
print(np.shape(y)) # (990, 5, 1) - Five timesteps of one features
# Construct a simple model predicting output sequences.
m = models.Sequential()
m.add(layers.LSTM(20, activation='relu', return_sequences=True, input_shape=(INPUT_WINDOW, 4)))
m.add(layers.LSTM(20, activation='relu'))
m.add(layers.RepeatVector(OUTPUT_WINDOW))
m.add(layers.LSTM(20, activation='relu', return_sequences=True))
m.add(layers.wrappers.TimeDistributed(layers.Dense(1, activation='sigmoid')))
print(m.summary())
m.compile(optimizer='adam', loss='mse')
m.fit(X[:800], y[:800], batch_size=10, epochs=60) # Train on first 800 sequences.
preds = m.predict(X[800:], batch_size=10) # Predict the remaining sequences.
print('Prediction:\n' + str(preds[0]))
print('Actual:\n' + str(y[800]))
# Correlation should be around r = .98, essentially perfect.
print('Correlation: ' + str(stats.pearsonr(y[800:].flatten(), preds.flatten())[0]))

Why does my NN not classify these tic tac toe pattern correctly?

I'm trying to teach an AI to recognize patterns of tic-tac-toe with a winning line.
Unfortunately, it's not learning to recognize them correctly. I think my way of representing/encoding the game into vectors is wrong.
I choose a way that is easy for an human (me, in particular!) to understand:
training_data = np.array([[0,0,0,
0,0,0,
0,0,0],
[0,0,1,
0,1,0,
0,0,1],
[0,0,1,
0,1,0,
1,0,0],
[0,1,0,
0,1,0,
0,1,0]], "float32")
target_data = np.array([[0],[0],[1],[1]], "float32")
This uses an array of length 9 to represent a 3 x 3 board. The first three items represent the first row, the next three the second row, and so on. The line breaks should make it obvious. The target data then maps the first two game states to "no wins" and the last two game states to "wins".
Then I wanted to create some validation data that is slightly different to see if it generalizes.
validation_data = np.array([[0,0,0,
0,0,0,
0,0,0],
[1,0,0,
0,1,0,
1,0,0],
[1,0,0,
0,1,0,
0,0,1],
[0,0,1,
0,0,1,
0,0,1]], "float32")
Obviously, again the last two game states should be "wins" whereas the first two should not.
I tried to play with the number of neurons and learning rate, but no matter what I try, my output looks pretty off, e.g.
[[ 0.01207292]
[ 0.98913926]
[ 0.00925775]
[ 0.00577191]]
I tend to think it's the way how I represent the game state that may be wrong but actually I have no idea :D
Can anyone help me out here?
This is the entire code that I use
import numpy as np
from keras.models import Sequential
from keras.layers.core import Activation, Dense
from keras.optimizers import SGD
training_data = np.array([[0,0,0,
0,0,0,
0,0,0],
[0,0,1,
0,1,0,
0,0,1],
[0,0,1,
0,1,0,
1,0,0],
[0,1,0,
0,1,0,
0,1,0]], "float32")
target_data = np.array([[0],[0],[1],[1]], "float32")
validation_data = np.array([[0,0,0,
0,0,0,
0,0,0],
[1,0,0,
0,1,0,
1,0,0],
[1,0,0,
0,1,0,
0,0,1],
[0,0,1,
0,0,1,
0,0,1]], "float32")
model = Sequential()
model.add(Dense(2, input_dim=9, activation='sigmoid'))
model.add(Dense(1, activation='sigmoid'))
sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='mean_squared_error', optimizer=sgd)
history = model.fit(training_data, target_data, nb_epoch=10000, batch_size=4, verbose=0)
print(model.predict(validation_data))
UPDATE
I tried to follow the advice and used more training data with no success so far.
My training set looks like this now
training_data = np.array([[0,0,0,
0,0,0,
0,0,0],
[0,0,1,
0,0,0,
1,0,0],
[0,0,1,
0,1,0,
0,0,1],
[1,0,1,
0,1,0,
0,0,0],
[0,0,0,
0,1,0,
1,0,1],
[1,0,0,
0,0,0,
0,0,0],
[0,0,0,
0,0,0,
1,0,0],
[0,0,0,
0,1,0,
0,0,1],
[1,0,1,
0,0,0,
0,0,0],
[0,0,0,
0,0,0,
0,0,1],
[1,1,0,
0,0,0,
0,0,0],
[0,0,0,
1,0,0,
1,0,0],
[0,0,0,
1,1,0,
0,0,0],
[0,0,0,
0,0,1,
0,0,1],
[0,0,0,
0,0,0,
0,1,1],
[1,0,0,
1,0,0,
1,0,0],
[1,1,1,
0,0,0,
0,0,0],
[0,0,0,
0,0,0,
1,1,1],
[0,0,1,
0,1,0,
1,0,0],
[0,1,0,
0,1,0,
0,1,0]], "float32")
target_data = np.array([[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[1],[1],[1],[1],[1]], "float32")
Considering that I only count patterns of 1 as wins there are only 8 different win states for the way I represent the data. I made the NN see 5 of them so that I still have 3 to test against to see if the generalization works. I'm now feeding it 15 states that it should not consider a win.
However, the outcome for my validation seems to actually get worse.
[[ 1.06987642e-07]
[ 4.72647212e-02]
[ 1.97011139e-03]
[ 2.93282426e-07]]
Things I tried:
Changing from sigmoid to softmax
Adding more neurons
Adding more layer
A mix of all of the above
I see your problem immediately: your training set is far too small. Your problem space consists of the 512 corners a 9-dimensional hypercube. Your training colours two of the corners green, and two others red. You now somehow expect the trained model to have correctly intuited the proper colourings for the remaining 508 corners.
No general-purpose machine-learning algorithm will intuit the pattern of "does this board position contain any of the eight approved sequences of three evenly-spaced '1' values?" from only two positive and two negative examples. For one thing, note that your training data has no row wins, does not exclude evenly-spaced points that aren't a win, and ... well, many other patterns in the space.
I expect that you'll need at least two dozen well-chosen examples on each side of the classification to get any appreciable performance from your model. Think in terms of test cases: bits 1-2-3 make a win, but 3-4-5 does not; 3-5-7 make a win, but 1-3-5 and 2-4-6 do not.
Does this move you toward a solution?
One thing you might try is to generate random vectors and then classify them with a subroutine; feed these as training data. Do more for testing and validation data.
What Prune said makes a lot of sense. Given that your problem space is 138 terminal board positions (and that's excluding rotations and reflections! - see wiki) it is very unlikely that the learning algorithm can sufficiently adjust the weights and biases, just by training on a 4-entry data set. I had a similar experience in one of my "learning experiments", where, even though the net was trained on the complete data set, because the set was very small, I ended up having to train it over multiple epochs until it was able to output decent predictions.
I think what's important to remember here is that what training a FF neural net ultimately does is to fine-tune weights and biases so that the loss function is minimised as much as possible. The lower the loss, the closer the predictions get to the expected outputs and the better the neural net gets. This means the more training data the merrier :)
I found this complete training set for tic tac toe, though it's not in the format that you set out with, but who knows, perhaps it will be useful for you. I would be curious to know, what the min subset of that training set would be, for the net to start making reliable predictions :P
This is an interesting problem. I think you're really wanting your system to recognize "lines", but as others have said, with so little training data it's hard for the system to generalize.
A different and counterintuitive approach might be to start with a larger board, say, 10x10, not 3x3, and generate random lines in that space and try to make the model learn them. You might explore convolutional networks in that case. This would be a lot like the handwritten digit recognition problem, and I expect it would succeed easily. Once your system is good at recognizing lines, maybe you can creatively adapt it somehow and scale it down to recognize the tiny lines in the 3x3 case.
(That said, I think you can learn this particular 3x3 problem just by giving your network ALL the data. It might be too small for generalization, so I wouldn't even try in this case. After all, in training a net to learn the binary XOR function, we just fee it all 4 examples -- the complete space. You can't train it reliably from just 3 examples.)
I think there are problems here beyond a small data set, and these lie in your representation of the game state. In Tic-Tac-Toe, there are three possible states for each space on the board at any given time: [X], [O], or empty []. Furthermore there are conditions on the game which limit possible board configurations. i.e. there can be no more then n+1 [X] squares, given n [O] squares. I suggest going back and thinking about how to represent the three-state nature of the game-squares.
After playing around with this for a while I think I learned enough to add a valuable answer as well.
1. Grid size
Increasing the size of the grid will make it much easier to come up with more samples for the training while still leaving enough room for validation data that the NN won't see during the training. I'm not saying it can't be done for a 3 x 3 grid but increasing the size of the grid will definitely help. I ended up increasing the size to 6 x 6 and looking for straight lines of a min length of four connected points.
2. Data representation
Representing the data in a one dimensional vector isn't optimal.
Think about it. When we want to represent the following line in our grid...
[0,1,0,0,0,0,
0,1,0,0,0,0,
0,1,0,0,0,0,
0,1,0,0,0,0,
0,0,0,0,0,0,
0,0,0,0,0,0]
...how should our NN know that what we mean isn't actually this pattern in a grid of size 3 x 12?
[0,1,0,0,0,0,0,1,0,0,0,0,
0,1,0,0,0,0,0,1,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0]
We can provide much more context to our NN if we represent the data in a way that the NN knows we are talking about a grid of size 6 x 6.
[[0,1,0,0,0,0],
[0,1,0,0,0,0],
[0,1,0,0,0,0],
[0,1,0,0,0,0],
[0,0,0,0,0,0],
[0,0,0,0,0,0]]
The good news is that we can do exactly that using a Convolution2D layer in keras.
3. Target data representation
It's not only helpful to rethink the representation of our training data, we can also tweak the representation of our target data. Initially I wanted to go with with a binary question: Does this grid contain a straight line or not? 1 or 0.
Turns out we can do much better by using the same shape for our target data that we use for our input data and redefine our question as: Does this pixel belong to a straight line or not? So, considering we have an input sample that looks like this:
[[0,1,1,0,0,1],
[0,1,0,1,0,0],
[0,1,0,0,1,0],
[0,1,0,0,0,1],
[0,0,0,1,0,0],
[1,0,1,0,0,0]]
Our target output would look like this.
[[0,1,1,0,0,0],
[0,1,0,1,0,0],
[0,1,0,0,1,0],
[0,1,0,0,0,1],
[0,0,0,0,0,0],
[0,0,0,0,0,0]]
That way we're giving the NN much more context about we are actually looking for. Think about it, if you had to make sense of these samples I'm sure this target data representation would also hint your brain much better compared to a target data representation that is just 0 or 1.
Now the question is. How can we model our NN to have a target shape that is of the same shape as our input data? Because what usually happens is that each convolutional layer slices the grid in smaller grids to look for certain features which effectively then changes the shape of our data that is passed to the next layer.
However we can set border_mode='same' for our convolutional layers which essentially fills up the smaller grids with a border of zeros so that the original shape is preserved.
4. Measure
Measuring the performance of our model is the key to make the right adjustments. In particular, we want to see how accurate the predictions of our NN are for the training data and the validation data. Having these numbers gives us the right hints.
For instance if the accurancy for the predictions of our training data goes up while the accuracy of the predictions for our validation data is stale or even declines, that means that our NN is overfitting. That means, it basically memorizes the training data but it doesn't actually generalize the learnings so that it can apply them to data it hasn't seen before (e.g. our validation data).
There are three things we want to do:
A.) we want to set validation_data = (val_input_data, val_target_data) when we call model.fit(...) so that keras can inform us about the accuracy for our validation data after each epoch.
B.) we want to set verbose=2 when we call model.fit(...) so that keras actually prints out the progress after each epoch.
C.) we want to set metrics=['binary_accuracy'] when we call model.compile(...) to actually include the right metric in these progress logs that keras gives us after each epoch.
5. Data generation
Last but not least, as most of the other answers suggest. The more data, the better. I ended up writing a data generator that produces the training data and target data samples for me. My validation data is hand picked and I made sure that the generator does not generate training data that is identical with my validation data. I ended up training with 1000 samples.
The final model
This is the model I ended up with using. It uses a Dropout and a feature size of 64. That said, you can play with these numbers and will notice there are lots of models that will work pretty well.
model = Sequential()
model.add(Convolution2D(64, 3, 3, input_shape=(1, 6, 6), activation='relu', border_mode='same'))
model.add(Dropout(0.25))
model.add(Convolution2D(64, 3, 3, activation='relu', border_mode='same'))
model.add(Dropout(0.25))
model.add(Convolution2D(64, 3, 3, activation='relu', border_mode='same'))
model.add(Dropout(0.25))
model.add(Convolution2D(1, 1, 1, activation='sigmoid', border_mode='same'))

Keras LSTM Time Series

I have a problem and at this point I'm completely lost as to how to solve it. I'm using Keras with an LSTM layer to project a time series. I'm trying to use the previous 10 data points to predict the 11th.
Here's the code:
from keras.models import Sequential
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.recurrent import LSTM
def _load_data(data):
"""
data should be pd.DataFrame()
"""
n_prev = 10
docX, docY = [], []
for i in range(len(data)-n_prev):
docX.append(data.iloc[i:i+n_prev].as_matrix())
docY.append(data.iloc[i+n_prev].as_matrix())
if not docX:
pass
else:
alsX = np.array(docX)
alsY = np.array(docY)
return alsX, alsY
X, y = _load_data(df_test)
X_train = X[:25]
X_test = X[25:]
y_train = y[:25]
y_test = y[25:]
in_out_neurons = 2
hidden_neurons = 300
model = Sequential()
model.add(LSTM(in_out_neurons, hidden_neurons, return_sequences=False))
model.add(Dense(hidden_neurons, in_out_neurons))
model.add(Activation("linear"))
model.compile(loss="mean_squared_error", optimizer="rmsprop")
model.fit(X_train, y_train, nb_epoch=10, validation_split=0.05)
predicted = model.predict(X_test)
So I'm taking the input data (a two column dataframe), creating X which is an n by 10 by 2 array, and y which is an n by 2 array which is one step ahead of the last row in each array of X (labeling the data with the point directly ahead of it.
predicted is returning
[[ 7.56940445, 5.61719704],
[ 7.57328357, 5.62709032],
[ 7.56728049, 5.61216415],
[ 7.55060187, 5.60573629],
[ 7.56717342, 5.61548522],
[ 7.55866942, 5.59696181],
[ 7.57325984, 5.63150951]]
but I should be getting
[[ 73, 48],
[ 74, 42],
[ 91, 51],
[102, 64],
[109, 63],
[ 93, 65],
[ 92, 58]]
The original data set only has 42 rows, so I'm wondering if there just isn't enough there to work with? Or am I missing a key step in the modeling process maybe? I've seen some examples using Embedding layers etc, is that something I should be looking at?
Thanks in advance for any help!
Hey Ryan!
I know it's late but I just came across your question hope it's not too late or that you still find some knowledge here.
First of all, Stackoverflow may not be the best place for this kind of question. First reason to that is you have a conceptual question that is not this site's purpose. Moreover your code runs so it's not even a matter of general programming. Have a look at stats.
Second from what I see there is no conceptual error. You're using everything necessary that is:
lstm with propper dimensions
return_sequences=false just before your Dense layer
linear activation for your output
mse cost/loss/objective function
Third I however find it extremely unlikely that your network learns anything with so few pieces of data. You have to understand that you have less data than parameters here! For the great majority of supervised learning algorithm, the first thing you need is not a good model, it's good data. You can not learn from so few examples, especially not with a complex model such as LSTM networks.
Fourth It seems like your target data is made of relatively high values. First step of pre-processing here could be to standardize the data : center it around zero - that is translate your data by its mean - and rescale by ists standard deviation. This really helps learning!
Fifth In general here are a few things you should look into to improve learning and reduce overfitting :
Dropout
Batch Normalization
Other optimizer (such as Adam)
Gradient clipping
Random hyper parameter search
(This is not exhaustive, if you're reading this and think something should be added, comment it so it's useful for future readers!)
Last but NOT least I suggest you look at this tutorial on Github, especially the recurrent tutorial for time series with keras.
PS: Daniel Hnyk updated his post ;)

Resources