## How does this binary encoder function work? - binary-data

I'm trying to understand the logic behind this binary encoder.
It automatically takes categorical variables and dummy codes them (similar to one-hot-encoding on sklearn), but reduces the number of output columns equal to the log2 of the length of unique values.
Basically, when I used this library, I noticed that my dummy variables are limited to only a few of the unique values. Upon further investigation I noticed this #staticmethod, which take the log2 of the len of unique values in a categorical variable.
My question is WHY? I realize that this reduces the dimensionality of the output data, but what is the logic behind doing this? How does taking the log2 determine how many digits are needed to represent the data?
def calc_required_digits(X, col):
"""
figure out how many digits we need to represent the classes present
"""
return int( np.ceil(np.log2(len(X[col].unique()))) )
Full source code:
"""Binary encoding"""
import copy
import pandas as pd
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from category_encoders.ordinal import OrdinalEncoder
from category_encoders.utils import get_obj_cols, convert_input
__author__ = 'willmcginnis'
[docs]class BinaryEncoder(BaseEstimator, TransformerMixin):
"""Binary encoding for categorical variables, similar to onehot, but stores categories as binary bitstrings.
Parameters
----------
verbose: int
integer indicating verbosity of output. 0 for none.
cols: list
a list of columns to encode, if None, all string columns will be encoded
drop_invariant: bool
boolean for whether or not to drop columns with 0 variance
return_df: bool
boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array)
impute_missing: bool
boolean for whether or not to apply the logic for handle_unknown, will be deprecated in the future.
handle_unknown: str
options are 'error', 'ignore' and 'impute', defaults to 'impute', which will impute the category -1. Warning: if
impute is used, an extra column will be added in if the transform matrix has unknown categories. This can causes
unexpected changes in dimension in some cases.
Example
-------
>>>from category_encoders import *
>>>import pandas as pd
>>>from sklearn.datasets import load_boston
>>>bunch = load_boston()
>>>y = bunch.target
>>>X = pd.DataFrame(bunch.data, columns=bunch.feature_names)
>>>enc = BinaryEncoder(cols=['CHAS', 'RAD']).fit(X, y)
>>>numeric_dataset = enc.transform(X)
>>>print(numeric_dataset.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 16 columns):
CHAS_0 506 non-null int64
RAD_0 506 non-null int64
RAD_1 506 non-null int64
RAD_2 506 non-null int64
RAD_3 506 non-null int64
CRIM 506 non-null float64
ZN 506 non-null float64
INDUS 506 non-null float64
NOX 506 non-null float64
RM 506 non-null float64
AGE 506 non-null float64
DIS 506 non-null float64
TAX 506 non-null float64
PTRATIO 506 non-null float64
B 506 non-null float64
LSTAT 506 non-null float64
dtypes: float64(11), int64(5)
memory usage: 63.3 KB
None
"""
def __init__(self, verbose=0, cols=None, drop_invariant=False, return_df=True, impute_missing=True, handle_unknown='impute'):
self.return_df = return_df
self.drop_invariant = drop_invariant
self.drop_cols = []
self.verbose = verbose
self.impute_missing = impute_missing
self.handle_unknown = handle_unknown
self.cols = cols
self.ordinal_encoder = None
self._dim = None
self.digits_per_col = {}
[docs] def fit(self, X, y=None, **kwargs):
"""Fit encoder according to X and y.
Parameters
----------
X : array-like, shape = [n_samples, n_features]
Training vectors, where n_samples is the number of samples
and n_features is the number of features.
y : array-like, shape = [n_samples]
Target values.
Returns
-------
self : encoder
Returns self.
"""
# if the input dataset isn't already a dataframe, convert it to one (using default column names)
# first check the type
X = convert_input(X)
self._dim = X.shape[1]
# if columns aren't passed, just use every string column
if self.cols is None:
self.cols = get_obj_cols(X)
# train an ordinal pre-encoder
self.ordinal_encoder = OrdinalEncoder(
verbose=self.verbose,
cols=self.cols,
impute_missing=self.impute_missing,
handle_unknown=self.handle_unknown
)
self.ordinal_encoder = self.ordinal_encoder.fit(X)
for col in self.cols:
self.digits_per_col[col] = self.calc_required_digits(X, col)
# drop all output columns with 0 variance.
if self.drop_invariant:
self.drop_cols = []
X_temp = self.transform(X)
self.drop_cols = [x for x in X_temp.columns.values if X_temp[x].var() <= 10e-5]
return self
[docs] def transform(self, X):
"""Perform the transformation to new categorical data.
Parameters
----------
X : array-like, shape = [n_samples, n_features]
Returns
-------
p : array, shape = [n_samples, n_numeric + N]
Transformed values with encoding applied.
"""
if self._dim is None:
raise ValueError('Must train encoder before it can be used to transform data.')
# first check the type
X = convert_input(X)
# then make sure that it is the right size
if X.shape[1] != self._dim:
raise ValueError('Unexpected input dimension %d, expected %d' % (X.shape[1], self._dim, ))
if not self.cols:
return X
X = self.ordinal_encoder.transform(X)
X = self.binary(X, cols=self.cols)
if self.drop_invariant:
for col in self.drop_cols:
X.drop(col, 1, inplace=True)
if self.return_df:
return X
else:
return X.values
[docs] def binary(self, X_in, cols=None):
"""
Binary encoding encodes the integers as binary code with one column per digit.
"""
X = X_in.copy(deep=True)
if cols is None:
cols = X.columns.values
pass_thru = []
else:
pass_thru = [col for col in X.columns.values if col not in cols]
bin_cols = []
for col in cols:
# get how many digits we need to represent the classes present
digits = self.digits_per_col[col]
# map the ordinal column into a list of these digits, of length digits
X[col] = X[col].map(lambda x: self.col_transform(x, digits))
for dig in range(digits):
X[str(col) + '_%d' % (dig, )] = X[col].map(lambda r: int(r[dig]) if r is not None else None)
bin_cols.append(str(col) + '_%d' % (dig, ))
X = X.reindex(columns=bin_cols + pass_thru)
return X
[docs] #staticmethod
def calc_required_digits(X, col):
"""
figure out how many digits we need to represent the classes present
"""
return int( np.ceil(np.log2(len(X[col].unique()))) )
[docs] #staticmethod
def col_transform(col, digits):
"""
The lambda body to transform the column values
"""
if col is None or float(col) < 0.0:
return None
else:
col = list("{0:b}".format(int(col)))
if len(col) == digits:
return col
else:
return [0 for _ in range(digits - len(col))] + col

My question is WHY? I realize that this reduces the dimensionality of
the output data, but what is the logic behind doing this?
Basically, the issue of categorical encoding is to make your algorithm it's dealing with categorical features. Therefore, several methods are available for doing it, including binary encoding. Actually, it's logic is close to the logic of One Hot Encoding (OHE), if you understood it.
For binary encoding, each unique label in your categorical vector is associated randomly to a number between (0) and (the number of unique labels-1). Now, you encode this number in base 2 and "transcript" the previous number in 0 and 1 through the newly created columns.
As an example, let's say your dataset as three different labels: 'A', 'B' & 'C'.
The following correspondance is randomly built:
'A' -> 1 -> 01;
'B' -> 2 > 10;
'C' -> 0 -> 00.
Therefore, an example of encoding of a given dataset is:
index my_category enc_category_0 enc_category_1
0 A, 1, 0
1, B, 0, 1
2, C, 0, 0
3 A, 1, 0
Regarding the utility of it, as you said it's reduce the dimensionality. Besides, I guess it helps not having too much zeros in the encoded columns as with OHE. Here is an interesting post: https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931
How does taking the log2 determine how many digits are needed to represent the data?
If you understood the working principle, you understand the use of the log2. Computing the log2 of a number retrives the necessary number of digits for a binary encoding of this number. Example: [log2(10)]=[3.32]=4, 4 digits are needed for binary encode 10.
For more info about the implementation and code example: http://contrib.scikit-learn.org/categorical-encoding/_modules/category_encoders/binary.html#BinaryEncoder
Hope I was clear,
Tchau

## Related

### Map Dask bincount over 2d array columns

I am trying to use bincount over a 2D array. Specifically I have this code: import numpy as np import dask.array as da def dask_bincount(weights, x): da.bincount(x, weights) idx = da.random.random_integers(0, 1024, 1000) weight = da.random.random((1000, 2)) bin_count = da.apply_along_axis(dask_bincount, 1, weight, idx) The idea is that the bincount can be made with the same idx array on each one of the weight columns. That would return an array of size (np.amax(x) + 1, 2) if I am correct. However when doing this I get this error message: --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-17-5b8eed89ad32> in <module> ----> 1 bin_count = da.apply_along_axis(dask_bincount, 1, weight, idx) ~/.local/lib/python3.9/site-packages/dask/array/routines.py in apply_along_axis(func1d, axis, arr, dtype, shape, *args, **kwargs) 454 if shape is None or dtype is None: 455 test_data = np.ones((1,), dtype=arr.dtype) --> 456 test_result = np.array(func1d(test_data, *args, **kwargs)) 457 if shape is None: 458 shape = test_result.shape <ipython-input-14-34fd0eb9b775> in dask_bincount(weights, x) 1 def dask_bincount(weights, x): ----> 2 da.bincount(x, weights) ~/.local/lib/python3.9/site-packages/dask/array/routines.py in bincount(x, weights, minlength, split_every) 670 raise ValueError("Input array must be one dimensional. Try using x.ravel()") 671 if weights is not None: --> 672 if weights.chunks != x.chunks: 673 raise ValueError("Chunks of input array x and weights must match.") 674 AttributeError: 'numpy.ndarray' object has no attribute 'chunks' I thought that when dask array were created the library automatically assigns them chunks, so the error does not say much. How can I fix this? I made an script that does it on numpy with map. idx_np = np.random.randint(0, 1024, 1000) weight_np = np.random.random((1000,2)) f = lambda y: np.bincount(idx_np, weight_np[:,y]) result = map(f, [i for i in range(2)]) np.array(list(result)) array([[0.9885341 , 0.9977873 , 0.24937023, ..., 0.31024526, 1.40754883, 0.87609759], [1.77406303, 0.84787723, 0.14591474, ..., 0.54584068, 0.38357015, 0.85202672]]) I would like to the same but with dask

There are multiple problems at play. Weights should be (2, 1000) You discover this by trying to write the same function in numpy using apply_along_axis. idx_np = np.random.random_integers(0, 1024, 1000) weight_np = np.random.random((2, 1000)) # <- transposed # This gives the same result as the code you provided np.apply_along_axis(lambda weight, idx: np.bincount(idx, weight), 1, weight_np, idx_np) da.apply_along_axis applies the function to numpy arrays You're getting the error AttributeError: 'numpy.ndarray' object has no attribute 'chunks' This suggests that what makes it into the da.bincount method is actually a numpy array. The fact is that da.apply_along_axis actually takes each row of weight and sends it to the function as a numpy array. Your function should therefore actually be a numpy function: def bincount(weights, x): return np.bincount(x, weights) However, if you try this, you will still get the same error. I believe that happens for a whole another reason though: Dask doesn't know what the output shape will be and tries to infer it In the code and/or documentation for apply_along_axis, we can see that Dask tries to infer the output shape and dtype by passing in the array [1] (related question). This is a problem, since bincount cannot just accept such argument. What we can do instead is provide shape and dtype to the method so that Dask doesn't have to infer it. The problem here is that bincount's output shape depends on the maximum value of the input array. Unless you know it beforehand, you will sadly need to compute it. The whole operation therefore won't be fully lazy. This is the full answer: import numpy as np import dask.array as da idx = da.random.random_integers(0, 1024, 1000) weight = da.random.random((2, 1000)) def bincount(weights, x): return np.bincount(x, weights) m = idx.max().compute() da.apply_along_axis(bincount, 1, weight, idx, shape=(m,), dtype=weight.dtype) Appendix: randint vs random_integers Be careful, because these are subtly different randint takes integers from low (inclusive) to high (exclusive) random_integers takes integers from low (inclusive) to high (inclusive) Thus you have to call randint with high + 1 to get the same value.

### Getting error as DataFrame.dtypes for data must be int, float, bool or categorical

Full error in XGBOOST is ValueError: DataFrame.dtypes for data must be int, float, bool or categorical. When categorical type is supplied, DMatrix parameter `enable_categorical` must be set to `True`.Year The data is <class 'pandas.core.frame.DataFrame'> Int64Index: 50327 entries, 0 to 50326 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 C_Id 50327 non-null int8 1 Year 50327 non-null datetime64[ns] 2 value 50327 non-null float64 3 R_Id 50327 non-null int8 dtypes: datetime64[ns](1), float64(1), int8(2) memory usage: 2.3 MB Then I did, t_date = "2019-01-01 00:00:00" X_train = data[data["Year"]<t_date].drop(["value"],axis=1) Y_train = data[data["Year"]<t_date]["value"] X_test = data[data["Year"]>=t_date].drop(["value"],axis=1) ` model = XGBRegressor( max_depth = 8,n_estimators=1000, min_child_weight=300,colsample_bytree=0.8, subsample=0.8,eta=0.3,seed=42) model.fit(X_train,Y_train,eval_metric="rmse",eval_set=[(X_train,Y_train)], verbose =True,early_stopping_rounds=10) Where am I getting wrong, if you need anything pls ask Thanks for helping ! EDIT: I converted Year type to string and then to int BUt the result is like this, [461] validation_0-rmse:8791.25293 [462] validation_0-rmse:8791.08789

### RNN Language Model in PyTorch predicting the same three words repeatedly

I am attempting to create a word-level language model using an RNN in PyTorch. Whenever I am training the loss stays about the same for the whole training set and when I try to sample a new sentence the same three words are predicted in the same order. For example in my most recent attempt the RNN predicted 'the' then 'same' then 'of' and that sequence just kept repeating. I have tried changing the how I've set up the RNN including using LSTM's, GRU's, and different embeddings but so far nothing has worked. The way that I am training the RNN is by taking a sentence of 50 words, and selecting a progressively larger part of the sentence with the next word being the target. At the end of the sentence, I have an EOS tag. I am using text from Republic by Plato as my training set and embedding it using a pytorch embedding layer. I am then feeding it into an LSTM and then a linear layer to get the right shape. I am not sure if the problem is in the RNN, the data, the training or what so any help would be greatly appreciated. If anyone has any experience in nlp or in language modeling I would greatly appreciate any help you could offer for this fixing this problem. My end goal is simply to just be able to generate a sentence. Thanks in advance! Here is my RNN class LanguageModel(nn.Module): """ Class that defines the reccurent neural network. Methods ------- forward(input, h, c) Forward propogation through the RNN. initHidden() Initializes the hidden and cell states. """ def __init__(self, vocabSize, seqLen = 51, embeddingDim = 30, hiddenSize = 32, numLayers = 1, bid = False): """ Initializes the class Parameters ---------- seqLen : int, optional The length of the input sequence. embeddingDim : int, optional The dimension that the embedding dimension for the encoder should be. vocabSize : int The length of the vocab dictionary. hiddenSize : int, optional The size that the hidden state should be. numLayers : int, optional The number of LSTM Layers. bid : bool, optional Whether the RNN should be bidirctional or not. """ super(LanguageModel, self).__init__() self.hiddenSize = hiddenSize self.numLayers = numLayers # Set value of numDirections based on whether or not the RNN is bidirectional. if bid == True: self.numDirections = 2 else: self.numDirections = 1 self.encoder = nn.Embedding(vocabSize, embeddingDim) self.LSTM = nn.LSTM(input_size = embeddingDim, hidden_size = hiddenSize, num_layers = numLayers, bidirectional = bid) self.decoder = nn.Linear(seqLen * self.numDirections * hiddenSize, vocabSize) def forward(self, input, h, c): """ Forward propogates through the RNN Parameters ---------- input : torch.Tensor Input to RNN. Should be formatter using makeInput() and padSeq(). h : torch.Tensor Hidden state. c : torch.Tensor Cell state. Returns ------- torch.Tensor Log probabilities for the predicted word from the RNN. """ emb = self.encoder(input) emb.unsqueeze_(1) # Add in the batch dimension so the shape is right for the LSTM out, (h, c) = self.LSTM(emb, (h, c)) out = out.view(1, -1) # Reshaping to fit into the loss function. out = self.decoder(out) logProbs = F.log_softmax(out, dim = 1) return logProbs def initHidden(self): """ Initializes the hidden and cell states. Returns ------- torch.Tensor Tensor containing the initial hidden state. torch.Tensor Tensor containing the intial cell state. """ h = torch.zeros(self.numLayers * self.numDirections, 1, self.hiddenSize) c = torch.zeros(self.numLayers * self.numDirections, 1, self.hiddenSize) return h, c here is how I create my input and targets def makeInput(sentence): """ Prepares a sentence for input to the RNN. Parameters ---------- sentence : list The sentence to be converted into input. Should be of form: [str] Returns ------- torch.Tensor Tensor of the indices for each word in the input sentence. """ sen = sentence[0].split() # Split the list into individual words sen.insert(0, 'START') input = [word2Idx[word] for word in sen] # Iterate over the words in sentence and convert to indices return torch.tensor(input) def makeTarget(sentence): """ Prepares a sentence to be a target. Parameters ---------- sentence : str The sentence to be made into a target. Should be of form: [str] Returns ------- torch.Tensor Tensor of the indices for the target phrase including the <EOS> tag. """ sen = sentence[0].split() # Split the list into individual words sen.append('EOS') target = [word2Idx[word] for word in sen] target = torch.tensor(target, dtype = torch.long) return target.unsqueeze_(-1) # Removing dimension for loss function def padSeq(seq, refSeq): """ Pads a sequence to be the same shape as another sequence. Parameters ---------- seq : torch.Tensor The sequence to pad. refSeq : torch.Tensor The reference sequence. seq will be padded to be the same shape as refSeq. Returns ------- torch.Tensor Tensor containing the padded sequence. """ padded = pad_sequence([refSeq, seq]) tmp = torch.t(padded) # Transpose the padded sequence for easier indexing on return return tmp[1] # Return only the padded seq not both sequences and here is my training loop def train(): """ Trains the model. """ start = time.time() for i, data in enumerate(trainLoader): inputTensor = makeInput(data) targetTensor = makeTarget(data) targetTensor = targetTensor.to(device) h, c = model.initHidden() h = h.to(device) c = c.to(device) optimizer.zero_grad() loss = 0 for x in range(inputTensor.size(0)): # Iterate over all of the words in the input sentence """ Preparing input for the rnn """ input = inputTensor[: x + 1] # We only want part of the input so the RNN can learn on predicting the next words input = padSeq(input, inputTensor) input = input.to(device) out = model(input, h, c) l = criterion(out, targetTensor[x]) loss += l loss.backward() optimizer.step() if i % 250 == 0: # Print updates to the models loss every 10 iters. print('[{}] Epoch: {} -> {}'.format(timeSince(start), i, loss / inputTensor.size(0)))

### How can I change the max sequence length in a Tensorflow RNN Model?

I am currently trying to adapt my tensorflow classifier, which is able to tag a sequence of words to be positive or negative, to handle much longer sequences, without retraining. My model is a RNN, with a max sequence lenght of 210. One input is one word(300 dim), I vectorised the words with Googles word2vec, so I am able to feed a sequence with max 210 words. Now my question is, how can I change the max sequence length to for example 3000, for classifying movie reviews. My working model with fixed max sequence length of 210(tf_version: 1.1.0): n_chunks = 210 chunk_size = 300 x = tf.placeholder("float",[None,n_chunks,chunk_size]) y = tf.placeholder("float",None) seq_length = tf.placeholder("int64",None) with tf.variable_scope("rnn1"): lstm_cell = tf.contrib.rnn.LSTMCell(rnn_size, state_is_tuple=True) lstm_cell = tf.contrib.rnn.DropoutWrapper (lstm_cell, input_keep_prob=0.8) outputs, _ = tf.nn.dynamic_rnn(lstm_cell,x,dtype=tf.float32, sequence_length = self.seq_length) fc = tf.contrib.layers.fully_connected(outputs, 1000, activation_fn=tf.nn.relu) output = tf.contrib.layers.flatten(fc) #*1 logits = tf.contrib.layers.fully_connected(output, self.n_classes, activation_fn=None) cost = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits (logits=logits, labels=y) ) optimizer = tf.train.AdamOptimizer(learning_rate=0.01).minimize(cost) ... #train #train_x padded to fit(batch_size*n_chunks*chunk_size) sess.run([optimizer, cost], feed_dict={x:train_x, y:train_y, seq_length:seq_length}) #predict: ... pred = tf.nn.softmax(logits) pred = sess.run(pred,feed_dict={x:word_vecs, seq_length:sq_l}) What modifications I already tried: 1 Replacing n_chunks with None and simply feed data in x = tf.placeholder(tf.float32, [None,None,300]) #model fails to build #ValueError: The last dimension of the inputs to `Dense` should be defined. #Found `None`. # at *1 ... #all entrys in word_vecs still have got the same length for example #3000(batch_size*3000(!= n_chunks)*300) pred = tf.nn.softmax(logits) pred = sess.run(pred,feed_dict={x:word_vecs, seq_length:sq_l}) 2 Changing x and then restore the old model: x = tf.placeholder(tf.float32, [None,n_chunks*10,chunk_size] ... saver = tf.train.Saver(tf.all_variables(), reshape=True) saver.restore(sess,"...") #fails as well: #InvalidArgumentError (see above for traceback): Input to reshape is a #tensor with 420000 values, but the requested shape has 840000 #[[Node: save/Reshape_5 = Reshape[T=DT_FLOAT, Tshape=DT_INT32, #_device="/job:localhost/replica:0/task:0/cpu:0"](save/RestoreV2_5, #save/Reshape_5/shape)]] # run prediction If it is possible could you please provide me with any working example or explain me why it isnt?

I am just wondering why not you just assign the n_chunk a value of 3000? In your first attempt, you cannot use two None, since tf cannot how many dimensions to put for each one. The first dimension is set as None because it is contingent upon the batch size. In your second attempt, you just change one place and the other places where n_chunks is used may conflict with the x placeholder.

### How to train a RNN with LSTM cells for time series prediction

I'm currently trying to build a simple model for predicting time series. The goal would be to train the model with a sequence so that the model is able to predict future values. I'm using tensorflow and lstm cells to do so. The model is trained with truncated backpropagation through time. My question is how to structure the data for training. For example let's assume we want to learn the given sequence: [1,2,3,4,5,6,7,8,9,10,11,...] And we unroll the network for num_steps=4. Option 1 input data label 1,2,3,4 2,3,4,5 5,6,7,8 6,7,8,9 9,10,11,12 10,11,12,13 ... Option 2 input data label 1,2,3,4 2,3,4,5 2,3,4,5 3,4,5,6 3,4,5,6 4,5,6,7 ... Option 3 input data label 1,2,3,4 5 2,3,4,5 6 3,4,5,6 7 ... Option 4 input data label 1,2,3,4 5 5,6,7,8 9 9,10,11,12 13 ... Any help would be appreciated.

I'm just about to learn LSTMs in TensorFlow and try to implement an example which (luckily) tries to predict some time-series / number-series genereated by a simple math-fuction. But I'm using a different way to structure the data for training, motivated by Unsupervised Learning of Video Representations using LSTMs: LSTM Future Predictor Model Option 5: input data label 1,2,3,4 5,6,7,8 2,3,4,5 6,7,8,9 3,4,5,6 7,8,9,10 ... Beside this paper, I (tried) to take inspiration by the given TensorFlow RNN examples. My current complete solution looks like this: import math import random import numpy as np import tensorflow as tf LSTM_SIZE = 64 LSTM_LAYERS = 2 BATCH_SIZE = 16 NUM_T_STEPS = 4 MAX_STEPS = 1000 LAMBDA_REG = 5e-4 def ground_truth_func(i, j, t): return i * math.pow(t, 2) + j def get_batch(batch_size): seq = np.zeros([batch_size, NUM_T_STEPS, 1], dtype=np.float32) tgt = np.zeros([batch_size, NUM_T_STEPS], dtype=np.float32) for b in xrange(batch_size): i = float(random.randint(-25, 25)) j = float(random.randint(-100, 100)) for t in xrange(NUM_T_STEPS): value = ground_truth_func(i, j, t) seq[b, t, 0] = value for t in xrange(NUM_T_STEPS): tgt[b, t] = ground_truth_func(i, j, t + NUM_T_STEPS) return seq, tgt # Placeholder for the inputs in a given iteration sequence = tf.placeholder(tf.float32, [BATCH_SIZE, NUM_T_STEPS, 1]) target = tf.placeholder(tf.float32, [BATCH_SIZE, NUM_T_STEPS]) fc1_weight = tf.get_variable('w1', [LSTM_SIZE, 1], initializer=tf.random_normal_initializer(mean=0.0, stddev=1.0)) fc1_bias = tf.get_variable('b1', [1], initializer=tf.constant_initializer(0.1)) # ENCODER with tf.variable_scope('ENC_LSTM'): lstm = tf.nn.rnn_cell.LSTMCell(LSTM_SIZE) multi_lstm = tf.nn.rnn_cell.MultiRNNCell([lstm] * LSTM_LAYERS) initial_state = multi_lstm.zero_state(BATCH_SIZE, tf.float32) state = initial_state for t_step in xrange(NUM_T_STEPS): if t_step > 0: tf.get_variable_scope().reuse_variables() # state value is updated after processing each batch of sequences output, state = multi_lstm(sequence[:, t_step, :], state) learned_representation = state # DECODER with tf.variable_scope('DEC_LSTM'): lstm = tf.nn.rnn_cell.LSTMCell(LSTM_SIZE) multi_lstm = tf.nn.rnn_cell.MultiRNNCell([lstm] * LSTM_LAYERS) state = learned_representation logits_stacked = None loss = 0.0 for t_step in xrange(NUM_T_STEPS): if t_step > 0: tf.get_variable_scope().reuse_variables() # state value is updated after processing each batch of sequences output, state = multi_lstm(sequence[:, t_step, :], state) # output can be used to make next number prediction logits = tf.matmul(output, fc1_weight) + fc1_bias if logits_stacked is None: logits_stacked = logits else: logits_stacked = tf.concat(1, [logits_stacked, logits]) loss += tf.reduce_sum(tf.square(logits - target[:, t_step])) / BATCH_SIZE reg_loss = loss + LAMBDA_REG * (tf.nn.l2_loss(fc1_weight) + tf.nn.l2_loss(fc1_bias)) train = tf.train.AdamOptimizer().minimize(reg_loss) with tf.Session() as sess: sess.run(tf.initialize_all_variables()) total_loss = 0.0 for step in xrange(MAX_STEPS): seq_batch, target_batch = get_batch(BATCH_SIZE) feed = {sequence: seq_batch, target: target_batch} _, current_loss = sess.run([train, reg_loss], feed) if step % 10 == 0: print("#{}: {}".format(step, current_loss)) total_loss += current_loss print('Total loss:', total_loss) print('### SIMPLE EVAL: ###') seq_batch, target_batch = get_batch(BATCH_SIZE) feed = {sequence: seq_batch, target: target_batch} prediction = sess.run([logits_stacked], feed) for b in xrange(BATCH_SIZE): print("{} -> {})".format(str(seq_batch[b, :, 0]), target_batch[b, :])) print(" `-> Prediction: {}".format(prediction[0][b])) Sample output of this looks like this: ### SIMPLE EVAL: ### # [input seq] -> [target prediction] # `-> Prediction: [model prediction] [ 33. 53. 113. 213.] -> [ 353. 533. 753. 1013.]) `-> Prediction: [ 19.74548721 28.3149128 33.11489105 35.06603241] [ -17. -32. -77. -152.] -> [-257. -392. -557. -752.]) `-> Prediction: [-16.38951683 -24.3657589 -29.49801064 -31.58583832] [ -7. -4. 5. 20.] -> [ 41. 68. 101. 140.]) `-> Prediction: [ 14.14126873 22.74848557 31.29668617 36.73633194] ... The model is a LSTM-autoencoder having 2 layers each. Unfortunately, as you can see in the results, this model does not learn the sequence properly. I might be the case that I'm just doing a bad mistake somewhere, or that 1000-10000 training steps is just way to few for a LSTM. As I said, I'm also just starting to understand/use LSTMs properly. But hopefully this can give you some inspiration regarding the implementation.

After reading several LSTM introduction blogs e.g. Jakob Aungiers', option 3 seems to be the right one for stateless LSTM. If your LSTMs need to remember data longer ago than your num_steps, your can train in a stateful way - for a Keras example see Philippe Remy's blog post "Stateful LSTM in Keras". Philippe does not show an example for batch size greater than one, however. I guess that in your case a batch size of four with stateful LSTM could be used with the following data (written as input -> label): batch #0: 1,2,3,4 -> 5 2,3,4,5 -> 6 3,4,5,6 -> 7 4,5,6,7 -> 8 batch #1: 5,6,7,8 -> 9 6,7,8,9 -> 10 7,8,9,10 -> 11 8,9,10,11 -> 12 batch #2: 9,10,11,12 -> 13 ... By this, the state of e.g. the 2nd sample in batch #0 is correctly reused to continue training with the 2nd sample of batch #1. This is somehow similar to your option 4, however you are not using all available labels there. Update: In extension to my suggestion where batch_size equals the num_steps, Alexis Huet gives an answer for the case of batch_size being a divisor of num_steps, which can be used for larger num_steps. He describes it nicely on his blog.

I believe Option 1 is closest to the reference implementation in /tensorflow/models/rnn/ptb/reader.py def ptb_iterator(raw_data, batch_size, num_steps): """Iterate on the raw PTB data. This generates batch_size pointers into the raw PTB data, and allows minibatch iteration along these pointers. Args: raw_data: one of the raw data outputs from ptb_raw_data. batch_size: int, the batch size. num_steps: int, the number of unrolls. Yields: Pairs of the batched data, each a matrix of shape [batch_size, num_steps]. The second element of the tuple is the same data time-shifted to the right by one. Raises: ValueError: if batch_size or num_steps are too high. """ raw_data = np.array(raw_data, dtype=np.int32) data_len = len(raw_data) batch_len = data_len // batch_size data = np.zeros([batch_size, batch_len], dtype=np.int32) for i in range(batch_size): data[i] = raw_data[batch_len * i:batch_len * (i + 1)] epoch_size = (batch_len - 1) // num_steps if epoch_size == 0: raise ValueError("epoch_size == 0, decrease batch_size or num_steps") for i in range(epoch_size): x = data[:, i*num_steps:(i+1)*num_steps] y = data[:, i*num_steps+1:(i+1)*num_steps+1] yield (x, y) However, another Option is to select a pointer into your data array randomly for each training sequence.