## How does this binary encoder function work? - binary-data

### Map Dask bincount over 2d array columns

```I am trying to use bincount over a 2D array. Specifically I have this code:
import numpy as np
da.bincount(x, weights)
idx = da.random.random_integers(0, 1024, 1000)
weight = da.random.random((1000, 2))
bin_count = da.apply_along_axis(dask_bincount, 1, weight, idx)
The idea is that the bincount can be made with the same idx array on each one of the weight columns. That would return an array of size (np.amax(x) + 1, 2) if I am correct.
However when doing this I get this error message:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
----> 1 bin_count = da.apply_along_axis(dask_bincount, 1, weight, idx)
~/.local/lib/python3.9/site-packages/dask/array/routines.py in apply_along_axis(func1d, axis, arr, dtype, shape, *args, **kwargs)
454 if shape is None or dtype is None:
455 test_data = np.ones((1,), dtype=arr.dtype)
--> 456 test_result = np.array(func1d(test_data, *args, **kwargs))
457 if shape is None:
458 shape = test_result.shape
----> 2 da.bincount(x, weights)
~/.local/lib/python3.9/site-packages/dask/array/routines.py in bincount(x, weights, minlength, split_every)
670 raise ValueError("Input array must be one dimensional. Try using x.ravel()")
671 if weights is not None:
--> 672 if weights.chunks != x.chunks:
673 raise ValueError("Chunks of input array x and weights must match.")
674
AttributeError: 'numpy.ndarray' object has no attribute 'chunks'
I thought that when dask array were created the library automatically assigns them chunks, so the error does not say much. How can I fix this?
I made an script that does it on numpy with map.
idx_np = np.random.randint(0, 1024, 1000)
weight_np = np.random.random((1000,2))
f = lambda y: np.bincount(idx_np, weight_np[:,y])
result = map(f, [i for i in range(2)])
np.array(list(result))
array([[0.9885341 , 0.9977873 , 0.24937023, ..., 0.31024526, 1.40754883,
0.87609759],
[1.77406303, 0.84787723, 0.14591474, ..., 0.54584068, 0.38357015,
0.85202672]])
I would like to the same but with dask
```
```There are multiple problems at play.
Weights should be (2, 1000)
You discover this by trying to write the same function in numpy using apply_along_axis.
idx_np = np.random.random_integers(0, 1024, 1000)
weight_np = np.random.random((2, 1000)) # <- transposed
# This gives the same result as the code you provided
np.apply_along_axis(lambda weight, idx: np.bincount(idx, weight), 1, weight_np, idx_np)
da.apply_along_axis applies the function to numpy arrays
You're getting the error
AttributeError: 'numpy.ndarray' object has no attribute 'chunks'
This suggests that what makes it into the da.bincount method is actually a numpy array. The fact is that da.apply_along_axis actually takes each row of weight and sends it to the function as a numpy array.
Your function should therefore actually be a numpy function:
def bincount(weights, x):
return np.bincount(x, weights)
However, if you try this, you will still get the same error. I believe that happens for a whole another reason though:
Dask doesn't know what the output shape will be and tries to infer it
In the code and/or documentation for apply_along_axis, we can see that Dask tries to infer the output shape and dtype by passing in the array [1] (related question). This is a problem, since bincount cannot just accept such argument.
What we can do instead is provide shape and dtype to the method so that Dask doesn't have to infer it.
The problem here is that bincount's output shape depends on the maximum value of the input array. Unless you know it beforehand, you will sadly need to compute it. The whole operation therefore won't be fully lazy.
import numpy as np
idx = da.random.random_integers(0, 1024, 1000)
weight = da.random.random((2, 1000))
def bincount(weights, x):
return np.bincount(x, weights)
m = idx.max().compute()
da.apply_along_axis(bincount, 1, weight, idx, shape=(m,), dtype=weight.dtype)
Appendix: randint vs random_integers
Be careful, because these are subtly different
randint takes integers from low (inclusive) to high (exclusive)
random_integers takes integers from low (inclusive) to high (inclusive)
Thus you have to call randint with high + 1 to get the same value.```

### Getting error as DataFrame.dtypes for data must be int, float, bool or categorical

```Full error in XGBOOST is
ValueError: DataFrame.dtypes for data must be int, float, bool or categorical. When
categorical type is supplied, DMatrix parameter
`enable_categorical` must be set to `True`.Year
The data is
<class 'pandas.core.frame.DataFrame'>
Int64Index: 50327 entries, 0 to 50326
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 C_Id 50327 non-null int8
1 Year 50327 non-null datetime64[ns]
2 value 50327 non-null float64
3 R_Id 50327 non-null int8
dtypes: datetime64[ns](1), float64(1), int8(2)
memory usage: 2.3 MB
Then I did,
t_date = "2019-01-01 00:00:00"
X_train = data[data["Year"]<t_date].drop(["value"],axis=1)
Y_train = data[data["Year"]<t_date]["value"]
X_test = data[data["Year"]>=t_date].drop(["value"],axis=1)
`
model = XGBRegressor(
max_depth = 8,n_estimators=1000,
min_child_weight=300,colsample_bytree=0.8,
subsample=0.8,eta=0.3,seed=42)
model.fit(X_train,Y_train,eval_metric="rmse",eval_set=[(X_train,Y_train)],
verbose =True,early_stopping_rounds=10)
Where am I getting wrong, if you need anything pls ask
Thanks for helping !
EDIT:
I converted Year type to string and then to int
BUt the result is like this,
[461] validation_0-rmse:8791.25293
[462] validation_0-rmse:8791.08789```

### RNN Language Model in PyTorch predicting the same three words repeatedly

```I am attempting to create a word-level language model using an RNN in PyTorch. Whenever I am training the loss stays about the same for the whole training set and when I try to sample a new sentence the same three words are predicted in the same order. For example in my most recent attempt the RNN predicted 'the' then 'same' then 'of' and that sequence just kept repeating. I have tried changing the how I've set up the RNN including using LSTM's, GRU's, and different embeddings but so far nothing has worked.
The way that I am training the RNN is by taking a sentence of 50 words, and selecting a progressively larger part of the sentence with the next word being the target. At the end of the sentence, I have an EOS tag. I am using text from Republic by Plato as my training set and embedding it using a pytorch embedding layer. I am then feeding it into an LSTM and then a linear layer to get the right shape. I am not sure if the problem is in the RNN, the data, the training or what so any help would be greatly appreciated.
If anyone has any experience in nlp or in language modeling I would greatly appreciate any help you could offer for this fixing this problem. My end goal is simply to just be able to generate a sentence. Thanks in advance!
Here is my RNN
class LanguageModel(nn.Module):
"""
Class that defines the reccurent neural network.
Methods
-------
forward(input, h, c)
Forward propogation through the RNN.
initHidden()
Initializes the hidden and cell states.
"""
def __init__(self, vocabSize, seqLen = 51, embeddingDim = 30, hiddenSize = 32, numLayers = 1, bid = False):
"""
Initializes the class
Parameters
----------
seqLen : int, optional
The length of the input sequence.
embeddingDim : int, optional
The dimension that the embedding dimension for the encoder should be.
vocabSize : int
The length of the vocab dictionary.
hiddenSize : int, optional
The size that the hidden state should be.
numLayers : int, optional
The number of LSTM Layers.
bid : bool, optional
Whether the RNN should be bidirctional or not.
"""
super(LanguageModel, self).__init__()
self.hiddenSize = hiddenSize
self.numLayers = numLayers
# Set value of numDirections based on whether or not the RNN is bidirectional.
if bid == True:
self.numDirections = 2
else:
self.numDirections = 1
self.encoder = nn.Embedding(vocabSize, embeddingDim)
self.LSTM = nn.LSTM(input_size = embeddingDim, hidden_size = hiddenSize, num_layers = numLayers, bidirectional = bid)
self.decoder = nn.Linear(seqLen * self.numDirections * hiddenSize, vocabSize)
def forward(self, input, h, c):
"""
Forward propogates through the RNN
Parameters
----------
input : torch.Tensor
Input to RNN. Should be formatter using makeInput() and padSeq().
h : torch.Tensor
Hidden state.
c : torch.Tensor
Cell state.
Returns
-------
torch.Tensor
Log probabilities for the predicted word from the RNN.
"""
emb = self.encoder(input)
emb.unsqueeze_(1) # Add in the batch dimension so the shape is right for the LSTM
out, (h, c) = self.LSTM(emb, (h, c))
out = out.view(1, -1) # Reshaping to fit into the loss function.
out = self.decoder(out)
logProbs = F.log_softmax(out, dim = 1)
return logProbs
def initHidden(self):
"""
Initializes the hidden and cell states.
Returns
-------
torch.Tensor
Tensor containing the initial hidden state.
torch.Tensor
Tensor containing the intial cell state.
"""
h = torch.zeros(self.numLayers * self.numDirections, 1, self.hiddenSize)
c = torch.zeros(self.numLayers * self.numDirections, 1, self.hiddenSize)
return h, c
here is how I create my input and targets
def makeInput(sentence):
"""
Prepares a sentence for input to the RNN.
Parameters
----------
sentence : list
The sentence to be converted into input. Should be of form: [str]
Returns
-------
torch.Tensor
Tensor of the indices for each word in the input sentence.
"""
sen = sentence[0].split() # Split the list into individual words
sen.insert(0, 'START')
input = [word2Idx[word] for word in sen] # Iterate over the words in sentence and convert to indices
def makeTarget(sentence):
"""
Prepares a sentence to be a target.
Parameters
----------
sentence : str
The sentence to be made into a target. Should be of form: [str]
Returns
-------
torch.Tensor
Tensor of the indices for the target phrase including the <EOS> tag.
"""
sen = sentence[0].split() # Split the list into individual words
sen.append('EOS')
target = [word2Idx[word] for word in sen]
target = torch.tensor(target, dtype = torch.long)
return target.unsqueeze_(-1) # Removing dimension for loss function
"""
Pads a sequence to be the same shape as another sequence.
Parameters
----------
seq : torch.Tensor
refSeq : torch.Tensor
The reference sequence. seq will be padded to be the same shape as refSeq.
Returns
-------
torch.Tensor
"""
tmp = torch.t(padded) # Transpose the padded sequence for easier indexing on return
return tmp[1] # Return only the padded seq not both sequences
and here is my training loop
def train():
"""
Trains the model.
"""
start = time.time()
inputTensor = makeInput(data)
targetTensor = makeTarget(data)
targetTensor = targetTensor.to(device)
h, c = model.initHidden()
h = h.to(device)
c = c.to(device)
loss = 0
for x in range(inputTensor.size(0)): # Iterate over all of the words in the input sentence
""" Preparing input for the rnn """
input = inputTensor[: x + 1] # We only want part of the input so the RNN can learn on predicting the next words
input = input.to(device)
out = model(input, h, c)
l = criterion(out, targetTensor[x])
loss += l
loss.backward()
optimizer.step()
if i % 250 == 0: # Print updates to the models loss every 10 iters.
print('[{}] Epoch: {} -> {}'.format(timeSince(start), i, loss / inputTensor.size(0)))```

### How can I change the max sequence length in a Tensorflow RNN Model?

```I am currently trying to adapt my tensorflow classifier, which is able to tag a sequence of words to be positive or negative, to handle much longer sequences, without retraining. My model is a RNN, with a max sequence lenght of 210. One input is one word(300 dim), I vectorised the words with Googles word2vec, so I am able to feed a sequence with max 210 words. Now my question is, how can I change the max sequence length to for example 3000, for classifying movie reviews.
My working model with fixed max sequence length of 210(tf_version: 1.1.0):
n_chunks = 210
chunk_size = 300
x = tf.placeholder("float",[None,n_chunks,chunk_size])
y = tf.placeholder("float",None)
seq_length = tf.placeholder("int64",None)
with tf.variable_scope("rnn1"):
lstm_cell = tf.contrib.rnn.LSTMCell(rnn_size,
state_is_tuple=True)
lstm_cell = tf.contrib.rnn.DropoutWrapper (lstm_cell,
input_keep_prob=0.8)
outputs, _ = tf.nn.dynamic_rnn(lstm_cell,x,dtype=tf.float32,
sequence_length = self.seq_length)
fc = tf.contrib.layers.fully_connected(outputs, 1000,
activation_fn=tf.nn.relu)
output = tf.contrib.layers.flatten(fc)
#*1
logits = tf.contrib.layers.fully_connected(output, self.n_classes,
activation_fn=None)
cost = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits
(logits=logits, labels=y) )
...
#train
sess.run([optimizer, cost], feed_dict={x:train_x, y:train_y,
seq_length:seq_length})
#predict:
...
pred = tf.nn.softmax(logits)
pred = sess.run(pred,feed_dict={x:word_vecs, seq_length:sq_l})
1 Replacing n_chunks with None and simply feed data in
x = tf.placeholder(tf.float32, [None,None,300])
#model fails to build
#ValueError: The last dimension of the inputs to `Dense` should be defined.
#Found `None`.
# at *1
...
#all entrys in word_vecs still have got the same length for example
#3000(batch_size*3000(!= n_chunks)*300)
pred = tf.nn.softmax(logits)
pred = sess.run(pred,feed_dict={x:word_vecs, seq_length:sq_l})
2 Changing x and then restore the old model:
x = tf.placeholder(tf.float32, [None,n_chunks*10,chunk_size]
...
saver = tf.train.Saver(tf.all_variables(), reshape=True)
saver.restore(sess,"...")
#fails as well:
#InvalidArgumentError (see above for traceback): Input to reshape is a
#tensor with 420000 values, but the requested shape has 840000
#[[Node: save/Reshape_5 = Reshape[T=DT_FLOAT, Tshape=DT_INT32,
#save/Reshape_5/shape)]]
# run prediction
If it is possible could you please provide me with any working example or explain me why it isnt?
```
```I am just wondering why not you just assign the n_chunk a value of 3000?
In your first attempt, you cannot use two None, since tf cannot how many dimensions to put for each one. The first dimension is set as None because it is contingent upon the batch size. In your second attempt, you just change one place and the other places where n_chunks is used may conflict with the x placeholder.```

### How to train a RNN with LSTM cells for time series prediction

```I'm currently trying to build a simple model for predicting time series. The goal would be to train the model with a sequence so that the model is able to predict future values.
I'm using tensorflow and lstm cells to do so. The model is trained with truncated backpropagation through time. My question is how to structure the data for training.
For example let's assume we want to learn the given sequence:
[1,2,3,4,5,6,7,8,9,10,11,...]
And we unroll the network for num_steps=4.
Option 1
input data label
1,2,3,4 2,3,4,5
5,6,7,8 6,7,8,9
9,10,11,12 10,11,12,13
...
Option 2
input data label
1,2,3,4 2,3,4,5
2,3,4,5 3,4,5,6
3,4,5,6 4,5,6,7
...
Option 3
input data label
1,2,3,4 5
2,3,4,5 6
3,4,5,6 7
...
Option 4
input data label
1,2,3,4 5
5,6,7,8 9
9,10,11,12 13
...
Any help would be appreciated.
```
```I'm just about to learn LSTMs in TensorFlow and try to implement an example which (luckily) tries to predict some time-series / number-series genereated by a simple math-fuction.
But I'm using a different way to structure the data for training, motivated by Unsupervised Learning of Video Representations using LSTMs:
LSTM Future Predictor Model
Option 5:
input data label
1,2,3,4 5,6,7,8
2,3,4,5 6,7,8,9
3,4,5,6 7,8,9,10
...
Beside this paper, I (tried) to take inspiration by the given TensorFlow RNN examples. My current complete solution looks like this:
import math
import random
import numpy as np
import tensorflow as tf
LSTM_SIZE = 64
LSTM_LAYERS = 2
BATCH_SIZE = 16
NUM_T_STEPS = 4
MAX_STEPS = 1000
LAMBDA_REG = 5e-4
def ground_truth_func(i, j, t):
return i * math.pow(t, 2) + j
def get_batch(batch_size):
seq = np.zeros([batch_size, NUM_T_STEPS, 1], dtype=np.float32)
tgt = np.zeros([batch_size, NUM_T_STEPS], dtype=np.float32)
for b in xrange(batch_size):
i = float(random.randint(-25, 25))
j = float(random.randint(-100, 100))
for t in xrange(NUM_T_STEPS):
value = ground_truth_func(i, j, t)
seq[b, t, 0] = value
for t in xrange(NUM_T_STEPS):
tgt[b, t] = ground_truth_func(i, j, t + NUM_T_STEPS)
return seq, tgt
# Placeholder for the inputs in a given iteration
sequence = tf.placeholder(tf.float32, [BATCH_SIZE, NUM_T_STEPS, 1])
target = tf.placeholder(tf.float32, [BATCH_SIZE, NUM_T_STEPS])
fc1_weight = tf.get_variable('w1', [LSTM_SIZE, 1], initializer=tf.random_normal_initializer(mean=0.0, stddev=1.0))
fc1_bias = tf.get_variable('b1', [1], initializer=tf.constant_initializer(0.1))
# ENCODER
with tf.variable_scope('ENC_LSTM'):
lstm = tf.nn.rnn_cell.LSTMCell(LSTM_SIZE)
multi_lstm = tf.nn.rnn_cell.MultiRNNCell([lstm] * LSTM_LAYERS)
initial_state = multi_lstm.zero_state(BATCH_SIZE, tf.float32)
state = initial_state
for t_step in xrange(NUM_T_STEPS):
if t_step > 0:
tf.get_variable_scope().reuse_variables()
# state value is updated after processing each batch of sequences
output, state = multi_lstm(sequence[:, t_step, :], state)
learned_representation = state
# DECODER
with tf.variable_scope('DEC_LSTM'):
lstm = tf.nn.rnn_cell.LSTMCell(LSTM_SIZE)
multi_lstm = tf.nn.rnn_cell.MultiRNNCell([lstm] * LSTM_LAYERS)
state = learned_representation
logits_stacked = None
loss = 0.0
for t_step in xrange(NUM_T_STEPS):
if t_step > 0:
tf.get_variable_scope().reuse_variables()
# state value is updated after processing each batch of sequences
output, state = multi_lstm(sequence[:, t_step, :], state)
# output can be used to make next number prediction
logits = tf.matmul(output, fc1_weight) + fc1_bias
if logits_stacked is None:
logits_stacked = logits
else:
logits_stacked = tf.concat(1, [logits_stacked, logits])
loss += tf.reduce_sum(tf.square(logits - target[:, t_step])) / BATCH_SIZE
reg_loss = loss + LAMBDA_REG * (tf.nn.l2_loss(fc1_weight) + tf.nn.l2_loss(fc1_bias))
with tf.Session() as sess:
sess.run(tf.initialize_all_variables())
total_loss = 0.0
for step in xrange(MAX_STEPS):
seq_batch, target_batch = get_batch(BATCH_SIZE)
feed = {sequence: seq_batch, target: target_batch}
_, current_loss = sess.run([train, reg_loss], feed)
if step % 10 == 0:
print("#{}: {}".format(step, current_loss))
total_loss += current_loss
print('Total loss:', total_loss)
print('### SIMPLE EVAL: ###')
seq_batch, target_batch = get_batch(BATCH_SIZE)
feed = {sequence: seq_batch, target: target_batch}
prediction = sess.run([logits_stacked], feed)
for b in xrange(BATCH_SIZE):
print("{} -> {})".format(str(seq_batch[b, :, 0]), target_batch[b, :]))
print(" `-> Prediction: {}".format(prediction[0][b]))
Sample output of this looks like this:
### SIMPLE EVAL: ###
# [input seq] -> [target prediction]
# `-> Prediction: [model prediction]
[ 33. 53. 113. 213.] -> [ 353. 533. 753. 1013.])
`-> Prediction: [ 19.74548721 28.3149128 33.11489105 35.06603241]
[ -17. -32. -77. -152.] -> [-257. -392. -557. -752.])
`-> Prediction: [-16.38951683 -24.3657589 -29.49801064 -31.58583832]
[ -7. -4. 5. 20.] -> [ 41. 68. 101. 140.])
`-> Prediction: [ 14.14126873 22.74848557 31.29668617 36.73633194]
...
The model is a LSTM-autoencoder having 2 layers each.
Unfortunately, as you can see in the results, this model does not learn the sequence properly. I might be the case that I'm just doing a bad mistake somewhere, or that 1000-10000 training steps is just way to few for a LSTM. As I said, I'm also just starting to understand/use LSTMs properly.
But hopefully this can give you some inspiration regarding the implementation.
```
```After reading several LSTM introduction blogs e.g. Jakob Aungiers', option 3 seems to be the right one for stateless LSTM.
If your LSTMs need to remember data longer ago than your num_steps, your can train in a stateful way - for a Keras example see Philippe Remy's blog post "Stateful LSTM in Keras". Philippe does not show an example for batch size greater than one, however. I guess that in your case a batch size of four with stateful LSTM could be used with the following data (written as input -> label):
batch #0:
1,2,3,4 -> 5
2,3,4,5 -> 6
3,4,5,6 -> 7
4,5,6,7 -> 8
batch #1:
5,6,7,8 -> 9
6,7,8,9 -> 10
7,8,9,10 -> 11
8,9,10,11 -> 12
batch #2:
9,10,11,12 -> 13
...
By this, the state of e.g. the 2nd sample in batch #0 is correctly reused to continue training with the 2nd sample of batch #1.
This is somehow similar to your option 4, however you are not using all available labels there.
Update:
In extension to my suggestion where batch_size equals the num_steps, Alexis Huet gives an answer for the case of batch_size being a divisor of num_steps, which can be used for larger num_steps. He describes it nicely on his blog.
```
```I believe Option 1 is closest to the reference implementation in /tensorflow/models/rnn/ptb/reader.py
def ptb_iterator(raw_data, batch_size, num_steps):
"""Iterate on the raw PTB data.
This generates batch_size pointers into the raw PTB data, and allows
minibatch iteration along these pointers.
Args:
raw_data: one of the raw data outputs from ptb_raw_data.
batch_size: int, the batch size.
num_steps: int, the number of unrolls.
Yields:
Pairs of the batched data, each a matrix of shape [batch_size, num_steps].
The second element of the tuple is the same data time-shifted to the
right by one.
Raises:
ValueError: if batch_size or num_steps are too high.
"""
raw_data = np.array(raw_data, dtype=np.int32)
data_len = len(raw_data)
batch_len = data_len // batch_size
data = np.zeros([batch_size, batch_len], dtype=np.int32)
for i in range(batch_size):
data[i] = raw_data[batch_len * i:batch_len * (i + 1)]
epoch_size = (batch_len - 1) // num_steps
if epoch_size == 0:
raise ValueError("epoch_size == 0, decrease batch_size or num_steps")
for i in range(epoch_size):
x = data[:, i*num_steps:(i+1)*num_steps]
y = data[:, i*num_steps+1:(i+1)*num_steps+1]
yield (x, y)
However, another Option is to select a pointer into your data array randomly for each training sequence.```