error while fitting a hmm.MultinomialHMM with sequences of observed data - machine-learning

I'm trying to fit a multinomial model with my observed data. I have a dataset containing trajectories with different lengths. Since my observations are discrete I try to fit with a multinomial model. The number of observation symbols is 3147 and the number of trajectories(sequences) is 4760.
Whilst I give the observation sequences (X with an array shape defined in hmmlearn class) and the length of observation with () to fit method with the this code:
X
array([[31],
[ 1],
[17],
...,
[ 4],
[ 1],
[16]])
lengths
[28,
6,
11,
7,
2,
2,
...]
model = hmm.MultinomialHMM(n_components=10).fit(X, lengths)
I got an error. Can someone help me and explain what I am doing wrong.
Thanks.
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-72-565aadfd0ae4> in <module>()
----> 1 model = hmm.MultinomialHMM(n_components=10).fit(X, lengths)
C:\Python\Anaconda2\lib\site-packages\hmmlearn\base.pyc in fit(self, X, lengths)
427 curr_logprob = 0
428 for i, j in iter_from_X_lengths(X, lengths):
--> 429 framelogprob = self._compute_log_likelihood(X[i:j])
430 logprob, fwdlattice = self._do_forward_pass(framelogprob)
431 curr_logprob += logprob
C:\Python\Anaconda2\lib\site-packages\hmmlearn\hmm.pyc in _compute_log_likelihood(self, X)
403
404 def _compute_log_likelihood(self, X):
--> 405 return np.log(self.emissionprob_)[:, np.concatenate(X)].T
406
407 def _generate_sample_from_state(self, state, random_state=None):
IndexError: index 3147 is out of bounds for axis 1 with size 3147

The value of your X should start from zero rather than from one.

Related

why max_samples does not accept float type?

I am doing machine learning.Here I want to find the best triple (max_samples, n_trees and threshold) that gives the greatest performance in terms of area under ROC curve and area under recall precison curve.
Here is the code:
def meilleur_triplet(x,classes):
for n_trees in np.arange(100,160,10):
for sample_size in np.arange(0.1,1,0.1):
for threshold in np.arange(0.4,1,0.1):
model=IforestLocal(sample_size,n_trees)
model.fit(x)
y_pred,y_score=model.predict(x,threshold)
auc=roc_auc_score(classes,y_pred)
auc_pr=average_precision_score(classes,y_pred)
Now when I use max_samples with a range of int I don't have an error however if it's in float I have the following error:
**
TypeError Traceback (most recent call last)
Input In [201], in <cell line: 1>()
----> 1 meilleur_triplet(X_glass,y_glass)
Input In [200], in meilleur_triplet(x, classes)
6 for threshold in np.arange(0.4,1,0.1):#(0.4,1,0.1)
8 model=IforestLocal(sample_size,n_trees)
----> 9 model.fit(x)
File ~\Desktop\THESE\Maurras\Code_Maurras\iforest_D.py:45, in IsolationForest.fit(self, X)
42 self.sample_size = len_x
44 for i in range(self.n_trees):
---> 45 sample_idx = random.sample(list(range(len_x)), self.sample_size)
46 # TODO: Must be deleted before compute the memory consumption of the methods
47 self.samples.append(sample_idx)
File ~\anaconda3\lib\random.py:450, in Random.sample(self, population, k, counts)
448 if not 0 <= k <= n:
449 raise ValueError("Sample larger than population or is negative")
--> 450 result = [None] * k
451 setsize = 21 # size of a small set minus size of an empty list
452 if k > 5:
TypeError: can't multiply sequence by non-int of type 'numpy.float64'
**
This is where I called the function
meilleur_triplet(X_glass,y_glass)
Thank you please help me

Loss for Multi-label Classification

I am working on a multi-label classification problem. My gt labels are of shape 14 x 10 x 128, where 14 is the batch_size, 10 is the sequence_length, and 128 is the vector with values 1 if the item in sequence belongs to the object and 0 otherwise.
My output is also of same shape: 14 x 10 x 128. Since, my input sequence was of varying length I had to pad it to make it of fixed length 10. I'm trying to find the loss of the model as follows:
total_loss = 0.0
unpadded_seq_lengths = [3, 4, 5, 7, 9, 3, 2, 8, 5, 3, 5, 7, 7, ...] # true lengths of sequences
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.BCEWithLogitsLoss()
for data in training_dataloader:
optimizer.zero_grad()
# shape of input 14 x 10 x 128
output = model(data)
batch_loss = 0.0
for batch_idx, sequence in enumerate(output):
# sequence shape is 10 x 128
true_seq_len = unpadded_seq_lengths[batch_idx]
# only keep unpadded gt and predicted labels since we don't want loss to be influenced by padded values
predicted_labels = sequence[:true_seq_len, :] # for example, 3 x 128
gt_labels = gt_labels_padded[batch_idx, :true_seq_len, :] # same shape as above, gt_labels_padded has shape 14 x 10 x 128
# loop through unpadded predicted and gt labels and calculate loss
for item_idx, predicted_labels_seq_item in enumerate(predicted_labels):
# predicted_labels_seq_item and gt_labels_seq_item are 1D vectors of length 128
gt_labels_seq_item = gt_labels[item_idx]
current_loss = criterion(predicted_labels_seq_item, gt_labels_seq_item)
total_loss += current_loss
batch_loss += current_loss
batch_loss.backward()
optimizer.step()
Can anybody please check to see if I'm calculating loss correctly. Thanks
Update:
Is this the correct approach for calculating accuracy metrics?
# batch size: 14
# seq length: 10
for epoch in range(10):
TP = FP = TN = FN = 0.
for x, y, mask in tr_dl:
# mask shape: (10,)
out = model(x) # out shape: (14, 10, 128)
y_pred = (torch.sigmoid(out) >= 0.5).float().type(torch.int64) # consider all predictions above 0.5 as 1, rest 0
y_pred = y_pred[mask] # y_pred shape: (14, 10, 10, 128)
y_labels = y[mask] # y_labels shape: (14, 10, 10, 128)
# do I flatten y_pred and y_labels?
y_pred = y_pred.flatten()
y_labels = y_labels.flatten()
for idx, prediction in enumerate(y_pred):
if prediction == 1 and y_labels[idx] == 1:
# calculate IOU (overlap of prediction and gt bounding box)
iou = 0.78 # assume we get this iou value for objects at idx
if iou >= 0.5:
TP += 1
else:
FP += 1
elif prediction == 1 and y_labels[idx] == 0:
FP += 1
elif prediction == 0 and y_labels[idx] == 1:
FN += 1
else:
TN += 1
EPOCH_ACC = (TP + TN) / (TP + TN + FP + FN)
It is usually recommended to stick with batch-wise operations and avoid going into single-element processing steps while in the main training loop. One way to handle this case is to make your dataset return padded inputs and labels with additionally a mask that will come useful for loss computation. In other words, to compute the loss term with sequences of varying sizes, we will use a mask instead of doing individual slices.
Dataset
The way to proceed is to make sure you build the mask in the dataset and not in the inference loop. Here I am showing a minimal implementation that you should be able to transfer to your dataset without much hassle:
class Dataset(data.Dataset):
def __init__(self):
super().__init__()
def __len__(self):
return 100
def __getitem__(self, index):
i = random.randint(5, SEQ_LEN) # for demo puporse, generate x with random length
x = torch.rand(i, EMB_SIZE)
y = torch.randint(0, N_CLASSES, (i, EMB_SIZE))
# pad data to fit in batch
pad = torch.zeros(SEQ_LEN-len(x), EMB_SIZE)
x_padded = torch.cat((pad, x))
y_padded = torch.cat((pad, y))
# construct tensor to mask loss
mask = torch.cat((torch.zeros(SEQ_LEN-len(x)), torch.ones(len(x))))
return x_padded, y_padded, mask
Essentially in the __getitem__, we not only pad the input x and target y with zero values, we also construct a simple mask containing the positions of the padded values in the currently processed element.
Notice how:
x_padded, shaped (SEQ_LEN, EMB_SIZE)
y_padded, shaped (SEQ_LEN, N_CLASSES)
mask, shaped (SEQ_LEN,)
are all three tensors which are shape invariant across the dataset, yet mask contains the padding information necessary for us to compute the loss function appropriately.
Inference
The loss you've used nn.BCEWithLogitsLoss, is the correct one since it's a multi-dimensional loss used for binary classification. In other words, you can use it here in this multi-label classification task, considering each one of the 128 logits as an individual binary prediction. Do not use nn.CrossEntropyLoss) as suggested elsewhere, since the softmax will push a single logit (i.e. class), which is the behaviour required for single-label classification tasks.
Therefore, in the training loop, we simply have to apply the mask to our loss.
for x, y, mask in dl:
y_pred = model(x)
loss = mask*bce(y_pred, y)
# backpropagation, loss postprocessing, logs, etc.
This is what you need for the first part of the question, there are already loss functions implemented in tensorflow: https://medium.com/#aadityaura_26777/the-loss-function-for-multi-label-and-multi-class-f68f95cae525. Yours is just tf.nn.weighted_cross_entropy_with_logits, but you need to set the weight.
The second part of the question is not straightforward, because there's conditioning on the IOU, generally, when you do machine learning, you should heavily depend on matrix multiplication, in your case, you probably need to pre-calculate the IOU -> 1 or 0 as a vector, then multiply with the y_pred , element-wise, this will give you the modified y_pred . After that, you can use any accuracy available function to calculate the final result.
if you can use the CROSSENTROPYLOSS instead of BCEWithLogitsLoss there is something called ignore_index. you can use it to exclude your padded sequences. the difference between the 2 losses is the activation function used (softmax vs sigmoid). but I think you can still use the CROSSENTROPYLOSSfor binary classification as well.

Map Dask bincount over 2d array columns

I am trying to use bincount over a 2D array. Specifically I have this code:
import numpy as np
import dask.array as da
def dask_bincount(weights, x):
da.bincount(x, weights)
idx = da.random.random_integers(0, 1024, 1000)
weight = da.random.random((1000, 2))
bin_count = da.apply_along_axis(dask_bincount, 1, weight, idx)
The idea is that the bincount can be made with the same idx array on each one of the weight columns. That would return an array of size (np.amax(x) + 1, 2) if I am correct.
However when doing this I get this error message:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-17-5b8eed89ad32> in <module>
----> 1 bin_count = da.apply_along_axis(dask_bincount, 1, weight, idx)
~/.local/lib/python3.9/site-packages/dask/array/routines.py in apply_along_axis(func1d, axis, arr, dtype, shape, *args, **kwargs)
454 if shape is None or dtype is None:
455 test_data = np.ones((1,), dtype=arr.dtype)
--> 456 test_result = np.array(func1d(test_data, *args, **kwargs))
457 if shape is None:
458 shape = test_result.shape
<ipython-input-14-34fd0eb9b775> in dask_bincount(weights, x)
1 def dask_bincount(weights, x):
----> 2 da.bincount(x, weights)
~/.local/lib/python3.9/site-packages/dask/array/routines.py in bincount(x, weights, minlength, split_every)
670 raise ValueError("Input array must be one dimensional. Try using x.ravel()")
671 if weights is not None:
--> 672 if weights.chunks != x.chunks:
673 raise ValueError("Chunks of input array x and weights must match.")
674
AttributeError: 'numpy.ndarray' object has no attribute 'chunks'
I thought that when dask array were created the library automatically assigns them chunks, so the error does not say much. How can I fix this?
I made an script that does it on numpy with map.
idx_np = np.random.randint(0, 1024, 1000)
weight_np = np.random.random((1000,2))
f = lambda y: np.bincount(idx_np, weight_np[:,y])
result = map(f, [i for i in range(2)])
np.array(list(result))
array([[0.9885341 , 0.9977873 , 0.24937023, ..., 0.31024526, 1.40754883,
0.87609759],
[1.77406303, 0.84787723, 0.14591474, ..., 0.54584068, 0.38357015,
0.85202672]])
I would like to the same but with dask
There are multiple problems at play.
Weights should be (2, 1000)
You discover this by trying to write the same function in numpy using apply_along_axis.
idx_np = np.random.random_integers(0, 1024, 1000)
weight_np = np.random.random((2, 1000)) # <- transposed
# This gives the same result as the code you provided
np.apply_along_axis(lambda weight, idx: np.bincount(idx, weight), 1, weight_np, idx_np)
da.apply_along_axis applies the function to numpy arrays
You're getting the error
AttributeError: 'numpy.ndarray' object has no attribute 'chunks'
This suggests that what makes it into the da.bincount method is actually a numpy array. The fact is that da.apply_along_axis actually takes each row of weight and sends it to the function as a numpy array.
Your function should therefore actually be a numpy function:
def bincount(weights, x):
return np.bincount(x, weights)
However, if you try this, you will still get the same error. I believe that happens for a whole another reason though:
Dask doesn't know what the output shape will be and tries to infer it
In the code and/or documentation for apply_along_axis, we can see that Dask tries to infer the output shape and dtype by passing in the array [1] (related question). This is a problem, since bincount cannot just accept such argument.
What we can do instead is provide shape and dtype to the method so that Dask doesn't have to infer it.
The problem here is that bincount's output shape depends on the maximum value of the input array. Unless you know it beforehand, you will sadly need to compute it. The whole operation therefore won't be fully lazy.
This is the full answer:
import numpy as np
import dask.array as da
idx = da.random.random_integers(0, 1024, 1000)
weight = da.random.random((2, 1000))
def bincount(weights, x):
return np.bincount(x, weights)
m = idx.max().compute()
da.apply_along_axis(bincount, 1, weight, idx, shape=(m,), dtype=weight.dtype)
Appendix: randint vs random_integers
Be careful, because these are subtly different
randint takes integers from low (inclusive) to high (exclusive)
random_integers takes integers from low (inclusive) to high (inclusive)
Thus you have to call randint with high + 1 to get the same value.

Tensorflow Dimension size must be evenly divisible by N but is M for 'linear/linear_model/x/Reshape'

I've been trying to use tensorflow's tf.estimator, but I'm getting the following errors regarding the shape of input/output data.
ValueError: Dimension size must be evenly divisible by 9 but is 12 for
'linear/linear_model/x/Reshape' (op: 'Reshape') with input shapes:
[4,3], [2] and with input tensors computed as partial shapes: input[1]
= [?,9].
Here is the code:
data_size = 3
iterations = 10
learn_rate = 0.005
# generate test data
input = np.random.rand(data_size,3)
output = np.dot(input, [2, 3 ,7]) + 4
output = np.transpose([output])
feature_columns = [tf.feature_column.numeric_column("x", shape=(data_size, 3))]
estimator = tf.estimator.LinearRegressor(feature_columns=feature_columns)
input_fn = tf.estimator.inputs.numpy_input_fn({"x":input}, output, batch_size=4, num_epochs=None, shuffle=True)
estimator.train(input_fn=input_fn, steps=iterations)
input data shape is shape=(3, 3):
[[ 0.06525168 0.3171153 0.61675511]
[ 0.35166298 0.71816544 0.62770994]
[ 0.77846666 0.20930611 0.1710842 ]]
output data shape is shape=(3, 1)
[[ 9.399135 ]
[ 11.25179188]
[ 7.38244104]]
I have sense it is related to input data, output data and batch_size, because when input data changed to 1 row it works. When input data rows count equal to batch_size(data_size = 10 and batch_size=10) then it throws other error:
ValueError: Shapes (1, 1) and (10, 1) are incompatible
Any help with the errors would be much appreciated.

Torch: How to shuffle a tensor by its rows?

I am currently working in torch to implement a random shuffle (on the rows, the first dimension in this case) on some input data. I am new to torch, so I have some troubles figuring out how permutation works..
The following is supposed to shuffle the data:
if argshuffle then
local perm = torch.randperm(sids:size(1)):long()
print("\n\n\nSize of X and y before")
print(X:view(-1, 1000, 128):size())
print(y:size())
print(sids:size())
print("\nPerm size is: ")
print(perm:size())
X = X:view(-1, 1000, 128)[{{perm},{},{}}]
y = y[{{perm},{}}]
print(sids[{{1}, {}}])
sids = sids[{{perm},{}}]
print(sids[{{1}, {}}])
print(X:size())
print(y:size())
print(sids:size())
os.exit(69)
end
This prints out
Size of X and y before
99
1000
128
[torch.LongStorage of size 3]
99
1
[torch.LongStorage of size 2]
99
1
[torch.LongStorage of size 2]
Perm size is:
99
[torch.LongStorage of size 1]
5
[torch.LongStorage of size 1x1]
5
[torch.LongStorage of size 1x1]
99
1000
128
[torch.LongStorage of size 3]
99
1
[torch.LongStorage of size 2]
99
1
[torch.LongStorage of size 2]
Out of the value, I can imply that the function did not shuffle the data. How can I make it shuffle correctly, and what is the common solution in lua/torch?
I also faced a similar issue. In the documentation, there is no shuffle function for tensors (there are for dataset loaders). I found a workaround to the problem using torch.randperm.
>>> a=torch.rand(3,5)
>>> print(a)
tensor([[0.4896, 0.3708, 0.2183, 0.8157, 0.7861],
[0.0845, 0.7596, 0.5231, 0.4861, 0.9237],
[0.4496, 0.5980, 0.7473, 0.2005, 0.8990]])
>>> # Row shuffling
...
>>> a=a[torch.randperm(a.size()[0])]
>>> print(a)
tensor([[0.4496, 0.5980, 0.7473, 0.2005, 0.8990],
[0.0845, 0.7596, 0.5231, 0.4861, 0.9237],
[0.4896, 0.3708, 0.2183, 0.8157, 0.7861]])
>>> # column shuffling
...
>>> a=a[:,torch.randperm(a.size()[1])]
>>> print(a)
tensor([[0.2005, 0.7473, 0.5980, 0.8990, 0.4496],
[0.4861, 0.5231, 0.7596, 0.9237, 0.0845],
[0.8157, 0.2183, 0.3708, 0.7861, 0.4896]])
I hope it answers the question!
dim = 0
idx = torch.randperm(t.shape[dim])
t_shuffled = t[idx]
If your tensor is e.g. of shape CxNxF (channels by rows by features), then you can shuffle along the second dimension like so:
dim=1
idx = torch.randperm(t.shape[dim])
t_shuffled = t[:,idx]
A straightforward solution is to use permutation matrices (those that are usual in linear algebra). Since you seem to be interested in the 3d case, we will have to flatten your 3d tensor first. So, here's an example code (ready-to use) that I came up with
data=torch.floor(torch.rand(5,3,2)*100):float()
reordered_data=data:view(5,-1)
perm=torch.randperm(5);
perm_rep=torch.repeatTensor(perm,5,1):transpose(1,2)
indexes=torch.range(1,5);
indexes_rep=torch.repeatTensor(indexes,5,1)
permutation_matrix=indexes_rep:eq(perm_rep):float()
permuted=permutation_matrix*reordered_data
print("perm")
print(perm)
print("before permutation")
print(data)
print("after permutation")
print(permuted:view(5,3,2))
As you will see from one execution, it reorders the tensor data according to the row indexes given in perm.
Based on your syntax, I assume you're using to torch with lua rather than PyTorch.
torch.Tensor.index is your function, it works like below:
x = torch.rand(4, 4)
p = torch.randperm(4)
print(x)
print(p)
print(x:index(1,p:long())

Resources