Dask eror - min() arg is an empty sequence - dask

I'm trying to use Dask to handle a reasonably large dataset but I keep getting
ValueError: min() arg is an empty sequence
when I try to run .describe().compute()
I have confirmed the Describe works in normal Pandas with the same dataset so it must be dask related.
Here is the line I'm using:
inpFile = dd.read_csv(fPath, sep='\t', error_bad_lines= False,quoting=csv.QUOTE_NONE)
and the full error is:
ValueError Traceback (most recent call
last) in ()
----> 1 inpFile.describe().compute()
2 #inpFile2.describe()
/home/badrul/anaconda3/lib/python3.6/site-packages/dask/dataframe/core.py
in describe(self, split_every) 1306 num =
self._get_numeric_data() 1307
-> 1308 stats = [num.count(split_every=split_every), 1309 num.mean(split_every=split_every), 1310
num.std(split_every=split_every),
/home/badrul/anaconda3/lib/python3.6/site-packages/dask/dataframe/core.py
in count(self, axis, split_every) 1191
token=token, split_every=split_every) 1192 if
isinstance(self, DataFrame):
-> 1193 result.divisions = (min(self.columns), max(self.columns)) 1194 return result 1195
ValueError: min() arg is an empty sequence
Although it doesn't run for very long so I suspect it's not loading.
The error then comes when I do: inpFile.describe().compute()

Related

machine learning+deep learning+speech recognition

I run the code in my editor (VS Code) without any problems, but for next step and due to RAM and GPU limitation, I took it in colab, but got an error that seems to be due to mismatch of versions due to transfer from my editor to colab. how can i fix this problem?
The current version of python running on Google Colab is 3.8.16, I used tensorflow 2.3.0 and keras 2.4.3.
The error is related to this part of code when use the model.fit() for train the model:
(I use CTC_loss in model):
model.fit(
train_dg,
validation_data=val_dg,
epochs=args.epochs,
callbacks=[PlotLossesKeras(),
early_stopping,
cp,
csv_logger,
lrs]
)
But I got this error:
----------------------------------------------------------------------------------------------------
**Epoch 00001: LearningRateScheduler reducing learning rate to 0.001. Epoch 1/300
-----------**---------------------------------------------------------------- InvalidArgumentError Traceback (most recent call last) <ipython-input-87-2b4ea6811b43> in <module>
----> 1 model.fit(train_dg,validation_data=val_dg,epochs=args.epochs,callbacks=[PlotLossesKeras(),early_stopping,cp,csv_logger,lrs])
9 frames /usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
57 try:
58 ctx.ensure_initialized()
---> 59 tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
60 inputs, attrs, num_outputs)
61 except core._NotOkStatusException as e:
InvalidArgumentError: Saw a non-null label (index >= num_classes - 1) following a null label, batch: 2 num_classes: 16 labels: 16,0,0,0,0,0,0 labels seen so far: [[node functional_3/CTCloss/CTCLoss (defined at <ipython-input-17-1689d20fc46d>:887) ]] [Op:__inference_train_function_6401]
Function call stack: train_function
---------------------------------------------------------------------------------------
I try change the version of python in colab but it dosent work.
also change num_classes in the last layer of my model, it dosent work too.

Ran into "TypeError: '<' not supported between instances of 'Tensor' and 'list'" when going through dataset

I am replicating ResNet (source: https://arxiv.org/abs/1512.03385).
I ran into the error "TypeError: '<' not supported between instances of 'Tensor' and 'list'" when trying to go through several different dataset in different sections of my code.
I tried different fixes but none worked: (i) I deleted enumerate cause I worried that using this may cause the problem (ii) I tried to go through dataloader rather than dataset but it didn't work
1st time: When I tried to view images:
for images, _ in train_loader:
print('images.shape:', images.shape)
plt.figure(figsize=(16,8))
plt.axis('off')
plt.imshow(torchvision.utils.make_grid(images, nrow=16).permute((1, 2, 0)))
break
2nd/3rd time: when I tried to validate/test the resnet:
with torch.no_grad():
for j, inputs, labels in enumerate(test_loader, start=0):
outputs = resnet_models[i](inputs)
_, prediction = torch.max(outputs, dim=1)
You may notice that I didn't run into this error when training the resnet, and the code is quite similar:
for batch, data in enumerate(train_dataloader, start=0):
inputs, labels = data
inputs, labels = inputs.to(device), labels.to(device)
Error message (taking the first error as an example. The rest is pretty much the same)
TypeError Traceback (most recent call last)
Input In [38], in <cell line: 8>()
6 print("Images AFTER NORMALIZATION")
7 print("--------------------------")
----> 8 for images, _ in training_data:
9 sort=False
10 print('images.shape:', images.shape)
File ~/miniconda3/envs/resnet/lib/python3.9/site->packages/torch/utils/data/dataset.py:471, in Subset.getitem(self, idx)
469 if isinstance(idx, list):
470 return self.dataset[[self.indices[i] for i in idx]]
--> 471 return self.dataset[self.indices[idx]]
File ~/miniconda3/envs/resnet/lib/python3.9/site->packages/torchvision/datasets/cifar.py:118, in CIFAR10.getitem(self, index)
115 img = Image.fromarray(img)
117 if self.transform is not None:
--> 118 img = self.transform(img)
120 if self.target_transform is not None:
121 target = self.target_transform(target)
File ~/miniconda3/envs/resnet/lib/python3.9/site->packages/torchvision/transforms/transforms.py:95, in Compose.call(self, img)
93 def call(self, img):
94 for t in self.transforms:
---> 95 img = t(img)
96 return img
File ~/miniconda3/envs/resnet/lib/python3.9/site->packages/torch/nn/modules/module.py:1110, in Module._call_impl(self, *input, **kwargs)
1106 # If we don't have any hooks, we want to skip the rest of the logic in
1107 # this function, and just call forward.
1108 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks >or _global_backward_hooks
1109 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1110 return forward_call(*input, **kwargs)
1111 # Do not call functions when jit is used
1112 full_backward_hooks, non_full_backward_hooks = [], []
File ~/miniconda3/envs/resnet/lib/python3.9/site->packages/torchvision/transforms/transforms.py:707, in RandomHorizontalFlip.forward(self, >img)
699 def forward(self, img):
700 """
701 Args:
702 img (PIL Image or Tensor): Image to be flipped.
(...)
705 PIL Image or Tensor: Randomly flipped image.
706 """
--> 707 if torch.rand(1) < self.p:
708 return F.hflip(img)
709 return img
TypeError: '<' not supported between instances of 'Tensor' and 'list'
I was having the same error message, probably under different circumstances, but I just found my own bug and figured I would share it anyway for various readers. I was using a torchvision transformation in my dataset, which the dataloader was loading from. The transformation was
torchvision.transforms.RandomHorizontalFlip([0.5]),
and the error is that the input to this transformation should not be a list but should be
torchvision.transforms.RandomHorizontalFlip(0.5),
So if there is anything I can recommend, it's just that maybe there is some list argument being passed through that shouldn't be in some transformation or otherwise.

error in getting categorial features from train dataset

My train data looks something like this:
train data
To extract categorial features out of it I ran following code"
categorial=[c for c in train.columns if train.columns(c).dtype in ['object'] ]
But I am getting error:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-31-31eb7ac47e21> in <module>
----> 1 categorial=[c for c in train.columns if train.columns[c].dtype in ['object'] ]
<ipython-input-31-31eb7ac47e21> in <listcomp>(.0)
----> 1 categorial=[c for c in train.columns if train.columns[c].dtype in ['object'] ]
/opt/conda/lib/python3.7/site-packages/pandas/core/indexes/base.py in __getitem__(self, key)
4295 if is_scalar(key):
4296 key = com.cast_scalar_indexer(key, warn_float=True)
-> 4297 return getitem(key)
4298
4299 if isinstance(key, slice):
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
What is the possible solution?
Use this to select the 'object' type variables-
categorical = train.select_dtypes('object')
If you just want the variable names -
categorical_cols = train.select_dtypes('object').columns.tolist()

parallelise prediction with `map_partitions`

I have a dataframe of shape (25M, 79) and im trying to parallelise an sklearn pipeline prediction on it.
When I run it for just one partition, it works as expected:
n_partitions = 1000
ddf = dd.from_pandas(df_x_selection, npartitions=n_partitions)
grid_searcher.best_estimator_.predict_proba(ddf.get_partition(0))
But if I apply it to every partition, then it fails:
n_partitions = 1000
ddf = dd.from_pandas(df_x_selection, npartitions=n_partitions)
def _f(_df, _pipeline, _predicted_class) -> np.array:
return _pipeline.predict_proba(_df)[:, _predicted_class]
ddf.map_partitions(_f, grid_searcher.best_estimator_, 1, meta=(None, 'f8')).compute()
The error is:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/opt/conda/lib/python3.7/site-packages/pandas/core/internals/blocks.py in __init__(self, values, placement, ndim)
130 raise ValueError(
--> 131 f"Wrong number of items passed {len(self.values)}, "
132 f"placement implies {len(self.mgr_locs)}"
ValueError: Wrong number of items passed 79, placement implies 100
What am I doing wrong?
Thanks

Keras Tokenization (fit on text)

When i am running this script-->
tokenizer.fit_on_texts(df['text'].values)
sequences = tokenizer.texts_to_sequences(df['text'].values)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
I am getting this error
AttributeError Traceback (most recent call
last)
<ipython-input-4-7c08b89b116a> in <module>()
----> 1 tokenizer.fit_on_texts(df['text'].values)
2 sequences = tokenizer.texts_to_sequences(df['text'].values)
3 word_index = tokenizer.word_index
4 print('Found %s unique tokens.' % len(word_index))
/opt/conda/lib/python3.6/site-packages/keras_preprocessing/text.py in
fit_on_texts(self, texts)
220 self.filters,
221 self.lower,
--> 222 self.split)
223 for w in seq:
224 if w in self.word_counts:
/opt/conda/lib/python3.6/site-packages/keras_preprocessing/text.py in
text_to_word_sequence(text, filters, lower, split)
41 """
42 if lower:
---> 43 text = text.lower()
44
45 if sys.version_info < (3,):
AttributeError: 'float' object has no attribute 'lower'
My size of CSV file is 6970963 when I reduce the size it works, is there any size limit of keras Tokenizer or I am doing something wrong
I guess file size is not the issue, try using a try block and look at the data your are passing. Use some thing like this instead of the line
#instead of this
tokenizer.fit_on_texts(df['text'].values)
#use this to look at the data when it is causing that error.
try:
tokenizer.fit_on_texts(df['text'].values)
except Exception as e:
print("exceiption is", e)
print("data passedin ", df['text'].values)
Then you can accordingly fix the error you are getting.
Check the datatype of the text you are fitting the tokenizer on. It sees it as a float instead of string. You need to convert to string before fitting a tokenizer on it.
Try something like this:
train_x = [str(x[1]) for x in train_x]
Although it is an old thread, but still following could be answer.
You data may have nan, which are interpreted as a float instead of nan. either force the type as str(word) or remove the nan using data.fillna('empty')

Resources