I'm trying to use Dask to handle a reasonably large dataset but I keep getting
ValueError: min() arg is an empty sequence
when I try to run .describe().compute()
I have confirmed the Describe works in normal Pandas with the same dataset so it must be dask related.
Here is the line I'm using:
inpFile = dd.read_csv(fPath, sep='\t', error_bad_lines= False,quoting=csv.QUOTE_NONE)
and the full error is:
ValueError Traceback (most recent call
last) in ()
----> 1 inpFile.describe().compute()
2 #inpFile2.describe()
/home/badrul/anaconda3/lib/python3.6/site-packages/dask/dataframe/core.py
in describe(self, split_every) 1306 num =
self._get_numeric_data() 1307
-> 1308 stats = [num.count(split_every=split_every), 1309 num.mean(split_every=split_every), 1310
num.std(split_every=split_every),
/home/badrul/anaconda3/lib/python3.6/site-packages/dask/dataframe/core.py
in count(self, axis, split_every) 1191
token=token, split_every=split_every) 1192 if
isinstance(self, DataFrame):
-> 1193 result.divisions = (min(self.columns), max(self.columns)) 1194 return result 1195
ValueError: min() arg is an empty sequence
Although it doesn't run for very long so I suspect it's not loading.
The error then comes when I do: inpFile.describe().compute()
Related
I run the code in my editor (VS Code) without any problems, but for next step and due to RAM and GPU limitation, I took it in colab, but got an error that seems to be due to mismatch of versions due to transfer from my editor to colab. how can i fix this problem?
The current version of python running on Google Colab is 3.8.16, I used tensorflow 2.3.0 and keras 2.4.3.
The error is related to this part of code when use the model.fit() for train the model:
(I use CTC_loss in model):
model.fit(
train_dg,
validation_data=val_dg,
epochs=args.epochs,
callbacks=[PlotLossesKeras(),
early_stopping,
cp,
csv_logger,
lrs]
)
But I got this error:
----------------------------------------------------------------------------------------------------
**Epoch 00001: LearningRateScheduler reducing learning rate to 0.001. Epoch 1/300
-----------**---------------------------------------------------------------- InvalidArgumentError Traceback (most recent call last) <ipython-input-87-2b4ea6811b43> in <module>
----> 1 model.fit(train_dg,validation_data=val_dg,epochs=args.epochs,callbacks=[PlotLossesKeras(),early_stopping,cp,csv_logger,lrs])
9 frames /usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
57 try:
58 ctx.ensure_initialized()
---> 59 tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
60 inputs, attrs, num_outputs)
61 except core._NotOkStatusException as e:
InvalidArgumentError: Saw a non-null label (index >= num_classes - 1) following a null label, batch: 2 num_classes: 16 labels: 16,0,0,0,0,0,0 labels seen so far: [[node functional_3/CTCloss/CTCLoss (defined at <ipython-input-17-1689d20fc46d>:887) ]] [Op:__inference_train_function_6401]
Function call stack: train_function
---------------------------------------------------------------------------------------
I try change the version of python in colab but it dosent work.
also change num_classes in the last layer of my model, it dosent work too.
I am replicating ResNet (source: https://arxiv.org/abs/1512.03385).
I ran into the error "TypeError: '<' not supported between instances of 'Tensor' and 'list'" when trying to go through several different dataset in different sections of my code.
I tried different fixes but none worked: (i) I deleted enumerate cause I worried that using this may cause the problem (ii) I tried to go through dataloader rather than dataset but it didn't work
1st time: When I tried to view images:
for images, _ in train_loader:
print('images.shape:', images.shape)
plt.figure(figsize=(16,8))
plt.axis('off')
plt.imshow(torchvision.utils.make_grid(images, nrow=16).permute((1, 2, 0)))
break
2nd/3rd time: when I tried to validate/test the resnet:
with torch.no_grad():
for j, inputs, labels in enumerate(test_loader, start=0):
outputs = resnet_models[i](inputs)
_, prediction = torch.max(outputs, dim=1)
You may notice that I didn't run into this error when training the resnet, and the code is quite similar:
for batch, data in enumerate(train_dataloader, start=0):
inputs, labels = data
inputs, labels = inputs.to(device), labels.to(device)
Error message (taking the first error as an example. The rest is pretty much the same)
TypeError Traceback (most recent call last)
Input In [38], in <cell line: 8>()
6 print("Images AFTER NORMALIZATION")
7 print("--------------------------")
----> 8 for images, _ in training_data:
9 sort=False
10 print('images.shape:', images.shape)
File ~/miniconda3/envs/resnet/lib/python3.9/site->packages/torch/utils/data/dataset.py:471, in Subset.getitem(self, idx)
469 if isinstance(idx, list):
470 return self.dataset[[self.indices[i] for i in idx]]
--> 471 return self.dataset[self.indices[idx]]
File ~/miniconda3/envs/resnet/lib/python3.9/site->packages/torchvision/datasets/cifar.py:118, in CIFAR10.getitem(self, index)
115 img = Image.fromarray(img)
117 if self.transform is not None:
--> 118 img = self.transform(img)
120 if self.target_transform is not None:
121 target = self.target_transform(target)
File ~/miniconda3/envs/resnet/lib/python3.9/site->packages/torchvision/transforms/transforms.py:95, in Compose.call(self, img)
93 def call(self, img):
94 for t in self.transforms:
---> 95 img = t(img)
96 return img
File ~/miniconda3/envs/resnet/lib/python3.9/site->packages/torch/nn/modules/module.py:1110, in Module._call_impl(self, *input, **kwargs)
1106 # If we don't have any hooks, we want to skip the rest of the logic in
1107 # this function, and just call forward.
1108 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks >or _global_backward_hooks
1109 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1110 return forward_call(*input, **kwargs)
1111 # Do not call functions when jit is used
1112 full_backward_hooks, non_full_backward_hooks = [], []
File ~/miniconda3/envs/resnet/lib/python3.9/site->packages/torchvision/transforms/transforms.py:707, in RandomHorizontalFlip.forward(self, >img)
699 def forward(self, img):
700 """
701 Args:
702 img (PIL Image or Tensor): Image to be flipped.
(...)
705 PIL Image or Tensor: Randomly flipped image.
706 """
--> 707 if torch.rand(1) < self.p:
708 return F.hflip(img)
709 return img
TypeError: '<' not supported between instances of 'Tensor' and 'list'
I was having the same error message, probably under different circumstances, but I just found my own bug and figured I would share it anyway for various readers. I was using a torchvision transformation in my dataset, which the dataloader was loading from. The transformation was
torchvision.transforms.RandomHorizontalFlip([0.5]),
and the error is that the input to this transformation should not be a list but should be
torchvision.transforms.RandomHorizontalFlip(0.5),
So if there is anything I can recommend, it's just that maybe there is some list argument being passed through that shouldn't be in some transformation or otherwise.
My train data looks something like this:
train data
To extract categorial features out of it I ran following code"
categorial=[c for c in train.columns if train.columns(c).dtype in ['object'] ]
But I am getting error:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-31-31eb7ac47e21> in <module>
----> 1 categorial=[c for c in train.columns if train.columns[c].dtype in ['object'] ]
<ipython-input-31-31eb7ac47e21> in <listcomp>(.0)
----> 1 categorial=[c for c in train.columns if train.columns[c].dtype in ['object'] ]
/opt/conda/lib/python3.7/site-packages/pandas/core/indexes/base.py in __getitem__(self, key)
4295 if is_scalar(key):
4296 key = com.cast_scalar_indexer(key, warn_float=True)
-> 4297 return getitem(key)
4298
4299 if isinstance(key, slice):
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
What is the possible solution?
Use this to select the 'object' type variables-
categorical = train.select_dtypes('object')
If you just want the variable names -
categorical_cols = train.select_dtypes('object').columns.tolist()
I have a dataframe of shape (25M, 79) and im trying to parallelise an sklearn pipeline prediction on it.
When I run it for just one partition, it works as expected:
n_partitions = 1000
ddf = dd.from_pandas(df_x_selection, npartitions=n_partitions)
grid_searcher.best_estimator_.predict_proba(ddf.get_partition(0))
But if I apply it to every partition, then it fails:
n_partitions = 1000
ddf = dd.from_pandas(df_x_selection, npartitions=n_partitions)
def _f(_df, _pipeline, _predicted_class) -> np.array:
return _pipeline.predict_proba(_df)[:, _predicted_class]
ddf.map_partitions(_f, grid_searcher.best_estimator_, 1, meta=(None, 'f8')).compute()
The error is:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/opt/conda/lib/python3.7/site-packages/pandas/core/internals/blocks.py in __init__(self, values, placement, ndim)
130 raise ValueError(
--> 131 f"Wrong number of items passed {len(self.values)}, "
132 f"placement implies {len(self.mgr_locs)}"
ValueError: Wrong number of items passed 79, placement implies 100
What am I doing wrong?
Thanks
When i am running this script-->
tokenizer.fit_on_texts(df['text'].values)
sequences = tokenizer.texts_to_sequences(df['text'].values)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
I am getting this error
AttributeError Traceback (most recent call
last)
<ipython-input-4-7c08b89b116a> in <module>()
----> 1 tokenizer.fit_on_texts(df['text'].values)
2 sequences = tokenizer.texts_to_sequences(df['text'].values)
3 word_index = tokenizer.word_index
4 print('Found %s unique tokens.' % len(word_index))
/opt/conda/lib/python3.6/site-packages/keras_preprocessing/text.py in
fit_on_texts(self, texts)
220 self.filters,
221 self.lower,
--> 222 self.split)
223 for w in seq:
224 if w in self.word_counts:
/opt/conda/lib/python3.6/site-packages/keras_preprocessing/text.py in
text_to_word_sequence(text, filters, lower, split)
41 """
42 if lower:
---> 43 text = text.lower()
44
45 if sys.version_info < (3,):
AttributeError: 'float' object has no attribute 'lower'
My size of CSV file is 6970963 when I reduce the size it works, is there any size limit of keras Tokenizer or I am doing something wrong
I guess file size is not the issue, try using a try block and look at the data your are passing. Use some thing like this instead of the line
#instead of this
tokenizer.fit_on_texts(df['text'].values)
#use this to look at the data when it is causing that error.
try:
tokenizer.fit_on_texts(df['text'].values)
except Exception as e:
print("exceiption is", e)
print("data passedin ", df['text'].values)
Then you can accordingly fix the error you are getting.
Check the datatype of the text you are fitting the tokenizer on. It sees it as a float instead of string. You need to convert to string before fitting a tokenizer on it.
Try something like this:
train_x = [str(x[1]) for x in train_x]
Although it is an old thread, but still following could be answer.
You data may have nan, which are interpreted as a float instead of nan. either force the type as str(word) or remove the nan using data.fillna('empty')