parallelise prediction with `map_partitions` - dask

I have a dataframe of shape (25M, 79) and im trying to parallelise an sklearn pipeline prediction on it.
When I run it for just one partition, it works as expected:
n_partitions = 1000
ddf = dd.from_pandas(df_x_selection, npartitions=n_partitions)
grid_searcher.best_estimator_.predict_proba(ddf.get_partition(0))
But if I apply it to every partition, then it fails:
n_partitions = 1000
ddf = dd.from_pandas(df_x_selection, npartitions=n_partitions)
def _f(_df, _pipeline, _predicted_class) -> np.array:
return _pipeline.predict_proba(_df)[:, _predicted_class]
ddf.map_partitions(_f, grid_searcher.best_estimator_, 1, meta=(None, 'f8')).compute()
The error is:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/opt/conda/lib/python3.7/site-packages/pandas/core/internals/blocks.py in __init__(self, values, placement, ndim)
130 raise ValueError(
--> 131 f"Wrong number of items passed {len(self.values)}, "
132 f"placement implies {len(self.mgr_locs)}"
ValueError: Wrong number of items passed 79, placement implies 100
What am I doing wrong?
Thanks

Related

Ran into "TypeError: '<' not supported between instances of 'Tensor' and 'list'" when going through dataset

I am replicating ResNet (source: https://arxiv.org/abs/1512.03385).
I ran into the error "TypeError: '<' not supported between instances of 'Tensor' and 'list'" when trying to go through several different dataset in different sections of my code.
I tried different fixes but none worked: (i) I deleted enumerate cause I worried that using this may cause the problem (ii) I tried to go through dataloader rather than dataset but it didn't work
1st time: When I tried to view images:
for images, _ in train_loader:
print('images.shape:', images.shape)
plt.figure(figsize=(16,8))
plt.axis('off')
plt.imshow(torchvision.utils.make_grid(images, nrow=16).permute((1, 2, 0)))
break
2nd/3rd time: when I tried to validate/test the resnet:
with torch.no_grad():
for j, inputs, labels in enumerate(test_loader, start=0):
outputs = resnet_models[i](inputs)
_, prediction = torch.max(outputs, dim=1)
You may notice that I didn't run into this error when training the resnet, and the code is quite similar:
for batch, data in enumerate(train_dataloader, start=0):
inputs, labels = data
inputs, labels = inputs.to(device), labels.to(device)
Error message (taking the first error as an example. The rest is pretty much the same)
TypeError Traceback (most recent call last)
Input In [38], in <cell line: 8>()
6 print("Images AFTER NORMALIZATION")
7 print("--------------------------")
----> 8 for images, _ in training_data:
9 sort=False
10 print('images.shape:', images.shape)
File ~/miniconda3/envs/resnet/lib/python3.9/site->packages/torch/utils/data/dataset.py:471, in Subset.getitem(self, idx)
469 if isinstance(idx, list):
470 return self.dataset[[self.indices[i] for i in idx]]
--> 471 return self.dataset[self.indices[idx]]
File ~/miniconda3/envs/resnet/lib/python3.9/site->packages/torchvision/datasets/cifar.py:118, in CIFAR10.getitem(self, index)
115 img = Image.fromarray(img)
117 if self.transform is not None:
--> 118 img = self.transform(img)
120 if self.target_transform is not None:
121 target = self.target_transform(target)
File ~/miniconda3/envs/resnet/lib/python3.9/site->packages/torchvision/transforms/transforms.py:95, in Compose.call(self, img)
93 def call(self, img):
94 for t in self.transforms:
---> 95 img = t(img)
96 return img
File ~/miniconda3/envs/resnet/lib/python3.9/site->packages/torch/nn/modules/module.py:1110, in Module._call_impl(self, *input, **kwargs)
1106 # If we don't have any hooks, we want to skip the rest of the logic in
1107 # this function, and just call forward.
1108 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks >or _global_backward_hooks
1109 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1110 return forward_call(*input, **kwargs)
1111 # Do not call functions when jit is used
1112 full_backward_hooks, non_full_backward_hooks = [], []
File ~/miniconda3/envs/resnet/lib/python3.9/site->packages/torchvision/transforms/transforms.py:707, in RandomHorizontalFlip.forward(self, >img)
699 def forward(self, img):
700 """
701 Args:
702 img (PIL Image or Tensor): Image to be flipped.
(...)
705 PIL Image or Tensor: Randomly flipped image.
706 """
--> 707 if torch.rand(1) < self.p:
708 return F.hflip(img)
709 return img
TypeError: '<' not supported between instances of 'Tensor' and 'list'
I was having the same error message, probably under different circumstances, but I just found my own bug and figured I would share it anyway for various readers. I was using a torchvision transformation in my dataset, which the dataloader was loading from. The transformation was
torchvision.transforms.RandomHorizontalFlip([0.5]),
and the error is that the input to this transformation should not be a list but should be
torchvision.transforms.RandomHorizontalFlip(0.5),
So if there is anything I can recommend, it's just that maybe there is some list argument being passed through that shouldn't be in some transformation or otherwise.

Why does using X[0] in MNIST classifier code give me an error?

I was learning to do classification with the MNIST dataset. And I got an error with I am not able to figure out, I have done a lot of google searches and I am not able to do anything, maybe you are an expert and can help me. Here is the code--
>>> from sklearn.datasets import fetch_openml
>>> mnist = fetch_openml('mnist_784', version=1)
>>> mnist.keys()
output:
dict_keys(['data', 'target', 'frame', 'categories', 'feature_names', 'target_names', 'DESCR', 'details', 'url'])
>>> X, y = mnist["data"], mnist["target"]
>>> X.shape
output:(70000, 784)
>>> y.shape
output:(70000)
>>> X[0]
output:KeyError Traceback (most recent call last)
c:\users\khush\appdata\local\programs\python\python39\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2897 try:
-> 2898 return self._engine.get_loc(casted_key)
2899 except KeyError as err:
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 0
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
<ipython-input-10-19c40ecbd036> in <module>
----> 1 X[0]
c:\users\khush\appdata\local\programs\python\python39\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
2904 if self.columns.nlevels > 1:
2905 return self._getitem_multilevel(key)
-> 2906 indexer = self.columns.get_loc(key)
2907 if is_integer(indexer):
2908 indexer = [indexer]
c:\users\khush\appdata\local\programs\python\python39\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2898 return self._engine.get_loc(casted_key)
2899 except KeyError as err:
-> 2900 raise KeyError(key) from err
2901
2902 if tolerance is not None:
KeyError: 0
Please answer, there can be a silly mistake because I am a beggineer in ML. It would be really helpful if you gave me some hint also.
The API of fetch_openml changed between versions. In earlier versions, it returns a numpy.ndarray array. Since 0.24.0 (December 2020), as_frame argument of fetch_openml is set to auto (instead of False as default option earlier) which gives you a pandas.DataFrame for the MNIST data. You can force the data read as a numpy.ndarray by setting as_frame = False. See fetch_openml reference .
I was also facing the same problem.
scikit-learn: 0.24.0
matplotlib: 3.3.3
Python: 3.9.1
I used to below code to resolve the issue.
import matplotlib as mpl
import matplotlib.pyplot as plt
# instead of some_digit = X[0]
some_digit = X.to_numpy()[0]
some_digit_image = some_digit.reshape(28,28)
plt.imshow(some_digit_image,cmap="binary")
plt.axis("off")
plt.show()
You don't need to downgrade you scikit-learn library, if you follow the code below:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version= 1, as_frame= False)
mnist.keys()
You load the dataset as a dataframe for you to able to access the images, you have two ways to do this,
Transform the dataframe to an Array
# Transform the dataframe into an array. Check the first value
some_digit = X.to_numpy()[0]
# Reshape it to (28,28). Note: 28 x 28 = 7064, if the reshaping doesn't meet
# this you are not able to show the image
some_digit_image = some_digit.reshape(28,28)
plt.imshow(some_digit_image,cmap="binary")
plt.axis("off")
plt.show()
Transform the row
# Transform the row of your choosing into an array
some_digit = X.iloc[0,:].values
# Reshape it to (28,28). Note: 28 x 28 = 7064, if the reshaping doesn't
# meet this you are not able to show the image
some_digit_image = some_digit.reshape(28,28)
plt.imshow(some_digit_image,cmap="binary")
plt.axis("off")
plt.show()

Saving extracted features in CNN

I've just started learning machine learning algorithms. I would like to train VGG-16 network for my own dataset. I am using tflearn.DNN to simulate the VGG net.
I want to save the output (which is a tensor) of fully connected layer, that extracts 4096 features, into a file. I wanted to know how to save these features.
When i ran the following lines:
feed_dict = feed_dict_builder(X, Y, model.inputs, model.targets)
output = model.predictor.evaluate(feed_dict, convnet1)
print(output)
output.save('features.npy')
I got the following exception and error:
Exception in thread Thread-48:
Traceback (most recent call last):
File "/home/anupama/anaconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/anupama/anaconda3/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/home/anupama/anaconda3/lib/python3.6/site-packages/tflearn/data_flow.py", line 187, in fill_feed_dict_queue
data = self.retrieve_data(batch_ids)
File "/home/anupama/anaconda3/lib/python3.6/site-packages/tflearn/data_flow.py", line 222, in retrieve_data
utils.slice_array(self.feed_dict[key], batch_ids)
File "/home/anupama/anaconda3/lib/python3.6/site-packages/tflearn/utils.py", line 180, in slice_array
return [x[start] for x in X]
File "/home/anupama/anaconda3/lib/python3.6/site-packages/tflearn/utils.py", line 180, in <listcomp>
return [x[start] for x in X]
IndexError: index 2 is out of bounds for axis 1 with size 2
[0.0]
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-23-f2d62c020964> in <module>()
4 output = model.predictor.evaluate(feed_dict, convnet1)
5 print(output)
----> 6 output.save('/home/anupama/Internship/feats')
AttributeError: 'list' object has no attribute 'save'
You should save the FC layer of the network as a separate tensor and use DNN.predictor to evaluate it. Sample code:
import tflearn
from tflearn.utils import feed_dict_builder
# VGG model definition
...
previous_layer = ...
fc_layer1 = tflearn.fully_connected(previous_layer, 4096, activation='relu', name='fc1')
fc_layer2 = tflearn.fully_connected(fc_layer1, 4096, activation='relu', name='fc2')
network = ...
# Training
model = tflearn.DNN(network)
model.fit(x, y)
# Evaluation
feed_dict = feed_dict_builder(x, y, model.inputs, model.targets)
output = model.predictor.evaluate(feed_dict, [fc_layer2])
np.save('features.npy', output)

Dask eror - min() arg is an empty sequence

I'm trying to use Dask to handle a reasonably large dataset but I keep getting
ValueError: min() arg is an empty sequence
when I try to run .describe().compute()
I have confirmed the Describe works in normal Pandas with the same dataset so it must be dask related.
Here is the line I'm using:
inpFile = dd.read_csv(fPath, sep='\t', error_bad_lines= False,quoting=csv.QUOTE_NONE)
and the full error is:
ValueError Traceback (most recent call
last) in ()
----> 1 inpFile.describe().compute()
2 #inpFile2.describe()
/home/badrul/anaconda3/lib/python3.6/site-packages/dask/dataframe/core.py
in describe(self, split_every) 1306 num =
self._get_numeric_data() 1307
-> 1308 stats = [num.count(split_every=split_every), 1309 num.mean(split_every=split_every), 1310
num.std(split_every=split_every),
/home/badrul/anaconda3/lib/python3.6/site-packages/dask/dataframe/core.py
in count(self, axis, split_every) 1191
token=token, split_every=split_every) 1192 if
isinstance(self, DataFrame):
-> 1193 result.divisions = (min(self.columns), max(self.columns)) 1194 return result 1195
ValueError: min() arg is an empty sequence
Although it doesn't run for very long so I suspect it's not loading.
The error then comes when I do: inpFile.describe().compute()

Keeping zeros in data with sklearn

I have a csv dataset that I'm trying to use with sklearn. The goal is to predict future webtraffic. However, my dataset contains zeros on days that there were no visitors and I'd like to keep that value. There are more days with zero visitors then there are with visitors (it's a tiny tiny site). Here's a look at the data
Col1 is the date:
10/1/11
10/2/11
10/3/11
etc....
Col2 is the # of visitors:
12
1
0
0
1
5
0
0
etc....
sklearn seems to interpret the zero values as NaN values which is understandable. How can I use those zero values in a logistic function (is that even possible)?
Update:
The estimator is https://github.com/facebookincubator/prophet and when I run the following:
df = pd.read_csv('~/tmp/datafile.csv')
df['y'] = np.log(df['y'])
df.head()
m = Prophet()
m.fit(df);
future = m.make_future_dataframe(periods=365)
future.tail()
forecast = m.predict(future)
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()
m.plot(forecast);
m.plot_components(forecast);
plt.show
I get the following:
growthprediction.py:7: RuntimeWarning: divide by zero encountered in log
df['y'] = np.log(df['y'])
/usr/local/lib/python3.6/site-packages/fbprophet/forecaster.py:307: RuntimeWarning: invalid value encountered in double_scalars
k = (df['y_scaled'].ix[i1] - df['y_scaled'].ix[i0]) / T
Traceback (most recent call last):
File "growthprediction.py", line 11, in <module>
m.fit(df);
File "/usr/local/lib/python3.6/site-packages/fbprophet/forecaster.py", line 387, in fit
params = model.optimizing(dat, init=stan_init, iter=1e4)
File "/usr/local/lib/python3.6/site-packages/pystan/model.py", line 508, in optimizing
ret, sample = fit._call_sampler(stan_args)
File "stanfit4anon_model_35bf14a7f93814266f16b4cf48b40a5a_4758371668158283666.pyx", line 804, in stanfit4anon_model_35bf14a7f93814266f16b4cf48b40a5a_4758371668158283666.StanFit4Model._call_sampler (/var/folders/ym/m6j7kw0d3kj_0frscrtp58800000gn/T/tmp5wq7qltr/stanfit4anon_model_35bf14a7f93814266f16b4cf48b40a5a_4758371668158283666.cpp:16585)
File "stanfit4anon_model_35bf14a7f93814266f16b4cf48b40a5a_4758371668158283666.pyx", line 398, in stanfit4anon_model_35bf14a7f93814266f16b4cf48b40a5a_4758371668158283666._call_sampler (/var/folders/ym/m6j7kw0d3kj_0frscrtp58800000gn/T/tmp5wq7qltr/stanfit4anon_model_35bf14a7f93814266f16b4cf48b40a5a_4758371668158283666.cpp:8818)
RuntimeError: k initialized to invalid value (nan)
In this line of your code:
df['y'] = np.log(df['y'])
you are taking logarithm of 0 when your df['y'] is zero, which results in warnings and NaNs in your resulting dataset, because logarithm of 0 is not defined.
sklearn itself does NOT interpret zero values as NaNs unless you replace them with NaNs in your preprocessing.

Resources