Stack overflow on dask __array__ - dask

I have a rather simple program using dask:
import dask.array as darray
import numpy as np
X = np.array([[1.,2.,3.],
[4.,5.,6.],
[7.,8.,9.]])
arr = darray.from_array(X)
arr = arr[:,0]
a = darray.min(arr)
b = darray.max(arr)
quantiles = darray.linspace(a, b, 4)
print(np.array(quantiles))
Running this program results in an error like this:
Traceback (most recent call last):
File "discretization.py", line 12, in <module>
print(np.array(quantiles))
File "/Users/zhujun/job/adf/local_training/venv/lib/python3.7/site-packages/dask/array/core.py", line 1341, in __array__
x = np.array(x)
File "/Users/zhujun/job/adf/local_training/venv/lib/python3.7/site-packages/dask/array/core.py", line 1341, in __array__
x = np.array(x)
File "/Users/zhujun/job/adf/local_training/venv/lib/python3.7/site-packages/dask/array/core.py", line 1341, in __array__
x = np.array(x)
[Previous line repeated 325 more times]
File "/Users/zhujun/job/adf/local_training/venv/lib/python3.7/site-packages/dask/array/core.py", line 1337, in __array__
x = self.compute()
File "/Users/zhujun/job/adf/local_training/venv/lib/python3.7/site-packages/dask/base.py", line 166, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/Users/zhujun/job/adf/local_training/venv/lib/python3.7/site-packages/dask/base.py", line 434, in compute
dsk = collections_to_dsk(collections, optimize_graph, **kwargs)
File "/Users/zhujun/job/adf/local_training/venv/lib/python3.7/site-packages/dask/base.py", line 220, in collections_to_dsk
[opt(dsk, keys, **kwargs) for opt, (dsk, keys) in groups.items()],
File "/Users/zhujun/job/adf/local_training/venv/lib/python3.7/site-packages/dask/base.py", line 220, in <listcomp>
[opt(dsk, keys, **kwargs) for opt, (dsk, keys) in groups.items()],
File "/Users/zhujun/job/adf/local_training/venv/lib/python3.7/site-packages/dask/array/optimization.py", line 42, in optimize
dsk = optimize_blockwise(dsk, keys=keys)
File "/Users/zhujun/job/adf/local_training/venv/lib/python3.7/site-packages/dask/blockwise.py", line 547, in optimize_blockwise
out = _optimize_blockwise(graph, keys=keys)
File "/Users/zhujun/job/adf/local_training/venv/lib/python3.7/site-packages/dask/blockwise.py", line 572, in _optimize_blockwise
if isinstance(layers[layer], Blockwise):
File "/anaconda3/lib/python3.7/abc.py", line 139, in __instancecheck__
return _abc_instancecheck(cls, instance)
RecursionError: maximum recursion depth exceeded in comparison
Python is version 3.7.1 and dask is version 2.15.0.
What is wrong with this program?
Thanks in advance.

linspace does not (yet) accept lazy inputs from other dask things, you need real numbers. Use compute to materialize these numbers as follows:
a, b = dask.compute(darray.min(arr), darray.max(arr))
quantiles = darray.linspace(a, b, 4)

With either one of these package combinations:
dask==2.15.0
numpy<1.16.0
toolz==0.9.0
dask==2.16.0
numpy<1.17.0
toolz==0.9.0
The following program can be executed without an issue:
import dask.array as darray
import numpy as np
X = np.array([[1.,2.,3.],
[4.,5.,6.],
[7.,8.,9.]])
arr = darray.from_array(X)
arr = arr[:,0]
a = darray.min(arr)
b = darray.max(arr)
q0 = darray.linspace(a, b, 4)
print(np.array(q0))
The key in the above package lists is numpy. Newer versions of numpy may cause an error.
As #mdurant suggested, the implementation of linspace does not yet accept lazy inputs; hence the fact that these combinations of packages work might be actually an coincidence.
I will leave this question open until I fully understand what is happening here.

Related

Roberta on local CPU tensor mismatch at non-singleton dimension 1

I downloaded https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment model to my local pc.
When I pull the model from the website it works perfectly fine but it gave me tensor mismatch error on local.
`self.MODEL = "C:/Users/metehan/project1/MLTools/twitter-roberta-base-sentiment"
self.model = AutoModelForSequenceClassification.from_pretrained(self.MODEL)
self.tokenizer = AutoTokenizer.from_pretrained(self.MODEL)
self.labels = ['Negative', 'Neutral', 'Positive']`
Vocabulary sizes of model and tokenizer are the same and I don't use GPU so model, tokenizer and inputs are at the same location.
`encoded_tweet = self.tokenizer(eng_tweet, return_tensors='pt')
output = self.model(**encoded_tweet)
scores = output[0][0].detach().numpy()
scores = softmax(scores)
max_value = max(scores)`
(base) C:\Users\metehan\project1>python test.py
Traceback (most recent call last):
File "C:\Users\metehan\project1\MLTools\analyze_tweets.py", line 34, in analyze
output = self.model(**encoded_tweet)
File "C:\Users\metehan\AppData\Roaming\Python\Python39\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\metehan\AppData\Roaming\Python\Python39\site-packages\transformers\models\roberta\modeling_roberta.py", line 1206, in forward
outputs = self.roberta(
File "C:\Users\metehan\AppData\Roaming\Python\Python39\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\metehan\AppData\Roaming\Python\Python39\site-packages\transformers\models\roberta\modeling_roberta.py", line 814, in forward
buffered_token_type_ids_expanded = buffered_token_type_ids.expand(batch_size, seq_length)
RuntimeError: The expanded size of the tensor (685) must match the existing size (514) at non-singleton dimension 1. Target sizes: [1, 685]. Tensor sizes: [1, 514]
I tried adding padding and truncation to tokenizer but an index error has occured. Also adding tokenizer a max length didn't work.
Any idea how to fix this?

Dask distributed LocalCluster fails with "TypeError: can't pickle _thread._local objects" when using dask.array.store to hdf5 file

I'm running on one machine with 16 cores and 64GB RAM and want to use dask with LocalCluster, since need the profiling tool for optimization.
I set up the LocalCluster as explained here. Still it gives me the following error:
Traceback (most recent call last):
File "/data/myusername/anaconda3/lib/python3.7/site-packages/distributed/protocol/pickle.py", line 38, in dumps
result = pickle.dumps(x, protocol=pickle.HIGHEST_PROTOCOL)
TypeError: can't pickle _thread._local objects
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/myusername/remote_code/trials/minimal_reproducible_example.py", line 61, in <module>
create_matrix()
File "/home/myusername/remote_code/trials/minimal_reproducible_example.py", line 55, in create_matrix
da.store(w, d_set, dtype="float32")
File "/data/myusername/anaconda3/lib/python3.7/site-packages/dask/array/core.py", line 916, in store
result.compute(**kwargs)
File "/data/myusername/anaconda3/lib/python3.7/site-packages/dask/base.py", line 175, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/data/myusername/anaconda3/lib/python3.7/site-packages/dask/base.py", line 446, in compute
results = schedule(dsk, keys, **kwargs)
File "/data/myusername/anaconda3/lib/python3.7/site-packages/distributed/client.py", line 2499, in get
actors=actors,
File "/data/myusername/anaconda3/lib/python3.7/site-packages/distributed/client.py", line 2426, in _graph_to_futures
"tasks": valmap(dumps_task, dsk3),
File "cytoolz/dicttoolz.pyx", line 179, in cytoolz.dicttoolz.valmap
File "cytoolz/dicttoolz.pyx", line 204, in cytoolz.dicttoolz.valmap
File "/data/myusername/anaconda3/lib/python3.7/site-packages/distributed/worker.py", line 3186, in dumps_task
return {"function": dumps_function(task[0]), "args": warn_dumps(task[1:])}
File "/data/myusername/anaconda3/lib/python3.7/site-packages/distributed/worker.py", line 3195, in warn_dumps
b = dumps(obj)
File "/data/myusername/anaconda3/lib/python3.7/site-packages/distributed/protocol/pickle.py", line 51, in dumps
return cloudpickle.dumps(x, protocol=pickle.HIGHEST_PROTOCOL)
File "/data/myusername/anaconda3/lib/python3.7/site-packages/cloudpickle/cloudpickle.py", line 1108, in dumps
cp.dump(obj)
File "/data/myusername/anaconda3/lib/python3.7/site-packages/cloudpickle/cloudpickle.py", line 473, in dump
return Pickler.dump(self, obj)
File "/data/myusername/anaconda3/lib/python3.7/pickle.py", line 437, in dump
self.save(obj)
File "/data/myusername/anaconda3/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/data/myusername/anaconda3/lib/python3.7/pickle.py", line 786, in save_tuple
save(element)
File "/data/myusername/anaconda3/lib/python3.7/pickle.py", line 549, in save
self.save_reduce(obj=obj, *rv)
File "/data/myusername/anaconda3/lib/python3.7/pickle.py", line 662, in save_reduce
save(state)
File "/data/myusername/anaconda3/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/data/myusername/anaconda3/lib/python3.7/pickle.py", line 856, in save_dict
self._batch_setitems(obj.items())
File "/data/myusername/anaconda3/lib/python3.7/pickle.py", line 882, in _batch_setitems
save(v)
File "/data/myusername/anaconda3/lib/python3.7/pickle.py", line 524, in save
rv = reduce(self.proto)
TypeError: can't pickle _thread._local objects
I use the latest versions of all AFAIK needed versions:
python 3.7.3 with anaconda3 on ubuntu 18.04 LTS
dask: 2.3.0
distributed: 2.3.0
bokeh: 1.3.4
cytoolz: 0.10.0
h5py: 2.9.0
Here is the minimal reproducible example:
import os
import dask.array as da
import h5py
import numpy as np
from dask.distributed import Client
MY_USER_NAME = "myusername"
EARTH_RADIUS = 6372.795
CHUNK_SIZE = 5000
N = 20000
def create_matrix():
lat_vec = np.random.random(N) * 90
lon_vec = np.random.random(N) * 180
lat_vec = np.radians(lat_vec)
lon_vec = np.radians(lon_vec)
sin_lat_vec = np.sin(lat_vec)
cos_lat_vec = np.cos(lat_vec)
def _blocked_calculate_great_circle_distance(block, block_info=None):
loc = block_info[0]['array-location']
(row_start, row_stop) = loc[0]
(col_start, col_stop) = loc[1]
# see https://en.wikipedia.org/wiki/Great-circle_distance
# and https://github.com/ulope/geopy/blob/master/geopy/distance.py
row_lon = lon_vec[row_start:row_stop]
col_lon = lon_vec[col_start:col_stop]
delta_lon = row_lon[:, np.newaxis] - col_lon
cos_delta_lon = np.cos(delta_lon)
central_angle = np.arccos(sin_lat_vec[row_start:row_stop, np.newaxis] * sin_lat_vec[col_start:col_stop] +
cos_lat_vec[row_start:row_stop, np.newaxis] * cos_lat_vec[col_start:col_stop]
* cos_delta_lon)
return EARTH_RADIUS * central_angle
dir_path = "/home/" + MY_USER_NAME + "/minimum_reproducible_example/"
if not os.path.exists(dir_path):
os.makedirs(dir_path)
file_path = os.path.join(dir_path, "matrix.hdf5")
if os.path.exists(file_path):
os.remove(file_path)
with h5py.File(file_path) as f:
d_set = f.create_dataset('/data', shape=(N, N), dtype='f4', fillvalue=0)
w = da.from_array(d_set, chunks=(CHUNK_SIZE, CHUNK_SIZE))
w = w.map_blocks(_blocked_calculate_great_circle_distance, chunks=(CHUNK_SIZE, CHUNK_SIZE), dtype='f4')
da.store(w, d_set, dtype="float32")
if __name__ == '__main__':
client = Client(processes=False)
create_matrix()
Can anybody help me with this?

Neural Network Dense Layer Error in Shape attribute

I have created a feed forward neural network but but it is giving a Type Error despite changing the datatype of the parameter. I am really new to keras and Machine Learning so I would appreciate as detailed help as possible. I am attaching the code snippet and the error log below. CODE-
num_of_features = X_train.shape[1]
nb_classes = Y_train.shape[1]
def baseline_model():
def branch2(x):
x = Dense(np.floor(num_of_features*50), activation='sigmoid')(x)
x = Dropout(0.75)(x)
x = Dense(np.floor(num_of_features*20), activation='sigmoid')(x)
x = Dropout(0.5)(x)
x = Dense(np.floor(num_of_features), activation='sigmoid')(x)
x = Dropout(0.1)(x)
return x
main_input = Input(shape=(num_of_features,), name='main_input')
x = main_input
x = branch2(x)
main_output = Dense(nb_classes, activation='softmax')(x)
model = Model(input=main_input, output=main_output)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy', 'categorical_crossentropy'])
return model
model = baseline_model()
ERROR-
Traceback (most recent call last):
File "h2_fit_neural.py", line 143, in <module>
model = baseline_model()
File "h2_fit_neural.py", line 137, in baseline_model
x = branch2(x)
File "h2_fit_neural.py", line 124, in branch2
x = Dense(np.floor(num_of_features*50), activation='sigmoid')(x)
File "/home/shashank/tensorflow/lib/python3.6/site-packages/keras/engine/base_layer.py", line 432, in __call__
self.build(input_shapes[0])
File "/home/shashank/tensorflow/lib/python3.6/site-packages/keras/layers/core.py", line 872, in build
constraint=self.kernel_constraint)
File "/home/shashank/tensorflow/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/home/shashank/tensorflow/lib/python3.6/site-packages/keras/engine/base_layer.py", line 249, in add_weight
weight = K.variable(initializer(shape),
File "/home/shashank/tensorflow/lib/python3.6/site-packages/keras/initializers.py", line 218, in __call__
dtype=dtype, seed=self.seed)
File "/home/shashank/tensorflow/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 4077, in random_uniform
dtype=dtype, seed=seed)
File "/home/shashank/tensorflow/lib/python3.6/site-packages/tensorflow/python/ops/random_ops.py", line 242, in random_uniform
rnd = gen_random_ops.random_uniform(shape, dtype, seed=seed1, seed2=seed2)
File "/home/shashank/tensorflow/lib/python3.6/site-packages/tensorflow/python/ops/gen_random_ops.py", line 674, in random_uniform
name=name)
File "/home/shashank/tensorflow/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 609, in _apply_op_helper
param_name=input_name)
File "/home/shashank/tensorflow/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 60, in _SatisfiesTypeConstraint
", ".join(dtypes.as_dtype(x).name for x in allowed_list)))
TypeError: Value passed to parameter 'shape' has DataType float32 not in list of allowed values: int32, int64
Why are you using np.floor for the shape in your Dense layers? This will produce a float, you need an int there. Removing np.floor should solve your problem.

Isolating the topk routine of dask

I try to isolate the topk routine from dask.
Somehow it dies in isolation.
Apparently, numpy array instead of dask array is passed to the x argument during the recursion.
The original source code for topk is at: https://github.com/dask/dask/blob/master/dask/array/routines.py
Test program:
import numpy as np
import dask.array as da
from dask.base import tokenize
from operator import getitem
import dask.sharedict as sharedict
from dask.array.core import Array
def topk(k, x):
if x.ndim != 1:
raise ValueError("Topk only works on arrays of one dimension")
token = tokenize(k, x)
name = 'chunk.topk-' + token
dsk = {(name, i): (topk, k, key)
for i, key in enumerate(x.__dask_keys__())}
name2 = 'topk-' + token
dsk[(name2, 0)] = (getitem, (np.sort, (np.concatenate, list(dsk))),
slice(-1, -k - 1, -1))
chunks = ((k,),)
return Array(sharedict.merge((name2, dsk), x.dask), name2, chunks, dtype=x.dtype)
def main():
x = np.arange(12)*8
y = da.from_array(x, 7)
print(y.topk(2).compute())
print(topk(2, y).compute())
main()
Error:
File "test_dask_argtopk.py", line 40, in <module>
main()
File "test_dask_argtopk.py", line 38, in main
print(topk(2, y).compute())
File "test_dask_argtopk.py", line 27, in topk
for i, key in enumerate(x.__dask_keys__())}
AttributeError: 'Array' object has no attribute '__dask_keys__'

Dimension mismatch error with scikit pipeline FeatureUnion

This is my first post. I've been trying to combine features with FeatureUnion and Pipeline, but when I add a tf-idf + svd piepline the test fails with a 'dimension mismatch' error. My simple task is to create a regression model to predict search relevance. Code and errors are reported below. Is there something wrong in my code?
df = read_tsv_data(input_file)
df = tokenize(df)
df_train, df_test = train_test_split(df, test_size = 0.2, random_state=2016)
x_train = df_train['sq'].values
y_train = df_train['relevance'].values
x_test = df_test['sq'].values
y_test = df_test['relevance'].values
# char ngrams
char_ngrams = CountVectorizer(ngram_range=(2,5), analyzer='char_wb', encoding='utf-8')
# TFIDF word ngrams
tfidf_word_ngrams = TfidfVectorizer(ngram_range=(1, 4), analyzer='word', encoding='utf-8')
# SVD
svd = TruncatedSVD(n_components=100, random_state = 2016)
# SVR
svr_lin = SVR(kernel='linear', C=0.01)
pipeline = Pipeline([
('feature_union',
FeatureUnion(
transformer_list = [
('char_ngrams', char_ngrams),
('char_ngrams_svd_pipeline', make_pipeline(char_ngrams, svd)),
('tfidf_word_ngrams', tfidf_word_ngrams),
('tfidf_word_ngrams_svd', make_pipeline(tfidf_word_ngrams, svd))
]
)
),
('svr_lin', svr_lin)
])
model = pipeline.fit(x_train, y_train)
y_pred = model.predict(x_test)
When adding the pipeline below to the FeatureUnion list:
('tfidf_word_ngrams_svd', make_pipeline(tfidf_word_ngrams, svd))
The exception below is generated:
2016-07-31 10:34:08,712 : Testing ... Test Shape: (400,) - Training Shape: (1600,)
Traceback (most recent call last):
File "src/model/end_to_end_pipeline.py", line 236, in <module>
main()
File "src/model/end_to_end_pipeline.py", line 233, in main
process_data(input_file, output_file)
File "src/model/end_to_end_pipeline.py", line 175, in process_data
y_pred = model.predict(x_test)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/utils/metaestimators.py", line 37, in <lambda>
out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/pipeline.py", line 203, in predict
Xt = transform.transform(Xt)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/pipeline.py", line 523, in transform
for name, trans in self.transformer_list)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 800, in __call__
while self.dispatch_one_batch(iterator):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 658, in dispatch_one_batch
self._dispatch(tasks)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 566, in _dispatch
job = ImmediateComputeBatch(batch)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 180, in __init__
self.results = batch()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 72, in __call__
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/pipeline.py", line 399, in _transform_one
return transformer.transform(X)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/utils/metaestimators.py", line 37, in <lambda>
out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/pipeline.py", line 291, in transform
Xt = transform.transform(Xt)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/decomposition/truncated_svd.py", line 201, in transform
return safe_sparse_dot(X, self.components_.T)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/utils/extmath.py", line 179, in safe_sparse_dot
ret = a * b
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scipy/sparse/base.py", line 389, in __mul__
raise ValueError('dimension mismatch')
ValueError: dimension mismatch
What if you change second svd usage to new svd?
transformer_list = [
('char_ngrams', char_ngrams),
('char_ngrams_svd_pipeline', make_pipeline(char_ngrams, svd)),
('tfidf_word_ngrams', tfidf_word_ngrams),
('tfidf_word_ngrams_svd', make_pipeline(tfidf_word_ngrams, clone(svd)))
]
Seems your problem occurs because you're using same object 2 times. I is fitted first time on CountVectorizer, and second time on TfidfVectorizer (Or vice versa), and after you call predict of whole pipeline this svd object cannot understand output of CountVectorizer, because it was fitted on or TfidfVectorizer's output (Or again, vice versa).

Resources