Error using PBSCluster with multiple nodes - dask

I'm encountering some strange behaviour when running xarray with a dask client on our PBS Cluster.
When choosing multiple nodes, aka - cluster.scale(<int>), the process constantly fails prompting the error below.
The strange thing is that when running only one 1 machine 'cluster.scale(1)' the process runs smoothly
Code to summon workers -
'''
cluster = PBSCluster(queue = 'some_q_name',
project = 'project1',
cores = 16,
memory = '100GB',
processes = 1,
walltime = '48:00:00')
cluster.scale(2)
client = Client(cluster)
'''
Error-
'''
distributed.utils - ERROR - "('mean_combine-partial-8d870d263fee8efd863c386e1a41a643', 48, 0, 0, 0)"
Traceback (most recent call last):
File "/work/stavn/anaconda3/envs/Dask_v2/lib/python3.7/site-packages/distributed/utils.py", line 656, in log_errors
yield
File "/work/stavn/anaconda3/envs/Dask_v2/lib/python3.7/site-packages/distributed/scheduler.py", line 1736, in add_worker
typename=types[key],
KeyError: "('mean_combine-partial-8d870d263fee8efd863c386e1a41a643', 48, 0, 0, 0)"
distributed.core - ERROR - "('mean_combine-partial-8d870d263fee8efd863c386e1a41a643', 48, 0, 0, 0)"
Traceback (most recent call last):
File "/work/stavn/anaconda3/envs/Dask_v2/lib/python3.7/site-packages/distributed/core.py", line 459, in handle_comm
result = await result
File "/work/stavn/anaconda3/envs/Dask_v2/lib/python3.7/site-packages/distributed/scheduler.py", line 1736, in add_worker
typename=types[key],
KeyError: "('mean_combine-partial-8d870d263fee8efd863c386e1a41a643', 48, 0, 0, 0)"
'''

Related

how to resolve "[WinError 2] The system cannot find the file specified" while making ROI by SCP

2022-09-06T15:53:28 WARNING Traceback (most recent call last):
File "C:/Users/Baghban/AppData/Roaming/QGIS/QGIS3\profiles\default/python/plugins\SemiAutomaticClassificationPlugin\dock\scpdock.py", line 1736, in pointerClickROI
self.createROI(cfg.pntROI)
File "C:/Users/Baghban/AppData/Roaming/QGIS/QGIS3\profiles\default/python/plugins\SemiAutomaticClassificationPlugin\dock\scpdock.py", line 2833, in createROI
o = cfg.utls.multiProcessNoBlocks(rasterPath = tPMD, bandNumberList = bandNumberList2, functionRaster = cfg.utls.regionGrowingAlgMultiprocess, functionBandArgument = functionList, functionVariable = fVarList, progressMessage = cfg.QtWidgetsSCP.QApplication.translate('semiautomaticclassificationplugin', 'Region growing'))
File "C:/Users/Baghban/AppData/Roaming/QGIS/QGIS3\profiles\default/python/plugins\SemiAutomaticClassificationPlugin\core\utils.py", line 4688, in multiProcessNoBlocks
cfg.pool = cfg.poolSCP(processes=threadNumber)
File "C:\PROGRA~1\QGIS3~1.16\apps\Python37\lib\multiprocessing\context.py", line 119, in Pool
context=self.get_context())
File "C:\PROGRA~1\QGIS3~1.16\apps\Python37\lib\multiprocessing\pool.py", line 176, in __init__
self._repopulate_pool()
File "C:\PROGRA~1\QGIS3~1.16\apps\Python37\lib\multiprocessing\pool.py", line 241, in _repopulate_pool
w.start()
File "C:\PROGRA~1\QGIS3~1.16\apps\Python37\lib\multiprocessing\process.py", line 112, in start
self._popen = self._Popen(self)
File "C:\PROGRA~1\QGIS3~1.16\apps\Python37\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)
File "C:\PROGRA~1\QGIS3~1.16\apps\Python37\lib\multiprocessing\popen_spawn_win32.py", line 48, in __init__
None, None, False, 0, None, None, None)
FileNotFoundError: [WinError 2] The system cannot find the file specified

Tensorboard cannot write data on pytorch: FileNotFoundError: [Errno 2] No such file or directory

I would like to log some data using tensorboard on my pytorch script. This is the structure of my program:
from torch.utils.tensorboard import SummaryWriter
#[methods...]
def main():
device = "cuda" if torch.cuda.is_available() else "cpu"
writer = SummaryWriter("runs/last_train")
#[some code to init the model ...]
for t in range(epochs):
print(f"Epoch {t+1}\n-------------------------------")
train_loss, train_acc = train_loop(train_dataloader, model, loss_fn, optimizer)
val_loss, val_acc = test_loop(val_dataloader, model, loss_fn)
writer.add_scalar('Loss/train', train_loss, t)
writer.add_scalar('Loss/val', val_loss, t)
writer.add_scalar('Accuracy/train', train_acc, t)
writer.add_scalar('Accuracy/val', val_acc, t)
writer.flush()
print("Done!")
writer.close()
if __name__ == "__main__":
main()
The code seems to stop at writer.flush(), with this error:
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/home/iaslab/.local/lib/python3.8/site-packages/tensorboard/summary/writer/event_file_writer.py", line 233, in run
self._record_writer.write(data)
File "/home/iaslab/.local/lib/python3.8/site-packages/tensorboard/summary/writer/record_writer.py", line 40, in write
self._writer.write(header + header_crc + data + footer_crc)
File "/home/iaslab/.local/lib/python3.8/site-packages/tensorboard/compat/tensorflow_stub/io/gfile.py", line 766, in write
self.fs.append(self.filename, file_content, self.binary_mode)
File "/home/iaslab/.local/lib/python3.8/site-packages/tensorboard/compat/tensorflow_stub/io/gfile.py", line 160, in append
self._write(filename, file_content, "ab" if binary_mode else "a")
File "/home/iaslab/.local/lib/python3.8/site-packages/tensorboard/compat/tensorflow_stub/io/gfile.py", line 164, in _write
with io.open(filename, mode, encoding=encoding) as f:
FileNotFoundError: [Errno 2] No such file or directory: b'runs/last_train/events.out.tfevents.1655714469.iaslab-Dell-G15-Special-Edition-5521.10389.0'
The program now is stuck. Running CTRL+C, I got:
Traceback (most recent call last):
File "trainMLP.py", line 99, in <module>
main()
File "trainMLP.py", line 92, in main
writer.flush()
File "/home/iaslab/.local/lib/python3.8/site-packages/torch/utils/tensorboard/writer.py", line 1039, in flush
writer.flush()
File "/home/iaslab/.local/lib/python3.8/site-packages/torch/utils/tensorboard/writer.py", line 132, in flush
self.event_writer.flush()
File "/home/iaslab/.local/lib/python3.8/site-packages/tensorboard/summary/writer/event_file_writer.py", line 121, in flush
self._async_writer.flush()
File "/home/iaslab/.local/lib/python3.8/site-packages/tensorboard/summary/writer/event_file_writer.py", line 176, in flush
self._byte_queue.join()
File "/usr/lib/python3.8/queue.py", line 89, in join
self.all_tasks_done.wait()
File "/usr/lib/python3.8/threading.py", line 302, in wait
waiter.acquire()
KeyboardInterrupt
What is the problem? The file actually exists, I have checked.
Thank you in advance!

Python error when shapefile reading PySal

I'm currently working with PySal library, I'm using queen_from_shapefile() fonction, and Python return an error for some shp and work perfectly for the others. All shapefiles has been created in the same way.They are area shapefiles.
There is the error:
Traceback (most recent call last):
File "<pyshell#7>", line 1, in <module>
graph(1850,117)
File "C:\Users\jbeverag\Desktop\graph_queen_fonction.py", line 37, in graph
qW = ps.queen_from_shapefile(str(planche)+".shp")
File "C:\Users\jbeverag\AppData\Local\Programs\Python\Python36\lib\site-packages\pysal\weights\user.py", line 67, in queen_from_shapefile
w = Queen.from_shapefile(shapefile, idVariable=idVariable)
File "C:\Users\jbeverag\AppData\Local\Programs\Python\Python36\lib\site-packages\pysal\weights\Contiguity.py", line 255, in from_shapefile
w = cls(iterable, ids=ids, id_order=id_order, **kwargs)
File "C:\Users\jbeverag\AppData\Local\Programs\Python\Python36\lib\site-packages\pysal\weights\Contiguity.py", line 199, in __init__
criterion=criterion, method=method)
File "C:\Users\jbeverag\AppData\Local\Programs\Python\Python36\lib\site-packages\pysal\weights\Contiguity.py", line 383, in _build
neighbor_data = ContiguityWeightsPolygons(polygons, wttype=wttype).w
File "C:\Users\jbeverag\AppData\Local\Programs\Python\Python36\lib\site-packages\pysal\weights\_contW_binning.py", line 68, in __init__
self.do_weights()
File "C:\Users\jbeverag\AppData\Local\Programs\Python\Python36\lib\site-packages\pysal\weights\_contW_binning.py", line 98, in do_weights
shpObj = self.collection[i]
File "C:\Users\jbeverag\AppData\Local\Programs\Python\Python36\lib\site-packages\pysal\core\FileIO.py", line 162, in __getitem__
return self.by_row.__getitem__(key)
File "C:\Users\jbeverag\AppData\Local\Programs\Python\Python36\lib\site-packages\pysal\core\FileIO.py", line 145, in __getitem__
return self.p.get(key)
File "C:\Users\jbeverag\AppData\Local\Programs\Python\Python36\lib\site-packages\pysal\core\FileIO.py", line 269, in get
obj = self.__read()
File "C:\Users\jbeverag\AppData\Local\Programs\Python\Python36\lib\site-packages\pysal\core\FileIO.py", line 312, in __read
row = self._read()
File "C:\Users\jbeverag\AppData\Local\Programs\Python\Python36\lib\site-packages\pysal\core\IOHandlers\pyShpIO.py", line 142, in _read
rec = self.dataObj.get_shape(self.pos)
File "C:\Users\jbeverag\AppData\Local\Programs\Python\Python36\lib\site-packages\pysal\core\util\shapefile.py", line 362, in get_shape
return self.shape.unpack(bufferIO(self.fileObj.read(byts)))
File "C:\Users\jbeverag\AppData\Local\Programs\Python\Python36\lib\site-packages\pysal\core\util\shapefile.py", line 633, in unpack
record = _unpackDict(cls.USTRUCT, dat)
File "C:\Users\jbeverag\AppData\Local\Programs\Python\Python36\lib\site-packages\pysal\core\util\shapefile.py", line 136, in _unpackDict
fileObj.read(struct['size']))
struct.error: unpack requires a buffer of 44 bytes
Thanks for your help,
Lacafed
Rebuild shp file fixed the problem, but I don't know what was problem origin's

Client request for tensorflow serving gives error "Attempting to use uninitialized value fully_connected/biases"

I created a LSTM RNN model for text classification on tensorflow and exported the savedModel successfully. I tested the model using savedModel CLI and everything seems to be working fine. However I am trying to create a client that can make a request and get a result. I have been following this tensorflow serving inception example (more specifically inception_client.py) for reference. This works well with the inception model but I am not sure how to change the request for my own model. How exactly should I change the request?
My signature and saving the model:
# Build the signature_def_map.
classification_signature = signature_def_utils.build_signature_def(
inputs={signature_constants.CLASSIFY_INPUTS: classification_inputs},
outputs={
signature_constants.CLASSIFY_OUTPUT_CLASSES:
classification_outputs_classes,
},
method_name=signature_constants.CLASSIFY_METHOD_NAME)
legacy_init_op = tf.group(
tf.tables_initializer(), name='legacy_init_op')
#add the sigs to the servable
builder.add_meta_graph_and_variables(
sess, [tag_constants.SERVING],
signature_def_map={
signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY:
classification_signature
},
assets_collection=tf.get_collection(tf.GraphKeys.ASSET_FILEPATHS),
legacy_init_op=tf.group(assign_filename_op))
print ("added meta graph and variables")
builder.save()
print("model saved")
The model takes in inputs_ as the input which is a list of list of numbers ( [[1,3,4,5,2]] ).
inputs_ = tf.placeholder(tf.int32, [None, None], name="input_ints")
How I am using the savedModel CLI (returns right results):
$ saved_model_cli run --dir ./python2_SavedModelFinalInputInts --tag_set serve --signature_def 'serving_default' --input_exprs inputs='[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2634, 758, 938, 579, 1868, 1894, 24, 651, 572, 32, 1847, 232]]'
More information about the savedModel:
$ saved_model_cli show --dir ./python2_prediction_SavedModelFinalInputInts --all
MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:
signature_def['serving_default']:
The given SavedModel SignatureDef contains the following input(s):
inputs['inputs'] tensor_info:
dtype: DT_INT32
shape: (-1, -1)
name: inputs/input_ints:0
The given SavedModel SignatureDef contains the following output(s):
outputs['outputs'] tensor_info:
dtype: DT_FLOAT
shape: (1, 1)
name: predictions/fully_connected/Sigmoid:0
Method name is: tensorflow/serving/predict
How I am trying to create a request in the client code:
request1 = predict_pb2.PredictRequest()
request1.model_spec.name = 'mnist'
request1.model_spec.signature_name = signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY
request1.inputs[signature_constants.PREDICT_INPUTS].CopyFrom(tf.contrib.util.make_tensor_proto(input_nums, shape=[1,100],dtype=tf.int32))
response = stub.Predict(request1,1.0)
result_dict = { 'Analyst Rating': str(response.message) }
return jsonify(result_dict)
I am getting the following error:
[2017-11-29 19:03:29,318] ERROR in app: Exception on /analyst_rating [POST]
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1612, in full_dispatch_request
rv = self.dispatch_request()
File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1598, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "/usr/local/lib/python2.7/dist-packages/flask_restful/__init__.py", line 480, in wrapper
resp = resource(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/flask/views.py", line 84, in view
return self.dispatch_request(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/flask_restful/__init__.py", line 595, in dispatch_request
resp = meth(*args, **kwargs)
File "restApi.py", line 91, in post
response = stub.Predict(request,1)
File "/usr/local/lib/python2.7/dist-packages/grpc/beta/_client_adaptations.py", line 309, in __call__
self._request_serializer, self._response_deserializer)
File "/usr/local/lib/python2.7/dist-packages/grpc/beta/_client_adaptations.py", line 195, in _blocking_unary_unary
raise _abortion_error(rpc_error_call)
AbortionError: AbortionError(code=StatusCode.FAILED_PRECONDITION, details="Attempting to use uninitialized value fully_connected/biases
[[Node: fully_connected/biases/read = Identity[T=DT_FLOAT, _class=["loc:#fully_connected/biases"], _output_shapes=[[1]], _device="/job:localhost/replica:0/task:0/cpu:0"](fully_connected/biases)]]")
127.0.0.1 - - [29/Nov/2017 19:03:29] "POST /analyst_rating HTTP/1.1" 500 -
{"message": "Internal Server Error"}
Update:
Changing the signature of the model from a classification signature to a prediction signature seemed to work. I also changed the legacy_init_op to legacy_init_op as defined from assign_filename_op which I was using for Assets organization initially.
Changing the model signature from classification to a prediction signature seemed to return results.
prediction_signature = (tf.saved_model.signature_def_utils.build_signature_def(
inputs={signature_constants.PREDICT_INPUTS: prediction_inputs},
outputs={signature_constants.PREDICT_OUTPUTS: prediction_outputs},
method_name=tf.saved_model.signature_constants.PREDICT_METHOD_NAME))
legacy_init_op = tf.group(tf.tables_initializer(), name='legacy_init_op')
#add the sigs to the servable
builder.add_meta_graph_and_variables(
sess, [tag_constants.SERVING],
signature_def_map={
signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY:
prediction_signature
},
# assets_collection=tf.get_collection(tf.GraphKeys.ASSET_FILEPATHS),
legacy_init_op=legacy_init_op)
I am not entirely sure how the client request should be for a model with classification signature or why it was not working.
(If anyone has an explanation, I will select that as the correct answer.)

Tensorflow core

I run the code below on Tensorflow 1.0 using python 3.5
a = tf.placeholder(tf.float32)
b = tf.placeholder(tf.float32)
adder_node = a + b # + provides a shortcut for tf.add(a, b)
sess = tf.Session()
print(sess.run(adder_node, {a: 3, b: 4.5}))
print(sess.run(adder_node, {a: [1, 3], b: [2, 4]}))
I got this error
print(sess.run(adder_node, {a:3, b:4.5}))
Traceback (most recent call last):
File "C:\Users\xxxx\AppData\Local\Continuum\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1068, in _run
allow_operation=False)
File "C:\Users\xxxx\AppData\Local\Continuum\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\framework\ops.py", line 2708, in as_graph_element
return self._as_graph_element_locked(obj, allow_tensor, allow_operation)
File "C:\Users\xxxx\AppData\Local\Continuum\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\framework\ops.py", line 2787, in _as_graph_element_locked
raise ValueError("Tensor %s is not an element of this graph." % obj)
ValueError: Tensor Tensor("Placeholder_3:0", dtype=float32) is not an element of this graph.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\xxxx\AppData\Local\Continuum\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 895, in run
run_metadata_ptr)
File "C:\Users\xxxx\AppData\Local\Continuum\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1071, in _run
+ e.args[0])
TypeError: Cannot interpret feed_dict key as Tensor: Tensor Tensor("Placeholder_3:0", dtype=float32) is not an element of this graph.
>
Please help me debug this, not too sure where the error is coming from

Resources