Related
I am trying to implement a program with a producer and a consumer classes. The producer class reads the numpy array(an image) and puts it in a shared memory and the consumer class will read the numpy array data from the shared memory and apply a pytorch inference model on that.
Below is the shared memory creation code snippet.
import multiprocessing as multi_processing
def create_shared_memory(self):
type_code = "I"
size = int(np.prod(self.image_frame_shape))
frame_lock = multi_processing.Lock()
shared_memory_array = multi_processing.Array(typecode_or_type = type_code, size_or_initializer = size, lock = frame_lock)
buffered_array = np.frombuffer(shared_memory_array.get_obj(), dtype = type_code).reshape(self.image_frame_shape)
shared_memory_object_tuple = (shared_memory_array, buffered_array)
return shared_memory_object_tuple
I have created a pytorch data loader with the below code snippet.
inference_data_loader = create_loader(
InferCustomDataset(
frame_list,
self.validation_transforms,
input_size = self.model_params['input_size'][1:]
),
**self.model_params
)
And the InferCustomDataset class is as below.
class InferCustomDataset(torch.utils.data.Dataset):
def __init__(self, imlist, custom_transforms = None, input_size = (224, 224)):
self._imlist = imlist
self.transform = custom_transforms
self.input_size = input_size
def __getitem__(self, idx):
img = Image.fromarray(self._imlist[idx]).convert('RGB')
img = img.resize(self.input_size)
if self.transform is not None:
img = self.transform(img)
return img, torch.tensor(-1, dtype=torch.long)
def __len__(self):
return len(self._imlist)
When i try to iterate through the data loader, i am getting the below error/exception.
For loop : for image_data, _ in inference_data_loader:
Process ConsumerVHP-2:
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/conda-pv-pytorch-2/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/ubuntu/mvi/modules/vac/src/consumer_video_handler_process.py", line 183, in run
classes = self.infer_on_frame_list(buffered_images_list)
File "/home/ubuntu/mvi/modules/vac/src/consumer_video_handler_process.py", line 92, in infer_on_frame_list
for image_data, _ in inference_data_loader:
File "/home/ubuntu/anaconda3/envs/conda-pv-pytorch-2/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 363, in __iter__
self._iterator = self._get_iterator()
File "/home/ubuntu/anaconda3/envs/conda-pv-pytorch-2/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 314, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "/home/ubuntu/anaconda3/envs/conda-pv-pytorch-2/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 939, in __init__
torch.cuda.current_device(),
File "/home/ubuntu/anaconda3/envs/conda-pv-pytorch-2/lib/python3.9/site-packages/torch/cuda/__init__.py", line 481, in current_device
_lazy_init()
File "/home/ubuntu/anaconda3/envs/conda-pv-pytorch-2/lib/python3.9/site-packages/torch/cuda/__init__.py", line 206, in _lazy_init
raise RuntimeError(
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f990c904a60>
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/conda-pv-pytorch-2/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1358, in __del__
self._shutdown_workers()
File "/home/ubuntu/anaconda3/envs/conda-pv-pytorch-2/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1317, in _shutdown_workers
self._mark_worker_as_unavailable(worker_id, shutdown=True)
File "/home/ubuntu/anaconda3/envs/conda-pv-pytorch-2/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1258, in _mark_worker_as_unavailable
assert self._workers_status[worker_id] or (self._persistent_workers and shutdown)
AttributeError: '_MultiProcessingDataLoaderIter' object has no attribute '_workers_status'
^CProcess PproducerVHProcess-1:
Traceback (most recent call last):
File "/home/ubuntu/mvi/modules/vac/src/main.py", line 195, in <module>
producer_reader_process.join()
File "/home/ubuntu/anaconda3/envs/conda-pv-pytorch-2/lib/python3.9/multiprocessing/process.py", line 149, in join
res = self._popen.wait(timeout)
File "/home/ubuntu/anaconda3/envs/conda-pv-pytorch-2/lib/python3.9/multiprocessing/popen_fork.py", line 43, in wait
Traceback (most recent call last):
return self.poll(os.WNOHANG if timeout == 0.0 else 0)
File "/home/ubuntu/anaconda3/envs/conda-pv-pytorch-2/lib/python3.9/multiprocessing/popen_fork.py", line 27, in poll
File "/home/ubuntu/anaconda3/envs/conda-pv-pytorch-2/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/ubuntu/mvi/modules/vac/src/producer_video_handler_process.py", line 48, in run
ret = shared_memory_array.acquire()
KeyboardInterrupt
pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt
It is throwing the RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method error.
In my main program, if i use, set_start_method('spawn'), the consumer code is just getting the numpy array with all zeros and looks like consumer process is not getting the image (numpy array) from the shared memory.
I also tried by setting "num_workers": 0, but getting the below error.
ValueError: persistent_workers option needs num_workers > 0
Could you let me know how to get the numpy array (image) that was sent to shared memory by the producer and apply the pytorch inference in the consumer process.
I also tried torch.multiprocessing module instead of python's multiprocessing module, but that also resulted in the same error.
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
Appreciate your suggestion/help on how to fix this. Thank You.
The following change fixed this issue.
In my if __name__ == '__main__', i was calling the create_shared_memory() method.
shared_mem_handler = SharedMemoryHandler()
shared_memory_object_tuple = shared_mem_handler.create_shared_memory()
I moved this code outside of this main method, and added just under my import statements. This helped to fix this.
After one of the recent updates of py2neo we are seeing a lot of randomly appearing errors saying
ValueError: file descriptor out of range in select() we are using py2neo to connect to remote neo4j instance.
client_identifier = request.args.get('tribes_client_id')
graph_obj = generic_helpers.get_graph_object(client_identifier) # <- returns py2neo instance
graph_transaction = graph_obj.begin() # <- this is the line causing the exception
Below is stack trace of the exception being raised
Traceback (most recent call last): File "/env/lib/python3.7/site-packages/flask/app.py", line 2447, in wsgi_app response = self.full_dispatch_request() File "/env/lib/python3.7/site-packages/flask/app.py", line 1952, in full_dispatch_request rv = self.handle_user_exception(e) File "/env/lib/python3.7/site-packages/flask/app.py", line 1821, in handle_user_exception reraise(exc_type, exc_value, tb) File "/env/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise raise value File "/env/lib/python3.7/site-packages/flask/app.py", line 1950, in full_dispatch_request rv = self.dispatch_request() File "/env/lib/python3.7/site-packages/flask/app.py", line 1936, in dispatch_request return self.view_functionsrule.endpoint File "/srv/neo4j_maintenance_routes.py", line 116, in get_words_list_for_lookup graph_transaction = graph_obj.begin() File "/env/lib/python3.7/site-packages/py2neo/database.py", line 353, in begin return Transaction(self, autocommit) File "/env/lib/python3.7/site-packages/py2neo/database.py", line 781, in __init__ self.transaction = self.connector.begin() File "/env/lib/python3.7/site-packages/py2neo/internal/connectors.py", line 297, in begin tx = self.pool.acquire() File "/env/lib/python3.7/site-packages/neobolt/direct.py", line 715, in acquire return self.acquire_direct(self.address) File "/env/lib/python3.7/site-packages/neobolt/direct.py", line 608, in acquire_direct connection = self.connector(address, error_handler=self.connection_error_handler) File "/env/lib/python3.7/site-packages/py2neo/internal/connectors.py", line 227, in connector encrypted=cx_data["secure"], **kwargs) File "/env/lib/python3.7/site-packages/neobolt/direct.py", line 972, in connect raise last_error File "/env/lib/python3.7/site-packages/neobolt/direct.py", line 964, in connect connection = _handshake(s, address, der_encoded_server_certificate, **config) File "/env/lib/python3.7/site-packages/neobolt/direct.py", line 898, in _handshake ready_to_read, _, _ = select((s,), (), (), 1) ValueError: filedescriptor out of range in select()
py2neo -> version 4.3.0
python -> version 3.7.3
neo4j -> version (Enterprise 3.5.3)
Thing to notice is we aren't getting these errors every time we are creating a new transaction but on random times and it automatically gets fixed after sometime.
I also come across this issue occasionally. The different thing is, I am implementing neo4j-python-driver (1.7.6).
File "/home/my/anaconda3/envs/python37/lib/python3.7/site-packages/neo4j/__init__.py", line 444, in run
self._connect()
File "/home/my/anaconda3/envs/python37/lib/python3.7/site-packages/neo4j/__init__.py", line 383, in _connect
self._connection = self._acquirer(access_mode)
File "/home/my/anaconda3/envs/python37/lib/python3.7/site-packages/neobolt/direct.py", line 715, in acquire
return self.acquire_direct(self.address)
File "/home/my/anaconda3/envs/python37/lib/python3.7/site-packages/neobolt/direct.py", line 608, in acquire_direct
connection = self.connector(address, error_handler=self.connection_error_handler)
File "/home/my/anaconda3/envs/python37/lib/python3.7/site-packages/neo4j/__init__.py", line 232, in connector
return connect(address, **dict(config, **kwargs))
File "/home/my/anaconda3/envs/python37/lib/python3.7/site-packages/neobolt/direct.py", line 972, in connect
raise last_error
File "/home/my/anaconda3/envs/python37/lib/python3.7/site-packages/neobolt/direct.py", line 964, in connect
connection = _handshake(s, address, der_encoded_server_certificate, **config)
File "/home/my/anaconda3/envs/python37/lib/python3.7/site-packages/neobolt/direct.py", line 898, in _handshake
ready_to_read, _, _ = select((s,), (), (), 1)
ValueError: filedescriptor out of range in select()
After I rerun the same code, nothing happened and the program worked well. I think this issue is probably caused by multiple 'login in' action of the same user. When I still in the process of accessing the neo4j database in the browser and use that account to fetch data with neo4j-python-driver at the same time, this error could happened. However, sometimes it seems that it just happened in that way regardless of the fact that I did nothing but run the code.
Extra Info:
Neo4j Graph Database: 3.5 community version
I am trying to use Dask to perform a groupby operation on a Dataframe.
The code below does not work but it seems that if I initialize the Client from another console the code works, even though I can't see anything on the dashboard ( http://localhost:8787/status ): I mean, there is a dashboard, but all the figures look empty. I am on macOS.
Code:
from datetime import datetime
import numpy as np
import os
from dask import dataframe as dd
from dask.distributed import Client
import pandas as pd
client = Client()
# open http://localhost:8787/status
csv_path = 'chicago-complete.monthly.2018-07-01-to-2018-07-31/data.csv'
dir_destination = 'data'
df = dd.read_csv(csv_path,
dtype = {
'timestamp': str,
'node_id': str,
'subsystem': str,
'sensor': str,
'parameter': str,
'value_raw': str,
'value_hrf': str,
},
parse_dates=['timestamp'],
date_parser=lambda x: pd.datetime.strptime(x, '%Y/%m/%d %H:%M:%S')
)
#%%
if not os.path.exists(dir_destination):
os.makedirs(dir_destination)
def create_node_csv(df_node):
# test function
return len(df_node)
res = df.groupby('node_id').apply(create_node_csv, meta=int)
The csv file is simply composed by columns of string. My goal is to group of all the rows that contains a certain value in a column and than save them as separates file using create_node_csv(df_node) (even though right now is a dummy function). Any other way to do it is appreciated, but I would like to understand what's going on here.
When I run it, the console prints multiple times the following errors:
tornado.application - ERROR - Multiple exceptions in yield list
Traceback (most recent call last):
File "/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 883, in callback
result_list.append(f.result())
File "/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 1141, in run
yielded = self.gen.throw(*exc_info)
File "/anaconda3/lib/python3.7/site-packages/distributed/deploy/local.py", line 208, in _start_worker
yield w._start()
File "/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
File "/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 1141, in run
yielded = self.gen.throw(*exc_info)
File "/anaconda3/lib/python3.7/site-packages/distributed/nanny.py", line 157, in _start
response = yield self.instantiate()
File "/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
File "/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 1141, in run
yielded = self.gen.throw(*exc_info)
File "/anaconda3/lib/python3.7/site-packages/distributed/nanny.py", line 226, in instantiate
self.process.start()
File "/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
File "/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 1141, in run
yielded = self.gen.throw(*exc_info)
File "/anaconda3/lib/python3.7/site-packages/distributed/nanny.py", line 370, in start
yield self.process.start()
File "/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
File "/anaconda3/lib/python3.7/site-packages/distributed/process.py", line 35, in _call_and_set_future
res = func(*args, **kwargs)
File "/anaconda3/lib/python3.7/site-packages/distributed/process.py", line 184, in _start
process.start()
File "/anaconda3/lib/python3.7/multiprocessing/process.py", line 112, in start
self._popen = self._Popen(self)
File "/anaconda3/lib/python3.7/multiprocessing/context.py", line 291, in _Popen
return Popen(process_obj)
File "/anaconda3/lib/python3.7/multiprocessing/popen_forkserver.py", line 35, in __init__
super().__init__(process_obj)
File "/anaconda3/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__
self._launch(process_obj)
File "/anaconda3/lib/python3.7/multiprocessing/popen_forkserver.py", line 42, in _launch
prep_data = spawn.get_preparation_data(process_obj._name)
File "/anaconda3/lib/python3.7/multiprocessing/spawn.py", line 143, in get_preparation_data
_check_not_importing_main()
File "/anaconda3/lib/python3.7/multiprocessing/spawn.py", line 136, in _check_not_importing_main
is not going to be frozen to produce an executable.''')
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
And:
distributed.nanny - WARNING - Worker process 1844 exited with status 1
distributed.nanny - WARNING - Restarting worker
And:
Traceback (most recent call last):
File "/anaconda3/lib/python3.7/multiprocessing/queues.py", line 242, in _feed
send_bytes(obj)
File "/anaconda3/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/anaconda3/lib/python3.7/multiprocessing/connection.py", line 404, in _send_bytes
self._send(header + buf)
File "/anaconda3/lib/python3.7/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
tornado.application - ERROR - Multiple exceptions in yield list
Traceback (most recent call last):
File "/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 883, in callback
result_list.append(f.result())
File "/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 1147, in run
yielded = self.gen.send(value)
File "/anaconda3/lib/python3.7/site-packages/distributed/deploy/local.py", line 217, in _start_worker
raise gen.TimeoutError("Worker failed to start")
tornado.util.TimeoutError: Worker failed to start
tornado.application - ERROR - Multiple exceptions in yield list
Traceback (most recent call last):
File "/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 883, in callback
result_list.append(f.result())
File "/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 1147, in run
yielded = self.gen.send(value)
File "/anaconda3/lib/python3.7/site-packages/distributed/deploy/local.py", line 217, in _start_worker
EDIT:
Based on the answer:
- How do I prevent the creation of a new Client if I run the program again?
- How can I do the following?
def create_node_csv(df_node):
return len(df_node)
It returns me the following error, is it related to the meta parameter?
ValueError: cannot reindex from a duplicate axis
When you run the script, Client() is causing new Dask workers to be spawned, which also get copies of variables from the original main process. In some some cases, this involves re-importing the script in each worker, each of which, of course, then tries to create a Client and new set of processes.
The best answer, as in general with anything running in processes, is to use functions, and protect the main execution. The following would be a way to do this, without changing your one-script structure:
from datetime import datetime
import numpy as np
import os
from dask import dataframe as dd
from dask.distributed import Client
import pandas as pd
csv_path = 'chicago-complete.monthly.2018-07-01-to-2018-07-31/data.csv'
dir_destination = 'data'
def run():
client = Client()
df = dd.read_csv(csv_path, ...)
if not os.path.exists(dir_destination):
os.makedirs(dir_destination)
def create_node_csv(df_node):
# test function
return len(df_node)
res = df.groupby('node_id').apply(create_node_csv, meta=int)
print(res.compute())
if __name__ == "__main__":
run()
How do I prevent the creation of a new Client if I run the program again?
In the call to Client() you can include the address of an existing cluster, if you know what that would be. Also, some specific types of deployments (are there are a few) may have a concept of the "current cluster".
I'm trying to replace a Series dask partition with my own partition.
I've used the code snippet given by #MRocklin in this post.
list_of_delayed = dask_df.to_delayed()
new_partition = dask.delayed(pd.read_csv)(filename)
list_of_delayed[i] = new_partition
new_dask_df = dd.from_delayed(list_of_delayed, meta=dask_df._meta)
I've done exactly the same except dask_df is a series in my case. I'm getting the following error:
Traceback (most recent call last):
File "sdfr_dhruvkmr.py", line 465, in <module>
pts = task[(task.task_date <= dtm.Time.iloc[i]) & (task.T_Date == dtm.Date.iloc[i])]
File "/usr/lib/python2.7/site-packages/edask/dataframe.py", line 130, in __getitem__
new_dask_df = dd.from_delayed(list_of_delayed)
File "/usr/lib/python2.7/site-packages/edask/edask/dask/dataframe/io/io.py", line 493, in from_delayed
type(df).__name__)
TypeError: Expected Delayed object, got Delayed
I'm working on converting some tests from using Nose and twisted, to using Pytest and twisted, as Nose is no longer in development. The easiest way to convert the tests is by editing the custom decorator that each test has. This decorator is on every test, and defines a timeout for the individual test.
I've tried using #pytest.mark.timeout, but the only method that's worked is the 'thread' method, but this stops the entire test run and won't continue on to the next test. Using the method 'signal' fails to stop the test, but I can see an error present in the junitxml file.
def inlineCallbacksTest ( timeout = None ):
def decorator ( method ):
#wraps ( method )
#pytest.mark.timeout(timeout = timeout, method = 'signal' )
#pytest.inlineCallbacks
def testMethod ( *args, **kwargs ):
return method(*args, **kwargs)
return testMethod
return decorator
The tests themselves use twisted to start up and send messages to the software. I don't need the tests to cancel any twisted processes or locks. I would just like pytest to mark the test as a failure after the timeout, and then move onto the next test.
Below is the error I see in the xml file when using signal method of timeout.
</system-out><system-err>
+++++++++++++++++++++++++++++++++++ Timeout ++++++++++++++++++++++++++++++++++++
~~~~~~~ Stack of PoolThread-twisted.internet.reactor-2 (139997693642496) ~~~~~~~
File "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap
self.__bootstrap_inner()
File "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
self.run()
File "/usr/lib64/python2.7/threading.py", line 764, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/lib64/python2.7/site-packages/twisted/python/threadpool.py", line 190, in _worker
o = self.q.get()
File "/usr/lib64/python2.7/Queue.py", line 168, in get
self.not_empty.wait()
File "/usr/lib64/python2.7/threading.py", line 339, in wait
waiter.acquire()
~~~~~~~ Stack of PoolThread-twisted.internet.reactor-1 (139997702035200) ~~~~~~~
File "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap
self.__bootstrap_inner()
File "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
self.run()
File "/usr/lib64/python2.7/threading.py", line 764, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/lib64/python2.7/site-packages/twisted/python/threadpool.py", line 190, in _worker
o = self.q.get()
File "/usr/lib64/python2.7/Queue.py", line 168, in get
self.not_empty.wait()
File "/usr/lib64/python2.7/threading.py", line 339, in wait
waiter.acquire()
+++++++++++++++++++++++++++++++++++ Timeout ++++++++++++++++++++++++++++++++++++
Unhandled Error
Traceback (most recent call last):
File "/usr/lib64/python2.7/site-
packages/twisted/internet/base.py", line 1169, in run
self.mainLoop()
--- <exception caught here> ---
File "/usr/lib64/python2.7/site-
packages/twisted/internet/base.py", line 1181, in mainLoop
self.doIteration(t)
File "/usr/lib64/python2.7/site-
packages/twisted/internet/epollreactor.py", line 362, in doPoll
l = self._poller.poll(timeout, len(self._selectables))
File "/usr/lib/python2.7/site-packages/pytest_timeout.py", line 110, in handler
timeout_sigalrm(item, timeout)
File "/usr/lib/python2.7/site-packages/pytest_timeout.py", line 243, in timeout_sigalrm
pytest.fail('Timeout >%ss' % timeout)
File "/usr/lib/python2.7/site-packages/_pytest/outcomes.py", line 85, in fail
raise Failed(msg=msg, pytrace=pytrace)
builtins.Failed: Timeout >5.0s
</system-err>
I have looked around for a similar solution, and the closest I could find was this question. Any help or suggestions would be appreciated.
Coming back to this after 4+ years with an answer. The problem seems to be the exception from the test getting caught by the twisted reactor. I was able to resolve this by updating the version of twisted. Twisted versions since 16.5 have a new Deferred function call addTimeout (Docs). Using that, I was able to modify the original decorator to the following. Now whenever a test times out, it simply raises an exception and moves on to the next one. May not be the most elegant, but I hope this helps someone else out!
import twisted.internet.defer as defer
import pytest_twisted as pt
from functools import wraps
def inlineCallbacksTest ( timeout = None ):
def testDecorator ( testFunc ):
def timeoutError ( value, timeout ):
raise Exception ( "Test Timeout: {} secs have expired".format ( timeout ) )
#wraps ( testFunc )
def wrapper ( *args, **kwargs ):
testDefer = pt.inlineCallbacks ( testFunc )( *args, **kwargs )
testDefer.addTimeout ( timeout, reactor, timeoutError )
return testDefer
return wrapper
return testDecorator