Pytorch dataloaders : OSError: [Errno 9] Bad file descriptor - docker

Description of the problem
The error will occur if the num_workers > 0 , But when I set num_workers = 0 , the error disappeared, though, this will slow down the trainning speed. I think the multiprocessing really matters here .How can I solve this problem?
env
docker python3.8 Pytorch 1.11.0+cu113
error output
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/multiprocessing/resource_sharer.py", line 149, in _serve
send(conn, destination_pid)
File "/opt/conda/lib/python3.8/multiprocessing/resource_sharer.py", line 50, in send
reduction.send_handle(conn, new_fd, pid)
File "/opt/conda/lib/python3.8/multiprocessing/reduction.py", line 184, in send_handle
sendfds(s, [handle])
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/multiprocessing/reduction.py", line 149, in sendfds
File "save_disp.py", line 85, in <module>
sock.sendmsg([msg], [(socket.SOL_SOCKET, socket.SCM_RIGHTS, fds)])
OSError: [Errno 9] Bad file descriptor
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/multiprocessing/resource_sharer.py", line 151, in _serve
test()
File "save_disp.py", line 55, in test
close()
for batch_idx, sample in enumerate(TestImgLoader):
File "/opt/conda/lib/python3.8/multiprocessing/resource_sharer.py", line 52, in close
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 530, in __next__
os.close(new_fd)
OSError: [Errno 9] Bad file descriptor
data = self._next_data()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1207, in _next_data
idx, data = self._get_data()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1173, in _get_data
success, data = self._try_get_data()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1011, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 116, in get
return _ForkingPickler.loads(res)
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 295, in rebuild_storage_fd
fd = df.detach()
File "/opt/conda/lib/python3.8/multiprocessing/resource_sharer.py", line 58, in detach
return reduction.recv_handle(conn)
File "/opt/conda/lib/python3.8/multiprocessing/reduction.py", line 189, in recv_handle
return recvfds(s, 1)[0]
File "/opt/conda/lib/python3.8/multiprocessing/reduction.py", line 159, in recvfds
raise EOFError
EOFError

Related

Scapy rdpcap function error "MemoryError"

I want to use rdpcap to open a traffic capture.
cap = rdpcap("Chall_1.pcapng")
but i receive the following error and I don't know how to solve it.
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/scapy/utils.py", line 979, in __call__
i.__init__(filename, fdesc, magic)
File "/usr/lib/python3/dist-packages/scapy/utils.py", line 1124, in __init__
RawPcapReader.__init__(self, filename, fdesc, magic)
File "/usr/lib/python3/dist-packages/scapy/utils.py", line 1035, in __init__
raise Scapy_Exception(
scapy.error.Scapy_Exception: Not a pcap capture file (bad magic: b'\n\r\r\n')
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/valentin/Desktop/Tema3/ctf1.py", line 29, in <module>
cap = rdpcap("Chall_1.pcapng")
File "/usr/lib/python3/dist-packages/scapy/utils.py", line 950, in rdpcap
with PcapReader(filename) as fdesc:
File "/usr/lib/python3/dist-packages/scapy/utils.py", line 985, in __call__
i.__init__(filename, fdesc, magic)
File "/usr/lib/python3/dist-packages/scapy/utils.py", line 1320, in __init__
RawPcapNgReader.__init__(self, filename, fdesc, magic)
File "/usr/lib/python3/dist-packages/scapy/utils.py", line 1209, in __init__
self.f.read(blocklen - 24)
MemoryError
The problem was that I didn't have enoght RAM on my virtual machine.
I infer from this Scapy pull request that the intent is that rdpcap() be able to open both pcap and pcapng files. If that's not working, then it's presumably a Scapy bug; please report it on the Scapy issue list.

SSLError when using open_by_key(Spreadsheet ID)

I referred to the following site. After completing each setting, I wrote gspread_simple.py which shown the site.
https://qiita.com/164kondo/items/eec4d1d8fd7648217935
As a result, the following error was output (I replaced a folder name under C:\Users with ********).
Traceback (most recent call last):
File "C:\Users\********\AppData\Local\Programs\Python\Python39\lib\site-packages\urllib3\connectionpool.py", line 696, in urlopen
self._prepare_proxy(conn)
File "C:\Users\********\AppData\Local\Programs\Python\Python39\lib\site-packages\urllib3\connectionpool.py", line 964, in _prepare_proxy
conn.connect()
File "C:\Users\********\AppData\Local\Programs\Python\Python39\lib\site-packages\urllib3\connection.py", line 359, in connect
conn = self._connect_tls_proxy(hostname, conn)
File "C:\Users\********\AppData\Local\Programs\Python\Python39\lib\site-packages\urllib3\connection.py", line 496, in _connect_tls_proxy
return ssl_wrap_socket(
File "C:\Users\********\AppData\Local\Programs\Python\Python39\lib\site-packages\urllib3\util\ssl_.py", line 432, in ssl_wrap_socket
ssl_sock = _ssl_wrap_socket_impl(sock, context, tls_in_tls)
File "C:\Users\********\AppData\Local\Programs\Python\Python39\lib\site-packages\urllib3\util\ssl_.py", line 474, in _ssl_wrap_socket_impl
return ssl_context.wrap_socket(sock)
File "C:\Users\********\AppData\Local\Programs\Python\Python39\lib\ssl.py", line 500, in wrap_socket
return self.sslsocket_class._create(
File "C:\Users\********\AppData\Local\Programs\Python\Python39\lib\ssl.py", line 1040, in _create
self.do_handshake()
File "C:\Users\********\AppData\Local\Programs\Python\Python39\lib\ssl.py", line 1309, in do_handshake
self._sslobj.do_handshake()
ssl.SSLEOFError: EOF occurred in violation of protocol (_ssl.c:1122)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\********\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\adapters.py", line 439, in send
resp = conn.urlopen(
File "C:\Users\********\AppData\Local\Programs\Python\Python39\lib\site-packages\urllib3\connectionpool.py", line 783, in urlopen
return self.urlopen(
File "C:\Users\********\AppData\Local\Programs\Python\Python39\lib\site-packages\urllib3\connectionpool.py", line 783, in urlopen
return self.urlopen(
File "C:\Users\********\AppData\Local\Programs\Python\Python39\lib\site-packages\urllib3\connectionpool.py", line 783, in urlopen
return self.urlopen(
File "C:\Users\********\AppData\Local\Programs\Python\Python39\lib\site-packages\urllib3\connectionpool.py", line 755, in urlopen
retries = retries.increment(
File "C:\Users\********\AppData\Local\Programs\Python\Python39\lib\site-packages\urllib3\util\retry.py", line 573, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='oauth2.googleapis.com', port=443): Max retries exceeded with url: /token (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1122)')))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\********\AppData\Local\Programs\Python\Python39\lib\site-packages\google\auth\transport\requests.py", line 182, in __call__
response = self.session.request(
File "C:\Users\********\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\sessions.py", line 542, in request
resp = self.send(prep, **send_kwargs)
File "C:\Users\********\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\sessions.py", line 655, in send
r = adapter.send(request, **kwargs)
File "C:\Users\********\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\adapters.py", line 514, in send
raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='oauth2.googleapis.com', port=443): Max retries exceeded with url: /token (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1122)')))
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\********\samp\module1.py", line 18, in <module>
ws = connect_gspread(jsonf,spread_sheet_key)
File "C:\Users\********\samp\module1.py", line 12, in connect_gspread
worksheet = gc.open_by_key(SPREADSHEET_KEY).sheet1
File "C:\Users\********\AppData\Local\Programs\Python\Python39\lib\site-packages\gspread\models.py", line 123, in sheet1
return self.get_worksheet(0)
File "C:\Users\********\AppData\Local\Programs\Python\Python39\lib\site-packages\gspread\models.py", line 283, in get_worksheet
sheet_data = self.fetch_sheet_metadata()
File "C:\Users\********\AppData\Local\Programs\Python\Python39\lib\site-packages\gspread\models.py", line 265, in fetch_sheet_metadata
r = self.client.request('get', url, params=params)
File "C:\Users\********\AppData\Local\Programs\Python\Python39\lib\site-packages\gspread\client.py", line 61, in request
response = getattr(self.session, method)(
File "C:\Users\********\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\sessions.py", line 555, in get
return self.request('GET', url, **kwargs)
File "C:\Users\********\AppData\Local\Programs\Python\Python39\lib\site-packages\google\auth\transport\requests.py", line 460, in request
self.credentials.before_request(auth_request, method, url, request_headers)
File "C:\Users\********\AppData\Local\Programs\Python\Python39\lib\site-packages\google\auth\credentials.py", line 133, in before_request
self.refresh(request)
File "C:\Users\********\AppData\Local\Programs\Python\Python39\lib\site-packages\google\oauth2\service_account.py", line 361, in refresh
access_token, expiry, _ = _client.jwt_grant(request, self._token_uri, assertion)
File "C:\Users\********\AppData\Local\Programs\Python\Python39\lib\site-packages\google\oauth2\_client.py", line 153, in jwt_grant
response_data = _token_endpoint_request(request, token_uri, body)
File "C:\Users\********\AppData\Local\Programs\Python\Python39\lib\site-packages\google\oauth2\_client.py", line 105, in _token_endpoint_request
response = request(method="POST", url=token_uri, headers=headers, body=body)
File "C:\Users\********\AppData\Local\Programs\Python\Python39\lib\site-packages\google\auth\transport\requests.py", line 188, in __call__
six.raise_from(new_exc, caught_exc)
File "<string>", line 3, in raise_from
google.auth.exceptions.TransportError: HTTPSConnectionPool(host='oauth2.googleapis.com', port=443): Max retries exceeded with url: /token (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1122)')))
I referred to the following site and so on.
https://qiita.com/vmmhypervisor/items/c8ef0ad87318656facb2
OS: Windows10
Python: 3.9
My Python is not 2.7 series.
As a test, I tried the command.
pip uninstall -y certifi && pip install certifi==2015.04.28
But the output was the same.
And I tried the command.
pip install -U certifi
But the output was the same.

Can't load models from torchvision on GPU clusters in my company

codes:
from torchvision.models import vgg16
vgg_model = vgg16(pretrained=True).features[:16]
the log output:
Traceback (most recent call last):
File "train.py", line 95, in <module>
vgg_model = vgg16(pretrained=True).features[:16]
File "/opt/conda/lib/python3.7/site-packages/torchvision/models/vgg.py", line 150, in vgg16
return _vgg('vgg16', 'D', False, pretrained, progress, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torchvision/models/vgg.py", line 93, in _vgg
progress=progress)
File "/opt/conda/lib/python3.7/site-packages/torch/hub.py", line 480, in load_state_dict_from_url
os.makedirs(model_dir)
File "/opt/conda/lib/python3.7/os.py", line 213, in makedirs
makedirs(head, exist_ok=exist_ok)
File "/opt/conda/lib/python3.7/os.py", line 213, in makedirs
makedirs(head, exist_ok=exist_ok)
File "/opt/conda/lib/python3.7/os.py", line 223, in makedirs
mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/.cache'
I have to run my codes on GPU clusters, and I have no rights to modify these docker images on GPU clusters. What can I do?

ESP8266 Micropython: [Errno 103] ECONNABORTED\r\n') after some time

I am having issues running the following script for a longer period of time:
I use ampy to execute the script on the ESP:
sudo ampy --port /dev/ttyUSB0 run photoresistor.py
photoresistor.py:
#!/usr/bin/env python3
import machine
import network
from time import sleep
from urllib.urequest import urlopen
import json
wifiap = network.WLAN(network.AP_IF)
wifiap.active(False)
routercon = network.WLAN(network.STA_IF)
routercon.active(True)
routercon.ifconfig(('10.0.0.128','255.255.255.0','10.0.0.138','10.0.0.138'))
routercon.connect('mywifi', '123')
while not routercon.isconnected():
pass
posturl=('http://10.0.0.156:23102/rest/v2/send')
adc = machine.ADC(0)
gc.enable()
while True:
value = adc.read()
if value < 200:
message = {'username': 'test', 'message': value, 'chatid': 'test', 'password': 'test', 'notifyself': 'false'}
r = urlopen(posturl, data=json.dumps(message).encode())
r.close()
gc.collect()
sleep(1)
It works as expected in the beginning but after some time I get the following stacktrace:
Traceback (most recent call last):
File "/usr/local/bin/ampy", line 11, in <module>
sys.exit(cli())
File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/ampy/cli.py", line 337, in run
output = board_files.run(local_file, not no_output)
File "/usr/local/lib/python3.6/dist-packages/ampy/files.py", line 303, in run
out = self._pyboard.execfile(filename)
File "/usr/local/lib/python3.6/dist-packages/ampy/pyboard.py", line 273, in execfile
return self.exec_(pyfile)
File "/usr/local/lib/python3.6/dist-packages/ampy/pyboard.py", line 267, in exec_
raise PyboardError('exception', ret, ret_err)
ampy.pyboard.PyboardError: ('exception', b'', b'Traceback (most recent call last):\r\n File "<stdin>", line 28, in <module>\r\n File "urequests.py", line 152, in post\r\n File "urequests.py", line 89, in request\r\nOSError: [Errno 103] ECONNABORTED\r\n')
No idea what to do.
I tried to play around with the garbage collection but it didn't help.
I suspect that the board doesn't clean up sockets properly.
If the board is sending post requests quickly in the loop (every second for 1 minute) and let it sit afterwards for a short period of time it fails quickly with above ECONNABORTED.
If the board sends post requests more slowly (like 2 in a minute) it takes way longer for it to fail. To conclude: I suspect the OS does not properly clean up resources and still has active connections after r.close() or I am overseeing something in the code.
I am not sure what else I can do to make sure these sockets are closed.
EDIT:
I found out it fails on connect (https://github.com/micropython/micropython-lib/blob/master/urllib.urequest/urllib/urequest.py):
line 28:
s.connect(ai[-1])
however routercon.isconnected() returns true:
>>> routercon.isconnected()
True
>>>
How can it be that altough there is a active connection I am unable to send an http post request?
EDIT2:
When this happens sometimes I also can't post to another endpoint e.g. the test server with the same webservice
>>> r = urlopen(posturl, data=json.dumps(message).encode())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "urllib/urequest.py", line 28, in urlopen
OSError: [Errno 103] ECONNABORTED
>>> r = urlopen("http://10.0.0.8:23102/rest/v2/send", data=json.dumps(message).encode())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "urllib/urequest.py", line 28, in urlopen
OSError: [Errno 103] ECONNABORTED
>>>
Interestingly a http get to google works:
>>> r = urlopen("http://www.google.com")
>>>
If I let it sit idle for some time http post start to work again.
Could it be that the OS is performing a cleanup in the background?
I faced the same problem. Restarting your api endpoint device will solve the issue.

docker-compose command is giving error while building

I am pretty amateur to docker. I was running my application on docker successfully from many days.
Now when I run the command docker-compose up -d it gives me the error as below
raceback (most recent call last):
File "docker-compose", line 3, in <module>
File "compose/cli/main.py", line 88, in main
File "compose/cli/main.py", line 140, in perform_command
File "compose/cli/main.py", line 900, in up
File "compose/project.py", line 385, in up
File "compose/project.py", line 590, in warn_for_swarm_mode
File "site-packages/docker/api/daemon.py", line 73, in info
File "site-packages/docker/utils/decorators.py", line 47, in inner
File "site-packages/docker/api/client.py", line 179, in _get
File "site-packages/requests/sessions.py", line 488, in get
File "site-packages/requests/sessions.py", line 466, in request
File "site-packages/requests/sessions.py", line 641, in
merge_environment_settings
File "site-packages/requests/utils.py", line 605, in get_environ_proxies
File "site-packages/requests/utils.py", line 589, in should_bypass_proxies
File "urllib.py", line 1510, in proxy_bypass
File "urllib.py", line 1484, in proxy_bypass_macosx_sysconf
ValueError: negative shift count
Failed to execute script docker-compose
I searched this error ValueError: negative shift count a lot but nothing useful found.

Resources