Can't load models from torchvision on GPU clusters in my company - docker

codes:
from torchvision.models import vgg16
vgg_model = vgg16(pretrained=True).features[:16]
the log output:
Traceback (most recent call last):
File "train.py", line 95, in <module>
vgg_model = vgg16(pretrained=True).features[:16]
File "/opt/conda/lib/python3.7/site-packages/torchvision/models/vgg.py", line 150, in vgg16
return _vgg('vgg16', 'D', False, pretrained, progress, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torchvision/models/vgg.py", line 93, in _vgg
progress=progress)
File "/opt/conda/lib/python3.7/site-packages/torch/hub.py", line 480, in load_state_dict_from_url
os.makedirs(model_dir)
File "/opt/conda/lib/python3.7/os.py", line 213, in makedirs
makedirs(head, exist_ok=exist_ok)
File "/opt/conda/lib/python3.7/os.py", line 213, in makedirs
makedirs(head, exist_ok=exist_ok)
File "/opt/conda/lib/python3.7/os.py", line 223, in makedirs
mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/.cache'
I have to run my codes on GPU clusters, and I have no rights to modify these docker images on GPU clusters. What can I do?

Related

Docker Odoo - PermissionError: [Errno 13] Permission denied: '/var/lib/odoo/sessions'

That was my first time create a project with Docker.
This it URL github my project : https://github.com/ngocnhi10011999/odoo-docker
The problems was appear when I run
sh dev up
and browser port 8069
ERROR
`
PermissionError
PermissionError: [Errno 13] Permission denied: '/var/lib/odoo/sessions'
Traceback (most recent call last)
File "/usr/lib/python3/dist-packages/odoo/service/wsgi_server.py", line 112, in application
return application_unproxied(environ, start_response)
File "/usr/lib/python3/dist-packages/odoo/service/wsgi_server.py", line 87, in application_unproxied
result = odoo.http.root(environ, start_response)
File "/usr/lib/python3/dist-packages/odoo/http.py", line 1336, in __call__
return self.dispatch(environ, start_response)
File "/usr/lib/python3/dist-packages/odoo/http.py", line 1302, in __call__
return self.app(environ, start_wrapped)
File "/usr/lib/python3/dist-packages/werkzeug/middleware/shared_data.py", line 260, in __call__
return self.app(environ, start_response)
File "/usr/lib/python3/dist-packages/odoo/http.py", line 1487, in dispatch
explicit_session = self.setup_session(httprequest)
File "/usr/lib/python3/dist-packages/odoo/http.py", line 1367, in setup_session
session_gc(self.session_store)
File "/usr/lib/python3/dist-packages/odoo/tools/func.py", line 26, in __get__
value = self.fget(obj)
File "/usr/lib/python3/dist-packages/odoo/http.py", line 1313, in session_store
path = odoo.tools.config.session_dir
File "/usr/lib/python3/dist-packages/odoo/tools/config.py", line 710, in session_dir
os.makedirs(d, 0o700)
File "/usr/lib/python3.9/os.py", line 225, in makedirs
mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/var/lib/odoo/sessions'
`

Pytorch dataloaders : OSError: [Errno 9] Bad file descriptor

Description of the problem
The error will occur if the num_workers > 0 , But when I set num_workers = 0 , the error disappeared, though, this will slow down the trainning speed. I think the multiprocessing really matters here .How can I solve this problem?
env
docker python3.8 Pytorch 1.11.0+cu113
error output
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/multiprocessing/resource_sharer.py", line 149, in _serve
send(conn, destination_pid)
File "/opt/conda/lib/python3.8/multiprocessing/resource_sharer.py", line 50, in send
reduction.send_handle(conn, new_fd, pid)
File "/opt/conda/lib/python3.8/multiprocessing/reduction.py", line 184, in send_handle
sendfds(s, [handle])
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/multiprocessing/reduction.py", line 149, in sendfds
File "save_disp.py", line 85, in <module>
sock.sendmsg([msg], [(socket.SOL_SOCKET, socket.SCM_RIGHTS, fds)])
OSError: [Errno 9] Bad file descriptor
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/multiprocessing/resource_sharer.py", line 151, in _serve
test()
File "save_disp.py", line 55, in test
close()
for batch_idx, sample in enumerate(TestImgLoader):
File "/opt/conda/lib/python3.8/multiprocessing/resource_sharer.py", line 52, in close
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 530, in __next__
os.close(new_fd)
OSError: [Errno 9] Bad file descriptor
data = self._next_data()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1207, in _next_data
idx, data = self._get_data()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1173, in _get_data
success, data = self._try_get_data()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1011, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 116, in get
return _ForkingPickler.loads(res)
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 295, in rebuild_storage_fd
fd = df.detach()
File "/opt/conda/lib/python3.8/multiprocessing/resource_sharer.py", line 58, in detach
return reduction.recv_handle(conn)
File "/opt/conda/lib/python3.8/multiprocessing/reduction.py", line 189, in recv_handle
return recvfds(s, 1)[0]
File "/opt/conda/lib/python3.8/multiprocessing/reduction.py", line 159, in recvfds
raise EOFError
EOFError

Downloading language model in Spacy breaks in docker

So my docker file has the following line
RUN python -m spacy download en_core_web_sm
Which throws an error stating thinc module is not found.
So, I decided to add the following line above the previous line.
RUN pip install thinc
When I compile I get the following error.
Step 30/48 : RUN pip install thinc
---> Running in 7ba293263245
Collecting thinc
Downloading https://files.pythonhosted.org/packages/25/99/e21c2a1622c193b2c93a628496fea4525a0ce93e3b47f3cb06559b6ff3ee/thinc-7.4.3.tar.gz (1.3MB)
Complete output from command python setup.py egg_info:
BLIS_COMPILER? None
WARNING: pip versions <19.3 (currently installed: 9.0.1) are unable to detect binary wheel compatibility for blis. To avoid a source install with a very long compilation time, please upgrade pip with `pip install --upgrade pip`.
If you know what you're doing and you really want to compile blis from source, please set the environment variable BLIS_REALLY_COMPILE=1.
Traceback (most recent call last):
File "/opt/app-root/lib/python3.6/site-packages/setuptools/sandbox.py", line 157, in save_modules
yield saved
File "/opt/app-root/lib/python3.6/site-packages/setuptools/sandbox.py", line 198, in setup_context
yield
File "/opt/app-root/lib/python3.6/site-packages/setuptools/sandbox.py", line 248, in run_setup
DirectorySandbox(setup_dir).run(runner)
File "/opt/app-root/lib/python3.6/site-packages/setuptools/sandbox.py", line 278, in run
return func()
File "/opt/app-root/lib/python3.6/site-packages/setuptools/sandbox.py", line 246, in runner
_execfile(setup_script, ns)
File "/opt/app-root/lib/python3.6/site-packages/setuptools/sandbox.py", line 47, in _execfile
exec(code, globals, locals)
File "/tmp/easy_install-v4oqr4r6/blis-0.7.3/setup.py", line 219, in <module>
"cython>=0.25",
SystemExit: 1
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/app-root/lib/python3.6/site-packages/setuptools/command/easy_install.py", line 1101, in run_setup
run_setup(setup_script, args)
File "/opt/app-root/lib/python3.6/site-packages/setuptools/sandbox.py", line 251, in run_setup
raise
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/contextlib.py", line 99, in __exit__
self.gen.throw(type, value, traceback)
File "/opt/app-root/lib/python3.6/site-packages/setuptools/sandbox.py", line 198, in setup_context
yield
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/contextlib.py", line 99, in __exit__
self.gen.throw(type, value, traceback)
File "/opt/app-root/lib/python3.6/site-packages/setuptools/sandbox.py", line 169, in save_modules
saved_exc.resume()
File "/opt/app-root/lib/python3.6/site-packages/setuptools/sandbox.py", line 144, in resume
six.reraise(type, exc, self._tb)
File "/opt/app-root/lib/python3.6/site-packages/pkg_resources/_vendor/six.py", line 685, in reraise
raise value.with_traceback(tb)
File "/opt/app-root/lib/python3.6/site-packages/setuptools/sandbox.py", line 157, in save_modules
yield saved
File "/opt/app-root/lib/python3.6/site-packages/setuptools/sandbox.py", line 198, in setup_context
yield
File "/opt/app-root/lib/python3.6/site-packages/setuptools/sandbox.py", line 248, in run_setup
DirectorySandbox(setup_dir).run(runner)
File "/opt/app-root/lib/python3.6/site-packages/setuptools/sandbox.py", line 278, in run
return func()
File "/opt/app-root/lib/python3.6/site-packages/setuptools/sandbox.py", line 246, in runner
_execfile(setup_script, ns)
File "/opt/app-root/lib/python3.6/site-packages/setuptools/sandbox.py", line 47, in _execfile
exec(code, globals, locals)
File "/tmp/easy_install-v4oqr4r6/blis-0.7.3/setup.py", line 219, in <module>
"cython>=0.25",
SystemExit: 1
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-build-_lc_0t4n/thinc/setup.py", line 276, in <module>
setup_package()
File "/tmp/pip-build-_lc_0t4n/thinc/setup.py", line 272, in setup_package
cmdclass={"build_ext": build_ext_subclass},
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/distutils/core.py", line 108, in setup
_setup_distribution = dist = klass(attrs)
File "/opt/app-root/lib/python3.6/site-packages/setuptools/dist.py", line 315, in __init__
self.fetch_build_eggs(attrs['setup_requires'])
File "/opt/app-root/lib/python3.6/site-packages/setuptools/dist.py", line 361, in fetch_build_eggs
replace_conflicting=True,
File "/opt/app-root/lib/python3.6/site-packages/pkg_resources/__init__.py", line 850, in resolve
dist = best[req.key] = env.best_match(req, ws, installer)
File "/opt/app-root/lib/python3.6/site-packages/pkg_resources/__init__.py", line 1122, in best_match
return self.obtain(req, installer)
File "/opt/app-root/lib/python3.6/site-packages/pkg_resources/__init__.py", line 1134, in obtain
return installer(requirement)
File "/opt/app-root/lib/python3.6/site-packages/setuptools/dist.py", line 429, in fetch_build_egg
return cmd.easy_install(req)
File "/opt/app-root/lib/python3.6/site-packages/setuptools/command/easy_install.py", line 665, in easy_install
return self.install_item(spec, dist.location, tmpdir, deps)
File "/opt/app-root/lib/python3.6/site-packages/setuptools/command/easy_install.py", line 695, in install_item
dists = self.install_eggs(spec, download, tmpdir)
File "/opt/app-root/lib/python3.6/site-packages/setuptools/command/easy_install.py", line 876, in install_eggs
return self.build_and_install(setup_script, setup_base)
File "/opt/app-root/lib/python3.6/site-packages/setuptools/command/easy_install.py", line 1115, in build_and_install
self.run_setup(setup_script, setup_base, args)
File "/opt/app-root/lib/python3.6/site-packages/setuptools/command/easy_install.py", line 1103, in run_setup
raise DistutilsError("Setup script exited with %s" % (v.args[0],))
distutils.errors.DistutilsError: Setup script exited with 1
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-_lc_0t4n/thinc/
You are using pip version 9.0.1, however version 20.2.4 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
The image is
FROM centos/python-36-centos7
before you install python packages add the following line
RUN pip install --upgrade pip

Colab : Memory Allocation

I am facing the below error on google colab on one of code loops.
The Run time selected is GPU and its shows RAM as full but Disk storage usage is 35GB/ 358GB.
Can anyone suggest where I am heading wrong?
The error snippet is as below -
File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/usr/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.6/multiprocessing/pool.py", line 405, in _handle_workers
pool._maintain_pool()
File "/usr/lib/python3.6/multiprocessing/pool.py", line 246, in _maintain_pool
self._repopulate_pool()
File "/usr/lib/python3.6/multiprocessing/pool.py", line 239, in _repopulate_pool
w.start()
File "/usr/lib/python3.6/multiprocessing/process.py", line 105, in start
self._popen = self._Popen(self)
File "/usr/lib/python3.6/multiprocessing/context.py", line 277, in _Popen
return Popen(process_obj)
File "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 66, in _launch
self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory
Thanks

docker-compose command is giving error while building

I am pretty amateur to docker. I was running my application on docker successfully from many days.
Now when I run the command docker-compose up -d it gives me the error as below
raceback (most recent call last):
File "docker-compose", line 3, in <module>
File "compose/cli/main.py", line 88, in main
File "compose/cli/main.py", line 140, in perform_command
File "compose/cli/main.py", line 900, in up
File "compose/project.py", line 385, in up
File "compose/project.py", line 590, in warn_for_swarm_mode
File "site-packages/docker/api/daemon.py", line 73, in info
File "site-packages/docker/utils/decorators.py", line 47, in inner
File "site-packages/docker/api/client.py", line 179, in _get
File "site-packages/requests/sessions.py", line 488, in get
File "site-packages/requests/sessions.py", line 466, in request
File "site-packages/requests/sessions.py", line 641, in
merge_environment_settings
File "site-packages/requests/utils.py", line 605, in get_environ_proxies
File "site-packages/requests/utils.py", line 589, in should_bypass_proxies
File "urllib.py", line 1510, in proxy_bypass
File "urllib.py", line 1484, in proxy_bypass_macosx_sysconf
ValueError: negative shift count
Failed to execute script docker-compose
I searched this error ValueError: negative shift count a lot but nothing useful found.

Resources