GPU Acceleration with WSL 2 - docker

I'm trying to setup tensorflow to use GPU acceleration with WSL 2 running Ubuntu 20.04. I'm following this tutorial and am running into the error seen here. However, when I follow the solution there and try to start docker with sudo service docker start I get told docker is an unrecognized service. However, considering I can access the help menu and whatnot, I know docker is installed. While I can get docker to work with the desktop tool, since it doesn't support Cuda as mentioned in the SO post from earlier, it's not very helpful. It's not really giving me error logs or anything, so please ask if you need more details.
Edit:
Considering the lack of details, here are a list of solutions I've tried to no avail. 1 2 3
Update: I used sudo dockerd to get the container started and tried running the nvidia benchmark container only to be met with
INFO[2020-07-18T21:04:05.875283800-04:00] shim containerd-shim started address=/containerd-shim/021834ef5e5600bdf62a6a9e26dff7ffc1c76dd4ec9dadb9c1fcafb6c88b6e1b.sock debug=false pid=1960
INFO[2020-07-18T21:04:05.899420200-04:00] shim reaped id=70316df254d6b2633c743acb51a26ac2d0520f6f8e2f69b69c4e0624eaac1736
ERRO[2020-07-18T21:04:05.909710600-04:00] stream copy error: reading from a closed fifo
ERRO[2020-07-18T21:04:05.909753500-04:00] stream copy error: reading from a closed fifo
ERRO[2020-07-18T21:04:06.001006700-04:00] 70316df254d6b2633c743acb51a26ac2d0520f6f8e2f69b69c4e0624eaac1736 cleanup: failed to delete container from containerd: no such container
ERRO[2020-07-18T21:04:06.001045100-04:00] Handler for POST /v1.40/containers/70316df254d6b2633c743acb51a26ac2d0520f6f8e2f69b69c4e0624eaac1736/start returned error: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver error: failed to process request\\\\n\\\"\"": unknown
docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver error: failed to process request\\\\n\\\"\"": unknown.
ERRO[0000] error waiting for container: context canceled
Update 2: After installing windows insider and making everything as up to date as possible, I encountered a different error.
Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance.
-fullscreen (run n-body simulation in fullscreen mode)
-fp64 (use double precision floating point values for simulation)
-hostmem (stores simulation data in host memory)
-benchmark (run benchmark to measure performance)
-numbodies=<N> (number of bodies (>= 1) to run in simulation)
-device=<d> (where d=0,1,2.... for the CUDA device to use)
-numdevices=<i> (where i=(number of CUDA devices > 0) to use for simulation)
-compare (compares simulation results running once on the default GPU and once on the CPU)
-cpu (run n-body simulation on the CPU)
-tipsy=<file.bin> (load a tipsy model file for simulation)
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Error: only 0 Devices available, 1 requested. Exiting.
I have a GTX 970, so I'm not sure why it's not being detected. After running sudo lshw -C display, it was confirmed that my graphics card isn't being detected. I got:
*-display UNCLAIMED
description: 3D controller
product: Microsoft Corporation
vendor: Microsoft Corporation
physical id: 4
bus info: pci#941e:00:00.0
version: 00
width: 32 bits
clock: 33MHz
capabilities: bus_master cap_list
configuration: latency=0

Related

Nebula Graph fails on CentOS 6.5

Nebula Graph fails on CentOS 6.5, the error message is as follows:
# storage log
Heartbeat failed, status:RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
# meta log
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
E0415 22:32:38.944437 15532 AsyncServerSocket.cpp:762] failed to set SO_REUSEPORT on async server socket Protocol not available
E0415 22:32:38.945001 15510 ThriftServer.cpp:440] Got an exception while setting up the server: 92failed to bind to async server socket: [::]:0: Protocol not available
E0415 22:32:38.945057 15510 RaftexService.cpp:90] Setup the Raftex Service failed, error: 92failed to bind to async server socket: [::]:0: Protocol not available
E0415 22:32:38.949586 15463 NebulaStore.cpp:47] Start the raft service failed
E0415 22:32:38.949597 15463 MetaDaemon.cpp:88] Nebula store init failed
E0415 22:32:38.949796 15463 MetaDaemon.cpp:215] Init kv failed!
Nebula service status is as follows:
[root#redhat6 scripts]# ./nebula.service status all
[WARN] The maximum files allowed to open might be too few: 1024
[INFO] nebula-metad: Exited
[INFO] nebula-graphd: Exited
[INFO] nebula-storaged: Running as 15547, Listening on 44500
Reason for error: CentOS 6.5 system kernel version is 2.6.32, which is less than 3.9. However, SO_REUSEPORT only supports Linux 3.9 and above.
Upgrading the system to CentOS 7.5 can solve the problem by itself.

My neo4j server is automatically stopping and starting,

I'm running my neo4j community edition 3.5.5 version with 8GB ram in aws instance.
Initially for few months it ran very fine and got results in millis of time, but now a days it's getting stopping automatically and starting automatically. Sometimes it's not at all starting for hours,even we started it manually also.
Can anyone please help me with this. I'm getting the below logs.
tail -100f /var/log/neo4j/neo4j.log
2019-07-29 13:17:52.570+0000 WARN The client is unauthorized due to authentication failure.
2019-09-04 05:33:52.328+0000 WARN The client is unauthorized due to authentication failure.
2019-10-17 15:18:14.652+0000 INFO Transaction with id 2683388 has been automatically rolled back due to transaction timeout.
nohup: ignoring input
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00000006e5400000, 3670016000, 0) failed; error='Cannot allocate memory' (errno=12)
There is insufficient memory for the Java Runtime Environment to continue.
Native memory allocation (mmap) failed to map 3670016000 bytes for committing reserved memory.
An error report file with more information is saved as:
/home/ubuntu/hs_err_pid8965.log
nohup: ignoring input
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00000006e5400000, 3670016000, 0) failed; error='Cannot allocate memory' (errno=12)
There is insufficient memory for the Java Runtime Environment to continue.
Native memory allocation (mmap) failed to map 3670016000 bytes for committing reserved memory.
An error report file with more information is saved as:
/home/ubuntu/hs_err_pid9050.log
nohup: ignoring input
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00000006e5400000, 3670016000, 0) failed; error='Cannot allocate memory' (errno=12)
2019-10-17 17:14:44.651+0000 INFO Transaction with id 2689294 has been automatically rolled back due to transaction timeout.
this can be because you are running lot of merge operations and dont have proper indices created or try increasing the heap size in config file .

LINKERD: Unable to build docker image from linkerd

https://github.com/linkerd/linkerd#docker
From the instruction on Readme, I have executed the following commands,
; linkerd/docker ;namerd/docker
I get the following exception,
[info] Done packaging.
[trace] Stack trace suppressed: run last linkerd/bundle:docker for the full output.
[error] (linkerd/bundle:docker) java.io.IOException: Cannot run program "docker" (in directory "/home/shaikk/linkerd/linkerd/target/docker"): error=2, No such file or directory
[error] Total time: 284 s, completed Mar 6, 2017 9:13:49 AM
I think the No such file or directory error message is referring to the docker binary itself. Can you try running which docker to see if it's in your path? If it's not there, you can install it using the instructions here: https://docs.docker.com/engine/installation/#platform-support-matrix

Random failure of creating a New Cassandra Cluster using OpsCenter

OpsCenter version: 5.1.0 and
DSE Version: 4.6.0
Creating a brand new cluster by using OpsCenter directly, gives us the following error. It randomly works with the same settings but 95% of the times it fails with the same error. Opscenter is running on its own box but sharing the same Security groups as the cluster instances. For good measure, I have opened up all TCP ports to all IPs. The following is the stack trace of the error from the opscenterd.log:
*2015-03-19 10:06:12+0000 [] INFO: Starting provisioning process
2015-03-19 10:06:12+0000 [] INFO: Starting installation phase of cluster provisioning
2015-03-19 10:06:13+0000 [] WARN: HTTP request http://10.x.x.x:61621/alive? failed: Connection was refused by other side: 111: Connection refused.
2015-03-19 10:06:13+0000 [] INFO: Beginning install of OpsCenter agent to 54.x.x.x
2015-03-19 10:06:26+0000 [] WARN: HTTP request http://10.x.x.x:61621/alive? failed: Connection was refused by other side: 111: Connection refused.
2015-03-19 10:06:31+0000 [] INFO: Agent for ip 10.x.x.x is version None
2015-03-19 10:06:31+0000 [] INFO: Agent for ip 10.x.x.x is version u'5.1.0'
2015-03-19 10:07:23+0000 [] INFO: Successfully installed agent and dse on node 10.x.x.x
2015-03-19 10:07:23+0000 [] INFO: Beginning "stop" phase of cluster provisioning
2015-03-19 10:07:25+0000 [] WARN: Marking request '10.x.x.x: /ops/stop' (f6708fa2-b45f-42b4-b992-90a82b460ac7) as failed: /usr/sbin/service dse stop failed
exit status: 1
stdout:
log_daemon_msg is a shell function
Cassandra 2.0 and later require Java 7 or later.
2015-03-19 10:07:25+0000 [] ERROR: Failed to stop node 10.x.x.x: /usr/sbin/service dse stop failed
exit status: 1
stdout:
log_daemon_msg is a shell function
Cassandra 2.0 and later require Java 7 or later.
2015-03-19 10:07:25+0000 [] WARN: Marking request 'stop stage' (0b6fcb6b-96ba-404e-a484-b4b6b167b309) as failed: Failed to stop node 10.x.x.x: /usr/sbin/service dse stop failed
exit status: 1
stdout:
log_daemon_msg is a shell function
Cassandra 2.0 and later require Java 7 or later.
2015-03-19 10:07:25+0000 [] ERROR: Stop stage failed: Failed to stop node 10.x.x.x: /usr/sbin/service dse stop failed
exit status: 1
stdout:
log_daemon_msg is a shell function
Cassandra 2.0 and later require Java 7 or later.
2015-03-19 10:07:25+0000 [] WARN: Marking request 'provision' (daf1c15d-92e3-40b0-83ca-34d548ea835b) as failed: Stop stage failed: Failed to stop node 10.x.x.x: /usr/sbin/service dse stop failed
exit status: 1
stdout:
log_daemon_msg is a shell function
Cassandra 2.0 and later require Java 7 or later.
2015-03-19 10:07:25+0000 [] ERROR:
2015-03-19 10:07:25+0000 [] ERROR: Cluster provisioning failed: Exception: Stop stage failed: Failed to stop node 10.x.x.x: /usr/sbin/service dse stop failed
exit status: 1
stdout:
log_daemon_msg is a shell function
Cassandra 2.0 and later require Java 7 or later.
2015-03-19 10:07:25+0000 [] ERROR: Failed to provision cluster: Cluster provisioning failed: Exception: Stop stage failed: Failed to stop node 10.x.x.x: /usr/sbin/service dse stop failed
exit status: 1
stdout:
log_daemon_msg is a shell function
Cassandra 2.0 and later require Java 7 or later.
2015-03-19 10:07:25+0000 [] WARN: Marking request 28c021fd-d21a-4fed-bb5c-a4fe17d362e0 as failed: Cluster provisioning failed: Exception: Stop stage failed: Failed to stop node 10.x.x.x: /usr/sbin/service dse stop failed
exit status: 1
stdout:
log_daemon_msg is a shell function
Cassandra 2.0 and later require Java 7 or later.
2015-03-19 10:07:41+0000 [] WARN: Unable to find a matching cluster for node with IP [u'fe80:0:0:0:2000:aff:feeb:31c7%2', u'10.x.x.x', u'0:0:0:0:0:0:0:1%1', u'127.0.0.1']; the message was [u'5.1.0', u'/1947480708/conf']. This usually indicates that an OpsCenter agent is still running on an old node that was decommissioned or is part of a cluster that OpsCenter is no longer monitoring.
Appreciate any help!
Thanks in advance
Harsha
OpCenter developer here. I make the OpsCenter provisioning features go zoom (or splat occasionally as you've seen). It is with sadness and shame that I must tell you that you're hitting a bug.
The Datastax AMI version 2.4 used by OpsCenter provisioning (https://github.com/riptano/ComboAMI/tree/2.4) does quite a bit of work at boot time via startup scripts. One of those tasks is to set up some gpg repository keys used to validate packages. Intermittently that process can fail, breaking package installs and leading to the series of errors that you're seeing. This failure is intermittent and has greatly increased in frequency recently. If you check /home/ubuntu/datastax-ami/ami.log you should see the gpg key failures that begin the rest of the failure chain.
Unfortunately, this error is pretty far down the technology stack and is difficult to manually work around. If you just need to provision a single cluster you can retry until you get a good run. Otherwise your best best is to manually launch the instances and use local provisioning to deploy dse/dsc to their private ip addresses:
Launch instances using ami-ada2b6c4 (assuming you're in us-east-1)
Make sure to add the instances to the OpsCenterSecurity group.
Make sure you have the private half of the keypair you use (you'll need it during local provisioning)
On the instance data page, hit the advanced pulldown and add the following userdata as text "--raidonly --java7"
Do a local-provisioning run against the private-ip's
Not a super-simple workaround. I wish your experience with OpsCenter this time around was more awesome. The good news is I'm on this bug and it will be fixed in an upcoming point release.
Edit: No longer necessary to manually remove /etc/security/limits.d/cassandra.conf
if its just complaining about java then install the java 7 preferably datastax wants oracle jdk and jre. you might already have java 7 and another version on your nodes but java 7 is not the default version. to change this do:
sudo update-java-alternatives -s java-7-oracle
which is a command you can script to run with ssh so you dont have to log in to each node

Fresh download and start of Neo4j 1.9.9: fails to start

Steps to recreate:
Downloaded the tar.gz file to Ubuntu 14.04
tar zxvf neo4j.tar.gz
sudo neo4j/bin/neo4j install
sudo service neo4j-service start
Getting the following error:
16:02:44.205 [main] INFO o.n.s.enterprise.EnterpriseNeoServer - Setting startup timeout to: 120000ms based on -1
org.neo4j.kernel.lifecycle.LifecycleException: Component 'org.neo4j.kernel.impl.transaction.XaDataSourceManager#6c04c230' failed to stop. Please see attached cause exception.
at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.stop(LifeSupport.java:530)
at org.neo4j.kernel.lifecycle.LifeSupport.stop(LifeSupport.java:157)
at org.neo4j.kernel.lifecycle.LifeSupport.start(LifeSupport.java:123)
at org.neo4j.kernel.InternalAbstractGraphDatabase.run(InternalAbstractGraphDatabase.java:284)
at org.neo4j.kernel.EmbeddedGraphDatabase.<init>(EmbeddedGraphDatabase.java:116)
at org.neo4j.graphdb.factory.GraphDatabaseFactory$1.newDatabase(GraphDatabaseFactory.java:96)
at org.neo4j.graphdb.factory.GraphDatabaseBuilder.newGraphDatabase(GraphDatabaseBuilder.java:207)
at org.neo4j.server.enterprise.EnterpriseDatabase$DatabaseMode$1.createDatabase(EnterpriseDatabase.java:42)
at org.neo4j.server.enterprise.EnterpriseDatabase.start(EnterpriseDatabase.java:78)
at org.neo4j.server.AbstractNeoServer.start(AbstractNeoServer.java:170)
at org.neo4j.server.Bootstrapper.start(Bootstrapper.java:103)
at org.neo4j.server.Bootstrapper.main(Bootstrapper.java:57)
For full stack trace see:
https://gist.github.com/henry74/6a14fa76fffabf6d4313

Resources