In a dask distributed setup the worker sits idle - dask

I'm trying to setup a dask distributed cluster, I've installed dask on three machines to get started:
laptop (where searchCV gets called)
scheduler (small box where the dask scheduler process lives)
HPC (Large box expected to do the work)
I have dask[complete] installed on the laptop and dask on the other machines.
The worker and scheduler start fine and I can see the dashboards, but I can't send them anything. Running GridSearchCV on the laptop get's a result but it comes from the laptop alone, the worker sits idle.
All machines are windows 7 (HPC is 10) I've checked the ports with netstat and it appears that it is really listening where it is supposed to.
When runnign a small example I get the following error:
from dask.distributed import Client
scheduler_address = 'tcp://10.X.XX.XX:8786'
client = Client(scheduler_address)
def square(x):
return x ** 2
def neg(x):
return -x
A = client.map(square, range(10))
B = client.map(neg, A)
total = client.submit(sum, B)
print(total.result())
INFO - Batched Comm Closed: in <closed TCP>: ConnectionAbortedError: [WinError 10053] An established connection was aborted by the software in your host machine
distributed.comm.core.CommClosedError
tornado.application - ERROR - Exception in callback <bound method Client._heartbeat of <Client: scheduler='tcp://10.X.XX.XX:8786' processes=1 cores=10
I've also filed a bug report as I don't know if this is a bug or ineptitude on my part (I'm guessing later)

Running client.get_versions(check=True) revealed all sorts of issues despite a clean install with -U. Fixing the environments to be the same fixed that. The laptop can have different versions of stuff installed, at least it worked for the differences I have, YMMV.

Related

Jenkins ssh why it shows this error com.jcraft.jsch.JSchException: channel is not opened?

I know I posted a question that was bogging me for days but found a solution for it just 5 minutes after posting so I am posting about this problem that I get ever since 2 hours, anyway, I have a job in Jenkins that executes a series of commands remotely via SSH but before there is a connection establishment it throws me this error: com.jcraft.jsch.JSchException: channel is not opened, on my topology I have the Jenkins server in my main pc and I want to communicate with a CentOS 7 VM, on my jenkins I have configured everything (the SSH agent on global configuration for example), on my CentOS 7 VM I don't think that there's a need to open the port 22, my expected results are obviously the possibility to execute the script (let's begin by connecting), my VM has the ip of 192.168.127.129, if you want another information you can ask me by commenting, thanks in advance
I did not resolve the problem, however my VM was in host only connection, I changed it to NAT and problem solved but it isn't a permanent one nor best practice, now my VM is connected to the internet and is exposed to all of its dangers

nginx with high traffic socket.io running on docker

So I am building a web application for university which has a very high tick rate (clients recieving data from node server above 30 times per second via socketio). This works well in docker. Now I installed nginx, configured it and everything works well (no exposed ports, socket still running etc.) but now nginx logs in the docker terminal every single socket connection from every single client (at two clients well above 60 logs per second) and I think this also leads to performance issues and causes small lag to the clients. I did not find any solutions in their docs.

Jenkins Slave Connection Timeout When Connecting

Last week I set up a selenium grid using jenkins and 4 slave windows VMs. As part of doing this I had to unblock ports for both the slave connection and the selenium connection.
The vms downloaded the jnlp starter and registered correctly and by the end of the day Friday I had my tests running as reported as expected.\
Happy Monday, I come in to find out over the weekend that the connections to all four of the VMs have been lost due to connection timeouts. (the initial error indicated it had been terminated because the ping was too long, subsequent attempts never successfully connect in the first place.)
My research on SO so far points to issues with the ports, so I checked to make sure they are still enabled, and they are. Next I restarted the jenkins instance, and still no success.
Interestingly, the connection to the jenkins selenium grid IS working, each of the standalone servers starts and registers correctly on the VMs, and they are all able to access the jenkins ui from the browsers, just not able to register as a slave through jnlp.
At this point I am at a loss, I've mirrored the exact same setup that was working last week. I checked with our devOps team that manages the server and verified there have been no changes on that end. The VMs have been untouched.
Found a solution, but it leaves at least one question.
To resolve this I altered the Jenkins global security settings to use a fixed port for TCP connections and made sure it was one of my enabled ports, connection goes through cleanly now.
That said - this should NOT have worked on its own. When trying to connect earlier the logs clearly stated that connection attempts at the given port were refused (exact same port, and it was enabled then as well.)
I can understand if the agent was trying to connect at a different port, but I don't understand why dedicating the port itself would make a difference to the connecting agent.

Troubleshoot windows error: Failed to schedule Software Protection service for re-start at 2014-09-13T08:09:23Z. Error Code: 0x80040154

My setup:
Opertaing System: Windows 8.1
Memory: 16GB
HD: 500GB
etc. non relevant.
Issue:
I noticed the issue when my printer stopped working after a normal restart. At that point I tried to uninstall and reinstall the printer driver. The install failed.
At that point I also realized all my remote desktop connections were also failing, with not being able to find the remote host.
Here is the error I was seeing when install for the printer was failing:
can't start printer spooler service not enough resources are available to start the service
Manually trying to start the spooler service and checking in the event Viewer of windows showed that it was failing on:
Failed to schedule Software Protection service for re-start at 2014-09-13T08:09:30Z. Error Code: 0x80040154.
Further trying to restart the protection service revealed that it was failing on:
Task Scheduler service has encountered RPC initialization error in "RpcServerUseProtseq:ncacn_ip_tcp". Additional Data: Error Value: 1721.
In either case my computer had become somewhat useless as I couldn't install anything and my printer and remote desktop was broken too.
Reporting the problem just in case if someone has a similar issues.
The resolution for my problem is actually posted here, but way back so bringing it to more light:
http://social.technet.microsoft.com/Forums/windows/en-US/0c438376-1486-4ae4-9847-2de7a8767f27/task-scheduler-service-has-encountered-rpc-initialization-error-in?forum=itprovistasp
For me what worked was just to starting the prompt in adminstrator mode, running:
netsh winsock reset
and restarting my machine.
Not exactly certain of what actually fixed the issue.

MPI error due to Timeout in making connection to a remote process

I'm trying to run a NAS-UPC benchmark to study it's profile. UPC uses MPI to communicate with remote processes .
When I run the benchmark with 64 processes , i get the following error
upcrun -n 64 bt.C.64
"Timeout in making connection to remote process on <<machine name>>"
Can anybody tell me why this error occurs ?
this probably means that you're failing to spawn the remote processes - upcrun delegates that to a per-conduit mechanism, which may involve your scheduler (if any). my guess is that you're depending on ssh-type remote access, and that's failing, probably because you don't have keys, agent or host-based trust set up. can you ssh to your remote nodes without password? sane environment on the remote nodes (paths, etc)?
"upcrun -v" may illuminate the problem, even without resorting to the man page ;)

Resources