MPI error due to Timeout in making connection to a remote process - timeout

I'm trying to run a NAS-UPC benchmark to study it's profile. UPC uses MPI to communicate with remote processes .
When I run the benchmark with 64 processes , i get the following error
upcrun -n 64 bt.C.64
"Timeout in making connection to remote process on <<machine name>>"
Can anybody tell me why this error occurs ?

this probably means that you're failing to spawn the remote processes - upcrun delegates that to a per-conduit mechanism, which may involve your scheduler (if any). my guess is that you're depending on ssh-type remote access, and that's failing, probably because you don't have keys, agent or host-based trust set up. can you ssh to your remote nodes without password? sane environment on the remote nodes (paths, etc)?
"upcrun -v" may illuminate the problem, even without resorting to the man page ;)

Related

Gitlab SSL Configuration for both Internal and External Access

Looking for a little help here. Trying to bootstrap a small side business, and I have never been the DevOps guy. I use the web hosted version Gitlab to store my codebase, but I am unable to use it to act as a repository for docker images that I am creating from that code. The images that I am generating are quite large and exceed the token expiration when I am attempting to push back to the registry from the group gitlab-runner that I have installed on my personal machine. I have an extra machine sitting around, so I installed gitlab-ee and exposed it through a dynamic dns service (NoIP). I then mirrored the repositories that I want to generate images for on my locally hosted gitlab instance. At first, I tried to use a runner that was on the same machine as my gitlab instance, but always failed due to all available memory being consumed and locked up the machine. All in all, gitlab docs pretty much don’t run the runner and instance on the same machine. So, I went back to using the runner I originally used for the web hosted instance, but I am having issues pushing to my local instance. When trying to push to my repository (through the ddns URL), I end up getting a lot of this:
e4fdbd3bf512: Retrying in X seconds
And it eventually times out due to job time limit or token time limit. I am guessing this is due to my connectivity not being great. What I would like to do is have the (installed on a local machine) runner push to the local IP on my network, but I am unsure how to do this with the SSL setup. When trying to login and push in my pipeline, I get the following error:
Error response from daemon: Get "https://xxx.xxx.xxx.xxx:xxxx/v2/": x509: cannot validate certificate for xxx.xxx.xxx.xxx because it doesn't contain any IP SANs
How do I correct this without affecting the https:// SSL that is already setup for when accessing the instance from the DDNS? Appreciate any help you can give me.
I abandoned attempts at getting this to work. Ran through a bunch of scenarios of creating my on CA and trying to create certificates for the IP address and share that with the other machine. Ultimately, gitlab obscures some things with LetsEncrypt. Funny enough it was just a connectivity issue where I was getting timeouts. Ended up hard-lining both machines and getting better throughput. Able to push ~6GB docker images up through the URL.

Sidekiq can't connect to database?

I have "mariadb" set to 127.0.0.1 in my /etc/hosts file and sidekiq occasionally throws errors such as:
Mysql2::Error::ConnectionError: Unknown MySQL server host 'mariadb' (16)
The VM is not under significant load or anything like that.
Later edit: seems other gems have trouble resolving hosts too:
WARN -- : Unable to record event with remote Sentry server (Errno::EBUSY - Failed to open TCP connection to XXXX.ingest.sentry.io:443 (Device or resource busy - getaddrinfo)):
Anyone have any idea why that may happen?
I've figured this out a couple weeks ago but wanted to be sure before posting an answer.
I still can't figure out the mechanic of this issue but it was caused by fail2ban.
I had it running in a container polling the httpd logs and blocking the tremendous amount of bots scraping my sites.
Also I increased the max file handlers and inotify handlers.
fs.file-max = 131070
fs.inotify.max_user_watches = 65536
As soon as I got rid of fail2ban and increased the inotify handlers the errors disappeared.
Obviously fail2ban gets on the "do not touch" list because of this, and we've rolled out a 404/403/500 handler on application layer that pushes unknown IPs to Cloudflare.
Although this is probably an edge case I'm leaving this here in hope it helps someone at some point.

Deploy is everytime in progress on AWS opsworks

Deploy is not finished and failed. I tried stop instance but all operation are in progress. What I need do?
This usually happens when the instances in which you are running the command do not have a way to connect to the internet. On rare occasions, this could also indicate that the Opsworks agent is not running, but that is less likely.
Check the firewall settings and outbound internet access. SSH into the machines and try to ping something on the internet.
If you are deploying your app to a private VPC, then you need to add NAT instances so that the instances have internet access.

Troubleshoot windows error: Failed to schedule Software Protection service for re-start at 2014-09-13T08:09:23Z. Error Code: 0x80040154

My setup:
Opertaing System: Windows 8.1
Memory: 16GB
HD: 500GB
etc. non relevant.
Issue:
I noticed the issue when my printer stopped working after a normal restart. At that point I tried to uninstall and reinstall the printer driver. The install failed.
At that point I also realized all my remote desktop connections were also failing, with not being able to find the remote host.
Here is the error I was seeing when install for the printer was failing:
can't start printer spooler service not enough resources are available to start the service
Manually trying to start the spooler service and checking in the event Viewer of windows showed that it was failing on:
Failed to schedule Software Protection service for re-start at 2014-09-13T08:09:30Z. Error Code: 0x80040154.
Further trying to restart the protection service revealed that it was failing on:
Task Scheduler service has encountered RPC initialization error in "RpcServerUseProtseq:ncacn_ip_tcp". Additional Data: Error Value: 1721.
In either case my computer had become somewhat useless as I couldn't install anything and my printer and remote desktop was broken too.
Reporting the problem just in case if someone has a similar issues.
The resolution for my problem is actually posted here, but way back so bringing it to more light:
http://social.technet.microsoft.com/Forums/windows/en-US/0c438376-1486-4ae4-9847-2de7a8767f27/task-scheduler-service-has-encountered-rpc-initialization-error-in?forum=itprovistasp
For me what worked was just to starting the prompt in adminstrator mode, running:
netsh winsock reset
and restarting my machine.
Not exactly certain of what actually fixed the issue.

How can i connect to remote server for CPU process time monitoring?

I want to connect to remote server to monitor the cpu process time when i run the stress test.
But it always failed, what can i do to successfully connect to remote server ?
If you are using linux, you can ssh into the remote server by knowing its hostname and ip as explained here.
You also require to know the root password of the remote server.
To check the CPU process time, memory ,etc during the stress test, you can use SeaLion.
It allows you to monitor the output of commands like top, free -m,etc on a graphical inteface, thus everytime you perform your test, you wouldn't require to connect to the remote server.
There is also New Relic which is extremely feature rich and provides many functionalities like graphing, etc.

Resources