Jenkins ssh slave disconnect to make the task continue perform - jenkins

I have a iOS archive job on a mac slave which will take a long time sometimes 30min
The question is the ssh long connection often disconnect and caused the task fail.
Now I want to ask how can I do to avoid this question? What I'm looking for is when the long connection disconnect but the task continue to perform.
How can I do?

Adding a keepAlive option is a feature request since 2014
As proposed workarounds, this ticket includes:
Change the /etc/ssh/ssh_config by appending the following line to the end of this file. This tells the ssh client to send nop command periodically to avoid the ssh connection being disconnected.
The unit of 80 is seconds. You may tune this parameter based on your network condition.
ServerAliveInterval 80
In the Jenkins slave configuration page, change the Launch method to "Launch slave via execution of command on the Master". See the Jenkins built-in help for more details.
So far, I haven't see any issues in this configuration. Hope this helps.
See also "Remoting issues / SSH slaves".

Related

Jenkins IllegalArgumentException while adding a new slave

I want to add a new slave to Jenkins. When I followed the Jenkins UI, it gives me the command below
java -jar agent.jar -jnlpUrl http://<jenkins_url>/computer/<slave_name>/slave-agent.jnlp -secret 4b59708a20e155c8ccb39f1fb046be09f72c712ed839401195c475d5fdb2b0e5
When I tried to execute that command, its output like below:
Exception in thread "main" java.lang.IllegalArgumentException: IV buffer too short for given offset/length combination
at javax.crypto.spec.IvParameterSpec.<init>(IvParameterSpec.java:80)
at hudson.remoting.Launcher.parseJnlpArguments(Launcher.java:515)
at hudson.remoting.Launcher.run(Launcher.java:325)
at hudson.remoting.Launcher.main(Launcher.java:283)
Could you please help me about this error? Any help will be appreciated. Thanks in advance.
Best Regards.
I reviewed the setting of the IV length in Jenkins code as well as in the Jenkins agents code (remoting) and it seems to be consistently set to 16 bytes everywhere.
However, by running curl to GET the slave-agent.jnlp URL ($JENKINS_URL/computer/$node_name/slave-agent.jnlp), I found that the http:// URL which I thought I should be using returns just "302 Found" with the "location" header set to the same URL but with https://. curling that, I saw messages about missing permissions in Jenkins (to Read, then to Connect agents). Adding those for anonymous users (at $JENKINS_URL/configureSecurity, using matrix-based security) resolved that issue for me.
Or rather, it turned it into another issue, which was "Connection refused". It took me another while to figure out that -- for our Jenkins master running in a container -- in the Global Security configuration, "TCP port for inbound agents" must be set to the container-internal port, while in the node configuration, "Tunnel connection through" has to be set to the external port.
I hope my debugging exercise will be at least partially applicable in your context, too.

Jenkins ssh why it shows this error com.jcraft.jsch.JSchException: channel is not opened?

I know I posted a question that was bogging me for days but found a solution for it just 5 minutes after posting so I am posting about this problem that I get ever since 2 hours, anyway, I have a job in Jenkins that executes a series of commands remotely via SSH but before there is a connection establishment it throws me this error: com.jcraft.jsch.JSchException: channel is not opened, on my topology I have the Jenkins server in my main pc and I want to communicate with a CentOS 7 VM, on my jenkins I have configured everything (the SSH agent on global configuration for example), on my CentOS 7 VM I don't think that there's a need to open the port 22, my expected results are obviously the possibility to execute the script (let's begin by connecting), my VM has the ip of 192.168.127.129, if you want another information you can ask me by commenting, thanks in advance
I did not resolve the problem, however my VM was in host only connection, I changed it to NAT and problem solved but it isn't a permanent one nor best practice, now my VM is connected to the internet and is exposed to all of its dangers

Jenkins build log shows aborted by user

Jenkins job ( in Network A ) runs on a slave machine ( say, server A in Network A). Jenkins job has instructions as part of the build to SSH to a server ( say server B in Network B) and execute further steps.
The job runs for about 2.5 hours. Very randomly it fails with the error message stating
18:24:14 Aborted by <USERNAME>
18:24:14 Finished: ABORTED
On server B where the build is executed, TCP keep alive is set to yes and to probe a signal every 80secs. On the kernel level, the tcpkeepalive parameter is set to 2.5hours.
I'm sure that the problem is not the timeout from this machine as i have seen a run with a duration of 157 minutes pass successfully.
The build log does neither have any further lines nor it is descriptive.
How can i effectively debug this problem? We are unable to track anything from the network traffic as there is only one session when the slave is established with SSH.
If incase, this is due to any error within the build, how can i make Jenkins throw descriptive message so that we can narrow down to the root cause?
What specifically can be tracked in network to check if this is due to network glitch?

Jenkins Slave Connection Timeout When Connecting

Last week I set up a selenium grid using jenkins and 4 slave windows VMs. As part of doing this I had to unblock ports for both the slave connection and the selenium connection.
The vms downloaded the jnlp starter and registered correctly and by the end of the day Friday I had my tests running as reported as expected.\
Happy Monday, I come in to find out over the weekend that the connections to all four of the VMs have been lost due to connection timeouts. (the initial error indicated it had been terminated because the ping was too long, subsequent attempts never successfully connect in the first place.)
My research on SO so far points to issues with the ports, so I checked to make sure they are still enabled, and they are. Next I restarted the jenkins instance, and still no success.
Interestingly, the connection to the jenkins selenium grid IS working, each of the standalone servers starts and registers correctly on the VMs, and they are all able to access the jenkins ui from the browsers, just not able to register as a slave through jnlp.
At this point I am at a loss, I've mirrored the exact same setup that was working last week. I checked with our devOps team that manages the server and verified there have been no changes on that end. The VMs have been untouched.
Found a solution, but it leaves at least one question.
To resolve this I altered the Jenkins global security settings to use a fixed port for TCP connections and made sure it was one of my enabled ports, connection goes through cleanly now.
That said - this should NOT have worked on its own. When trying to connect earlier the logs clearly stated that connection attempts at the given port were refused (exact same port, and it was enabled then as well.)
I can understand if the agent was trying to connect at a different port, but I don't understand why dedicating the port itself would make a difference to the connecting agent.

MPI error due to Timeout in making connection to a remote process

I'm trying to run a NAS-UPC benchmark to study it's profile. UPC uses MPI to communicate with remote processes .
When I run the benchmark with 64 processes , i get the following error
upcrun -n 64 bt.C.64
"Timeout in making connection to remote process on <<machine name>>"
Can anybody tell me why this error occurs ?
this probably means that you're failing to spawn the remote processes - upcrun delegates that to a per-conduit mechanism, which may involve your scheduler (if any). my guess is that you're depending on ssh-type remote access, and that's failing, probably because you don't have keys, agent or host-based trust set up. can you ssh to your remote nodes without password? sane environment on the remote nodes (paths, etc)?
"upcrun -v" may illuminate the problem, even without resorting to the man page ;)

Resources