Jenkins job ( in Network A ) runs on a slave machine ( say, server A in Network A). Jenkins job has instructions as part of the build to SSH to a server ( say server B in Network B) and execute further steps.
The job runs for about 2.5 hours. Very randomly it fails with the error message stating
18:24:14 Aborted by <USERNAME>
18:24:14 Finished: ABORTED
On server B where the build is executed, TCP keep alive is set to yes and to probe a signal every 80secs. On the kernel level, the tcpkeepalive parameter is set to 2.5hours.
I'm sure that the problem is not the timeout from this machine as i have seen a run with a duration of 157 minutes pass successfully.
The build log does neither have any further lines nor it is descriptive.
How can i effectively debug this problem? We are unable to track anything from the network traffic as there is only one session when the slave is established with SSH.
If incase, this is due to any error within the build, how can i make Jenkins throw descriptive message so that we can narrow down to the root cause?
What specifically can be tracked in network to check if this is due to network glitch?
Related
I have created a dataflow pipeline which read a file from Storage Bucket and just do a simple transform to the data (e.g: trim the spaces).
When I execute the dataflow job, the job started and log shows that the workers are started in a zone, but after that nothing happens. Job never get completed or failed. I had to manually stop the job.
Dataflow job has been executed by a service account having dataflow.worker role, dataflow.developer role and dataflow.objectAdmin role.
Please can someone suggest why the dataflow job is not being completed or why the job not executed after the worker started.
2021-02-09 11:01:29.753 GMTWorker configuration: n1-standard-1 in europe-west2-b.
Warning
2021-02-09 11:01:30.015 GMTThe network sdas-global-dev doesn't have rules that open TCP ports 12345-12346 for internal connection with other VMs. Only rules with a target tag 'dataflow' or empty target tags set apply. If you don't specify such a rule, any pipeline with more than one worker that shuffles data will hang. Causes: No firewall rules associated with your network.
Info
2021-02-09 11:01:31.067 GMTExecuting operation Read files/Read+ManageData/ParDo(ManageData)
Info
2021-02-09 11:01:31.115 GMTStarting 1 workers in europe-west2-b...
Warning
2021-02-09 11:07:33.341 GMTThe network sdas-global-dev doesn't have rules that open TCP ports 12345-12346 for internal connection with other VMs. Only rules with a target tag 'dataflow' or empty target tags set apply. If you don't specify such a rule, any pipeline with more than one worker that shuffles data will hang. Causes: No firewall rules associated with your network.
I found the problem. I was running the job in a region as where the VPC was in different region. Thus the worker did not able to spin up. Make the region same as of the VPC and then everything went well.
I have set a Jenkins master (on a VM) and this is provisioning jnlp slaves as kubernetes pods.
In very rare occasions, the pipeline fails, with this message:
java.io.IOException: Pipe closed
at java.io.PipedInputStream.checkStateForReceive(PipedInputStream.java:260)
at java.io.PipedInputStream.receive(PipedInputStream.java:226)
at java.io.PipedOutputStream.write(PipedOutputStream.java:149)
at java.io.OutputStream.write(OutputStream.java:75)
at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.setupEnvironmentVariable(ContainerExecDecorator.java:510)
at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.doLaunch(ContainerExecDecorator.java:474)
at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.launch(ContainerExecDecorator.java:333)
at hudson.Launcher$ProcStarter.start(Launcher.java:455)
Viewing kubernetes logs Stackdriver in Stackdriver, one can see that the pod does manage to connect to the master, e.g.
Handshaking
Agent discovery successful
Trying protocol: JNLP4-Connect
Remote Identity confirmed: <some_hash_here>
Connecting to <jenkins-master-url>:49187
started container
loading plugin ...
but after a while it fails and here are the relevant logs:
org.csanchez.jenkins.plugins.kubernetes.KubernetesSlave$SlaveDisconnector call
INFO: Disabled slave engine reconnects.
hudson.remoting.jnlp.Main$CuiListener status
Terminated
hudson.remoting.Request$2 run
Failed to send back a reply to the request hudson.remoting.Request$2#336ec321: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel#29d0e8b2:JNLP4-connect connection to <jenkins-master-url>/<public-ip-of-jenkins-master>:49187": channel is already closed
"Processing signal 'terminated'"
.
.
.
How can I further troubleshoot this random error?
Can you take a look at the Kubernetes Pod-Events with Stackdriver? We had a similar behavior with a different CI-System (GitlabCI). Our builds where also randomly failing. It turned out that the JVM inside the Container exceeded its memory limitation and was killed by Kubernetes (OOMKilled) and the CI-System recognised this as a build error.
I know I posted a question that was bogging me for days but found a solution for it just 5 minutes after posting so I am posting about this problem that I get ever since 2 hours, anyway, I have a job in Jenkins that executes a series of commands remotely via SSH but before there is a connection establishment it throws me this error: com.jcraft.jsch.JSchException: channel is not opened, on my topology I have the Jenkins server in my main pc and I want to communicate with a CentOS 7 VM, on my jenkins I have configured everything (the SSH agent on global configuration for example), on my CentOS 7 VM I don't think that there's a need to open the port 22, my expected results are obviously the possibility to execute the script (let's begin by connecting), my VM has the ip of 192.168.127.129, if you want another information you can ask me by commenting, thanks in advance
I did not resolve the problem, however my VM was in host only connection, I changed it to NAT and problem solved but it isn't a permanent one nor best practice, now my VM is connected to the internet and is exposed to all of its dangers
I am using jenkins-test-harness to run some tests on my jenkins library code, but when it executes the tests I get the following error for each test :
hudson.UDPBroadcastThread#run: Cannot listen to UDP port 33,848, skipping: java.net.SocketException: Can't assign requested address
The test will pass (if it should pass), but it then takes about 75 seconds for the jenkins server to shut down. I believe that these two are related, but I can't work out why I am getting this error. I have nothing else running on this port.
When I run the tests within a gradle docker container, rather than locally on the command line or inside the IDE (IntelliJ). This is very frustrating. While it does not change the result of the tests, it takes the running from about 10 minutes to over 1 hour and 15 minutes.
Am I missing a setting which is making this fail?
For me this was caused by Jenkins assuming that the default IP address it would be provided with would be IPv4 when in fact my machine was dual stack, preferring IPv6. I resolved it by ensuring that the integrationTest section of my build.gradle file had systemProperties 'java.net.preferIPv4Stack' : true. A bit like this:
integrationTest: {
/* other statements */
systemProperties 'java.net.preferIPv4Stack' : true
}
I must confess I saw no significant difference to my Jenkins shutdown time. I'd be interested to know if this resolves the error message, and if that resolves your overall issue.
I have a iOS archive job on a mac slave which will take a long time sometimes 30min
The question is the ssh long connection often disconnect and caused the task fail.
Now I want to ask how can I do to avoid this question? What I'm looking for is when the long connection disconnect but the task continue to perform.
How can I do?
Adding a keepAlive option is a feature request since 2014
As proposed workarounds, this ticket includes:
Change the /etc/ssh/ssh_config by appending the following line to the end of this file. This tells the ssh client to send nop command periodically to avoid the ssh connection being disconnected.
The unit of 80 is seconds. You may tune this parameter based on your network condition.
ServerAliveInterval 80
In the Jenkins slave configuration page, change the Launch method to "Launch slave via execution of command on the Master". See the Jenkins built-in help for more details.
So far, I haven't see any issues in this configuration. Hope this helps.
See also "Remoting issues / SSH slaves".