Ansible is stop executing the tasks randomly without any error - jenkins

We are running an automated job in Jenkins that triggers an ansible playbook. However the playbook execution works 80% of the time, but sometimes it stops randomly at some tasks(hung) without any error as shown below
TASK [Install utilities like vim, mlocate etc.] ***************
changed: [192.0.100.30]
changed: [192.0.100.27]
It does not get stopped at the same task. It is getting stopped randomly at random points!!
I am unable to reproduce this issue if I execute it manually.

Similar issue I have faced earlier, this could be due to specific host might be hung or high on resource utilisation, in this case ansible task will be in queue and sometimes in hung state for that host.
Whenever such situation re-occurs, then immediately check the host health on which ansible is trying to apply the playbook.

Related

How to identify commands ran by Ansible on a remote host in Falco context?

I would like to know if someone has an idea about how to identify commands ran by Ansible within a remote host.
To give you more context I'm gonna describe my workflow in-depth:
I have a scheduled job between 1 am to 6 am which runs a compliance Ansible playbook to ensure the production servers configuration are up to date and well configured, however, this playbook change some files inside the /etc folder.
Besides this, I have a Falco stack which keeps an eye on what is going on the production servers and raises alerts when an event that I describe as suspicious is found (It can be a syscall/ network connection/ sensitive file editing "/etc/passwd, pam.conf, ..." etc...
So the problem I'm running through is, my playbook triggers some alerts for example:
Warning Sensitive file opened for reading by non-trusted program (user=XXXX user_loginuid=XXX program=python3 command=python3 file=/etc/shadow parent=sh gparent=sudo ggparent=sh gggparent=sshd container_id=host image=<NA>)
My question is, can we set a "flag or prefix" to all Ansible commands, which will allow me to whitelist this flag of prefix and avoid triggering my alerts for nothing.
PS: whitelisting python3 for the user root is not a solution in my opinion.
Ansible is python tool, so the process accessing the file will be python3. The commands that Ansible executes are based on the steps that are in the playbook.
You can solve your problem by modifying the falco rules. You can evaluating the proc.pcmdline in falcon rule and the chain of the proc.aname to identify that the command was executed by the ansible process (ex. process is python3, parent is sh grandparent is sudo, etc.)

jenkins kills ssh session when supervisord restarts

I'm using jenkins to do a few actions in a remote server.
I have an Execute Shell command in which I do the following:
sudo ssh <remote server> 'sudo service supervisor restart'
sleep 30
When jenkins reaches the first line I can see 'Restarting Supervisor' but after a moment I see that jenkins closed the ssh connection and moved on to the second line.
I tried adding a 'sleep 30' after the restart command but it still doesn't work.
Seems jenkins doesn't wait for the supervisor restart command to be completed.
Problem is it's not something that always happens, just sometimes, but it does make a lot of problems when it fails.
I think you can never be certain all processes started by supervisord are in a 'ready' state after a restart. Even is the restart action would wait for processes to be started, it wouldn't know if they are 'ready'.
In docker-compose setups that need to know if a certain service is available I've used an extra 'really ready' check for this - optionally in a loop with a sleep/wait. If the process that you are starting opens a port you can use one of the variations of 'wait-for' for this.

Find the last 5 apps which got restarted in marathon

I need a curl command to find the applications and the time which were restarted in the last 30 minutes in mesos marathon.
For example, I hit the curl in the terminal like below:
curl http://marathon:5050/............
Then the output should be like:
APP TIME_OF_RESTART
app1 2018-07-01 23:45PM IST
If I can get the curl command then I can write a script to automate it to provide the details required.
it looks like you may want to use the Marathon Event Bus which streams all Marathon events.
The parameter you would be interested in is unhealthy_task_kill_event if you are looking for tasks that failed health checks enough times and require a "restart".
From the Marathon REST API documentation:
If a task fails more than maxConsecutiveFailures health checks consecutively, that task is killed causing Marathon to start more instances. These restarts are modulated like any other failing app by backoffSeconds, backoffFactor and maxLaunchDelaySeconds. The kill of the unhealthy task is signalled via unhealthy_task_kill_event event.

Jenkins: 2 master nodes using NFS

I´m thinking about the following high availability solution for my enviroment:
Datacenter with one powered on Jenkins master node.
Datacenter for desasters with one off Jenkins master node.
Datacenter one is always powered on, the second is only for disasters. My idea is install the two jenkins using the same ip but with a shared NFS. If the first has fallen, the second starts with the same ip and I still having my service successfully
My question is, can this solution work?.
Thanks all by the hekp ;)
I don't see any challenges as such why it should not work. But you still got to monitor in case of switch-over because I have faced the situation where jobs that were running when jenkins abruptly shuts down were still in the queue when service was recovered but they never completed afterwards, I had to manually delete the build using script console.
Over the jenkins forum a lot of people have reported such bugs, most of them seems to have fixed, but still there are cases where this might happen, and it is because every time jenkins is restarted/started the configuration is reloaded from the disk. So there is inconsistency at times because of in memory config that were there earlier and reloaded config.
So in your case, it might happen that your executor thread would still be blocked when service is recovered. Thus you got to make sure that everything is running fine after recovery.

How many threads should Jenkins run?

I have a Jenkins server that keeps running out of memory and cannot create native threads. I've upped the memory and installed the Monitoring plugin.
There are about 150 projects on the server, and I've been watching the thread count creep up all day. It is now around 990. I expect when it hits 1024, which is the user limit for threads, Jenkins will run out of memory again.
[edit]: I have hit 1016 threads and am now getting the out of memory error
Is this an appropriate number of threads for Jenkins to be running? How can I tell Jenkins to destroy threads when it is finished with them?
tl;dr:
There was a post-build action running a bash script that didn't return anything via stderr or stdout to Jenkins. Therefore, every time the build ran, threads would be created and get stuck waiting. I resolved this issue by having the bash script return an exit status.
long answer
I am running Jenkins on CentOS and have installed in via the RPM. In terms of modifying the Winstone servlet container, you can change that in Jenkin's init script in /etc/sysctrl/jenkins. However the options above only control the number of HTTP threads that are created, not the number of threads overall.
That would be a solution if my threads were hanging on accessing an HTTP API of Jenkins as part of a post-commit action. However, using the ever-handy Monitoring plugin mentioned in my question, I inspected the stuck threads.
The threads were stuck on something in the com.trilead.ssh2.channel package. The getChannelData method has a while(true) loop that looks for output on the stderr or stdout of an ssh stream. The thread was getting suck in that loop because nothing was coming through. I learned this on GrepCode.
This was because the post-build action was to execute a command via SSH onto a server and execute a bash script that would inspect a git repo. However, the git repo was misconfigured and the git command would error, but the exit 1 status did not bubble up through the bash script (partially due to a misformed if-elif-else statement).
The script completed and the build was considered a success, but somehow the thread handling the SSH connection from Jenkins was left hanging due to this git error.
But thank you for your help on this question!
If you run Jenkins "out of the box" it uses Winstone servlet container. You can pass command-line arguments to it as described here. Some of those parameters can limit the number of threads:
--handlerCountStartup = set the no of worker threads to spawn at startup. Default is 5
--handlerCountMax = set the max no of worker threads to allow. Default is 300
--handlerCountMaxIdle = set the max no of idle worker threads to allow. Default is 50
Now, I tried this some time ago and was not 100% convinced that it worked, so no guarantees, but it is worth a try.

Resources