Jenkins - Master and Slave asynchronous - jenkins

I've installed a master jenkins instance and 2 slave nodes.
Both slaves are not synchronous with the master. Sometimes it shows that the slaves are 2 days or 1 hour in the future, sometimes it shows that time on slaves is behind the master - it seems to randomize.
Because of this some selenium tests or builds or other jobs doesn't work correctly anymore. The problem occurred suddenly and it doesn't matter which version of jenkins has been installed.
Has anyone an idea how to fix this problem?
Thank you very much.
Cheers
Christoph

It is hard to explain why the time difference would vary abruptly between the two machines. I assume you are referring to the information given by the http://jenkins.mydomain/computer/ url.
Normally you want to keep your machines in time sync, and enable NTP clients on each host, each pointing to the same set of NTP servers, either internal to your organization (if that is available), or the standard free NTP services available on the web.
Do you have this setup already and see abrupt time variations? If so, review your list of NTP services and make sure to use reliable ones and also the same list, it should help. Maybe narrow it down to just one service and then expand if need be.

Related

Distributed Erlang - network split recovery and using heart with distributed applications

I have a standard situation, two distributed Erlang nodes, one master one standby.
When I stop the master the standby comes on - failover, when I start the master the standby stops - takeover. Everything works fine as long as heart is not turned on and there is no network split.
However, when I disconnect the master from the network after 60 seconds or so the standby gives me an error message ** removing (timedout) connection ** and starts up as if the master node stopped. This makes sense to me, it doesn't know if the master is alive or not, and epmd can't connect to the master node so it is removed from the nodes() list. Lets pretend for a moment that this is the desired outcome.
The problem is that, when the connection is restored, I have master and standby running at the same time and the standby is oblivious to the fact that the master is running. Pinging the standby during the masters init does not solve the issue. I checked nodes() on the standby after doing so, it sees the master node but still it continues to run.
My solution for now has been to create a process, that monitors all nodes that are above each node in hierarchy and if any of them are online, can be pinged, the process calls erlang:halt() to terminate the standby node. It works for simple situations, but maybe someone can tell me if there is a better way? I found a similar problem described on Elixir forum so it probably a known erlang problem without an easy solution. https://elixirforum.com/t/distributed-application-network-split/10007
If during a network split you don't want to have two nodes running in parallel I'm guessing an outside monitoring application needs to be used?
The second major issue is heart. If heart is turned on, as is, the failover never happens. If heart is running with a sleep before it calls start it stops the failover node when it calls the application start. So even when it can't start the master, do to it not having access to vital resources for example, it stops the failover node, and doesn't bring it back up after it fails to start the master. I don't know if heart is not supposed to be used with a distributed application or if there is an option to run some erlang code to check if the resources are available before attempting a start the node and before stopping the failover node?
The documentation on heart is not great. Very hard to find any examples of HEART_COMMAND. I found a way to set the HEART_COMMAND to a script from within my application, but there is a limit to how long the argument can be, and it's not as long as stated in the documentation from what I can tell. This for example sets a sleep timer for 60 seconds before calling application start again. It doesn't solve any issues, because in 60 seconds it stops the failover node and hangs if master node can't start.
heart:set_cmd("sleep 60; ./bin/myapp start")
The solution I've ended up with for now is letting heart of the main release start another release, a pre-loader, which does a preliminary check that all resources are available and if they are it starts the main release-application, and if they are not it continues checking forever. This way the main app is running on the failover node without interruption. So the main release has heart turned on, and the pre-loader does not. I ended up using a bash script file because I needed to do more work than I could fit in the heart:set_cmd/1, so I'm still calling heart:set_cmd(Dir ++ "/priv/myHeartScript.sh " ++ Arg1 ++ " " ++ Arg2), but don't get carried away with the Args as there is a limit on size! I also used Environment Variables which I set in vm.args using -env to pass data to the script, such as the pre-loader path/name. This allowed me to avoid having to edit the scrip too during deployment.
If anyone has a better solution PLEASE let me know.
UPDATE
The team at Erlang Solutions was kind enough to shed some light onto the subject. Basically, nobody they know uses the Erlangs built in distributed model. Everything revolves around the data, and as long as it is available on redundant databases you can spin up new applications anytime. They recommend using the cloud hosts that can spin up new servers when one goes down or use a redundant node design, so have 5 nodes up in parallel and if a few go down you can restart them manually or by other means.
As for me, I can say that getting heart to start a pre-loader release/app gets the job done but it gets complicated fast. To launch the app now requires provisioning several extra sys.config/vm.args/rebar.config files. I will be looking into their suggestions for the next iteration.
UPDATE
Moved away from using Erlang distributed model. Using RabbitMQ to send heartbeats to all nodes, including itself. If a node is receiving heartbeats from itself and no other node it's the master, if receiving more than one use any attribute like node name to chose the master. You don't have to use RabbitMQ, but you need to make sure all nodes can reach the same destination and consume from it.
Also, devOps oppose using heart. They prefer to use standard Linux tools to monitor application status and restart it after crash or a server reboot.

Jenkins: jobs in queue are stuck and not triggered to be restarted

For a while, our Jenkins experiences critical problems. We have jobs hung, our job scheduler does not trigger the builds. After the Jenkins service restart, everything is back to normal, but after some period of time all problem are return. (this period can be week or day or ever less). Any idea where we can start looking? I'll appreciate any help on this issue
Muatik has made a good point in his comment, the recommended approach is to run jobs on agents (slave) nodes. If you already do it, you can look at:
Jenkins master machine CPU, RAM and hard disk usage. Access the machine and/or use plugin like Java Melody. I have seen missing graphics in the builds test results and stuck builds due to no hard disk space. You could also have hit the limit of RAM or CPU for the slaves/jobs you are executing. You may need more heap space.
Look at Jenkins log files, start with severe exceptions. If the files are too big or you see logrotate exceptions, you can change the logging levels, so that fewer exceptions are logged. For more details see my article on this topic. Try to fix exceptions that you see logged.
Go through recently made changes that can be the cause of such behavior, for example, new plugins, changes to config files (jenkins.xml)?
Look at TCP connections. Run netstat -a Are there suspicious connections (CLOSED_WAIT status)?
Delete old builds that you do not need.
We have been facing this issue from last 4 months, and tried everything, changing resources CPU & memory, increasing desired nodes in ASG. But nothing seems worked .
Solution: 1. Go to Manage Jnekins-> System Configurationd-> Maven project
configurations
2. In "usage" field, select "Only buid Jobs with label expressions matching this nodes"
Doing this resolved it and jenkins is working as a Rocket now :)

Randomise slave load on Mesos

trying to solve some problem with Mesos. I have three build servers for Jenkins. Jenkins schedules jobs on them through Mesos.
For now, Mesos loads one agent(slave) as hard as possible, but I want it to spread jobs across all agents..
As I see, it's better to run three jobs on three agents, than on one.
Is it possible to randomise job scheduling?
Or perhaps, I have such scenario. 2 large servers and one mini. I want to schedule Jobs on mini by default, and if it's not enough resources, then proceed to large servers. How can I achieve this goal? Is it possible to set priority for agents(slaves) to specify on which agent I want job to run at first?
The Mesos plugin for Jenkins attempts to build on the most recently built slave (see this method). This means that once it builds on that machine once, as long as that machine still has available spare resources - it'll schedule additional jobs on that machine until it is full. Right now it looks like that isn't optional (I have filed it as a feature request).

How many Remote Nodes can Jenkins manage

How many Remote Nodes can Jenkins manage ? Are there any limitations/memory issues?
What is more effective:
1) 100 Nodes 1 executor per node ?
2) 5 Nodes with 20 executors per node ?
Tx.
As far as i know, there is no limitation on # of nodes one can have although your system might feel like saying, enough is enough! Issues such as number of processes per user (we got this issue recently, not with Jenkins but some other application where RAM and disk space were fine but the system stopped responding. We started getting system cannot fork() error), total number of open files etc. Few such issues might still be configurable but may not be allowed/feasible.
If resource (in your case, nodes) is not a constraint, which process wouldn't like to run wild? :) In practical cases, generally you wouldn't have the flexibility to opt for first option. In second case where you have 5 nodes with 20 executors, all you have to make sure is not to tie up jobs to a particular node unless you have a compelling reason.
Some slaves are faster, while others are slow. Some slaves are closer (network wise) to a master, others are far away. So doing a good build distribution is a challenge. Currently, Jenkins employs the following strategy:
If a project is configured to stick to one computer, that's always honored.
Jenkins tries to build a project on the same computer that it was previously built.
Jenkins tries to move long builds to slaves, because the amount of network interaction between a master and a slave tends to be logarithmic to the duration of a build (IOW, even if project A takes twice as long to build as project B, it won't require double network transfer.) So this strategy reduces the network overhead.
You should also have a look at these links:
https://wiki.jenkins-ci.org/display/JENKINS/Least+Load+Plugin
https://wiki.jenkins-ci.org/display/JENKINS/Gearman+Plugin

Nagios: Make sure x out of y services are running

I'm introducing 24/7 monitoring for our systems. To avoid unnecessary pages in the middle of the night I want Nagios to NOT page me, if only one or two of the service checks fail, as this won't have any impact on users: The other servers run the same service and the impact on users is almost zero, so fixing the problem has time until the next day.
But: I want to get paged if too many of the checks fail.
For example: 50 servers run the same service, 2 fail -> I can still sleep.
The service fails on 15 servers -> I get paged because the impact is getting to high.
What I could do is add a lot (!) of notification dependencies that only trigger if numerous hosts are down. The problem: Even though I can specify to get paged if 15 hosts are down, I still have to define exactly which hosts need to be down for this alert to be sent. I rather want to specify that if ANY 15 hosts are down a page is made.
I'd be glad if somebody could help me with that.
Personally I'm using Shinken which has business rules just for that. Shinken is backward compatible with Nagios, so it's easy to drop your nagios configuration into shinken.
It seems there is a similar addon for nagios Nagios Business Process Intelligence Addon, but I'm not having experience with this addon.

Resources