Nagios: Make sure x out of y services are running - monitoring

I'm introducing 24/7 monitoring for our systems. To avoid unnecessary pages in the middle of the night I want Nagios to NOT page me, if only one or two of the service checks fail, as this won't have any impact on users: The other servers run the same service and the impact on users is almost zero, so fixing the problem has time until the next day.
But: I want to get paged if too many of the checks fail.
For example: 50 servers run the same service, 2 fail -> I can still sleep.
The service fails on 15 servers -> I get paged because the impact is getting to high.
What I could do is add a lot (!) of notification dependencies that only trigger if numerous hosts are down. The problem: Even though I can specify to get paged if 15 hosts are down, I still have to define exactly which hosts need to be down for this alert to be sent. I rather want to specify that if ANY 15 hosts are down a page is made.
I'd be glad if somebody could help me with that.

Personally I'm using Shinken which has business rules just for that. Shinken is backward compatible with Nagios, so it's easy to drop your nagios configuration into shinken.
It seems there is a similar addon for nagios Nagios Business Process Intelligence Addon, but I'm not having experience with this addon.

Related

Cloud services to notify on a script not succeeding for a long time

I have a python script that resides on a VPS, reads (each hour) financial news from a public datafeed and emails me when certain keywords of interest appear. That can happen only a few times a week, but such events are very important and must not be missed. On any data fetching or parsing error, I should also be notified via email, and errors of course get recorded into the server's local log file.
But how do I know that my smtp credentials are not blocked by the mail provider, or my VPS is not shut down by my hoster? In that case, I would not be notified and would be unaware of important events (and the failure to fetch/deliver them itself) until I decided to log into VPS manually and take a look at the logs.
Even if I would use a backup notification channel, e.g., SMS or Telegram, it still would not protect against cloud provider service disruption, or my account being blocked due to temporary payment issues, as there would exist no instance of the script to deliver the message on any of the channels. That's why I suspect some 3rd party fault-tolerant service is needed. Especially if I'm a freelance coder having lots of similar scripts, running on a mixture of VPSes, serverless/Lambdas, possibly for different end clients.
What is the best practice you dear developers are using to be notified when some script has not succeeded for a long enough time? I would like something reliable and ready-to-use, maybe you can recommend some existing monitoring services. At least I was not able to find the ones that solve my particular problem straight away.
To clarify, I don't want to spend time on some manual checking until it's absolutely necessary (in this case, I can tolerate up to 2 hours, and if it does not self-heal within that period, then I need to be notified), and I obviously don't want to get regular annoying reports that the service is doing fine and there simply were no interesting news detected. Plus, I of course want to keep the costs reasonable.

Can't generate more than ~8000 RPM from Locust

I'm using Locust to load test my web servers. I'm running Locust in distributed mode. The worker nodes are written in Java, and use the Locust/Java port using locust4j. The master node and the worker nodes are containerized, our orchestrator is Kubernetes. When I want to spin up more workers, I'm doing it from there.
The problem that I'm running into is that no matter how many users I add, or worker nodes I add, I can't seem to generate more than ~8000 RPM. This is confirmed by the Locust web frontend, as well as the metrics I'm collecting from my web server.
Does anyone have any ideas why this is happening?
I've attached an image of timings I've collected. The snapshots are from running the load test for 60 seconds, I'm timing it from a stopwatch.
The usual culprit in these kinds of situations is your servers can't handle more than that. In my experience, the behavior you'll see client side as the servers get overwhelmed is you'll start to see a slow but steady increase in response times. This is one big reason why Locust includes those in the metrics it shows you.
Based on what I'm seeing in your screenshots, this is most likely the case for you. You have some very low minimum times but your average, median, and 90%iles are a lot higher than your minimums; your maximums are very significantly higher than those. Without seeing your charts I can't know for sure but that's a big red flag.
For more things to look out for, check out this question in the FAQ (especially see the list of server stats to investigate):
https://github.com/locustio/locust/wiki/FAQ#increase-my-request-raterps

How are people solving app pool recycle issues on deployment with large apps?

Currently after a build/deployment of our app (58 projects, large asp.net MVC 3 front end) takes ~15-20secs to load as it goes through the whole 'recycling the app pool' (release configuration).
We do have a web farm if that alters people's answers, but the question really is:
What are people doing in large scale applications where a maintenance window isn't viable (we're a 24/7 very active website) to minimize that initial 'first hit' on the app pool recycle after a deploy?
We've used a number of tools to analyze that startup time and there doesn't really seem to be any way to bring it down so what I'm looking for are what techniques do people employ in order to minimize the impact of a large application deploy affecting users.
By default - if you change 15 files in an ASP.NET application at once (even via FTP) then the app pool is automatically recycled. You can change the number of files but as soon as web.config and bin files are changed then it needs to recycle. So in my opinion the ideal solution for an environment like yours would be as follows:
4 web servers (this is an arbitrary number)
each server has a status.aspx that the load balancer looks at - use TeamCity to take 2 of these servers "off line" (off the load balancer) and wait 20 seconds for the traffic to filter across. A distributed cache will help keep user experience problems
Use TeamCity to deploy to those 2 servers - run your automated tests etc. and once you are happy put those back into the farm and take the other 2 offline and deploy to those
This can all be scripted / automated. The only issue with this is any schema changes that are not backwards compatible may not allow running the new version site in parallel with old version of the site for the 20 seconds for the load balancer to kick back in
This is good old fashioned Canary Releasing - there are some patterns here http://continuousdelivery.com/patterns/ to help take into consideration. Id also suggest a copy of that continuous delivery book - its like a continuous delivery bible and has got me out of a few situations :)
At the very base you could run a tinyget script against the application after completion of deployment which will "warm up" the application however if a customer hits your site before the script can run, they will still face a delay. What do you currently have in place, what post deployment steps do you have in place?
In a farm environment you could stage deployments too, so take one server out of load balance, update it and then bring that online after deployment and take the other out, complete the deployment and then reintroduce into the farm. How is your SQL Server setup - clustered?
copy and paste from my post here
We operate a Blue/Green deployment strategy on a 4 tier architecture which has a web site over 4 servers at the top tier. Due to the complexity the architecture introduced for deployments, we needed a way to deploy without disturbing any traffic to the "live" site. Following Fowler's advice, but not quite in the same way, we came up with a solution that means we have 2 sites on each server (a blue and a green, or in our case site A and site B). The live site has the appropriate host header, and once we have deployed and tested to the non-live site, we then flip the headers of the 2 sites so that what was once live is now the non-live site, and vice-versa. The effect is, a robust deployment that can be done in business hours and with the highest level of confidence.
This of course complicates your configuration and deployment slightly, but it's worth the effort. I guess it kind of goes without saying that you want to script both the deployment, and the host header swapping.
Firstly, unless you're running Google or something bigger, does a 15-20s load time at 3am for a handful of users really impact that much? I'd say the effort invested in eliminating the occasional lag would far outweigh the 15-20s inconvenience of a couple of users.
I consider it a necessary evil of using ASP.NET unfortunately. Using a pre-compiled site (.DLLs instead of the code-behind files) will lessen the time but not necessarily eliminate it.
The best thing you can do is use something like a status notification bar to warn users they may experience some "issues" during "essential maintenance".
But even then, I'd say in terms of user experience it'd be better to keep quiet and have a handful of people blame their "slow internet" when your site takes 20s to load on one occasion, than announce to all and sundry that it will be slow.
You can also try this approach : http://weblogs.asp.net/scottgu/archive/2009/09/15/auto-start-asp-net-applications-vs-2010-and-net-4-0-series.aspx
without knowing anything about your site, my first thought is that you might be able to break it down into smaller sites so that they start faster individually.
second, with your web farm, i assume you have some sort of load balancing device in front of that from which you can pull machines out of the pool when they are being deployed. don't put them back in the pool until after you have sent a request against the site to get it started up. you should be able to script this such that you are pretty much clicking a button that takes a machine out, deploys to it, and sends a request after it's back up and happy.
You can consider using aspnet_compiler.exe to precompile your application, because I think the delay after deployment is caused by the compilation phase rather than "whole recycling the app pool".

Best way to run rails with long delays

I'm writing a Rails web service that interacts with various pieces of hardware scattered throughout the country.
When a call is made to the web service, the Rails app then attempts to contact the appropriate piece of hardware, get the needed information, and reply to the web client. The time between the client's call and the reply may be up to 10 seconds, depending upon lots of factors.
I do not want to split the web service call in two (ask for information, answer immediately with a pending reply, then force another api call to get the actual results).
I basically see two options. Either run JRuby and use multithreading or else run several regular Ruby instances and hope that not many people try to use the service at a time. JRuby seems like the much better solution, but it still doesn't seem to be mainstream and have out of the box support at Heroku and EngineYard. The multiple instance solution seems like a total kludge.
1) Am I right about my two options? Is there a better one I'm missing?
2) Is there an easy deployment option for JRuby?
I do not want to split the web service call in two (ask for information, answer immediately with a pending reply, then force another api call to get the actual results).
From an engineering perspective, this seems like it would be the best alternative.
Why don't you want to do it?
There's a third option: If you host your Rails app with Passenger and enable global queueing, you can do this transparently. I have some actions that take several minutes, with no issues (caveat: some browsers may time out, but that may not be a concern for you).
If you're worried about browser timeout, or you cannot control the deployment environment, you may want to process it in the background:
User requests data
You enter request into a queue
Your web service returns a "ticket" identifier to check the progress
A background process processes the jobs in the queue
The user polls back, referencing the "ticket" id
As far as hosting in JRuby, I've deployed a couple of small internal applications using the glassfish gem, but I'm not sure how much I would trust it for customer-facing apps. Just make sure you run config.threadsafe! in production.rb. I've heard good things about Trinidad, too.
You can also run the web service call in a delayed background job so that it's not hogging up a web-server and can even be run on a separate physical box. This is also a much more scaleable approach. If you make the web call using AJAX then you can ping the server every second or two to see if your results are ready, that way your client is not held in limbo while the results are being calculated and the request does not time out.

Keeping applications and infrastructure connected

I work in an IT department that is divided into two groups. One group develops and manages applications, the other manages the company's infrastructure and servers. One of the problems we face is a break down in communication. I work for the application group and one of the problems I have is not being notified when a server is taken down by infrastructure, or a database is being refreshed.
Does anyone have suggestions on how to improve communications between the two groups or any ideas on how to keep a light-weight log across multiple systems (both linux and windows)? Ideally it would be nice if we could have our boxes just tweet their statuses or something.
Thanks for the help,
Ben
One thing you could do to communicate server status is to have our Infrastructure group setup a network monitoring system like Nagios. This will give everyone in your application group the ability to get a snapshot view of the status of every server in the system. Having this kind of status is invaluable when you are doing development.
Nagios gives you network monitoring, but also allows you to show scheduled down time for a particular server in the system.
Another thing your group could do to foster communication with the Infrastructure is to have your build system report which servers it is currently using for building and testing your products.
Also, setting up regular meeting between stakeholders of both groups is probably a good idea too. If you all are talking to each other, even for 15 minute a week, you'll probably see incidents like the one you described above go down quite a bit.
I think this is a bigger issue of change control.
You should have hardware and software change control and an approval process.
Ultimately, infrastructure serves you - the purpose for IT infrastructure is to run applications.
In my current large financial data company, servers are not TOUCHED without proper authorization through the client and application groups. It seems like a huge pain, but every single server is there for a reason - to meet a specific business goal and run a specific application. There is simply no excuse for the infrastructure group to be changing things or upsetting servers on their own volition.
Response to critical hardware failure might be an exception.
Needed software and OS updates are handled through scheduled maintenance windows and an approved change process.
I like the Nagios idea as well. If you want to setup something that's more of a communication tool, I would recommend a content management system like Drupal.
We use Drupal internally to communicate between teams. When one team takes a server down, they would add an event into Drupal. The rest of us would either get it as an email, an RSS item or just by refreshing the page.
Implement a change control process where changes are submitted, approved and scheduled for BOTH groups. This lets everyone know what is going on. This process can be as light or heavy-weight as you want.

Resources