How to get downtime on Icinga? - icinga

I'm working on a project and we are using Icinga to monitor some services. However, we need to get a downtime from some services, but I can't find it.
For an example:
My service is UP, running for 5 mins.
Suddenly, service is down.
After 10 minutes, service is up again.
Okay, how can I get the 10 minutes of down service? I mean,I know that I can get two times (last time it was up, and when it came back up), but can I get this information somewhere else?
Thanks.

You can look at the service history, either on the GUI (icingaweb2 or another one), or with a request against:
The core API ;
Livestatus.

Related

How can I see how long my Cloud Run deployed revision took to spin up?

I deployed a Vue.js and a Kotlin server app. Cloud Run does promise to put a service to sleep if no request to it arise for a specific time. I did not opened my app for a day now. As I opened it - it was available almost immediatly. Since I know how long it takes to spin up when started locally I kinda don't trust the promise that Cloud Run really had put the app to sleep and span it up so crazy fast.
I'd love to know a way how I can really see how long it took for the spinup - also for startup improvement for the backend service.
After having the service inactive for some time, record the time when you request the service URL and request it.
Then go to the logs for the Cloud Run service, and use this filter to see the logs for the service:
resource.type="cloud_run_revision"
resource.labels.service_name="$SERVICE_NAME"
Look for the log entry with the normal app output after your request, check its time and compare it with the recorded time.
You can't know when the instance will be evicted or if it is kept in memory. It could happen quickly, or take hours or days before eviction. it's "serverless".
About the starting time, when I test, I deploy a new revision and I have a try on it. In the logging service, the first log entry of the new revision provides me the cold start duration. (Usually 300+ ms, compare to usual 20 - 50 ms with warm start).
The billing instance time is the sum of all the containers running times. A container is considered as "running" when it process request(s).

How can I get Jelastic to sleep?

Yesterday I got a trial account on webhosting.net's Jelastic v2.2.2 and configured an environment with a minimum of 0 cloudlets (max 8, i.e., all dynamic, no reserved). Then I deployed a Grails war which was using 3 cloudlets after it started up (around 350 MB). It worked great, and I was very impressed.
However, I did not access my app overnight, and the billing history shows it kept using 3 dynamic cloudlets every hour, even with 0 requests (i.e., 0 MB paid traffic) for 14 hours. Is there some way I can get my Jelastic environment to sleep (i.e., hibernation) after some period with no requests (e.g., after an hour or two)? Then, when it gets a request, I'd like it to automatically wake up (i.e., allocate some cloudlets and restore memory from disk). I see how to stop and restart it manually, but I would like it to work automatically, for any requester.
edit: I found the following documentation, but does it not work for Tomcat/Grails?
Hibernation
Jelastic’s hibernation feature delivers even better utilization of cluster resources. Optimal use of resources is achieved by suspending non-active containers and returning released resources back to the cluster.
Because they are in sleep mode, hibernated containers do not consume resources (only disk space). As a result you save money while your containers are in hibernate mode. If applications are needed again the platform returns them to a running state again in just a few seconds.
It takes a little time to awaken your environment from sleep, so it's not suitable to work how you describe for production use - you would effectively lose visitors because it would seem like your service is offline due to the delays for that first access.
For that reason the 'sleep' function is only active for trial accounts, and the inactivity time before sleep is set by the hosting provider (so you should contact them directly for help on that point).
Of course you should also remember that accesses from search engine spiders etc. may keep your environment awake.

Nagios: Make sure x out of y services are running

I'm introducing 24/7 monitoring for our systems. To avoid unnecessary pages in the middle of the night I want Nagios to NOT page me, if only one or two of the service checks fail, as this won't have any impact on users: The other servers run the same service and the impact on users is almost zero, so fixing the problem has time until the next day.
But: I want to get paged if too many of the checks fail.
For example: 50 servers run the same service, 2 fail -> I can still sleep.
The service fails on 15 servers -> I get paged because the impact is getting to high.
What I could do is add a lot (!) of notification dependencies that only trigger if numerous hosts are down. The problem: Even though I can specify to get paged if 15 hosts are down, I still have to define exactly which hosts need to be down for this alert to be sent. I rather want to specify that if ANY 15 hosts are down a page is made.
I'd be glad if somebody could help me with that.
Personally I'm using Shinken which has business rules just for that. Shinken is backward compatible with Nagios, so it's easy to drop your nagios configuration into shinken.
It seems there is a similar addon for nagios Nagios Business Process Intelligence Addon, but I'm not having experience with this addon.

My app fails to connect to the server some times

I've been helplessly observing this problem for a couple months now, and have decided this is my best shot.
I'm not sure what the cause of the problem is, but I can list some of the things I'm doing. I have an iOS app that uses AFNetworking to connect to a remote server hosted by Google App Engine using HTTP POST requests.
Now, everything works great, but sometimes, very very sporadically and random, I get failed requests. The activity indicator spins and spins for about a minute, and I get no feedback at the end - just a failed request. I check my server logs, and I don't see any errors. After the failed request, I try again, and it works fine. It works fine for the whole day. And then another time randomly the issue repeats itself, sometimes spinning for 10 seconds with a fail, or a minute.
Generally, what can possibly be the cause of this? Is it normal to have some failed connections randomly? Is that something on my part?
But the weird thing is, is that while on my iPhone the app is running, and the indicator is spinning, and it's trying to connect, I try connecting on the iOS simulator, and the connection works just fine. I try again on the iPhone, and it doesn't work.
If I close the app completely and start again, then it works again. So it sounds like it may be a software issue rather than connection issue, but then again I have no evidence or data what so ever.
I know it's vague, but I'm hoping someone may have had a similar problem. Anything helps.
There is a known issue with instance start on GAE for Java. You can star http://code.google.com/p/googleappengine/issues/detail?id=7706 issue.
The same problem was reported for Python but it is not such a big problem.
I think you should check logging level you use on appengine and monitor all your calls. Instance start usually takes more time, so you will be able to see how much time do you use on start and is it really a timeout problem.
For Java version you could try to change log level to debug:
.level = DEBUG
in your logging.properties file. It will give you more information about instance start process.

Web App Performance Problem

I have a website that is hanging every 5 or 10 requests. When it works, it works fast, but if you leave the browser sit for a couple minutes and then click a link, it just hangs without responding. The user has to push refresh a few times in the browser and then it runs fast again.
I'm running .NET 3.5, ASP.NET MVC 1.0 on IIS 7.0 (Windows Server 2008). The web app connects to a SQLServer 2005 DB that is running locally on the same instance. The DB has about 300 Megs of RAM and the rest is free for web requests I presume.
It's hosted on GoGrid's cloud servers, and this instance has 1GB of RAM and 1 Core. I realize that's not much, but currently I'm the only one using the site, and I still receive these hangs.
I know it's a difficult thing to troubleshoot, but I was hoping that someone could point me in the right direction as to possible IIS configuration problems, or what the "rough" average hardware requirements would be using these technologies per 1000 users, etc. Maybe for a webserver the minimum I should have is 2 cores so that if it's busy you still get a response. Or maybe the slashdot people are right and I'm an idiot for using Windows period, lol. In my experience though, it's usually MY algorithm/configuration error and not the underlying technology's fault.
Any insights are appreciated.
What diagnistics are available to you? Can you tell what happens when the user first hits the button? Does your application see that request, and then take ages to process it, or is there a delay and then your app gets going and works as quickly as ever? Or does that first request just get lost completely?
My guess is that there's some kind of paging going on, I beleive that Windows tends to have a habit of putting non-recently used apps out of the way and then paging them back in. Is that happening to your app, or the DB, or both?
As an experiment - what happens if you have a sneekly little "howAreYou" page in your app. Does the tiniest possible amount of work, such as getting a use count from the db and displaying it. Have a little monitor client hit that page every minute or so. Measure Performance over time. Spikes? Consistency? Does the very presence of activity maintain your applicaition's presence and prevent paging?
Another idea: do you rely on any caching? Do you have any kind of aging on that cache?
Your application pool may be shutting down because of inactivity. There is an Idle Time-out setting per pool, in minutes (it's under the pool's Advanced Settings - Process Model). It will take some time for the application to start again once it shuts down.
Of course, it might just be the virtualization like others suggested, but this is worth a shot.
Is the site getting significant traffic? If so I'd look for poorly-optimized queries or queries that are being looped.
Your configuration sounds fine assuming your overall traffic is relatively low.
To many data base connections without being release?
Connecting some service/component that is causing timeout?
Bad resource release?
Network traffic?
Looping queries or in code logic?

Resources