Having C# application communicate with Nagios - monitoring

We are using Nagios to monitor our network with great results. There is now a new requirement we are struggling with:
We want to notify Nagios of an non
fatal but critical application errors. The
application does not stop running but
there is some sort of issue that
needs looking into.
Once the issue has been looked into,
we need some way to "unflag" the
issue in Nagios.
We tried using the syslog, but the biggest problem was once an error was logged, the service was put into an error state with no way to recover. Also, while applications would report a critical error to the syslog, most of the time they don't report an "All clear" error.

I've done this using passive checks: http://nagios.sourceforge.net/docs/3_0/passivechecks.html
Basically, you're application is just going to feed the nagios core some data into its external command file. Nagios will eventually read the data and update the alerts, execute event handlers, etc.
Exactly how you set this up will be unique for your case, but if you need any other help just let me know. :)

Related

How to send emails on failures of a dockerized application?

I had a python application, which, with the use of shell tricks, was able to send me emails when new error messages appeared in the log during an execution session, scheduled with cron.
Now I am packing it in docker and was able to reproduce most of its functionality with docker-compose.
But when it comes to emails on failures, I am not sure of what is the best way to implement it.
What are your suggestions? Are there any best practices?
Update:
The app runs couple of times a day. Previously, all prints to stderr was duplicated to stdout to preserve chronological order in the main log file. Then, the wrapper script would accumulate all stderr from a single session in another, temporary file. And if that file was not empty after the session, its content was send in a single email from me to myself through SMTP with proper authentication. I was happy to receive and able to handle them for the last few months.
Right now I see three possible solutions:
Duplicating everything worth sending to a temporary file right in the app, this way docker logs would persist. Then sending it after the session from the entrypoint, provided, there is a way to setup in container all the requirements.
Grepping docker log from the outside. But that's somewhat missing the point of docker.
Relaying reports via local net to another container, with something like https://hub.docker.com/r/juanluisbaptiste/postfix/ which then will send it in a email
I was not able to properly setup postfix or use mail utility inside the container, but python seems to work just fine https://realpython.com/python-send-email/
TL;DR look at NewRelic's free tier.
There is a lot to unpack here. First, it would be helpful to know more about what you had been doing previously with the backticks. Including some more information about what commands might change how I'd respond to this. Without that I might make some assumptions that are incorrect/not applicable.
I would consider errors via email a bad idea for a few reasons:
If many errors occur quickly, you inbox will get flooded with emails and/or flood a mail server with messages or flood the network with traffic. It's your mailbox and your network so you can do what you want, but it tends to fail dramatically when it happens, especially in production.
To send an email you need to send those messages through a SMTP server/gateway/relay of some sort and often times automated scripts like this get blocked as they trigger spam detections. When that happens and there is an error, the messages get silently dropped and production issues go unreported. Again, it's your data/errors and you can do that if you want, but I wouldn't consider it best practices as far as reliability goes. I've seen it fail to alert many times in my past experience for this very reason.
It's been my experiences (over 20 years in the field) that any alerting sent via email quickly gets routed to a sub-folder via a mail rule and start getting ignored as they are noisy. As soon as people start to ignore the messages, serious errors get lost in the inbox along with them and go unnoticed.
De-duplication of error messages isn't built-in and you could get hundreds or thousands of emails in a few seconds, often with the one meaningful error like finding a needle in a haystack of emails.
So I'd recommend not using email for those reasons. If you were really set on using email you could have a script running inside the container that tails the error log and sends the emails off using some smtp client (that could be installed in the docker container). You'd likely have to setup credentials depending on your mail server but it could be done. Again, I'd suggest against it.
Instead I'd recommend having the logs be sent to something like AWS SQS or AWS CloudWatch where you can setup rules to alert (via SNS which supports email alerting) if there are N messages in N minutes. You can configure those thresholds as you see fit and it can also handle de-duplication.
If you didn't want to use AWS (or some other Cloud provider) you could perhaps use something like Elasticache to store the events with a script to check for recent events/perform de-duplication.
There are plenty of 3rd party companies that will take this data and give you an all-in-one solution for storing the logs and a nice dashboard with custom notifications (via email/SMS/etc.), some of which are free. NewRelic comes to mind and is free assuming you don't need log retention.
I don't work for any of these companies, just some tools I've worked with that I'd consider using before I tried to roll a cronjob to send SMTP messages (although I've done that several times in my career which I needed something quick and dirty).

Google Cloud Monitoring: Restart a process automatically when detecting error

I'm using Google Cloud Monitoring (Stackdriver) for endpoing check. It's very useful but I need to restart a process manually after receiving an alert. Does anybody have any good idea?
update1
monit looks nice.
https://mmonit.com/monit/
http://supervisord.org/ is an option which is preferred by various developers but there is no such default solution to my knowledge.
It would be a great feature for stackdriver though. Whenever it detects a failure it can run a failsafe script on the machine as a privileged user.

Error creating the GCE VMs or starting Dataflow

I'm getting the following error in the recent jobs I'm trying to submit:
2015-01-07T15:51:56.404Z: (893c24e7fd2fd6de): Workflow failed.
Causes: (893c24e7fd2fd601):
There was a problem creating the GCE VMs or starting Dataflow on the VMs so no data was processed. Possible causes:
1. A failure in user code on in the worker.
2. A failure in the Dataflow code.
Next Steps:
1. Check the GCE serial console for possible errors in the logs.
2. Look for similar issues on http://stackoverflow.com/questions/tagged/google-cloud-dataflow.
There are no other errors.
What does this error mean?
Sorry for the trouble.
The Dataflow starts up VM instances and then launches an agent on those VMs. Those agents then do the heavy lifting of executing your code (e.g. ParDo's, reading and writing) your Data.
The error indicates the job failed because no agents were requesting work. As a result, the service marked the job as a failure because it wasn't making any progress and never would since there weren't any agents to process your data.
So we need to figure out where in the agent startup process things failed.
The first thing to check is whether the VMs actually started. When you run your job do you see any VMs created in your project? It might take a minute or two for the VMs to startup but they should appear shortly after the runner prints out the message "Starting worker pool setup". The VMs should be named something like
<PREFIX-OF-JOB-NAME>-<TIMESTAMP>-<random hexadecimal number>-<instance number>
Only a prefix of the job name is used to ensure we don't exceed GCE name limits.
If the VMs startup the next thing to do is to inspect the worker logs to look for errors indicating problems in launching the agent.
The easiest way to access the logs is using the UI. Go to the Google Cloud Console and then select the Dataflow option in the left hand frame. You should see a list of your jobs. You can click on the job in question. This should show you a graph of your job. On the right side you should see a button "view logs". Please click that. You should then see a UI for navigating the logs and you can look for errors.
The second option is to look for the logs on GCS. The location to look for is:
gs://PATH TO YOUR STAGING DIRECTORY/logs/JOB-ID/VM-ID/LOG-FILE
You might see multiple log files. The one we are most interested in is the one that starts with "start_java_worker". If that log file doesn't exist then the worker didn't make enough progress to actually upload the file; or else there might have been a permission problem uploading the log file.
In that case the best thing to do is to try to ssh into one of the VMs before it gets torn down. You should have about 15 minutes before the job fails and the VMs are deleted.
Once you login to the VM you can find all the logs in
/var/log/dataflow/...
The log we care most about at this point is:
/var/log/dataflow/taskrunner/harness/start_java_worker-SOME ID.log
If there is a problem starting the code that runs on the VM that log should tell us. That log and the other logs should also tell us if there is a permission problem that prevents the code running on the worker from being able to access Dataflow.
Please take a look and let us know if you find anything.
Apart from Jeremy Lewi's great answer, I would like to add that I've seen this error appear when you don't enable the proper Google APIs in the Developers Console, as mentioned here, which leads to a permission issue, like Jeremy said.

How do I prevent another process from terminating my service?

I have a Windows service that performs various "jobs" for my application (send emails, create backups, check for my application updates, provide some services...)
Recently some costumers reported problems between using some Internet banking sites and my application.
In searching for solutions, I found reports about a plugin (ActiveX) installed by the Internet banking Web site.
This ActiveX installs a bizarre service (GbPlugin, from GAS Tecnologia), that kills suspicious applications based in some idiot heuristic, and my service is a victim!
Now I'm trying to "immunize" my service.
Are there some ways to restrict the termination of my service to protect it?
I cannot use the "auto restart" option in the service properties, because I cannot be killed!
Both services are running as LOCALSYSTEM.
Most likely that service runs as LOCALSYSTEM and so can kill anything it likes. So it's extremely unlikely that you can defend against it.
Indeed, a quick websearch throws up some some hits that indicate that the service does indeed run as LOCALSYSTEM.
Your only tenable solution is going to involve the other software. Either compel your users to remove it, or work with its developers to find a way to white-list your program.
Assuming GbPlugin is going through normal SCM procedures to stop services and not just brute-force terminating them, then you have a couple of choices to prevent your service from stopping:
set your service's AllowStop property to False.
in the OnStop event, set the Stopped parameter to False.
Either approach will also prevent you from stopping your own service under normal consitions. To work around that, you could write a separate app that uses the Win32 API ControlService() function to send a custom command to your service. Inside your service, have it override the virtual DoCustomControl() method to look for that command. Have it either reset the AllowStop property back to True, or set a flag somewhere that the OnStop event can look at, then call Controller(SERVICE_CONTROL_STOP) to initiate a normal stop.
Needless to say, this is a bit overkill. If possible, a better option is to simply contact GAS Tecnologia and ask why your service is being flagged by GbPlugin's heuristics and then change that condition in your service, or else ask them to fix GbPlugin to ignore your service.

How to stop a Windows Service programmatically?

I'm writing a simple Windows Service that sends out emails to all employees every month. My question is, how to stop itself when it's done? I'm a noobie in this field so please help me out. Really appreciated.
It will be deployed on the server to be run monthly. I did not start this thing and the code was given to me like that. It is written in VB.NET and I'm asked now to change a few things around it. I noticed that there is only 'Sub OnStart' and wondered when the service would stop? After the main sub is done, what it the status of this service? Is it stopped or just hung in there? Sorry, as I said, I am really new to this....
If you have a task that recurs monthly you may be better off writing a console app, and then using Windows Task Scheduler to set it to run monthly. A service should be used for processes that need to run for a long time or constantly, with or without a user logged on
As every other answer has noted, it sounds like this should be an executable or script that you run as a scheduled task.
However, if you are obligated for some reason to run as a Windows Service and you're working in .NET, you just have to call the Stop() method inherited from ServiceBase once your service completes its work. From the MSDN documentation for the method:
The Stop method sets the service state
to indicate a stop is pending and
calls the OnStop method. After the
application is stopped, the service
state is set to stopped. If the
application is a hosted service, the
application domain is unloaded.
There's one important caveat here: the user account under which the service is running must have permission to stop services (which is a topic for ServerFault).
Once a service's OnStart method completes, it will continue running (doing nothing) until something tells it to stop in one of the following ways:
Programatically, by calling Stop
within the service itself or from an
external process using the method
Colin Gravill describes in his
answer.
Via the command-line.
Through the windows Computer Management console's "Services" panel.
If this is a Win32 service (i.e. written in C or C++), then you simply call SetServiceStatus(SERVICE_STOPPED) and return from ServiceMain.
On the other hand, if you're just sending emails once a month, why are you using a service at all? Use the Windows Task Scheduler and run a normal application or script.
net stop [service_name] ...on the command line will do it too.
But, I agree with everyone else; it seems that Windows Task Scheduler will meet your needs better.
It might be better to write this as a scheduled task, it would certainly be easier to develop initially. Then it would naturally terminate and wouldn't be consuming resources for the rest of the month.
To answer the original question, you can get a list of the current running services in C#
services = System.ServiceProcess.ServiceController.GetServices();
Then look for the one you want and set the status to stopped
locatedService.Status == ServiceControllerStatus.Stopped
Full example on msdn
Is there a reason it has to be a Windows service? If not, then follow #Macros solution. However, if it does, then why stop the service? If you stop it, then it'll just have to be restarted when the emails need to be sent. Based on your description, it doesn't sound like it would require a lot of resources, so I'd suggest just installing it and letting it run, firing up once a month to send the emails.
here's what i did in a similar situation.
windows service runs 24/7 and processes work units. it gets work units through a database view.
table Message
ProcessingStartTime
CompletionDTE
...
the database view only pulls records marked not-complete and have a ProcessingStartTime in the past. So after the service confirms the transaction it executes a stored procedure that updates the database record. For this system, end-user upload excel files to asp.net webfrom that imports them into the database.

Resources