How to send emails on failures of a dockerized application? - docker

I had a python application, which, with the use of shell tricks, was able to send me emails when new error messages appeared in the log during an execution session, scheduled with cron.
Now I am packing it in docker and was able to reproduce most of its functionality with docker-compose.
But when it comes to emails on failures, I am not sure of what is the best way to implement it.
What are your suggestions? Are there any best practices?
Update:
The app runs couple of times a day. Previously, all prints to stderr was duplicated to stdout to preserve chronological order in the main log file. Then, the wrapper script would accumulate all stderr from a single session in another, temporary file. And if that file was not empty after the session, its content was send in a single email from me to myself through SMTP with proper authentication. I was happy to receive and able to handle them for the last few months.
Right now I see three possible solutions:
Duplicating everything worth sending to a temporary file right in the app, this way docker logs would persist. Then sending it after the session from the entrypoint, provided, there is a way to setup in container all the requirements.
Grepping docker log from the outside. But that's somewhat missing the point of docker.
Relaying reports via local net to another container, with something like https://hub.docker.com/r/juanluisbaptiste/postfix/ which then will send it in a email

I was not able to properly setup postfix or use mail utility inside the container, but python seems to work just fine https://realpython.com/python-send-email/

TL;DR look at NewRelic's free tier.
There is a lot to unpack here. First, it would be helpful to know more about what you had been doing previously with the backticks. Including some more information about what commands might change how I'd respond to this. Without that I might make some assumptions that are incorrect/not applicable.
I would consider errors via email a bad idea for a few reasons:
If many errors occur quickly, you inbox will get flooded with emails and/or flood a mail server with messages or flood the network with traffic. It's your mailbox and your network so you can do what you want, but it tends to fail dramatically when it happens, especially in production.
To send an email you need to send those messages through a SMTP server/gateway/relay of some sort and often times automated scripts like this get blocked as they trigger spam detections. When that happens and there is an error, the messages get silently dropped and production issues go unreported. Again, it's your data/errors and you can do that if you want, but I wouldn't consider it best practices as far as reliability goes. I've seen it fail to alert many times in my past experience for this very reason.
It's been my experiences (over 20 years in the field) that any alerting sent via email quickly gets routed to a sub-folder via a mail rule and start getting ignored as they are noisy. As soon as people start to ignore the messages, serious errors get lost in the inbox along with them and go unnoticed.
De-duplication of error messages isn't built-in and you could get hundreds or thousands of emails in a few seconds, often with the one meaningful error like finding a needle in a haystack of emails.
So I'd recommend not using email for those reasons. If you were really set on using email you could have a script running inside the container that tails the error log and sends the emails off using some smtp client (that could be installed in the docker container). You'd likely have to setup credentials depending on your mail server but it could be done. Again, I'd suggest against it.
Instead I'd recommend having the logs be sent to something like AWS SQS or AWS CloudWatch where you can setup rules to alert (via SNS which supports email alerting) if there are N messages in N minutes. You can configure those thresholds as you see fit and it can also handle de-duplication.
If you didn't want to use AWS (or some other Cloud provider) you could perhaps use something like Elasticache to store the events with a script to check for recent events/perform de-duplication.
There are plenty of 3rd party companies that will take this data and give you an all-in-one solution for storing the logs and a nice dashboard with custom notifications (via email/SMS/etc.), some of which are free. NewRelic comes to mind and is free assuming you don't need log retention.
I don't work for any of these companies, just some tools I've worked with that I'd consider using before I tried to roll a cronjob to send SMTP messages (although I've done that several times in my career which I needed something quick and dirty).

Related

Cloud services to notify on a script not succeeding for a long time

I have a python script that resides on a VPS, reads (each hour) financial news from a public datafeed and emails me when certain keywords of interest appear. That can happen only a few times a week, but such events are very important and must not be missed. On any data fetching or parsing error, I should also be notified via email, and errors of course get recorded into the server's local log file.
But how do I know that my smtp credentials are not blocked by the mail provider, or my VPS is not shut down by my hoster? In that case, I would not be notified and would be unaware of important events (and the failure to fetch/deliver them itself) until I decided to log into VPS manually and take a look at the logs.
Even if I would use a backup notification channel, e.g., SMS or Telegram, it still would not protect against cloud provider service disruption, or my account being blocked due to temporary payment issues, as there would exist no instance of the script to deliver the message on any of the channels. That's why I suspect some 3rd party fault-tolerant service is needed. Especially if I'm a freelance coder having lots of similar scripts, running on a mixture of VPSes, serverless/Lambdas, possibly for different end clients.
What is the best practice you dear developers are using to be notified when some script has not succeeded for a long enough time? I would like something reliable and ready-to-use, maybe you can recommend some existing monitoring services. At least I was not able to find the ones that solve my particular problem straight away.
To clarify, I don't want to spend time on some manual checking until it's absolutely necessary (in this case, I can tolerate up to 2 hours, and if it does not self-heal within that period, then I need to be notified), and I obviously don't want to get regular annoying reports that the service is doing fine and there simply were no interesting news detected. Plus, I of course want to keep the costs reasonable.

Delay in pop download of mails using fetchmail

We are using fetchmail to retrieve mails from different mailboxes and create tickets in queues configured in the request tracker. Fetching mail from one of the mail-ids is taking long. Anybody had similar experience? What can be the possible reasons for delay or any means to debug it?
One approach I tried was to spawn another fetchmail process for the mail causing delay, but request tracker mandates being run by root user and a single user cannot run multiple fetchmail process at a time
A little bit of digging into apache logs, I was able to figure it out that the size of the mails being downloaded from the one particular mailbox was quite large, which in turn cause the timeout.
Changing the FcgidIOTimeout configuration from 90 to 180 in the following configuration file helped.
/etc/httpd/conf.d/fcgid.conf.

How to know server is in good health condition or not?

How can I know that my server is working fine i.e in good health condition.
My Requirement is Users are complaining that they can not access the web application (Web site) something like it taking long time to do, some times its not completing the request.
I want to know whether my web site is in good condition or not before users and to get an alert message.
I want to know how we can measure whether the server is very responsive or user is not facing any problem. Some times my site takes long time coz. millions of data records have to be retrieved in that case I can not depend upon response time.
please help me on this
Monitoring response time without any third party software can be done with scripts like webinject. Webinject is a perl script that execute some browsing scenario and tells you if it acceptable or not.
Run a script at a regular interval, say 10min, that will start a webinject scenario. If the scenario is ko (check the return code of your webinject call), your script can send you an email, a sms, start a sound alarm, ... whatever is relevant to you.
You can also add some complexity by running a diagnostic script (check network by pinging relevant hardware, check cpu/ram usage of your servers, check the number of sessions in your database, ...) and send the diagnostic by email. You can also save the response times in a database (like a rdd database) to have a graphical view and be able to do some problem analysis on it.

Debugging Amazon SQS consumers

I'm working with a PHP frontend which connects to a distributed back end, using Amazon SQS and a variety of message types and message consumers. I'm trying to come up with a way to safely debug those consumers, as we don't want message handlers with new, untested code consuming end-user messages, risking the messages being lost or incorrectly processed.
The actual message queue names are hardcoded as PHP constants in a class, so my first tactic was to create two different sets of queues, one for production and another for debugging, and to externalise the queue name constants into two different files. Depending on whether our debug condition is true or not, I wanted to include one or the other of those constant definitions and assign the constants in the included file to the class constants which currently have the names hardcoded.
This doesn't seem to work though because constants seem to act like class variables in PHP whereas I am trying to assign the values like instance variables. The next tactic was to see if there was anything on Amazon's side that would allow us to debug our message consumers transparently without adding lots of hacks to our code, but I couldn't see anything there that facilitated this. I'd love to know if anyone else has experienced (and ideally, solved this problem)
SQS doesn't provide a way to inspect the contents of messages in the queue, or for the sender to see if any consumers are failing to process messages.
A common approach to this problem would be to set up two sets of queues as you suggest and have the producer post the same message onto both queues. That way you can debug your code against a stream of production messages without affecting the actual production queue.
I'd recommend moving the decision of which queue to use out of your code and into config, and then deploy different config files to your development boxes vs your production boxes. The risk is always that a development box ends up talking to production systems, so having a single consistent approach to configuring those end-points across all your code is much less risky that doing it on an ad-hoc basis each time you call out to a service.
I'd also recommend putting your production and development queues in different AWS accounts with different access credentials. That way you can give your production account permission to publish to the development account's queue, but you can guarantee that your development systems can't read from the production queue.

Best way to run rails with long delays

I'm writing a Rails web service that interacts with various pieces of hardware scattered throughout the country.
When a call is made to the web service, the Rails app then attempts to contact the appropriate piece of hardware, get the needed information, and reply to the web client. The time between the client's call and the reply may be up to 10 seconds, depending upon lots of factors.
I do not want to split the web service call in two (ask for information, answer immediately with a pending reply, then force another api call to get the actual results).
I basically see two options. Either run JRuby and use multithreading or else run several regular Ruby instances and hope that not many people try to use the service at a time. JRuby seems like the much better solution, but it still doesn't seem to be mainstream and have out of the box support at Heroku and EngineYard. The multiple instance solution seems like a total kludge.
1) Am I right about my two options? Is there a better one I'm missing?
2) Is there an easy deployment option for JRuby?
I do not want to split the web service call in two (ask for information, answer immediately with a pending reply, then force another api call to get the actual results).
From an engineering perspective, this seems like it would be the best alternative.
Why don't you want to do it?
There's a third option: If you host your Rails app with Passenger and enable global queueing, you can do this transparently. I have some actions that take several minutes, with no issues (caveat: some browsers may time out, but that may not be a concern for you).
If you're worried about browser timeout, or you cannot control the deployment environment, you may want to process it in the background:
User requests data
You enter request into a queue
Your web service returns a "ticket" identifier to check the progress
A background process processes the jobs in the queue
The user polls back, referencing the "ticket" id
As far as hosting in JRuby, I've deployed a couple of small internal applications using the glassfish gem, but I'm not sure how much I would trust it for customer-facing apps. Just make sure you run config.threadsafe! in production.rb. I've heard good things about Trinidad, too.
You can also run the web service call in a delayed background job so that it's not hogging up a web-server and can even be run on a separate physical box. This is also a much more scaleable approach. If you make the web call using AJAX then you can ping the server every second or two to see if your results are ready, that way your client is not held in limbo while the results are being calculated and the request does not time out.

Resources