I'm trying to use Serilog (the Serilog.Sinks.Network sink) to write to a Fluentbit instance (just running locally).
I'm finding that if I start Fluentbit, then run my app, the logs get processed as expected. However, without restarting Fluentbit, if I then run my app subsequent times, the logs no longer appear. Restarting Fluentbit will go back to making it work again (but again, just that first time). I'm not seeing any errors (even with Fluentbit turned up to trace logging).
The above code is here incase it's useful:
https://github.com/dracan/FluentBitProblem
I'm seeing similar issues when trying other Serilog Sinks too - eg. Serilog.Sinks.Fluentd. Feels like there's something I'm missing with the way FluentBit connections work.
A bit more information
Digging a bit further, and getting out Wireshark - I can see that the TCP data is being sent for both the working and failing attempts...
If I use a bogus port number, Wireshark recognised this...
And if I log in a while loop, and then kill the Fluentbit process - Wireshark also picks this up as an error...
Which suggests to me that Fluentbit is getting the logs, but ignoring them for some reason.
Related
I had a python application, which, with the use of shell tricks, was able to send me emails when new error messages appeared in the log during an execution session, scheduled with cron.
Now I am packing it in docker and was able to reproduce most of its functionality with docker-compose.
But when it comes to emails on failures, I am not sure of what is the best way to implement it.
What are your suggestions? Are there any best practices?
Update:
The app runs couple of times a day. Previously, all prints to stderr was duplicated to stdout to preserve chronological order in the main log file. Then, the wrapper script would accumulate all stderr from a single session in another, temporary file. And if that file was not empty after the session, its content was send in a single email from me to myself through SMTP with proper authentication. I was happy to receive and able to handle them for the last few months.
Right now I see three possible solutions:
Duplicating everything worth sending to a temporary file right in the app, this way docker logs would persist. Then sending it after the session from the entrypoint, provided, there is a way to setup in container all the requirements.
Grepping docker log from the outside. But that's somewhat missing the point of docker.
Relaying reports via local net to another container, with something like https://hub.docker.com/r/juanluisbaptiste/postfix/ which then will send it in a email
I was not able to properly setup postfix or use mail utility inside the container, but python seems to work just fine https://realpython.com/python-send-email/
TL;DR look at NewRelic's free tier.
There is a lot to unpack here. First, it would be helpful to know more about what you had been doing previously with the backticks. Including some more information about what commands might change how I'd respond to this. Without that I might make some assumptions that are incorrect/not applicable.
I would consider errors via email a bad idea for a few reasons:
If many errors occur quickly, you inbox will get flooded with emails and/or flood a mail server with messages or flood the network with traffic. It's your mailbox and your network so you can do what you want, but it tends to fail dramatically when it happens, especially in production.
To send an email you need to send those messages through a SMTP server/gateway/relay of some sort and often times automated scripts like this get blocked as they trigger spam detections. When that happens and there is an error, the messages get silently dropped and production issues go unreported. Again, it's your data/errors and you can do that if you want, but I wouldn't consider it best practices as far as reliability goes. I've seen it fail to alert many times in my past experience for this very reason.
It's been my experiences (over 20 years in the field) that any alerting sent via email quickly gets routed to a sub-folder via a mail rule and start getting ignored as they are noisy. As soon as people start to ignore the messages, serious errors get lost in the inbox along with them and go unnoticed.
De-duplication of error messages isn't built-in and you could get hundreds or thousands of emails in a few seconds, often with the one meaningful error like finding a needle in a haystack of emails.
So I'd recommend not using email for those reasons. If you were really set on using email you could have a script running inside the container that tails the error log and sends the emails off using some smtp client (that could be installed in the docker container). You'd likely have to setup credentials depending on your mail server but it could be done. Again, I'd suggest against it.
Instead I'd recommend having the logs be sent to something like AWS SQS or AWS CloudWatch where you can setup rules to alert (via SNS which supports email alerting) if there are N messages in N minutes. You can configure those thresholds as you see fit and it can also handle de-duplication.
If you didn't want to use AWS (or some other Cloud provider) you could perhaps use something like Elasticache to store the events with a script to check for recent events/perform de-duplication.
There are plenty of 3rd party companies that will take this data and give you an all-in-one solution for storing the logs and a nice dashboard with custom notifications (via email/SMS/etc.), some of which are free. NewRelic comes to mind and is free assuming you don't need log retention.
I don't work for any of these companies, just some tools I've worked with that I'd consider using before I tried to roll a cronjob to send SMTP messages (although I've done that several times in my career which I needed something quick and dirty).
One of the service's container is constantly restarting. From the logs I can see that some request take like 20s, and for some of them there are exceptions like: An exception occurred in the database while iterating the results of a query. System.InvalidOperationException: An operation is already in progress. at Npgsql.NpgsqlConnection or Timeouts. When I try to access the db with the local environment, I cannot reproduce such exceptions. On random requests, taking too long, the container restarts. Have somebody had some similar issue?
As the exception says, your application is likely trying to use the same physical connection at the same time from multiple threads - but it's impossible to know without seeing some code. Make sure you understand exactly when connections are being used and by which thread, and if you're still stuck try to post a minimal code sample that demonstrates the issue.
If you are using ELB( Elastic Load Balancer ) then increase timeout limit of it .
I'm new to RoR, and am building a few beginner's projects - but am unfamiliar with the error processing on RoR. For instance, I am working on this project right here, even though the site I found it through warned it had a few errors (if you know of a program that shortens urls based on a domain you own, that has NO errors, let me know and I'll use it). The problem I'm having is on the step:
rails server
It produces the result that the writer shows, then brings up a SECURITY WARNING. Underneath the security warning it has three timestamps with INFO WEBBrick and INFO ruby. However, an hour later, it's still here - and it hasn't brought the application back to the original location of the code (ie: C:\Location).
Is this an error? Or is it supposed to load? I ask because from the article it seems like I could just move on to the next step (after thirty minutes I hit ENTER just to see what happens, but no response), but - unless I open up a new command prompt - I don't see that happening.
Have you made any requests?
After the Rails server starts, it will sit "forever" waiting to service client requests (e.g., from a browser). Under Windows, that command prompt won't be useful until the server is shut down, e.g., with a Ctrl-C.
You can either open a new command window, as you've done, or shut the server down and use the same window. It's worth nothing that sometimes you'll need to restart the server, much of the time you won't. Figuring out when, and under what circumstances, is deterministic, but occasionally confusing.
I've been helplessly observing this problem for a couple months now, and have decided this is my best shot.
I'm not sure what the cause of the problem is, but I can list some of the things I'm doing. I have an iOS app that uses AFNetworking to connect to a remote server hosted by Google App Engine using HTTP POST requests.
Now, everything works great, but sometimes, very very sporadically and random, I get failed requests. The activity indicator spins and spins for about a minute, and I get no feedback at the end - just a failed request. I check my server logs, and I don't see any errors. After the failed request, I try again, and it works fine. It works fine for the whole day. And then another time randomly the issue repeats itself, sometimes spinning for 10 seconds with a fail, or a minute.
Generally, what can possibly be the cause of this? Is it normal to have some failed connections randomly? Is that something on my part?
But the weird thing is, is that while on my iPhone the app is running, and the indicator is spinning, and it's trying to connect, I try connecting on the iOS simulator, and the connection works just fine. I try again on the iPhone, and it doesn't work.
If I close the app completely and start again, then it works again. So it sounds like it may be a software issue rather than connection issue, but then again I have no evidence or data what so ever.
I know it's vague, but I'm hoping someone may have had a similar problem. Anything helps.
There is a known issue with instance start on GAE for Java. You can star http://code.google.com/p/googleappengine/issues/detail?id=7706 issue.
The same problem was reported for Python but it is not such a big problem.
I think you should check logging level you use on appengine and monitor all your calls. Instance start usually takes more time, so you will be able to see how much time do you use on start and is it really a timeout problem.
For Java version you could try to change log level to debug:
.level = DEBUG
in your logging.properties file. It will give you more information about instance start process.
I have a quite big application, running from inside spree extension. Now the issue is, all requests are very slow even locally. I am getting messages like 'Waiting for localhost" or "waiting for server" in my browser status bar for 3 - 4 seconds for each request issued, before it starts execution. I can see execution time logged in log file is quite good. But overall response time is poor because of initial delay. So please suggest me, where can I start looking into improving this situation?
One possible root cause for this kind of problem is that initial DNS name resolution is failing before eventually resolving. You can check if this is the case using tcpdump (if that's available for your platform) or wireshark. Look for taffic to and from your client host on port 53 and see if the name responses are happening in a timely fashion.
If it turns out that this is the problem then you need to make sure that the client is configured such that the first resolver it trys knows about your server addresses (I'm guessing these are local LAN addresses that are failing). Different platforms have different ways of configuring this. A quick hack would be to put the address of your server in the client's hosts file to see if that fixes it.
Once you send in your request, you will see 'waiting for host' right up until the Ruby work is done, and it starts sending a response. So, if there is pretty much any processing work that is slowing you down, you'd see this error. What you'd want to do is start looking at the functions that youre seeing the behaviour on, and breaking them down into pieces to see which peices are slow. If EVERYTHING is slow, than you need to look at the things that are common to every function - before functions, or Application Controller code, or something similar. What I do, when I'm just playing around to see what I need to fix is just put 'puts' statements in my code at different stages, to print the current time, then I can see which stage is taking a long time, you know?