Logging directly to standard output - docker

Where I work, we are migrating our entire infrastructure which was until now based on monolithic services that ran directly on a windows/linux VM to a docker based architecture that will be orchestrated by Kubernetes.
One of the things that came to my mind is how we would handle logs in this new infrastructure.
Up until now, each app had its own way of handling logs, some were using log4net/log4j to write to file system, some were writing to GrayLog via a dedicated library.
The main problem I have with that is that one of the core ideas of programming micro-services in a Docker environment is that every service should assume as little as possible about the rest of services or the platform.
So basically I was looking into how I can abstract the logging process from the application, make it independent from the rest of the infrastructure.
One interesting thing that I found was that you could write the logs to standard output (stdout) and then configure Kubernetes to pull these logs and direct them to a centralised storage or a centralised logging server (like GrayLog) https://kubernetes.io/docs/concepts/cluster-administration/logging/
I have several concerns with this approach, for once, I haven't seen too many companies that do it, most popular logging solutions are to use a dedicated library to log to filesystem.
I am also concerned about how it might impact performance, some languages block if you write to stdout, whereas when you use a standard logging library, the logs are queued.
So what about services that output massive amount of user related logs?
I was interested about what you think, I didn't see this approach used widely, maybe there is reason for that.

Logging to whatever stream (File, stdout, GrayLog...) can either be synchronous (blocking) or asynchronous (non-blocking). Inherently, that has nothing to do with the medium you log to per-se. It is true that using System.out.println in Java will result in heavy thread-contention.
All the major logging frameworks (like log4j) provide you with the means to log in an asynchronous fashion to every medium that you like.
Your perception of not many companies doing this I think is wrong. Logging to stdout and configuring your underlying architecture to forward logs somewhere is the defacto standard of all PaaS/containerized applications.
So my tip is going to be: Log to stdout using a good logging framework which ensures asynchronous usage of the stream. For the rest you'll probably be fine.

Related

Logging in vernemq plugin

I am trying to implement logging on connected users in my vernemq client using erlang. From documentation, I found that this could be bad, due to the scalability of the client and the assumption that there might be a lot of clients connecting and disconnecting. This is not my case, I will just have a bunch of clients but a lot of messages. Anyway, to my question. Is it possible to change the log file when using error_logger? Or should I use a different module for logging? Log file can be in any location if it had to, but I need it separated from vernemqs console.log. A followup question would be, can I somehow get a floating window on logs? I don't need to keep logs from previous year and I don't want to manually clean them every day or week or something like that
Thanks for any responses
From OTP21 on, you should use logger instead of error_logger, although the error_logger API is kept for compatibility (it justs uses logger under the hood).
With logger, which you can configure with the system configuration, you can use file backends such as logger_std_h (check the example configurations).
In logger_std_h you can set file rotation.

What is recommended solution for monitoring heterogeneous infrastructure?

I am looking for monitoring tool for the following use cases:
Collect basic metrics about virtual machine (cpu usage, memory usage, i/o, available space)
Extract metrics from SQL Server (probably running some queries)
Extract information from external service about processing i.e how many processing are currently running and for how long. I am thinking about writing python scripts, but don't know how to combine with monitoring tool
Have the ability to plot charts and manage alerts and it will nice to have ability to send not only mails, but send message to slack/ms teams.
I was thing about Prometheus, because it has wmi_exporter, node_exporter, sql exporter, alert manager with possibility to send notifications to multiple destinations, but I don't know what to do with this external service and python scripts.
Any suggestions?
Prometheus can definitely do what you say you need done. Some of it may not be trivial, but you can definitely fill in the blanks yourself.
E.g. you can get machine metrics basically out of the box by firing up a node_exporter and having it scraped by Prometheus, but I don't think it has e.g. information on all running processes. The latter might require you to write an agent/exporter: a simple web server that exposes metrics on /metrics; there exists a Python client library to help with that. Or have said processes (assuming they're your code) push metrics to a Pushgateway instead, if they're short lived batch jobs.
Oh, and for charts/dashboards you probably want Grafana, as Prometheus' abilities in that area are rather limited and Grafana integrates rather well with Prometheus.

Docker performances for getting information: polling vs events

I have Docker swarm full of containers. I need to monitor when something is up or down. I can do this in 2 ways:
attaching to the swarm and listen to events.
polling service list
The issue with events is that there might be huge traffic, plus if some event is not processed, we will simply loose information on whats going on.
For me it is not super important to get immediate results, but to have correct information on whats going on.
Any pros/cons from real-life project?
Listening to events- its immediate, but risky as if your event listening program crashes because of any reason, you will miss an important information and lead to wrong result. This Registrator program is based on events.
Polling- eventual consistent result. but if it solves your problem it is less painful way to grabbing the data. No matter if your program crashes or restart. We are using this approach for service discovery in our project and so far it served the purpose.
From my experience, checking if something is up or down should be done using a health check, and should be agnostic to the underlying architecture running your service (otherwise you will have to write a new health check every time you change platform). Of course - you might have services with specific needs that cannot be monitored that way - if this is the case you're welcome to comment on that.
If you are using Swarm for stateless services only, I suggest creating a health check route that can verify the service is healthy and even disconnect faulty containers from the service.
If you are running statefull stuff this might be trickier, but there are solutions for that too, usually using some kind of monitoring agent over your statefull container (We are using cloudwatch since we run on AWS, but there are many alternatives)
Hope this helps.

How to specifically identify server issues from a load test result (using LoadRunner)?

How do you isolate a performance issue to a specific component of the application infrastructure? Specifically, are there distinct markers in the result logs that distinguish between bottlenecks at web, application and/or database server levels?
I was asked this question in an interview and went blank on it. Seems this information is not available anywhere.
In addition to SiteScope and other agentless monitoring of system components, you need to make sure your scenario and scripts are working as expected. This includes proper error checking and use of transactions (and a host of other things). If the transactions are granular enough, this will give you insight into at least the requests that have performance issues. Once you have these indicators, work with the infrastructure team to review logs and other information. Being an iterative process, tests can be made to focus on a smaller and smaller section of the infrastructure.
In addition, loadrunner scripts don't have to be made strictly 'coming in through the frontdoor'. If you have a multi-tiered system, scripts can be made to hit the web/app/database servers directly.
For what to look for, focus on any measurements that have 'knees' or 'hockey stick' type of behaviour. You can hook into any of the server resource type measurements directly in the controller and integrate other team's stats in the analysis phase. Compare with benchmarks at lower virtual user levels to determine what is acceptable and unacceptable.
Good luck!
If the interview is focused on LoadRunner and SiteScope is considered - I'd come to conclusion that it's more focused on HP/Mercury solutions.. In that case I'd suggest you to look into HP Diagnostics and it's LoadRunner integration capabilities.
This type of information is usually not available by just looking at the standard results from a performance test.
Parts of the information you are looking for MAY be found by using SiteScope to monitor all the relevant servers in the test. SiteScope offers many counters to look at such as CPU, Memory, Disk I/O and Network I/O - as seen on each server.
This information perhaps gives clues as to where the bottleneck is, and the more counters you add to SiteScope, the bigger the change to pinpoint the bottleneck.
It is a very common misconception that AppServer and DBServer bottlenecks could be identified by just looking at the raw response times or hits, pages etc (web protocol), unless of course the URI accessed defines the exact component(s) in the system...

What are the requirements for an application health monitoring system?

What, at a minimum, should an application health-monitoring system do for you (the developer) and/or your boss (the IT Manager) and/or the operations (on-call) staff?
What else should it do above the minimum requirements?
Is monitoring the 'infrastructure' applications (ms-exchange, apache, etc.) sufficient or do individual user applications, web sites, and databases also need to be monitored?
if the latter, what do you need to know about them?
ADDENDUM: thanks for the input, i was really looking for application-level monitoring not infrastructure monitoring, but it is good to know about both
Whether the application is running.
Unusual cpu/memory/network usage.
Report any unhandled exceptions.
Status of various modules (if applicable).
Status of external components (databases, webservices, fileservers, etc.)
Number of pending background tasks (if applicable).
Maybe track usage of the application and report statistics on most/less used functionalities so you know where optimizations are most beneficial.
The answer is 'it depends'. Why do you need to monitor? How large is your operations staff? Do you need reporting? What is the application environment? Who cares if the application fails? Who cares if an exception happens? Are any of the errors recoverable? I could ask questions like these for a long time.
Great question.
We've been looking for some application-level monitoring solution for our needs some time ago without any luck. Popular monitoring solution are mostly addressed to monitor infrastrcture and - in my opinion - they are too complicated for a requirements of most of small and mid-sized companies.
We required (mainly) following features:
alerts - we wanted to know about
incident as fast as possible
painless management - hosted service wouldbe
the best
visualizations - it's good to know what is going on and take some knowledge from the data
Because we didn't find suitable solution we started to write our own. Finally we've ended with up-and-running service called AlertGrid. (You can check it for free of course.)
The idea behind it is to provide an easy way to handle custom monitoring scenarios. Integration API is very simple (one function with two required parameters). At the momment we and others are using it for:
monitor scheduled tasks (cron jobs)
monitor entire application logic execution
alert on errors in applications
we are also working on examples of basic infrastructure monitoring using AlertGrid
This is such an open ended question, but I would start with physical measurements.
1. Are all the machines I think are hosting this site pingable?
2. Are all the machines which should be serving content actually serving some content? (Ideally this would be hit from an external network.)
3. Is each expected service on each machine running?
3a. Have those services run recently?
4. Does each machine have hard drive space left? (Don't forget the db)
5. Have these machines been backed up? When was the last time?
Once one lays out the physical monitoring of the systems, one can address those specific to a system?
1. Can an automated script log in? How long did it take?
2. How many users are live? Have there been a million fake accounts added?
...
These sorts of questions get more nebulous, and can be very system specific. They also usually can be derived reactively when responding to phsyical measurements. Hard drive fill up, maybe the web server logs got filled up because a bunch of agents created too many fake users. That kind of thing.
While plan A shouldn't necessarily be reactive, it is the way many a site setup a monitoring system.
Minimum: make sure it is running :)
However, some other stuff would be very useful. For example, the CPU load, RAM usage and (in multiuser systems) which user is running what. Also, for applications that access network, a list of network connections for each app. And (if you have access to client computer(s)) it would be cool to be able to see the 'window title' of the app - maybe check each 2-3 minutes if it changed and save it. Also, a list of files open by the application could be very useful, but it is not a must.
I think this is fairly simple - monitor so that you can be warned early enough before something goes wrong. That means monitor dependencies and the application itself.
It's really hard to provide specifics if you're not going to give details on the application you're monitoring, so I'd say use that as a general rule.
At a minimum you want to know that the system is healthy. This is subjective in what defines your system is healthy. Is it computers are up, the needed resources exist, the data is flowing through the system, the data is properly producing results, etc, etc.
In my project we do monitoring of most of this and then some. It really comes down to what is the highest level that you can use to analyze that everything is working. In our case we need to know down to the data output. If you just need to know down to the are these machines up it saves you on trying to show an inexperienced end user what is wrong.
There are also "off the shelf" tools that will do a lot of the hard work for you if you are just looking too hard into data results. I particularly liked Nagios when I was looking around but we needed more than it could easily show so I wrote our own monitoring system. Basically we also watch for "peculiarities" in the system, memory / cpu spikes, etc...
thanks everyone for the input, i was really looking for application-level monitoring not infrastructure monitoring, but it is good to know about both
the difference is:
infrastructure monitoring would be servers plus MS Exchange Server, Apache, IIS, and so forth
application monitoring would be user machines and the specific programs that they use to do their jobs, and/or servers plus the data-moving/backend applications that they run to keep the data flowing
sometimes it's hard to draw the line - an oversimplified definition might be "if your team wrote it, it's an application; if you bought it, it's infrastructure"
i think in practice it is best to monitor both
What you need to do is to break down the business process of the application and then have the software emit events at major business components. In addition, you'll need to create end to end synthetic transactions (eg. emulating end users clicking on a website). All that data would be fed into an monitoring tool. In the past, I've done JMX for applications of which flowed into Tivoli Monitoring's JMX Adapter and then I've done scripts that implement a "fake user" and then pipe in the results into Tivoli Monitoring's Script Adapter. Tivoli Monitoring takes the data and then creates application health and performance charts from that raw data.

Resources