Remove Airflow Scheduler logs - docker

I am using Docker Apache airflow VERSION 1.9.0-2 (https://github.com/puckel/docker-airflow).
The scheduler produces a significant amount of logs, and the filesystem will quickly run out of space, so I am trying to programmatically delete the scheduler logs created by airflow, found in the scheduler container in (/usr/local/airflow/logs/scheduler)
I have all of these maintenance tasks set up:
https://github.com/teamclairvoyant/airflow-maintenance-dags
However, these tasks only delete logs on the worker, and the scheduler logs are in the scheduler container.
I have also setup remote logging, sending logs to S3, but as mentioned in this SO post Removing Airflow task logs this setup does not stop airflow from writing to the local machine.
Additionally, I have also tried creating a shared named volume between the worker and the scheduler, as outlined here Docker Compose - Share named volume between multiple containers. However, I get the following error in worker:
ValueError: Unable to configure handler 'file.processor': [Errno 13] Permission denied: '/usr/local/airflow/logs/scheduler'
and the following error in scheduler:
ValueError: Unable to configure handler 'file.processor': [Errno 13] Permission denied: '/usr/local/airflow/logs/scheduler/2018-04-11'
And so, how do people delete scheduler logs??

Inspired by this reply, I have added the airflow-log-cleanup.py DAG (with some changes to its parameters) from here to remove all old airflow logs, including scheduler logs.
My changes are minor except that given my EC2's disk size (7.7G for /dev/xvda1), 30 days default value for DEFAULT_MAX_LOG_AGE_IN_DAYS seemed too large so (I had 4 DAGs) I changed it to 14 days, but feel free to adjust it according to your environment:
DEFAULT_MAX_LOG_AGE_IN_DAYS = Variable.get("max_log_age_in_days", 30) changed to
DEFAULT_MAX_LOG_AGE_IN_DAYS = Variable.get("max_log_age_in_days", 14)

Following could be one option to resolve this issue.
Login to the docker container using following mechanism
#>docker exec -it <name-or-id-of-container> sh
While running above command make sure - container is running.
and then use cron jobs to configure scheduled rm command on those log files.

This answer to "Removing Airflow Task logs" also fits your use case in Airflow 1.10.
Basically, you need to implement a custom log handler and configure Airflow logging to use that handler instead of the default (See UPDATING.md, not README nor docs!!, in Airflow source repo)
One word of caution: Due to the way logging, multiprocessing, and Airflow default handlers interact, it is safer to override handler methods than to extend them by calling super() in a derived handler class. Because Airflow default handlers don't use locks

I spent a lot of time trying to add "maintenance" DAGs that would clear logs generated by the different airflow components started as Docker containers.
The problem was in fact more at the Docker level, each one of those processes are responsible of tons of logs that are, by default, stored in json files by Docker. The solution was to change the logging drivers so that logs are not stored on the Docker hosting instance anymore; but sent directly to AWS CloudWatch Logs in my case.
I just had to add the following to each service in the docker-compose.yml file (https://github.com/puckel/docker-airflow) :
logging:
driver: awslogs
options:
awslogs-group: myAWSLogsGroupID
Note that the EC2 instance on which my "docker-composed" Airflow app is running has an AWS role that allows her to create a log stream and add log events (CreateLogStream and PutLogEvents actions in AWS IAM service).
If you run it on a machine outside of the AWS ecosystem, you'd need to ensure it has access to AWS through credentials.

Related

Unable to pull logs from Airflow Worker

I've got a simple docker development setup for Airflow that includes separate containers for the Airflow UI and Worker. I'm encountering a 403 Forbidden error whenever I attempt to view the log for a task in the Airflow UI.
So far I've ensured they all have the same secret key (in fact, using Docker Volumes they're all reading the exact same configuration file) but this doesn't seem to help. I haven't done anything about time sync, but I'd expect that docker containers would effectively be sharing the system clock anyway so I don't see how they'd get out of sync in the first place.
I can find the log file on the airflow worker, and it has run successfully - but something is obviously missing that should be allowing the airflow UI to display that (and it would be much more convenient for my workflow to be able to see the logs in the UI rather than having to rummage around on the worker).

How to return logs to Airflow container from remotely called, Dockerized celery workers

I am working on a Dockerized Python/Django project including a container for Celery workers, into which I have been integrating the off-the-shelf airflow docker containers.
I have Airflow successfully running celery tasks in the pre-existing container, by instantiating a Celery app with the redis broker and back end specified, and making a remote call via send_task; however none of the logging carried out by the celery task makes it back to the Airflow logs.
Initially, as a proof of concept as I am completely new to Airflow, I had set it up to run the same code by exposing it to the Airflow containers and creating airflow tasks to run it on the airflow celery worker container. This did result in all the logging being captured, but it's definitely not the way we want it architectured, as this makes the airflow containers very fat due to the repetition of dependencies from the django project.
The documentation says "Most task handlers send logs upon completion of a task" but I wasn't able to find more detail that might give me a clue how to enable the same in my situation.
Is there any way to get these logs back to airflow when running the celery tasks remotely?
Instead of "returning the logs to Airflow", an easy-to-implement alternative (because Airflow natively supports it) is to activate remote logging. This way, all logs from all workers would end up e.g. on S3, and the webserver would automatically fetch them.
The following illustrates how to configure remote logging using an S3 backend. Other options (e.g. Google Cloud Storage, Elastic) can be implemented similarly.
Set remote_logging to True in airflow.cfg
Build an Airflow connection URI. This example from the official docs is particularly useful IMO. One should end up having something like:
aws://AKIAIOSFODNN7EXAMPLE:wJalrXUtnFEMI%2FK7MDENG%2FbPxRfiCYEXAMPLEKEY#/?
endpoint_url=http%3A%2F%2Fs3%3A4566%2F
        It is also possible to create the connectino through the webserver GUI, if needed.
Make the connection URI available to Airflow. One way of doing so is to make sure that the environment variable AIRFLOW_CONN_{YOUR_CONNECTION_NAME} is available. Example for connection name REMOTE_LOGS_S3:
export AIRFLOW_CONN_REMOTE_LOGS_S3=aws://AKIAIOSFODNN7EXAMPLE:wJalrXUtnFEMI%2FK7MDENG%2FbPxRfiCYEXAMPLEKEY#/?endpoint_url=http%3A%2F%2Fs3%3A4566%2F
Set remote_log_conn_id to the connection name (e.g. REMOTE_LOGS_S3) in airflow.cfg
Set remote_base_log_folder in airflow.cfg to the desired bucket/prefix. Example:
remote_base_log_folder = s3://my_bucket_name/my/prefix
This related SO touches deeper on remote logging.
If debugging is needed, looking into any worker logs locally (i.e., inside the worker) should help.

Logging from multiple processes in a single docker container

I have an application (let's call it Master) which runs on linux and starts several processes (let's call them Workers) using fork/exec. Therefore each Worker has its own PID and writes its own logs.
When running directly on a host machine (without docker) each process uses syslog for logging, and rsyslog puts ouptut from each Worker to a separate file, using a config like this:
$template workerfile,"/var/log/%programname%.log"
:programname, startswith, "worker" ?workerfile
:programname, isequal, "master" "/var/log/master"
Now, I want to run my application inside a docker container. Docker starts Master process as the main process (in CMD section of the Dockerfile), and then it forks the Workers at runtime (not sure if it's a canonical way to use docker, but that's what I have). Of course I'm getting only the stdout for the Master process from docker, and logs of Workers get lost.
So my question is, any way I could get the logs from the forked processes?
To be precise, I want the logs from different processes to appear in individual files on the host machine eventually.
I tried to run rsyslog daemon inside docker container (just like I do when running without docker), writing logs to a mounted volume, but it doesn't seem to work. I guess it requires a workaround like supervisord to run the Master process and rsyslogd at the same time, which looks like an overkill to me.
I couldn't find any simple solution for that, though my problem seems to be trivial.
Any help is appreciated, thanks

How to stop and restart a Compute Engine VM that runs a Docker container

I'm running a Docker container on Compute Engine, using the Container Image VM property.
However, if I stop and restart the VM, my app works but the logs aren't collected any more.
When I run docker ps I only see my own Docker image. However, for a new VM that hasn't been stopped I also see a container image called gcr.io/stackdriver-agents/stackdriver-logging-agent.
Are there any specific steps I need to take to restore the VM as it was before it was stopped? How can I make logging work again, and are there other differences I should be aware of?
I understand you are running a docker container on Compute Engine and when you stop/restart the VM, the logs aren’t being collected anymore. As well as wanting to know how to restore a VM to its previous form and the stackdriver-logging-agent.
As described in this article [1], you can use GCE snapshots to create backups of persistent disks attached to the instance, including boot volumes. This is useful for backing up your data, recreating a disk that might have been lost, or copying a persistent disk. That being said, currently this is the method you can recover deleted disk.
Therefore, unfortunately if there are no snapshots taken already from the VM’s disk(s), the deleted disk volume cannot be recovered, this process is irreversible [2].
In the future, you can set disk ‘auto-delete’ [3] to no when creating an instance, this way disk will remain even if the instance is deleted.
As for the the logging agent image, it’s a container image that streams logs from your VM instances and from selected third-party software packages to Stackdriver Logging. It is a best practice to run the Logging agent on all your VM instances, which can answer your question as to why the logs aren’t appearing anymore. They are simply being recorded by the logging agent and sent to Stackdriver Logging.
For the logs not being recollected you can try this out to reset the service:
Please do the following on your affected Windows instance:
Stop the "StackdriverLogging" service. You can do it from command line with "net stop StackdriverLogging"
Navigate to the following directory: "C:\Program Files (x86)\Stackdriver\LoggingAgent\Main\pos\winevtlog.pos\worker0"
Remove the file “storage.json” located in that directory
Restart StackdriverLogging service - execute "net start StackdriverLogging" from command line.
This should reset logging agent state and make logging functional again.
[1] https://cloud.google.com/compute/docs/disks/create-snapshots
[2] https://cloud.google.com/compute/docs/disks/#pdspecs
[3] https://cloud.google.com/sdk/gcloud/reference/compute/instances/create#--disk

what's the BestPractice for Docker logging?

Im using docker with my Web service.
when I deploy using Docker, loosing some logging files (nginx accesslog, service log, system log.. etc)
Cause, docker deployment system using down and up container architecures.
So I thought about this problem.
LoggingServer and serviceServer(for api) must seperate!
using these, methods..
First, Using logstash(in elk)(attaching all my logFile) .
Second, Using batch system, this batch system will moves logfiles to otherServer on every midnight.
isn't it okay?
I expect a better answer.
thanks.
There are many ways for logging which most the admin uses for containers
1 ) mount log directory to host , so even if docker goes up/down logs will be persisted on host.
2) ELK server, using logstash/filebeat for pushing logs to elastic search server with tailing option of file, so if new log contents it pushes to server.
3) if there is application logs like maven based projects, then there are many plugins which pushes logs to server
4) batch system , which is not recommended because if containers dies before mid-night then logs will be lost.

Resources