How to return logs to Airflow container from remotely called, Dockerized celery workers - docker

I am working on a Dockerized Python/Django project including a container for Celery workers, into which I have been integrating the off-the-shelf airflow docker containers.
I have Airflow successfully running celery tasks in the pre-existing container, by instantiating a Celery app with the redis broker and back end specified, and making a remote call via send_task; however none of the logging carried out by the celery task makes it back to the Airflow logs.
Initially, as a proof of concept as I am completely new to Airflow, I had set it up to run the same code by exposing it to the Airflow containers and creating airflow tasks to run it on the airflow celery worker container. This did result in all the logging being captured, but it's definitely not the way we want it architectured, as this makes the airflow containers very fat due to the repetition of dependencies from the django project.
The documentation says "Most task handlers send logs upon completion of a task" but I wasn't able to find more detail that might give me a clue how to enable the same in my situation.
Is there any way to get these logs back to airflow when running the celery tasks remotely?

Instead of "returning the logs to Airflow", an easy-to-implement alternative (because Airflow natively supports it) is to activate remote logging. This way, all logs from all workers would end up e.g. on S3, and the webserver would automatically fetch them.
The following illustrates how to configure remote logging using an S3 backend. Other options (e.g. Google Cloud Storage, Elastic) can be implemented similarly.
Set remote_logging to True in airflow.cfg
Build an Airflow connection URI. This example from the official docs is particularly useful IMO. One should end up having something like:
aws://AKIAIOSFODNN7EXAMPLE:wJalrXUtnFEMI%2FK7MDENG%2FbPxRfiCYEXAMPLEKEY#/?
endpoint_url=http%3A%2F%2Fs3%3A4566%2F
        It is also possible to create the connectino through the webserver GUI, if needed.
Make the connection URI available to Airflow. One way of doing so is to make sure that the environment variable AIRFLOW_CONN_{YOUR_CONNECTION_NAME} is available. Example for connection name REMOTE_LOGS_S3:
export AIRFLOW_CONN_REMOTE_LOGS_S3=aws://AKIAIOSFODNN7EXAMPLE:wJalrXUtnFEMI%2FK7MDENG%2FbPxRfiCYEXAMPLEKEY#/?endpoint_url=http%3A%2F%2Fs3%3A4566%2F
Set remote_log_conn_id to the connection name (e.g. REMOTE_LOGS_S3) in airflow.cfg
Set remote_base_log_folder in airflow.cfg to the desired bucket/prefix. Example:
remote_base_log_folder = s3://my_bucket_name/my/prefix
This related SO touches deeper on remote logging.
If debugging is needed, looking into any worker logs locally (i.e., inside the worker) should help.

Related

ML serving service architecture with Docker

I am in the early stage of developing an image segmentation service. Currently, I have a simple Flask server that is responsible for receiving data and running a docker container with an AI model in the local GPU server. But I also think about something asynchronous like FastAPI or Nodejs to implement some scheduler for prediction tasks. What is better: a) when the server calls the docker container by ssh and the docker container run only when it is called, predicted images, saved results, and stopped, or b) running an API server inside the AI container? Each container is around 5-10GB. Running all containers looks more expensive, but I am not sure what practice is better.
I tried to call the container each time and stop it after work was done.
You should avoid approaches based on dynamically starting containers and approaches based on ssh. I'd recommend a long-running process that accepts some network input, like your existing Flask server, and either always has the ML model running or launches it as a subprocess.
If you can use a subprocess that could be a good match here. When the subprocess exits, all of its memory resources will be automatically cleaned up, so you won't have the cost of the subprocess when it's not being used. If the container happens to exit, the subprocess will get cleaned up with it. Subprocesses are also basic Unix functionality, so you can locally develop your service without needing any particular complex setup.
Dynamically launching containers comes with many challenges. It ties your application to the Docker API, which will make it harder to run, even in local development. Using that API grants unrestricted root-level access to the host system (you can very easily run a container that compromises the host). You need to remember to clean up after your own containers. The setup may not work in other container systems like Kubernetes that don't make a Docker socket available.
An ssh-based system presents different complexities. You need to distribute credentials to various places. If you're trying to run an ssh daemon inside a Docker container, that is difficult to configure securely (what creates the host keys? how do you provision users and private keys?). You also need to think about various failure cases around the ssh transport that might not be present in a purely-local system.

Unable to pull logs from Airflow Worker

I've got a simple docker development setup for Airflow that includes separate containers for the Airflow UI and Worker. I'm encountering a 403 Forbidden error whenever I attempt to view the log for a task in the Airflow UI.
So far I've ensured they all have the same secret key (in fact, using Docker Volumes they're all reading the exact same configuration file) but this doesn't seem to help. I haven't done anything about time sync, but I'd expect that docker containers would effectively be sharing the system clock anyway so I don't see how they'd get out of sync in the first place.
I can find the log file on the airflow worker, and it has run successfully - but something is obviously missing that should be allowing the airflow UI to display that (and it would be much more convenient for my workflow to be able to see the logs in the UI rather than having to rummage around on the worker).

Logging from multiple processes in a single docker container

I have an application (let's call it Master) which runs on linux and starts several processes (let's call them Workers) using fork/exec. Therefore each Worker has its own PID and writes its own logs.
When running directly on a host machine (without docker) each process uses syslog for logging, and rsyslog puts ouptut from each Worker to a separate file, using a config like this:
$template workerfile,"/var/log/%programname%.log"
:programname, startswith, "worker" ?workerfile
:programname, isequal, "master" "/var/log/master"
Now, I want to run my application inside a docker container. Docker starts Master process as the main process (in CMD section of the Dockerfile), and then it forks the Workers at runtime (not sure if it's a canonical way to use docker, but that's what I have). Of course I'm getting only the stdout for the Master process from docker, and logs of Workers get lost.
So my question is, any way I could get the logs from the forked processes?
To be precise, I want the logs from different processes to appear in individual files on the host machine eventually.
I tried to run rsyslog daemon inside docker container (just like I do when running without docker), writing logs to a mounted volume, but it doesn't seem to work. I guess it requires a workaround like supervisord to run the Master process and rsyslogd at the same time, which looks like an overkill to me.
I couldn't find any simple solution for that, though my problem seems to be trivial.
Any help is appreciated, thanks

Remove Airflow Scheduler logs

I am using Docker Apache airflow VERSION 1.9.0-2 (https://github.com/puckel/docker-airflow).
The scheduler produces a significant amount of logs, and the filesystem will quickly run out of space, so I am trying to programmatically delete the scheduler logs created by airflow, found in the scheduler container in (/usr/local/airflow/logs/scheduler)
I have all of these maintenance tasks set up:
https://github.com/teamclairvoyant/airflow-maintenance-dags
However, these tasks only delete logs on the worker, and the scheduler logs are in the scheduler container.
I have also setup remote logging, sending logs to S3, but as mentioned in this SO post Removing Airflow task logs this setup does not stop airflow from writing to the local machine.
Additionally, I have also tried creating a shared named volume between the worker and the scheduler, as outlined here Docker Compose - Share named volume between multiple containers. However, I get the following error in worker:
ValueError: Unable to configure handler 'file.processor': [Errno 13] Permission denied: '/usr/local/airflow/logs/scheduler'
and the following error in scheduler:
ValueError: Unable to configure handler 'file.processor': [Errno 13] Permission denied: '/usr/local/airflow/logs/scheduler/2018-04-11'
And so, how do people delete scheduler logs??
Inspired by this reply, I have added the airflow-log-cleanup.py DAG (with some changes to its parameters) from here to remove all old airflow logs, including scheduler logs.
My changes are minor except that given my EC2's disk size (7.7G for /dev/xvda1), 30 days default value for DEFAULT_MAX_LOG_AGE_IN_DAYS seemed too large so (I had 4 DAGs) I changed it to 14 days, but feel free to adjust it according to your environment:
DEFAULT_MAX_LOG_AGE_IN_DAYS = Variable.get("max_log_age_in_days", 30) changed to
DEFAULT_MAX_LOG_AGE_IN_DAYS = Variable.get("max_log_age_in_days", 14)
Following could be one option to resolve this issue.
Login to the docker container using following mechanism
#>docker exec -it <name-or-id-of-container> sh
While running above command make sure - container is running.
and then use cron jobs to configure scheduled rm command on those log files.
This answer to "Removing Airflow Task logs" also fits your use case in Airflow 1.10.
Basically, you need to implement a custom log handler and configure Airflow logging to use that handler instead of the default (See UPDATING.md, not README nor docs!!, in Airflow source repo)
One word of caution: Due to the way logging, multiprocessing, and Airflow default handlers interact, it is safer to override handler methods than to extend them by calling super() in a derived handler class. Because Airflow default handlers don't use locks
I spent a lot of time trying to add "maintenance" DAGs that would clear logs generated by the different airflow components started as Docker containers.
The problem was in fact more at the Docker level, each one of those processes are responsible of tons of logs that are, by default, stored in json files by Docker. The solution was to change the logging drivers so that logs are not stored on the Docker hosting instance anymore; but sent directly to AWS CloudWatch Logs in my case.
I just had to add the following to each service in the docker-compose.yml file (https://github.com/puckel/docker-airflow) :
logging:
driver: awslogs
options:
awslogs-group: myAWSLogsGroupID
Note that the EC2 instance on which my "docker-composed" Airflow app is running has an AWS role that allows her to create a log stream and add log events (CreateLogStream and PutLogEvents actions in AWS IAM service).
If you run it on a machine outside of the AWS ecosystem, you'd need to ensure it has access to AWS through credentials.

what's the BestPractice for Docker logging?

Im using docker with my Web service.
when I deploy using Docker, loosing some logging files (nginx accesslog, service log, system log.. etc)
Cause, docker deployment system using down and up container architecures.
So I thought about this problem.
LoggingServer and serviceServer(for api) must seperate!
using these, methods..
First, Using logstash(in elk)(attaching all my logFile) .
Second, Using batch system, this batch system will moves logfiles to otherServer on every midnight.
isn't it okay?
I expect a better answer.
thanks.
There are many ways for logging which most the admin uses for containers
1 ) mount log directory to host , so even if docker goes up/down logs will be persisted on host.
2) ELK server, using logstash/filebeat for pushing logs to elastic search server with tailing option of file, so if new log contents it pushes to server.
3) if there is application logs like maven based projects, then there are many plugins which pushes logs to server
4) batch system , which is not recommended because if containers dies before mid-night then logs will be lost.

Resources