Unable to submit PySpark job while context lives in Jupyter Lab - docker

I have created a Spark standalone cluster on Docker which can be found here.
The issue that I'm facing is that when I run the first cell in JupyterLab to create a SparkContext I lose the ability to submit jobs (Python programs). I keep getting the message:
TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
I'm not sure where the issue is, but it seems like the Driver is blocked?
I don't know how to actually formulate the question since I can submit PySpark jobs when the app from Jupyter is not submitted.

Related

Unable to pull logs from Airflow Worker

I've got a simple docker development setup for Airflow that includes separate containers for the Airflow UI and Worker. I'm encountering a 403 Forbidden error whenever I attempt to view the log for a task in the Airflow UI.
So far I've ensured they all have the same secret key (in fact, using Docker Volumes they're all reading the exact same configuration file) but this doesn't seem to help. I haven't done anything about time sync, but I'd expect that docker containers would effectively be sharing the system clock anyway so I don't see how they'd get out of sync in the first place.
I can find the log file on the airflow worker, and it has run successfully - but something is obviously missing that should be allowing the airflow UI to display that (and it would be much more convenient for my workflow to be able to see the logs in the UI rather than having to rummage around on the worker).

How to return logs to Airflow container from remotely called, Dockerized celery workers

I am working on a Dockerized Python/Django project including a container for Celery workers, into which I have been integrating the off-the-shelf airflow docker containers.
I have Airflow successfully running celery tasks in the pre-existing container, by instantiating a Celery app with the redis broker and back end specified, and making a remote call via send_task; however none of the logging carried out by the celery task makes it back to the Airflow logs.
Initially, as a proof of concept as I am completely new to Airflow, I had set it up to run the same code by exposing it to the Airflow containers and creating airflow tasks to run it on the airflow celery worker container. This did result in all the logging being captured, but it's definitely not the way we want it architectured, as this makes the airflow containers very fat due to the repetition of dependencies from the django project.
The documentation says "Most task handlers send logs upon completion of a task" but I wasn't able to find more detail that might give me a clue how to enable the same in my situation.
Is there any way to get these logs back to airflow when running the celery tasks remotely?
Instead of "returning the logs to Airflow", an easy-to-implement alternative (because Airflow natively supports it) is to activate remote logging. This way, all logs from all workers would end up e.g. on S3, and the webserver would automatically fetch them.
The following illustrates how to configure remote logging using an S3 backend. Other options (e.g. Google Cloud Storage, Elastic) can be implemented similarly.
Set remote_logging to True in airflow.cfg
Build an Airflow connection URI. This example from the official docs is particularly useful IMO. One should end up having something like:
aws://AKIAIOSFODNN7EXAMPLE:wJalrXUtnFEMI%2FK7MDENG%2FbPxRfiCYEXAMPLEKEY#/?
endpoint_url=http%3A%2F%2Fs3%3A4566%2F
        It is also possible to create the connectino through the webserver GUI, if needed.
Make the connection URI available to Airflow. One way of doing so is to make sure that the environment variable AIRFLOW_CONN_{YOUR_CONNECTION_NAME} is available. Example for connection name REMOTE_LOGS_S3:
export AIRFLOW_CONN_REMOTE_LOGS_S3=aws://AKIAIOSFODNN7EXAMPLE:wJalrXUtnFEMI%2FK7MDENG%2FbPxRfiCYEXAMPLEKEY#/?endpoint_url=http%3A%2F%2Fs3%3A4566%2F
Set remote_log_conn_id to the connection name (e.g. REMOTE_LOGS_S3) in airflow.cfg
Set remote_base_log_folder in airflow.cfg to the desired bucket/prefix. Example:
remote_base_log_folder = s3://my_bucket_name/my/prefix
This related SO touches deeper on remote logging.
If debugging is needed, looking into any worker logs locally (i.e., inside the worker) should help.

Dataflow Pipeline Follows Notebook Execution Number. Cant Update Pipeline

I am trying to update my dataflow pipeline. I like developing using Jupyter notebooks on Google Cloud. However, I've run into this error when trying to update:
"The new job is missing steps [5]: read/Read."
I understand the reason is because I re-ran some lines in my notebook and added some new lines, so now instead of "[5]: read/Read" it is now "[23]: read/Read" but surely dataflow doesn't need to care about the jupyter notebook execution. Is there some sort of way to turn it off and just call the steps using the given names without the numbers?
The Notebooks documentation recommends restarting the kernel and rerunning all cells to avoid that behavior.
"(Optional) Before using your notebook to run Dataflow jobs, restart the kernel, rerun all cells, and verify the output. If you skip this step, hidden states in the notebook might affect the job graph in the pipeline object."
Doing so will preserve the numbers.

Trying to understand the difference between using docker(Scheduler, Queue, Workers) VS Docker(Airflow)

Please pardon me IF I ASKED A VERY AMATEUR question but after reading multiple threads, posts, references and etc... I still do not understand the differences.
My current understanding:
1st method)
A traditional docker will compose of 3 dockers:
Scheduler that manages the schedule of the job
Queue that manages the queue of the multiple jobs
Worker that manages the work of each queue
I read from this source: https://laravel-news.com/laravel-scheduler-queue-docker
Docker + Apache Airflow will compose a single docker that does the same as above 3 dockers:
2nd method)
Worker(Airflow: since in airflow we can set the scheduler and also the queue)
I watched this tutorial: https://www.youtube.com/watch?v=vvr_WNzEXBE&t=575s
I first learnt from these two sources and other sources but I am confuse about:
Since I can use docker-compose to build all the services, all I need is just 1 docker(2nd method) and then set the scheduler which already in airflow to control the workflow, right? Then it means I do not need to create multiple dockers as the 1st method which separate all tasks into different dockers.
If both are different, then what are the differences? I try to find it for days but still could not figure out, I am sorry, I am new to this subject so I am currently still studying about it.
Thank you.
I think you are pointing at multi-node airflow vs single node airflow comparison. A multi-node airflow will provide you more computing power and higher availability for your Apache Airflow instance. You can run everything (webserver,scheduler and worker) on one machine/docker instance but if you project grows, you can make a cluster and scale your pipeline.
Actually, a Airflow instance can have number of Daemons workers that work together to provide the full functionality of Airflow.
With multiple gunicorn workers you can take and execute more task from the queue , in parallel/concurrently. On a single machine (depending on your usecase/cores of machine) you can define this in {AIRFLOW_HOME}/airflow.cfg (for example workers=6)
Now since Daemons don’t are independent of each other, people distribute them on multiple nodes/instances (Docker containers in your case). So, probably, this is what you have seen.
Update:
about the tutorial links you shared
As you asked in the comment section, the youtube tutorial that you have pointed to is also using one docker container where you run every thing, you arent using multi Node there.
For your first link (relating laravel scheduling), I am not sure but it seems like it is also using only one container
how you link multiple nodes of airflow in Multi-node setup
As an example, if you use same external Database instance (Mysql,postgres) and all your nodes interact with it, similarly, you have same queue from where they take task (may be external/same RabbitMQ cluster)
What is scheduler for, how to exucute
Scheduler is the thing that actually schedules the DAGs, for example running the weekly/daily/monthly etc as you have declared. In essense, one has only one
scheduler and there are workers that you need more. But it doesnt mean you cant have, you may have two webservers etc but you need to tackle port difference and share metadata between them.
To run the scheduler, you just need to do airflow scheduler and it will start picking Your DAGs and starts executing them once it is running successfully. For the first run it will see start_date and for subsequent, it will use schedule_interval that you have defined in the DAG to schedule them (thats why it is scheduler)

Port allocation when running build job in Jenkins

My project is structured in such a way that the build job in Jenkins is triggered from a push to Git. As part of my application logic, I spin up kafka and elastic search instances to be used in my test cases downstream.
The issue I have right now is, when a developer pushes his changes to Git, it triggers a build in Jenkins which in turn runs our code and spawns kafka broker in localhost:9092 and elastic search in localhost:9200.
When another developer working on some other change simultaneously, pushes his code, it triggers the build job again and tries to spin up another instance of kafka/elastic search but fails with the exception “Port already in use”.
I am looking at options on how to handle this scenario.
Will running these instances inside of docker container help to some extent? How do I handle the port issue in that case?
Yes dockerizing these instances can indeed help as you can spawn them multiple times.
You could create a docker container per component including your application and then let them talk to each other by linking them or using docker-compose
That way you would not have to expose the ports to the "outside" world but keep it internal within the docker environment.
That way you would not have the “Port already in use”. The only problem is memory in that case. e.g. if 100 pushes are done to the git repo, you might run out of memory...

Resources