I have a docker container which contains pyspark. It is working correctly but I want to connect to it and launch spark script from another container. I created small script to test the connection.
import pyspark
conf = pyspark.SparkConf().setAppName('MyApp').setMaster('spark://myspark:7077')
sc = pyspark.SparkContext(conf=conf)
However I get error RuntimeError: Java gateway process exited before sending its port number.
This is probably because I don't have Java installed in my other container. Installing it in this container is not an option. Is there a way to connect to it and execute a pyspark script without Java?
Related
I encountered an issue and not sure any of you encountered before. I tried to start a container job in which Linux based hosted on Window Server 2019. The hosted machine has docker EE installed and was able to run the container.
However, when I tried to trigger the Azure Pipeline to run the job in the the self-hosted machine, it shows following error:
Failed to create network
It appears that the agent failed to create the network using the default driver (Bridge) before starting up the container as the self-hosting server is Windows Server - windows uses NAT driver whereas the default driver for Docker is Bridge driver.
I wonder is it possible to override the driver using NAT driver in azure-pipeline? I tried using the following method but it seemed like not able to override it.
azure-pipelines
Or, is there any alternative way to disable the agent to create the network before starting the docker?
Or, is there any alternative to run Linux container in Windows Server?
I'm starting to use PySpark. I'm wondering how to use containerization with PySpark.
I would like to isolate my python application and dependencies in a container.
Can I place my python application within a container and give the image directly to a spark cluster? Will it be able to make his work and distribute the image to the workers and then distribute the work to the multiple "containers"?
For developing Spark applications in a container you could use:
jupyter/pyspark-notebook. There's also SparkUI which you can access on port 4040. More information here: Jupyter Apache Spark
Or AWS Glue docker image: amazon/aws-glue-libs:glue_libs_1.0.0_image_01. More information on how to set it up: Developing AWS Glue ETL jobs locally using a container
When your application is ready, you can just submit it to the cluster. I'm not sure why docker is needed at that point.
I am trying to export metrics of an application by using jmx exporter. So basically i added java agent to jvm jmx parameters to run as a agent and configured it to expose localhost:5555. At the end with docker I created container.
So applications runs in remote machine. If it was running on my local I could check localhost:5555/metrics and I could see if metrics are exported. But in my case that apps runs in a container on remote machine. So how can I check if metrics are exported or not ? (Prometheus has not been configured yet so I cannot check on it.)
As long as the container is exposing 5555 to a port on its host (let's assume the same port 5555, i.e. it's running using something of the form docker run ... --publish=5555:5555 ...), then, as long as you can access the host machine, you can curl (or browse) the endpoint:
REMOTE_HOST=...
http://${REMOTE_HOST}:5555/metrics
I am running Dask Gateway in a Kubernetes namespace. I am able to connect to the Gateway using the following code, while not running in a Docker container.
from dask.distributed import Client
from dask_gateway import Gateway
gateway = Gateway('http://[redacted traefik ip]')
cluster = gateway.new_cluster()
However, when I run the same code from a Docker container, I get this warning after gateway.new_cluster().
distributed.comm.tcp - WARNING - Closing dangling stream in <TLS local=tls://[local ip]:51060 remote=gateway://[redacted ip]:80/dask-gateway.e71c345decde470e8f9a23c3d5a64956>
What is the cause for this? I have also tried running this with --net=host on the Docker container, that resulted in the same error.
Additional Info: This doesn't appear to be a Docker networking issue... I am able to use the Coiled clusters from within a Docker container, but not the Dask-Gateway clusters...
It appears that the initial outgoing connection from the docker container to the traefik pod succeeds. A dask-scheduler is successfully spun up in the cluster. However, the connection drop (timeout?) that prevents further interactions.
I did write docker-compose.yaml for two services named as MySQL and JasperReports. I wrote the docker files for each service. while executing the docker-compose using docker-compose up -d, first it starts building the images for both the services, after that, running the containers based on the mentioned dependency level. but I have the requirement that Mysql service image will have to build first and run the container after that jasper server image will have to build and start running the container. Does it possible using docker-compose? why because jasper server uses the MySQL port and host. so, how cloud I achieve this case?
As explained in the docker-compose docs in Control startup and shutdown order in Compose, you may write a custom shell script that waits for mysql to be ready (i.e. accepting connections), before starting the jasper service.