I am trying to get a docker-compose file working with both airflow and spark. Airflow typically runs on 8080:8080 which is needed by spark as well. I have the following docker-compose file:
version: '3.7'
services:
master:
image: gettyimages/spark
command: bin/spark-class org.apache.spark.deploy.master.Master -h master
hostname: master
environment:
MASTER: spark://master:7077
SPARK_CONF_DIR: /conf
SPARK_PUBLIC_DNS: localhost
expose:
- 7001
- 7002
- 7003
- 7004
- 7005
- 7077
- 6066
ports:
- 4040:4040
- 6066:6066
- 7077:7077
- 8080:8080
volumes:
- ./conf/master:/conf
- ./data:/tmp/data
worker:
image: gettyimages/spark
command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077
hostname: worker
environment:
SPARK_CONF_DIR: /conf
SPARK_WORKER_CORES: 2
SPARK_WORKER_MEMORY: 1g
SPARK_WORKER_PORT: 8881
SPARK_WORKER_WEBUI_PORT: 8081
SPARK_PUBLIC_DNS: localhost
links:
- master
expose:
- 7012
- 7013
- 7014
- 7015
- 8881
ports:
- 8081:8081
volumes:
- ./conf/worker:/conf
- ./data:/tmp/data
postgres:
image: postgres:9.6
environment:
- POSTGRES_USER=airflow
- POSTGRES_PASSWORD=airflow
- POSTGRES_DB=airflow
logging:
options:
max-size: 10m
max-file: "3"
webserver:
image: puckel/docker-airflow:1.10.9
restart: always
depends_on:
- postgres
environment:
- LOAD_EX=y
- EXECUTOR=Local
logging:
options:
max-size: 10m
max-file: "3"
volumes:
- ./dags:/usr/local/airflow/dags
# Add this to have third party packages
- ./requirements.txt:/requirements.txt
# - ./plugins:/usr/local/airflow/plugins
ports:
- "8082:8080" # NEED TO CHANGE THIS LINE
command: webserver
healthcheck:
test: ["CMD-SHELL", "[ -f /usr/local/airflow/airflow-webserver.pid ]"]
interval: 30s
timeout: 30s
retries: 3
but specifically need to change the line:
ports:
- "8082:8080" # NEED TO CHANGE THIS LINE
under web server so there's no port conflict. However when I change the container port to something other than 8080:8080 it doesnt work (cannot connect/find server). How do I change the container port successfully?
When you specified port mapping in docker you are giving 2 ports, for instance: 8082:8080.
The right port is the one that is listening within the container.
You can have several containers that listening internally to the same port. They are still not available in your localhost - that's why we use the ports section for.
Now, in your localhost you cannot bind the same port more than once. That's why docker failed when you try to set on the left side 8080 more than one time.
In your current compose file, the spark service port is mapped to 8080 (left side of 8080:8080) and the webserver service is mapped to 8082 (left side of 8082:8080).
if you want to access spark, goto: http://localhost:8080 and for the web server, go to http://localhost:8082
Related
I'm trying to figure out how to run Redis cluster using Docker Compose.
However, I'm getting the following error:
Error response from daemon: Ports are not available: exposing port TCP 0.0.0.0:6380 -> 0.0.0.0:0: listen tcp 0.0.0.0:6380: bind: Only one usage of each socket address (protocol/network address/port) is normally permitted.
version: '3.9'
services:
dynamodb-local:
container_name: dynamodb-local
image: amazon/dynamodb-local:latest
command: -jar DynamoDBLocal.jar -sharedDb -dbPath ./data
ports:
- 8000:8000
volumes:
- ./docker/dynamodb:/home/dynamodblocal/data
working_dir: /home/dynamodblocal
networks:
- webnet
redis-master: # Setting up master node
image: bitnami/redis:latest
ports:
- 6379:6379
environment:
- REDIS_REPLICATION_MODE=master
- REDIS_PASSWORD=my_master_password
volumes:
- ./docker/redis:/bitnami/redis/data # Redis master data volume
- ./docker/redis/conf/redis.conf:/opt/bitnami/redis/conf/redis.conf # Redis master configuration volume
networks:
- webnet
redis-replica:
image: bitnami/redis:latest
ports:
- 6380-6382:6379
depends_on:
- redis-master
environment:
- REDIS_REPLICATION_MODE=slave
- REDIS_MASTER_HOST=redis-master
- REDIS_MASTER_PORT_NUMBER=6379
- REDIS_MASTER_PASSWORD=my_master_password
- REDIS_PASSWORD=my_replica_password
deploy:
replicas: 3
networks:
- webnet
networks:
webnet:
driver: bridge
I am fairly new to docker and am trying to get a docker-compose file running with both airflow and pyspark. Below is what I have so far:
version: '3.7'
services:
master:
image: gettyimages/spark
command: bin/spark-class org.apache.spark.deploy.master.Master -h master
hostname: master
environment:
MASTER: spark://master:7077
SPARK_CONF_DIR: /conf
SPARK_PUBLIC_DNS: localhost
expose:
- 7001
- 7002
- 7003
- 7004
- 7005
- 7077
- 6066
ports:
- 4040:4040
- 6066:6066
- 7077:7077
- 8080:8080
volumes:
- ./conf/master:/conf
- ./data:/tmp/data
worker:
image: gettyimages/spark
command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077
hostname: worker
environment:
SPARK_CONF_DIR: /conf
SPARK_WORKER_CORES: 2
SPARK_WORKER_MEMORY: 1g
SPARK_WORKER_PORT: 8881
SPARK_WORKER_WEBUI_PORT: 8081
SPARK_PUBLIC_DNS: localhost
links:
- master
expose:
- 7012
- 7013
- 7014
- 7015
- 8881
ports:
- 8081:8081
volumes:
- ./conf/worker:/conf
- ./data:/tmp/data
postgres:
image: postgres:9.6
environment:
- POSTGRES_USER=airflow
- POSTGRES_PASSWORD=airflow
- POSTGRES_DB=airflow
logging:
options:
max-size: 10m
max-file: "3"
webserver:
image: puckel/docker-airflow:1.10.9
restart: always
depends_on:
- postgres
environment:
- LOAD_EX=y
- EXECUTOR=Local
logging:
options:
max-size: 10m
max-file: "3"
volumes:
- ./dags:/usr/local/airflow/dags
# Add this to have third party packages
- ./requirements.txt:/requirements.txt
# - ./plugins:/usr/local/airflow/plugins
ports:
- "8082:8080"
command: webserver
healthcheck:
test: ["CMD-SHELL", "[ -f /usr/local/airflow/airflow-webserver.pid ]"]
interval: 30s
timeout: 30s
retries: 3
And I am trying to run the following simple DAG just to confirm pyspark is operating correctly:
import pyspark
from airflow.models import DAG
from airflow.utils.dates import days_ago, timedelta
from airflow.operators.python_operator import PythonOperator
from airflow.contrib.operators.spark_submit_operator import SparkSubmitOperator
import random
args = {
"owner": "ian",
"start_date": days_ago(1)
}
dag = DAG(dag_id="pysparkTest", default_args=args, schedule_interval=None)
def run_this_func(**context):
sc = pyspark.SparkContext()
print(sc)
with dag:
run_this_task = PythonOperator(
task_id='run_this',
python_callable=run_this_func,
provide_context=True,
retries=10,
retry_delay=timedelta(seconds=1)
)
When I do this, it fails with the error Java gateway process exited before sending its port number. I have found several posts that say to run the command export PYSPARK_SUBMIT_ARGS="--master local[2] pyspark-shell" which I have tried to run as a command like so:
version: '3.7'
services:
master:
image: gettyimages/spark
command: >
sh -c "bin/spark-class org.apache.spark.deploy.master.Master -h master
&& export PYSPARK_SUBMIT_ARGS="--master local[2] pyspark-shell""
hostname: master
...
But I still get the same error. Any ideas what I am doing wrong?
I don't think you need to modify the master's command. Leave it as they did here.
In addition, how do you expect the python code which runs on a different container - to connect the master container. I think you should add it to the spark-context, something like:
def run_this_func(**context):
sc = pyspark.SparkContext("spark://master:7077")
print(sc)
I have airflow running locally on port 8080 with the following docker-compose.yaml:
version: '3.7'
services:
postgres:
image: postgres:9.6
environment:
- POSTGRES_USER=airflow
- POSTGRES_PASSWORD=airflow
- POSTGRES_DB=airflow
logging:
options:
max-size: 10m
max-file: "3"
webserver:
image: puckel/docker-airflow:1.10.9
restart: always
depends_on:
- postgres
environment:
- LOAD_EX=y
- EXECUTOR=Local
logging:
options:
max-size: 10m
max-file: "3"
volumes:
- ./dags:/usr/local/airflow/dags
# Add this to have third party packages
- ./requirements.txt:/requirements.txt
# - ./plugins:/usr/local/airflow/plugins
ports:
- "8080:8080"
command: webserver
healthcheck:
test: ["CMD-SHELL", "[ -f /usr/local/airflow/airflow-webserver.pid ]"]
interval: 30s
timeout: 30s
retries: 3
However I need the port 8080 for another process. I tried updating to both "8080:8081" and "8081:8081" but neither worked, server would not respond. "8080:8080", however, works like a charm. What am I missing here?
I think you missed the only correct option. The syntax for ports is:
{host : container}
so in your case
8081:8080
should technically work. Assuming of course that airflow runs on port 8080 and has that one exposed (which it seems according to the dockerfile).
You could change the HTTP port on the webserver command and update the Docker port mapping like this:
on the YAML:
...
command: webserver -p 9999
ports:
- "9999:9999"
Sometimes this comes handy if you need the container to start listening directly on a specific port and avoid port mappings in scenarios where you are also changing network configurations e.g. in the YAML network_mode: host.
This basically means you can use only command: webserver -p 9999 and get rid of the ports: YAML completely.
I am trying to create a distributed spark cluster with only one worker using this docker-compose
master:
image: gettyimages/spark:2.0.0-hadoop-2.7
command: bin/spark-class org.apache.spark.deploy.master.Master -h master
hostname: master
container_name: spark-master
environment:
SPARK_CONF_DIR: /conf
SPARK_PUBLIC_DNS: <MASTER IP>
expose:
- 7001
- 7002
- 7003
- 7004
- 7005
- 7077
- 6066
ports:
- 4040:4040
- 6066:6066
- 7077:7077
- 8080:8080
volumes:
- ./conf/master:/conf
- ./data:/tmp/data
- ~/spark/data/:/spark/data/
worker:
image: gettyimages/spark:2.0.0-hadoop-2.7
command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077
hostname: worker
container_name: spark-worker
environment:
SPARK_CONF_DIR: /conf
SPARK_WORKER_CORES: 2
SPARK_WORKER_MEMORY: 1g
SPARK_WORKER_PORT: 8881
SPARK_WORKER_WEBUI_PORT: 8081
SPARK_PUBLIC_DNS: <WORKER IP>
links:
- master
expose:
- 7012
- 7013
- 7014
- 7015
- 8881
ports:
- 8081:8081
volumes:
- ./conf/worker:/conf
- ./data:/tmp/data
- ~/apps/sparkapp/worker/data:/spark/data/
But the problem is that the docker daemon is creating the containers on the same machine.Which takes away the whole point of having distributed network.How can I create a distributed spark cluster using docker
If the problem in the same ports for Spark workers, actually you have two options:
Do not expose workers' ports at all - you do not need them for workers to connect to master and work. But may be this is not convenient as you cannot access WebUI of workers
Use special syntax like "8081-8999:8081" so each next worker started with docker-compose up --scale worker=2 will use different port.
The docker is running and I want to run a docker container in Windows 10. When I run the docker-compose from Windows power shell, some downloading jobs are completed, an error occurs, and the docker container cannot run. It seems that jupyter fails to build or open a directory. Anyone could help me about this problem? The command line and the error is as the following:
PS C:\Users\mmva> cd C:\Users\mmva\Documents\GitHub\CerebralCortex-DockerCompose
PS C:\Users\mmva\Documents\GitHub\CerebralCortex-DockerCompose> docker-compose up
Building jupyter
Step 1/19 : FROM jupyter/jupyterhub
latest: Pulling from jupyter/jupyterhub
efd26ecc9548: Extracting [==================================================>] 51.34MB/51.34MB
a3ed95caeb02: Download complete
298ffe4c3e52: Download complete
758b472747c8: Download complete
8b9809a68afc: Download complete
93b253b5483d: Download complete
ef8136abb53c: Download complete
ERROR: Service 'jupyter' failed to build: failed to register layer: re-exec error: exit status 1: output: Failed to OpenForBackup failed in Win32: open \\?\C:\ProgramData\Docker\windowsfilter\eb9ac9d604f051d5490a876043809e7929197356387569bc50a3694b77d1b721\usr\share\man\man3\Locale::gettext.3pm.gz: The filename, directory name, or volume label syntax is incorrect. (0x1f) \\?\C:\ProgramData\Docker\windowsfilter\eb9ac9d604f051d5490a876043809e7929197356387569bc50a3694b77d1b721\usr\share\man\man3\Locale::gettext.3pm.gz
My docker version is 17.09.0-ce-win33 (13620).
I think the docker-compose's version is 3.
The content of docker-compose file:
version: '3'
# IPTABLES RULES IF NECESSARY
#-A INPUT -i br+ -j ACCEPT
#-A INPUT -i docker0 -j ACCEPT
#-A OUTPUT -o br+ -j ACCEPT
#-A OUTPUT -o docker0 -j ACCEPT
# The .env file is for production use with server-specific configurations
services:
# Frontend web proxy for accessing services and providing TLS encryption
nginx:
build: ./nginx
container_name: md2k-nginx
restart: always
volumes:
- ./nginx/site:/var/www
- ./nginx/nginx-selfsigned.crt:/etc/ssh/certs/ssl-cert.crt
- ./nginx/nginx-selfsigned.key:/etc/ssh/certs/ssl-cert.key
ports:
- "443:443"
- "80:80"
links:
- apiserver
- grafana
- jupyter
apiserver:
build: ../CerebralCortex-APIServer
container_name: md2k-api-server
restart: always
expose:
- 80
links:
- mysql
- kafka
- minio
depends_on:
- mysql
environment:
- MINIO_HOST=${MINIO_HOST:-minio}
- MINIO_ACCESS_KEY=${MINIO_ACCESS_KEY:-ZngmrLWgbSfZUvgocyeH}
- MINIO_SECRET_KEY=${MINIO_SECRET_KEY:-IwUnI5w0f5Hf1v2qVwcr}
- MYSQL_HOST=${MYSQL:-mysql}
- MYSQL_DB_USER=${MYSQL_ROOT_USER:-root}
- MYSQL_DB_PASS=${MYSQL_ROOT_PASSWORD:-random_root_password}
- KAFKA_HOST=${KAFKA_HOST:-kafka}
- JWT_SECRET_KEY=${MINIO_SECRET_KEY:-IwUnI5w0f5Hf1v2qVwcr}
- FLASK_HOST=${FLASK_HOST:-0.0.0.0}
- FLASK_PORT=${FLASK_PORT:-80}
- FLASK_DEBUG=${FLASK_DEBUG:-False}
volumes:
- ./data:/data
# Data vizualizations
grafana:
image: "grafana/grafana"
container_name: md2k-grafana
restart: always
ports:
- "3000:3000"
links:
- influxdb
environment:
- GF_SERVER_ROOT_URL=%(protocol)s://%(domain)s:%(http_port)s/grafana/
# - GF_INSTALL_PLUGINS=raintank-worldping-app,grafana-clock-panel,grafana-simple-json-datasource
volumes:
- timeseries-storage:/var/lib/grafana
# - timeseries-storage:/etc/grafana
influxdb:
image: "influxdb:alpine"
container_name: md2k-influxdb
restart: always
ports:
- "8086:8086"
volumes:
- timeseries-storage:/var/lib/influxdb
# Data Science Dashboard Interface
jupyter:
build: ./jupyterhub
container_name: md2k-jupyterhub
ports:
- 8000
restart: always
network_mode: "host"
pid: "host"
environment:
TINI_SUBREAPER: 'true'
volumes:
- ./jupyterhub/conf:/srv/jupyterhub/conf
command: jupyterhub --no-ssl --config /srv/jupyterhub/conf/jupyterhub_config.py
# Cerebral Cortex backend
kafka:
image: wurstmeister/kafka:0.10.2.0
container_name: md2k-kafka
restart: always
ports:
- "9092:9092"
environment:
KAFKA_ADVERTISED_HOST_NAME: ${MACHINE_IP:-10.0.0.1}
KAFKA_ADVERTISED_PORT: 9092
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_MESSAGE_MAX_BYTES: 2000000
KAFKA_CREATE_TOPICS: "filequeue:4:1,processed_stream:16:1"
KAFKA_AUTO_CREATE_TOPICS_ENABLE: 'true'
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- data-storage:/kafka
depends_on:
- zookeeper
zookeeper:
image: wurstmeister/zookeeper
container_name: md2k-zookeeper
restart: always
ports:
- "2181:2181"
mysql:
image: "mysql:5.7"
container_name: md2k-mysql
restart: always
ports:
- 3306:3306 # Default mysql port
environment:
- MYSQL_ROOT_PASSWORD=${MYSQL_ROOT_PASSWORD:-random_root_password}
- MYSQL_DATABASE=${MYSQL_DATABASE:-cerebralcortex}
- MYSQL_USER=${MYSQL_USER:-cerebralcortex}
- MYSQL_PASSWORD=${MYSQL_PASSWORD:-cerebralcortex_pass}
volumes:
- ./mysql/initdb.d:/docker-entrypoint-initdb.d
- metadata-storage:/var/lib/mysql
minio:
image: "minio/minio"
container_name: md2k-minio
restart: always
ports:
- 9000:9000 # Default minio port
environment:
- MINIO_ACCESS_KEY=${MINIO_ACCESS_KEY:-ZngmrLWgbSfZUvgocyeH}
- MINIO_SECRET_KEY=${MINIO_SECRET_KEY:-IwUnI5w0f5Hf1v2qVwcr}
command: server /export
volumes:
- object-storage:/export
cassandra:
build: ./cassandra
container_name: md2k-cassandra
restart: always
ports:
- 9160:9160 # Thrift client API
- 9042:9042 # CQL native transport
environment:
- CASSANDRA_CLUSTER_NAME=cerebralcortex
volumes:
- data-storage:/var/lib/cassandra
volumes:
object-storage:
metadata-storage:
data-storage:
temp-storage:
timeseries-storage:
user-storage:
log-storage