How to put file in HDFS using Airflow? - docker

I need to put the file in hdfs using the airflow dag task.
So, basically, I have installed docker and inside that, I have installed airflow, namenode, datanode, resourcemanager, etc.
So by doing ssh over namenode I'm able to put file in hdfs cluster.
But I want to put file in hdfs using airflow dag tasks, so that I can orchestrate everything in pipeline.
Anyone help me to put and get files from hdfs using airflow dag tasks.
Below is my docker-compose file:
version: "3"
services:
postgres: # create postgres container
image: postgres:9.6
container_name: postgres_container
environment:
- POSTGRES_USER=airflow
- POSTGRES_PASSWORD=airflow
- POSTGRES_DB=airflow
airflow: # create airflow container
build: './airflow_docker'
container_name: airflow_container
restart: always
depends_on:
- postgres
environment:
- LOAD_EX=n
- EXECUTOR=Local
volumes: # mount the following local folders
- ./dags:/usr/local/airflow/dags
- ./data:/usr/local/airflow/data
ports:
- "8080:8080" # expose port
command: webserver
healthcheck:
test: ["CMD-SHELL", "[ -f /usr/local/airflow/airflow-webserver.pid ]"]
interval: 30s
timeout: 30s
retries: 3
namenode:
image: bde2020/hadoop-namenode:2.0.0-hadoop3.2.1-java8
container_name: namenode
restart: always
ports:
- 9870:9870
- 9000:9000
volumes:
- hadoop_namenode:/hadoop/dfs/name
environment:
- CLUSTER_NAME=test
env_file:
- ./hadoop.env
datanode:
image: bde2020/hadoop-datanode:2.0.0-hadoop3.2.1-java8
container_name: datanode
restart: always
volumes:
- hadoop_datanode:/hadoop/dfs/data
environment:
SERVICE_PRECONDITION: "namenode:9870"
env_file:
- ./hadoop.env
resourcemanager:
image: bde2020/hadoop-resourcemanager:2.0.0-hadoop3.2.1-java8
container_name: resourcemanager
restart: always
environment:
SERVICE_PRECONDITION: "namenode:9000 namenode:9870 datanode:9864"
env_file:
- ./hadoop.env
nodemanager1:
image: bde2020/hadoop-nodemanager:2.0.0-hadoop3.2.1-java8
container_name: nodemanager
restart: always
environment:
SERVICE_PRECONDITION: "namenode:9000 namenode:9870 datanode:9864 resourcemanager:8088"
env_file:
- ./hadoop.env
historyserver:
image: bde2020/hadoop-historyserver:2.0.0-hadoop3.2.1-java8
container_name: historyserver
restart: always
environment:
SERVICE_PRECONDITION: "namenode:9000 namenode:9870 datanode:9864 resourcemanager:8088"
volumes:
- hadoop_historyserver:/hadoop/yarn/timeline
env_file:
- ./hadoop.env
volumes:
hadoop_namenode:
hadoop_datanode:
hadoop_historyserver:
And this is the hadoop.env file
CORE_CONF_fs_defaultFS=hdfs://namenode:9000
CORE_CONF_hadoop_http_staticuser_user=root
CORE_CONF_hadoop_proxyuser_hue_hosts=*
CORE_CONF_hadoop_proxyuser_hue_groups=*
CORE_CONF_io_compression_codecs=org.apache.hadoop.io.compress.SnappyCodec
HDFS_CONF_dfs_webhdfs_enabled=true
HDFS_CONF_dfs_permissions_enabled=false
HDFS_CONF_dfs_namenode_datanode_registration_ip___hostname___check=false
YARN_CONF_yarn_log___aggregation___enable=true
YARN_CONF_yarn_log_server_url=http://historyserver:8188/applicationhistory/logs/
YARN_CONF_yarn_resourcemanager_recovery_enabled=true
YARN_CONF_yarn_resourcemanager_store_class=org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore
YARN_CONF_yarn_resourcemanager_scheduler_class=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler
YARN_CONF_yarn_scheduler_capacity_root_default_maximum___allocation___mb=8192
YARN_CONF_yarn_scheduler_capacity_root_default_maximum___allocation___vcores=4
YARN_CONF_yarn_resourcemanager_fs_state___store_uri=/rmstate
YARN_CONF_yarn_resourcemanager_system___metrics___publisher_enabled=true
YARN_CONF_yarn_resourcemanager_hostname=resourcemanager
YARN_CONF_yarn_resourcemanager_address=resourcemanager:8032
YARN_CONF_yarn_resourcemanager_scheduler_address=resourcemanager:8030
YARN_CONF_yarn_resourcemanager_resource__tracker_address=resourcemanager:8031
YARN_CONF_yarn_timeline___service_enabled=true
YARN_CONF_yarn_timeline___service_generic___application___history_enabled=true
YARN_CONF_yarn_timeline___service_hostname=historyserver
YARN_CONF_mapreduce_map_output_compress=true
YARN_CONF_mapred_map_output_compress_codec=org.apache.hadoop.io.compress.SnappyCodec
YARN_CONF_yarn_nodemanager_resource_memory___mb=16384
YARN_CONF_yarn_nodemanager_resource_cpu___vcores=8
YARN_CONF_yarn_nodemanager_disk___health___checker_max___disk___utilization___per___disk___percentage=98.5
YARN_CONF_yarn_nodemanager_remote___app___log___dir=/app-logs
YARN_CONF_yarn_nodemanager_aux___services=mapreduce_shuffle
MAPRED_CONF_mapreduce_framework_name=yarn
MAPRED_CONF_mapred_child_java_opts=-Xmx4096m
MAPRED_CONF_mapreduce_map_memory_mb=4096
MAPRED_CONF_mapreduce_reduce_memory_mb=8192
MAPRED_CONF_mapreduce_map_java_opts=-Xmx3072m
MAPRED_CONF_mapreduce_reduce_java_opts=-Xmx6144m
MAPRED_CONF_yarn_app_mapreduce_am_env=HADOOP_MAPRED_HOME=/opt/hadoop-3.2.1/
MAPRED_CONF_mapreduce_map_env=HADOOP_MAPRED_HOME=/opt/hadoop-3.2.1/
MAPRED_CONF_mapreduce_reduce_env=HADOOP_MAPRED_HOME=/opt/hadoop-3.2.1/

Wrap your HDFS commands/operations inside bash/shell script and call it in DAG using BashOperator. Before Put/Get HDFS file, if you want to check whether file exists then use Airflow HDFS operators like HdfsSensor, HdfsFolderSensor , HdfsRegexSensor. Please note that Airflow is workflow management/data pipeline orchestration tool, and it is not a data ingestion/ETL tool.
hdfs_operations_task = BashOperator(
start_date=datetime(2020, 9, 9, 10, 0, 0, 0),
task_id='hdfs_operations_task',
bash_command="/hdfs_operations.sh " ,
dag = dag)

Related

Wrong HDFS Configured Capacity in Docker Stack

I'm using a Docker stack that implements, in the same machine, an Hadoop Namenode, two Datanodes, two Node Managers, a Resource Manager, a History Server, and other technologies.
I encountered an issue related to the HDFS Configured Capacity that is shown in the HDFS UI.
I'm using a machine with 256GB capacity, and I'm using the two datanodes implementation mentioned above. Instead of distributing the total capacity between the two nodes, HDFS duplicates the capacity of the entire machine by giving 226.87GB to each datanode.
As you can see here.
Any thoughts on how to make HDFS show the right capacity?
Here is the portion of the docker compose that implements the hadoop technologies mentioned above.
services:
# Hadoop master
namenode:
image: bde2020/hadoop-namenode:2.0.0-hadoop3.2.1-java8
container_name: namenode
ports:
- 9870:9870
- 8020:8020
volumes:
- ./namenode/home/${ADMIN_NAME:?err}:/home/${ADMIN_NAME:?err}
- ./namenode/hadoop-data:/hadoop-data
- ./namenode/entrypoint.sh:/entrypoint.sh
- hadoop-namenode:/hadoop/dfs/name
env_file:
- ./hadoop.env
- .env
networks:
- hadoop
resourcemanager:
restart: always
image: bde2020/hadoop-resourcemanager:2.0.0-hadoop3.2.1-java8
container_name: resourcemanager
ports:
- 8088:8088
environment:
SERVICE_PRECONDITION: "namenode:9870 datanode1:9864"
env_file:
- ./hadoop.env
networks:
- hadoop
# Hadoop slave 1
datanode1:
image: bde2020/hadoop-datanode:2.0.0-hadoop3.2.1-java8
container_name: datanode1
volumes:
- hadoop-datanode-1:/hadoop/dfs/data
environment:
SERVICE_PRECONDITION: "namenode:9870"
env_file:
- ./hadoop.env
networks:
- hadoop
nodemanager1:
image: bde2020/hadoop-nodemanager:2.0.0-hadoop3.2.1-java8
container_name: nodemanager1
volumes:
- ./nodemanagers/entrypoint.sh:/entrypoint.sh
environment:
SERVICE_PRECONDITION: "namenode:9870 datanode1:9864 resourcemanager:8088"
env_file:
- ./hadoop.env
- .env
networks:
- hadoop
# Hadoop slave 2
datanode2:
image: bde2020/hadoop-datanode:2.0.0-hadoop3.2.1-java8
container_name: datanode2
volumes:
- hadoop-datanode-2:/hadoop/dfs/data
environment:
SERVICE_PRECONDITION: "namenode:9870"
env_file:
- ./hadoop.env
networks:
- hadoop
nodemanager2:
image: bde2020/hadoop-nodemanager:2.0.0-hadoop3.2.1-java8
container_name: nodemanager2
volumes:
- ./nodemanagers/entrypoint.sh:/entrypoint.sh
environment:
SERVICE_PRECONDITION: "namenode:9870 datanode2:9864 resourcemanager:8088"
env_file:
- ./hadoop.env
- .env
networks:
- hadoop
historyserver:
image: bde2020/hadoop-historyserver:2.0.0-hadoop3.2.1-java8
container_name: historyserver
ports:
- 8188:8188
environment:
SERVICE_PRECONDITION: "namenode:9870 datanode1:9864 datanode2:9864 resourcemanager:8088"
volumes:
- hadoop-historyserver:/hadoop/yarn/timeline
env_file:
- ./hadoop.env
networks:
- hadoop
You will need to create the docker volumes with a defined size that fits on your machine and then ask each DN to use that volume. Then when the DN inspects the size of its volumes, it should return the size of the volume rather than the capacity of your entire machine and use that for the capacity.

Docker image names are changed to sha256 after subsequent runs

I am using Docker 20.10.7 on Mac OS and docker-compose to run multiple docker containers.
When I start it for the first time, all the docker images are properly labeled and appear as the following.
However, after subsequent runs (docker-compose up, docker-compose down), suddenly all the image names are changed to sha256 and start to look like this
Please advise how to avoid this behavior. Thank you.
UPDATE #1
This is the docker-compose file I use to start containers.
Initially the old displayed with properly labeled image names.
However, not even if I run a docker system prune command it continues to label them as sha256:...
version: '3.8'
services:
influxdb:
image: influxdb:1.8
container_name: influxdb
ports:
- "8083:8083"
- "8086:8086"
- "8090:8090"
- "2003:2003"
env_file:
- 'env.influxdb.properties'
volumes:
- /Users/user1/Docker/influxdb/data:/var/lib/influxdb
restart: unless-stopped
telegraf:
image: telegraf:latest
container_name: telegraf
links:
- db
volumes:
- ./telegraf.conf:/etc/telegraf/telegraf.conf:ro
- /var/run/docker.sock:/var/run/docker.sock
restart: unless-stopped
grafana:
image: grafana/grafana:latest
container_name: grafana
ports:
- "3000:3000"
env_file:
- 'env.grafana.properties'
links:
- influxdb
volumes:
- /Users/user1/Docker/grafana/data:/var/lib/grafana
restart: unless-stopped
db:
image: mysql
container_name: db-container
command: --default-authentication-plugin=mysql_native_password
ports:
- '3306:3306'
environment:
MYSQL_ROOT_PASSWORD: P#ssw0rd
MYSQL_USER: root
MYSQL_PASSWORD: P#ssw0rd
MYSQL_DATABASE: db1
volumes:
- /Users/user1/Docker/mysql/data:/var/lib/mysql
- "../sql/schema.sql:/docker-entrypoint-initdb.d/1.sql"
healthcheck:
test: "/usr/bin/mysql --user=root --password=P#ssw0rd --execute \"SHOW DATABASES;\""
interval: 2s
timeout: 20s
retries: 10
restart: always
adminer:
image: adminer
container_name: adminer
restart: always
ports:
- 8081:8080
redis:
image: bitnami/redis
container_name: redis
environment:
- ALLOW_EMPTY_PASSWORD=yes
#- REDIS_DISABLE_COMMANDS=FLUSHDB,FLUSHALL
ports:
- '6379:6379'
volumes:
- '/Users/user1/Docker/redis/data:/bitnami/redis/data'
- ./redis.conf:/opt/bitnami/redis/mounted-etc/redis.conf

Attribute Error in CeleryBeat Due to DataBaseSceduler

I am trying to use celery for asynchronous jobs and am using celery, docker and digitalocean.
I have line that is depicted below in docker-compose file.
As you can see there is celery beat part.
In celery beat part, there is "django_celery_beat.schedulers:DatabaseScheduler" and as far as I understand it can not find django_celery_beat.schedulers:DatabaseScheduler. I could not understand how may I solve tha problem.
version: '3.3'
services:
web:
build: .
image: proje
command: gunicorn -b 0.0.0.0:8000 proje.wsgi -w 4 --timeout 300 -t 80
restart: unless-stopped
tty: true
env_file:
- ./.env.production
networks:
- app-network
depends_on:
- migration
- database
- redis
healthcheck:
test: ["CMD", "wget", "http://localhost/healthcheck"]
interval: 3s
timeout: 3s
retries: 10
celery:
image: proje
command: celery -A proje worker -l info -n worker1#%%h
restart: unless-stopped
networks:
- app-network
environment:
- DJANGO_SETTINGS_MODULE=proje.settings
env_file:
- ./.env.production
depends_on:
- redis
celerybeat:
image: proje
command: celery -A proje beat -l info --scheduler django_celery_beat.schedulers:DatabaseScheduler
restart: unless-stopped
networks:
- app-network
environment:
- DJANGO_SETTINGS_MODULE=proje.settings
env_file:
- ./.env.production
depends_on:
- redis
migration:
image: proje
command: python manage.py migrate
volumes:
- .:/usr/src/app/
env_file:
- ./.env.production
depends_on:
- database
networks:
- app-network
webserver:
image: nginx:alpine
container_name: webserver
restart: unless-stopped
tty: true
ports:
- "80:80"
- "443:443"
volumes:
- ./static/:/var/www/static/
- ./conf/nginx/:/etc/nginx/conf.d/
- webserver-logs:/var/log/nginx/
networks:
- app-network
database:
image: "postgres:12" # use latest official postgres version
restart: unless-stopped
env_file:
- .databaseenv # configure postgres
ports:
- "5432:5432"
volumes:
- database-data:/var/lib/postgresql/data/
networks:
- app-network
redis:
image: "redis:5.0.8"
restart: unless-stopped
command: [ "redis-server", "/redis.conf" ]
working_dir: /var/lib/redis
ports:
- "6379:6379"
volumes:
- ./conf/redis/redis.conf:/redis.conf
- redis-data:/var/lib/redis/
networks:
- app-network
#Docker Networks
networks:
app-network:
driver: bridge
volumes:
database-data:
webserver-logs:
redis-data:
And it gives me result that is depicted below. I am stuck in my project for months.
Any help will be appreciated.
I have uploaded all these to an Ubuntu server and it has worked. I think my computer(Win 10) has some incompatibility with.
Thanks.

Creating spark cluster with drone.yml not working

I have docker-compose.yml with below image and configuration
version: '3'
services:
spark-master:
image: bde2020/spark-master:2.4.4-hadoop2.7
container_name: spark-master
ports:
- "8080:8080"
- "7077:7077"
environment:
- INIT_DAEMON_STEP=setup_spark
spark-worker-1:
image: bde2020/spark-worker:2.4.4-hadoop2.7
container_name: spark-worker-1
depends_on:
- spark-master
ports:
- "8081:8081"
environment:
- "SPARK_MASTER=spark://spark-master:7077"
here the docker-compose up log ---> https://jpst.it/1Xc4K
and here containers up and running and i mean spark worker connected to spark master without any issues , now problem is i created drone.yml and where i added services component with
services:
jce-cassandra:
image: cassandra:3.0
ports:
- "9042:9042"
jce-elastic:
image: elasticsearch:5.6.16-alpine
ports:
- "9200:9200"
environment:
- "ES_JAVA_OPTS=-Xms512m -Xmx512m"
janusgraph:
image: janusgraph/janusgraph:latest
ports:
- "8182:8182"
environment:
JANUS_PROPS_TEMPLATE: cassandra-es
janusgraph.storage.backend: cql
janusgraph.storage.hostname: jce-cassandra
janusgraph.index.search.backend: elasticsearch
janusgraph.index.search.hostname: jce-elastic
depends_on:
- jce-elastic
- jce-cassandra
spark-master:
image: bde2020/spark-master:2.4.4-hadoop2.7
container_name: spark-master
ports:
- "8080:8080"
- "7077:7077"
environment:
- INIT_DAEMON_STEP=setup_spark
spark-worker-1:
image: bde2020/spark-worker:2.4.4-hadoop2.7
container_name: spark-worker-1
depends_on:
- spark-master
ports:
- "8081:8081"
environment:
- "SPARK_MASTER=spark://spark-master:7077"
but here spark worker is not connected to spark master getting exceptions, here is exception log details , can some one please guide me why am facing this issue
Note : I am trying to create these services in drone.yml for my integration testing
Answering for better formatting. The comments suggest sleeping. Assuming this is the dockerfile (https://hub.docker.com/r/bde2020/spark-worker/dockerfile) You could sleep by adding the command:
spark-worker-1:
image: bde2020/spark-worker:2.4.4-hadoop2.7
container_name: spark-worker-1
command: sleep 10 && /bin/bash /worker.sh
depends_on:
- spark-master
ports:
- "8081:8081"
environment:
- "SPARK_MASTER=spark://spark-master:7077"
Although sleep 10 is probably excessive, if this would would sleep 5 or sleep 2

Hue access to HDFS: bypass default hue.ini?

The set up
I am trying to compose a lightweight minimal hadoop stack with the images provided by bde2020 (learning purpose). Right now, the stack includes (among others)
a namenode
a datanote
hue
Basically, I started from Big Data Europe official docker compose, and added a hue image based on their documentation
The issue
Hue's file browser can't access HDFS:
Cannot access: /user/dav. The HDFS REST service is not available. Note: you are a Hue admin but not a HDFS superuser, "hdfs" or part of HDFS supergroup, "supergroup".
HTTPConnectionPool(host='namenode', port=50070): Max retries exceeded with url: /webhdfs/v1/user/dav?op=GETFILESTATUS&user.name=hue&doas=dav (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f8119a3cf10>: Failed to establish a new connection: [Errno 111] Connection refused',))
What I tried so far to delimit the issue
to explicitly putted all the services on the same network
to point dfs_webhdfs_url to localhost:9870/webhdfs/v1 in the namenode env file (source) and edit hue.ini in hue's container accordingly (by adding webhdfs_url=http://namenode:9870/webhdfs/v1)
when I log into hue's container, I can see that namenode's port 9870 is open (nmap -p 9870 namenode). 50070 is not. I don't think that my issue is network related. Despite editing hue.ini, Hue still go for port 50070. So, how can I force hue to go for port 9870 in my current setup? (if this is the reason)
docker-compose
version: '3.7'
services:
namenode:
image: bde2020/hadoop-namenode:2.0.0-hadoop3.1.1-java8
container_name: namenode
hostname: namenode
domainname: hadoop
ports:
- 9870:9870
volumes:
- hadoop_namenode:/hadoop/dfs/name
- ./entrypoints/namenode/entrypoint.sh:/entrypoint.sh
env_file:
- ./hadoop.env
- .env
networks:
- hadoop_net
# TODO adduser --ingroup hadoop dav
datanode1:
image: bde2020/hadoop-datanode:2.0.0-hadoop3.1.1-java8
container_name: datanode
hostname: datanode1
domainname: hadoop
volumes:
- hadoop_datanode:/hadoop/dfs/data
environment:
SERVICE_PRECONDITION: "namenode:9870"
env_file:
- ./hadoop.env
networks:
- hadoop_net
resourcemanager:
image: bde2020/hadoop-resourcemanager:2.0.0-hadoop3.1.1-java8
container_name: resourcemanager
environment:
SERVICE_PRECONDITION: "namenode:9870 datanode:9864"
env_file:
- ./hadoop.env
networks:
- hadoop_net
nodemanager1:
image: bde2020/hadoop-nodemanager:2.0.0-hadoop3.1.1-java8
container_name: nodemanager
environment:
SERVICE_PRECONDITION: "namenode:9870 datanode:9864 resourcemanager:8088"
env_file:
- ./hadoop.env
networks:
- hadoop_net
historyserver:
image: bde2020/hadoop-historyserver:2.0.0-hadoop3.1.1-java8
container_name: historyserver
environment:
SERVICE_PRECONDITION: "namenode:9870 datanode:9864 resourcemanager:8088"
volumes:
- hadoop_historyserver:/hadoop/yarn/timeline
env_file:
- ./hadoop.env
networks:
- hadoop_net
filebrowser:
container_name: hue
image: bde2020/hdfs-filebrowser:3.11
ports:
- "8088:8088"
env_file:
- ./hadoop.env
volumes: # BYPASS DEFAULT webhdfs url
- ./overrides/hue/hue.ini:/opt/hue/desktop/conf.dist/hue.ini
environment:
- NAMENODE_HOST=namenode
networks:
- hadoop_net
networks:
hadoop_net:
volumes:
hadoop_namenode:
hadoop_datanode:
hadoop_historyserver:
I was able to get the Filebrowser working with this INI
[desktop]
http_host=0.0.0.0
http_port=8888
time_zone=America/Chicago
dev=true
app_blacklist=impala,zookeeper,oozie,hbase,security,search
[hadoop]
[[hdfs_clusters]]
[[[default]]]
fs_defaultfs=hdfs://namenode:8020
webhdfs_url=http://namenode:50070/webhdfs/v1
security_enabled=false
And this compose
version: "2"
services:
namenode:
image: bde2020/hadoop-namenode:1.1.0-hadoop2.7.1-java8
container_name: namenode
ports:
- 8020:8020
- 50070:50070
# - 59050:59050
volumes:
- hadoop_namenode:/hadoop/dfs/name
environment:
- CLUSTER_NAME=test
env_file:
- ./hadoop.env
networks:
- hadoop
datanode1:
image: bde2020/hadoop-datanode:1.1.0-hadoop2.7.1-java8
container_name: datanode1
ports:
- 50075:50075
# - 50010:50010
# - 50020:50020
depends_on:
- namenode
volumes:
- hadoop_datanode1:/hadoop/dfs/data
env_file:
- ./hadoop.env
networks:
- hadoop
hue:
image: gethue/hue
container_name: hue
ports:
- 8000:8888
depends_on:
- namenode
volumes:
- ./conf/hue.ini:/hue/desktop/conf/pseudo-distributed.ini
networks:
- hadoop
- frontend
volumes:
hadoop_namenode:
hadoop_datanode1:
networks:
hadoop:
frontend:
hadoop.env has to add hue as a proxy user as well
CORE_CONF_fs_defaultFS=hdfs://namenode:8020
CORE_CONF_hadoop_http_staticuser_user=root
CORE_CONF_hadoop_proxyuser_hue_hosts=*
CORE_CONF_hadoop_proxyuser_hue_groups=*
HDFS_CONF_dfs_replication=1
HDFS_CONF_dfs_webhdfs_enabled=true
HDFS_CONF_dfs_permissions_enabled=false
Yeah, found it. A few key elements:
in hadoop 3.*, webhdfs no longer listen to 50070 but 9870 is the standard port
overriding hue.ini involves to mount a file named hue-overrides.ini
the hue image from gethue is more up to date than the one from bde2020 (their hadoop stack rocks, though)
Docker-compose
version: '3.7'
services:
namenode:
image: bde2020/hadoop-namenode:2.0.0-hadoop3.1.1-java8
container_name: namenode
ports:
- 9870:9870
- 8020:8020
volumes:
- hadoop_namenode:/hadoop/dfs/name
- ./overrides/namenode/entrypoint.sh:/entrypoint.sh
env_file:
- ./hadoop.env
- .env
networks:
- hadoop
filebrowser:
container_name: hue
image: gethue/hue:4.4.0
ports:
- "8000:8888"
env_file:
- ./hadoop.env
volumes: # HERE
- ./overrides/hue/hue-overrides.ini:/usr/share/hue/desktop/conf/hue-overrides.ini
depends_on:
- namenode
networks:
- hadoop
- frontend
datanode1:
image: bde2020/hadoop-datanode:2.0.0-hadoop3.1.1-java8
container_name: datanode1
volumes:
- hadoop_datanode:/hadoop/dfs/data
environment:
SERVICE_PRECONDITION: "namenode:9870"
env_file:
- ./hadoop.env
networks:
- hadoop
resourcemanager:
image: bde2020/hadoop-resourcemanager:2.0.0-hadoop3.1.1-java8
container_name: resourcemanager
environment:
SERVICE_PRECONDITION: "namenode:9870 datanode1:9864"
env_file:
- ./hadoop.env
networks:
- hadoop
nodemanager1:
image: bde2020/hadoop-nodemanager:2.0.0-hadoop3.1.1-java8
container_name: nodemanager
environment:
SERVICE_PRECONDITION: "namenode:9870 datanode1:9864 resourcemanager:8088"
env_file:
- ./hadoop.env
networks:
- hadoop
historyserver:
image: bde2020/hadoop-historyserver:2.0.0-hadoop3.1.1-java8
container_name: historyserver
environment:
SERVICE_PRECONDITION: "namenode:9870 datanode1:9864 resourcemanager:8088"
volumes:
- hadoop_historyserver:/hadoop/yarn/timeline
env_file:
- ./hadoop.env
networks:
- hadoop
networks:
hadoop:
frontend:
volumes:
hadoop_namenode:
hadoop_datanode:
hadoop_historyserver:
hadoop.env
CORE_CONF_fs_defaultFS=hdfs://namenode:8020
CORE_CONF_hadoop_http_staticuser_user=root
CORE_CONF_hadoop_proxyuser_hue_hosts=*
CORE_CONF_hadoop_proxyuser_hue_groups=*
CORE_CONF_io_compression_codecs=org.apache.hadoop.io.compress.SnappyCodec
HDFS_CONF_dfs_replication=1
HDFS_CONF_dfs_webhdfs_enabled=true
HDFS_CONF_dfs_permissions_enabled=false
HDFS_CONF_dfs_namenode_datanode_registration_ip___hostname___check=false
hue-overrides.ini
[desktop]
http_host=0.0.0.0
http_port=8888
time_zone=France
dev=true
app_blacklist=impala,zookeeper,oozie,hbase,security,search
[hadoop]
[[hdfs_clusters]]
[[[default]]]
fs_defaultfs=hdfs://namenode:8020
webhdfs_url=http://namenode:9870/webhdfs/v1
security_enabled=false
Thanks #cricket_007

Resources