airflow gunircorn [CRITICAL] WORKER TIMEOUT

airflow gunircorn [CRITICAL] WORKER TIMEOUT - docker

I am trying to run apache airflow in ECS using the v1-10-stable branch of apache/airflow using my fork airflow. I am using env variables to set executor, Postgres and Redis info to the webserver.
AIRFLOW__CORE__SQL_ALCHEMY_CONN="postgresql+psycopg2://airflow_user:airflow_password#postgres:5432/airflow_db"
AIRFLOW__CELERY__RESULT_BACKEND="db+postgresql://airflow_user:airflow_password#postgres:5432/airflow_db"
AIRFLOW__CELERY__BROKER_URL="redis://redis_queue:6379/1"
AIRFLOW__CORE__EXECUTOR=CeleryExecutor
FERNET_KEY=46BKJoQYlPPOexq0OhDZnIlNepKFf87WFwLbfzqDDho=
AIRFLOW__CORE__LOAD_EXAMPLES=False
I am using CMD-SHELL [ -f /home/airflow/airflow/airflow-webserver.pid ] as health check for the ECS container. I can make a connection to the Postgres and Redis from the docker container so there is no issue of security groups as well.
With docker ps I can see that the container is healthy and container port mapping with ec2 instance 0.0.0.0:32794->8080/tcp
But when I try to open the webserver UI, it's not opening. Even with curl its not working. I have tried curl localhost:32794 from the ec2-instance and curl localhost:8080 from the container, but none of them are working. telnet is working in both cases.
In the container logs, I can see that the gunicorn workers are constantly getting a timeout
[2019-11-25 05:30:39,236] {__init__.py:51} INFO - Using executor CeleryExecutor
[2019-11-25 05:30:39 +0000] [11] [CRITICAL] WORKER TIMEOUT (pid:17337)
[2019-11-25 05:30:39 +0000] [17337] [INFO] Worker exiting (pid: 17337)
[2019-11-25 05:30:39,430] {dagbag.py:90} INFO - Filling up the DagBag from /home/airflow/airflow/dags
[2019-11-25 05:30:39,472] {dagbag.py:90} INFO - Filling up the DagBag from /home/airflow/airflow/dags
[2019-11-25 05:30:39,479] {__init__.py:51} INFO - Using executor CeleryExecutor
[2019-11-25 05:30:39,447] {__init__.py:51} INFO - Using executor CeleryExecutor
[2019-11-25 05:30:39,524] {__init__.py:51} INFO - Using executor CeleryExecutor
[2019-11-25 05:30:39,719] {__init__.py:51} INFO - Using executor CeleryExecutor
[2019-11-25 05:30:39,930] {dagbag.py:90} INFO - Filling up the DagBag from /home/airflow/airflow/dags
[2019-11-25 05:30:40,139] {dagbag.py:90} INFO - Filling up the DagBag from /home/airflow/airflow/dags
[2019-11-25 05:30:40,244] {dagbag.py:90} INFO - Filling up the DagBag from /home/airflow/airflow/dags
[2019-11-25 05:30:40 +0000] [11] [CRITICAL] WORKER TIMEOUT (pid:17338)
[2019-11-25 05:30:40 +0000] [11] [CRITICAL] WORKER TIMEOUT (pid:17339)
[2019-11-25 05:30:40 +0000] [17393] [INFO] Booting worker with pid: 17393
[2019-11-25 05:30:40,412] {dagbag.py:90} INFO - Filling up the DagBag from /home/airflow/airflow/dags
ec2-instance is using Amazon Linux 2 and I can these logs constantly in /var/log/messages
Nov 25 05:57:15 ip-172-31-67-43 ec2net: [rewrite_aliases] Rewriting aliases of eth0
Nov 25 05:58:16 ip-172-31-67-43 dhclient[2724]: XMT: Solicit on eth0, interval 131000ms.
Nov 25 06:00:27 ip-172-31-67-43 dhclient[2724]: XMT: Solicit on eth0, interval 127900ms.
Nov 25 06:01:01 ip-172-31-67-43 systemd: Created slice User Slice of root.
Nov 25 06:01:01 ip-172-31-67-43 systemd: Starting User Slice of root.
Nov 25 06:01:01 ip-172-31-67-43 systemd: Started Session 77 of user root.
Nov 25 06:01:01 ip-172-31-67-43 systemd: Starting Session 77 of user root.
Nov 25 06:01:01 ip-172-31-67-43 systemd: Removed slice User Slice of root.
Nov 25 06:01:01 ip-172-31-67-43 systemd: Stopping User Slice of root.
Nov 25 06:02:35 ip-172-31-67-43 dhclient[2724]: XMT: Solicit on eth0, interval 131620ms.
Nov 25 06:04:36 ip-172-31-67-43 systemd: Started Session 78 of user ec2-user.
Nov 25 06:04:36 ip-172-31-67-43 systemd-logind: New session 78 of user ec2-user.
Nov 25 06:04:36 ip-172-31-67-43 systemd: Starting Session 78 of user ec2-user.
Nov 25 06:04:46 ip-172-31-67-43 dhclient[2724]: XMT: Solicit on eth0, interval 125300ms.
Nov 25 06:06:52 ip-172-31-67-43 dhclient[2724]: XMT: Solicit on eth0, interval 115230ms.
Nov 25 06:08:47 ip-172-31-67-43 dhclient[2724]: XMT: Solicit on eth0, interval 108100ms.

Regarding these timeout errors:
[CRITICAL] WORKER TIMEOUT
you can set Gunicorn timeouts via these two Airflow environment variables:
AIRFLOW__WEBSERVER__WEB_SERVER_MASTER_TIMEOUT
Number of seconds the webserver waits before killing gunicorn master that doesn’t respond.
AIRFLOW__WEBSERVER__WEB_SERVER_WORKER_TIMEOUT
Number of seconds the gunicorn webserver waits before timing out on a worker.
See the Airflow documentation for more info.
I had to solve this error for my Airflow install. I set both of these timeouts to 300 seconds and that allowed me to get the Airflow web UI loading so that I could then debug the underlying cause of the slow page load.

I was getting this error when deploying Airflow with AIRFLOW_HOME set to my EFS mount. Setting it to ~/airflow fixed the issue.

Related

Issues with setting up redis cluster via docker

I am trying to configure the Redis cluster using the docker image bitnami/redis-cluster.
Following is the docker-compose.yml:
version: '3.8'
services:
redis-node-0:
image: bitnami/redis-cluster:6.2.7
volumes:
- redis-node-data-0:/bitnami/redis/data
environment:
- ALLOW_EMPTY_PASSWORD=yes
- REDIS_NODES=redis-node-0 redis-node-1 redis-node-2
redis-node-1:
image: bitnami/redis-cluster:6.2.7
volumes:
- redis-node-data-1:/bitnami/redis/data
environment:
- ALLOW_EMPTY_PASSWORD=yes
- REDIS_NODES=redis-node-0 redis-node-1 redis-node-2
redis-node-2:
image: bitnami/redis-cluster:6.2.7
volumes:
- redis-node-data-2:/bitnami/redis/data
ports:
- 6379:6379
depends_on:
- redis-node-0
- redis-node-1
environment:
- ALLOW_EMPTY_PASSWORD=yes
- REDIS_NODES=redis-node-0 redis-node-1 redis-node-2
- REDIS_CLUSTER_REPLICAS=1
- REDIS_CLUSTER_CREATOR=yes
volumes:
redis-node-data-0:
driver: local
redis-node-data-1:
driver: local
redis-node-data-2:
driver: local
networks:
default:
name: local_network
Docker container's are perfectly running:
Output of docker ps:
CONTAINER ID
IMAGE
COMMAND
CREATED
STATUS
PORTS
NAMES
bea9a7c52eba
bitnami/redis-cluster:6.2.7
"/opt/bitnami/script…"
12 minutes ago
12 minutes ago
0.0.0.0:6379->6379/tcp
local-redis-node-2-1
63c08f1330e0
bitnami/redis-cluster:6.2.7
"/opt/bitnami/script…"
12 minutes ago
12 minutes ago
6379/tcp
local-redis-node-1-1
e1b163d75254
bitnami/redis-cluster:6.2.7
"/opt/bitnami/script…"
12 minutes ago
12 minutes ago
6379/tcp
local-redis-node-0-1
As I have set local-redis-node-2-1 as REDIS_CLUSTER_CREATOR, now it is the incharge of initializing the cluster. Therefore, I am executing all below commands with this node.
Going inside container: docker exec -it local-redis-node-2-1 redis-cli
Then, trying to save data in redis set a 1, I am getting error: (error) CLUSTERDOWN Hash slot not serve
Output of cluster slots: (empty array)
I tried docker exec -it local-redis-node-2-1 redis-cli --cluster fix localhost:6379. But it is assigning 6379 slots out of 16383 to local-redis-node-2-1 only, remaining slots are not getting assigned to other nodes. Below is the output of cluster slots after fixing via above command:
127.0.0.1:6379> cluster slots
1) 1) (integer) 0
2) (integer) 16383
3) 1) "172.18.0.9"
2) (integer) 6379
3) "819770cf8b39793517efa10b9751083c854e15d7"
I am doing something wrong. I would love to know the manual solution as well but more interested in knowing the solution to set cluster slots via docker-compose.
Can someone help me in setting the cluster slots automatically via docker-compose.yml?
The read replicas will work or not with this docker-compose.yml or have I missed something?
Also, can someone confirm whether the cluster will work fine after resolving the cluster slots or not?
Logs of local-redis-node-2-1 below:
redis-cluster 00:47:45.70
redis-cluster 00:47:45.73 Welcome to the Bitnami redis-cluster container
redis-cluster 00:47:45.76 Subscribe to project updates by watching https://github.com/bitnami/containers
redis-cluster 00:47:45.78 Submit issues and feature requests at https://github.com/bitnami/containers/issues
redis-cluster 00:47:45.80
redis-cluster 00:47:45.83 INFO ==> ** Starting Redis setup **
redis-cluster 00:47:46.00 WARN ==> You set the environment variable ALLOW_EMPTY_PASSWORD=yes. For safety reasons, do not use this flag in a production environment.
redis-cluster 00:47:46.05 INFO ==> Initializing Redis
redis-cluster 00:47:46.29 INFO ==> Setting Redis config file
Changing old IP 172.18.0.8 by the new one 172.18.0.8
Changing old IP 172.18.0.6 by the new one 172.18.0.6
Changing old IP 172.18.0.9 by the new one 172.18.0.9
redis-cluster 00:47:47.30 INFO ==> ** Redis setup finished! **
1:C 10 Dec 2022 00:47:47.579 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:C 10 Dec 2022 00:47:47.580 # Redis version=6.2.7, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 10 Dec 2022 00:47:47.580 # Configuration loaded
1:M 10 Dec 2022 00:47:47.584 * monotonic clock: POSIX clock_gettime
1:M 10 Dec 2022 00:47:47.588 * Node configuration loaded, I'm 819770cf8b39793517efa10b9751083c854e15d7
1:M 10 Dec 2022 00:47:47.595 # A key '__redis__compare_helper' was added to Lua globals which is not on the globals allow list nor listed on the deny list.
1:M 10 Dec 2022 00:47:47.598 * Running mode=cluster, port=6379.
1:M 10 Dec 2022 00:47:47.599 # Server initialized
1:M 10 Dec 2022 00:47:47.612 * Ready to accept connections
1:M 10 Dec 2022 00:47:49.673 # Cluster state changed: ok
Logs of local-redis-node-1-1 below:
redis-cluster 00:47:45.43
redis-cluster 00:47:45.46 Welcome to the Bitnami redis-cluster container
redis-cluster 00:47:45.48 Subscribe to project updates by watching https://github.com/bitnami/containers
redis-cluster 00:47:45.51 Submit issues and feature requests at https://github.com/bitnami/containers/issues
redis-cluster 00:47:45.54
redis-cluster 00:47:45.56 INFO ==> ** Starting Redis setup **
redis-cluster 00:47:45.73 WARN ==> You set the environment variable ALLOW_EMPTY_PASSWORD=yes. For safety reasons, do not use this flag in a production environment.
redis-cluster 00:47:45.79 INFO ==> Initializing Redis
redis-cluster 00:47:46.00 INFO ==> Setting Redis config file
Changing old IP 172.18.0.8 by the new one 172.18.0.8
Changing old IP 172.18.0.6 by the new one 172.18.0.6
Changing old IP 172.18.0.9 by the new one 172.18.0.9
redis-cluster 00:47:47.10 INFO ==> ** Redis setup finished! **
1:C 10 Dec 2022 00:47:47.387 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:C 10 Dec 2022 00:47:47.388 # Redis version=6.2.7, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 10 Dec 2022 00:47:47.388 # Configuration loaded
1:M 10 Dec 2022 00:47:47.392 * monotonic clock: POSIX clock_gettime
1:M 10 Dec 2022 00:47:47.395 * Node configuration loaded, I'm 2ab0b8db952cc101f7873cdcf8cf691f8f6bae7b
1:M 10 Dec 2022 00:47:47.403 # A key '__redis__compare_helper' was added to Lua globals which is not on the globals allow list nor listed on the deny list.
1:M 10 Dec 2022 00:47:47.406 * Running mode=cluster, port=6379.
1:M 10 Dec 2022 00:47:47.407 # Server initialized
1:M 10 Dec 2022 00:47:47.418 * Ready to accept connections
1:M 10 Dec 2022 00:56:02.716 # New configEpoch set to 1
1:M 10 Dec 2022 00:56:41.943 # Cluster state changed: ok
Logs of local-redis-node-0-1 below:
redis-cluster 00:47:45.43
redis-cluster 00:47:45.46 Welcome to the Bitnami redis-cluster container
redis-cluster 00:47:45.49 Subscribe to project updates by watching https://github.com/bitnami/containers
redis-cluster 00:47:45.51 Submit issues and feature requests at https://github.com/bitnami/containers/issues
redis-cluster 00:47:45.54
redis-cluster 00:47:45.56 INFO ==> ** Starting Redis setup **
redis-cluster 00:47:45.73 WARN ==> You set the environment variable ALLOW_EMPTY_PASSWORD=yes. For safety reasons, do not use this flag in a production environment.
redis-cluster 00:47:45.79 INFO ==> Initializing Redis
redis-cluster 00:47:46.00 INFO ==> Setting Redis config file
Changing old IP 172.18.0.8 by the new one 172.18.0.8
Changing old IP 172.18.0.6 by the new one 172.18.0.6
Changing old IP 172.18.0.9 by the new one 172.18.0.9
redis-cluster 00:47:47.11 INFO ==> ** Redis setup finished! **
1:C 10 Dec 2022 00:47:47.387 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:C 10 Dec 2022 00:47:47.388 # Redis version=6.2.7, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 10 Dec 2022 00:47:47.388 # Configuration loaded
1:M 10 Dec 2022 00:47:47.391 * monotonic clock: POSIX clock_gettime
1:M 10 Dec 2022 00:47:47.395 * Node configuration loaded, I'm 5ffeca48faa750a5f47c76639598fdb9b7b8b720
1:M 10 Dec 2022 00:47:47.402 # A key '__redis__compare_helper' was added to Lua globals which is not on the globals allow list nor listed on the deny list.
1:M 10 Dec 2022 00:47:47.405 * Running mode=cluster, port=6379.
1:M 10 Dec 2022 00:47:47.405 # Server initialized
1:M 10 Dec 2022 00:47:47.415 * Ready to accept connections

Docker Swarm: Service keep on getting Ready & Shutdown

I have couple of docker swarm nodes, When tried to create the service on Leader with below command. Service creation process still going on it is more-than 40 minutes now.
docker service create \
--mode global \
--mount type=bind,src=/project/m32/,dst=/root/m32/ \
--publish mode=host,target=310,published=310 \
--publish mode=host,target=311,published=311 \
--publish mode=host,target=312,published=312 \
--publish mode=host,target=313,published=313 \
--constraint "node.labels.m32 == true" \
--name m32 \
local-registry/ubuntu:07
overall progress: 1 out of 2 tasks
ew0edluvz39p: ready [======================================> ]
kzc7jf7irsrh: running [==================================================>]
From service process, it keep on showing as Ready and Shutdown
$ docker service ps m32
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
s4q0rqrqbpdn m32.ew0edluvz39pazold0wnv2ean local-registry/ubuntu:07 sl-089 Ready Ready 1 second ago
r6vibgptm5oc \_ m32.ew0edluvz39pazold0wnv2ean local-registry/ubuntu:07 sl-089 Shutdown Complete 1 second ago
joq2p6c9jpnx \_ m32.ew0edluvz39pazold0wnv2ean local-registry/ubuntu:07 sl-089 Shutdown Complete 7 seconds ago
a5h8gac02vfx \_ m32.ew0edluvz39pazold0wnv2ean local-registry/ubuntu:07 sl-089 Shutdown Complete 13 seconds ago
f51stfsdlhvp \_ m32.ew0edluvz39pazold0wnv2ean local-registry/ubuntu:07 sl-089 Shutdown Complete 19 seconds ago
zqcbxkm4fwhr m32.kzc7jf7irsrhnx3kurcwqjb2j local-registry/ubuntu:07 sl-090 Ready Ready less than a second ago
za8efvi9x4yw \_ m32.kzc7jf7irsrhnx3kurcwqjb2j local-registry/ubuntu:07 sl-090 Shutdown Complete less than a second ago
$ sudo systemctl status docker.service
Nov 24 19:58:48 svr2 dockerd[2797]: time="2021-11-24T19:58:48.200421563+05:30" level=info msg="ignoring event" container=ea8b76fedb18159ba0cd8f279a9ca4264399c>
Nov 24 20:01:39 svr2 dockerd[2797]: time="2021-11-24T20:01:39.602028420+05:30" level=info msg="NetworkDB stats svr2(00bbf0799aa6) - netID:ubuzyty9mq4tb7xyb>
Nov 24 20:06:39 svr2 dockerd[2797]: time="2021-11-24T20:06:39.802013427+05:30" level=info msg="NetworkDB stats svr2(00bbf0799aa6) - netID:ubuzyty9mq4tb7xyb>
Nov 24 20:11:40 svr2 dockerd[2797]: time="2021-11-24T20:11:40.001992437+05:30" level=info msg="NetworkDB stats svr2(00bbf0799aa6) - netID:ubuzyty9mq4tb7xyb>
Nov 24 20:14:17 svr2 dockerd[2797]: time="2021-11-24T20:14:17.871605342+05:30" level=error msg="Error getting service xkauq9a599iv: service xkauq9a599iv not f>
Nov 24 20:14:52 svr2 dockerd[2797]: time="2021-11-24T20:14:52.833890158+05:30" level=error msg="Error getting service xkauq9a599iv: service xkauq9a599iv not f>
Nov 24 20:15:12 svr2 dockerd[2797]: time="2021-11-24T20:15:12.395692837+05:30" level=error msg="Error getting service pwaa8cvdd683: service pwaa8cvdd683 not f>
Nov 24 20:15:17 svr2 dockerd[2797]: time="2021-11-24T20:15:17.773200054+05:30" level=error msg="Error getting service xk0v0g2roypx: service xk0v0g2roypx not f>
Nov 24 20:16:18 svr2 dockerd[2797]: time="2021-11-24T20:16:18.529344060+05:30" level=error msg="Error getting service xk0v0g2roypx: service xk0v0g2roypx not f>
Nov 24 20:16:40 svr2 dockerd[2797]: time="2021-11-24T20:16:40.201888504+05:30" level=info msg="NetworkDB stats svr2(00bbf0799aa6) - netID:ubuzyty9mq4tb7xyb>
It looks loop process keep on creating containers. What is wrong in my way? Any help to fix this problem will be highly appreciated. Thanks

You really need to pass --restart-max-attempts 5 to your docker service create to ensure that services don't start too many times in a loop. Its bad for the stability of docker, and hard to debug. Rather have a task just give up and stop so you can see something is wrong and diagnose it.
To see specifically what is wrong you would want to look at the logs of each task. You use the individual task id's to see why each one failed:
# The logs for a task
docker service logs s4q0rqrqbpdn
# A general breakdown of a task
docker inspect s4q0rqrqbpdn
Sometimes you need to track down the actual container for the task and inspect that. docker container is not swarm aware, so
# list the service showing the full task id.
docker service ps <service> --no-trunc
# then docker context use <node> / ssh <node> to switch to a node of interest.
# Then, the container name is the "ID"."NAME" from the PS list. For example:
docker context use sl-089
docker container inspect m32.ew0edluvz39pazold0wnv2ean.s4q0rqrqbpdnABCDEFGABCDEFG
Inspecting the container can show if it was killed because of an OOM or certain other reasons that don't otherwise show up.

Jenkins up and running but can't access through browser on port 8080 AWS ubuntu

I have successfully installed Jenkins on AWS server ubuntu V18.04, the service is up and running on the server successfully
jenkins.service - LSB: Start Jenkins at boot time
Loaded: loaded (/etc/init.d/jenkins; generated)
Active: active (exited) since Mon 2020-01-20 05:26:40 UTC; 8min ago
Docs: man:systemd-sysv-generator(8)
Process: 1905 ExecStop=/etc/init.d/jenkins stop (code=exited, status=0/SUCCESS)
Process: 1951 ExecStart=/etc/init.d/jenkins start (code=exited, status=0/SUCCESS)
Jan 20 05:26:39 ip-3.106.165.24 systemd[1]: Stopped LSB: Start Jenkins at boot time.
Jan 20 05:26:39 ip-3.106.165.24 systemd[1]: Starting LSB: Start Jenkins at boot time...
Jan 20 05:26:39 ip-3.106.165.24 jenkins[1951]: Correct java version found
Jan 20 05:26:39 ip-3.106.165.24 jenkins[1951]: * Starting Jenkins Automation Server jenkins
Jan 20 05:26:39 ip-3.106.165.24 su[1997]: Successful su for jenkins by root
Jan 20 05:26:39 ip-3.106.165.24 su[1997]: + ??? root:jenkins
Jan 20 05:26:39 ip-3.106.165.24 su[1997]: pam_unix(su:session): session opened for user jenkins by (uid=0)
Jan 20 05:26:39 ip-3.106.165.24 su[1997]: pam_unix(su:session): session closed for user jenkins
Jan 20 05:26:40 ip-3.106.165.24 jenkins[1951]: ...done.
Jan 20 05:26:40 ip-3.106.165.24 systemd[1]: Started LSB: Start Jenkins at boot time.
My security groups are setup port 22 for SSH and port 8080 for jenkins and 80 for http
When I attempt to access jenkins through the web browser I get "This site can't be reached" error
Not sure what else I can try, as I have tried every solution under the sun but still the problem persists.
I can SSH into the server and I can access port 80 as I setup nginx on the server successfully.
Any help would be greatly appreciated
Regards
Danny

Not sure why port 8080 is not working. I had to change the port to 8081 in the jenkins config file in following location as there are multiple jenkins config file
/etc/default/jenkins
Also changed the security group settings in AWS to match
Works fine now
Further to this is that my ISP is blocking port 8080. I hotspotted my phone and port 8080 was accessible.

Configure Google Cloud Compute firewall to allow external access to DB server

I've installed a neo4j database on a Google Cloud Compute instance and am wanting to connect to the database from my laptop.
[1] I have neo4j running on Google Cloud
● neo4j.service - Neo4j Graph Database
Loaded: loaded (/lib/systemd/system/neo4j.service; disabled; vendor preset: enabled)
Active: active (running) since Sat 2017-09-30 09:33:39 UTC; 1h 3min ago
Main PID: 2099 (java)
Tasks: 41
Memory: 504.5M
CPU: 18.652s
CGroup: /system.slice/neo4j.service
└─2099 /usr/bin/java -cp /var/lib/neo4j/plugins:/etc/neo4j:/usr/share/neo4j/lib/*:/var/lib/neo4j/plugins/* -server -XX:+UseG1GC -XX:-OmitStackTraceInFastThrow -XX:+AlwaysPreTouch -XX:+U
nlockExperimentalVMOptions -XX:+TrustFinalNonStaticFields -XX:+DisableExplicitGC -Djdk.tls.ephemeralDHKeySize=2048 -Dunsupported.dbms.udc.source=debian -Dfile.encoding=UTF-8 org.neo4j.server.Commu
nityEntryPoint --home-dir=/var/lib/neo4j --config-dir=/etc/neo4j
Sep 30 09:33:40 neo4j-graphdb-server neo4j[2099]: certificates: /var/lib/neo4j/certificates
Sep 30 09:33:40 neo4j-graphdb-server neo4j[2099]: run: /var/run/neo4j
Sep 30 09:33:40 neo4j-graphdb-server neo4j[2099]: Starting Neo4j.
Sep 30 09:33:42 neo4j-graphdb-server neo4j[2099]: 2017-09-30 09:33:42.948+0000 INFO ======== Neo4j 3.2.5 ========
Sep 30 09:33:42 neo4j-graphdb-server neo4j[2099]: 2017-09-30 09:33:42.988+0000 INFO Starting...
Sep 30 09:33:44 neo4j-graphdb-server neo4j[2099]: 2017-09-30 09:33:44.308+0000 INFO Bolt enabled on 127.0.0.1:7687.
Sep 30 09:33:47 neo4j-graphdb-server neo4j[2099]: 2017-09-30 09:33:47.043+0000 INFO Started.
Sep 30 09:33:48 neo4j-graphdb-server neo4j[2099]: 2017-09-30 09:33:48.160+0000 INFO Remote interface available at http://localhost:7474/
Sep 30 09:39:17 neo4j-graphdb-server neo4j[2099]: 2017-09-30 09:39:17.918+0000 WARN badMessage: 400 No URI for HttpChannelOverHttp#27d4a9b{r=0,c=false,a=IDLE,uri=-}
Sep 30 09:46:18 neo4j-graphdb-server neo4j[2099]: 2017-09-30 09:46:18.374+0000 WARN badMessage: 400 for HttpChannelOverHttp#6cbed0ca{r=0,c=false,a=IDLE,uri=-}
[2] I've created a firewall rule on Google Cloud to allow external access to the DB server
The network tag of "google-db-server" has been added to the Google Cloud Compute server.
My expectation is that the rule below will allow any external machine to connect to port 7474 on the Google Cloud Compute instance
user#machine:~/home$ gcloud compute firewall-rules create custom-allow-neo4j --action ALLOW --rules tcp:7474 --description "Enable access to the neo4j database" --direction IN --target-tags google-db-server
user#machine:~/home$ gcloud compute firewall-rules list --format json
[
{
"allowed": [
{
"IPProtocol": "tcp",
"ports": [
"7474"
]
}
],
"creationTimestamp": "2017-09-30T00:25:51.220-07:00",
"description": "Enable access to the neo4j database",
"direction": "INGRESS",
"id": "5767618134171383824",
"kind": "compute#firewall",
"name": "custom-allow-neo4j",
"network": "https://www.googleapis.com/compute/v1/projects/graphdb-experiment/global/networks/default",
"priority": 1000,
"selfLink": "https://www.googleapis.com/compute/v1/projects/graphdb-experiment/global/firewalls/custom-allow-neo4j",
"sourceRanges": [
"0.0.0.0/0"
],
"targetTags": [
"google-db-server"
]
},
[3] Running nmap from the Google Cloud server instance shows that port 7474 is available locally, and I can telnet to that port locally
google_user#google-db-server:~$ nmap -p 22,80,443,7474 localhost
Starting Nmap 7.01 ( https://nmap.org ) at 2017-09-30 10:46 UTC
Nmap scan report for localhost (127.0.0.1)
Host is up (0.000081s latency).
PORT STATE SERVICE
22/tcp open ssh
80/tcp open http
443/tcp closed https
7474/tcp open unknown
Nmap done: 1 IP address (1 host up) scanned in 0.03 seconds
google-user#google-db-server:~$ telnet localhost 7474
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
[4] However I am unable to connect from my laptop and nmap shows port 7474 as unavailable
user#machine:~/home$ nmap -p 22,80,443,7474 35.201.26.52
Starting Nmap 7.01 ( https://nmap.org ) at 2017-09-30 20:50 AEST
Nmap scan report for 52.26.201.35.bc.googleusercontent.com (35.201.26.52)
Host is up (0.28s latency).
PORT STATE SERVICE
22/tcp open ssh
80/tcp open http
443/tcp closed https
7474/tcp closed unknown
Nmap done: 1 IP address (1 host up) scanned in 0.75 seconds
So despite the firewall rule being created to allow any IP address to connect to the Google Cloud Compute instance on tcp:7474, I'm still unable to access this port from my laptop.
Am I missing some additional steps?

It looks like neo4j is only listening on the loopback interface. This means it only accepts connections from the same machine. You can verify this by running sudo netstat -lntp. If you see 127.0.0.1:7474, it's only listening on loopback. It should be 0.0.0.0:7474.
you can fix this in the neo4j config by setting dbms.connector.bolt.listen_address to 0.0.0.0:7474. Your Linux distribution may also have a different place to set this configuration.

Docker swarm can be accessed only on nodes, where container is running

I'm currently running docker swarm on 3 nodes. First I created network as
docker network create -d overlay xx_net
after that a service as
docker service create --network xxx_net --replicas 1 -p 12345:12345 --name nameofservice nameofimage:1
If I read correctly, this is routing mesh (=ok for me). But I can only access service on that node-ip, where container is running, even it should be available on every node ip's.
If I drain some node, container starts up on different node and then it's on available on new ip.
**more information added below here:
I rebooted all servers - 3 workers, where on of them is manager
after boot, all seems to work ok!
I'm using rabbitmq-image from docker hub. Dockerfile is quite small: FROM rabbitmq:3-management Container has been started at worker 2
I can connect to rabbitmq's management page from all workers: worker1-ip:15672, worker2-ip:15672, worker3-ip:15672, so I think all ports needed is open.
about after 1 hour, rabbitmq-container has been moved from worker 2 to worker 3 - I do not know reason.
after that I cannot anymore connect from worker1-ip:15672, worker2-ip:15672 but from worker3-ip:15672 all still works!
I drained worker3 as docker node update --availability drain worker3
container started at worker1.
after that I can only connect from worker1-ip:15672, not anymore from worker2 or worker3
One test more:
all docker services restarted on all workers, and all works again?!
- let's wait a few hours...
Today's status:
2 of 3 nodes are working ok. On service log of manager:
Jul 12 07:53:32 dockerswarmmanager dockerd[7180]: time="2017-07-12T07:53:32.787953754Z" level=info msg="memberlist: Marking dockerswarmworker2-459b4229d652 as failed, suspect timeout reached"
Jul 12 07:53:39 dockerswarmmanager dockerd[7180]: time="2017-07-12T07:53:39.787783458Z" level=info msg="memberlist: Marking dockerswarmworker2-459b4229d652 as failed, suspect timeout reached"
Jul 12 07:55:27 dockerswarmmanager dockerd[7180]: time="2017-07-12T07:55:27.790564790Z" level=info msg="memberlist: Marking dockerswarmworker2-459b4229d652 as failed, suspect timeout reached"
Jul 12 07:55:41 dockerswarmmanager dockerd[7180]: time="2017-07-12T07:55:41.787974530Z" level=info msg="memberlist: Marking dockerswarmworker2-459b4229d652 as failed, suspect timeout reached"
Jul 12 07:56:33 dockerswarmmanager dockerd[7180]: time="2017-07-12T07:56:33.027525926Z" level=error msg="logs call failed" error="container not ready for logs: context canceled" module="node/agent/taskmanager" node.id=b6vnaouyci7b76ol1apq96zxx
Jul 12 07:56:33 dockerswarmmanager dockerd[7180]: time="2017-07-12T07:56:33.027668473Z" level=error msg="logs call failed" error="container not ready for logs: context canceled" module="node/agent/taskmanager" node.id=b6vnaouyci7b76ol1apq96zxx
Jul 12 08:13:22 dockerswarmmanager dockerd[7180]: time="2017-07-12T08:13:22.787796692Z" level=info msg="memberlist: Marking dockerswarmworker2-03ec8453a81f as failed, suspect timeout reached"
Jul 12 08:21:37 dockerswarmmanager dockerd[7180]: time="2017-07-12T08:21:37.788694522Z" level=info msg="memberlist: Marking dockerswarmworker2-03ec8453a81f as failed, suspect timeout reached"
Jul 12 08:24:01 dockerswarmmanager dockerd[7180]: time="2017-07-12T08:24:01.525570127Z" level=error msg="logs call failed" error="container not ready for logs: context canceled" module="node/agent/taskmanager" node.id=b6vnaouyci7b76ol1apq96zxx
Jul 12 08:24:01 dockerswarmmanager dockerd[7180]: time="2017-07-12T08:24:01.525713893Z" level=error msg="logs call failed" error="container not ready for logs: context canceled" module="node/agent/taskmanager" node.id=b6vnaouyci7b76ol1apq96zxx
and from worker's docker log:
Jul 12 08:20:47 dockerswarmworker2 dockerd[677]: time="2017-07-12T08:20:47.486202716Z" level=error msg="Bulk sync to node h999-99-999-185.scenegroup.fi-891b24339f8a timed out"
Jul 12 08:21:38 dockerswarmworker2 dockerd[677]: time="2017-07-12T08:21:38.288117026Z" level=warning msg="memberlist: Refuting a dead message (from: h999-99-999-185.scenegroup.fi-891b24339f8a)"
Jul 12 08:21:39 dockerswarmworker2 dockerd[677]: time="2017-07-12T08:21:39.404554761Z" level=warning msg="Neighbor entry already present for IP 10.255.0.3, mac 02:42:0a:ff:00:03"
Jul 12 08:21:39 dockerswarmworker2 dockerd[677]: time="2017-07-12T08:21:39.404588738Z" level=warning msg="Neighbor entry already present for IP 104.198.180.163, mac 02:42:0a:ff:00:03"
Jul 12 08:21:39 dockerswarmworker2 dockerd[677]: time="2017-07-12T08:21:39.404609273Z" level=warning msg="Neighbor entry already present for IP 10.255.0.6, mac 02:42:0a:ff:00:06"
Jul 12 08:21:39 dockerswarmworker2 dockerd[677]: time="2017-07-12T08:21:39.404622776Z" level=warning msg="Neighbor entry already present for IP 104.198.180.163, mac 02:42:0a:ff:00:06"
Jul 12 08:21:47 dockerswarmworker2 dockerd[677]: time="2017-07-12T08:21:47.486007317Z" level=error msg="Bulk sync to node h999-99-999-185.scenegroup.fi-891b24339f8a timed out"
Jul 12 08:22:47 dockerswarmworker2 dockerd[677]: time="2017-07-12T08:22:47.485821037Z" level=error msg="Bulk sync to node h999-99-999-185.scenegroup.fi-891b24339f8a timed out"
Jul 12 08:23:17 dockerswarmworker2 dockerd[677]: time="2017-07-12T08:23:17.630602898Z" level=error msg="Bulk sync to node h999-99-999-185.scenegroup.fi-891b24339f8a timed out"
And this one from working worker:
Jul 12 08:33:09 h999-99-999-185.scenegroup.fi dockerd[10330]: time="2017-07-12T08:33:09.219973777Z" level=warning msg="Neighbor entry already present for IP 10.0.0.3, mac xxxxx"
Jul 12 08:33:09 h999-99-999-185.scenegroup.fi dockerd[10330]: time="2017-07-12T08:33:09.220539013Z" level=warning msg="Neighbor entry already present for IP "managers ip here", mac xxxxxx"
I restarted docker on problematic worker and it started to work again.
I'll be following...
** Today's results:
2 of workers available, one is not
I didn't a thing
after 4 hour "swarm alone", all seems to works again?!
services has been moved from worker to other for any good reason, all results seems to be problem with communication.
quite confusing.

Upgrade to docker 17.06
Ingress overlay networking was broken for a long time until about 17.06-rc3

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

airflow gunircorn [CRITICAL] WORKER TIMEOUT - docker

I was getting this error when deploying Airflow with AIRFLOW_HOME set to my EFS mount. Setting it to ~/airflow fixed the issue.

Related

Issues with setting up redis cluster via docker

Docker Swarm: Service keep on getting Ready & Shutdown

Jenkins up and running but can't access through browser on port 8080 AWS ubuntu

Configure Google Cloud Compute firewall to allow external access to DB server

Docker swarm can be accessed only on nodes, where container is running

Categories

Resources