Configure Google Cloud Compute firewall to allow external access to DB server - neo4j

I've installed a neo4j database on a Google Cloud Compute instance and am wanting to connect to the database from my laptop.
[1] I have neo4j running on Google Cloud
● neo4j.service - Neo4j Graph Database
Loaded: loaded (/lib/systemd/system/neo4j.service; disabled; vendor preset: enabled)
Active: active (running) since Sat 2017-09-30 09:33:39 UTC; 1h 3min ago
Main PID: 2099 (java)
Tasks: 41
Memory: 504.5M
CPU: 18.652s
CGroup: /system.slice/neo4j.service
└─2099 /usr/bin/java -cp /var/lib/neo4j/plugins:/etc/neo4j:/usr/share/neo4j/lib/*:/var/lib/neo4j/plugins/* -server -XX:+UseG1GC -XX:-OmitStackTraceInFastThrow -XX:+AlwaysPreTouch -XX:+U
nlockExperimentalVMOptions -XX:+TrustFinalNonStaticFields -XX:+DisableExplicitGC -Djdk.tls.ephemeralDHKeySize=2048 -Dunsupported.dbms.udc.source=debian -Dfile.encoding=UTF-8 org.neo4j.server.Commu
nityEntryPoint --home-dir=/var/lib/neo4j --config-dir=/etc/neo4j
Sep 30 09:33:40 neo4j-graphdb-server neo4j[2099]: certificates: /var/lib/neo4j/certificates
Sep 30 09:33:40 neo4j-graphdb-server neo4j[2099]: run: /var/run/neo4j
Sep 30 09:33:40 neo4j-graphdb-server neo4j[2099]: Starting Neo4j.
Sep 30 09:33:42 neo4j-graphdb-server neo4j[2099]: 2017-09-30 09:33:42.948+0000 INFO ======== Neo4j 3.2.5 ========
Sep 30 09:33:42 neo4j-graphdb-server neo4j[2099]: 2017-09-30 09:33:42.988+0000 INFO Starting...
Sep 30 09:33:44 neo4j-graphdb-server neo4j[2099]: 2017-09-30 09:33:44.308+0000 INFO Bolt enabled on 127.0.0.1:7687.
Sep 30 09:33:47 neo4j-graphdb-server neo4j[2099]: 2017-09-30 09:33:47.043+0000 INFO Started.
Sep 30 09:33:48 neo4j-graphdb-server neo4j[2099]: 2017-09-30 09:33:48.160+0000 INFO Remote interface available at http://localhost:7474/
Sep 30 09:39:17 neo4j-graphdb-server neo4j[2099]: 2017-09-30 09:39:17.918+0000 WARN badMessage: 400 No URI for HttpChannelOverHttp#27d4a9b{r=0,c=false,a=IDLE,uri=-}
Sep 30 09:46:18 neo4j-graphdb-server neo4j[2099]: 2017-09-30 09:46:18.374+0000 WARN badMessage: 400 for HttpChannelOverHttp#6cbed0ca{r=0,c=false,a=IDLE,uri=-}
[2] I've created a firewall rule on Google Cloud to allow external access to the DB server
The network tag of "google-db-server" has been added to the Google Cloud Compute server.
My expectation is that the rule below will allow any external machine to connect to port 7474 on the Google Cloud Compute instance
user#machine:~/home$ gcloud compute firewall-rules create custom-allow-neo4j --action ALLOW --rules tcp:7474 --description "Enable access to the neo4j database" --direction IN --target-tags google-db-server
user#machine:~/home$ gcloud compute firewall-rules list --format json
[
{
"allowed": [
{
"IPProtocol": "tcp",
"ports": [
"7474"
]
}
],
"creationTimestamp": "2017-09-30T00:25:51.220-07:00",
"description": "Enable access to the neo4j database",
"direction": "INGRESS",
"id": "5767618134171383824",
"kind": "compute#firewall",
"name": "custom-allow-neo4j",
"network": "https://www.googleapis.com/compute/v1/projects/graphdb-experiment/global/networks/default",
"priority": 1000,
"selfLink": "https://www.googleapis.com/compute/v1/projects/graphdb-experiment/global/firewalls/custom-allow-neo4j",
"sourceRanges": [
"0.0.0.0/0"
],
"targetTags": [
"google-db-server"
]
},
[3] Running nmap from the Google Cloud server instance shows that port 7474 is available locally, and I can telnet to that port locally
google_user#google-db-server:~$ nmap -p 22,80,443,7474 localhost
Starting Nmap 7.01 ( https://nmap.org ) at 2017-09-30 10:46 UTC
Nmap scan report for localhost (127.0.0.1)
Host is up (0.000081s latency).
PORT STATE SERVICE
22/tcp open ssh
80/tcp open http
443/tcp closed https
7474/tcp open unknown
Nmap done: 1 IP address (1 host up) scanned in 0.03 seconds
google-user#google-db-server:~$ telnet localhost 7474
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
[4] However I am unable to connect from my laptop and nmap shows port 7474 as unavailable
user#machine:~/home$ nmap -p 22,80,443,7474 35.201.26.52
Starting Nmap 7.01 ( https://nmap.org ) at 2017-09-30 20:50 AEST
Nmap scan report for 52.26.201.35.bc.googleusercontent.com (35.201.26.52)
Host is up (0.28s latency).
PORT STATE SERVICE
22/tcp open ssh
80/tcp open http
443/tcp closed https
7474/tcp closed unknown
Nmap done: 1 IP address (1 host up) scanned in 0.75 seconds
So despite the firewall rule being created to allow any IP address to connect to the Google Cloud Compute instance on tcp:7474, I'm still unable to access this port from my laptop.
Am I missing some additional steps?

It looks like neo4j is only listening on the loopback interface. This means it only accepts connections from the same machine. You can verify this by running sudo netstat -lntp. If you see 127.0.0.1:7474, it's only listening on loopback. It should be 0.0.0.0:7474.
you can fix this in the neo4j config by setting dbms.connector.bolt.listen_address to 0.0.0.0:7474. Your Linux distribution may also have a different place to set this configuration.

Related

Why DHT can't find resource when to download with a trackerless torrent?

Please do as i do in your vps and then maybe the issue reproduced,replace the variable $vps_ip with your real vps ip during the below steps.
wget https://saimei.ftp.acc.umu.se/debian-cd/current/amd64/iso-cd/debian-10.4.0-amd64-netinst.iso
transmission-create -o debian.torrent debian-10.4.0-amd64-netinst.iso
Create a trackerless torrent ,show info on it:
transmission-show debian.torrent
Name: debian-10.4.0-amd64-netinst.iso
File: debian.torrent
GENERAL
Name: debian-10.4.0-amd64-netinst.iso
Hash: a7fbe3ac2451fc6f29562ff034fe099c998d945e
Created by: Transmission/2.92 (14714)
Created on: Mon Jun 8 00:04:33 2020
Piece Count: 2688
Piece Size: 128.0 KiB
Total Size: 352.3 MB
Privacy: Public torrent
TRACKERS
FILES
debian-10.4.0-amd64-netinst.iso (352.3 MB)
Open the port which transmission running on your vps.
firewall-cmd --zone=public --add-port=51413/tcp --permanent
firewall-cmd --reload
Check it from your local pc.
sudo nmap $vps_ip -p51413
Host is up (0.24s latency).
PORT STATE SERVICE
51413/tcp open unknown
Nmap done: 1 IP address (1 host up) scanned in 1.74 seconds
Add the torrent and seed it with transmission's default username and password on your vps(with your own if you already change it):
transmission-remote -n "transmission:transmission" --add debian.torrent
localhost:9091/transmission/rpc/ responded: "success"
transmission-remote -n "transmission:transmission" --list
ID Done Have ETA Up Down Ratio Status Name
1 0% None Unknown 0.0 0.0 None Idle debian-10.4.0-amd64-netinst.iso
Sum: None 0.0 0.0
transmission-remote -n "transmission:transmission" -t 1 --start
localhost:9091/transmission/rpc/ responded: "success"
Get the debian.torrent from your vps into local pc.
scp root#$vps_ip:/root/debian.torrent /tmp
Now to try download it in your local pc.
aria2c --enable-dht=true /tmp/debian.torrent
06/08 09:28:04 [NOTICE] Downloading 1 item(s)
06/08 09:28:04 [NOTICE] IPv4 DHT: listening on UDP port 6921
06/08 09:28:04 [NOTICE] IPv4 BitTorrent: listening on TCP port 6956
06/08 09:28:04 [NOTICE] IPv6 BitTorrent: listening on TCP port 6956
*** Download Progress Summary as of Mon Jun 8 09:29:04 2020 ***
===============================================================================
[#a34431 0B/336MiB(0%) CN:0 SD:0 DL:0B]
FILE: /tmp/debian-10.4.0-amd64-netinst.iso
-------------------------------------------------------------------------------
I wait about one hour ,the download progress is always 0%.
If you're using DHT, you have to open a UDP port in your firewall and then, depending on what you're doing, you can specify that port to aria2c. From the docs:
DHT uses UDP. Since aria2 doesn't configure firewalls or routers for port forwarding, it's up to you to do it manually.
$ aria2c --enable-dht --dht-listen-port=6881 file.torrent
See this page for some more examples of using DHT with aria2c.

Jenkins up and running but can't access through browser on port 8080 AWS ubuntu

I have successfully installed Jenkins on AWS server ubuntu V18.04, the service is up and running on the server successfully
jenkins.service - LSB: Start Jenkins at boot time
Loaded: loaded (/etc/init.d/jenkins; generated)
Active: active (exited) since Mon 2020-01-20 05:26:40 UTC; 8min ago
Docs: man:systemd-sysv-generator(8)
Process: 1905 ExecStop=/etc/init.d/jenkins stop (code=exited, status=0/SUCCESS)
Process: 1951 ExecStart=/etc/init.d/jenkins start (code=exited, status=0/SUCCESS)
Jan 20 05:26:39 ip-3.106.165.24 systemd[1]: Stopped LSB: Start Jenkins at boot time.
Jan 20 05:26:39 ip-3.106.165.24 systemd[1]: Starting LSB: Start Jenkins at boot time...
Jan 20 05:26:39 ip-3.106.165.24 jenkins[1951]: Correct java version found
Jan 20 05:26:39 ip-3.106.165.24 jenkins[1951]: * Starting Jenkins Automation Server jenkins
Jan 20 05:26:39 ip-3.106.165.24 su[1997]: Successful su for jenkins by root
Jan 20 05:26:39 ip-3.106.165.24 su[1997]: + ??? root:jenkins
Jan 20 05:26:39 ip-3.106.165.24 su[1997]: pam_unix(su:session): session opened for user jenkins by (uid=0)
Jan 20 05:26:39 ip-3.106.165.24 su[1997]: pam_unix(su:session): session closed for user jenkins
Jan 20 05:26:40 ip-3.106.165.24 jenkins[1951]: ...done.
Jan 20 05:26:40 ip-3.106.165.24 systemd[1]: Started LSB: Start Jenkins at boot time.
My security groups are setup port 22 for SSH and port 8080 for jenkins and 80 for http
When I attempt to access jenkins through the web browser I get "This site can't be reached" error
Not sure what else I can try, as I have tried every solution under the sun but still the problem persists.
I can SSH into the server and I can access port 80 as I setup nginx on the server successfully.
Any help would be greatly appreciated
Regards
Danny
Not sure why port 8080 is not working. I had to change the port to 8081 in the jenkins config file in following location as there are multiple jenkins config file
/etc/default/jenkins
Also changed the security group settings in AWS to match
Works fine now
Further to this is that my ISP is blocking port 8080. I hotspotted my phone and port 8080 was accessible.

airflow gunircorn [CRITICAL] WORKER TIMEOUT

I am trying to run apache airflow in ECS using the v1-10-stable branch of apache/airflow using my fork airflow. I am using env variables to set executor, Postgres and Redis info to the webserver.
AIRFLOW__CORE__SQL_ALCHEMY_CONN="postgresql+psycopg2://airflow_user:airflow_password#postgres:5432/airflow_db"
AIRFLOW__CELERY__RESULT_BACKEND="db+postgresql://airflow_user:airflow_password#postgres:5432/airflow_db"
AIRFLOW__CELERY__BROKER_URL="redis://redis_queue:6379/1"
AIRFLOW__CORE__EXECUTOR=CeleryExecutor
FERNET_KEY=46BKJoQYlPPOexq0OhDZnIlNepKFf87WFwLbfzqDDho=
AIRFLOW__CORE__LOAD_EXAMPLES=False
I am using CMD-SHELL [ -f /home/airflow/airflow/airflow-webserver.pid ] as health check for the ECS container. I can make a connection to the Postgres and Redis from the docker container so there is no issue of security groups as well.
With docker ps I can see that the container is healthy and container port mapping with ec2 instance 0.0.0.0:32794->8080/tcp
But when I try to open the webserver UI, it's not opening. Even with curl its not working. I have tried curl localhost:32794 from the ec2-instance and curl localhost:8080 from the container, but none of them are working. telnet is working in both cases.
In the container logs, I can see that the gunicorn workers are constantly getting a timeout
[2019-11-25 05:30:39,236] {__init__.py:51} INFO - Using executor CeleryExecutor
[2019-11-25 05:30:39 +0000] [11] [CRITICAL] WORKER TIMEOUT (pid:17337)
[2019-11-25 05:30:39 +0000] [17337] [INFO] Worker exiting (pid: 17337)
[2019-11-25 05:30:39,430] {dagbag.py:90} INFO - Filling up the DagBag from /home/airflow/airflow/dags
[2019-11-25 05:30:39,472] {dagbag.py:90} INFO - Filling up the DagBag from /home/airflow/airflow/dags
[2019-11-25 05:30:39,479] {__init__.py:51} INFO - Using executor CeleryExecutor
[2019-11-25 05:30:39,447] {__init__.py:51} INFO - Using executor CeleryExecutor
[2019-11-25 05:30:39,524] {__init__.py:51} INFO - Using executor CeleryExecutor
[2019-11-25 05:30:39,719] {__init__.py:51} INFO - Using executor CeleryExecutor
[2019-11-25 05:30:39,930] {dagbag.py:90} INFO - Filling up the DagBag from /home/airflow/airflow/dags
[2019-11-25 05:30:40,139] {dagbag.py:90} INFO - Filling up the DagBag from /home/airflow/airflow/dags
[2019-11-25 05:30:40,244] {dagbag.py:90} INFO - Filling up the DagBag from /home/airflow/airflow/dags
[2019-11-25 05:30:40 +0000] [11] [CRITICAL] WORKER TIMEOUT (pid:17338)
[2019-11-25 05:30:40 +0000] [11] [CRITICAL] WORKER TIMEOUT (pid:17339)
[2019-11-25 05:30:40 +0000] [17393] [INFO] Booting worker with pid: 17393
[2019-11-25 05:30:40,412] {dagbag.py:90} INFO - Filling up the DagBag from /home/airflow/airflow/dags
ec2-instance is using Amazon Linux 2 and I can these logs constantly in /var/log/messages
Nov 25 05:57:15 ip-172-31-67-43 ec2net: [rewrite_aliases] Rewriting aliases of eth0
Nov 25 05:58:16 ip-172-31-67-43 dhclient[2724]: XMT: Solicit on eth0, interval 131000ms.
Nov 25 06:00:27 ip-172-31-67-43 dhclient[2724]: XMT: Solicit on eth0, interval 127900ms.
Nov 25 06:01:01 ip-172-31-67-43 systemd: Created slice User Slice of root.
Nov 25 06:01:01 ip-172-31-67-43 systemd: Starting User Slice of root.
Nov 25 06:01:01 ip-172-31-67-43 systemd: Started Session 77 of user root.
Nov 25 06:01:01 ip-172-31-67-43 systemd: Starting Session 77 of user root.
Nov 25 06:01:01 ip-172-31-67-43 systemd: Removed slice User Slice of root.
Nov 25 06:01:01 ip-172-31-67-43 systemd: Stopping User Slice of root.
Nov 25 06:02:35 ip-172-31-67-43 dhclient[2724]: XMT: Solicit on eth0, interval 131620ms.
Nov 25 06:04:36 ip-172-31-67-43 systemd: Started Session 78 of user ec2-user.
Nov 25 06:04:36 ip-172-31-67-43 systemd-logind: New session 78 of user ec2-user.
Nov 25 06:04:36 ip-172-31-67-43 systemd: Starting Session 78 of user ec2-user.
Nov 25 06:04:46 ip-172-31-67-43 dhclient[2724]: XMT: Solicit on eth0, interval 125300ms.
Nov 25 06:06:52 ip-172-31-67-43 dhclient[2724]: XMT: Solicit on eth0, interval 115230ms.
Nov 25 06:08:47 ip-172-31-67-43 dhclient[2724]: XMT: Solicit on eth0, interval 108100ms.
Regarding these timeout errors:
[CRITICAL] WORKER TIMEOUT
you can set Gunicorn timeouts via these two Airflow environment variables:
AIRFLOW__WEBSERVER__WEB_SERVER_MASTER_TIMEOUT
Number of seconds the webserver waits before killing gunicorn master that doesn’t respond.
AIRFLOW__WEBSERVER__WEB_SERVER_WORKER_TIMEOUT
Number of seconds the gunicorn webserver waits before timing out on a worker.
See the Airflow documentation for more info.
I had to solve this error for my Airflow install. I set both of these timeouts to 300 seconds and that allowed me to get the Airflow web UI loading so that I could then debug the underlying cause of the slow page load.
I was getting this error when deploying Airflow with AIRFLOW_HOME set to my EFS mount. Setting it to ~/airflow fixed the issue.

How to troubleshoot failed docker tasks

I'm just starting my way in the docker world and many (basic) principles of how everything is organized are still unclear. Please help me to understand how should we approach troubleshooting failed docker tasks.
My docker service doesn't work but that is a secondary problem. The primary issue is that it's totally unclear how to troubleshoot that.
This is my docker-compose file, it consists of an application service and mongodb service. The application service writes logs to the /opt/myapp/log/app.log. Complete sources can be found here. I also built a corresponding docker image and uploaded it to the dockerhub
Let's start the stack:
docker swarm init
Swarm initialized: current node (xpkngdn0vpr73nioalzbkem1k) is now a manager.
To add a worker to this swarm, run the following command:
docker swarm join --token SWMTKN-1-6109nv6pn7eb9gtam8bq4m198k5sk7ztzf7hy7yfv5c47kcrmq-9fbrmmccd977kx22mivs7segn 192.168.65.3:2377
To add a manager to this swarm, run 'docker swarm join-token manager' and follow the instructions.
docker stack deploy -c docker-compose.yml myapp
Creating network myapp_default
Creating service myapp_web
Creating service myapp_db
After that, let's wait for a little while (~1 minute) and proceed:
docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
31e9f3a8f5aa deniszhdanov/docker-swarm-troobleshoot-service:1 "java -jar /opt/myap…" 31 seconds ago Up 25 seconds 8090/tcp myapp_web.1.sij3z7cbbynsxos6608ru2f8a
9fc6a5868c12 mongo:latest "docker-entrypoint.s…" About a minute ago Up About a minute 27017/tcp myapp_db.1.gxl5xwj1tg80nr16clbskk2oc
a3ff2ba0c8c5 deniszhdanov/docker-swarm-troobleshoot-service:1 "java -jar /opt/myap…" About a minute ago Exited (137) 32 seconds ago myapp_web.1.3dv8x2dx6kig4qkf1wc2axro8
We see that there is a failed task. Let's try to understand what went wrong:
docker commit a3ff2ba0c8c5 snapshot
sha256:bec4756cadebbada400b4d1037cac671168396bf73b7d3e875c6f98f63522afd
docker run --rm -it snapshot /bin/sh
/opt/myapp # cat /opt/myapp/log/app.log
2018-10-05 15:34:20 - Starting Start on a3ff2ba0c8c5 with PID 1 (/opt/myapp/lib/myapp.jar started by root in /opt/myapp)
2018-10-05 15:34:21 - No active profile set, falling back to default profiles: default
The task's log doesn't contain target information which would be enough to troubleshoot the problem. However, that data is available when we run the application image standalone:
docker run --rm -d deniszhdanov/docker-swarm-troobleshoot-service:1
825a818b425feb7ed1f593c14a411efb68457aee9c6bfcf27f745fd58cfa0001
docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
825a818b425f deniszhdanov/docker-swarm-troobleshoot-service:1 "java -jar /opt/myap…" 46 seconds ago Up 44 seconds 8090/tcp zen_bardeen
docker exec -it 825a818b425f /bin/sh
/opt/myapp # cat /opt/myapp/log/app.log
2018-10-05 15:30:09 - Starting Start on 825a818b425f with PID 1 (/opt/myapp/lib/myapp.jar started by root in /opt/myapp)
2018-10-05 15:30:09 - No active profile set, falling back to default profiles: default
2018-10-05 15:30:09 - Refreshing org.springframework.context.annotation.AnnotationConfigApplicationContext#18ef96: startup date <Fri Oct 05 15:30:09 GMT 2018>; root of context hierarchy
2018-10-05 15:30:10 - Cluster created with settings {hosts=<db:27017>, mode=MULTIPLE, requiredClusterType=UNKNOWN, serverSelectionTimeout='30000 ms', maxWaitQueueSize=500}
2018-10-05 15:30:10 - Adding discovered server db:27017 to client view of cluster
2018-10-05 15:30:11 - Exception in monitor thread while connecting to server db:27017
com.mongodb.MongoSocketException: db: Name does not resolve
at com.mongodb.ServerAddress.getSocketAddress(ServerAddress.java:188)
at com.mongodb.connection.SocketStreamHelper.initialize(SocketStreamHelper.java:59)
at com.mongodb.connection.SocketStream.open(SocketStream.java:57)
at com.mongodb.connection.InternalStreamConnection.open(InternalStreamConnection.java:126)
at com.mongodb.connection.DefaultServerMonitor$ServerMonitorRunnable.run(DefaultServerMonitor.java:114)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.UnknownHostException: db: Name does not resolve
at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
at java.net.InetAddress.getAllByName(InetAddress.java:1192)
at java.net.InetAddress.getAllByName(InetAddress.java:1126)
at java.net.InetAddress.getByName(InetAddress.java:1076)
at com.mongodb.ServerAddress.getSocketAddress(ServerAddress.java:186)
... 5 common frames omitted
Huh, it took a while to describe all of that, thanks to everyone who managed to get to this point :)
Questions:
Why do we have different states for standalone containers and tasks?
What is the recommended way to troubleshoot failing tasks
Regards, Denis
Eventually, I found this docker troubleshooting page, checked docker logs and found this:
2018-10-06 11:58:29.662461+0800 localhost com.docker.hyperkit[583]: [91168.810550] CPU: 3 PID: 50578 Comm: java Not tainted 4.9.93-linuxkit-aufs #1
2018-10-06 11:58:29.663013+0800 localhost com.docker.hyperkit[583]: [91168.811356] Hardware name: BHYVE, BIOS 1.00 03/14/2014
2018-10-06 11:58:29.663984+0800 localhost com.docker.hyperkit[583]: [91168.811909] 0000000000000000 ffffffffa243922a ffffbb9dc0763de8 ffff937eb67aed00
2018-10-06 11:58:29.664792+0800 localhost com.docker.hyperkit[583]: [91168.812878] ffffffffa21f5d85 0000000000000000 0000000000000000 ffffbb9dc0763de8
2018-10-06 11:58:29.665681+0800 localhost com.docker.hyperkit[583]: [91168.813694] ffff937e6df576a0 0000000000000202 ffffffffa27f9dae ffff937eb67aed00
2018-10-06 11:58:29.666082+0800 localhost com.docker.hyperkit[583]: [91168.814585] Call Trace:
2018-10-06 11:58:29.666638+0800 localhost com.docker.hyperkit[583]: [91168.814984] [<ffffffffa243922a>] ? dump_stack+0x5a/0x6f
2018-10-06 11:58:29.667238+0800 localhost com.docker.hyperkit[583]: [91168.815534] [<ffffffffa21f5d85>] ? dump_header+0x78/0x1ed
2018-10-06 11:58:29.667980+0800 localhost com.docker.hyperkit[583]: [91168.816144] [<ffffffffa27f9dae>] ? _raw_spin_unlock_irqrestore+0x16/0x18
2018-10-06 11:58:29.668686+0800 localhost com.docker.hyperkit[583]: [91168.816878] [<ffffffffa21a1f90>] ? oom_kill_process+0x83/0x324
2018-10-06 11:58:29.669273+0800 localhost com.docker.hyperkit[583]: [91168.817583] [<ffffffffa21a25b7>] ? out_of_memory+0x239/0x267
2018-10-06 11:58:29.669944+0800 localhost com.docker.hyperkit[583]: [91168.818162] [<ffffffffa21ef2cd>] ? mem_cgroup_out_of_memory+0x4b/0x79
2018-10-06 11:58:29.670652+0800 localhost com.docker.hyperkit[583]: [91168.818834] [<ffffffffa21f34a6>] ? mem_cgroup_oom_synchronize+0x26b/0x294
2018-10-06 11:58:29.671338+0800 localhost com.docker.hyperkit[583]: [91168.819560] [<ffffffffa21ef650>] ? mem_cgroup_is_descendant+0x48/0x48
2018-10-06 11:58:29.671982+0800 localhost com.docker.hyperkit[583]: [91168.820253] [<ffffffffa21a2612>] ? pagefault_out_of_memory+0x2d/0x6f
2018-10-06 11:58:29.672615+0800 localhost com.docker.hyperkit[583]: [91168.820886] [<ffffffffa20459b0>] ? __do_page_fault+0x3c6/0x45f
2018-10-06 11:58:29.673158+0800 localhost com.docker.hyperkit[583]: [91168.821516] [<ffffffffa27fb3c8>] ? page_fault+0x28/0x30
2018-10-06 11:58:29.677517+0800 localhost com.docker.hyperkit[583]: [91168.822159] Task in /docker/bf5d7ef29816596e58e25b05e5bde1f57531e02fe31317d6a1dbad477580b235 killed as a result of limit of /docker/bf5d7ef29816596e58e25b05e5bde1f57531e02fe31317d6a1dbad477580b235
2018-10-06 11:58:29.678200+0800 localhost com.docker.hyperkit[583]: [91168.826457] memory: usage 51188kB, limit 51200kB, failcnt 8764
2018-10-06 11:58:29.678913+0800 localhost com.docker.hyperkit[583]: [91168.827160] memory+swap: usage 102400kB, limit 102400kB, failcnt 78
2018-10-06 11:58:29.679469+0800 localhost com.docker.hyperkit[583]: [91168.827815] kmem: usage 884kB, limit 9007199254740988kB, failcnt 0
2018-10-06 11:58:29.681923+0800 localhost com.docker.hyperkit[583]: [91168.828391] Memory cgroup stats for /docker/bf5d7ef29816596e58e25b05e5bde1f57531e02fe31317d6a1dbad477580b235: cache:20KB rss:50284KB rss_huge:0KB mapped_file:4KB dirty:8KB writeback:0KB swap:51212KB inactive_anon:25264KB active_anon:25020KB inactive_file:8KB active_file:8KB unevictable:0KB
2018-10-06 11:58:29.682630+0800 localhost com.docker.hyperkit[583]: [91168.830820] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
2018-10-06 11:58:29.683423+0800 localhost com.docker.hyperkit[583]: [91168.831586] [50168] 0 50168 499844 14545 94 5 12964 0 java
2018-10-06 11:58:29.684186+0800 localhost com.docker.hyperkit[583]: [91168.832329] Memory cgroup out of memory: Kill process 50168 (java) score 1078 or sacrifice child
2018-10-06 11:58:29.685451+0800 localhost com.docker.hyperkit[583]: [91168.833172] Killed process 50168 (java) total-vm:1999376kB, anon-rss:49108kB, file-rss:9072kB, shmem-rss:0kB
2018-10-06 11:58:29.970702+0800 localhost com.docker.hyperkit[583]: [91169.119073] oom_reaper: reaped process 50168 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
2018-10-06 11:58:30.206071+0800 localhost com.docker.driver.amd64-linux[579]: osxfs: die event: de-registering container bf5d7ef29816596e58e25b05e5bde1f57531e02fe31317d6a1dbad477580b235
I.e. the reason was that I inadvertently copied resources/limits/memory setup from one of the docker-compose tutorials and docker kept killing my app because of that.
In the end of the day, the problem was trivial and the troubleshooting also was not a real bear. It was only necessary to look into docker daemon logs. Only due to my inexperience with docker I spent an evening trying to find the root from docker container/swarm side (service logs, container logs etc). Well, practice makes perfect :)

Kuberntes master not starting up in OpenStack heat

I have been trying to setup a Kubernetes cluster for the last week or so in OpenStack using this guide. I have faced a few issues in the process one of which is described in this question -> kube-up.sh failes in OpenStack
On issuing the ./cluster/kube-up.sh script, it tries to bring up the cluster using the openstack stack create step (Log) . Here, for some reason the kubernetes master does not properly come up and here is where the installation fails. I was able to SSH into the master node and found this in /var/log/cloud-init-output.log
[..]
Complete!
* INFO: Running install_centos_stable_post()
* INFO: Running install_centos_check_services()
* INFO: Running install_centos_restart_daemons()
* INFO: Running daemons_running()
* INFO: Salt installed!
2017-01-02 12:57:31,574 - cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
2017-01-02 12:57:31,576 - util.py[WARNING]: Running scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python2.7/site-packages/cloudinit/config/cc_scripts_user.pyc'>) failed
Cloud-init v. 0.7.5 finished at Mon, 02 Jan 2017 12:57:31 +0000. Datasource DataSourceOpenStack [net,ver=2]. Up 211.20 seconds
On digging further I found this snippet in the /var/log/messages file -> https://paste.ubuntu.com/23733430/
So I would assume that the Docker daemon is not starting up. Also there is something screwed up with my etcd configuration due to which flanneld service is also not starting up. Here is the output of service flanneld status
● flanneld.service - Flanneld overlay address etcd agent
Loaded: loaded (/usr/lib/systemd/system/flanneld.service; enabled; vendor preset: disabled)
Active: activating (start) since Tue 2017-01-03 13:32:37 UTC; 48s ago
Main PID: 15666 (flanneld)
CGroup: /system.slice/flanneld.service
└─15666 /usr/bin/flanneld -etcd-endpoints= -etcd-prefix= -iface=eth0 --ip-masq
Jan 03 13:33:16 kubernetesstack-master flanneld[15666]: E0103 13:33:16.229827 15666 network.go:53] Failed to retrieve network config: client: etcd cluster is unavailable or misconfigured
Jan 03 13:33:17 kubernetesstack-master flanneld[15666]: E0103 13:33:17.230082 15666 network.go:53] Failed to retrieve network config: client: etcd cluster is unavailable or misconfigured
Jan 03 13:33:18 kubernetesstack-master flanneld[15666]: E0103 13:33:18.230326 15666 network.go:53] Failed to retrieve network config: client: etcd cluster is unavailable or misconfigured
Jan 03 13:33:19 kubernetesstack-master flanneld[15666]: E0103 13:33:19.230560 15666 network.go:53] Failed to retrieve network config: client: etcd cluster is unavailable or misconfigured
Jan 03 13:33:20 kubernetesstack-master flanneld[15666]: E0103 13:33:20.230822 15666 network.go:53] Failed to retrieve network config: client: etcd cluster is unavailable or misconfigured
Jan 03 13:33:21 kubernetesstack-master flanneld[15666]: E0103 13:33:21.231325 15666 network.go:53] Failed to retrieve network config: client: etcd cluster is unavailable or misconfigured
Jan 03 13:33:22 kubernetesstack-master flanneld[15666]: E0103 13:33:22.231581 15666 network.go:53] Failed to retrieve network config: client: etcd cluster is unavailable or misconfigured
Jan 03 13:33:23 kubernetesstack-master flanneld[15666]: E0103 13:33:23.232140 15666 network.go:53] Failed to retrieve network config: client: etcd cluster is unavailable or misconfigured
Jan 03 13:33:24 kubernetesstack-master flanneld[15666]: E0103 13:33:24.234041 15666 network.go:53] Failed to retrieve network config: client: etcd cluster is unavailable or misconfigured
Jan 03 13:33:25 kubernetesstack-master flanneld[15666]: E0103 13:33:25.234277 15666 network.go:53] Failed to retrieve network config: client: etcd cluster is unavailable or misconfigured
My etcd daemon is running:
[root#kubernetesstack-master salt]# netstat -tanlp | grep etcd
tcp 0 0 192.168.173.3:4379 0.0.0.0:* LISTEN 20338/etcd
tcp 0 0 192.168.173.3:4380 0.0.0.0:* LISTEN 20338/etcd
Although its running on a non standard port.
I'm also in a corporate network under a proxy. Any pointers on how to debug this further is appreciated. As of now I have reached a dead end on how to proceed on this. Asking in the kubernetes slack channels have also produced zero results!
/usr/bin/flanneld -etcd-endpoints=
That line is the source of your troubles, assuming you didn't elide the output before posting it. Your situation is made worse by etcd running on non-standard ports, but thankfully I think the solution to both of those is actually the same fix.
I would expect running systemctl cat flanneld.service (you may need sudo, depending on the strictness of your systemd setup) to output the unified systemd descriptor for flanneld, including any "drop-ins", overrides, etc, and if my theory is correct, one of them will be either Environment= or EnvironmentFile= and that's the place I bet flanneld.service expected to have ETCD_ENDPOINTS= or FLANNELD_ETCD_ENDPOINTS= (as seen here) available to the Exec.
So hopefully that file is either missing or is actually blank, and in either case you are one swift vi away from teaching flanneld about your etcd endpoints, and everything being well in the world again.

Resources