Run dask scheduler and workers in Amazon's ECS (Fargate) - docker

I tried to run a scheduler and workers docker containers on Amazon's ECS.
I'm using this example:
https://docs.dask.org/en/latest/setup/docker.html
The scheduler works perfectly, I successfully connected to it from my local machine:
distributed.scheduler - INFO - Remove client Client-0ae5b0fa
distributed.scheduler - INFO - Close client connection: Client-0ae5b0fa
distributed.scheduler - INFO - Remove client Client-0ae5b0fa
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Receive client connection: Client-0ae5b0fa
I try to run the worker the same way, with this command:
dask-worker tcp://SCHEDULER_PUBLIC_IP:8786
The worker is writing these logs and exits:
+ exec 'dask-worker tcp://SCHEDULER_PUBLIC_IP:8786'
/usr/bin/prepare.sh: line 30: /dask-worker tcp://SCHEDULER_PUBLIC_IP:8786: No such file or directory
+ '[' '' ']'
no environment.yml
+ '[' -e /opt/app/environment.yml ']'
+ echo 'no environment.yml'
+ '[' '' ']'
+ '[' '' ']'
I expected the worker to connect to the scheduler, because the same commands worked when I tried them on an EC2 instance. Also, I tried doing this with all ports open to tcp connections and still nothing.
Environment:
Dask docker container version: 6bfa3b19b4be (1 AUG 2021) (latest)
Fargate version: 1.4.0 (latest)
Container has 2 vCPUs, 4 Gb memory

The problem was that my command was not comma delimited. it was:
dask-worker 1.1.1.1:8786
it is supposed to be:
dask-worker,1.1.1.1:8786
in order for docker to understand these are different arguments:
Command ["dask-worker","1.1.1.1:8786"]

Related

SSL(curl) connection error in ElasticSearch setup

Have setup a 3-node Elasticsearch cluster using docker-compose. Followed below steps:
On one of the master nodes, es11, gets below error, however same curl command works fine on other 2 nodes i.e. es12, es13:
Error:
curl -X GET 'https://localhost:9316'
curl: (35) Encountered end of file
Below error in logs:
"stacktrace": ["org.elasticsearch.transport.RemoteTransportException: [es13][SOMEIP:9316][internal:cluster/coordination/join]",
"Caused by: org.elasticsearch.transport.ConnectTransportException: [es11][SOMEIP:9316] handshake failed. unexpected remote node {es13}{SOMEVALUE}{SOMEVALUE
"at org.elasticsearch.transport.TransportService.lambda$connectionValidator$6(TransportService.java:468) ~[elasticsearch-7.17.6.jar:7.17.6]",
"at org.elasticsearch.action.ActionListener$MappedActionListener.onResponse(ActionListener.java:95) ~[elasticsearch-7.17.6.jar:7.17.6]",
"at org.elasticsearch.transport.TransportService.lambda$handshake$9(TransportService.java:577) ~[elasticsearch-7.17.6.jar:7.17.6]",
https://localhost:9316 on browser gives site can't be reached error as well.It seems SSL certificate as created in step 4 below is having some issues in es11.
Any leads please? OR If I repeat step 4, do i need to copy the certs again to es12 & es13?
Below elasticsearch.yml
cluster.name: "docker-cluster"
network.host: 0.0.0.0
Ports as defined in all 3 nodes docker-compose.yml
environment:
- node.name=es11
- transport.port=9316
ports:
- 9216:9200
- 9316:9316
Initialize a docker swarm. On ES11 run docker swarm init. Follow the instructions to join 12 and 13 to the swarm.
Create an overlay network docker network create -d overlay --attachable elastic
If necessary, bring down the current cluster and remove all the associated volumes by running docker-compose down -v
Create SSL certificates for ES with docker-compose -f create-certs.yml run --rm create_certs
Copy the certs for es12 and 13 to the respective servers
Use this busybox to create the overlay network on 12 and 13 sudo docker run -itd --name containerX --net [network name] busybox
Configure certs on 12 and 13 with docker-compose -f config-certs.yml run --rm config_certs
Start the cluster with docker-compose up -d on each server
Set the passwords for the built-in ES accounts by logging into the cluster docker exec -it es11 sh then running bin/elasticsearch-setup-passwords interactive --url localhost:9316
(as per your https://discuss.elastic.co thread)
you cannot talk HTTP to the transport protocol port, which you have defined in transport.port. you need to talk to port 9200 in the container, which you have mapped to 9216 outside the container
the transport port runs a binary protocol that is not HTTP accessible

Cannot run nodetool commands and cqlsh to Scylla in Docker

I am new to Scylla and I am following the instructions to try it in a container as per this page: https://hub.docker.com/r/scylladb/scylla/.
The following command ran fine.
docker run --name some-scylla --hostname some-scylla -d scylladb/scylla
I see the container is running.
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
e6c4e19ff1bd scylladb/scylla "/docker-entrypoint.…" 14 seconds ago Up 13 seconds 22/tcp, 7000-7001/tcp, 9042/tcp, 9160/tcp, 9180/tcp, 10000/tcp some-scylla
However, I'm unable to use nodetool or cqlsh. I get the following output.
$ docker exec -it some-scylla nodetool status
Using /etc/scylla/scylla.yaml as the config file
nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)
See 'nodetool help' or 'nodetool help <command>'.
and
$ docker exec -it some-scylla cqlsh
Connection error: ('Unable to connect to any servers', {'172.17.0.2': error(111, "Tried connecting to [('172.17.0.2', 9042)]. Last error: Connection refused")})
Any ideas?
Update
Looking at docker logs some-scylla I see some errors in the logs, the last one is as follows.
2021-10-03 07:51:04,771 INFO spawned: 'scylla' with pid 167
Scylla version 4.4.4-0.20210801.69daa9fd0 with build-id eb11cddd30e88ef39c32c847e70181b5cf786355 starting ...
command used: "/usr/bin/scylla --log-to-syslog 0 --log-to-stdout 1 --default-log-level info --network-stack posix --developer-mode=1 --overprovisioned --listen-address 172.17.0.2 --rpc-address 172.17.0.2 --seed-provider-parameters seeds=172.17.0.2 --blocked-reactor-notify-ms 999999999"
parsed command line options: [log-to-syslog: 0, log-to-stdout: 1, default-log-level: info, network-stack: posix, developer-mode: 1, overprovisioned, listen-address: 172.17.0.2, rpc-address: 172.17.0.2, seed-provider-parameters: seeds=172.17.0.2, blocked-reactor-notify-ms: 999999999]
ERROR 2021-10-03 07:51:05,203 [shard 6] seastar - Could not setup Async I/O: Resource temporarily unavailable. The most common cause is not enough request capacity in /proc/sys/fs/aio-max-nr. Try increasing that number or reducing the amount of logical CPUs available for your application
2021-10-03 07:51:05,316 INFO exited: scylla (exit status 1; not expected)
2021-10-03 07:51:06,318 INFO gave up: scylla entered FATAL state, too many start retries too quickly
Update 2
The reason for the error was described on the docker hub page linked above. I had to start container specifying the number of CPUs with --smp 1 as follows.
docker run --name some-scylla --hostname some-scylla -d scylladb/scylla --smp 1
According to the above page:
This command will start a Scylla single-node cluster in developer mode
(see --developer-mode 1) limited by a single CPU core (see --smp).
Production grade configuration requires tuning a few kernel parameters
such that limiting number of available cores (with --smp 1) is the
simplest way to go.
Multiple cores requires setting a proper value to the
/proc/sys/fs/aio-max-nr. On many non production systems it will be
equal to 65K. ...
As you have found out, in order to be able to use additional CPU cores you'll need to increase fs.aio-max-nr kernel parameter.
You may run as root:
# sysctl -w fs.aio-max-nr=65535
Which should be enough for most systems. Should you still have any error preventing it to use all of your CPU cores, increase its value further.
Do notice that the above configuration is not persistent. Edit /etc/sysctl.conf in order to make it persistent across reboots.

Cannot conect to Docker container running in VSTS

I have a test which starts a Docker container, performs the verification (which is talking to the Apache httpd in the Docker container), and then stops the Docker container.
When I run this test locally, this test runs just fine. But when it runs on hosted VSTS, thus a hosted build agent, it cannot connect to the Apache httpd in the Docker container.
This is the .vsts-ci.yml file:
queue: Hosted Linux Preview
steps:
- script: |
./test.sh
This is the test.sh shell script to reproduce the problem:
#!/bin/bash
set -e
set -o pipefail
function tearDown {
docker stop test-apache
docker rm test-apache
}
trap tearDown EXIT
docker run -d --name test-apache -p 8083:80 httpd
sleep 10
curl -D - http://localhost:8083/
When I run this test locally, the output that I get is:
$ ./test.sh
469d50447ebc01775d94e8bed65b8310f4d9c7689ad41b2da8111fd57f27cb38
HTTP/1.1 200 OK
Date: Tue, 04 Sep 2018 12:00:17 GMT
Server: Apache/2.4.34 (Unix)
Last-Modified: Mon, 11 Jun 2007 18:53:14 GMT
ETag: "2d-432a5e4a73a80"
Accept-Ranges: bytes
Content-Length: 45
Content-Type: text/html
<html><body><h1>It works!</h1></body></html>
test-apache
test-apache
This output is exactly as I expect.
But when I run this test on VSTS, the output that I get is (irrelevant parts replaced with …).
2018-09-04T12:01:23.7909911Z ##[section]Starting: CmdLine
2018-09-04T12:01:23.8044456Z ==============================================================================
2018-09-04T12:01:23.8061703Z Task : Command Line
2018-09-04T12:01:23.8077837Z Description : Run a command line script using cmd.exe on Windows and bash on macOS and Linux.
2018-09-04T12:01:23.8095370Z Version : 2.136.0
2018-09-04T12:01:23.8111699Z Author : Microsoft Corporation
2018-09-04T12:01:23.8128664Z Help : [More Information](https://go.microsoft.com/fwlink/?LinkID=613735)
2018-09-04T12:01:23.8146694Z ==============================================================================
2018-09-04T12:01:26.3345330Z Generating script.
2018-09-04T12:01:26.3392080Z Script contents:
2018-09-04T12:01:26.3409635Z ./test.sh
2018-09-04T12:01:26.3574923Z [command]/bin/bash --noprofile --norc /home/vsts/work/_temp/02476800-8a7e-4e22-8715-c3f706e3679f.sh
2018-09-04T12:01:27.7054918Z Unable to find image 'httpd:latest' locally
2018-09-04T12:01:30.5555851Z latest: Pulling from library/httpd
2018-09-04T12:01:31.4312351Z d660b1f15b9b: Pulling fs layer
[…]
2018-09-04T12:01:49.1468474Z e86a7f31d4e7506d34e3b854c2a55646eaa4dcc731edc711af2cc934c44da2f9
2018-09-04T12:02:00.2563446Z % Total % Received % Xferd Average Speed Time Time Time Current
2018-09-04T12:02:00.2583211Z Dload Upload Total Spent Left Speed
2018-09-04T12:02:00.2595905Z
2018-09-04T12:02:00.2613320Z 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (7) Failed to connect to localhost port 8083: Connection refused
2018-09-04T12:02:00.7027822Z test-apache
2018-09-04T12:02:00.7642313Z test-apache
2018-09-04T12:02:00.7826541Z ##[error]Bash exited with code '7'.
2018-09-04T12:02:00.7989841Z ##[section]Finishing: CmdLine
The key thing is this:
curl: (7) Failed to connect to localhost port 8083: Connection refused
10 seconds should be enough for apache to start.
Why can curl not communicate with Apache on its port 8083?
P.S.:
I know that a hard-coded port like this is rubbish and that I should use an ephemeral port instead. I wanted to get it running first wirth a hard-coded port, because that's simpler than using an ephemeral port, and then switch to an ephemeral port as soon as the hard-coded port works. And in case the hard-coded port doesn't work because the port is unavailable, the error should look different, in that case, docker run should fail because the port can't be allocated.
Update:
Just to be sure, I've rerun the test with sleep 100 instead of sleep 10. The results are unchanged, curl cannot connect to localhost port 8083.
Update 2:
When extending the script to execute docker logs, docker logs shows that Apache is running as expected.
When extending the script to execute docker ps, it shows the following output:
2018-09-05T00:02:24.1310783Z CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
2018-09-05T00:02:24.1336263Z 3f59aa014216 httpd "httpd-foreground" About a minute ago Up About a minute 0.0.0.0:8083->80/tcp test-apache
2018-09-05T00:02:24.1357782Z 850bda64f847 microsoft/vsts-agent:ubuntu-16.04-docker-17.12.0-ce-standard "/home/vsts/agents/2…" 2 minutes ago Up 2 minutes musing_booth
The problem is that the VSTS build agent runs in a Docker container. When the Docker container for Apache is started, it runs on the same level as the VSTS build agent Docker container, not nested inside the VSTS build agent Docker container.
There are two possible solutions:
Replacing localhost with the ip address of the docker host, keeping the port number 8083
Replacing localhost with the ip address of the docker container, changing the host port number 8083 to the container port number 80.
Access via the Docker Host
In this case, the solution is to replace localhost with the ip address of the docker host. The following shell snippet can do that:
host=localhost
if grep '^1:name=systemd:/docker/' /proc/1/cgroup
then
apt-get update
apt-get install net-tools
host=$(route -n | grep '^0.0.0.0' | sed -e 's/^0.0.0.0\s*//' -e 's/ .*//')
fi
curl -D - http://$host:8083/
The if grep '^1:name=systemd:/docker/' /proc/1/cgroup inspects whether the script is running inside a Docker container. If so, it installs net-tools to get access to the route command, and then parses the default gw from the route command to get the ip address of the host. Note that this only works if the container's network default gw actually is the host.
Direct Access to the Docker Container
After launching the docker container, its ip addresses can be obtained with the following command:
docker container inspect --format '{{range .NetworkSettings.Networks}}{{.IPAddress}} {{end}}' <container-id>
Replace <container-id> with your container id or name.
So, in this case, it would be (assuming that the first ip address is okay):
ips=($(docker container inspect --format '{{range .NetworkSettings.Networks}}{{.IPAddress}} {{end}}' nuance-apache))
host=${ips[0]}
curl http://$host/

neo4j-shell can not connect to neo4j Server

I'm using docker version of neo4j (v3.1.0) and I'm having difficulties connecting to neo4j server using neo4j-shell.
After running an instance of neo4r:3.1.0 docker, I run a bash inside the container:
$ docker exec -it neo4j /bin/bash
And from there I try to run the neo4j-shell like this:
/var/lib/neo4j/bin/neo4j-shell
But it errors:
$ /var/lib/neo4j/bin/neo4j-shell
ERROR (-v for expanded information):
Connection refused
-host Domain name or IP of host to connect to (default: localhost)
-port Port of host to connect to (default: 1337)
-name RMI name, i.e. rmi://<host>:<port>/<name> (default: shell)
-pid Process ID to connect to
-c Command line to execute. After executing it the shell exits
-file File containing commands to execute, or '-' to read from stdin. After executing it the shell exits
-readonly Connect in readonly mode (only for connecting with -path)
-path Points to a neo4j db path so that a local server can be started there
-config Points to a config file when starting a local server
Example arguments for remote:
-port 1337
-host 192.168.1.234 -port 1337 -name shell
-host localhost -readonly
...or no arguments for default values
Example arguments for local:
-path /path/to/db
-path /path/to/db -config /path/to/neo4j.config
-path /path/to/db -readonly
I also tried other hosts like: localhost, 127.0.0.1 and 172.17.0.6 (the container IP). Since it didn't work I tried to list open ports on my container:
$ netstat -l
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 :::7687 :::* LISTEN
tcp 0 0 :::7473 :::* LISTEN
tcp 0 0 :::7474 :::* LISTEN
Active UNIX domain sockets (only servers)
Proto RefCnt Flags Type State I-Node Path
As you can see there's no 1337 open! I've looked into the config file and the line for specifying port is commented out which means it should be set to its default value (1337).
Can anyone help me connect to neo4j using neo4j-shell?
BTW, the neo4j server is up and running and I can use its web access through port :7474.
In 3.1 it seems the shell is not enabled by default.
You will need to pass your own configuration file with the shell enabled :
Uncomment
# Enable a remote shell server which Neo4j Shell clients can log in to.
dbms.shell.enabled=true
(I find the amount of worker for changing one value in docker quite heavy but yeah..)
Or use the new cypher-shell :
ikwattro#graphaware-team ~> docker ps -a | grep 'neo4j'
34b3c6718504 neo4j:3.1.0 "/docker-entrypoint.s" 2 minutes ago Up 2 minutes 7473-7474/tcp, 7687/tcp compassionate_easley
2395bd0b1fe9 neo4j:3.1.0 "/docker-entrypoint.s" 5 minutes ago Exited (143) 3 minutes ago cranky_goldstine
949feacbc0f9 neo4j:3.1.0 "/docker-entrypoint.s" 5 minutes ago Exited (130) 5 minutes ago modest_boyd
c38572b078de neo4j:3.0.6-enterprise "/docker-entrypoint.s" 6 weeks ago Exited (0) 6 weeks ago fastfishpim_neo4j_1
ikwattro#graphaware-team ~> docker exec --interactive --tty compassionate_easley bin/cypher-shell
username: neo4j
password: *****
Connected to Neo4j 3.1.0 at bolt://localhost:7687 as user neo4j.
Type :help for a list of available commands or :exit to exit the shell.
Note that Cypher queries must end with a semicolon.
neo4j>
NB: Cypher-shell supports begin and commit :
neo4j> :begin
neo4j# create (n:Node);
Added 1 nodes, Added 1 labels
neo4j# :commit;
neo4j>
-
neo4j> :begin
neo4j# create (n:Person {name:"John"});
Added 1 nodes, Set 1 properties, Added 1 labels
neo4j# :rollback
neo4j> :commit
There is no open transaction to commit
neo4j>
http://neo4j.com/docs/operations-manual/current/tools/cypher-shell/

Docker Cloud Service Discovery Two Containers

In DockerCloud I am trying to get my container to speak with the other container. I believe the problem is the hostname not resolving (this is set in /conf.d/kafka.yaml shown below).
To get DockerCloud to have the two containers communicate, I have tried many variations including the full host-name kafka-development-1 and kafka-development-1.kafka, etc.
The error I keep getting is in the datadog-agent info:
Within the container I run ./etc/init.d/datadog-agent info and receive:
kafka
-----
- instance #kafka-kafka-development-9092 [ERROR]: 'Cannot connect to instance
kafka-development:9092 java.io.IOException: Failed to retrieve RMIServer stub:
javax.naming.CommunicationException [Root exception is java.rmi.ConnectIOException: error during JRMP connection establishment; nested exception is: \n\tjava.net.SocketException: Connection reset]' collected 0 metrics
- Collected 0 metrics, 0 events & 0 service checks
The steps I take for details
SSH Into Docker Node:
$ docker ps
CONTAINER | PORTS
datadog-agent-kafka-development-1.2fb73f62 | 8125/udp, 9001/tcp
kafka-development-1.3dc7c2d0 | 0.0.0.0:9092->9092/tcp
I log into the containers to see their values, this is the datadog-agent:
$ docker exec -it datadog-agent-kafka-development-1.2fb73f62 /bin/bash
$ > echo $DOCKERCLOUD_CONTAINER_HOSTNAME
datadog-agent-kafka-development-1
$ > tail /etc/hosts
172.17.0.7 datadog-agent-kafka-development-1
10.7.0.151 datadog-agent-kafka-development-1
This is the kafka container:
$ docker exec -it kafka-development-1.3dc7c2d0 /bin/bash
$ > echo $DOCKERCLOUD_CONTAINER_HOSTNAME
kafka-development-1
$ > tail /etc/hosts
172.17.0.6 kafka-development-1
10.7.0.8 kafka-development-1
$ > echo $KAFKA_ADVERTISED_HOST_NAME
kafka-development.c23d1d00.svc.dockerapp.io
$ > echo $KAFKA_ADVERTISED_PORT
9092
$ > echo $KAFKA_ZOOKEEPER_CONNECT
zookeeper-development:2181
Datadog conf.d/kafka.yaml:
instances:
- host: kafka-development
port: 9092 # This is the JMX port on which Kafka exposes its metrics (usually 9999)
tags:
kafka: broker
env: development
# ... Defaults Below
Can anyone see what I am doing wrong?

Resources