Flink cannot be run in Marathon - docker

I have three physical nodes with docker installed on them. I have one docker container with Mesos, Marathon, Hadoop and Flink. I configured Master node and Slave nodes for Mesos,Zookeeper and Marathon. I do these works step by step.
First, In Master node, I enter to docker container with this command:
docker run -v /home/user/.ssh:/root/.ssh --privileged -p 5050:5050 -p 5051:5051 -p 5052:5052 -p 2181:2181 -p 8082:8081 -p 6123:6123 -p 8080:8080 -p 50090:50090 -p 50070:50070 -p 9000:9000 -p 2888:2888 -p 3888:3888 -p 4041:4040 -p 7077:7077 -p 52222:22 -e WEAVE_CIDR=10.32.0.2/12 -e MESOS_EXECUTOR_REGISTRATION_TIMEOUT=5mins -e LIBPROCESS_IP=10.32.0.2 -e MESOS_RESOURCES=ports*:[11000-11999] -ti hadoop_marathon_mesos_flink_2 /bin/bash
Then run Mesos and Zookeeper :
/home/zookeeper-3.4.14/bin/zkServer.sh restart
/home/mesos-1.7.2/build/bin/mesos-master.sh --ip=10.32.0.1 --hostname=10.32.0.1 --roles=marathon,flink --quorum=1 --work_dir=/var/run/mesos --log_dir=/var/log/mesos
After that run Marathon in the same container:
/home/marathon-1.7.189-48bfd6000/bin/marathon --master 10.32.0.1:5050 --zk zk://10.32.0.1:2181/marathon --hostname 10.32.0.1 --webui_url 10.32.0.1:8080 --logging_level debug
And finally, I run hadoop:
/opt/hadoop/sbin/start-dfs.sh
Marathon,Mesos and Hadoop are run without any problems.
The most important part of my work is running Flink in Marathon. I configured Flink in docker container like this:
env.java.home: /opt/java
jobmanager.rpc.address: 10.32.0.1
high-availability: zookeeper
high-availability.storageDir: hdfs:///flink/ha/
high-availability.zookeeper.quorum: 10.32.0.1:2181,10.32.0.2:2181
recovery.zookeeper.path.mesos-workers: /mesos-workers
In Marathon UI, I create Application and put this JSON file on it, but it is failed.
{
"id": "flink",
"cmd": "/home/flink-1.7.0/bin/mesos-appmaster.sh
-Dmesos.master=10.32.0.1:5050,10.32.0.2:5050
-Dmesos.initial-tasks=1",
"cpus": 1.0,
"mem": 1024
}
Flink application is failed in Mesos UI. It shows this error:
I0428 06:01:39.586699 6155 exec.cpp:162] Version: 1.7.2
I0428 06:01:39.596458 6154 exec.cpp:236] Executor registered on agent 984595ae-e811-48fb-a9f5-ca6128e1cc1a-S0
I0428 06:01:39.598870 6157 executor.cpp:188] Received SUBSCRIBED event
I0428 06:01:39.599761 6157 executor.cpp:192] Subscribed executor on 10.32.0.3
I0428 06:01:39.599963 6157 executor.cpp:188] Received LAUNCH event
I0428 06:01:39.601236 6157 executor.cpp:697] Starting task flink.16a7cc18-697b-11e9-928f-ce235caa831e
I0428 06:01:39.613719 6157 executor.cpp:712] Forked command at 6163
I0428 06:01:39.787395 6157 executor.cpp:1013] Command exited with status 1 (pid: 6163)
I0428 06:01:40.791885 6162 process.cpp:927] Stopped the socket accept loop
The strange thing is that in STDout, I see this text; even though I set JAVA_HOME in /etc/environment and flink-conf.yam.
Please specify JAVA_HOME. Either in Flink config ./conf/flink-conf.yaml or as system-wide JAVA_HOME.
Would you please tell me what I should do for that problem?
Many Thanks.

You can check your Flink log in Slave node. Also, it is better to change your JSON file like this. It helps you to follow your application.
{
"id": "flink",
"cmd": "/home/flink-1.7.0/bin/mesos-appmaster.sh -Djobmanager.heap.mb=1024
-Djobmanager.rpc.port=6123 -Drest.port=8081
-Dmesos.resourcemanager.tasks.mem=1024 -Dtaskmanager.heap.mb=1024
-Dtaskmanager.numberOfTaskSlots=2 -Dparallelism.default=2
-Dmesos.resourcemanager.tasks.cpus=1",
"cpus": 1.0,
"mem": 1024,
"fetch": [
{
"uri": "/home/flink-1.7.0/bin/mesos-appmaster.sh",
"executable": true
}
]
}
Also, JAVA_HOME to Flink_conf.yaml in every nodes, Master and Slaves.
env.java.home: /opt/java
With adding JAVA_HOME, you do not see the error in STDOUT.
I think it is useful.

Related

Logstash Crashes When Filebeat Running

Logstash is running well without beats configuration over tcp and I can see the all logs when I send over tcp.
input {tcp{
port => 8500 }
}
output { elasticsearch { hosts => ["elasticsearch:9200"] }
}
But I want to send logs to logstash from filebeat. I changed logstash config with this:
input {
beats {
port => 5044
}
}
output { elasticsearch { hosts => ["elasticsearch:9200"] }
}
This is docker run for logstash
docker run -d -p 8500:8500 -h logstash --name logstash --link elasticsearch:elasticsearch -v C:\elk2\config-dir:/config-dir docker.elastic.co/logstash/logstash:7.5.2 -f /config-dir/logstash.conf
I am running filebeat in docker with following:
docker run -d docker.elastic.co/beats/filebeat:6.8.6 setup --template -E output.logstash.enabled=true -E 'output.logstash.hosts=["127.0.0.1:5044"]'
But whenever I run filebeat, logstash and filenbeat containers are being stopped:
There is no docker log meaningfull:
[2020-01-24T14:13:37,104][INFO ][logstash.agent ] Successfully started Logstash API endpoint {:port=>9600}
[2020-01-24T14:13:37,978][INFO ][logstash.javapipeline ] Pipeline terminated {"pipeline.id"=>".monitoring-logstash"}
[2020-01-24T14:13:38,657][INFO ][logstash.runner ] Logstash shut down.
You need to expose your beat listening port
docker run -d -p 5044:5044 -h logstash --name logstash --link elasticsearch:elasticsearch -v C:\elk2\config-dir:/config-dir docker.elastic.co/logstash/logstash:7.5.2 -f /config-dir/logstash.conf

kafka.errors.KafkaTimeoutError: KafkaTimeoutError: Failed to update metadata after 60.0 secs

I start a docker container to run a Kafka server with
docker run -p 2181:2181 -p 9092:9092 --env ADVERTISED_HOST=192.168.99.100 --env ADVERTISED_PORT=9092 spotify/kafka
I find the IP address of the Docker container. This is 172.17.0.2 and I can ping this address.
Now I want a producer that sends messages:
from kafka import KafkaProducer
producer = KafkaProducer(bootstrap_servers='172.17.0.2:9092')
for i in range(100):
producer.send('foobar', b'hola')
producer.close()
However this gives:
kafka.errors.KafkaTimeoutError: KafkaTimeoutError: Failed to update metadata after 60.0 secs.
How to solve this?
Had the same error but because my topic name wasn't right/set, same as python_noob.

consul container exits with a protocol version error

I am trying to make a container for consul and it keeps failing with this output, funny, I don't really think it is an error
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)
following is the command I am using:
docker container run --net host --name consul-server -e 'CONSUL_LOCAL_CONFIG={"skip_leave_on_interrupt": true}' -e CONSUL_BIND_INTERFACE='eth0' consul agent -server -client 0.0.0.0 -dns-port 53 -bootstrap-expect 1 -ui -datacenter dc1 -v "/var/lib/consul:/consul/data" -data-dir /var/lib/consul
It is a single node fresh installation with latest version from registry, so there is no upgrade or version mismatch with any agent/client happening here.
Two things to fix. First, the -v volume argument must be for docker command, not for consul command. Move it to the right place:
docker container run -v "/consul/data:/var/lib/consul" -data-dir /var/lib/consul --net host --name consul-server -e 'CONSUL_LOCAL_CONFIG={"skip_leave_on_interrupt": true}' -e CONSUL_BIND_INTERFACE='eth0' consul agent -server -client 0.0.0.0 -dns-port 53 -bootstrap-expect 1 -ui -datacenter dc1
Also invert them (they are /host/dir:/container/dir)
Second, by default Consul can't listen to privileged ports (i.e. 53). See this: https://www.consul.io/docs/guides/forwarding.html, so remove the -dns-port 53 and implement any approach that they recommends:
docker container run -v "/consul/data:/var/lib/consul" -data-dir /var/lib/consul --net host --name consul-server -e 'CONSUL_LOCAL_CONFIG={"skip_leave_on_interrupt": true}' -e CONSUL_BIND_INTERFACE='eth0' consul agent -server -client 0.0.0.0 -bootstrap-expect 1 -ui -datacenter dc1
I recommend the DNSMasq setup, it is easy to implement.
#Robert Alright, I think we also went a bit off topic here. The real issue is the message it shows and exits immidiately after that.
I tried your example and it gives the same message/error (don't think it is an error though)
[root#ip-X-X-X-X user]# docker container run --net host --name consul-server -e 'CONSUL_LOCAL_CONFIG={"skip_leave_on_interrupt": true}' -e CONSUL_BIND_INTERFACE='eth0' consul agent -server -client 0.0.0.0 -dns-port 53 -bootstrap-expect 1 -ui -datacenter dc1 -v "/var/lib/consul:/consul/data" -data-dir /var/lib/consul
==> Found address 'X.X.X.X' for interface 'eth0', setting bind option...
Consul v0.8.5
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)
[root#ip-X-X-X-X user]# docker container ls | grep consul-server
[root#ip-10-201-14-34 user]#
Same for recursors example:
[root#ip-X.X.X.X user]# docker container run --net host --name consul-server -e 'CONSUL_LOCAL_CONFIG={"skip_leave_on_interrupt": true}' -e CONSUL_BIND_INTERFACE='eth0' consul agent -server -client 0.0.0.0 -dns-port 53 -bootstrap-expect 1 -ui -datacenter dc1 -v "/var/lib/consul:/consul/data" -data-dir /var/lib/consul -recursers 8.8.8.8
==> Found address 'X.X.X.X' for interface 'eth0', setting bind option...
Consul v0.8.5
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)
[root#ip-X-X-X-X user]# docker container ls | grep consul-server
[root#ip-10-201-14-34 user]#

Spring boot app fail to link consul in docker

I am trying to use Consul as discovery service, and another two spring boot app to register with Consul; and put them into docker;
following are my codes:
app:
server:
port: 3333
spring:
application:
name: adder
cloud:
consul:
host: consul
port: 8500
discovery:
preferIpAddress: true
healthCheckPath: /health
healthCheckInterval: 15s
instanceId: ${spring.application.name}:${spring.application.instance_id:${server.port}}
2 docker-compose.yml
consul1:
image: "progrium/consul:latest"
container_name: "consul1"
hostname: "consul1"
command: "-server -bootstrap -ui-dir /ui"
adder:
image: wsy/adder
ports:
- "3333:3333"
links:
- consul1
environment:
WAIT_FOR_HOSTS: consul1:8500
There is another similar question Cannot link Consul and Spring Boot app in Docker;
the answer suggests, the app should wait for consul to fully work by using depends_on, which I tried, but didn't work;
the error message is as following:
adder_1 | com.ecwid.consul.transport.TransportException: java.net.ConnectException: Connection refused
adder_1 | at com.ecwid.consul.transport.AbstractHttpTransport.executeRequest(AbstractHttpTransport.java:80) ~[consul-api-1.1.8.jar!/:na]
adder_1 | at com.ecwid.consul.transport.AbstractHttpTransport.makeGetRequest(AbstractHttpTransport.java:39) ~[consul-api-1.1.8.jar!/:na]
besides spring boot application.yml and docker-compose.yml, following is App's Dockerfile
FROM java:8
VOLUME /tmp
ADD adder-0.0.1-SNAPSHOT.jar app.jar
RUN bash -c 'touch /app.jar'
ADD start.sh start.sh
RUN bash -c 'chmod +x /start.sh'
EXPOSE 3333
ENTRYPOINT ["/start.sh", " java -Djava.security.egd=file:/dev/./urandom -jar /app.jar"]
and the start.sh
#!/bin/bash
set -e
wait_single_host() {
local host=$1
shift
local port=$1
shift
echo "waiting for TCP connection to $host:$port..."
while ! nc ${host} ${port} > /dev/null 2>&1 < /dev/null
do
echo "TCP connection [$host] not ready, will try again..."
sleep 1
done
echo "TCP connection ready. Executing command [$host] now..."
}
wait_all_hosts() {
if [ ! -z "$WAIT_FOR_HOSTS" ]; then
local separator=':'
for _HOST in $WAIT_FOR_HOSTS ; do
IFS="${separator}" read -ra _HOST_PARTS <<< "$_HOST"
wait_single_host "${_HOST_PARTS[0]}" "${_HOST_PARTS[1]}"
done
else
echo "IMPORTANT : Waiting for nothing because no $WAIT_FOR_HOSTS env var defined !!!"
fi
}
wait_all_hosts
exec $1
I can infer that your Consul configuration is located in your application.yml instead of bootstrap.yml, that's the problem.
According to this answer, bootstrap.yml is loaded before application.yml and Consul client has to check its configuration before the application itself and therefore look at the bootstrap.yml.
Example of a working bootstrap.yml :
spring:
cloud:
consul:
host: consul
port: 8500
discovery:
prefer-ip-address: true
Run Consul server and do not forget the name option to match with your configuration:
docker run -d -p 8500:8500 --name=consul progrium/consul -server -bootstrap
Consul server is now running, run your application image (builded previously with your artifact) and link it to the Consul container:
docker run -d -name=my-consul-client-app --link consul:consul acme/spring-app
Your problem is that depends_on does only control the startup order of your services. You have to wait until the consul servers are up and running before starting your spring app. You can do this with this script:
#!/bin/bash
set -e
default_host="database"
default_port="3306"
host="${2:-$default_host}"
port="${3:-$default_port}"
echo "waiting for TCP connection to $host:$port..."
while ! (echo >/dev/tcp/$host/$port) &>/dev/null
do
sleep 1
done
echo "TCP connection ready. Executing command [$1] now..."
exec $1
Usage in you docker file:
COPY wait.sh /wait.sh
RUN chmod +x /wait.sh
CMD ["/wait.sh", "java -jar yourApp-jar" , "consulURL" , "ConsulPort" ]
I just want to clarify that, at last I still don't have a solution, and can't understand the situation here; I tried the suggestion from Ohmen, in APP container, I am able to ping consul1; But the APP still fails to connect consul;
If I only start the consul by following command:
docker-compose -f docker-compose-discovery.yml up discovery
Then I can run the APP directly(through Main), and it is able to connect with spring.cloud.consul.host: discovery;
But if I try to run APP in docker container, like following:
docker run --name adder --link discovery-consul:discovery wsy/adder
It fails again with connection refused;
I am very new to docker & docker-compose; I thought it would be a good example to start, but it seems not that easy for me;

Docker Swarm with Consul - Manager not electing primary

I'm trying to setup a HA docker cluster on 3 dedicated pc's. I've successfully followed the instructions on docs.docker.com/engine/installation/linux/ubuntulinux and now I'm trying to follow the instructions on https://docs.docker.com/swarm/install-manual. Since I'm not using any virtualization I start at "Set up an consul discovery backend". The PC's (running ubuntu trusty 14.04 server edition) are all in the LAN 192.168.2.0/24. ubuntu001 has .104, ubuntu002 has .106, and ubuntu003 has .105
I did the following according to the instructions:
arnolde#ubuntu001:~$ docker run -d -p 8500:8500 --name=consul progrium/consul -server -bootstrap
arnolde#ubuntu001:~$ docker run -d -p 4000:4000 swarm manage -H :4000 --replication --advertise 192.168.2.104:4000 consul://192.168.2.104
arnolde#ubuntu002:~# docker run -d swarm manage -H :4000 --replication --advertise 192.168.2.106:4000 consul://192.168.2.104:8500
arnolde#ubuntu003:~$ docker run -d swarm join --advertise=192.168.2.105:2375 consul://192.168.2.104:8500
But then when trying the next step, the swarm manager does NOT show up as "Primary" like it says it should, and no primary is listed:
arnolde#ubuntu001:~$ docker -H :4000 info
Containers: 0
Running: 0
Paused: 0
Stopped: 0
Images: 0
Server Version: swarm/1.1.0
Role: replica
Primary:
Strategy: spread
Filters: health, port, dependency, affinity, constraint
Nodes: 0
Plugins:
Volume:
Network:
Kernel Version: 3.19.0-25-generic
Operating System: linux
Architecture: amd64
CPUs: 0
Total Memory: 0 B
And:
arnolde#ubuntu001:~$ docker -H :4000 run hello-world
docker: Error response from daemon: No elected primary cluster manager.
I searched and found https://github.com/docker/swarm/issues/1491 which recommends to use dockerswarm/swarm:master instead, which I did, but it didn't help:
arnolde#ubuntu001:~$ docker run -d -p 4000:4000 dockerswarm/swarm:master manage -H :4000 --replication --advertise 192.168.2.104:4000 consul://192.168.2.104
I didn't find any other input regarding swarm+consul+primary so here I am... any suggestions? Unfortunately I'm not sure how to troubleshoot since I don't even know where to look for logging/debugging info, i.e. if the manager is connecting to consul successfully etc...
I was able to solve it myself after explicitly adding the port number to the consul:// parameter, apparently the docker docs are incomplete:
arnolde#ubuntu001:~$ docker run -d -p 4000:4000 dockerswarm/swarm:master manage -H :4000 --replication --advertise 192.168.2.104:4000 consul://192.168.2.104:8500
arnolde#ubuntu001:~$ docker -H :4000 info
Containers: 0
Running: 0
Paused: 0
Stopped: 0
Images: 0
Server Version: swarm/1.1.0
Role: replica
Primary: 192.168.2.106:4000
Also I added "-p 4000:4000" to the command on the replica manager (on ubuntu002). Not sure if that was necessary (or even a good idea).
My friends,the first step you should edit docker start daemon configure to write listen the port any other configure ,my environment is centos7,so my daemon configure is in /usr/lib/docker/.... edit "ExecStart=/usr/bin/docker daemon -H fd:// -H tcp://0.0.0.0:2375 -H unix:///var/run/docker.sock --cluster-store=consul://192.168.1.102:8500 --cluster-advertise=192.168.1.103:0" each node. and the second step: "docker run -d -p 8500:8500 --name=consul progrium/consul -server -bootstrap" anymore...

Resources