Artifactory service not starting on Windows - windows-services

I'm trying to run Artifactory on Windows Server but the Artifactory service will not start. I'm running Windows Server 2016 Datacenter in a VM in Hyper-V. I have tried Server installs with and without the Windows GUI. Artifactory is installed via Chocolatey:
choco install Artifactory -y
When I try to start the service with PowerShell:
Start-Service Artifactory
I immediately get this error in PowerShell:
Service 'artifactory (Artifactory)' cannot be started due to the following error: Cannot start service Artifactory on computer '.'.
Windows event logs show these two errors in this order:
A timeout was reached (30000 milliseconds) while waiting for the Artifactory service to connect.
The Artifactory service failed to start due to the following error:
The service did not respond to the start or control request in a timely fashion.
Again, these errors happen immediately so the timeout error is completely erroneous.
But I am able to manually start the Artifactory process:
C:\Program Files\artifactory\bin\artifactory.bat
Artifactory Logs
commons-daemon.2017-08-10.log
[2017-08-10 10:02:53] [info] [ 2344] Commons Daemon procrun (1.0.11.0 64-bit) started
[2017-08-10 10:02:53] [info] [ 2344] Service Artifactory name Artifactory
[2017-08-10 10:02:53] [info] [ 2344] Service 'Artifactory' installed
[2017-08-10 10:02:53] [info] [ 2344] Commons Daemon procrun finished
[2017-08-10 10:02:54] [info] [ 3420] Commons Daemon procrun (1.0.11.0 64-bit) started
[2017-08-10 10:02:54] [info] [ 3420] Updating service...
[2017-08-10 10:02:54] [info] [ 3420] Service 'Artifactory' updated
[2017-08-10 10:02:54] [info] [ 3420] Update service finished.
[2017-08-10 10:02:54] [info] [ 3420] Commons Daemon procrun finished
[2017-08-10 10:02:54] [info] [ 1468] Commons Daemon procrun (1.0.11.0 64-bit) started
[2017-08-10 10:02:54] [info] [ 1468] Updating service...
[2017-08-10 10:02:54] [info] [ 1468] Service 'Artifactory' updated
[2017-08-10 10:02:54] [info] [ 1468] Update service finished.
[2017-08-10 10:02:54] [info] [ 1468] Commons Daemon procrun finished
[2017-08-10 10:02:54] [info] [ 1000] Commons Daemon procrun (1.0.11.0 64-bit) started
[2017-08-10 10:02:54] [info] [ 1000] Updating service...
[2017-08-10 10:02:54] [info] [ 1000] Service 'Artifactory' updated
[2017-08-10 10:02:54] [info] [ 1000] Update service finished.
[2017-08-10 10:02:54] [info] [ 1000] Commons Daemon procrun finished
[2017-08-10 10:02:54] [info] [ 5016] Commons Daemon procrun (1.0.11.0 64-bit) started
[2017-08-10 10:02:55] [info] [ 5016] Updating service...
[2017-08-10 10:02:55] [info] [ 5016] Service 'Artifactory' updated
[2017-08-10 10:02:55] [info] [ 5016] Update service finished.
[2017-08-10 10:02:55] [info] [ 5016] Commons Daemon procrun finished
[2017-08-10 10:02:55] [info] [ 4308] Commons Daemon procrun (1.0.11.0 64-bit) started
[2017-08-10 10:02:55] [info] [ 4308] Updating service...
[2017-08-10 10:02:55] [info] [ 4308] Service 'Artifactory' updated
[2017-08-10 10:02:55] [info] [ 4308] Update service finished.
[2017-08-10 10:02:55] [info] [ 4308] Commons Daemon procrun finished
[2017-08-10 10:02:55] [info] [ 1168] Commons Daemon procrun (1.0.11.0 64-bit) started
[2017-08-10 10:02:55] [info] [ 1168] Updating service...
[2017-08-10 10:02:55] [info] [ 1168] Service 'Artifactory' updated
[2017-08-10 10:02:55] [info] [ 1168] Update service finished.
[2017-08-10 10:02:55] [info] [ 1168] Commons Daemon procrun finished
artifactory-services.2017-08-10.log
[2017-08-10 10:02:56] [info] [ 3172] Commons Daemon procrun (1.0.11.0 64-bit) started
[2017-08-10 10:02:56] [info] [ 3172] Updating service...
[2017-08-10 10:02:56] [info] [ 3172] Service 'Artifactory' updated
[2017-08-10 10:02:56] [info] [ 3172] Update service finished.
[2017-08-10 10:02:56] [info] [ 3172] Commons Daemon procrun finished
[2017-08-10 10:02:56] [info] [ 540] Commons Daemon procrun (1.0.11.0 64-bit) started
[2017-08-10 10:02:56] [info] [ 540] Updating service...
[2017-08-10 10:02:56] [info] [ 540] Service 'Artifactory' updated
[2017-08-10 10:02:56] [info] [ 540] Update service finished.
[2017-08-10 10:02:56] [info] [ 540] Commons Daemon procrun finished
Update
Using procmon I noticed that when I tried to start the 'artifactory' service it was starting 'artifactory-service.exe'. Trying to run that program directly myself resulted in the following error:
The system cannot find the Registry key for service 'artifactory-service'
Load configuration failed
The system cannot find the file specified.
Commons Daemon procrun failed with exit value: 2 (Failed to load configuration)
The system cannot find the file specified.
Checking procmon again shows that when I start 'artifactory-service.exe' it is trying to access registry entry 'HKLM\SOFTWARE\WOW6432Node\Apache Software Foundation\Procrun 2.0\artifactory-service' and not finding it. I confirmed with regedit that this registry entry does not exist. I am inclined to think this is part of the reason the service is failing to start.

This seems to be caused by the artifactory-service.exe causing unnusual characters to be used inside the Service definition. After running installService.bat when I inspected the Service the "Path to executable" had
...\artifactory-pro-5.5.1\bin\artifactory-service.exe ೴//RS//Artifactory
Where the unusual character is some strange unicode character such as this:
http://www.fileformat.info/info/unicode/char/0cf4/index.htm
This seems to be caused by the artifactory-service.exe which is just an older version of the "Commons Daemon Service Runner" prunsrv.exe v1.0.11.0. I cannot find documentation of this error, so I do not know the underlying cause.
What I did to solve this was get the most recent version of prunsrv.exe v1.0.15.0 from a Tomcat 8 installation (tomcat8.exe) and renaming it artifactory-service.exe and placing it back in the %ARTIFACTORY_HOME%\bin installation folder. This allowed it to install and begin running without issue.

Great! Your procedure is working for me, many thanks. The gremlin is this illegal character that we see on the service full path once installed. I was puzzled because the artifactory service works fine under Windows 10 but refused to work under Windows 2016. Tried to extend the PipeServiceTimeout to no avail.
If not clear to some of you, just need to rename tomcat8.exe from latest tomcat 8.5.23 download to artifactory-service.exe . Inspecting the file property allows to see the original file prunsrv.exe v1.0.11.0 or prunsrv.exe v1.0.15.0.
File Property Snapshot -- click here

Related

unable to run consul in docker host network mode

I am trying to run consul docker container in host network mode as suggested on docker hub. I am unable to access the UI at port 8500
My docker host IP address: 192.168.30.12
network interface which is used by host: ens192
Here is my docker run command:
docker run -d --net=host -v /home/docker/conf.json:/consul/config/config.json -v /home/docker/data/:/consul/data/ -e CONSUL_BIND_INTERFACE=ens192 -e CONSUL_CLIENT_INTERFACE=ens192 --name=consulserver1 -d consul agent -server -bootstrap-expect=1 -client 0.0.0.0 -bind=192.168.30.12
I also see following error in docker logs
==> Found address '192.168.30.12' for interface 'ens192', setting bind option...
==> Found address '192.168.30.12' for interface 'ens192', setting client option...
==> Starting Consul agent...
Version: '1.14.4'
Build Date: '2023-01-26 15:47:10 +0000 UTC'
Node ID: 'd8e91718-dcf3-70be-dd29-c558158959f0'
Node name: 'docker-try1'
Datacenter: 'dc1' (Segment: '<all>')
Server: true (Bootstrap: true)
Client Addr: [0.0.0.0] (HTTP: 8500, HTTPS: -1, gRPC: -1, gRPC-TLS: 8503, DNS: 8600)
Cluster Addr: 192.168.30.12 (LAN: 8301, WAN: 8302)
Gossip Encryption: false
Auto-Encrypt-TLS: false
HTTPS TLS: Verify Incoming: false, Verify Outgoing: false, Min Version: TLSv1_2
gRPC TLS: Verify Incoming: false, Min Version: TLSv1_2
Internal RPC TLS: Verify Incoming: false, Verify Outgoing: false (Verify Hostname: false), Min Version: TLSv1_2
==> Log data will now stream in as it occurs:
2023-02-17T15:18:30.052Z [WARN] agent: BootstrapExpect is set to 1; this is the same as Bootstrap mode.
2023-02-17T15:18:30.052Z [WARN] agent: Node name "docker-try1" will not be discoverable via DNS due to invalid characters. Valid characters include all alpha-numerics and dashes.
2023-02-17T15:18:30.052Z [WARN] agent: bootstrap = true: do not enable unless necessary
2023-02-17T15:18:30.057Z [WARN] agent.auto_config: BootstrapExpect is set to 1; this is the same as Bootstrap mode.
2023-02-17T15:18:30.057Z [WARN] agent.auto_config: Node name "docker-try1" will not be discoverable via DNS due to invalid characters. Valid characters include all alpha-numerics and dashes.
2023-02-17T15:18:30.057Z [WARN] agent.auto_config: bootstrap = true: do not enable unless necessary
2023-02-17T15:18:30.061Z [INFO] agent.server.raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:d8e91718-dcf3-70be-dd29-c558158959f0 Address:192.168.30.12:8300}]"
2023-02-17T15:18:30.061Z [INFO] agent.server.raft: entering follower state: follower="Node at 192.168.30.12:8300 [Follower]" leader-address= leader-id=
2023-02-17T15:18:30.062Z [INFO] agent.server.serf.wan: serf: EventMemberJoin: docker-try1.dc1 192.168.30.12
2023-02-17T15:18:30.062Z [WARN] agent.server.serf.wan: serf: Failed to re-join any previously known node
2023-02-17T15:18:30.062Z [INFO] agent.server.serf.lan: serf: EventMemberJoin: docker-try1 192.168.30.12
2023-02-17T15:18:30.063Z [INFO] agent.router: Initializing LAN area manager
2023-02-17T15:18:30.063Z [WARN] agent.server.serf.lan: serf: Failed to re-join any previously known node
2023-02-17T15:18:30.063Z [INFO] agent.server: Adding LAN server: server="docker-try1 (Addr: tcp/192.168.30.12:8300) (DC: dc1)"
2023-02-17T15:18:30.063Z [INFO] agent.server.autopilot: reconciliation now disabled
2023-02-17T15:18:30.064Z [INFO] agent.server: Handled event for server in area: event=member-join server=docker-try1.dc1 area=wan
2023-02-17T15:18:30.064Z [INFO] agent.server.cert-manager: initialized server certificate management
2023-02-17T15:18:30.064Z [INFO] agent: Started DNS server: address=0.0.0.0:8600 network=udp
2023-02-17T15:18:30.065Z [INFO] agent: Started DNS server: address=0.0.0.0:8600 network=tcp
2023-02-17T15:18:30.065Z [INFO] agent: Starting server: address=[::]:8500 network=tcp protocol=http
2023-02-17T15:18:30.065Z [INFO] agent: Started gRPC listeners: port_name=grpc_tls address=[::]:8503 network=tcp
2023-02-17T15:18:30.065Z [INFO] agent: started state syncer
2023-02-17T15:18:30.065Z [INFO] agent: Consul agent running!
2023-02-17T15:18:37.152Z [WARN] agent.cache: handling error in Cache.Notify: cache-type=connect-ca-leaf error="No cluster leader" index=0
2023-02-17T15:18:37.152Z [ERROR] agent.server.cert-manager: failed to handle cache update event: error="leaf cert watch returned an error: No cluster leader"
2023-02-17T15:18:37.248Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No cluster leader"
2023-02-17T15:18:39.483Z [WARN] agent.server.raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
2023-02-17T15:18:39.483Z [INFO] agent.server.raft: entering candidate state: node="Node at 192.168.30.12:8300 [Candidate]" term=7
2023-02-17T15:18:39.486Z [INFO] agent.server.raft: election won: term=7 tally=1
2023-02-17T15:18:39.486Z [INFO] agent.server.raft: entering leader state: leader="Node at 192.168.30.12:8300 [Leader]"
2023-02-17T15:18:39.486Z [INFO] agent.server: cluster leadership acquired
2023-02-17T15:18:39.487Z [INFO] agent.server: New leader elected: payload=docker-try1
2023-02-17T15:18:39.493Z [INFO] agent.server.autopilot: reconciliation now enabled
2023-02-17T15:18:39.493Z [INFO] agent.leader: started routine: routine="federation state anti-entropy"
2023-02-17T15:18:39.493Z [INFO] agent.leader: started routine: routine="federation state pruning"
2023-02-17T15:18:39.493Z [INFO] agent.leader: started routine: routine="streaming peering resources"
2023-02-17T15:18:39.493Z [INFO] agent.leader: started routine: routine="metrics for streaming peering resources"
2023-02-17T15:18:39.493Z [INFO] agent.leader: started routine: routine="peering deferred deletion"
2023-02-17T15:18:39.493Z [INFO] connect.ca: initialized primary datacenter CA from existing CARoot with provider: provider=consul
2023-02-17T15:18:39.493Z [INFO] agent.leader: started routine: routine="intermediate cert renew watch"
2023-02-17T15:18:39.493Z [INFO] agent.leader: started routine: routine="CA root pruning"
2023-02-17T15:18:39.493Z [INFO] agent.leader: started routine: routine="CA root expiration metric"
2023-02-17T15:18:39.493Z [INFO] agent.leader: started routine: routine="CA signing expiration metric"
2023-02-17T15:18:39.493Z [INFO] agent.leader: started routine: routine="virtual IP version check"
2023-02-17T15:18:39.493Z [INFO] agent.leader: stopping routine: routine="virtual IP version check"
2023-02-17T15:18:39.493Z [INFO] agent.leader: stopped routine: routine="virtual IP version check"
2023-02-17T15:18:40.065Z [ERROR] agent.server.autopilot: Failed to reconcile current state with the desired state
2023-02-17T15:18:41.061Z [INFO] agent: Synced node info
I think I figured it out.
There was a firewall blocking tcp ports. As soon as I opened all ports recommended in Consul documentation Consul Ports, it started working

Dockerfile with Tomcat catalina.sh run does not run

I can build the Dockerfile. When I do docker run -it path-to-image/tomcat9:latest and check the logs, there isn't a catalina.out and the run fail with getting /bin/sh: ["catalina.sh",: command not found.
Here is my Dockerfile
FROM gitlab-registry.gs.mil/gets-development/docker/openjdk11
USER root
# Copy Tomcat and start
ADD imageFiles/apache-tomcat-9.0.65.tar.gz /usr/local/
RUN mv /usr/local/apache-tomcat-9.0.65/ /usr/local/tomcat
ENV WORKPATH /usr/local
WORKDIR $WORKPATH
ENV CATALINA_HOME /usr/local/tomcat
ENV CATALINA_BASE /usr/local/tomcat
ENV PATH $PATH:$CATALINA_HOME/bin:$CATALINA_HOME/lib
EXPOSE 8080
CMD ["/usr/local/tomcat/bin/catalina.sh", "run"]
Build command:
docker build -t gitlab-registry.gs.mil/gets-development/docker/tomcat9-test .
Start command:
docker run --name tomcatTest -it gitlab-registry.gs.mil/gets-development/docker/tomcat9-test:latest /bin/bash
Trying to connect inside the docker container to the localhost fails
curl: (7) Failed to connect to localhost port 8080: Connection refused
There are no log files
[root#b058163e9605 local]# cd tomcat/logs/
[root#b058163e9605 logs]# ls -als
total 0
0 drwxr-x--- 2 root root 6 Jul 14 12:28 .
0 drwxr-xr-x 9 root root 220 Aug 5 16:17 ..
[root#b058163e9605 logs]#
This is telling me that Tomcat did not start. Starting Tomcat inside the container, tomcat could launch success:
[root#b058163e9605 bin]# ./catalina.sh run
.....
08-Aug-2022 13:12:02.934 INFO [main] org.apache.coyote.AbstractProtocol.init Initializing ProtocolHandler ["http-nio-8080"]
08-Aug-2022 13:12:03.038 INFO [main] org.apache.catalina.startup.Catalina.load Server initialization in [1590] milliseconds
08-Aug-2022 13:12:03.204 INFO [main] org.apache.catalina.core.StandardService.startInternal Starting service [Catalina]
08-Aug-2022 13:12:03.205 INFO [main] org.apache.catalina.core.StandardEngine.startInternal Starting Servlet engine: [Apache Tomcat/9.0.65]
08-Aug-2022 13:12:03.224 INFO [main] org.apache.catalina.startup.HostConfig.deployDirectory Deploying web application directory [/usr/local/tomcat/webapps/ROOT]
08-Aug-2022 13:12:03.877 INFO [main] org.apache.catalina.startup.HostConfig.deployDirectory Deployment of web application directory [/usr/local/tomcat/webapps/ROOT] has finished in [652] ms
08-Aug-2022 13:12:03.879 INFO [main] org.apache.catalina.startup.HostConfig.deployDirectory Deploying web application directory [/usr/local/tomcat/webapps/docs]
08-Aug-2022 13:12:03.945 INFO [main] org.apache.catalina.startup.HostConfig.deployDirectory Deployment of web application directory [/usr/local/tomcat/webapps/docs] has finished in [66] ms
08-Aug-2022 13:12:03.947 INFO [main] org.apache.catalina.startup.HostConfig.deployDirectory Deploying web application directory [/usr/local/tomcat/webapps/examples]
08-Aug-2022 13:12:04.559 INFO [main] org.apache.catalina.startup.HostConfig.deployDirectory Deployment of web application directory [/usr/local/tomcat/webapps/examples] has finished in [613] ms
08-Aug-2022 13:12:04.562 INFO [main] org.apache.catalina.startup.HostConfig.deployDirectory Deploying web application directory [/usr/local/tomcat/webapps/host-manager]
08-Aug-2022 13:12:04.626 INFO [main] org.apache.catalina.startup.HostConfig.deployDirectory Deployment of web application directory [/usr/local/tomcat/webapps/host-manager] has finished in [63] ms
08-Aug-2022 13:12:04.626 INFO [main] org.apache.catalina.startup.HostConfig.deployDirectory Deploying web application directory [/usr/local/tomcat/webapps/manager]
08-Aug-2022 13:12:04.717 INFO [main] org.apache.catalina.startup.HostConfig.deployDirectory Deployment of web application directory [/usr/local/tomcat/webapps/manager] has finished in [90] ms
08-Aug-2022 13:12:04.733 INFO [main] org.apache.coyote.AbstractProtocol.start Starting ProtocolHandler ["http-nio-8080"]
08-Aug-2022 13:12:04.767 INFO [main] org.apache.catalina.startup.Catalina.start Server startup in [1728] milliseconds
Last, I check the Docker logs shows me what I did inside the container, but no other information.
Please assist.
Your docker run command does not launch Tomcat but simply bash. Notice the last argument
docker run --name tomcatTest -it gitlab-registry.gs.mil/gets-development/docker/tomcat9-test:latest /bin/bash
change it into
docker run --name tomcatTest gitlab-registry.gs.mil/gets-development/docker/tomcat9-test:latest
If you want a shell to investigate what is going on inside a running container, use
docker exec -it tomcatTest /bin/bash

Management page won't load when using RabbitMQ docker container

I'm running RabbitMQ locally using:
docker run -it --rm --name rabbitmq -p 5672:5672 -p 15672:15672 rabbitmq:3-management
Some log:
narley#brittes ~ $ docker run -it --rm --name rabbitmq -p 5672:5672 -p 15672:15672 rabbitmq:3-management
2020-01-08 22:31:52.079 [info] <0.8.0> Feature flags: list of feature flags found:
2020-01-08 22:31:52.079 [info] <0.8.0> Feature flags: [ ] drop_unroutable_metric
2020-01-08 22:31:52.079 [info] <0.8.0> Feature flags: [ ] empty_basic_get_metric
2020-01-08 22:31:52.079 [info] <0.8.0> Feature flags: [ ] implicit_default_bindings
2020-01-08 22:31:52.080 [info] <0.8.0> Feature flags: [ ] quorum_queue
2020-01-08 22:31:52.080 [info] <0.8.0> Feature flags: [ ] virtual_host_metadata
2020-01-08 22:31:52.080 [info] <0.8.0> Feature flags: feature flag states written to disk: yes
2020-01-08 22:31:52.160 [info] <0.268.0> ra: meta data store initialised. 0 record(s) recovered
2020-01-08 22:31:52.162 [info] <0.273.0> WAL: recovering []
2020-01-08 22:31:52.164 [info] <0.277.0>
Starting RabbitMQ 3.8.2 on Erlang 22.2.1
Copyright (c) 2007-2019 Pivotal Software, Inc.
Licensed under the MPL 1.1. Website: https://rabbitmq.com
## ## RabbitMQ 3.8.2
## ##
########## Copyright (c) 2007-2019 Pivotal Software, Inc.
###### ##
########## Licensed under the MPL 1.1. Website: https://rabbitmq.com
Doc guides: https://rabbitmq.com/documentation.html
Support: https://rabbitmq.com/contact.html
Tutorials: https://rabbitmq.com/getstarted.html
Monitoring: https://rabbitmq.com/monitoring.html
Logs: <stdout>
Config file(s): /etc/rabbitmq/rabbitmq.conf
Starting broker...2020-01-08 22:31:52.166 [info] <0.277.0>
node : rabbit#1586b4698736
home dir : /var/lib/rabbitmq
config file(s) : /etc/rabbitmq/rabbitmq.conf
cookie hash : bwlnCFiUchzEkgAOsZwQ1w==
log(s) : <stdout>
database dir : /var/lib/rabbitmq/mnesia/rabbit#1586b4698736
2020-01-08 22:31:52.210 [info] <0.277.0> Running boot step pre_boot defined by app rabbit
...
...
...
2020-01-08 22:31:53.817 [info] <0.277.0> Setting up a table for connection tracking on this node: tracked_connection_on_node_rabbit#1586b4698736
2020-01-08 22:31:53.827 [info] <0.277.0> Setting up a table for per-vhost connection counting on this node: tracked_connection_per_vhost_on_node_rabbit#1586b4698736
2020-01-08 22:31:53.828 [info] <0.277.0> Running boot step routing_ready defined by app rabbit
2020-01-08 22:31:53.828 [info] <0.277.0> Running boot step pre_flight defined by app rabbit
2020-01-08 22:31:53.828 [info] <0.277.0> Running boot step notify_cluster defined by app rabbit
2020-01-08 22:31:53.829 [info] <0.277.0> Running boot step networking defined by app rabbit
2020-01-08 22:31:53.833 [info] <0.624.0> started TCP listener on [::]:5672
2020-01-08 22:31:53.833 [info] <0.277.0> Running boot step cluster_name defined by app rabbit
2020-01-08 22:31:53.833 [info] <0.277.0> Running boot step direct_client defined by app rabbit
2020-01-08 22:31:53.922 [info] <0.674.0> Management plugin: HTTP (non-TLS) listener started on port 15672
2020-01-08 22:31:53.922 [info] <0.780.0> Statistics database started.
2020-01-08 22:31:53.923 [info] <0.779.0> Starting worker pool 'management_worker_pool' with 3 processes in it
completed with 3 plugins.
2020-01-08 22:31:54.316 [info] <0.8.0> Server startup complete; 3 plugins started.
* rabbitmq_management
* rabbitmq_management_agent
* rabbitmq_web_dispatch
Then I go to http:localhost:15672 and page doesn't load. No error is displayed.
Interesting thing is that it worked last time I used it (about 3 weeks ago).
Can anyone give me some help?
Cheers!
have a try:
step 1, going into docker container
docker exec -it rabbitmq bash
step 2, run it in docker container
rabbitmq-plugins enable rabbitmq_management
is work for me
I got it working by simply upgrading docker.
Was running docker 18.09.7 and upgrade to 19.03.5.
In my case, clearing the cookies up has fixed this issue instantly.

ECS exit with code 0, stopped and restarted docker container many times

I am deploying a docker with ECS. I can run the docker ok on local, but when i use ECS, it always stops it and then restart a new one. I checked the docker log, there is no error either. I also checked docker stats, things looks ok. Below is what the ecs log shows:
2018-02-16T07:24:01Z [INFO] Managed task [arn:aws:ecs:us-west-2:035804961478:task/c13ba3f3-6ac8-49c5-a649-3d90e363ce4d]: Cgroup resource set up for task complete
2018-02-16T07:24:01Z [INFO] Task engine [arn:aws:ecs:us-west-2:035804961478:task/c13ba3f3-6ac8-49c5-a649-3d90e363ce4d]: pulling container openimage-http concurrently
2018-02-16T07:24:01Z [INFO] Task engine [arn:aws:ecs:us-west-2:035804961478:task/c13ba3f3-6ac8-49c5-a649-3d90e363ce4d]: Recording timestamp for starting image pulltime: 2018-02-16 07:24:01.209507829 +0000 UTC m=+30.870498308
2018-02-16T07:27:29Z [INFO] Adding image name- 035804961478.dkr.ecr.us-west-2.amazonaws.com/adobesearch/openimage:classifierscpu to Image state- sha256:b0b930a705e7df39404de30c58da2e573a2deb519b212c0d0a7ab74cf9d85192
2018-02-16T07:27:29Z [INFO] Updating container reference openimage-http in Image State - sha256:b0b930a705e7df39404de30c58da2e573a2deb519b212c0d0a7ab74cf9d85192
2018-02-16T07:27:29Z [INFO] Saving state! module="statemanager"
2018-02-16T07:27:29Z [INFO] Saving state! module="statemanager"
2018-02-16T07:27:29Z [INFO] Task engine [arn:aws:ecs:us-west-2:035804961478:task/c13ba3f3-6ac8-49c5-a649-3d90e363ce4d]: Finished pulling container 035804961478.dkr.ecr.us-west-2.amazonaws.com/adobesearch/openimage:classifierscpu in 3m27.982214992s
2018-02-16T07:27:29Z [INFO] Managed task [arn:aws:ecs:us-west-2:035804961478:task/c13ba3f3-6ac8-49c5-a649-3d90e363ce4d]: Cgroup resource set up for task complete
2018-02-16T07:27:29Z [INFO] Task engine [arn:aws:ecs:us-west-2:035804961478:task/c13ba3f3-6ac8-49c5-a649-3d90e363ce4d]: creating container: openimage-http
2018-02-16T07:27:29Z [INFO] Task engine [arn:aws:ecs:us-west-2:035804961478:task/c13ba3f3-6ac8-49c5-a649-3d90e363ce4d]: created container name mapping for task: openimage-http -> ecs-classifierscpu-http-TaskDefinition-1AYVW69U0MPZX-1-openimage-http-b2cbd59b9dd7f4f7c501
2018-02-16T07:27:29Z [INFO] Saving state! module="statemanager"
2018-02-16T07:27:29Z [INFO] Task engine [arn:aws:ecs:us-west-2:035804961478:task/c13ba3f3-6ac8-49c5-a649-3d90e363ce4d]: created docker container for task: openimage-http -> e9710d91b36113d319dc04a274fe0fdf2908e102c25767ab2fab14e822277049
2018-02-16T07:27:29Z [INFO] Managed task [arn:aws:ecs:us-west-2:035804961478:task/c13ba3f3-6ac8-49c5-a649-3d90e363ce4d]: unable to create task state change event []: create task state change event api: status not recognized by ECS: CREATED
2018-02-16T07:27:29Z [INFO] Managed task [arn:aws:ecs:us-west-2:035804961478:task/c13ba3f3-6ac8-49c5-a649-3d90e363ce4d]: Cgroup resource set up for task complete
2018-02-16T07:27:29Z [INFO] Managed task [arn:aws:ecs:us-west-2:035804961478:task/c13ba3f3-6ac8-49c5-a649-3d90e363ce4d]: redundant container state change. openimage-http to CREATED, but already CREATED
2018-02-16T07:27:29Z [INFO] Task engine [arn:aws:ecs:us-west-2:035804961478:task/c13ba3f3-6ac8-49c5-a649-3d90e363ce4d]: starting container: openimage-http
2018-02-16T07:27:30Z [INFO] Managed task [arn:aws:ecs:us-west-2:035804961478:task/c13ba3f3-6ac8-49c5-a649-3d90e363ce4d]: sending container change event [openimage-http]: arn:aws:ecs:us-west-2:035804961478:task/c13ba3f3-6ac8-49c5-a649-3d90e363ce4d openimage-http -> RUNNING, Ports [{5000 8080 0.0.0.0 0}], Known Sent: NONE
2018-02-16T07:27:30Z [INFO] Managed task [arn:aws:ecs:us-west-2:035804961478:task/c13ba3f3-6ac8-49c5-a649-3d90e363ce4d]: sent container change event [openimage-http]: arn:aws:ecs:us-west-2:035804961478:task/c13ba3f3-6ac8-49c5-a649-3d90e363ce4d openimage-http -> RUNNING, Ports [{5000 8080 0.0.0.0 0}], Known Sent: NONE
2018-02-16T07:27:30Z [INFO] Managed task [arn:aws:ecs:us-west-2:035804961478:task/c13ba3f3-6ac8-49c5-a649-3d90e363ce4d]: sending task change event [arn:aws:ecs:us-west-2:035804961478:task/c13ba3f3-6ac8-49c5-a649-3d90e363ce4d -> RUNNING, Known Sent: NONE, PullStartedAt: 2018-02-16 07:24:01.209507829 +0000 UTC m=+30.870498308, PullStoppedAt: 2018-02-16 07:27:29.191733309 +0000 UTC m=+238.852723707, ExecutionStoppedAt: 0001-01-01 00:00:00 +0000 UTC]
2018-02-16T07:27:30Z [INFO] TaskHandler: batching container event: arn:aws:ecs:us-west-2:035804961478:task/c13ba3f3-6ac8-49c5-a649-3d90e363ce4d openimage-http -> RUNNING, Ports [{5000 8080 0.0.0.0 0}], Known Sent: NONE
2018-02-16T07:27:30Z [INFO] TaskHandler: Adding event: TaskChange: [arn:aws:ecs:us-west-2:035804961478:task/c13ba3f3-6ac8-49c5-a649-3d90e363ce4d -> RUNNING, Known Sent: NONE, PullStartedAt: 2018-02-16 07:24:01.209507829 +0000 UTC m=+30.870498308, PullStoppedAt: 2018-02-16 07:27:29.191733309 +0000 UTC m=+238.852723707, ExecutionStoppedAt: 0001-01-01 00:00:00 +0000 UTC, arn:aws:ecs:us-west-2:035804961478:task/c13ba3f3-6ac8-49c5-a649-3d90e363ce4d openimage-http -> RUNNING, Ports [{5000 8080 0.0.0.0 0}], Known Sent: NONE] sent: false
2018-02-16T07:27:30Z [INFO] TaskHandler: Sending task change: TaskChange: [arn:aws:ecs:us-west-2:035804961478:task/c13ba3f3-6ac8-49c5-a649-3d90e363ce4d -> RUNNING, Known Sent: NONE, PullStartedAt: 2018-02-16 07:24:01.209507829 +0000 UTC m=+30.870498308, PullStoppedAt: 2018-02-16 07:27:29.191733309 +0000 UTC m=+238.852723707, ExecutionStoppedAt: 0001-01-01 00:00:00 +0000 UTC, arn:aws:ecs:us-west-2:035804961478:task/c13ba3f3-6ac8-49c5-a649-3d90e363ce4d openimage-http -> RUNNING, Ports [{5000 8080 0.0.0.0 0}], Known Sent: NONE] sent: false
2018-02-16T07:27:30Z [INFO] Managed task [arn:aws:ecs:us-west-2:035804961478:task/c13ba3f3-6ac8-49c5-a649-3d90e363ce4d]: sent task change event [arn:aws:ecs:us-west-2:035804961478:task/c13ba3f3-6ac8-49c5-a649-3d90e363ce4d -> RUNNING, Known Sent: NONE, PullStartedAt: 2018-02-16 07:24:01.209507829 +0000 UTC m=+30.870498308, PullStoppedAt: 2018-02-16 07:27:29.191733309 +0000 UTC m=+238.852723707, ExecutionStoppedAt: 0001-01-01 00:00:00 +0000 UTC]
2018-02-16T07:27:30Z [INFO] Managed task [arn:aws:ecs:us-west-2:035804961478:task/c13ba3f3-6ac8-49c5-a649-3d90e363ce4d]: redundant container state change. openimage-http to RUNNING, but already RUNNING
2018-02-16T07:27:30Z [WARN] Error publishing metrics: stats engine: no task metrics to report
I also deployed another simple docker test image on ecs, i ONLY changed the tag name (the tag of the other docker image in ECS) in the template, and it works perfectly. But the previous failed case, the docker works perfectly well on local as well in instance if i manually do docker run. I really don't know where else to look for the error.

Issue when joining serf nodes located in different Docker containers

Context: Host is AWS-EC2 / Ubuntu 14.04.5 with Docker version 17.05.0-ce. Containers are built from publicly available repo image cbhihe/serf-alpine-bash. All containers are located on the same EC2 instance and share the same default bridge network with net-interface "docker0".
Trying to join nodes serfDC1 (id d4fd90692e18) and serfDC2 (id 6353e7f6134d), by passing cmds from the host's shell:
$ docker exec serfDC1 serf agent -node=Node1 -bind=0.0.0.0:7946
==> Starting Serf agent…
==> Starting Serf agent RPC...
==> Serf agent running!
Node name: 'd4fd90692e18'
Bind addr: '0.0.0.0:7946'
RPC addr: '127.0.0.1:7373'
Encrypted: false
Snapshot: false
Profile: lan
==> Log data will now stream in as it occurs:
2017/06/04 00:01:10 [INFO] agent: Serf agent starting
2017/06/04 00:01:10 [INFO] serf: EventMemberJoin: d4fd90692e18 127.0.0.1
2017/06/04 00:01:11 [INFO] agent: Received event: member-join
^C
After discovering Node1's container's IP=172.17.0.4, I can issue the serf agent -join cmd to Node2:
$ docker exec serfDC2 serf agent -node=Node2 -join=172.17.0.4
==> Starting Serf agent...
==> Starting Serf agent RPC...
==> Serf agent running!
Node name: '6353e7f6134d'
Bind addr: '0.0.0.0:7946'
RPC addr: '127.0.0.1:7373'
Encrypted: false
Snapshot: false
Profile: lan
==> Joining cluster...(replay: false)
Join completed. Synced with 1 initial agents
==> Log data will now stream in as it occurs:
2017/06/04 00:18:35 [INFO] agent: Serf agent starting
2017/06/04 00:18:35 [INFO] serf: EventMemberJoin: 6353e7f6134d 127.0.0.1
2017/06/04 00:18:35 [INFO] agent: joining: [172.17.0.4] replay: false
2017/06/04 00:18:35 [INFO] serf: EventMemberJoin: d4fd90692e18 127.0.0.1
2017/06/04 00:18:35 [INFO] agent: joined: 1 nodes
2017/06/04 00:18:36 [WARN] memberlist: Got ping for unexpected node 'd4fd90692e18' from=127.0.0.1:7946
2017/06/04 00:18:36 [INFO] agent: Received event: member-join
2017/06/04 00:18:37 [WARN] memberlist: Got ping for unexpected node d4fd90692e18 from=127.0.0.1:34876
2017/06/04 00:18:37 [ERR] memberlist: Failed TCP fallback ping: EOF
2017/06/04 00:18:37 [INFO] memberlist: Suspect d4fd90692e18 has failed, no acks received
2017/06/04 00:18:38 [WARN] memberlist: Got ping for unexpected node 'd4fd90692e18' from=127.0.0.1:7946
2017/06/04 00:18:39 [WARN] memberlist: Got ping for unexpected node d4fd90692e18 from=127.0.0.1:34879
2017/06/04 00:18:39 [ERR] memberlist: Failed TCP fallback ping: EOF
2017/06/04 00:18:40 [INFO] memberlist: Suspect d4fd90692e18 has failed, no acks received
2017/06/04 00:18:41 [WARN] memberlist: Got ping for unexpected node 'd4fd90692e18' from=127.0.0.1:7946
2017/06/04 00:18:42 [WARN] memberlist: Got ping for unexpected node d4fd90692e18 from=127.0.0.1:34881
2017/06/04 00:18:42 [ERR] memberlist: Failed TCP fallback ping: EOF
2017/06/04 00:18:42 [INFO] memberlist: Marking d4fd90692e18 as failed, suspect timeout reached (0 peer confirmations)
2017/06/04 00:18:42 [INFO] serf: EventMemberFailed: d4fd90692e18 127.0.0.1
2017/06/04 00:18:43 [INFO] agent: Received event: member-failed
2017/06/04 00:18:44 [INFO] memberlist: Suspect d4fd90692e18 has failed, no acks received
2017/06/04 00:19:05 [INFO] serf: attempting reconnect to d4fd90692e18 127.0.0.1:7946
^C
Resulted in failure to join as shown by:
$ docker exec serfDC2 serf members
6353e7f6134d 127.0.0.1:7946 alive
d4fd90692e18 127.0.0.1:7946 failed
$ docker exec serfDC1 serf members
d4fd90692e18 127.0.0.1:7946 alive
6353e7f6134d 127.0.0.1:7946 failed
I have been at this for quite some time now and am at my wit's end as to where I should turn. Hashicorp's and Docker's documentation do not seem to cover this aspect of the initial handshake between two serf agents in different containers.
Could somebody show me where I took a wrong turn ? Any answer would be great, really. Tx.
Serf nodes need to 'announce' themselves with a routable address. In your case they are telling to each other: 'hi, I'm localhost:...', so each one tries to answer to localhost, which is something wrong because each container has its own localhost.
There is an option to configure the agent to use the eth0 ip to advertise to the others nodes in the network: -iface. Then you need to discard the -bind option. Those ports are default so there is no need to customize.
So, for the node1:
serf agent -node=Node1 -iface=eth0
And for the node2:
serf agent -node=Node2 -join=172.17.0.2 -iface=eth0
From docs:
-iface - This flag can be used to provide a binding interface. It can be used instead of -bind if the interface is known but not the address.
It's working properly for me:
Node1:
==> Log data will now stream in as it occurs:
2017/06/04 01:56:40 [INFO] agent: Serf agent starting
2017/06/04 01:56:40 [INFO] serf: EventMemberJoin: Node1 172.17.0.2
2017/06/04 01:56:41 [INFO] agent: Received event: member-join
2017/06/04 01:57:02 [INFO] serf: EventMemberJoin: Node2 172.17.0.3
2017/06/04 01:57:03 [INFO] agent: Received event: member-join
Node2:
==> Log data will now stream in as it occurs:
2017/06/04 01:57:02 [INFO] agent: Serf agent starting
2017/06/04 01:57:02 [INFO] serf: EventMemberJoin: Node2 172.17.0.3
2017/06/04 01:57:02 [INFO] agent: joining: [172.17.0.2] replay: false
2017/06/04 01:57:02 [INFO] serf: EventMemberJoin: Node1 172.17.0.2
2017/06/04 01:57:02 [INFO] agent: joined: 1 nodes
2017/06/04 01:57:03 [INFO] agent: Received event: member-join
Edit:
In the case that each container is in its own VM (EC2 instance), as each instance has its own docker network and not interconnected, you have to provide the EC2 instance IP and expose the corresponding ports. Use -advertise
-advertise - The advertise flag is used to change the address that we advertise to other nodes in the cluster.
Node1:
serf agent -node=Node1 -iface=eth0 -advertise=INSTANCE_IP
Node2:
serf agent -node=Node2 -join=NODE1_INSTANCE_IP -iface=eth0
And remember to expose the serf port in docker run
docker run -p 7946:7946 (...rest of the command...)

Resources