Spring Boot Admin - Running in Docker Swarm weirdly - docker

I am running multiple Spring-Boot servers all connected to a Spring Boot Admin instance. Everything is running in the same Docker Swarm.
Spring Boot Admin keeps reporting on these "fake" instances that pop up and die. They are up for 1 second and then become unresponsive. When I clear them, they come back. The details for that instance show this error:
Fetching live health status failed. This is the last known information.
Request failed with status code 502
Here's a screenshot:
This is the same for all my APIs. This is causing us to get an inaccurate health reading of our services. How can I get Admin to stop reporting on these non-existant containers ?
I've looked in all my nodes and can't find any containers (running or stopped) that match the unresponsive containers that Admin is reporting.

Related

Error reading JDBC_PING table, Keycloak cluster

Can somebody help me with this problem.
I have two instances of docker keycloak containers cluster using a postgre Database. I use JDBC_PING for keycloak cluster discovery. The problem is that when checking one of the instances logs I get the following errors:
Error reading JDBC_PING table(https://i.stack.imgur.com/vrsdp.png)
Rollback(https://i.stack.imgur.com/2z0MF.png)
Multiple threads active within it(https://i.stack.imgur.com/lemFD.png)
All of them are deployed on azure ACI using an application gateway for managing traffic.
Can somebody point me in the right direction for debugging?
I don't know what to check.
Only one container throws this error.
Edit: There is not a problem with the keycloak cluster because I disabled jdbc_ping and left only 1 instance, I think it is a connection timeout, because the exceptions are still apearing. It is really weird that it happens only on production, on staging is working fine. Still investigating :(

Composer instance freeze, metadata.google.internal authentication error

Our Composer instance dropped all its active workers in the middle of the day. Node memory and cpu utilization disappeared for 2 out of 3 nodes.
First errors were:
_mysql_exceptions.OperationalError: (2006, "Can't connect to MySQL server on 'airflow-sqlproxy-service.default.svc.cluster.local' (110))"
Restarting Composer instance (with a dummy env variable) does not help, gives the below error:
Killing GKE workers in error does not help either. Stackdriver has this:
ERROR: (gcloud.container.clusters.describe) You do not currently have an active account selected.)
And another error seems to point to internal Google authentication service problem:
ERROR: (gcloud.container.clusters.get-credentials) There was a problem refreshing your current auth tokens: Unable to find the server at metadata.google.internal)
The Composer storage bucket seems to have 'Storage Legacy Bucket ...' permissions for some service accounts. Some changes going on in the authentication backend or what could be the underlying cause of the sudden and weird freeze?
Versions are composer-1.8.2 and airflow-1.10.3.

Hyperledger Composer Identity Issue error after network restart (code:20, authorization failure)

I am using Docker Swarm and docker-compose to setup my Fabric (v1.1) and Composer (v0.19.18) networks.
I wanted to test how my Swarm/Fabric networks would respond to a host/ec2 failure, so I manually reboot the host which is running the fabric-ca, orderer, and peer0 containers.
Before the reboot, everything runs perfectly with respect to issuing identities. After the reboot, though all of the Fabric containers are restarted and appear to be functioning properly, I am unable to issue identities with the main admin#network card.
After reboot, composer network ping -c admin#network returns successfully, but composer identity issue (via CLI or javascript) both return code 20 errors as described here:
"fabric-ca request register failed with errors [[{\"code\":20,\"message\":\"Authorization failure\"}]]"
I am guessing the issue is stemming from the host reboot and some difference in how it is restarting the Fabric containers. I can post my docker-compose files if necessary.
If your fabric-ca-server has restarted and it's registration database hasn't been persisted (for example the database is stored on the file system of the container and loss of that container means loss of the contents of that file system) then the ca-server will create a completely new bootstrap identity called admin for issuing identities and it won't be the one you have already have and therefore isn't a valid identity anymore for the fabric-ca-server. Note that it will be a valid identity for the fabric network still. So this is why you now get authorisation failure from the fabric-ca-server. The identity called admin that you currently have is not known to your fabric-ca-server anymore.

Gateway app can not connect to microservices

we are using Jhipster and docker for our microservices architechture. we just deployed our application stack to docker swarm(docker-compose version 3) with one only one node as active and having issues with Gateway app throwing zuul timeout connecting to backend microservices. We have a different environment where we are not using swarm(docker-compose version 2) and it works great. In swarm I was able curl to backend microservices from Gateway app using containername:port but not containerIp:port. I am lost here as I could not narrow down the issue to whether it is a swarm issue or jhipster issue. I even changed the 'prefer-ip-address: false' in our app properties but it is same issue? Any leads on what the issue could be?

Docker Swarm Late Server Startup

I've been using docker swarm for a while and I'm really pleased with how simple it is to set up a swarm cluster and to run replicated services. However I've faced a problem that seems like a blocker in my use case.
I'm using docker 1.12 and swarm mode.
My problem is that the internal IPVS load balancer sends request to tasks that have "status health: starting" and whereas my application is not properly started.
My application takes some time to start but docker swarm load balancer starts sending requests as soon as the container is in "state running".
After running some tests I realized that If I scale up one instance, the instance is available to the load balancer immediately and the client may get a connection refused response if the load balancer sends the request to the starting server.
I've implemented the health check and I was expecting a particular instance to only become available to the load balancer after the first successful health check.
Is there any way to configure the load balancer or the scheduler to only send request to instance that are properly started?
Best Regards,
Bruno Vale

Resources