Our Composer instance dropped all its active workers in the middle of the day. Node memory and cpu utilization disappeared for 2 out of 3 nodes.
First errors were:
_mysql_exceptions.OperationalError: (2006, "Can't connect to MySQL server on 'airflow-sqlproxy-service.default.svc.cluster.local' (110))"
Restarting Composer instance (with a dummy env variable) does not help, gives the below error:
Killing GKE workers in error does not help either. Stackdriver has this:
ERROR: (gcloud.container.clusters.describe) You do not currently have an active account selected.)
And another error seems to point to internal Google authentication service problem:
ERROR: (gcloud.container.clusters.get-credentials) There was a problem refreshing your current auth tokens: Unable to find the server at metadata.google.internal)
The Composer storage bucket seems to have 'Storage Legacy Bucket ...' permissions for some service accounts. Some changes going on in the authentication backend or what could be the underlying cause of the sudden and weird freeze?
Versions are composer-1.8.2 and airflow-1.10.3.
Related
Can somebody help me with this problem.
I have two instances of docker keycloak containers cluster using a postgre Database. I use JDBC_PING for keycloak cluster discovery. The problem is that when checking one of the instances logs I get the following errors:
Error reading JDBC_PING table(https://i.stack.imgur.com/vrsdp.png)
Rollback(https://i.stack.imgur.com/2z0MF.png)
Multiple threads active within it(https://i.stack.imgur.com/lemFD.png)
All of them are deployed on azure ACI using an application gateway for managing traffic.
Can somebody point me in the right direction for debugging?
I don't know what to check.
Only one container throws this error.
Edit: There is not a problem with the keycloak cluster because I disabled jdbc_ping and left only 1 instance, I think it is a connection timeout, because the exceptions are still apearing. It is really weird that it happens only on production, on staging is working fine. Still investigating :(
We have the following scenario:
Current working setup
Web API project using a single DockerFile
A release pipe line with an 'Azure App Service deploy' task.
Proposed new setup
Web API project using multi container Docker Compose file
A release pipe line with an 'Azure Web App for Containers' task.
Upon deploying the new setup we receive the below error message:
ERROR - multi-container unit was not started successfully
Unhandled exception. System.AggregateException: One or more errors occurred.
(Parameters: Connection String: XXX, Resource: https://vault.azure.net, Authority:
https://login.windows.net/xxxxx. Exception Message:
Tried to get token using Managed Service Identity.
Access token could not be acquired. Connection refused)
The exception thrown is because it can't connect to Azure MSI (Managed Service Identity). It does this to obtain a token before connecting to key vault.
I have tried the following based upon some research and solutions others have found:
Connecting with "RunAs=App" (this seems to be the default parameter-less constructor anyway)
Building up the connection string myself manually by pulling the "MSI_SECRET" environment variable from the machine. This is always blank.
Restarting MSI.
Upgrading and downgrading AppAuthentication package
MSI appears to be configured correctly as it works perfectly with our current working setup so we can rule that out.
It's worth noting that this is System assigned identity not a user assigned one.
The documentation that states which services support managed identites only mentions 'Azure Container Instances' not 'Azure Managed Container Instances' and that is for Linux/Preview too so that it could be not supported.
Services that support managed identities for Azure resources
We've spent a considerable amount of time getting to this point with the configuration and deployment and it would be great if we could resolve this last issue.
Any help appreciated.
Unfortunately, there currently is no multi-container support for managed identities. The multi-container feature is in preview and so does not have all its functionality working yet.
However, the documentation you linked to is also not as clear about the supported scenarios, so I am working on getting this documentation updated to better clarify this. I can update this answer once that's done.
I run a very simple micro integrator service that only has 1 proxy service and a single sequence. In this sequence the incoming XML message is transferred to amazon SQS service.
If I run this in the Integration Studio on the instance that comes built in I have no problems. However, when I package the file into a CAR and feed it to the docker instance it will boot up and instantly gets bombarded with requests? That is to say, the following logs take over and the container can no longer be manually stopped:
[2020-04-15 12:45:44,585] INFO
{org.apache.synapse.transport.passthru.SourceHandler} - Writer null
when calling informWriterError ^[[?62;c^[[?62;c[2020-04-15
12:45:46,589] ERROR
{org.apache.synapse.transport.passthru.SourceHandler} - HttpException
occurred org.apache.http.ProtocolException: Invalid request line:
ÇÃ^ú§ß¡ðO©%åË*29xÙVÀ$À(=À&À*kjÀ at
org.apache.http.impl.nio.codecs.AbstractMessageParser.parse(AbstractMessageParser.java:208)
at
org.apache.synapse.transport.http.conn.LoggingNHttpServerConnection$LoggingNHttpMessageParser.parse(LoggingNHttpServerConnection.java:407)
at
org.apache.synapse.transport.http.conn.LoggingNHttpServerConnection$LoggingNHttpMessageParser.parse(LoggingNHttpServerConnection.java:381)
at
org.apache.http.impl.nio.DefaultNHttpServerConnection.consumeInput(DefaultNHttpServerConnection.java:265)
at
org.apache.synapse.transport.http.conn.LoggingNHttpServerConnection.consumeInput(LoggingNHttpServerConnection.java:114)
at
org.apache.synapse.transport.passthru.ServerIODispatch.onInputReady(ServerIODispatch.java:82)
at
org.apache.synapse.transport.passthru.ServerIODispatch.onInputReady(ServerIODispatch.java:39)
at
org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:113)
at
org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:159)
at
org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:338)
at
org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:316)
at
org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:277)
at
org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:105)
at
org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:586)
at java.lang.Thread.run(Thread.java:748) Caused by:
org.apache.http.ParseException: Invalid request line:
ÇÃ^þvHÅFmÉ
(#ë¸'º¯æ¦V
I made sure there were no outside connections possible and I also found the older threads of someone describing this problem, but their solution (changing something in the keystore) did not work.
Also, I made sure to include the SQS certificate in the container as well.
I have no connections setup to connect to the container so that will be out of the equation as well.
What am I missing here?
I have no idea why, but I have identified the culprit to be none other than Portainer. When I shutdown Portainer the stream of requests stops.
According to Wireshark, the requests are all made towards
GET
http://172.17.0.1:9000/api/endpoints/< containerID >/docker/< someId >/logs
It seems that because the WSO2 container I'm trying to run is an ESB that uses endpoints and returns 400 status codes on non-existing endpoints portainer will retry until it succeeds. This is just my observation, so I could be wrong.
I have confirmed my findings by uploading my container to AWS where the problem did not exist.
We have been using services worker on our mobile web app from some time now.
We use Sentry as event logs tool.
We are getting lot of error of the type:
Cannot update a null/nonexistent service worker registration
Error: AbortError: Failed to update a ServiceWorker for scope ('https://www.some.production.domain/') with script ('https://www.some.production.domain/sw.js'): Timed out while trying to start the Service Worker.
And so,
Is there a standard way to know why and if we should be worried about those kind of errors?
Or even get more details to try to figure out why they happen apparently in a random way?
I am running multiple Spring-Boot servers all connected to a Spring Boot Admin instance. Everything is running in the same Docker Swarm.
Spring Boot Admin keeps reporting on these "fake" instances that pop up and die. They are up for 1 second and then become unresponsive. When I clear them, they come back. The details for that instance show this error:
Fetching live health status failed. This is the last known information.
Request failed with status code 502
Here's a screenshot:
This is the same for all my APIs. This is causing us to get an inaccurate health reading of our services. How can I get Admin to stop reporting on these non-existant containers ?
I've looked in all my nodes and can't find any containers (running or stopped) that match the unresponsive containers that Admin is reporting.