We're using Contentful to manage CMS content. When you save content in Contentful it sends webhooks for a service we've set up on Cloud Run, which in turn ensures the updated content is built and deployed.
This setup has been previously so that the Cloud Run service was limited to 1 container max, with 80 concurrent requests limit. This should be plenty for the few webhooks we get occasionally.
Now when debugging complaints about content not being updated I bumped into a very persistent and irritating issue - Google Cloud Run does not try to process the 2 webhooks sent by Contentful, but instead responds to one of the 2 with status 429 and Rate exceeded. in response body.
This response does not come from our backend, I can see in the Cloud Run Logs tab the message generated by Google: The request was aborted because there was no available instance.
I've tried:
Increasing number of processes on the container from 1 to 2 - should not be necessary due to use of an async framework
Increasing number of containers from 1 to 2
The issue persists for the webhooks from Contentful.
If I try making requests from my local machine with hey that defaults to 200 requests with 50 concurrency, they all go through without any 429 status codes returned.
What is going on that generates 429 status codes when a specific client - in this case Contentful - makes ONLY 2 requests in quick succession? How do we disable or bypass this behavior?
gcloud run services describe <name> gives me these details of the deployment:
+ Service [redacted] in region europe-north1
URL: https://[redacted].a.run.app
Ingress: all
Traffic:
100% LATEST (currently [redacted])
Last updated on 2021-01-19T13:48:46.172388Z by [redacted]:
Revision [redacted]
Image: eu.gcr.io/[redacted]/[redacted]:c0a2e7a6-56d5-4f6f-b241-1dd9ed96dd30
Port: 8080
Memory: 256Mi
CPU: 1000m
Service account: [redacted]-compute#developer.gserviceaccount.com
Env vars:
WEB_CONCURRENCY 2
Concurrency: 80
Max Instances: 2
Timeout: 300s
This is more a speculation that an answer, but I would try re-deploying you Cloud Run service with min-instances set to 1 (or more).
Here is why.
In the Cloud Run troubleshooting docs they write (emphasis mine):
This error can also be caused by a sudden increase in traffic, a long container startup time or a long request processing time.
Your Cloud Run service receives webhook events from a CMS (Contentful). And, as you wrote, these updates are rather sporadic. So I think that your situation could be the same as the one described in this comment on Medium:
I tested “max-instances: 2” and the conclusion is I got 429 — Rate exceeded responses from Google frontend proxy because no container was running. It seems that a very low count of instances will deregister your service completely from the load-balancer until a second request was made.
If Google Cloud did indeed de-register your Cloud Run service completely because it was not receiving any traffic, re-deploying the service with at least one container instance could fix your issue. Another way would be to call your Cloud Run service every once in a while just to keep it "warm".
Related
Some context here:
An old Symfony app is used in multiple EC2 instances. Handles millions of requests each day without issues.
For dev purposes, the app was added to a container and that container is used locally by the developers without having to install all the requirements. The dockerized app uses the same nginx/supervisor/php-fpm configs that productive ec2 instances.
To make easier some dev processes, it was decided to create multiple dev environments using AWS Fargate, instead of EC2 instances.
The image is pushed to ECR and is deployed using FARGATE strategy to clusters.
The approach perhaps is too much, since we have 1 Cluster running 1 service only with 1 task. That Service uses an ELB -> Target group.
The application is working fine, but after some time (hours, or days), some requests are returned with different headers. The response is a JSON, but the content type is returned as HTML, other headers are dropped from the request like access-control-allow-headers, access-control-allow-credentials, access-control-allow-methods, triggering a CORS error in the client's browser.
The weird part is that if 1 page creates 10 requests to this service, 9 will work correctly, but 1 request will return 200 with different headers. That endpoint consistently will behave in the same way to any user until the task is restarted.
The response headers are returned by the Symfony app. I also tried to force those headers including those in nginx config by default for every response, and the result is the same.
The docker image exposes port 80 to the service.
The load balancer has the rule to forward HTTPS (443) traffic to port 80, so traffic can reach the container.
The load balancer has enabled the use of HTTP/2
The only notable difference besides EC2/Fargate implementations is the load balancer. The production load balancer is an old class load balancer with only HTTP/1 enabled and the new ones are Applications load balancers using HTTP/2.
This is driving me crazy. Has anyone experienced something like this?
Incorrect headers
Correct headers
I am using Azure Kuerbets Services and Im having a huge problem to detect why pods (of specific type) isnt starting... The only thing that happens is that when new pods starts health check is timing out and silently AKS go back to old deployed services that worked... I have made a lot of trace output in service to detect where it fails if its external calls that are blocked etc and I have a global try/catch in Program.cs but no information comes out... AKS listen on stdout and grabbing logs there and push them to external tool.... I have tried to increase values when health check should start etc as below but with no result
livenessProbe:
.
.
initialDelaySeconds: 60
periodSeconds: 10
readinessProbe:
.
.
initialDelaySeconds: 50
periodSeconds: 15
When running service locally its up in 10-15 sec
Obviously things seems to fail before service is started or something is timing out and I'm wondering...
Can I fetch logs or monitor whats happening and why pods are so slow in AKS when pods are starting?
Is it possible to monitor what comes out on stdout on an virtual machine that belongs to AKS-cluster?
Feels like I have tested everything but I cant find any reason why health-monitoring is refusing requests.
Thanks!
If you have enabled Azure Monitor for Container when you created your cluster, the logs of your application will be pushed to a Log Analytics workspace in the table ContainerLog. If Azure Monitor is not enable, you can use kubectl to see what is output to stdout and sdterr with the following command :
kubectl logs {pod-name} -n {namespace}
You can also check the kubernetes events, you'll see events saying that the probes failed If this is really the problem :
kubectl get events -n {namespace}
I run a very simple micro integrator service that only has 1 proxy service and a single sequence. In this sequence the incoming XML message is transferred to amazon SQS service.
If I run this in the Integration Studio on the instance that comes built in I have no problems. However, when I package the file into a CAR and feed it to the docker instance it will boot up and instantly gets bombarded with requests? That is to say, the following logs take over and the container can no longer be manually stopped:
[2020-04-15 12:45:44,585] INFO
{org.apache.synapse.transport.passthru.SourceHandler} - Writer null
when calling informWriterError ^[[?62;c^[[?62;c[2020-04-15
12:45:46,589] ERROR
{org.apache.synapse.transport.passthru.SourceHandler} - HttpException
occurred org.apache.http.ProtocolException: Invalid request line:
ÇÃ^ú§ß¡ðO©%åË*29xÙVÀ$À(=À&À*kjÀ at
org.apache.http.impl.nio.codecs.AbstractMessageParser.parse(AbstractMessageParser.java:208)
at
org.apache.synapse.transport.http.conn.LoggingNHttpServerConnection$LoggingNHttpMessageParser.parse(LoggingNHttpServerConnection.java:407)
at
org.apache.synapse.transport.http.conn.LoggingNHttpServerConnection$LoggingNHttpMessageParser.parse(LoggingNHttpServerConnection.java:381)
at
org.apache.http.impl.nio.DefaultNHttpServerConnection.consumeInput(DefaultNHttpServerConnection.java:265)
at
org.apache.synapse.transport.http.conn.LoggingNHttpServerConnection.consumeInput(LoggingNHttpServerConnection.java:114)
at
org.apache.synapse.transport.passthru.ServerIODispatch.onInputReady(ServerIODispatch.java:82)
at
org.apache.synapse.transport.passthru.ServerIODispatch.onInputReady(ServerIODispatch.java:39)
at
org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:113)
at
org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:159)
at
org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:338)
at
org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:316)
at
org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:277)
at
org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:105)
at
org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:586)
at java.lang.Thread.run(Thread.java:748) Caused by:
org.apache.http.ParseException: Invalid request line:
ÇÃ^þvHÅFmÉ
(#ë¸'º¯æ¦V
I made sure there were no outside connections possible and I also found the older threads of someone describing this problem, but their solution (changing something in the keystore) did not work.
Also, I made sure to include the SQS certificate in the container as well.
I have no connections setup to connect to the container so that will be out of the equation as well.
What am I missing here?
I have no idea why, but I have identified the culprit to be none other than Portainer. When I shutdown Portainer the stream of requests stops.
According to Wireshark, the requests are all made towards
GET
http://172.17.0.1:9000/api/endpoints/< containerID >/docker/< someId >/logs
It seems that because the WSO2 container I'm trying to run is an ESB that uses endpoints and returns 400 status codes on non-existing endpoints portainer will retry until it succeeds. This is just my observation, so I could be wrong.
I have confirmed my findings by uploading my container to AWS where the problem did not exist.
I am running multiple Spring-Boot servers all connected to a Spring Boot Admin instance. Everything is running in the same Docker Swarm.
Spring Boot Admin keeps reporting on these "fake" instances that pop up and die. They are up for 1 second and then become unresponsive. When I clear them, they come back. The details for that instance show this error:
Fetching live health status failed. This is the last known information.
Request failed with status code 502
Here's a screenshot:
This is the same for all my APIs. This is causing us to get an inaccurate health reading of our services. How can I get Admin to stop reporting on these non-existant containers ?
I've looked in all my nodes and can't find any containers (running or stopped) that match the unresponsive containers that Admin is reporting.
So I am building a web application for university which has a very high tick rate (clients recieving data from node server above 30 times per second via socketio). This works well in docker. Now I installed nginx, configured it and everything works well (no exposed ports, socket still running etc.) but now nginx logs in the docker terminal every single socket connection from every single client (at two clients well above 60 logs per second) and I think this also leads to performance issues and causes small lag to the clients. I did not find any solutions in their docs.