I have a FastAPI service that I deploy with traefik via docker swarm. Everything runs fine until I request an endpoint supposed to return a JSON file just over 10 Ko. The call to this endpoint take several minutes before returning an empty string along with a 200 status code. I've checked all the logs I could find, and it looks like FastAPI completes the request properly and instantly (thus the 200 status code). It should send the response to my traefik reverse proxy hosted on a different node on my swarm, that would forward it to the client. However it looks like the response is lost somewhere on the way, and the client never get the expected JSON file, and gets an empty string instead.
Has this ever happened to you ? Is there a parameter to set for docker swarm networks to be handled this kind of data (it doesnt seem that heavy to me) ? Any help would be greatly appreciated, thanks a lot !
I tried changing the endpoint name, deploying the service with standard docker and traefik (it works perfectly), returning a smaller JSON file (works well with very small JSONs). I'm out of options :)
To answer my own question - If anyone stumbles unto this: this was caused due to the GCP infrastructure hosting my swarm.
The VPC provided by GCP had a MTU (Max transmission Unit) of 1460, where docker network defaults to 1500. The package sent to my reverse proxy was thus dropped as soon as it was bigger than 1460 bytes.
Related
The Google Cloud Platform Kubernetes Engine based backend deployment I work on has between 4-60 nodes running at all times, spanning two different services.
I want to interface with an API that employs IP whitelisting however, which would mean that all outgoing requests would have to be funneled through one singular IP address.
How do I do this? The deployment uses an Nginx Ingress controller, which doesn't allow many options when it comes to the egress part of things.
I tried setting up a VM outside of the deployment, but still on GCP in the same region, and was unable to set up a forward proxy. At least, not one that I could connect to off my local device. Not sure if this was because of GCP's firewall or anything of that sort. This was using Squid, as well Apache, with no success in either.
I also looked at the Cloud NAT option, but it seems like I would have to recreate all the services, CI/CD pipelines, and DNS settings etc. I would ideally avoid that, as it would be a few days worth of work and would call for some downtime of the systems as well.
Ideally I would have a working forward proxy. I tried looking for Docker images that would function as one, but that does not seem to be a thing, sadly. SSHing into a VM to set up such a proxy hasn't led to success yet, either.
You have already found the solution, you have to rebuild things using either Cloud NAT or an equivalent solution made yourself. Even that is relatively recent and I've not actually tried it myself, as recently as a 6 months ago we were told this was not supported for GKE. Our solution was the proxy idea you mentioned, an HTTP proxy running outside of GKE and directing things through it at the app code level rather than infrastructure. It was not fun.
I am writing an application using Asterisk-Java. It is designed to run on a server that also runs Asterisk. So far, so good.
My application, that originates calls (using the AMI) and that manages user input (using Asterisk-Java's FastAGI and an embedded AgiServer) works great on both my development server and the production server.
For deployment purposes, I am now asked to create a Docker container that would pack up Asterisk and my application, so that it could be easily deployed to other places without having to go through installations and configurations.
The thing is, my application does not behave the same way in the Docker container: on the development / production servers, using the getData function, I can get a DTMF code; on the Docker container, getData seems to never receive the DTMF data from Asterisk (I can stream a file, but the function eventually times out, which means it did not get anything).
I first though of an unexposed port, but since this communication problem seems to be between the AGI Server and Asterisk, which are both running in the container, I find it hard to believe.
I have no other idea, please suggest.
Check out the dtmfmode Parameter for your SIP-Peer...
If your are using RFC2833 (DTMF via RTP), unexposed media ports could very well be the reason.
You could try to optimize your port settings (could be a lot of ports!).
Or try to use DMTF via SIP-Info as an alternative.
But that wouldnt fix any media problems...
I have put together an architecture that at high level is best described below
Five node docker swarm cluster
Have say 5 instances of my dockerized micro service running one copy on each of the swarm nodes
The service offers functionality via REST end points
One such functionality is downloads and they work perfectly, I wrote some code in Scala/Play framerwork, dockerized the service and deployed it.
I also know that since I use swarm , it internally does LB per request for me.
I have some questions on WebSocket and how load balancer does not ruin things during download.
I start a 5GB file download and it works. I am using HTTP stream or chunked I guess it does not matter. Now my question is once my REST end point for download is hit, the TCP connection remains open and since it is open until the server closes the connection, it is due to this that the swarm load balancing does not interfere? In short, each time a client requests a HTTP call, swarm load balances it but once the TCP socket is established as in case of specific download example, the request is served by one node as the connection is not re-stablished during the download process?
If a client opens a web socket, it will hit one of the nodes of swarm where the service is running and the websocket connection since it is open, the same service instance will push the notifications?
If for some reason the websocket dies, a new connection might be established by client but the request might end up on some other service instance and will remain like that until a new connection is again established?
Are above 3 points correct in my understanding? Is there some reading material/blogs I can find more on elaborating this?
Maybe using nginx like proxy LB, ip_hash mode
Specifies that a group should use a load balancing method where requests are distributed between servers based on client IP addresses. The first three octets of the client IPv4 address, or the entire IPv6 address, are used as a hashing key. The method ensures that requests from the same client will always be passed to the same server except when this server is unavailable. In the latter case client requests will be passed to another server. Most probably, it will always be the same server as well.
http://nginx.org/en/docs/http/ngx_http_upstream_module.html#ip_hash
All,
We recently had an issue with ELB HealthCheck in covering up a certain use-case or scenario which caused an application impact.
Can anyone suggest a fault-tolerant approach to handle this?
We have a nodeJS app running in a port - 80
We have 3 instances in the Target Group & that is enrolled in ELB.
ELB HealthCheck was configured to hit root path on port 80 and return success if it gets HTTP 200
Recently one of the node had 100% disk filled on application mount and root mount was still having space.
Though the HealthCheck was succeeding as per ELB the server didn't respond for any other services and it was ideally unhealthy. This means that there are some requests that got
succeeded but some of them failed (that was routed to this disk-filled server).
We did received notifications from other monitoring systems on disk filling but due to overwhelming emails & limited resources it got missed out.
Is there any other way we can improvise the HealthCheck strategy to just have these scenarios intimated to AutoScaling Group or ELB
so that we can target these nodes to be removed and replace them automatically?
Rather than just checking that the index.htm page is returning a 200 response, you can configure Elastic Load Balancing to point to a customer Health Check page (eg healthcheck.php).
You could run some code on that page to test the general health of the application (database connectivity, disk space, free memory). If everything checks out OK, return a 200 response. If something is wrong, return a 500 response. This will cause the Load Balancer to treat the instance as Unhealthy and it will stop serving traffic to the instance.
If Auto Scaling is configured to use the ELB Health Check, then Auto Scaling will terminate the unhealthy instance and automatically replace it with a new instance.
i am not sure where the root of my problem actually comes from, so i try to explain the bigger picture.
In short, the symptom: After upgrading consul from 0.7.3 to 0.8.1 my agents ( explaining that below ) could no longer connect to the cluster leader due to dublicated node-ids ( why that probably happens, explained below).
I could neither fix it with https://www.consul.io/docs/agent/options.html#_disable_host_node_id nor fully understand, why i run into this .. and thats where the bigger picture and maybe even different questions comes from.
I have the following setup:
I run a application stack with about 8 containers for different services ( different micoservices, DB-types and so on).
I use a single consul server per stack (yes the consul server runs in the software stack, it has its reasons because i need this to be offline-deployable and every stack lives for itself)
The consul-server does handle the registration, service discovery and also KV/configuration
Important/Questionable: Every container has a consul agent started with with "consul agent -config-dir /etc/consul.d" .. connecting the this one server. The configuration looks like this .. including to other files with they encrypt token / acl token. Do not wonder about servicename() .. it replaced by a m4 macro during image build time
The clients are secured by a gossip key and ACL keys
Important: All containers are on the same hardware node
Server configuration looks like this, if any important. In addition, ACLs looks like this, and a ACL-master and client token/gossip json files are in that configurtion folder
Sorry for this probably TLTR above, but the reasons behind all the explanation was, this multi-agent setup ( or 1-agent per container ).
My reasons for that:
I use tiller to configure the containers, so a dimploy gem will try to usually connect to localhost:8500 .. to acomplish that without making the consul-configuration extraordinary complicated, i use this local agent, which then forwards the request to the actual server and thus handles all the encryption-key/ACL negation stuff
i use several 'consul watch' tasks on the server to trigger re-configuration, they also run on localhost:8500 without any extra configuration
That said, the reason i run a 1-agent/container is, the simplicity for local services to talk to the consul-backend without really knowing about authentication as long as they connect through 127.0.0.1:8500 ( as the level of security )
Final Question:
Is that multi-consul agent actually designed to be used that way? The reason i ask is, because as far as i understand, the node-id duplication issue i get now when starting a 0.8.1 comes from "the host" being the same, so the hardware node being identical for all consul-agents .. right?
Is my design wrong or do i need to generate my own node-ids from now on and its all just fine?
Seem like this issue has been identified by Hashicorp and addressed in https://github.com/hashicorp/consul/blob/master/CHANGELOG.md#085-june-27-2017 where -disable-host-node-id has been set to true by default, thus the node-id is no longer generated from the host hardware but a random uuid, which solves the issue i had running several consul nodes on the same physical hardware
So the way i deployed was fine.