Connectivity loss in federated prometheus - monitoring

I'm considering setting up Prometheus monitoring stack using federation to handle issues with poor connectivity. My use case is the following:
I have N separate small setups on-prem, each consisting of several machines and docker containers running on them
Each of those setups is connected to the cloud, but connection is poor and can be lost sometimes
I need to have a single prometheus instance in the cloud which would aggregate data from all those "small setup" on-prem
My idea was that local prom servers will scrape my jobs/machines etc and then using federation those metrics will land in the "central" prom instance.
Now, I'm not sure what will happen if I loose connectivity between cloud and on-prem. Am I going to download all samples from local prometheus servers, once connectivity is back? Or the central server will never learn about what happened during such network outage?

Related

add nodes to dockers swarm from different servers

I am new to docker swarm, I read documentation and googled to the topic, but results was vague,
Is it possible to add worker or manager node from distinct and separate Virtual private servers?
Idea is to connect many non-related hosts into a swarm which then creates distribution over many systems and resiliency in case of any HW failures. The only thing you need to watch out for is that the internet connection between the hosts is stable and that all of the needed ports based of the official documentation are open. And you are good to go :)
Oh and between managers you want a VERY stable internet connection without any random ping spikes, or you may encounter weird behaviour (because of consensus with raft and decision making).
other than that it is good
Refer to Administer and maintain a swarm of Docker Engines
In production the best practice to maximise swarm HA is to spread your swarm managers across multiple availability zones. Availability Zones are geo-graphically co-located but distinct sites. i.e. instead of having a single London data centre, have 3 - each connected to a different internet and power utility. That way, if any single ISP or Power utility has an outage, you still have 2 data centres connected to the internet.
Swarm was designed with this kind of Highly available topology in mind and can scale to having its managers - and workers - distributed across nodes in different data centres.
However, Swarm is sensitive to latency over longer distances - so global distribution is not a good idea. In a single city, Data center to Data centre latencies will be in the low 10s of ms. Which is fine.
Connecting data centres in different cities / continents moves the latency to the low, to mid 100s of ms which does cause problems and leads to instability.
Otherwise, go ahead. Build your swarm across AZ distributed nodes.

How to collect messages (total number and size) between microservices?

I have a microservices based software architecture.
There is a php application which orchestrates the communication among microservices and the application's whole logic.
I need to simulate the communication between microservices as a graph.
There will be edges with weights , which will represent the affinities between microservices.
I am searching for a tool in order to collect all messages and their size.
I have read that there are distibuted tracing systems like Zipkin which i have already deployed, and could accomplish this task.
But, i cannot find how to collect the messages i want.
This is the php library i used for the instrumentation of my app
[https://github.com/openzipkin/zipkin-php]
Any ideas about other tools or how to use Zipkin differently to achieve my goal?
Let me add to this thread my three bits. Speaking of Envoy, yes, when attached to your application it adds a lot of useful features from observability bucket, e.g. network level statistics and tracing.
Here is the question, have you considered running your legacy apps inside service mesh, like Istio ?.
Istio simplifies deployment and configuration of Envoy for you. It injects sidecar container (istio-proxy, in fact Envoy instance) to your Pod application, and gives you these extra features like a set of service metrics out of the box*.
Example: Stats produced by Envoy in Prometheus format, like istio_request_bytes are visualized in Kiali Metrics dashboard for inbound traffic as request_size (check screenshot)
*as mentioned by #David Kruk, you still needs to have Prometheus server deployed in your cluster to be able to pull these metrics to Kiali dashboards.
You can learn more about Istio here. There is also a dedicated section on how to visualize metrics collected by Istio (e.g. request size).

Cloud Run Inter Zone Egress

I have a question on inter-zone egress charges on Google Cloud Run (managed). As I understand there is no control over which zones Cloud Run chooses. So potentially when deploying several microservices talking to each other, there could be significant charges.
In kubernetes this can be alleviated via service topology (preferring same zone or even same host if available). Is there anyway to achieve this with Cloud Run?
https://kubernetes.io/docs/concepts/services-networking/service-topology/
According to Cloud Run pricing and internet egress pricing cost stays the same
independent if apps are within the same zone or not.
Now if you plan to have heavy traffic between your apps you should consider using different setup. Either GKE or Cloud Run for Anthos will allow you to setup communication between your apps through internal IP addresses which is free of charge assuming they are in the same zone. Refer to this table.

How to configure Prometheus in a multi-location scenario?

I love using Prometheus for monitoring and alerting. Until now, all my targets (nodes and containers) lived on the same network as the monitoring server.
But now I'm facing a scenario, where we will deploy our application stack (as a bunch of Docker containers) to several client machines in thier networks. Nearly all of the clients networks are behind a firewall or NAT. So scraping becomes quite difficult.
As we're still accountable for our stack, I'd like to have a central montioring server, altering and dashboards.
I was wondering what could be the best architecture if want to implement it with Prometheus, but I couldn't find any convincing approaches. My ideas so far:
Use a Pushgateway on our side and push all data out of the client networks. As the docs state, it's not intended that way: https://prometheus.io/docs/practices/pushing/
Use a federation setup (https://prometheus.io/docs/prometheus/latest/federation/): Place a Prometheus server in every client network behind a reverse proxy (to enable SSL and authentication) and aggregate relevant metricts there. Open/forward just a single port for federation scraping.
Other more experimental setups, such as SSH Tunneling (e.g. here https://miek.nl/2016/february/24/monitoring-with-ssh-and-prometheus/) or VPN!?
Thank you in advance for your help!
Nobody posted an answer so I will try to give my opinion on the second choice because that's what I think I would do in your situation.
The second setup seems the most flexible, you have access to the datas and only need to open one port on for the federating server, so it should still be secure.
One other bonus of this type of setup is that even if the firewall stop working for a reason or another, you will still have a prometheus scraping, you will have an alert because you won't be able to access the server(s) but when the connexion comes again you will have all the datas. You won't have a hole in the grafana dashboards because there was no datas, apart during the incident.
The issue with this setup is the fact that you need to maintain a number of server equivalent to the number of networks. A solution for this would be to have a packer image or maybe an ansible playbook to deploy.

microservices & service discovery with random ports

My question is related to microservices & service discovery of a service which is spread between several hosts.
The setup is as follows:
2 docker hosts (host A & host B)
a Consul server (service discovery)
Let’s say that I have 2 services:
service A
service B
Service B is deployed 10 times (with random ports): 5 times on host A and 5 times on host B.
When service A communicates with service B, for example, it sends a request to serviceB.example.com (hard coded).
In order to get an IP and a port, service A should query the Consul server for an SRV record.
It will get 10 ip:port pairs, for which the client should apply some load-balancing logic.
Is there a simpler way to handle this without me developing a client resolver (+LB) library for that matter ?
Is there anything like that already implemented somewhere ?
Am I doing it all wrong ?
There are a few options:
Load balance on client as you suggest for which you'll either need to find a ready-build service discovery library that works with SRV records and handles load balancing and circuit breaking. Another answer suggested Netflix' ribbon which I have not used and will only be interesting if you are on JVM. Note that if you are building your own, you might find it simpler to just use Consul's HTTP API for discovering services than DNS SRV records. That way you can "watch" for changes too rather than caching the list and letting it get stale.
If you don't want to reinvent that particular wheel, another popular and simple option is to use a HAProxy instance as the load balancer. You can integrate it with consul via consul-template which will automatically watch for new/failed instances of your services and update LB config. HAProxy then provides robust load balancing and health checking with a lot of options (http/tcp, different balancing algorithms, etc). One possible setup is to have a local HAProxy instance on each docker host and a fixed port assigned statically to each logical service (can store it in Consul KV) so you connect to localhost:1234 for service A for example and localhost:2345 for service B. Local instance means you don't pay for extra round trip to loadbalancer instance then to the actual service instance but this might not be an issue for you.
I suggest you to check out Kontena. It will solve this kind of problem out of the box. Every service will have an internal DNS that you can use in communication between services. Kontena has also built-in load balancer that is very easy to use making it very easy to create and scale micro services.
There are also lot's of built-in features that will help developing containerized applications, like private image registry, VPN access to running services, secrets management, stateful services etc.
Kontena is open source project and the code is visible on Github
If you look for a minimal setup, you can wrap the values you receive from Consul via ribbon, Netflix' client based load balancer.
You will find it as a module for Spring Cloud.
I didn't find an up-to-date standalone example, only this link to chrisgray's dropwizard-consul implementation that is using it in a Dropwizard context. But it might serve as a starting point for you.

Resources