Fluentbit forward retry limit and HA - fluentd

I want have a setup of multiple fluentbit that forward to a single fluentd (or two).
The HA architecture of fluentd is clear, but is it applicable with fluentbit=>fluentd architecture?
Fluentd have file based buffering with retries that can configured to days or weeks in case that the target inresponssive for a long timeto avoid loosing records. Can we have file based buffering and retry for such a long time with fluentbit?

As of now, the FluentBit cannot be set up in HA mode exactly like FluentD. As mentioned here
the only HA mode supported is in-memory buffering, meaning: if the
logs cannot be shipped to the destination there is a retry logic in
place.
If you refer to HA where having primary and secondary destinations as
well with balancing support that feature has not been implemented yet.
From another bug description it seems like if you configure two IP addresses behind a hostname which is used in the fluentbit configuration, you might be able to achieve HA. Based on your context, YMMV

Related

Why don't use host network in docker since docker and kubernetes network is so complex

Using docker can simplify CI/CD but also introduce the complexity, not everybody able to hold the docker network though selecting open source solutions like Flannel, Calico.
So why don't use host network in docker, or what lost if use host network in docker.
I know the port conflict is one point, any others?
There are two parts to an answer to your question:
Pods must have individual, cluster-routable, IP addresses and one should be very cautious about recycling them
You can, if you wish, not use any software defined network (SDN)
So with the first part, it is usually a huge hassle to provision a big enough CIDR to house the address range required for supporting every Pod that is running across every Namespace, and have the space be big enough to avoid recycling addresses for a very long time. Thus, having an SDN allows using "fake" addresses that one need not bother the "real" network with knowing about. No routers need to be updated, no firewalls, no DHCP, whatever.
That said, as with the second part, you don't have to use an SDN: that's exactly what the container network interface (CNI) is designed to paper over. You can use the CNI provider that makes you the happiest, including using static IP addresses or the outer network's DHCP server.
But your comment about port collisions is pretty high up the list of reasons one wouldn't just want to hostNetwork: true and be done with it; I'm actually not certain if the default kubernetes scheduler is aware of hostNetwork: true and the declared ports: on the containers: in order to avoid co-scheduling two containers that would conflict. I guess try it and see, or, better yet, don't try it -- use CNI so the next poor person who tries to interact with your cluster doesn't find a snowflake setup.

On-prem docker swarm deployment with HA

I’m doing on-prem deployments using docker swarm and I need application and DB high availability.
As far as application HA is concerned, it works great within docker (service discovery and load balancing), but I’m not sure how to use it on my network. I mean how can I assign a virtual IP to all of my docker managers so that if any of them goes down, that virtual IP automatically points to the other docker manager in the cluster. I don’t want to have a single point of failure in my architecture, that’s why I’m not inclined to use any (single) reverse proxy solution in front of my swarm cluster (because to my understanding, if nginx/HAProxy goes down, the whole system goes into abyss. I would love to know that I’m wrong).
Secondly, I use WebSockets in my application for push notifications which doesn’t behave normally with all the load balancing stuff because socket handshakes get distorted.
I want a solution to these problems without writing anything in code (HA-specific and non-generic like hard coding IPs etc). Any suggestions? I hope I explained my problem correctly.
Docker Flow Proxy or Traefik can be placed on a set of swarm nodes that you want to receive traffic for incoming connections, and use DNS routing to get packets to the correct containers. Both have sticky sessions option (I know Docker Flow does, not sure about Traefik).
Then you can either:
If your incoming connections are just client HTTP/S requests, you can use DNS Round Robin with multiple A records, which works great, or
By an expensive hardware fault tolerant reverse proxy like F5
Use some network-layer IP failover that is at the OS and physical network level (not related to Docker really), but I'm not sure how well that would work with Swarm.
Number 2 is the typical solution in private datacenters that need full HA at all layers.

Kubernetes With DPDK

I'm trying to figure out if Kubernetes will work for a certain use case. I understand the networking/clustering concept, and even the load balancing and how that can be used with things like nginx. However, assuming this is not deployed on a public cloud and things like ELB won't be available, could it still be used for a high-speed networking application using DPDK? For example, if we assume the cluster networking provided by k8s is only used for the control/management path, and the containers themselves handle the NIC directly with DPDK, is this something it's commonly used for?
Secondly, I understand the replication controller and petsets feature I think, but I'm not really clear on whether the intent of those features is for high availability or not. It seems that the "pod fails and the RC replaces it on a different node" isn't necessarily for HA, and there aren't really guarantees on how fast it builds a new pod. Am I incorrect?
For the second question, if the replication controller has size large than 1, it is highly available.
For example, you have an service "web-svc" in front of the replication controller "web-app", with size 3, then your request will be load balanced to one of the 3 pod:
web-svc ----> {web-app-pod1, web-app-pod2, web-app-pod3}
If some of the 3 pods fail, kubernetes will replace them with new ones.
And pet set is similar to replication controller, but used for stateful applications like database.

Bosun HA and scalability

I have a minor bosun setup, and its collecting metrics from numerous services, and we are planning to scale these services on the cloud.
This will mean more data coming into bosun and hence, the load/efficiency/scale of bosun is affected.
I am afraid of losing data, due to network overhead, and in case of failures.
I am looking for any performance benchmark reports for bosun, or any inputs on benchmarking/testing bosun for scale and HA.
Also, any inputs on good practices to be followed to scale bosun will be helpful.
My current thinking is to run numerous bosun binaries as a cluster, backed by a distributed opentsdb setup.
Also, I am thinking is it worthwhile to run some bosun executors as plain 'collectors' of scollector data (with bosun -n command), and some to just calculate the alerts.
The problem with this approach is it that same alerts might be triggered from multiple bosun instances (running without option -n). Is there a better way to de-duplicate the alerts?
The current best practices are:
Use https://godoc.org/bosun.org/cmd/tsdbrelay to forward metrics to opentsdb. This gets the bosun binary out of the "critical path". It should also forward the metrics to bosun for indexing, and can duplicate the metric stream to multiple data centers for DR/Backups.
Make sure your hadoop/opentsdb cluster has at least 5 nodes. You can't do live maintenance on a 3 node cluster, and hadoop usually runs on a dozen or more nodes. We use Cloudera Manager to manage the hadoop cluster, and others have recommended Apache Ambari.
Use a load balancer like HAProxy to split the /api/put write traffic across multiple instances of tsdbrelay in an active/passive mode. We run one instance on each node (with tsdbrelay forwarding to the local opentsdb instance) and direct all write traffic at a primary write node (with multiple secondary/backup nodes).
Split the /api/query traffic across the remaining nodes pointed directly at opentsdb (no need to go thru the relay) in an active/active mode (aka round robin or hash based routing). This improves query performance by balancing them across the non-write nodes.
We only run a single bosun instance in each datacenter, with the DR site using the read only flag (any failover would be manual). It really isn't designed for HA yet, but in the future may allow two nodes to share a redis instance and allow active/active or active/passive HA.
By using tsdbrelay to duplicate the metric streams you don't have to deal with opentsdb/hbase replication and instead can setup multiple isolated monitoring systems in each datacenter and duplicate the metrics to whichever sites are appropriate. We have a primary and a DR site, and choose to duplicate all metrics to both data centers. I actually use the DR site daily for Grafana queries since it is closer to where I live.
You can find more details about production setups at http://bosun.org/resources including copies of all of the haproxy/tsdbrelay/etc configuration files we use at Stack Overflow.

Can a bastion host be launched by auto-scaling-group for failure recovery?

Can I launch a bastion host through auto-scaling-group, so that I set "MinSize": 1 and "DesiredCapacity": 1.
I understand that normally ASG is used along with ELB or SQS and Cloudwatch from load balancing or scaling purpose. And I feel my purpose here is different -- I want to make my bastion machine up and running, and once it's down, I want to bring it back as soon as possible. (I don't need my bastion host to be "HA", but I'd like it to be able to automatically recover, say within 3 mins)
Is there such an use case for auto scaling group?
Yes, using an Auto Scaling Group in this fashion will ensure that a failed host will be replaced automatically if it fail EC2 health checks.
However, this is not the best and up to date way to solve your problem. EC2 supports Auto Recovery as of last year. Recovery can be configured to perform a variety of actions on an instance that fails EC2 health checks. The advantage it has over Auto Scaling is that things like Elastic IPs can be migrated over to the new instance. The docs contain all the information you'll need to set this up.
Yes, that's a valid use case.
Auto scaling groups force you to setup automatically creatable instances: you define a launch configuration that specifies stuff like instance type and the image you want to launch, and the number of instances in the group.
When you set the desired instances to '1', the autoscaling group (AG) will start enforcing that one instance will be running.
Problem: the instances get assigned a different IP when they boot so you won't know where to reach it.
There are two ways around this:
- use an ELB so you can always reach it at the ELB's address. When only running one instance, this is kind of an overkill
- make the instance assign an elastic ip when it boots. I don't think that Amazon supports this out-of-the box yet, but you can find scripts that do this for you on the web.
Note that this setup won't prevent failure. But once an instance fails, it's a matter of terminating it and a new one will be backup in 5 minutes or so.
Refer following link from amazon on the architecture and best practice for Bastion host - http://docs.aws.amazon.com/quickstart/latest/linux-bastion/architecture.html

Resources