Wavefront Alert for AWS RDS Freeable Memory - memory

I have multiple RDS instances that is monitored in Wavefront with AWS RDS integration dashboard.
Question: What WQL query should I use to create an Alert that will trigger when the aws.rds.freeablememory value is less that 5% of total RAM for corresponding RDS instance?
NOTE: The instance type are of different RAM size.

as per the feedback I received from Wavefront support team, there is no parameter available currently. so this type of alert cannot be created.

Related

Cockroachdb storage increasing without any reason

I am using cockroachDB for storing the login credentials of the users. I only have one table with no records in it. But I noticed that the cluster storage is increasing slowly without any activity in database. Why is it happening? How do I stop this ?yesterday's screenshot
Today's screenshot
A CockroachDB cluster will regularly store information about its own health and status in system tables, so that it is available to you should you need it (e.g. in your dashboard). Once these tables hit their retention limit, storage will level off. Overall, however, the amount of stored system data should be very small in relation to overall limits. For example, in this case, 3MB is only 0.06% of the free storage limit.

how to make my amazon-connect instance multi-region or multi-AZ?

please I have amazon-connect instances in a specific region, and I want to implement failover, like I wanna make my amazon-connect instances multi-region or multi-AZ , so that if the primary region failed , the secondary instances from the other region can pick-up the workload easily without downtime?
You don't need to do anything, Amazon takes care of Connect resiliancy as part of the service. See: https://docs.aws.amazon.com/connect/latest/adminguide/reliability-bp.html
Amazon Connect Global Resiliency provides a set of APIs that you use to:
Provision a linked Amazon Connect instance in another AWS Region.
Provision and manage phone numbers that are global and accessible in both Regions.
Distribute telephony traffic between the instances and across Regions in 10% increments.
For example, you can distribute traffic 100% in US East (N. Virginia) / 0% in US West (Oregon), or 50% in each Region.
Access reserved capacity across Regions.
https://docs.aws.amazon.com/connect/latest/adminguide/setup-connect-global-resiliency.html

Share storage/volume between worker nodes in Kubernetes?

Is it possible to have a centralized storage/volume that can be shared between two pods/instances of an application that exist in different worker nodes in Kubernetes?
So to explain my case:
I have a Kubernetes cluster with 2 worker nodes. In each one of these I have 1 instance of app X running. This means I have 2 instances of app X running totally at the same time.
Both instances subscribe on the topic topicX, that has 2 partitions, and are part of a consumer group in Apache Kafka called groupX.
As I understand it the message load will be split among the partitions, but also among the consumers in the consumer group. So far so good, right?
So to my problem:
In my whole solution I have a hierarchy division with the unique constraint by country and ID. Each combination of country and ID has a pickle model (python Machine Learning Model), which is stored in a directory accessed by the application. For each combination of a country and ID I receive one message per minute.
At the moment I have 2 countries, so to be able to scale properly I wanted to split the load between two instances of app X, each one handling its own country.
The problem is that with Kafka the messages can be balanced between the different instances, and to access the pickle-files in each instance without know what country the message belongs to, I have to store the pickle-files in both instances.
Is there a way to solve this? I would rather keep the setup as simple as possible so it is easy to scale and add a third, fourth and fifth country later.
Keep in mind that this is an overly simplified way of explaining the problem. The number of instances is much higher in reality etc.
Yes. It's possible if you look at this table any PV (Physical Volume) that supports ReadWriteMany will help you accomplish having the same data store for your Kafka workers. So in summary these:
AzureFile
CephFS
Glusterfs
Quobyte
NFS
VsphereVolume - (works when pods are collocated)
PortworxVolume
In my opinion, NFS is the easiest to implement. Note that Azurefile, Quobyte, and Portworx are paid solutions.

Can a bastion host be launched by auto-scaling-group for failure recovery?

Can I launch a bastion host through auto-scaling-group, so that I set "MinSize": 1 and "DesiredCapacity": 1.
I understand that normally ASG is used along with ELB or SQS and Cloudwatch from load balancing or scaling purpose. And I feel my purpose here is different -- I want to make my bastion machine up and running, and once it's down, I want to bring it back as soon as possible. (I don't need my bastion host to be "HA", but I'd like it to be able to automatically recover, say within 3 mins)
Is there such an use case for auto scaling group?
Yes, using an Auto Scaling Group in this fashion will ensure that a failed host will be replaced automatically if it fail EC2 health checks.
However, this is not the best and up to date way to solve your problem. EC2 supports Auto Recovery as of last year. Recovery can be configured to perform a variety of actions on an instance that fails EC2 health checks. The advantage it has over Auto Scaling is that things like Elastic IPs can be migrated over to the new instance. The docs contain all the information you'll need to set this up.
Yes, that's a valid use case.
Auto scaling groups force you to setup automatically creatable instances: you define a launch configuration that specifies stuff like instance type and the image you want to launch, and the number of instances in the group.
When you set the desired instances to '1', the autoscaling group (AG) will start enforcing that one instance will be running.
Problem: the instances get assigned a different IP when they boot so you won't know where to reach it.
There are two ways around this:
- use an ELB so you can always reach it at the ELB's address. When only running one instance, this is kind of an overkill
- make the instance assign an elastic ip when it boots. I don't think that Amazon supports this out-of-the box yet, but you can find scripts that do this for you on the web.
Note that this setup won't prevent failure. But once an instance fails, it's a matter of terminating it and a new one will be backup in 5 minutes or so.
Refer following link from amazon on the architecture and best practice for Bastion host - http://docs.aws.amazon.com/quickstart/latest/linux-bastion/architecture.html

Naming statsd metrics for short lived streams

I am trying to model statistics to submit to statsd/graphite. However what I am monitoring is "session" centric. For example, I have a game that is played in real time. There are multiple instances of a game active on the servers. Each game has multiple (and variable number of) participants. Each instance of a game has a unique ID as does each player.
I want to track (and graph) each player's stats but then roll the metric up for the whole instance and then for all the instances of a game. For example there may be two instances of a game active at a given time. Lets say each has two players in the game
GameTitle.RealTime.VoiceErrors.game_instance_a.player_id_1 10
GameTitle.RealTime.VoiceErrors.game_instance_a.player_id_2 20
GameTitle.RealTime.VoiceErrors.game_instance_b.player_id_3 50
GameTitle.RealTime.VoiceErrors.game_instance_b.player_id_4 70
where game_instances and player_ids are 128 bit numbers
And I want to be able to see that the value of all voice errors for game_instance_a is 30
while all voice errors across the system is 150
Given this I have three questions
What guidance would you have on naming the metrics.
Is it kosher to have metrics that have "dynamic" identifiers as part of the name
What are they scale limits on this. If I had a 100K game instances
with say as many as 1000 players in a game, is this going to kill statsd/graphite?
Thanks!
What guidance would you give on naming the metrics?
Graphite recommends that "Volatile path components should be kept as deep into the hierarchy as possible". This essentially means that if you can push the parts of the metrics that are frequently unique to the end of the "bucket" without impacting your grouping queries you should try to do so.
Here is a great post on using Graphite that includes naming recommendations. And here is another one with additional info from Jason Dixon (an excellent source for Graphite stuff in general).
Is it kosher to have metrics that have "dynamic" identifiers as part of the name?
I usually try to avoid identifiers in the metric names unless they are very low in number (<100). Because Graphite will store a .wsp file for every metric name you'll have a difficult time re-sizing or adjusting the storage settings should you decide to change your configuration. Additionally, the Graphite UI will have a "folder" for every metric name so you can easily make the UI unusable.
In your case, I'd probably graph the total number of game instances, the total number of players, and the number of errors (by type), etc. Additionally, I might try to track players per instance (generally) and maybe errors per instance (again without knowing the actual instance. e.g. GameTitle.RealTime.PerInstance.VoiceErrors) if I had that capability (i.e. state stored per instance in my application).
Logstash, Elastic Search, Kibana
I'd suggest logging this error information with instance and player ids and using logstash to send your logs to elastic search and kibana. Then I'd watch Graphite for real time error and health anomaly detection and use Kibana (and Elastic Search underneath) to dig deeper.
What are the scale limits on this. If I had a 100K game instances with say as many as 1000 players in a game, is this going to kill statsd/graphite?
Statsd should have no problem with this, as it just acts as a -mostly- dumb aggregator. While it does maintain some state internally I don't anticipate a problem.
I don't think you'll have problems with the internal Graphite Whisper Storage itself, as it is just using files and folders. But, as I mentioned above, the Graphite Web UI will be unusable and I think you'll also run the risk of other manageability issues.
Summary
Keep the volatile (dynamic) metric buckets at the end of the name and avoid going above a couple hundred of these.

Resources