aggregating codahale metrics counts across ECS task instances in CloudWatch - monitoring

I've got an ECS service reporting metrics to CloudWatch collected with Codahale Metrics. Some of the metrics are counts, eg count of requests made to an external service. Each service instance maintains and reports to CloudWatch its own count. To my understanding it means the values of the count in CloudWatch are the individuals counts per service without a possibility to see eg the total. If each instance was making 300 requests than the value reported would be 300, with not way to sum it up to 900.
What is the best way to fix it? Is adding an additional dimension with eg ecs task id to the reported CloudWatch metric the way?
I'm graphing the results in Grafana, but likely it's not the important part.

Metrics are already aggregated in Cloudwatch assuming they have the same namespace and name. If these service request metrics are the same, they should be the same metric, then you can add Dimensions to them, such as TaskId, RequestedService or whatever you wanted to aggregate by.
Typically you have the opposite challenge in Cloudwatch Metrics to what you are describing. Metrics are already aggregated together and then you want to drill down to a particular values to debug some issue, such as if you had a problem with a particular container task you would set the dimension TaskId=todo1, or if you suspected a service is down you'd set RequestedService=todo2.
I suspect you are creating a metric for each service you make requests to, instead you only want one metric, and add dimensions to it as described earlier.
Also for this particular use-case you might want to consider open-telemetry/X-Ray which will create for you a service graph and handles the specific case of tracing requests through different services. That does take a bit of effort to setup though.

Related

What to report in a time serie database when the measure failed?

I use a time series database to report some network metrics, such as the download time or DNS lookup time for some endpoints. However, sometimes the measure fails like if the endpoint is down, or if there is a network issue. In theses cases, what should be done according to the best practices? Should I report an impossible value, like -1, or just not write anything at all in the database?
The problem I see when not writing anything, is that I cannot know if my test is not running anymore, or if it is a problem with the endpoint/network.
The best practice is to capture the failures in their own time series for separate analysis.
Failures or bad readings will skew the series, so they should be filtered out or replaced with a projected value for 'normal' events. The beauty of a time series is that one measure (time) is globally common, so it is easy to project between two known points when one is missing.
The failure information is also important, as it is an early indicator to issues or outages on your target. You can record the network error and other diagnostic information to find trends and ensure it is the client and not your server having the issue. Further, there can be several instances deployed to monitor the same target so that they cancel each other's noise.
You can also monitor a known endpoint like google's 204 page to ensure network connectivity. If all the monitors report an error connecting to your site but not to the known endpoint, your server is indeed down.

Is it possible to track Metrics within a sink such as FileIO?

Playing around with org.apache.beam.sdk.metrics was wondering the following... Can you track metrics from an "obscure" (and by obscure I mean that the code of that stage is not yours) to catch failures, delays and such like, for example, within the CassandraIO connector when inserts fail?
If so, how can I access that information?
So far I've been tracking metrics within my very own stages doing Metrics.counter("new_counter", "new_metric").inc(n) and similar.
The metrics need to be added manually to each connector, much like you already do with your own metrics (i.e. Metrics.counter(.....).inc(..)).
If the connector has metrics of its own, it will publish them. Unfortunately, it seems like that's not the case for CassandraIO. : (
If you are interested, I would invite you to submit a pull request to the Apache Beam repository to add metrics that you find interesting for CassandraIO or any other connector that you like.

Scaling chat log workers horizontally

I've thought about this a lot but can't come up with a solution I'm happy with.
Basicly this is the problem: Log 100k+ Chats (some slower, some faster) into cassandra. So save userId, channelId, timestamp and the message.
Cassandra already supports horizontal scaling out of the box, I have no issue here.
Now my software that reads these chats does it over TCP (IRC). Something like 300 messages / sec are usual for the top 1k channels and 1 single IRC connection can't handle that from my experiments.
What I now want to build is multiple instances (with Docker/Kubernetes) of the logger and share the load between those. So ideally if I have maybe 4 workers and 1k chats (example). They would each join atleast 250 channels. I say atleast because I would want optional redundancy so I can have 2 loggers in the same chat to make sure no messages get lost.
There is no issue with duplicates, because all messages have a unique ID.
Now how would I best and dynamically share the current channels joined between the workers. I wanna avoid having a master or controlling point. Should also be easy to add more workers that then reduce the load on other workers.
Are there any good articles about this kind of behaviour? Maybe good concepts or protocols already defined? Like I said i wanna avoid another central control point so no rabbitmq, redis or whatever.
Edit: I've looked into something like the Raft Consensus Algorithm, but it doesn't make sense I think, since I don't want my clients to agree on a shared state instead divide the state between them "equally".
I think in this case looking for a description of existing algorithm might be not very useful: the problem is not complicated and generic enough to be worth publication.
As described, the problem could be solved by using Cassandra itself as a mediator and to share chat channel assignment information among the workers.
So (trivial part) channels would have IDs and assigned worker ID(s), plus in the optional case of redundancy - required amount of workers (2 or whatever number of workers you want to process this chat). Worker, before assigning itself to a channel would check if there is already enough assignees. If so would continue to the next channel. If not, assign itself to the channel. This is one of the options (alternatively you can have workers holding the channel IDs, but since redundancy is rare this way seems to be simpler). Workers would have a limit of channels they can process and will not try exceeding it by assigning more channels.
Now we only have to deal with the case of assigning too much workers to the same channel, exceeding requirements and exhausting the worker capacity by monitoring all the same channels. Otherwise, if they start all at once, channels might have more assigned workers than needed. Even though it is unlikely will create a real problem in described case (just a bit more redundancy than requested), you can handle that by prioritising workers. Much like employing of school teachers in Canada, BC is done on seniority basis - the most senior gets job first, except that here it'd be voluntarily done by the workers themselves, not by school administration. What this means, is that each worker would have to check all it's assigned channels and, should there be more workers than needed at this time, would check if it has the smallest priority among all the assignees. If it does, it would resign - remove itself and stop processing the channel.
That requires assigning distinct priorities of the workers, which could be easily achieved when spawning them, by simply setting each to a next sequential number (the oldest has the highest priority, or v.v if you concerned of old, potentially dying workers taking up all the load, and would prefer new ones to take on more while still fresh). More elaborately, this could also be done by using Cassandra Lightweight transactions as described in one of the answers here (the one by AlonL). With just a few (you mentioned ~4) workers either way should work and concerns about scaling mentioned in the other answers there isn't a big deal for a few integer priorities. Also, instead of sequential number assignment, requiring the workers to self-assign a random 32-bit integer priority on initialization has virtually no chance of collision, so loop "until no collisions" should exit on the very first iteration (which would make a second iteration very rarely code path requiring an explicit test).
The trick is basically to limit the amount of data requiring synchronisation and putting the load of regulation onto the workers themselves. There is no need for consensus algorithms as there is not much complexity and we are not dealing with huge number of potentially fraudulent workers, trying to get assignments ahead of more senior peers.
The only issue I should mention is that there could be implicit worker rotation if channels go offline which makes worker to stop processing. You will get a different worker assignment next time the channel goes online.

Should I store a global counter or an aggregated value in a TSDB

This question is really about the data schema. I have a program which has a bunch of discrete events, and I want to get beautiful graphs out.
From my knowledge, I understand that I should really keep a counter of the number of events that have occurred, and on a regular interval, transfer that cumulative counter to the TSDB (as part of a cron job or similar).
What I currently have though is a system where the monitor, on a regular interval, tells the TSB how many events occurred during that interval (a fixed hard coded value!).
Which of these two design patterns is better? What are the factors that affect that decision? Do I have a counter value here or is it just a measurement?
I have various concerns, including but not limited to the efficiency of the monitoring tool.
You tagged the question with InfluxDB but it seems like what you are really asking about is the collection agent. For that I would look at Telegraf.
StatsD is also a really great lightweight API that is available for most major languages now, from which you can efficiently emit different types of stats (counters, timings, etc); either for every event or at a sample rate you define.
I implemented a solution that gather metrics emitted from my app using StatsD, metrics that were pulled (JMX queries), and basic host level stats you get for free with Telegraf. Every host (30+) runs a single telegraf instance which delivers its stats to a centralized InfluxDB server on some interval (i.e. 30 seconds).
So with an approach like that you get a good balance of performance and data precision.

Is there any way to set numWorkers dynamically in the middle of dataflow job running?

I am using google dataflow on my work.
While I am using dataflow, I need to set number of workers dynamically while dataflow batch job is running.
That's mainly because of cloud bigtable QPS.
We are using 3 bigtable cluster nodes and they can't afford to receiving all traffics from 500 number of workers instantly.
So, I gotta change number of workers(from 500 to 25) just before trying to insert all the processed data into the bigtable.
Is there any way to achieve this goal?
Dataflow does not provide the ability to manually change the resource allocation of a batch job while it is running, however:
1) We plan to incorporate throttling into our autoscaling algorithms, so Dataflow would detect that it needs to downsize while writing to your bigtable. I don't have a concrete ETA, but this is definitely on our roadmap.
2) Meanwhile, you try to can artificially limit the parallelism of your pipeline by a trick like this:
Take your PCollection<Something> (Something being the data type you're writing to bigtable)
Pipe it through a sequence of transforms: ParDo(pair with a random key in 0..25), GroupByKey, ParDo(ungroup and remove random key). You get, again, a PCollection<Something>
Write this collection to Bigtable.
The trick here is that there is no parallelization within a single key after a GroupByKey, so the result of GroupByKey is a collection of 25 key-value pairs (where the value is an Iterable<Something>) that can't be processed by more than 25 workers in parallel. The ParDo's following it will likely get fused together with the writing to Bigtable, and will thus have a parallelism of 25.
The caveat is that Dataflow is within its right to materialize any intermediate collections if it predicts that this will improve performance of the pipeline. It may even do this just for the sake of increasing the degree of parallelism (which goes explicitly against your goal in this example). But if you have an urgent job to run, I believe right now this will probably do what you want.
Meanwhile the only long-term solution I can suggest, until we have throttling, is to use a smaller limit on number of workers, or use a larger Bigtable cluster, or both.
There's a lot of relevant information in the DATA & ANALYTICS: Analyzing 25 billion stock market events in an hour with NoOps on GCP talk from GCP/Next.
FWIW, you can increase the number of nodes of Bigtable before your batch job, give Bigtable a few minutes to adjust, and then start your job. You can turn down the Bigtable cluster when you're done with the batch job.

Resources