A way to use "aggregator" (custom counter) within Combine step? - google-cloud-dataflow

My team uses a lot of aggregators (custom counters) for many of dataflow pipelines we use for monitoring and analysis purposes.
We mostly write DoFn classes to do so, but we sometimes use Combine.perKey(), by writing our own combine class that implements SerializableFunction<Iterable<T>, S> (usually in our case, T and S are the same). Some of the jobs we run have a small fraction of very hot keys, and we are looking to utilize some of the features offered by Combine (such as hot key fanout), but there is one issue with this approach.
It appears that aggregators are only available within DoFn, and I am wondering if there is a way around this, or this is a likely feature to be added in the future. Mostly, we use a bunch of custom counters to count the number of certain events/objects of different types for analysis and monitoring purposes. In some cases, we can probably apply another DoFn after the Combine step to do this, but in other cases we really need to count things during the combine process -- for instance, we want to know the distribution of objects over keys to understand how many hot keys we have and what draws the line between hot keys and very hot keys, for instance. There are a few other cases that seem tricky to us.
I searched around, but I couldn't find much resource around how one can use aggregators during the Combine step, so any help will be really appreciated!
If needed, I can perhaps describe what kind of Combine step we use and what we are trying to count, but it'll take some time and I'd like to have a general solution around this.

This is not currently possible. In the future (as part of Apache Beam) it is likely to be possible to define metrics (which are like aggregators) within a CombineFn which should address this.
In the meantime, for your use case you can do as you describe. You can have a Combine.perKey(), and then have multiple steps consuming the result -- one for your actual processing and others to report various metrics.
You could also look at the methods in CombineFns which allow creating a composed CombineFn. For instance, you could use your CombineFn and a simple Count, so that the reporting DoFn can report the number of elements in each key (consuming the Count) and the actual processing DoFn can consume the result of your CombineFn.

Related

Custom PCollectionView

There have been a number of times when I wanted to create a custom PCollectionView. Is this possible? For now, the only workaround I have is to create a PTransform, return a PCollection, and then apply a PCollectionView.asSingleton() transform, but I've noticed (at least several months ago) that this is much slower than running a native PCollectionView transform, such as View.AsList(). And since I'll be calling this PCollectionView method millions of times, it makes a difference if it takes a few milliseconds vs say a second.
How do you want to view the contents of your PCollection? The answer to this question will determine how you should approach things.
Cloud Dataflow (more generally, any Apache Beam backend) has a few ways that it will materialize your PCollection to allow you to efficiently access it as a side input. So list, singleton, map, and multimap are each pretty efficient for their usual access patterns (iteration, key lookup, etc). The architecture of Dataflow (now Beam) is such that you can define custom views, but if it requires a new access pattern then it will require backend support to be efficient.
Also you might care to know that after the first access to a singleton sided input, the value will usually be cached.

Sharing atoms between http-connected erlang clusters

We have a few non-erlang-connected clusters in our infrastructure and currently use term_to_binary to encode erlang terms for messages between the clusters. On the receiving side we use binary_to_term(Bin, [safe]) to only convert to existing atoms (should there be any in the message).
Occasionally (especially after starting a new cluster/stack), we run into the problem that there are partially known atoms encoded in the message, i.e. the sending cluster knows this atom, but the receiving does not. This can be for various reasons, most common is that the receiving node simply has not loaded a module containing some record definition. We currently employ some nasty work-arounds which basically amount to maintaining a short-ish list of potentially used atoms, but we're not quite happy with this error-prone approach.
Is there a smart way to share atoms between these clusters? Or is it recommended to not use the binary format for such purposes?
Looking forward to your insights.
I would think hard about why non-Erlang nodes are sending atom values in the first place. Most likely there is some adjustment that can be made to the protocol being used to communicate -- or most often there is simply not a real protocol defined and the actual protocol in use evolved organically over time.
Not knowing any details of the situation, there are two solutions to this:
Go deep and use an abstract serialization technique like ASN.1 or JSON or whatever, using binary strings instead of atoms. This makes the most sense when you have a largish set of well understood, structured data to send (which may wrap unstructured or opaque data).
Remain shallow and instead write a functional API interface for the processes/modules you will be sending to/calling first, to make sure you fully understand what your protocol actually is, and then back that up by making each interface call correspond to a matching network message which, when received, dispatches the same procedures an API function call would have.
The basic problem is the idea of non-Erlang nodes being able to generate atoms that the cluster may not be aware of. This is a somewhat sticky problem. In many cases the places where you are using atoms you can instead use binaries to similar effect and retain the same semantics without confusing the runtime. Its the difference between {<<"new_message">>, Data} and {new_message, Data}; matching within a function head works the same way, just slightly more noisy syntactically.

Rails object structure for reporting metrics

I've asked a couple of questions around this subject recently, and I think I'm managing to narrow down what I need to do.
I am attempting to create some "metrics" (quotes because these should not be confused with metrics relating to the performance of the application; these are metrics that are generated based on application data) in a Rails app; essentially I would like to be able to use something similar to the following in my view:
#metric(#customer,'total_profit','01-01-2011','31-12-2011').result
This would give the total profit for the given customer for 2011.
I can, of course, create a metric model with a custom result method, but I am confused about the best way to go about creating the custom metrics (e.g. total_profit, total_revenue, etc.) in such a way that they are easily extensible so that custom metrics can be added on a per-user basis.
My initial thoughts were to attempt to store the formula for each custom metric in a structure with operand, operation and operation_type models, but this quickly got very messy and verbose, and was proving very hard to do in terms of adding each metric.
My thoughts now are that perhaps I could create a custom metrics helper method that would hold each of my metrics (thus I could just hard code each one, and pass variables to each method), but how extensible would this be? This option doesn't seem very rails-esque.
Can anyone suggest a better alternative for approaching this problem?
EDIT: The answer below is a good one in that it keeps things very simple - though i'm concerned it may be fraught with danger, as it uses eval (thus there is no prospect of ever using user code). Is there another option for doing this (my previous option where operands etc. were broken down into chunks used a combination of constantize and get_instance_variable - is there a way these could be used to make the execution of a string safer)?
This question was largely answered with some discussion here: Rails - Scalable calculation model.
For anyone who comes across this, the solution is essentially to ensure an operation always has two operands, but an operand can either be an attribute, or the result of a previous calculation (i.e. it can be a metric itself), and it is thus highly scalable. This avoids the need to eval anything, and thus avoids the potential security holes that this entails.

What's a succinct, useful and efficient way to store large time-series in F#?

I'm currently learning F# and I'm exploring using it to analyse financial time-series. Can anyone recommend a good data structure to store time-series data in?
F# offers a rich selection of native types and I'm looking for a some simple combination that would provide an elegant, succinct and efficient solution.
I'm looking store tick data, which consists of millions of records each with a time stamp, and several (~5-20) fields of numerical and textual data, with possible missing values.
My first thoughts are perhaps a sequence of tuples or records, but I was wondering if someone could kindly suggest something that has worked well in the real world.
EDIT:
A few extra points for clarification:
The common operations that I'm likely to require are:
Time based lookup - i.e. find the most recent data point at a given time
Time based joins
Appends
(Updates and deletes are going to be rare. )
I should make it clear I'm exploring using F# primarily as an interactive tool for research, with the ability to compile as a (really big) added bonus.
ANOTHER EDIT:
I should also have mentioned, my role/use of F# and this data is purely within research not development. The intention being that once we understand the data (and what we want to do with it) better then we can later specify tools that our developers would build. Such as data warehouses etc. at which we'd start using their data structures etc.
Although, I am concerned that our models are computationally intensive, use a lot of memory and can't always be coded in a recursive manner. So we many end up having to query out large chunks anyway.
I should also say that I've always used Matlab or R for these sorts of tasks before but I'm now interested in F# as it offers that interactive, high level flexibility for Research but the same code can be used in production.
My apologies for not giving this context information at the start (It's my first question), I can see now that it helps people form their answers.
My thanks again to everyone that's taken the time to help me.
It really sounds like your data should be stored and queried in a relational database (where is it currently stored?: loading millions of records with several fields into memory must be an expensive operation, and could leave you with stale data and difficulty persisting changes). And then you could use the F# LINQ to SQL implementation (which I believe you can find in the Power Pack) to have F# expressions translated to SQL expressions.
Here's a link from Don Syme about LINQ Support in F# Power Pack: http://blogs.msdn.com/b/dsyme/archive/2009/10/23/a-quick-refresh-on-query-support-in-the-f-power-pack.aspx
The best choice of data structure depends upon what operations you want to do on it.
The simplest would be an array of structs. This has the advantages of fast random lookup, good space efficiency for an uncompressed representation and good locality. If there is sharing between substructures (like the strings) then intern them to make sure they get shared.
Alternatives might be a seq that is loaded from disk on-demand, a singly-linked list that allows you to prepend elements quickly or a balanced binary trees that allows operations like insertion at random locations efficiently.

Is it ever a good idea to use association lists instead of records?

Would any experienced Erlang programmers out there ever recommend association lists over records?
One case might be where two (or more) nodes on different machines are exchanging messages. We want to be able to upgrade the software on each machine independently. Some upgrades may involve adding a field to one (or more) of the messages being sent. It seems like using a record as the message would mean you'd always have to do the upgrade on both machines in lock step so that the extra field didn't cause the receiver to ignore the record. Whereas if you used something like an association list (which still has a "record-like" API), the not-yet-upgraded receiver would still receive the message successfully and just ignore the new field. I realize this isn't always the desired behavior, but often it is. Also, assume the messages are fairly small so the lookup time doesn't matter.
Assuming the above makes some sense, I have the following additional questions:
Is there a standard (or widely used) library for alists? Some trivial googling didn't turn up anything.
Are there other cases where you would use an association list (or something like it)?
You have basically three choices:
Use Records
Use Association Lists (proplists)
Use Combination
I use records where the likelihood of changing it is very low. That way I get the pattern matching and speed up that I want.
I use proplists where I need hashtable like functionality. I get flexibility at the expense of pattern matching and speed.
And sometimes I use both. A record with one field that is a proplist. That way I can pattern match on a portion of it and yet have flexibility where I need it.
All three choices have different trade-offs so you basically just have to evaluate your particular needs and make a choice. It may take some prototyping and playing around to figure out which trade-offs make sense and which features you absolutely must have.
For small amount of keys you can use lists aka proplists for bigger you should use dict. In both cases biggest disadvantage is that you can't use pattern match in way as used for records. There is also speed penalty but it is in most cases irrelevant.
Note that lists:keysearch/3 is pretty much "assq".

Resources