Custom PCollectionView - google-cloud-dataflow

There have been a number of times when I wanted to create a custom PCollectionView. Is this possible? For now, the only workaround I have is to create a PTransform, return a PCollection, and then apply a PCollectionView.asSingleton() transform, but I've noticed (at least several months ago) that this is much slower than running a native PCollectionView transform, such as View.AsList(). And since I'll be calling this PCollectionView method millions of times, it makes a difference if it takes a few milliseconds vs say a second.

How do you want to view the contents of your PCollection? The answer to this question will determine how you should approach things.
Cloud Dataflow (more generally, any Apache Beam backend) has a few ways that it will materialize your PCollection to allow you to efficiently access it as a side input. So list, singleton, map, and multimap are each pretty efficient for their usual access patterns (iteration, key lookup, etc). The architecture of Dataflow (now Beam) is such that you can define custom views, but if it requires a new access pattern then it will require backend support to be efficient.
Also you might care to know that after the first access to a singleton sided input, the value will usually be cached.

Related

A way to use "aggregator" (custom counter) within Combine step?

My team uses a lot of aggregators (custom counters) for many of dataflow pipelines we use for monitoring and analysis purposes.
We mostly write DoFn classes to do so, but we sometimes use Combine.perKey(), by writing our own combine class that implements SerializableFunction<Iterable<T>, S> (usually in our case, T and S are the same). Some of the jobs we run have a small fraction of very hot keys, and we are looking to utilize some of the features offered by Combine (such as hot key fanout), but there is one issue with this approach.
It appears that aggregators are only available within DoFn, and I am wondering if there is a way around this, or this is a likely feature to be added in the future. Mostly, we use a bunch of custom counters to count the number of certain events/objects of different types for analysis and monitoring purposes. In some cases, we can probably apply another DoFn after the Combine step to do this, but in other cases we really need to count things during the combine process -- for instance, we want to know the distribution of objects over keys to understand how many hot keys we have and what draws the line between hot keys and very hot keys, for instance. There are a few other cases that seem tricky to us.
I searched around, but I couldn't find much resource around how one can use aggregators during the Combine step, so any help will be really appreciated!
If needed, I can perhaps describe what kind of Combine step we use and what we are trying to count, but it'll take some time and I'd like to have a general solution around this.
This is not currently possible. In the future (as part of Apache Beam) it is likely to be possible to define metrics (which are like aggregators) within a CombineFn which should address this.
In the meantime, for your use case you can do as you describe. You can have a Combine.perKey(), and then have multiple steps consuming the result -- one for your actual processing and others to report various metrics.
You could also look at the methods in CombineFns which allow creating a composed CombineFn. For instance, you could use your CombineFn and a simple Count, so that the reporting DoFn can report the number of elements in each key (consuming the Count) and the actual processing DoFn can consume the result of your CombineFn.

Custom Collector for Neo4j Legacy Index

I have a Neo4j application that uses the legacy Lucene indexes on certain relationship properties. Whenever I query these I am looking for exact matches, and all of them. While doing some profiling I discovered that the application is spending a highly disproportionate amount of time retrieving these results as it is pulling them in chunks from a prioritized queue. Given that I do not care about the ordering and want all of the results, what can I do to change the underlying behavior?
From my own searching, I came across Lucene's Collector implementations and it seems like a custom one that collects everything and never bothers scoring could be the answer, but I do not know how I can inject one into Neo4j. I am not opposed to using reflection or other means if it is not actually supported by Neo4j.
The application accesses Neo4j via the embedded Java methods.
We're working on some of that as part of our upgrade to Lucene5, there custom collectors for some of these use-case will be implemented. Hopefully we can make something available in the next weeks.

Sharing atoms between http-connected erlang clusters

We have a few non-erlang-connected clusters in our infrastructure and currently use term_to_binary to encode erlang terms for messages between the clusters. On the receiving side we use binary_to_term(Bin, [safe]) to only convert to existing atoms (should there be any in the message).
Occasionally (especially after starting a new cluster/stack), we run into the problem that there are partially known atoms encoded in the message, i.e. the sending cluster knows this atom, but the receiving does not. This can be for various reasons, most common is that the receiving node simply has not loaded a module containing some record definition. We currently employ some nasty work-arounds which basically amount to maintaining a short-ish list of potentially used atoms, but we're not quite happy with this error-prone approach.
Is there a smart way to share atoms between these clusters? Or is it recommended to not use the binary format for such purposes?
Looking forward to your insights.
I would think hard about why non-Erlang nodes are sending atom values in the first place. Most likely there is some adjustment that can be made to the protocol being used to communicate -- or most often there is simply not a real protocol defined and the actual protocol in use evolved organically over time.
Not knowing any details of the situation, there are two solutions to this:
Go deep and use an abstract serialization technique like ASN.1 or JSON or whatever, using binary strings instead of atoms. This makes the most sense when you have a largish set of well understood, structured data to send (which may wrap unstructured or opaque data).
Remain shallow and instead write a functional API interface for the processes/modules you will be sending to/calling first, to make sure you fully understand what your protocol actually is, and then back that up by making each interface call correspond to a matching network message which, when received, dispatches the same procedures an API function call would have.
The basic problem is the idea of non-Erlang nodes being able to generate atoms that the cluster may not be aware of. This is a somewhat sticky problem. In many cases the places where you are using atoms you can instead use binaries to similar effect and retain the same semantics without confusing the runtime. Its the difference between {<<"new_message">>, Data} and {new_message, Data}; matching within a function head works the same way, just slightly more noisy syntactically.

Rails object structure for reporting metrics

I've asked a couple of questions around this subject recently, and I think I'm managing to narrow down what I need to do.
I am attempting to create some "metrics" (quotes because these should not be confused with metrics relating to the performance of the application; these are metrics that are generated based on application data) in a Rails app; essentially I would like to be able to use something similar to the following in my view:
#metric(#customer,'total_profit','01-01-2011','31-12-2011').result
This would give the total profit for the given customer for 2011.
I can, of course, create a metric model with a custom result method, but I am confused about the best way to go about creating the custom metrics (e.g. total_profit, total_revenue, etc.) in such a way that they are easily extensible so that custom metrics can be added on a per-user basis.
My initial thoughts were to attempt to store the formula for each custom metric in a structure with operand, operation and operation_type models, but this quickly got very messy and verbose, and was proving very hard to do in terms of adding each metric.
My thoughts now are that perhaps I could create a custom metrics helper method that would hold each of my metrics (thus I could just hard code each one, and pass variables to each method), but how extensible would this be? This option doesn't seem very rails-esque.
Can anyone suggest a better alternative for approaching this problem?
EDIT: The answer below is a good one in that it keeps things very simple - though i'm concerned it may be fraught with danger, as it uses eval (thus there is no prospect of ever using user code). Is there another option for doing this (my previous option where operands etc. were broken down into chunks used a combination of constantize and get_instance_variable - is there a way these could be used to make the execution of a string safer)?
This question was largely answered with some discussion here: Rails - Scalable calculation model.
For anyone who comes across this, the solution is essentially to ensure an operation always has two operands, but an operand can either be an attribute, or the result of a previous calculation (i.e. it can be a metric itself), and it is thus highly scalable. This avoids the need to eval anything, and thus avoids the potential security holes that this entails.

Is it ever a good idea to use association lists instead of records?

Would any experienced Erlang programmers out there ever recommend association lists over records?
One case might be where two (or more) nodes on different machines are exchanging messages. We want to be able to upgrade the software on each machine independently. Some upgrades may involve adding a field to one (or more) of the messages being sent. It seems like using a record as the message would mean you'd always have to do the upgrade on both machines in lock step so that the extra field didn't cause the receiver to ignore the record. Whereas if you used something like an association list (which still has a "record-like" API), the not-yet-upgraded receiver would still receive the message successfully and just ignore the new field. I realize this isn't always the desired behavior, but often it is. Also, assume the messages are fairly small so the lookup time doesn't matter.
Assuming the above makes some sense, I have the following additional questions:
Is there a standard (or widely used) library for alists? Some trivial googling didn't turn up anything.
Are there other cases where you would use an association list (or something like it)?
You have basically three choices:
Use Records
Use Association Lists (proplists)
Use Combination
I use records where the likelihood of changing it is very low. That way I get the pattern matching and speed up that I want.
I use proplists where I need hashtable like functionality. I get flexibility at the expense of pattern matching and speed.
And sometimes I use both. A record with one field that is a proplist. That way I can pattern match on a portion of it and yet have flexibility where I need it.
All three choices have different trade-offs so you basically just have to evaluate your particular needs and make a choice. It may take some prototyping and playing around to figure out which trade-offs make sense and which features you absolutely must have.
For small amount of keys you can use lists aka proplists for bigger you should use dict. In both cases biggest disadvantage is that you can't use pattern match in way as used for records. There is also speed penalty but it is in most cases irrelevant.
Note that lists:keysearch/3 is pretty much "assq".

Resources