Per-partition GroupByKey in Beam - google-cloud-dataflow

Beam's GroupByKey groups records by key across all partitions and outputs a single iterable per-key-per-window. This "brings associated data together into one location"
Is there a way I can groups records by key locally, so that I still get a single iterable per-key-per-window as its output, but only over the local records in the partition instead of a global group-by-key over all locations?

If I understand your question correctly, you don't want to transfer a data over network if a part of it (partition) was processed on the same machine and then can be grouped locally.
Normally, Beam doesn't provide you details where and how your code will be running since it may vary depending on runner/engine/resource manager. Though, if you can fetch some uniq information about your worker (like hostname, ip or mac address) then you can use it as a part of your key and group all related data by this. Quite likely that in this case these data partitions won't be moved to other machines since all needed input data is already sitting on the same machine and can be processed locally. Though, afaik, there is no 100% guarantee about that.

Related

Why are read-only nodes called read-only in the case of data store replication?

I was going through the article, https://learn.microsoft.com/en-us/azure/architecture/patterns/cqrs which says, "If separate read and write databases are used, they must be kept in sync". One obvious benefit I can understand from having separate read replicas is that they can be scaled horizontally. However, I have some doubts:
It says, "Updating the database and publishing the event must occur in a single transaction". My understanding is that there is no guarantee that the updated data will be available immediately on the read-only nodes because it depends on when the event will be consumed by the read-only nodes. Did I get it correctly?
Data must be first written to read-only nodes before it can be read i.e. write operations are also performed on the read-only nodes. Why are they called read-only nodes? Is it because the write operations are performed on these nodes not directly by the data producer application; but rather by some serverless function (e.g. AWS Lambda or Azure Function) that picks up the event from the topic (e.g. Kafka topic) to which the write-only node has sent the event?
Is the data sharded across the read-only nodes or does every read-only node have the complete set of data?
All of these have "it depends"-like answers...
Yes, usually, although some implementations might choose to (try to) update read models transactionally with the update. With multiple nodes you're quickly forced to learn the CAP theorem, though, and so in many CQRS contexts, eventual consistency is just accepted as a feature, as the gains from tolerating it usually significantly outweigh the losses.
I suspect the bit you quoted anyway refers to transactionally updating the write store with publishing the event. Even this can be difficult to achieve, and is one of the problems event sourcing seeks to solve.
Yes. It's trivially obvious - in this context - that data must be written before it can be read, but your apps as consumers of the data see them as read-only.
Both are valid outcomes. Usually this part is less an application concern and is more delegated to the capabilities of your chosen read-model infrastructure (Mongo, Cosmos, Dynamo, etc).

Aapache Geode for data replication

We need to maintain and modify an in-memory hash table from within a single java process. We also need to persist it, so that its contents can be recovered after a crash, deploy or when the machine running the application fails.
We have tight latency requirements.
Would Apache Geode fit our requirements? We will run two additional nodes, which can be used on application startup to populate the hash table values.
Geode is a key value distributed cache, kind of like a hash table on steroids, so yes, it would fit your requirements.
You can choose to persist your data, or not.
You can have n nodes populating your data managed by a locator process that will auto distribute to all nodes, or a set of nodes, on the same machine, or on n other machines.

How BroadcastHashJoin exacly work in spark?

I'm trying to understanding how a broadcastHashJoin is executed.
I know that the little table is send broadcast to all node, but next the result is sent back to the driver?
I'm using the spark ui to undestand how network traffic is managed but i don't get relevant result and the driver result always empty:
Why i can't see traffic to driver?
Relation which is to be broadcasted is collected to the driver
Collected relation is hashed locally
Hashed relation is used to create a broadcast variable
Broadcasted relation is used to compute join in parallel
Missing data from the driver you see most likely correspond to hashing part which is not executed inside job and doesn't create useful metrics.

Elasticsearch / Kibana: Application-side joins

is it possible with Kibana (preferably the shining new version 4 beta) to perform application-side joins?
I know that ES / Kibana is not built to replace relational- databases and it is normally a better idea to denormalize my data. In this use-case however, this is not the best approach since index-size is exploding and performance is dropping:
I'm indexing billions of documents containing session information of network flows like this: source ip, source port, destination ip, destination port, timestamp.
Now I also want to collect additional information for each ip address, such as geolocation, asn, reverse dns etc. Adding this information to every single session document makes the whole database unmanageable: There are millions of documents with the same ip addresses and the redundancy of adding the same additional information to all these documents leads to a massive bloat and an unresponsive user-experience even on a cluster with hundreds of gigabytes of ram.
Instead I would like to create a separate index containing only unique ip addresses and the metadata that I have collected to each one of them.
The question is: How can I still analyze my data using kibana? For each document returned by the query, kibana should perform a lookup in the ip-index and "virtually enrich" each ip address with this information. Something like adding virtual fields so the structure would look like this (on the fly):
source ip, source port, source country, source asn, source fqdn
I'm aware that this would come at the cost of multiple queries.
I don't think that there is such thing, but maybe that you could play around with the filters :
You create nice and simple data visualisations that filter on
different types and display only one simple data.
You put these different visualizations in a dashboard in order to display all the data associated with a type of join.
You use the filters as your join key and use the full dashboard,
composed of different panels, to get insights of specific join keys
(ips in your case, or sessions)
You need to create 1 dashboard for every type of join that you want to make.
Note that you will need to harmonize the names and mappings of the fields in your different documents!
Keep us updated, that's an interesting problematic, I would like to now how it turns out with so many documents.

What is Mnesia replication strategy?

What strategy does Mnesia use to define which nodes will store replicas of particular table?
Can I force Mnesia to use specific number of replicas for each table? Can this number be changed dynamically?
Are there any sources (besides the source code) with detailed (not just overview) description of Mnesia internal algorithms?
Manual. You're responsible for specifying what is replicated where.
Yes, as above, manually. This can be changed dynamically.
I'm afraid (though may be wrong) that none besides the source code.
In terms of documenation the whole Erlang distribution is hardly the leader
in the software world.
Mnesia does not automatically manage the number of replicas of a given table.
You are responsible for specifying each node that will store a table replica (hence their number). A replica may be then:
stored in memory,
stored on disk,
stored both in memory and on disk,
not stored on that node - in this case the table will be accessible but data will be fetched on demand from some other node(s).
It's possible to reconfigure the replication strategy when the system is running, though to do it dynamically (based on a node-down event for example) you would have to come up with the solution yourself.
The Mnesia system events could be used to discover a situation when a node goes down; given you know what tables were stored on that node you could check the number of their online replicas based on the nodes which were still online and then perform a replication if needed.
I'm not aware of any application/library which already manages this kind of stuff and it seems like a quite an advanced (from my point of view, at least) endeavor to make one.
However, Riak is a database which manages data distribution among it's nodes transparently from the user and is configurable with respect to the options you mentioned. That may be the way to go for you.

Resources