In Akka I've found that the name of an actor system is case sensitive but I'm so far unable to find any documentation on whether the other parts of the actor paths themselves are case sensitive.
Related
I need to build a system that consist of:
Nodes, each mode can accept one input.
The node that received the input shares it with all nodes in the network.
Each node do a computation on the input (same computation but each node has a different database so the results are different for each node).
The node that received the input consolidate each node result and apply a logic to determine the overall result.
This result is returned to the caller.
It's very similar to a map-reduce use case. Just there will be a few nodes (maybe 10~20), and solutions like hadoop seems an overkill.
Do you know of any simple framework/sdk to build:
Network (discovery, maybe gossip protocol)
Distribute a task/data to each node
Aggregate the results
Can be in any language.
Thanks very much
Regads;
fernando
Ok to begin with, there are many ways to do this. I would suggest the following if you are just starting to tackle this architecture:
Pub/Sub with Broker
Programs like RabbitMQ are meant to easily allow for variable amounts of nodes to connect and speak to one another. Most importantly, they allow for transparency and observability. You can easily ask the Broker which nodes are connected and even view messages in transit. Basically they are a 'batteries included' means of delaying with a large amount of clients.
Brokerless (Update)
I was looking for a more 'symmetric' architecture where each node is the same and do not have a centralized broker/queue manager.
You can use a brokerless Pub/Subs, but I personally avoid them. While they have tooling, it is hard to understand their registration protocols if something odd happens. I generally just use Multicast as it is very straight forward, especially if each node has just one network interface, and you can extend/modify behavior just with routing infra.
Here is how you scheme would work with Multicast:
All nodes join a known multicast address (IE: 239.1.2.3:8000)
All nodes would need to respond to a 'who's here' message
All nodes would either need to have a 'do work' api either via multicast or from consumer to node (node address grabbed from 'who's here message)
You would need to make these messages yourself, but given how short i expect them to be it should be pretty simple.
The 'who's here' message from the consumer could just be a message with a binary zero.
The 'who's here' response could just be a 1 followed by the nodes information (making it a TLV would probably be best though)
Not sure if each node has unique arguments or not so i don't know how to make your 'do work' message or responce
I was going through the article, https://learn.microsoft.com/en-us/azure/architecture/patterns/cqrs which says, "If separate read and write databases are used, they must be kept in sync". One obvious benefit I can understand from having separate read replicas is that they can be scaled horizontally. However, I have some doubts:
It says, "Updating the database and publishing the event must occur in a single transaction". My understanding is that there is no guarantee that the updated data will be available immediately on the read-only nodes because it depends on when the event will be consumed by the read-only nodes. Did I get it correctly?
Data must be first written to read-only nodes before it can be read i.e. write operations are also performed on the read-only nodes. Why are they called read-only nodes? Is it because the write operations are performed on these nodes not directly by the data producer application; but rather by some serverless function (e.g. AWS Lambda or Azure Function) that picks up the event from the topic (e.g. Kafka topic) to which the write-only node has sent the event?
Is the data sharded across the read-only nodes or does every read-only node have the complete set of data?
All of these have "it depends"-like answers...
Yes, usually, although some implementations might choose to (try to) update read models transactionally with the update. With multiple nodes you're quickly forced to learn the CAP theorem, though, and so in many CQRS contexts, eventual consistency is just accepted as a feature, as the gains from tolerating it usually significantly outweigh the losses.
I suspect the bit you quoted anyway refers to transactionally updating the write store with publishing the event. Even this can be difficult to achieve, and is one of the problems event sourcing seeks to solve.
Yes. It's trivially obvious - in this context - that data must be written before it can be read, but your apps as consumers of the data see them as read-only.
Both are valid outcomes. Usually this part is less an application concern and is more delegated to the capabilities of your chosen read-model infrastructure (Mongo, Cosmos, Dynamo, etc).
We need to export information about all the Service Principals of an AAD tenant periodically. We expect the number to be large so we need to partition the export. When exporting users from Microsoft Graph we were able to partition based on first letter of mailNickname using startswith(mailNickname, '<letter>') as filter but trying that on appId and id with ServicePrincipals errored out with Request_UnsupportedQuery. Is there another way we could use to parallelize data export?
Ex of request: https://graph.microsoft.com/beta/servicePrincipals?$filter=accountEnabled eq true and startswith(appId, '0')&$select=id,appId,displayName&$top=999
What you are attempting is similar to the approach I shared here (feel free to experiment with that code):
https://github.com/piotrci/Microsoft-Graph-Efficient-Operations/blob/master/Microsoft-Graph-Efficient-Operations/ScenarioImplementations/UserScenarios.cs
Filtering on ids is usually not supported by Graph resources. I my short experiment, I was able to use servicePrincipal's displayName to partition the collection.
Note however, that such approach is not guarantee uniform partitioning. Also, in your scenario (periodic full exports) is this optimization necessary?
Suggestion: consider using Graph's delta query to do a full export once, and then only pick up delta changes. This may be a much better optimization if you expect high volume, but limited churn to the resources.
https://graph.microsoft.com/beta/servicePrincipals/delta
https://learn.microsoft.com/en-us/graph/delta-query-overview
I am looking to dynamically set an erlang node to 'hidden' or set 'connect_all' after the node has already been created. Is this possible in erlang?
There is an undocumented net_kernel:hidden_connect_node(NodeName) function that can be used on a per-connection basis for NodeName not to share all the connection details of the caller.
There is no guarantee related to its long term support, but that's currently the only way to do things dynamically.
Thanks to #I GIVE TERRIBLE ADVICE (AND WRITE AWESOME ERLANG BOOKS) for sharing this gem of knowledge. I would also like to highlight how it has been particularly useful in my specific case :
Context :
I have several machines that host an Erlang node running my OTP application
The nodes are configured in a wireless peer-to-peer setup
For testing purposes, I would like to observe the behaviour of the cluster when multi-hop is required from a node A to another node B.
So far my best (and only) solution has been to physically move around the nodes such that they can only reach neighbours in range of their Wi-Fi antenna.
Bottom line for those that are in situations similar to what I have described, this is a very handy function for clustering nodes without completely removing the default transitive behaviour.
I am about to develop a distributed system. The system, among all functionalities, needs to allocate some resources (large resources that can be fragmented in smaller blocks). In order to do that, I want to use the Chord/Pastry P2P approach (stations on a logic ring-net).
Pastry has a very interesting approach for resource allocation: when a user station needs to send something, the hash of the station guid is used to find the key in the dht, so something like this is considered:
User Station -> GUID (hash on user station ip) -> HASH -> I obtain a value called X -> Use this hash and find in the Pastry ring-net the station having that same GUID (hash on Pastry node public key) value is located (or the immediate predecessor) -> put data there.
Well, this means that, ideally, every user always locates its own data in the same Patry station (Pastry node). Well, the protocol also mirrors data on neighbours so a user can find its data in few nodes.
Is this a good approach? Are there any possible side effects on proceeding as before?
Pastry-P2P-like solution are theoretical models. As such, you should take them for what they are, an abstraction.
These models don't take into account the real practical searching of a peer and the technical difficulties encountered when trying to establish a connection to a remote peer (for example, NAT traversal and firewall issues). A peer can also be down.
The cost of connection to the next peer is not always 1. It can be much more. To answer your question, you cannot only rely on the selected model.
That being said, if the hash results are distributed uniformly, then variation of performance between peers will be low, unless they are particularly hard to reach behind a NAT, a proxy or a firewall.