Scaling chat log workers horizontally - docker

I've thought about this a lot but can't come up with a solution I'm happy with.
Basicly this is the problem: Log 100k+ Chats (some slower, some faster) into cassandra. So save userId, channelId, timestamp and the message.
Cassandra already supports horizontal scaling out of the box, I have no issue here.
Now my software that reads these chats does it over TCP (IRC). Something like 300 messages / sec are usual for the top 1k channels and 1 single IRC connection can't handle that from my experiments.
What I now want to build is multiple instances (with Docker/Kubernetes) of the logger and share the load between those. So ideally if I have maybe 4 workers and 1k chats (example). They would each join atleast 250 channels. I say atleast because I would want optional redundancy so I can have 2 loggers in the same chat to make sure no messages get lost.
There is no issue with duplicates, because all messages have a unique ID.
Now how would I best and dynamically share the current channels joined between the workers. I wanna avoid having a master or controlling point. Should also be easy to add more workers that then reduce the load on other workers.
Are there any good articles about this kind of behaviour? Maybe good concepts or protocols already defined? Like I said i wanna avoid another central control point so no rabbitmq, redis or whatever.
Edit: I've looked into something like the Raft Consensus Algorithm, but it doesn't make sense I think, since I don't want my clients to agree on a shared state instead divide the state between them "equally".

I think in this case looking for a description of existing algorithm might be not very useful: the problem is not complicated and generic enough to be worth publication.
As described, the problem could be solved by using Cassandra itself as a mediator and to share chat channel assignment information among the workers.
So (trivial part) channels would have IDs and assigned worker ID(s), plus in the optional case of redundancy - required amount of workers (2 or whatever number of workers you want to process this chat). Worker, before assigning itself to a channel would check if there is already enough assignees. If so would continue to the next channel. If not, assign itself to the channel. This is one of the options (alternatively you can have workers holding the channel IDs, but since redundancy is rare this way seems to be simpler). Workers would have a limit of channels they can process and will not try exceeding it by assigning more channels.
Now we only have to deal with the case of assigning too much workers to the same channel, exceeding requirements and exhausting the worker capacity by monitoring all the same channels. Otherwise, if they start all at once, channels might have more assigned workers than needed. Even though it is unlikely will create a real problem in described case (just a bit more redundancy than requested), you can handle that by prioritising workers. Much like employing of school teachers in Canada, BC is done on seniority basis - the most senior gets job first, except that here it'd be voluntarily done by the workers themselves, not by school administration. What this means, is that each worker would have to check all it's assigned channels and, should there be more workers than needed at this time, would check if it has the smallest priority among all the assignees. If it does, it would resign - remove itself and stop processing the channel.
That requires assigning distinct priorities of the workers, which could be easily achieved when spawning them, by simply setting each to a next sequential number (the oldest has the highest priority, or v.v if you concerned of old, potentially dying workers taking up all the load, and would prefer new ones to take on more while still fresh). More elaborately, this could also be done by using Cassandra Lightweight transactions as described in one of the answers here (the one by AlonL). With just a few (you mentioned ~4) workers either way should work and concerns about scaling mentioned in the other answers there isn't a big deal for a few integer priorities. Also, instead of sequential number assignment, requiring the workers to self-assign a random 32-bit integer priority on initialization has virtually no chance of collision, so loop "until no collisions" should exit on the very first iteration (which would make a second iteration very rarely code path requiring an explicit test).
The trick is basically to limit the amount of data requiring synchronisation and putting the load of regulation onto the workers themselves. There is no need for consensus algorithms as there is not much complexity and we are not dealing with huge number of potentially fraudulent workers, trying to get assignments ahead of more senior peers.
The only issue I should mention is that there could be implicit worker rotation if channels go offline which makes worker to stop processing. You will get a different worker assignment next time the channel goes online.

Related

Monitoring a causal cluster's redundancy

I'm on Neo4j 3.1.2. I'm trying to automate monitoring a causal cluster for proper redundancy, preferably over the http interface, dbms.cluster.overview being the most obvious call. But when they die, servers drop off this list without regard to how they exit. The operations manual says there is a difference between clean shutdowns and unclean ones. How do I figure out if a server left cleanly or uncleanly? Is there a procedure to clean up a unclean failure that's never coming back?
In general I would like to know the number of core servers Neo4j is checking for consensus. I don't see an API to find that number. That way I could tell how close to failure we are.
The expected_core_cluster size setting is used when bootstrapping the cluster on first formation. A cluster will not form without the configured amount of cores and this should in general be configured to the full and fixed amount.
This setting is then also used as the minimum consensus group size. The consensus group size (core machines successfully voted into the raft) can shrink and grow dynamically but bounded on the lower end at this number.
The intention is in almost all cases for users to leave this setting alone. If you have 5 machines then you can survive failures down to 3 remaining, e.g. with 2 dead members. The three remaining can still vote another replacement member in successfully up to a total of 6 (2 of which are still dead) and then after this, one of the superfluous dead members will be immediately and automatically voted out (so you are left with 5 members in the consensus group, 1 of which is currently dead). Operationally you can now bring the last machine up by bringing in another replacement or repairing the dead one.
If the intention really is to bring the expected_core_cluster size down to 3 then today you will have to update the setting and do a rolling restart. This is considered an uncommon scenario. Causal clustering optimizes operationally for repair/replace.
The only difference between a clean and an unclean shutdown is that the former leads to a quicker discovery of a core member disappearing, since that is based on a timeout.

Erlang OTP based application - architecture ideas

I'm trying to write an Erlang application (OTP) that would parse a list of users and then launch workers that will work 24X7 to collect user-data (using three different APIs) from remote servers and store it in ets.
What would be the ideal architecture for this kind of application. Do I launch a bunch of workers - one for each user (assuming small number users)? What will happen if number of users increases very rapidly?
Also, to call different APIs I need to put up a Timer mechanism in the worker process.
Any hint will be really appreciated.
Spawning new process for each user is not a such bad idea. There are http servers that do this for each connection, and they doing quite fine.
First of all cost of creating new process is minimal. And cost of maintaining processes is even smaller. If one of the has nothing to do, it won't do anything; there is none (almost) runtime overhead from inactive processes, which in the end means that you are doing only the work you have to do (this is in fact the source of Erlang systems reactivity).
Some issue might be memory usage. Each process has it's own memory stack, and in use-case when they actually do not need to store any internal data, you might be allocating some unnecessary memory. But this also could be modified (even during runtime), and in most cases such memory will be garbage collected.
Actually I would not worry about such things too soon. Issues you might encounter might depend on many things, mostly amount of outside data or user activity, and you can not really design this. Most probably you won't encounter any of them for quite some time. There's no need for premature optimization, especially if you could bind yourself to design that would slow down rest of your development process. In Erlang, with processes being main source of abstraction you can easily swap this process-per-user with pool-of-workers, and ets with external service. But only if you really need it.
What's most important is fact that representing "user" as process would be closest to problem domain. "Users" are independent entities, and deserve separate processes (they have their own state, and they can act or react independent to each other). It is quite similar to using Objects and Classes in other languages (it is over-simplification, but it should get you going).
If you were writing this in Python or C++ would you worry about how many objects you were creating? Only in extreme cases. In Erlang the same general rule applies for processes. Don't worry about how many you are creating.
As for architecture, the only element that is an architectural issue in your question is whether you should design a fixed worker pool or a 1-for-1 worker pool. The shape of the supervision tree would be an outcome of whichever way you choose.
If you are scraping data your real bottleneck isn't going to be how many processes you have, it will be how many network requests you are able to make per second on each API you are trying to access. You will almost certainly get throttled.
(A few months ago I wrote a test demonstration of a very similar system to what you are describing. The limiting factor was API request limits from providers like fb, YouTube, g+, Yahoo, not number of processes.)
As always with Erlang, write some system first, and then benchmark it for real before worrying about performance. You will usually find that performance isn't an issue, and the times that it is you will discover that it is much easier to optimize one small part of an existing system than to design an optimized system from scratch. So just go for it and write something that basically does what you want right now, and worry about optimization tweaks after you have something that basically does what you want. After getting some concrete performance data (memory, request latency, etc.) is the time to start thinking about performance.
Your problem will almost certainly be on the API providers' side or your network latency, not congestion within the Erlang VM.

NServiceBus appropriate for load distribution of periodic tasks

Would NServiceBus or an equivalent ESB be appropriate for an application that has a bunch of different kinds of background maintenance-type tasks? For example:
Scanning databases for the occurence of certain words in user-generated content
Updating database tables that store the results of relatively expensive queries
Creating/maintaining external indexes for content
Sending event notification emails for a scheduled event.
My idea is to employ some kind of task scheduler (either the Windows builtin one, Quartz.NET, or my own database-based solution) to publish different kinds messages onto the bus periodically. The period may be as short as one minute or as long as a days. The reason I want to use the bus is so that I can scale out the number of subscribers as the system becomes larger and busier and the tasks become either more frequent or more resource-intensive. It would also provide redundancy as long as I always have at least two subscribers running.
The obvious alternative to this would be to write my own Windows Service that is triggered by the scheduler and performs the work, but I feel like making that scale beyond a single machine and provide fault tolerance might be more difficult than using the ESB as that plumbing.
Does this sound like a reasonable approach? Alternative suggestions?
TIA
As the author of NServiceBus, I'm quite probably biased, but there is a tradeoff between learning a new technology and writing (possibly a simpler version of) your own. I would recommend considering the longer term maintainance (and documentation) costs of your own solution as compared to one written in house.
In terms of the feature-set you described, NServiceBus does provide facilities for all of that.

Twitter Streaming API recording and Processing using Windows Azure and F#

A month ago I tried to use F# agents to process and record Twitter StreamingAPI Data here. As a little exercise I am trying to transfer the code to Windows Azure.
So far I have two roles:
One worker role (Publisher) that puts messages (a message being the json of a tweet) to a queue.
One worker role (Processor) that reads messages from the queue, decodes the json and dumps the data into a cloud table.
Which leads to lots of questions:
Is it okay to think of a worker role as an agent ?
In practice the message can be larger than 8 KB so I am going to need to use a blob storage and pass as message the reference to the blob (or is there another way?), will that impact performance ?
Is it correct to say that if needed I can increase the number of instances of the Processor worker role, and the queue will magically be processed faster ?
Sorry for pounding all these questions, hope you don't mind,
Thanks a lot!
There is an opensource library named Lokad.Cloud which can process big message transparently, you can check it on http://code.google.com/p/lokad-cloud/
Is it okay to think of a worker role as an agent?
Yes, definitely.
In practice the message can be larger than 8 KB so I am going to need to use a blob storage and pass as message the reference to the blob (or is there another way?), will that impact performance ?
Yes, using the technique you're talking about (saving the JSON to blob storage with a name of "JSONMessage-1" and then sending a message to a queue with contents of "JSONMessage-1") seems to be the standard way of passing messages in Azure that are bigger than 8KB. As you're making 4 calls to Azure storage rather than 2 (1 to get the queue message, 1 to get the blob contents, 1 to delete from the queue, 1 to delete the blob) it will be slower. Will it be noticeably slower? Probably not.
If a good number of messages are going to be smaller than 8KB when Base64 encoded (this is a gotcha in the StorageClient library) then you can put in some logic to determine how to send it.
Is it correct to say that if needed I can increase the number of instances of the Processor worker role, and the queue will magically be processed faster ?
As long as you've written your worker role so that it's self contained and the instances don't get in each others way, then yes, increasing the instance count will increase the through put.
If you're role is mainly just reading and writing to storage, you might benefit by multi-threading the worker role first, before increasing the instance count which will save money.
Is it okay to think of a worker role
as an agent ?
This is the perfect way to think of it. Imagine the workers at McDonald's. Each worker has certain tasks and they communicate with each other via messages (spoken).
In practice the message can be larger
than 8 KB so I am going to need to use
a blob storage and pass as message the
reference to the blob (or is there
another way?), will that impact
performance?
As long as the message is immutable this is the best way to do it. Strings can be very large and thus are allocated to the heap. Since they are immutable passing around references is not an issue.
Is it correct to say that if needed I
can increase the number of instances
of the Processor worker role, and the
queue will magically be processed
faster?
You need to look at what your process is doing and decide if it is IO bound or CPU bound. Typically IO bound processes will have an increase in performance by adding more agents. If you are using the ThreadPool for your agents the work will be balanced quite well even for CPU bound processes but you will hit a limit. That being said don't be afraid to mess around with your architecture and MEASURE the results of each run. This is the best way to balance the amount of agents to use.

Using Erlang, how should I distribute load amongst a cluster?

I was looking at the slave/pool modules and it seems similar to what I
want, but it also seems like I have a single point of failure in my
application (if the master node goes down).
The client has a list of gateways (for the sake of fallback - all do
the same thing) which accept connections, and one is chosen from
randomly by the client. When the client connects all nodes are
examined to see which has the least load and then the IP of the least-
loaded server is forwarded back to the client. The client then
connects to this server and everything is executed there.
In summary, I want all nodes to act as both gateways and to actually
process client requests. The load balancing is only done when the
client initially connects - all of the actual packets and processed on
the client's "home" node.
How would I do this?
I don't know if there is this modules implemented yet but what I can say, load balance is overrated. What I can argue is, random placing of jobs is best bet unless you know far more information how load will come in future and in most of cases you really doesn't. What you wrote:
When the client connects all nodes are examined to see which has the least load and then the IP of the least- loaded server is forwarded back to the client.
How you know that all those least loaded node will not be highest loaded just in next ms? How you know that all those high loaded nodes which you will not include in list will not drop load just in next ms? You really can't know it unless you have very rare case.
Just measure (or compute) your node's performance and set node's probability be chosen depend of it. Choose node randomly regardless of current load. Use this as initial approach. When you set it up, then you can try make up some more sophisticated algorithm. I bet that it will be very hard work to beat this initial approach. Trust me, very hard.
Edit: To be more clear in one subtle detail, I strongly argue that you can't predict future load from current and historical load but you should use knowledge about tasks durations probability and current decomposition of task's lifetime. This work is so hard to try achieve.
The purpose of a supervision tree is to manage the processes not necessarily forward requests. There is no reason you couldn't use different code to send requests directly to members of the list of available processes. See the pool:get_nodes or pool:get_node() functions for one way to get those lists.
You can let the pool module handle the management of the processes (restarting, monitoring, and killing processing) and use some other module to transparently redirect requests to the pool of processes. Maybe you were looking for distributed pools though? It'll be hard to get away from the master process in erlang whithout going to distributed nodes. The whole running system is pretty much one large supervision tree.
I recently remembered the pg module which allows you to setup process groups. messages sent to the group go to every process in the group. It might get you part way toward what you want. you would have to write the code to decide which process handles the request for real but you would get a pool without a master using it.

Resources