Extract, transform, load within Rabbit? - erlang

One of the things that i do pretty often is transforming SQL data into cache and document-based stores, for performance reasons. I don't want my frontend applications hitting my database, so i have high-speed cache solutions, as well efficient Solr and other solutions.
I use RabbitMQ as the central communication hub to achieve this ETL flow, which looks like this: Backend application sends a message to Rabbit with the new data, or changes made into existing data. I then have a node.js script which consumes the queue, makes small batches of data and populates all the necessary systems: Redis, Mongo, Solr, etc.
However, i'm wondering if there's a better way of doing this. Maybe Rabbit has some kind of scripting support to create erlang logic for queues?

However, i'm wondering if there's a better way of doing this. Maybe Rabbit has some kind of scripting support to create erlang logic for queues?
it doesn't. it's just a message queueing system.
personally, I think your current design sounds good.
The only thing I would wonder, is whether or not each of your target systems has a queue of it's own. That way, any one of them can go down and not affect the others.
I would probably do something like this:
back-end produces data message and sends through RMQ
RMQ is configured with a fanout exchange, and has one bound queue per target system
each system receives the message in it's own queue
otherwise, what you have sounds about right to me!

Related

Respond with large amount of objects through a Rails API

I currently have an API for one of my projects and a service that is responsible for generating export files as CSVs, archive and store them somewhere in the cloud.
Since my API is written in Rails and my service in plain Ruby, I use the Her gem in the service to interact with the API. But I find my current implementation less performant, since I do a Model.all in my service, which in turn triggers a request that may contain way too many objects in the response.
I am curious on how to improve this whole task. Here's what I've thought of:
implement pagination at API level and call Model.where(page: xxx) from my service;
generate the actual CSV at API level and send the CSV back to the service (this may be done sync or async).
If I were to use the first approach, how many objects should I retrieve per page? How big should a response be?
If I were to use the second approach, this would bring quite an overhead to the request (and I guess API requests shouldn't take that long) and I also wonder whether it's really the API's job to do this.
What approach should I follow? Or, is there something better that I'm missing?
You need to pass a lot of information through a ruby process, that's always not simple, I don't think you're missing anything here.
If you decide to generate CSVs at the API level then what do you get with maintaining the service? You could just ditch the service altogether because replacing your service with an nginx proxy would do the same thing better (if you're just streaming the response from API host)?
If you decide to paginate, there will be a performance reduction for sure, but nobody can tell you exactly how much you should paginate - bigger pages will be faster and consume more memory (reducing throughput by being able to run less workers), smaller pages will be slower and consume less memory but demand more workers because of IO wait times,
exact numbers will depend on the IO response times of your API app and the cloud and your infrastructure, I'm afraid no one can give you a simple answer you can follow without experimentation with a stress test, and once you set up a stress test, you will get a number of your own anyway - better than anybody's estimate.
A suggestion, write a bit more about your problem, constraints you are working under etc and maybe someone can help you with a bit more radical solution. For some reason I get the feeling that what you're really looking for is a background processor like sidekiq or delayed job, or maybe connect your service to the DB directly through a DB view if you are anxoius to decouple your apps, or an nginx proxy for API responses, or nothing at all... but I really can't tell without more information.
I think it really depends how you want do define 'performance' and what your goal for your API is. Do you want to make sure no request to your API takes longer than 20msec to respond, than adding pagination would be a reasonable approach. Especially if the CSV generation is just an edge case, and the API is really built for other services. The number of items per page would then be limited by the speed at which you can deliver them. Your service would not be particularly more performant (even less so), since it needs to call the service multiple times.
Creating an async call (maybe with a webhook as callback) would be worth adding to your API if you think it is a valid use case for services to dump the whole record set.
Having said that, I think strictly speaking it is the job of the API to be quick and responsive. So maybe try to figure out how caching can improve response times, so paging through all the records is reasonable. On the other hand it is the job of the service to be mindful of the amount of calls to the API, so maybe store old records locally and only poll for updates instead of dumping the whole set of records each time.

Erlang supervisor processes

I have been learning Erlang intensively, and after finishing 'Programming Erlang' from Joe Armstrong, there is one thing that I keep coming back to.
In my mind a Supervisor spawns One process per child handler. So each declared gen_server type handler will run as a separate process.
What happens if you are building a tiny web server and you want each requests to be its own process. Do you still conform to OTP principles and use a gen_server somehow (how ?), or do you create your own behaviour?
How does Cowboy handle this for eg. ? Does it still use gen_server ?
tl;dr: I find that trying to figure out the "correct" supervision structure a the beginning of a project is a form of premature optimization.
The "right" way to design your supervision tree depends on what the worker parts of your application are doing. In the case of a web server I would probably first explore something along the lines of:
top supervisor (singular)
data service supervisor (one per service type)
worker pool (all workers under the service sup)
client connection supervisor (one)
connection worker pool (or one per connection, have to play with it to decide)
logical supervisor (as appropriate -- massive variance here, depending on problem domain)
workers or supervisors (as appropriate -- have to explore/know the problem domain to have any idea how this should be structured)
So that's several workers per supervisor type at the lower level. I haven't used Cowboy so I don't know how it is organized. The point I'm trying to make is that while the mechanics of handling data services serving web pages are relatively trivial, the part of the system that actually does the core problem-solving work might not be and this is going to dictate everything interesting about the system.
It is a bad thing to have your problem-solving bits mixed in the same module as your web-displaying or connection handling bits. Ideally you should be able to use the same logic units in a native application, a web application and a networked service without any changes.
Ultimately the answer to whether you should have 1:1 supervisors to workers or 1:n depends on what you're doing and what restart strategy gives you the best balance among recovery to a known consistent state, latency felt by the user, and resource usage.
One of my favorite things about Erlang is that I can start with a naive supervisor structure like the one above, play with it until I see where its not so good, and rather easily switch things around and experiment with alternatives without fundamentally altering my system much. (The same goes for playing with alternative data representations if you write proper abstractions around them.) So first, get something that works in testing. Then load it up and see if you can break it. Then start worrying about the details, after you understand where the problems actually are.
It is a common pattern to spawn one server per client in erlang, You will then use a supervisor using the simple_one_to_one strategy for the children servers. This allows to ask the server to start a server on_demand. Generally this is used when you don't know how many processes you will need, and when the processes are independent (a crash of one process should not impacts the other).
There is a very good information in the site learningyousomeerlang.com (LYSE supervisor chapter). the whole site is worth to read.

Why do we need simple_one_for_one?

Somebody told me that simple_one_for_one is very useful for chat applications, because each chat client is a server process (gen_server). Is this right?
And I wonder why do we need it? Why not create only one center server (gen_server) to handle all chat client communication? Because maybe the number of chat clients is very large so only one server couldn't handle fast, make the system slow down?
I think maybe creating too many servers like simple_one_for_one may take too much system resource. I'm a new OTP guy, so I really need explanation about this point.
Yes, the idea is that you would have a process (gen_server) per client.
This lets you isolate failure of one client from another.
If you had everyone in a single process, you have to be very careful to handle all the things that might go wrong and crash you process (thus, disconnecting all your clients).
With one process per client, you can code for the happy path and just let it crash when things go wrong. Worst case is you drop a single client.
Processes are fairly cheap (nothing like creating threads). On a modern machine you can have millions.
If your user base is in the many millions, I'm sure you'd end up with more than one server anyway. So something that can easily scale to the hundreds of thousands to low millions on a box is plenty.

How is request processing with rails, redis, and node.js asynchronous?

For web development I'd like to mix rails and node.js since I want to get the best out of both worlds (rails for fast web development and node for concurrency). I know that some people choose to just use full ruby stack with eventmachine that is integrated into rails controller so that every request can be nonblocking by using fiber in event-loop model. I have been able to understand how that works in a big picture.
At this moement however I want to try doing nonblocking request processing with rails and node.js with message queue concept. I heard that this can be achieved by using redis as an intermediary. I'm still having trouble trying to figure out how that works as of now. From what I can understand: so we have 2 apps A (rails) and B (node.js) and redis. rails app will handle requests from users that go through controllers in REST manner, and then from there rails will pass that through redis, and then redis will form queues and node.js app will pick up that queue and do whatever necessary afterhand (write or read from backend db).
My questions:
So how would that improve concurrency and scalability? from what i
know since rails handle the requests through controllers
synchronously, and then write to redis, the requests will be
blocking still, even though node.js end can pickup the queue
asynchronously. (I have a feeling that it's not asynchronous yet if it's not end to end
non-blocking).
Would node.js be considered a proxy or an application here if redis
is the intermediary?
I'm new to redis and learning it still. If I'm using 100% noSQL
solution for my backend database, such as mongoDB or couchDB, are they replaceable by redis entirely or is redis more seen as a
messaging queue tool like rabbitMQ?
Is messaging queue a different concurrency concept than threading or
event-loop model or is it supposed to supplement them?
That's all my question. I'm new to message queue concept. Will appreciate any help and pointers to right direction and articles that help me learn more. thanks.
You are mixing some things here that don't go together.
Let's first make sure we are on the same page regarding the strengths/weaknesses of the involved technologies
Rails: Used for it's web-development simplicity and perfect for serving database-backed web-applications.
Not very performant when having to serve a large number of long running requests as you'd run out of threads on your Ruby workers - but well suited for anything that can scale horizontally with more web-nodes (multiple web-servers - 1 db).
Node.js: Great for high-concurrency scenarios. Not as easy as rails to write a regular web-application in it. But can handle near an insane amount of long-running low-cpu tasks efficiently.
Redis: A Key-Value Store that supports operations on it's data-structures (increment/decrement values, append/prepent push/pop to lists - all operations that make this DB work consistently with multiple clients writing at once)
Now as you can see, there is no benefit in having Rails AND Node serve the same request - communicating through Redis. Going through the Rails Stack would not provide any benefit if the requests ends up being handled by the Node server.
And even if you only offload some processing to the node server, it's still the Rails webserver that handles the requests and has to wait for a response from node - killing the desired scalability. It simply makes no sense.
Where you would a setup with Node and Rails together is in certain areas of your app that have drastically different scaling requirements.
If you are for example writing a Website that displays live stats for Football games you can easily see that there are two different concerns in your app: The "normal" Site that contains signup, billing and profile stuff that screams for a quick implementation through rails. And the "live" portion of the site where users see live results and you expect to handle a lot of clients at once - all waiting for something to happen (low cpu - high concurrency).
In such a case it may be beneficial to actually seperate the two parts of the site into a Ruby and a Node app, with then sharing data about the user through a store like Redis (but actually you just need some shared state that both can look at and write to for synchronization purposes).
So you would use for example Rails for the Signup/Login portions - once signed up write the session cookie into redis alongside with the permissions of the user (what game is he allowed to follow) and hand the user off to the Node.js app.
There the Node app can read the session information from Redis and serve the user.
Word of advice:
You don't get scalability by simply throwing Node.js into your Toolbox. You really have to find out what Node.js is good at (low-cpu high-io concurrent operations) and how you can leverage that to remedy some of the problems your currently chosen technology has.
I can answer 3 for you. Redis does not guarantee that when you perform an operation that result will actually be on disk, also transaction handling it a bit "different". It also requires for the whole database to be in memory. Depending on the situation this can be an issue or not. It is however incredibly fast. It is not a messaging queue, you can easily make a queue out of it, but it is not it's purpose. If you want to have a queuing system only you can probably do better with something else.

NServiceBus appropriate for load distribution of periodic tasks

Would NServiceBus or an equivalent ESB be appropriate for an application that has a bunch of different kinds of background maintenance-type tasks? For example:
Scanning databases for the occurence of certain words in user-generated content
Updating database tables that store the results of relatively expensive queries
Creating/maintaining external indexes for content
Sending event notification emails for a scheduled event.
My idea is to employ some kind of task scheduler (either the Windows builtin one, Quartz.NET, or my own database-based solution) to publish different kinds messages onto the bus periodically. The period may be as short as one minute or as long as a days. The reason I want to use the bus is so that I can scale out the number of subscribers as the system becomes larger and busier and the tasks become either more frequent or more resource-intensive. It would also provide redundancy as long as I always have at least two subscribers running.
The obvious alternative to this would be to write my own Windows Service that is triggered by the scheduler and performs the work, but I feel like making that scale beyond a single machine and provide fault tolerance might be more difficult than using the ESB as that plumbing.
Does this sound like a reasonable approach? Alternative suggestions?
TIA
As the author of NServiceBus, I'm quite probably biased, but there is a tradeoff between learning a new technology and writing (possibly a simpler version of) your own. I would recommend considering the longer term maintainance (and documentation) costs of your own solution as compared to one written in house.
In terms of the feature-set you described, NServiceBus does provide facilities for all of that.

Resources