Respond with large amount of objects through a Rails API - ruby-on-rails

I currently have an API for one of my projects and a service that is responsible for generating export files as CSVs, archive and store them somewhere in the cloud.
Since my API is written in Rails and my service in plain Ruby, I use the Her gem in the service to interact with the API. But I find my current implementation less performant, since I do a Model.all in my service, which in turn triggers a request that may contain way too many objects in the response.
I am curious on how to improve this whole task. Here's what I've thought of:
implement pagination at API level and call Model.where(page: xxx) from my service;
generate the actual CSV at API level and send the CSV back to the service (this may be done sync or async).
If I were to use the first approach, how many objects should I retrieve per page? How big should a response be?
If I were to use the second approach, this would bring quite an overhead to the request (and I guess API requests shouldn't take that long) and I also wonder whether it's really the API's job to do this.
What approach should I follow? Or, is there something better that I'm missing?

You need to pass a lot of information through a ruby process, that's always not simple, I don't think you're missing anything here.
If you decide to generate CSVs at the API level then what do you get with maintaining the service? You could just ditch the service altogether because replacing your service with an nginx proxy would do the same thing better (if you're just streaming the response from API host)?
If you decide to paginate, there will be a performance reduction for sure, but nobody can tell you exactly how much you should paginate - bigger pages will be faster and consume more memory (reducing throughput by being able to run less workers), smaller pages will be slower and consume less memory but demand more workers because of IO wait times,
exact numbers will depend on the IO response times of your API app and the cloud and your infrastructure, I'm afraid no one can give you a simple answer you can follow without experimentation with a stress test, and once you set up a stress test, you will get a number of your own anyway - better than anybody's estimate.
A suggestion, write a bit more about your problem, constraints you are working under etc and maybe someone can help you with a bit more radical solution. For some reason I get the feeling that what you're really looking for is a background processor like sidekiq or delayed job, or maybe connect your service to the DB directly through a DB view if you are anxoius to decouple your apps, or an nginx proxy for API responses, or nothing at all... but I really can't tell without more information.

I think it really depends how you want do define 'performance' and what your goal for your API is. Do you want to make sure no request to your API takes longer than 20msec to respond, than adding pagination would be a reasonable approach. Especially if the CSV generation is just an edge case, and the API is really built for other services. The number of items per page would then be limited by the speed at which you can deliver them. Your service would not be particularly more performant (even less so), since it needs to call the service multiple times.
Creating an async call (maybe with a webhook as callback) would be worth adding to your API if you think it is a valid use case for services to dump the whole record set.
Having said that, I think strictly speaking it is the job of the API to be quick and responsive. So maybe try to figure out how caching can improve response times, so paging through all the records is reasonable. On the other hand it is the job of the service to be mindful of the amount of calls to the API, so maybe store old records locally and only poll for updates instead of dumping the whole set of records each time.

Related

Extract, transform, load within Rabbit?

One of the things that i do pretty often is transforming SQL data into cache and document-based stores, for performance reasons. I don't want my frontend applications hitting my database, so i have high-speed cache solutions, as well efficient Solr and other solutions.
I use RabbitMQ as the central communication hub to achieve this ETL flow, which looks like this: Backend application sends a message to Rabbit with the new data, or changes made into existing data. I then have a node.js script which consumes the queue, makes small batches of data and populates all the necessary systems: Redis, Mongo, Solr, etc.
However, i'm wondering if there's a better way of doing this. Maybe Rabbit has some kind of scripting support to create erlang logic for queues?
However, i'm wondering if there's a better way of doing this. Maybe Rabbit has some kind of scripting support to create erlang logic for queues?
it doesn't. it's just a message queueing system.
personally, I think your current design sounds good.
The only thing I would wonder, is whether or not each of your target systems has a queue of it's own. That way, any one of them can go down and not affect the others.
I would probably do something like this:
back-end produces data message and sends through RMQ
RMQ is configured with a fanout exchange, and has one bound queue per target system
each system receives the message in it's own queue
otherwise, what you have sounds about right to me!

How is request processing with rails, redis, and node.js asynchronous?

For web development I'd like to mix rails and node.js since I want to get the best out of both worlds (rails for fast web development and node for concurrency). I know that some people choose to just use full ruby stack with eventmachine that is integrated into rails controller so that every request can be nonblocking by using fiber in event-loop model. I have been able to understand how that works in a big picture.
At this moement however I want to try doing nonblocking request processing with rails and node.js with message queue concept. I heard that this can be achieved by using redis as an intermediary. I'm still having trouble trying to figure out how that works as of now. From what I can understand: so we have 2 apps A (rails) and B (node.js) and redis. rails app will handle requests from users that go through controllers in REST manner, and then from there rails will pass that through redis, and then redis will form queues and node.js app will pick up that queue and do whatever necessary afterhand (write or read from backend db).
My questions:
So how would that improve concurrency and scalability? from what i
know since rails handle the requests through controllers
synchronously, and then write to redis, the requests will be
blocking still, even though node.js end can pickup the queue
asynchronously. (I have a feeling that it's not asynchronous yet if it's not end to end
non-blocking).
Would node.js be considered a proxy or an application here if redis
is the intermediary?
I'm new to redis and learning it still. If I'm using 100% noSQL
solution for my backend database, such as mongoDB or couchDB, are they replaceable by redis entirely or is redis more seen as a
messaging queue tool like rabbitMQ?
Is messaging queue a different concurrency concept than threading or
event-loop model or is it supposed to supplement them?
That's all my question. I'm new to message queue concept. Will appreciate any help and pointers to right direction and articles that help me learn more. thanks.
You are mixing some things here that don't go together.
Let's first make sure we are on the same page regarding the strengths/weaknesses of the involved technologies
Rails: Used for it's web-development simplicity and perfect for serving database-backed web-applications.
Not very performant when having to serve a large number of long running requests as you'd run out of threads on your Ruby workers - but well suited for anything that can scale horizontally with more web-nodes (multiple web-servers - 1 db).
Node.js: Great for high-concurrency scenarios. Not as easy as rails to write a regular web-application in it. But can handle near an insane amount of long-running low-cpu tasks efficiently.
Redis: A Key-Value Store that supports operations on it's data-structures (increment/decrement values, append/prepent push/pop to lists - all operations that make this DB work consistently with multiple clients writing at once)
Now as you can see, there is no benefit in having Rails AND Node serve the same request - communicating through Redis. Going through the Rails Stack would not provide any benefit if the requests ends up being handled by the Node server.
And even if you only offload some processing to the node server, it's still the Rails webserver that handles the requests and has to wait for a response from node - killing the desired scalability. It simply makes no sense.
Where you would a setup with Node and Rails together is in certain areas of your app that have drastically different scaling requirements.
If you are for example writing a Website that displays live stats for Football games you can easily see that there are two different concerns in your app: The "normal" Site that contains signup, billing and profile stuff that screams for a quick implementation through rails. And the "live" portion of the site where users see live results and you expect to handle a lot of clients at once - all waiting for something to happen (low cpu - high concurrency).
In such a case it may be beneficial to actually seperate the two parts of the site into a Ruby and a Node app, with then sharing data about the user through a store like Redis (but actually you just need some shared state that both can look at and write to for synchronization purposes).
So you would use for example Rails for the Signup/Login portions - once signed up write the session cookie into redis alongside with the permissions of the user (what game is he allowed to follow) and hand the user off to the Node.js app.
There the Node app can read the session information from Redis and serve the user.
Word of advice:
You don't get scalability by simply throwing Node.js into your Toolbox. You really have to find out what Node.js is good at (low-cpu high-io concurrent operations) and how you can leverage that to remedy some of the problems your currently chosen technology has.
I can answer 3 for you. Redis does not guarantee that when you perform an operation that result will actually be on disk, also transaction handling it a bit "different". It also requires for the whole database to be in memory. Depending on the situation this can be an issue or not. It is however incredibly fast. It is not a messaging queue, you can easily make a queue out of it, but it is not it's purpose. If you want to have a queuing system only you can probably do better with something else.

Best way to run rails with long delays

I'm writing a Rails web service that interacts with various pieces of hardware scattered throughout the country.
When a call is made to the web service, the Rails app then attempts to contact the appropriate piece of hardware, get the needed information, and reply to the web client. The time between the client's call and the reply may be up to 10 seconds, depending upon lots of factors.
I do not want to split the web service call in two (ask for information, answer immediately with a pending reply, then force another api call to get the actual results).
I basically see two options. Either run JRuby and use multithreading or else run several regular Ruby instances and hope that not many people try to use the service at a time. JRuby seems like the much better solution, but it still doesn't seem to be mainstream and have out of the box support at Heroku and EngineYard. The multiple instance solution seems like a total kludge.
1) Am I right about my two options? Is there a better one I'm missing?
2) Is there an easy deployment option for JRuby?
I do not want to split the web service call in two (ask for information, answer immediately with a pending reply, then force another api call to get the actual results).
From an engineering perspective, this seems like it would be the best alternative.
Why don't you want to do it?
There's a third option: If you host your Rails app with Passenger and enable global queueing, you can do this transparently. I have some actions that take several minutes, with no issues (caveat: some browsers may time out, but that may not be a concern for you).
If you're worried about browser timeout, or you cannot control the deployment environment, you may want to process it in the background:
User requests data
You enter request into a queue
Your web service returns a "ticket" identifier to check the progress
A background process processes the jobs in the queue
The user polls back, referencing the "ticket" id
As far as hosting in JRuby, I've deployed a couple of small internal applications using the glassfish gem, but I'm not sure how much I would trust it for customer-facing apps. Just make sure you run config.threadsafe! in production.rb. I've heard good things about Trinidad, too.
You can also run the web service call in a delayed background job so that it's not hogging up a web-server and can even be run on a separate physical box. This is also a much more scaleable approach. If you make the web call using AJAX then you can ping the server every second or two to see if your results are ready, that way your client is not held in limbo while the results are being calculated and the request does not time out.

Storing a global hash in Ruby on Rails

I am looking at an application where I need to build out the user friend model as a graph structure. I need to go several degrees deep so using standard SQL in MySQL will not work due to the circular references. I have looked at the graph algorithms available and they involve loading the entire record set into a Graph object then doing operations on that. I can't afford to do this for every operation.
I would like to store the Graph object as a global object in memory and just make calls and updates to it. However, since Rails scales by creating separate processes I am going to have an almost immediate synchronization problem, since a single Rails process is only going to scale to a few simultaneous users.
Does anyone know a way to store an object in memory in Rails and keep it in sync between both requests and between the multiple mongrel/whatever processes?
At this point I am looking at a Java service for the graph operations since it scales using a thread model instead of a process model. I can scale big enough that I won't have to deal with the scaling issue for a while.
I would like to have an all Rails solution though because it will be easier to maintain and build.
One option would be to build a small rack application that you load the data in to. The logic for querying and computing your data will be in that rack app. You can use normal HTTP calls to localhost in order to transmit whatever you need to transmit (generated HTML? Something else?) from the rack app to your rails app.
This is basically a workaround to having multiple processes for your Rails application. I'm sure there are better solutions out there, such as memcached, NoSQL databases, and so on.
Redis is the tool you're looking for.
It sounds like you may need a distributed hash table, or maybe something something like CouchDB as an alternative to a RDBMS.

Is it possible to have a stateless timed function

I'm trying to set a reminder in a system to fire at a certain time.
This is a web based app, so it's not like it will be in memory all the time.
Ideally I'd like to avoid using a service or job on the server(mainly out of curiosity, to see if there is a more efficient way to do it)
For example, imagine how many Ebay bids are constantly ending all the times, and emails being sent out seemingly perfectly in time.
Do people recon there is just a big loop going over and over, moving items into a queue etc... Or is there something lower level helping out (stored procedures, triggers etc)
Thanks everyone.
What you have to realize about eBay - and most large database-backed websites - is that the interactions between humans and the database that come through the web server are only a part (sometimes a very small part) of the functionality of the system.
To use eBay as an example, the email that goes out when auctions expire is not handled by a web server. They are far more likely to have that scripted. In other words, there is another program running on a number of their systems that look at the database for ended auctions, do some processing on them, send emails, etc.
If I were doing something similar (albeit on a much smaller scale,) I'd have my web services built in the usual way, but have a job that is run automatically every few minutes to do the maintenance work. It would start up, look at the database for work, process anything that was required, then exit.

Resources