Use Sidekiq to keep a cache of results full - ruby-on-rails

Say I have a particularly expensive calculation to perform during a specific user request. The plus side is that this calculation can be performed ahead of time, and pushed in a general queue for people to pull from.
Is there a way to use Sidekiq in a Ruby/Rails backend to keep this cache of results full to a certain level? Where would I store the results of this calculation?
e.g.
On server load, calculate 20 sets of results, and cache somewhere.
On user request, pop off a result to allow for immediate server response.
Regenerate one set of results in the background to fill back up to 20 in the queue.
Obviously may need to use a different number than 20 depending on how long the computation takes, and rate of user requests, but I think you get the idea.

I'm curious to know what kind of calculation actually fits this profile but that's not really important.
Since you are using Sidekiq (or would like to use Sidekiq) it means you have a Redis database. A Redis database is a great place to put this kind of info.
So you can just create a LIST in Redis of your results. During application startup fire of 20 sidekiq jobs to create your calculations. The worker doing the calculation can push the result onto the list in Redis.
As you handle requests, just pop a result off the list and queue another sidekiq job to make yourself a new calculation.

Related

How to optimise computation intensive request response on rails [duplicate]

This question already has answers here:
How do I handle long requests for a Rails App so other users are not delayed too much?
(3 answers)
Closed 6 years ago.
I have an application, which does a lot of computation on few pages(requests). The web interface sends an AJAX request. The computation takes sometimes about 2-5 minutes. The problem is, by this time AJAX request times out.
We can certainly increase the timeout on the web portal, but that doesn't sound like right solution. Also, to improve performance:
Removed N+1/Duplicate queries
Implemented Caching
What else could be done here to reduce the calculation time?
Also, if it still takes longer, I was thinking of following solutions:
Do the computation beforehand and store it in DB. So when the actual request comes, there is no need of calculation. (Apprehensive about this approach. Since we will have to modify/Erase-and-recalculate this data, whenever there is some application logic change.)
Load the whole data in cache when application starts/data gets modified. But for the first time computation has to be done. Also, can't keep whole data in the cache when the application starts. So need to store it in the cache as per demand.
Maybe, do something like Angular promise, where promise gets fulfilled when the response comes from the server.
Do we have any alternative to do this efficiently?
UPDATE:
Depending on user input, the calculation might happen in few seconds. And also it might take 2-5 minutes. The scenario is, user imports an excel. The excel has been parsed and saved in DB. Now on another page, user wants to see the report/analytics graph derived with few calculations on the imported data(which has already been saved to db with background job). The calculation has to be done with many factors, so do not want to save it in DB(As pointed above). Also, when user request the report/analytics graph, It'll be bad experience to tell him that graph will be shown after sometime. You'll get email/notification etc.
The extremely typical solution is to enqueue a job for background processing, and return a job ID to the front-end. Your front-end can then poll for completion using that job ID, or you can trigger a notification such as an email to be sent to the user when the job completes.
There are a multitude of gems for this, and it is such a popular and accepted solution that Rails introduced its own ActiveJob for this exact purpose.
Here are a few possible solutions:
Optimize your tables with indexes to reduce data fetching time.
Preload all rows you'll be dealing with at the beginning, so you won't do a query each time you calculate something... it's faster/easier to #things.select { |r| r.blah } than to Thing.where(conditions)
Instead of all that, just do the computing in PLSQL on the database side. Sure, it's not the same as writing Ruby code but it could be faster.
And yes, cache the whole results set into memcache or redis or something (and expire when something change)
Run the calculation in the background (crontab?) and store the results in a JSON somewhere, or cache the entire HTML file (if you're not localizing or anything)
PS: I'm doing 1,2,3 combined with 5 (caching JSON results into memcache and then pulling the array and formatting/localizing) for a few M records from about 12 tables... sports data mainly.

Parse large and multiple fetches for statistics

In my app I want to retrieve a large amount of data from Parse to build a view of statistics. However, in future, as data builds up, this may be a huge amount.
For example, 10,000 results. Even if I fetched in batches of 1000 at a time, this would result in 10 fetches. This could rapidly, send me over the 30 requests per second limitation by Parse. Specifically when several other chunks of data may need to be collected at the same time for other stats.
Any recommendations/tips/advice for this scenario?
You will also run into limits with the skip and limit query variables. And heavy weight lifting on a mobile device could also present issues for you.
If you can you should pre-aggregate these statistics, perhaps once per day, so that you can simply directly request the details.
Alternatively, create a cloud code function to do some processing for you and return the results. Again, you may well run into limits here, so a cloud job may meed your needs better, and then you may need to effectively create a request object which is processed by the job and then poll for completion or send out push notifications on completion.

Background job taking twice the time that the same operation within rails

In my Rails application, I have a long calculation requiring a lot of database access.
To make it short, my calculation took 25 seconds.
When implementing the same calculation within a background job (a big single worker), the same calculation take twice the same time (ie 50 seconds). I have try several technics to put the job in a background process put none add an impact on my performances => using DelayJob / Sidekiq / doing the process within my rails but in a thread created for the work, but all have the same impact on my performances *2.
This performance difference only exist in rails 'production' environment. It looks like there is an optimisation done by rails that is not done in my background job.
My technical environment is the following =>
I am using ruby 2.0 / rails 4
I am using unicorn (but I have same problem without it).
The job is using Rails.cache to store some partial computation.
I am using postgresql
Does anybody has an clue where this impact might come from ?
I'm assuming you're comparing the background job speed to the speed of running the operation during a web request? If so, you're likely benefiting from Rails's QueryCache, which caches db queries during a web request. Try disabling it like described here:
Disabling Rails SQL query caching globally
If that causes the web request version of the job to take as long as the background job, you've found your culprit. You can then enable the query cache on your background job to speed it up (if it makes sense for your application).
Background job is not something that need to used for speed-up things. It's main meaning is to 'fire and forget' and remove 25 seconds of calculating synchronously and adding some more of calculating asynchronously. So you can give user response that she's request is processing and return with calculation later.
You may take speed gain from background job by splitting big task on some small and running them at same time. In your case I think it's something impossible to use, because of dependency of operations in yours calculation.
So if you want to speed you calculation, you need to look into denormalization of your data structure, storing some calculated values for your big calculation on moment when source data for this calculation updated. So you will calculate less on user request for results and more on data storage. And it's good place for use background job. So you finish your update of data, create background task for update caches. And if user request for calculation comes before this task is finished you will still need to wait for cache fill-up.
Update: I think I am still need to answer your main question. So basically this additional time on background task processing is comes from implementation. Because of 'fire and forget' approach no one need that background task scheduler will consume big amount of processor time just monitoring for new jobs. I am not sure completely but think that if your calculation will be two times more complex, time gain will be same 25 seconds.
My guess is that the extra time is coming from the need for your background worker to load rails and all of your application. My clue is that you said the difference was greatest with Rails in production mode. In production mode, subsequent calls to the app make use of the app and class cache.
How to check this hypotheses:
Change your background job to do the following:
print a log message before you initiate the worker
start the worker
run your calculation. As part of your calculation startup, print a log message
print another log message
run your calculation again
print another log message
Then compare the two times for running your calculation.
Of course, you'll also gain some extra time benefits from database caching, code might remain resident in memory, etc. But if the second run is much much faster, then the fact that the second run didn't restart Rails is more significant.
Also, the time between the log message from steps 1 and 3 will also help you understand the start up times.
Fixes
Why wait?
Most important: why do you need the results faster? Eg, tell your user that the result will be emailed to them after it is calculated. Or let your user see that the calculation is proceeding in the background, and later, show them the result.
The key for any long running calculation is to do it in the background and encourage the user to not wait for the result. They should be able to do something else until they get the result.
Start the calculation automatically As soon as the user logs in, or after they do something interesting, start the calculation. That way, when (and if) the user asks for the calculation, the answer will either be already done or will soon be done.
Cache the result and bust the cache as needed Similar to the above, start the calculation periodically and automatically. If the user changes some data, then restart the calculation by busting the cache. There are also ways to halt any on-going calculation if data is changed during the calculation.
Pre-calculate part of the calculation Why are you taking 25 seconds or more for a dbms calculation? Could be that you should change the calculation. Investigate adding indexes, summary tables, de-normalizing, splitting the calculation into smaller steps that can be pre-calculated, etc.

opening and closing streaming clients for specific durations

I'd like to infrequently open a Twitter streaming connection with TweetStream and listen for new statuses for about an hour.
How should I go about opening the connection, keeping it open for an hour, and then closing it gracefully?
Normally for background processes I would use Resque or Sidekiq, but from my understanding those are for completing tasks as quickly as possible, not chilling and keeping a connection open.
I thought about using a global variable like $twitter_client but that wouldn't horizontally scale.
I also thought about building a second application that runs on one box to handle this functionality, but that seems excessive if it can be integrated into the main app somehow.
To clarify, I have no trouble starting a process, capturing tweets, and using them appropriately. I'm just not sure what I should be starting. A new app? A daemon of some sort?
I've never encountered a problem like this, and am completely lost. Any direction would be much appreciated!
Although not a direct fix, this is what I would look at:
Time
You're working with time, so I'd look at what time-centric processes could be used to induce the connection for an hour
Specifically, I'd look at running a some sort of job on the server, which you could fire at specific times (programmatically if required), to open & close the connection. I only have experience with resque, but as you say, it's probably not up to the job. If I find any better solutions, I'll certainly update the answer
Storage
Once you've connected to TweetStream, you'll want to look at how you can capture the tweets for that time period. It seems a waste to create a data table just for the job, so I'd be inclined to use something like Redis to store the tweets that you need
This can then be used to output the tweets you need, allowing you to simulate storing / capturing them, but then delete them after the hour-window has passed
Delivery
I don't know what context you're using this feature in, so I'll just give you as generic process idea as possible
To display the tweets, I'd personally create some sort of record in the DB to show the time you're pinging TweetStream that day (if it changes; if it's constant, just set a constant in an initializer), and then just include some logic to try and get the tweets from Redis. If you're able to collect them, show them as you wish, else don't print anything
Hope that gives you a broader spectrum of ideas?

Storing Data In Memory: Session vs Cache vs Static

A bit of backstory: I am working on an web application that requires quite a bit of time to prep / crunch data before giving it to the user to edit / manipulate. The data request task ~ 15 / 20 secs to complete and a couple secs to process. Once there, the user can manipulate vaules on the fly. Any manipulation of values will require the data to be reprocessed completely.
Update: To avoid confusion, I am only making the data call 1 time (the 15 sec hit) and then wanting to keep the results in memory so that I will not have to call it again until the user is 100% done working with it. So, the first pull will take a while, but, using Ajax, I am going to hit the in-memory data to constantly update and keep the response time to around 2 secs or so (I hope).
In order to make this efficient, I am moving the intial data into memory and using Ajax calls back to the server so that I can reduce processing time to handle the recalculation that occurs w/ this user's updates.
Here is my question, with performance in mind, what would be the best way to storing this data, assuming that only 1 user will be working w/ this data at any given moment.
Also, the user could potentially be working in this process for a few hours. When the user is working w/ the data, I will need some kind of failsafe to save the user's current data (either in a db or in a serialized binary file) should their session be interrupted in some way. In other words, I will need a solution that has an appropriate hook to allow me to dump out the memory object's data in the case that the user gets disconnected / distracted for too long.
So far, here are my musings:
Session State - Pros: Locked to one user. Has the Session End event which will meet my failsafe requirements. Cons: Slowest perf of the my current options. The Session End event is sometimes tricky to ensure it fires properly.
Caching - Pros: Good Perf. Has access to dependencies which could be a bonus later down the line but not really useful in current scope. Cons: No easy failsafe step other than a write based on time intervals. Global in scope - will have to ensure that users do not collide w/ each other's work.
Static - Pros: Best Perf. Easies to maintain as I can directly leverage my current class structures. Cons: No easy failsafe step other than a write based on time intervals. Global in scope - will have to ensure that users do not collide w/ each other's work.
Does anyone have any suggestions / comments on what I option I should choose?
Thanks!
Update: Forgot to mention, I am using VB.Net, Asp.Net, and Sql Server 2005 to perform this task.
I'll vote for secret option #4: use the database for this. If you're talking about a 20+ second turnaround time on the data, you are not going to gain anything by trying to do this in-memory, given the limitations of the options you presented. You might as well set this up in the database (give it a table of its own, or even a separate database if the requirements are that large).
I'd go with the caching method of for storing the data across any page loads. You can name the cache you want to store the data in to avoid conflicts.
For tracking user-made changes, I'd go with a more old-school approach: append to a text file each time the user makes a change and then sweep that file at intervals to save changes back to DB. If you name the files based on the user/account or some other session-unique indicator then there's no issue with conflict and the app (or some other support app, which might be a better idea in general) can sweep through all such files and update the DB even if the session is over.
The first part of this can be adjusted to stagger the write out more: save changes to Session, then write that to file at intervals, then sweep the file at larger intervals. you can tune it to performance and choose what level of possible user-change loss will be possible.
Use the Session, but don't rely on it.
Simply, let the user "name" the dataset, and make a point of actively persisting it for the user, either automatically, or through something as simple as a "save" button.
You can not rely on the session simply because it is (typically) tied to the users browser instance. If they accidentally close the browser (click the X button, their PC crashes, etc.), then they lose all of their work. Which would be nasty.
Once the user has that kind of control over the "persistent" state of the data, you can rely on the Session to keep it in memory and leverage that as a cache.
I think you've pretty much just answered your question with the pros/cons. But if you are looking for some peer validation, my vote is for the Session. Although the performance is slower (do you know by how much slower?), your processing is going to take a long time regardless. Do you think the user will know the difference between 15 seconds and 17 seconds? Both are "forever" in web terms, so go with the one that seems easiest to implement.
perhaps a bit off topic. I'd recommend putting those long processing calls in asynchronous (not to be confused with AJAX's asynchronous) pages.
Take a look at this article and ping me back if it doesn't make sense.
http://msdn.microsoft.com/en-us/magazine/cc163725.aspx
I suggest to create a copy of the data in a new database table (let's call it EDIT) as you send the initial results to the user. If performance is an issue, do this in a background thread.
As the user edits the data, update the table (also in a background thread if performance becomes an issue). If you have to use threads, you must make sure that the first thread is finished before you start updating the rows.
This allows a user to walk away, come back, even restart the browser and commit whenever she feels satisfied with the result.
One possible alternative to what the others mentioned, is to store the data on the client.
Assuming the dataset is not too large, and the code that manipulates it can be handled client side. You could store the data as an XML data island or JSON object. This data could then be manipulated/processed and handled all client side with no round trips to the server. If you need to persist this data back to the server the end resulting data could be posted via an AJAX or standard postback.
If this does not work with your requirements I'd go with just storing it on the SQL server as the other comment suggested.

Resources