Rails process batches of elements within an array concurrently - ruby-on-rails

First time learning about concurrency and threading within Rails, so any advice is very appreciated.
I currently have an array of 50 strings. I have an 3rd party API call that takes in the string and returns a numeric value. Right now I am simply calling the API on each string one at a time, which takes a really long time.
After looking at a few SO like this one, this other one and finally this one, it seems like I have to use some sort of threading to achieve what I want to do. My plan is to break down the array into batches of ten strings, and then run 5 API calls on each array of ten strings concurrently in hopes that it will drastically reduce the time.
I've never done threading of any kind with rails before, so I just wondering if I am on the right track following the third SO post above, or if I should use other techniques that may be better for my need.

The approach you take will depend on your use case. Do you need to wait for all the calls to be made to do something with the result? Can it be asynchronous?
If you are looking into threads to distribute the work then the third SO post you mentioned is a good way to do it.
If your use case permits the process to be async, I'd definitely look into a scheduler, as mentioned in the first SO post. I've use DelayedJob for this goal, there are some other alternatives.
On a related topic, I usually implement a micro-service that receives those requests and processes them async instead of having DelayedJob in the same app, but is just a matter of preference.
Something REALLY important to have in mind if you go with the async approach is that if you are accessing ActiveRecord records inside a thread you need to explicitly check out the database connection. Rails only handles the check in/out of connections in the main thread. Be really careful on this since it can cause connection leaks really hard to track.
The first answer on this SO post shows how to ensure the db connection to be released.
Hope that helps.

Related

How to optimise computation intensive request response on rails [duplicate]

This question already has answers here:
How do I handle long requests for a Rails App so other users are not delayed too much?
(3 answers)
Closed 6 years ago.
I have an application, which does a lot of computation on few pages(requests). The web interface sends an AJAX request. The computation takes sometimes about 2-5 minutes. The problem is, by this time AJAX request times out.
We can certainly increase the timeout on the web portal, but that doesn't sound like right solution. Also, to improve performance:
Removed N+1/Duplicate queries
Implemented Caching
What else could be done here to reduce the calculation time?
Also, if it still takes longer, I was thinking of following solutions:
Do the computation beforehand and store it in DB. So when the actual request comes, there is no need of calculation. (Apprehensive about this approach. Since we will have to modify/Erase-and-recalculate this data, whenever there is some application logic change.)
Load the whole data in cache when application starts/data gets modified. But for the first time computation has to be done. Also, can't keep whole data in the cache when the application starts. So need to store it in the cache as per demand.
Maybe, do something like Angular promise, where promise gets fulfilled when the response comes from the server.
Do we have any alternative to do this efficiently?
UPDATE:
Depending on user input, the calculation might happen in few seconds. And also it might take 2-5 minutes. The scenario is, user imports an excel. The excel has been parsed and saved in DB. Now on another page, user wants to see the report/analytics graph derived with few calculations on the imported data(which has already been saved to db with background job). The calculation has to be done with many factors, so do not want to save it in DB(As pointed above). Also, when user request the report/analytics graph, It'll be bad experience to tell him that graph will be shown after sometime. You'll get email/notification etc.
The extremely typical solution is to enqueue a job for background processing, and return a job ID to the front-end. Your front-end can then poll for completion using that job ID, or you can trigger a notification such as an email to be sent to the user when the job completes.
There are a multitude of gems for this, and it is such a popular and accepted solution that Rails introduced its own ActiveJob for this exact purpose.
Here are a few possible solutions:
Optimize your tables with indexes to reduce data fetching time.
Preload all rows you'll be dealing with at the beginning, so you won't do a query each time you calculate something... it's faster/easier to #things.select { |r| r.blah } than to Thing.where(conditions)
Instead of all that, just do the computing in PLSQL on the database side. Sure, it's not the same as writing Ruby code but it could be faster.
And yes, cache the whole results set into memcache or redis or something (and expire when something change)
Run the calculation in the background (crontab?) and store the results in a JSON somewhere, or cache the entire HTML file (if you're not localizing or anything)
PS: I'm doing 1,2,3 combined with 5 (caching JSON results into memcache and then pulling the array and formatting/localizing) for a few M records from about 12 tables... sports data mainly.

Is the process dictionary appropriate in this case?

I've read several comments here and elsewhere suggesting that Erlang's process dictionary was a bad idea and should die. Normally, as a total Erlang newbie, I'd just avoid it. However, in this situation my other options aren't great.
I have a main dispatcher function that looks something like this:
dispatch(State) ->
receive
{cmd1, Params} ->
NewState = do_cmd1_stuff(Params, State),
dispatch(NewState);
{cmd2, Params} ->
NewState = do_cmd2_stuff(Params, State),
dispatch(NewState);
BadMsg ->
log_error(BadMsg),
dispatch(State)
end.
Obviously, my names are more meaningful to me, but that's the gist of it. Deep down in a function called by a function called by a function called by do_cmd2_stuff(), I want to send out messages to all my users telling them about something I've done. In order to do that, I need to get the list of users from the point where I send the messages. The user list doesn't lend itself easily to sticking in the global state, since that's just one data structure representing the only block of data on which I operate.
The way I see it, I have a couple unpleasant options other than using the process dictionary. I can send the user list through all the various levels of functions down to the very bottom one that does the broadcasting. That's unpleasant because it causes all my functions to gain a parameter, whether they really care about it or not.
Alternatively, I could have all the do_cmdN_stuff() functions return a message to send. That's not great either though, since sending the message may not be the last thing I want to do and it clutters up my dispatcher with a bunch of {Msg, NewState} tuples. Furthermore, some of the functions might not have any messages to send some of the time.
Like I said earlier, I'm very new to Erlang. Maybe someone with more experience can point me at a better way. Is there one? Is the process dictionary appropriate in this case?
The general rule is that if you have doubts, you shouldn't use the process dictionary.
If the two options you mentioned aren't good enough (I personally like the one where you return the messages to send) and what you want is some particular piece of code to track users and forward messages to them, maybe what you want to do is have a process holding that info.
Pid ! {forward, Msg}
where Pid will take care of sending everything to a bunch of other processes. Now, you would still need to pass the Pid around, unless you give it a name in some registry to find it. Either with register/2, global or gproc.
A simple answer would be to nest your global within a state record, which is then threaded through the system, at least at the stop level. This makes it easy to add new fields to the state in the future, not an uncommon occurrence, and allow you to keep your global state data structure untouched. So initially
-record(state, {users=[],state_data}).
Defining it as a record makes it easy to access and extend when necessary.
As you mentioned you can always pass the user list as extra param, thats not so bad.
If you don't want to do this just put it in State. You can have a special State just for this part of the calculation that also contains the user list.
Then there always is the possibility of putting it in ETS or in another server process.
What exactly to do is hard to recommend since it depends a lot on your exact application and preferences.
Just choose from the mentioned possibilities as if the process dictionary doesn't exist. Maybe your code needs restructuring if none of the variants look elegant, there always is some better way without the process dictionary.
Its really bad it is still there, because its alluring to many beginning Erlang users.
You really should not use process dictionary. I accept using dictionary only if
It is short living process.
I have full control about the process from spawn to termination i.e. I use minimum and well known set of external modules.
I need performance gain badly. It means avoid copy of data when using ets and dict/gb_tree is too slow (for GC reason).
ad 1. is not your case, you are using in server. ad 2. I don't know if it is your case. ad 3. is not your case because you need list of recipient so you don't gain nothing from that process dictionary is very fast key/value storage. In your case I don't see any reason why you should not include what you need to your State. IMHO State is exactly the right place for it.
Its an interesting question because it involves the fundamentals of functional design.
My opinion:
Try as much as possible to make the function return the messages, then send them. This separates the two different tasks nicely, and separates the purely functional task from the one that causes side effects.
If this isn't possible, pass receivers as argument even if its a bit messy. If the broadcasting function uses that data, it should be given to it explicitly, for clarity and predictability.
Using ETS as Peer Stritzinger suggests is really not any better than the PD, both hides the fact that the broadcasting function uses the receiver list and makes it dependent on global data.
I'm not sure about the Erlang way of encapsulating some state in a process, as I GIVE TERRIBLE ADVICE suggests. Is it really any better that ETS or PD?
clutters up my dispatcher with a bunch
of {Msg, NewState}
This is my experience also, that you often end up like this. It's not particularly pretty, but functional design seems to encourage this. Could some language feature be introduced to make it more beautiful and natural?
EDIT:
6 years ago I wrote:
Could some language feature be introduced to make it more beautiful and natural?
After learning much more about functional programming I have realised that examples of this are state-monads and do-notation that are found in Haskell.
I would consider sending a special message to self() from deep inside the call stack, and handling it at the top level dispatch method that you've sketched, where list of users is available.

Threading in Rails - do params[] persist?

I am trying to spawn a thread in Rails. I am usually not comfortable using threads as I will need to have an in-depth knowledge of Rails' request/response cycle, yet I cannot avoid using one as my request times out.
In order to avoid the time out, I am using a thread within a request. My question here is simple. The thread that I've used accesses a params[] variable inside it. And things seem to work OK now. I want to know whether this is right? I'd be happy if someone can throw some light on using Threads in Rails during request/response cycle.
[Starting a bounty]
The short answer is yes, but only to a degree; the binding in which the thread was created will continue to persist. The params will still exist only if no one (including Rails) goes out of their way to modify or delete the params hash. Instead, they rely on the garbage collector to clean up these objects. Since the thread has access to the current context (called the "binding" in Ruby) when it was created, all the variables that can be reached from that scope (effectively the entire state when the thread was created) cannot be deleted by the garbage collector. However, as executing continues in the main thread, the values of the variables in that context can be changed by the main thread, or even by the thread you created, if it can access it. This is the benefit--and the downfall--of threads: they share memory with everything else.
You can emulate a very similar environment to Rails to test your problem, using a function as such: http://gist.github.com/637719. As you can see, the code still prints 5.
However, this is not the correct way to do this. The better way to pass data to a thread is to pass it to Thread.new, like so:
# always dup objects when passing into a thread, else you really
# haven't done yourself any good-it would still be the same memory
Thread.new(params.dup) do |params|
puts params[:foo]
end
This way, you can be sure than any modifications to params will not affect your thread. The best practice is to only use data you pass to your thread in this way, or things that the thread itself created. Relying on the state of the program outside the thread is dangerous.
So as you can see, there are good reasons that this is not recommended. Multithreaded programming is hard, even in Ruby, and especially when you're dealing with as many libraries and dependencies as are used in Rails. In fact, Ruby seems to make it deceptively easy, but it's basically a trap. The bugs you will encounter are very subtle; they will happen "randomly" and be very hard to debug. This is why things like Resque, Delayed Job, and other background processing libraries are used so widely and recommended for Rails apps, and I would recommend the same.
The question is more does rails keep the request open whilst the thread is running than does it persist the value.
It won't persist the value as soon as the request ends and I also wouldn't recommend holding the request open unless there is a real need. As other users have said some stuff is just better in a delayed job.
Having said that we used threading a couple of times to query multiple sources concurrently and actually reduce the response time of an app (that was only for admins so didn't need to have fast response times) and if memory serves correctly the thread can keep the request open if you call join at the end and wait for each thread to finish before continuing.

Clone a lua state

Recently, I have encountered many difficulties when I was developing using C++ and Lua. My situation is: for some reason, there can be thousands of Lua-states in my C++ program. But these states should be same just after initialization. Of course, I can do luaL_loadlibs() and lua_loadfile() for each state, but that is pretty heavy(in fact, it takes a rather long time for me even just initial one state). So, I am wondering the following schema: What about keeping a separate Lua-state(the only state that has to be initialized) which is then cloned for other Lua-states, is that possible?
When I started with Lua, like you I once wrote a program with thousands of states, had the same problem and thoughts, until I realized I was doing it totally wrong :)
Lua has coroutines and threads, you need to use these features to do what you need. They can be a bit tricky at first but you should be able to understand them in a few days, it'll be well worth your time.
take a look to the following lua API call I think it is what you exactly need.
lua_State *lua_newthread (lua_State *L);
This creates a new thread, pushes it on the stack, and returns a pointer to a lua_State that represents this new thread. The new thread returned by this function shares with the original thread its global environment, but has an independent execution stack.
There is no explicit function to close or to destroy a thread. Threads are subject to garbage collection, like any Lua object.
Unfortunately, no.
You could try Pluto to serialize the whole state. It does work pretty well, but in most cases it costs roughly the same time as normal initialization.
I think it will be hard to do exactly what you're requesting here given that just copying the state would have internal references as well as potentially pointers to external data. One would need to reconstruct those internal references in order to not just have multiple states pointing to the clone source.
You could serialize out the state after one starts up and then load that into subsequent states. If initialization is really expensive, this might be worth it.
I think the closest thing to doing what you want that would be relatively easy would be to put the states in different processes by initializing one state and then forking, however your operating system supports it:
http://en.wikipedia.org/wiki/Fork_(operating_system)
If you want something available from within Lua, you could try something like this:
How do you construct a read-write pipe with lua?

How to prepare to be tech crunched

There is a good chance that we will be tech crunched in the next few days. Unfortunately, we have not gone live yet so we don't have a good estimation of how our system handles a production audience.
Our production setup consists of 2 EngineYard slices each with 3 mongrel instances, using Postgres as the database server.
Obviously a huge portion of how our app will hold up is to do with our actual code and queries etc. However, it would be good to see if there are any tips/pointers on what kind of load to expect or experiences from people who have been through it. Does 6 mongrel instances (possibly 8 if the servers can take it) sound like it will handle the load, or are at least most of it?
I have worked on several rails applications that experienced high load due to viral growth on Facebook.
Your mongrel count should be based on several factors. If your mongrels make API calls or deliver email and must wait for responses, then you should run as many as possible. Otherwise, try to maintain one mongrel per CPU core, with maybe a couple extra left over.
Make sure your server is using a Fair Proxy Balancer (not round robin). Here is the nginx module that does this: http://github.com/gnosek/nginx-upstream-fair/tree/master
And here are some other tips on improving and benchmarking your application performance to handle the load:
ActiveRecord
The most common problem Rails applications face is poor usage of ActiveRecord objects. It can be quite easy to make 100's of queries when only one is necessary. The easiest way to determine if this could be a problem with your application is to set up New Relic. After making a request to each major page on your site, take a look at the newrelic SQL overview. If you see a large number of very similar queries sequentially (select * from posts where id = 1, select * from posts where id = 2, select * from posts...) this may be a sign that you need to use a :include in one of your ActiveRecord calls.
Some other basic ActiveRecord tips (These are just the ones I can think of off the top of my head):
If you're not doing it already, make sure to correctly use indexes on your database tables.
Avoid making database calls in views, especially partials, it can be very easy to lose track of how much you are making database queries in views. Push all queries and calculations into your models or controllers.
Avoid making queries in iterators. Usually this can be done by using an :include.
Avoid having rails build ActiveRecord objects for large datasets as much as possible. When you make a call like Post.find(:all).size, a new class is instantiated for every Post in your database (and it could be a large query too). In this case you would want to use Post.count(:all), which will make a single fast query and return an integer without instantiating any objects.
Associations like User..has_many :objects create both a user.objects and user.object_ids method. The latter skips instantiation of ActiveRecord objects and can be much faster. Especially when dealing with large numbers of objects this is a good way to speed things up.
Learn and use named_scope whenever possible. It will help you keep your code tiny and makes it much easier to have efficient queries.
External APIs & ActionMailer
As much as you can, do not make API calls to external services while handling a request. Your server will stop executing code until a response is received. Not only will this add to load times, but your mongrel will not be able to handle new requests.
If you absolutely must make external calls during a request, you will need to run as many mongrels as possible since you may run into a situation where many of them are waiting for an API response and not doing anything else. (This is a very common problem when building Facebook applications)
The same applies to sending emails in some cases. If you expect many users to sign up in a short period of time, be sure to benchmark the time it takes for ActionMailer to deliver a message. If it's not almost instantaneous then you should consider storing emails in your database an using a separate script to deliver them.
Tools like BackgroundRB have been created to solve this problem.
Caching
Here's a good guide on the different methods of caching in rails.
Benchmarking (Locating performance problems)
If you suspect a method may be slow, try benchmarking it in console. Here's an example:
>> Benchmark.measure { User.find(4).pending_invitations }
=> #<Benchmark::Tms:0x77934b4 #cutime=0.0, #label="", #total=0.0, #stime=0.0, #real=0.00199985504150391, #utime=0.0, #cstime=0.0>
Keep track of methods that are slow in your application. Those are the ones you want to avoid executing frequently. In some cases only the first call will be slow since Rails has a query cache. You can also cache the method yourself using Memoization.
NewRelic will also provide a nice overview of how long methods and SQL calls take to execute.
Good luck!
Look into some load testing software like WEBLoad or if you have money, Quick Test Pro. This will help give you some idea. WEBLoad might be the best test in your situation.
You can generate thousands of virtual nodes hitting your site and you can inspect the performance of your servers from that load.
In my experience having watched some of our customers absorb a crunching, the traffic was fairly modest- not the bone crushing spike people seem to expect. Now, if you get syndicated and make on Yahoo's page or something, things may be different.
Search for the experiences of Facestat.com if you want to read about how they handled it (the Yahoo FP.)
My advise is just be prepared to turn off signups or go to a more static version of your site if your servers get too hot. Using a monitoring/profiling tool is a good idea as well, I like FiveRuns Manage tool for ease of setup.
Since you're using EngineYard, you should be able to allocate more machines to handle the load if necessary
Your big problems will probably not be the number of incoming requests, but will be the amount of data in your database showing you where your queries aren't using the indexes your expecting, or are returning too much data, e.g. The User List page works with 10 users, but dies when you try to show 10,000 users on that one page because you didn't add pagination (will_paginate plugin is almost your friend - watch out for 'select count(*)' queries that are generated for you)
So the two things to watch:
Missing indexes
Too much data per page
For #1, there's a plugin that runs an 'explain ...' query after every query so you can check index usage manually
There is a plugin that can generate data for you for various types of data that may help you fill your database up to test these queries too.
For #2, use will_paginate plugin or some other way to reduce data per page.
We've got basically the same setup as you, 2 prod slices and a staging slice at EY. We found ab to be a great load testing tool - just write a bash script with the urls that you expect to get hit and point it at your slice. Watch NewRelic stats and it should give you some idea of the load your app can handle and where you might need to optimise.
We also found query_reviewer to be very useful as well. It is great for finding those un-indexed tables and n+1 queries.

Resources