When is key/value caching on Heroku worthwhile? - ruby-on-rails

Background:
I have a web service which takes in input of 1 to 20 objects, and then performs an operation on each that takes roughly 100-300ms. The results of that operation are valid on average for one hour, and the output is a hash of strings and integers. The average request has 5 objects, thus a response time of roughly 1000ms. I am expecting a pretty low cache hit rate until the service picks up traction--let's call it a 10% hit rate for now.
My application is hosted on Heroku, and for the purposes of this question, I do not wish to move it.
What I've Tried
I started with the free offering from IronCache (through the Heroku add-on), and did some very rough tests. A put() and get() request take roughly 20-40ms for simple objects. There is no support for batch operations, so assuming a 100% cache miss, this would add 20-40ms per object to my response. In my average case for 5 objects, that is roughly 150ms extra.
IronCache did not support batched operations, but it seems like that would solve my issue.
My Question
Given this profile, is it worthwhile to use a hosted caching (key/value) store on Heroku? If so, which?

I went with MemCachier, an add-on for Heroku which offers a 25MB free tier. They use Dalli as their Ruby library, which supports get_multi, and a multi function which takes a block and defers sending until the end of the block.

If batched operations will make caching worth while for you, you should use redis, It supports them.

Related

Difference between CKQueryOperation and Perform(Fetch...)

I'm new to working with CloudKit and database fetching and I've looked at the CKDataBaseOperation calls, so I'm trying to understand the real differences between adding an operation to a database and using "normal" function calls on that database if they both produce, more or less, the same results.
Why would adding an operation be more desirable over a function call and in what situations?
Thanks for helping me understand this. I'm trying to learn as much as I can about Swift.
Overview:
In CloudKit most of the tasks have 2 ways of doing things:
Convenience APIs (functions with completion handlers)
Operations
1. Convenience APIs
Advantages:
As the name implies, they are convenient to use
Disadvantage:
Usually requires more server requests.
Can't build dependencies
2. Operations:
Advantages:
More configurable and more options.
Requires lesser server requests (Better for your server request quota)
It is built using Operation, so you get all the capabilities of Operation like dependencies (you will need them in a real app)
Disadvantages:
It is not so convenient to use, you need to create the operation. It takes a little more time to code but well worth it.
Example 1 (Fetch):
If you use CKDatabase.fetch, you would need to specify the record IDs that you want to fetch.
If you use CKQueryOperation, you can query based on field values.
Example 2 (Save / Update):
If you use CKDatabase.save, you can save 1 record with every function call. Each function call would result in a separate server request. If you want to save 200 records, you would have to run it in a loop and would make 200 server requests which is not very efficient. CloudKit also has a limit on the number of server requests you can make per second. This way you would exhaust your quota very quickly.
If you use CKModifyRecordsOperation, you can save 200 records all at once*, by passing it as an array. So you would be making far lesser server requests.
*Note: The server imposes a limit on the number of records it can save in 1 request but it is definitely better than creating a separate request to save each record.
Reference:
https://developer.apple.com/library/content/documentation/DataManagement/Conceptual/CloudKitQuickStart/Introduction/Introduction.html#//apple_ref/doc/uid/TP40014987-CH1-SW1
Watch WWDC CloudKit videos
Might help to learn and watch WWDC videos about Operation (earlier used to be referred as NSOperation)

How to optimise computation intensive request response on rails [duplicate]

This question already has answers here:
How do I handle long requests for a Rails App so other users are not delayed too much?
(3 answers)
Closed 6 years ago.
I have an application, which does a lot of computation on few pages(requests). The web interface sends an AJAX request. The computation takes sometimes about 2-5 minutes. The problem is, by this time AJAX request times out.
We can certainly increase the timeout on the web portal, but that doesn't sound like right solution. Also, to improve performance:
Removed N+1/Duplicate queries
Implemented Caching
What else could be done here to reduce the calculation time?
Also, if it still takes longer, I was thinking of following solutions:
Do the computation beforehand and store it in DB. So when the actual request comes, there is no need of calculation. (Apprehensive about this approach. Since we will have to modify/Erase-and-recalculate this data, whenever there is some application logic change.)
Load the whole data in cache when application starts/data gets modified. But for the first time computation has to be done. Also, can't keep whole data in the cache when the application starts. So need to store it in the cache as per demand.
Maybe, do something like Angular promise, where promise gets fulfilled when the response comes from the server.
Do we have any alternative to do this efficiently?
UPDATE:
Depending on user input, the calculation might happen in few seconds. And also it might take 2-5 minutes. The scenario is, user imports an excel. The excel has been parsed and saved in DB. Now on another page, user wants to see the report/analytics graph derived with few calculations on the imported data(which has already been saved to db with background job). The calculation has to be done with many factors, so do not want to save it in DB(As pointed above). Also, when user request the report/analytics graph, It'll be bad experience to tell him that graph will be shown after sometime. You'll get email/notification etc.
The extremely typical solution is to enqueue a job for background processing, and return a job ID to the front-end. Your front-end can then poll for completion using that job ID, or you can trigger a notification such as an email to be sent to the user when the job completes.
There are a multitude of gems for this, and it is such a popular and accepted solution that Rails introduced its own ActiveJob for this exact purpose.
Here are a few possible solutions:
Optimize your tables with indexes to reduce data fetching time.
Preload all rows you'll be dealing with at the beginning, so you won't do a query each time you calculate something... it's faster/easier to #things.select { |r| r.blah } than to Thing.where(conditions)
Instead of all that, just do the computing in PLSQL on the database side. Sure, it's not the same as writing Ruby code but it could be faster.
And yes, cache the whole results set into memcache or redis or something (and expire when something change)
Run the calculation in the background (crontab?) and store the results in a JSON somewhere, or cache the entire HTML file (if you're not localizing or anything)
PS: I'm doing 1,2,3 combined with 5 (caching JSON results into memcache and then pulling the array and formatting/localizing) for a few M records from about 12 tables... sports data mainly.

Heroku database performance experience needed?

We are experiencing some serious scaling challenges for our intelligent search engine/aggregator. Our database holds around 200k objects. From profiling and newrelic it seems most of our troubles may come from the database. We are using the smallest dedicated database Heroku provide (Ronin).
We have been looking into indexing and caching. So far we managed to solve our problems by reducing database calls and caching content intelligently, but now even this seems to reach an end. We are constantly asking ourselves if our code/configuration is good enough or if we are simply not using enough "hardware".
We suspect that the database solution we buy from Heroku may be performing insufficiently. For example, just doing a simple count (no joins, no nothing) on the 200k items takes around 250ms. This seems like a long time, even though postgres is known for its bad performance on counts?
We have also started to use geolocation lookups based on latitude/longitude. Both columns are indexed floats. Doing a distance calculation involves pretty complicated math, but we are using the very well recommended geocoder gem that is suspected to run very optimized queries. Even geocoder still takes 4-10 seconds to perform a lookup on, say, 40.000 objects, returning only a limit of the first nearest 10. This again sounds like a long time, and all the experienced people we consult says that it sound very odd, again hinting at the database performance.
So basically we wonder: What can we expect from the database? Might there be a problem? And what can we expect if we decide to upgrade?
An additional question I have is: I read here that we can improve performance by loading the entire database into memory. Are we supposed to configure this ourselves and if so how?
UPDATE ON THE LAST QUESTION:
I got this from the helpful people at Heroku support:
"What this means is having enough memory (a large enough dedicated
database) to store your hot data set in memory. This isn't something
you have to do manually, Postgres is configured automatically use all
available memory on our dedicated databases.
I took a look at your database and it looks like you're currently
using about 1.25 GB of RAM, so you haven't maxed your memory usage
yet."
UPDATE ON THE NUMBERS AND FIGURES
Okay so now I've had time to look into the numbers and figures, and I'll try to answer the questions below as follows:
First of all, the db consists of around 29 tables with a lot of relations. But in reality most queries are done on a single table (some additional resources are joined in, to provide all needed information for the views).
The table has 130 columns.
Currently it holds around 200k records but only 70k are active - hence all indexes are made as partial-indexes on this "state".
All columns we search are indexed correctly and none is of text-type, and many are just booleans.
Answers to questions:
Hmm the baseline performance it's kind of hard to tell, we have sooo many different selects. The time it takes varies typically from 90ms to 250ms selecting a limit of 20 rows. We have a LOT of counts on the same table all varying from 250ms to 800ms.
Hmm well, that's hard to say cause they wont give it a shot.
We have around 8-10 users/clients running requests at the same time.
Our query load: In new relic's database reports it says this about the last 24 hours: throughput: 9.0 cpm, total time: 0.234 s, avg time: 25.9 ms
Yes we have examined the query plans of our long-running queries. The count queries are especially slow, often over 500ms for a pretty simple count on the 70k records done on indexed columns with a result around 300
I've tuned a few Rails apps hosted on Heroku, and also hosted on other platforms, and usually the problems fall into a few basic categories:
Doing too much in ruby that could be done at the db level (sorting, filtering, join data, etc)
Slow queries
Inefficient use of indexes (not enough, or too many)
Trying too hard to do it all in the db (this is not as common in rails, but does happen)
Not optimizing cacheable data
Not effectively using background processing
Right now its hard to help you because your question doesn't contain any specifics. I think you'll get a better response if you pinpoint the biggest issue you need help with and then ask.
Some info that will help us help you:
What is the average response time of your actions? (from new relic, request-log-analyzer, logs)
What is the slowest request that you want help with?
What are the queries and code in that request?
Is the site's performance different when you run it locally vs. heroku?
In the end I think you'll find that it is not an issue specific to Heroku, and if you had your app deployed on amazon, engineyard, etc you'd have the same performance. The good news is I think that your problems are common, and shouldn't be too hard to fix once you've done some benchmarking and profiling.
-John McCaffrey
We are constantly asking...
...this seems a lot...
...that is suspected...
...What can we expect...
Good news! You can put and end to seeming, suspecting wondering and expecting through the magic of measurement!!!
Seriously though, you've not mentioned any of the basic points you'd need to get a useful answer:
What's the baseline performance of the DB running a sequential scan and single-row index fetches? You say Heroku say your DB fits in RAM, so you shouldn't see disk I/O issues when you measure.
Does this performance match whatver Heroku say it should be?
How many concurrent clients?
What's your query load - what queries and how often?
Have you checked the query plans for any of your suspiciously long-running queries?
Once you've got this sort of information, maybe someone can say something useful. As it stands anything you read here is just guesswork.
First: you should check your postgres configuration. (show all from within psql or another client, or just look at postgres.conf in the data directory) The parameter with the largest impact on performance is effective_cache_size, which should be set to about (total_physical_ram - memory_in_use_by_kernel_and_all_processes). For a 4GB machine, this often is around 3GB (4-1). (this is very course tuning, but will give the best results for a first step)
Second: why do you want all the counts? Better use a typical query: just ask for what is needed, not what is available. (reason: there is no possible optimisation for a COUNT(*): eiither the whole table, or a whole index needs to be scanned)
Third: start gathering and analysing some queryplans (for typical queries that perform badly). You can get a query plan by putting EXPLAIN ANALYZE before the actual query. (another way is to increase the logging level, and obtain them from the logfile) A bad queryplan can point you at missing statistics or indexes, or even at bad data-modelling.
Newrelic monitoring can be included as an add-on for heroku (http://devcenter.heroku.com/articles/newrelic). At the very least this should give you a lot of insight into what is happening behind the scenes, and may help you pinpoint some issues.

How to prepare to be tech crunched

There is a good chance that we will be tech crunched in the next few days. Unfortunately, we have not gone live yet so we don't have a good estimation of how our system handles a production audience.
Our production setup consists of 2 EngineYard slices each with 3 mongrel instances, using Postgres as the database server.
Obviously a huge portion of how our app will hold up is to do with our actual code and queries etc. However, it would be good to see if there are any tips/pointers on what kind of load to expect or experiences from people who have been through it. Does 6 mongrel instances (possibly 8 if the servers can take it) sound like it will handle the load, or are at least most of it?
I have worked on several rails applications that experienced high load due to viral growth on Facebook.
Your mongrel count should be based on several factors. If your mongrels make API calls or deliver email and must wait for responses, then you should run as many as possible. Otherwise, try to maintain one mongrel per CPU core, with maybe a couple extra left over.
Make sure your server is using a Fair Proxy Balancer (not round robin). Here is the nginx module that does this: http://github.com/gnosek/nginx-upstream-fair/tree/master
And here are some other tips on improving and benchmarking your application performance to handle the load:
ActiveRecord
The most common problem Rails applications face is poor usage of ActiveRecord objects. It can be quite easy to make 100's of queries when only one is necessary. The easiest way to determine if this could be a problem with your application is to set up New Relic. After making a request to each major page on your site, take a look at the newrelic SQL overview. If you see a large number of very similar queries sequentially (select * from posts where id = 1, select * from posts where id = 2, select * from posts...) this may be a sign that you need to use a :include in one of your ActiveRecord calls.
Some other basic ActiveRecord tips (These are just the ones I can think of off the top of my head):
If you're not doing it already, make sure to correctly use indexes on your database tables.
Avoid making database calls in views, especially partials, it can be very easy to lose track of how much you are making database queries in views. Push all queries and calculations into your models or controllers.
Avoid making queries in iterators. Usually this can be done by using an :include.
Avoid having rails build ActiveRecord objects for large datasets as much as possible. When you make a call like Post.find(:all).size, a new class is instantiated for every Post in your database (and it could be a large query too). In this case you would want to use Post.count(:all), which will make a single fast query and return an integer without instantiating any objects.
Associations like User..has_many :objects create both a user.objects and user.object_ids method. The latter skips instantiation of ActiveRecord objects and can be much faster. Especially when dealing with large numbers of objects this is a good way to speed things up.
Learn and use named_scope whenever possible. It will help you keep your code tiny and makes it much easier to have efficient queries.
External APIs & ActionMailer
As much as you can, do not make API calls to external services while handling a request. Your server will stop executing code until a response is received. Not only will this add to load times, but your mongrel will not be able to handle new requests.
If you absolutely must make external calls during a request, you will need to run as many mongrels as possible since you may run into a situation where many of them are waiting for an API response and not doing anything else. (This is a very common problem when building Facebook applications)
The same applies to sending emails in some cases. If you expect many users to sign up in a short period of time, be sure to benchmark the time it takes for ActionMailer to deliver a message. If it's not almost instantaneous then you should consider storing emails in your database an using a separate script to deliver them.
Tools like BackgroundRB have been created to solve this problem.
Caching
Here's a good guide on the different methods of caching in rails.
Benchmarking (Locating performance problems)
If you suspect a method may be slow, try benchmarking it in console. Here's an example:
>> Benchmark.measure { User.find(4).pending_invitations }
=> #<Benchmark::Tms:0x77934b4 #cutime=0.0, #label="", #total=0.0, #stime=0.0, #real=0.00199985504150391, #utime=0.0, #cstime=0.0>
Keep track of methods that are slow in your application. Those are the ones you want to avoid executing frequently. In some cases only the first call will be slow since Rails has a query cache. You can also cache the method yourself using Memoization.
NewRelic will also provide a nice overview of how long methods and SQL calls take to execute.
Good luck!
Look into some load testing software like WEBLoad or if you have money, Quick Test Pro. This will help give you some idea. WEBLoad might be the best test in your situation.
You can generate thousands of virtual nodes hitting your site and you can inspect the performance of your servers from that load.
In my experience having watched some of our customers absorb a crunching, the traffic was fairly modest- not the bone crushing spike people seem to expect. Now, if you get syndicated and make on Yahoo's page or something, things may be different.
Search for the experiences of Facestat.com if you want to read about how they handled it (the Yahoo FP.)
My advise is just be prepared to turn off signups or go to a more static version of your site if your servers get too hot. Using a monitoring/profiling tool is a good idea as well, I like FiveRuns Manage tool for ease of setup.
Since you're using EngineYard, you should be able to allocate more machines to handle the load if necessary
Your big problems will probably not be the number of incoming requests, but will be the amount of data in your database showing you where your queries aren't using the indexes your expecting, or are returning too much data, e.g. The User List page works with 10 users, but dies when you try to show 10,000 users on that one page because you didn't add pagination (will_paginate plugin is almost your friend - watch out for 'select count(*)' queries that are generated for you)
So the two things to watch:
Missing indexes
Too much data per page
For #1, there's a plugin that runs an 'explain ...' query after every query so you can check index usage manually
There is a plugin that can generate data for you for various types of data that may help you fill your database up to test these queries too.
For #2, use will_paginate plugin or some other way to reduce data per page.
We've got basically the same setup as you, 2 prod slices and a staging slice at EY. We found ab to be a great load testing tool - just write a bash script with the urls that you expect to get hit and point it at your slice. Watch NewRelic stats and it should give you some idea of the load your app can handle and where you might need to optimise.
We also found query_reviewer to be very useful as well. It is great for finding those un-indexed tables and n+1 queries.

Application Context in Rails

Rails comes with a handy session hash into which we can cram stuff to our heart's content. I would, however, like something like ASP's application context, which instead of sharing data only within a single session, will share it with all sessions in the same application. I'm writing a simple dashboard app, and would like to pull data every 5 minutes, rather than every 5 minutes for each session.
I could, of course, store the cache update times in a database, but so far haven't needed to set up a database for this app, and would love to avoid that dependency if possible.
So, is there any way to get (or simulate) this sort of thing? If there's no way to do it without a database, is there any kind of "fake" database engine that comes with Rails, runs in memory, but doesn't bother persisting data between restarts?
Right answer: memcached . Fast, clean, supports multiple processes, integrates very cleanly with Rails these days. Not even that bad to set up, but it is one more thing to keep running.
90% Answer: There are probably multiple Rails processes running around -- one for each Mongrel you have, for example. Depending on the specifics of your caching needs, its quite possible that having one cache per Mongrel isn't the worst thing in the world. For example, supposing you were caching the results of a long-running query which
gets fresh data every 8 hours
is used every page load, 20,000 times a day
needs to be accessed in 4 processes (Mongrels)
then you can drop that 20,000 requests down to 12 with about a single line of code
##arbitrary_name ||= Model.find_by_stupidly_long_query(param)
The double at-mark, a Ruby symbol you might not be familiar with, is a global variable. ||= is the commonly used Ruby idiom to execute the assignment if and only if the variable is currently nil or otherwise evaluates to false. It will stay good until you explicitly empty it OR until the process stops, for any reason -- server restart, explicitly killed, what have you.
And after you go down from 20k calculations a day to 12 in about 15 seconds (OK, two minutes -- you need to wrap it in a trivial if block which stores the cache update time in a different global), you might find that there is no need to spend additional engineering assets on getting it down to 4 a day.
I actually use this in one of my production sites, for caching a few expensive queries which literally only need to be evaluated once in the life of the process (i.e. they change only at deployment time -- I suppose I could precalculate the results and write them to disk or DB but why do that when SQL can do the work for me).
You don't get any magic expiry syntax, reliability is pretty slim, and it can't be shared across processes -- but its 90% of what you need in a line of code.
You should have a look at memcached: http://wiki.rubyonrails.org/rails/pages/MemCached
There is a helpful Railscast on Rails 2.1 caching. It is very useful if you plan on using memcached with Rails.
Using the stock Rails cache is roughly equivalent to this.
#p3t0r- is right,MemCached is probably the best option, but you could also use the sqlite database that comes with Rails. That won't work over multiple machines though, where MemCached will. Also, sqlite will persist to disk, though I think you can set it up not to if you want. Rails itself has no application-scoped storage since it's being run as one-process-per-request-handler so it has no shared memory space like ASP.NET or a Java server would.
So what you are asking is quite impossible in Rails because of the way it is designed. What you ask is a shared object and Rails is strictly single threaded. Memcached or similar tool for sharing data between distributed processes is the only way to go.
The Rails.cache freezes the objects it stores. This kind of makes sense for a cache but NOT for an application context. I guess instead of doing a roundtrip to the moon to accomplish that simple task, all you have to do is create a constant inside config/environment.rb
APP_CONTEXT = Hash.new
Pretty simple, ah?

Resources