Questions about caching in high-traffic website - asp.net-mvc

Suppose we are building an E-commerce site that allows consumers to search for products by typing in keywords. Say there are at most 200,000 products, and there are millions of consumers using the system. Let’s say the product table is updated fairly frequently. Since the number of products is not that high and we can probably store the entire product table in memory and search against it instead of hitting the database. We are hoping to create distributed caches that store the same data but reside in different servers (for high availability and performance reason) and we need to be able to synchronize data among these caches and invalidate caches when product table is modified.
Our application is built using ASP.NET MVC and NHibernate. I am trying to understand whether NHibernate’s level-2 caching would help with my situation. I would really appreciate if you guys can shed some light on this.
I understand that level-2 caching will help cache query result so if two different users are searching using the same keyword, the L2 Cache will serve the result from the cache instead of from the database. But it doesn’t help us much since the product table is updated frequently and the cached result will be stale.
My question is am I understanding L2 caching correctly and is there exists anything that help manage cache the way I would like to do (multiple caches, the same data, synchronize between cache and invalidate cache). Any thoughts is highly appreciated.

Having used both the second-level cache (using the memcached provider) and the NHibernate.Search add-on it seems to me you could benefit from both.
The NHibernate.Search component depends on Lucene.Net and keyword search is decoupled from the Database it self. A different index file is created per class mapped and optimizations can be set on the property level using attributes, giving you an extra level of granularity. Additionally, you can implement best match and propositions (check Lucene in Action and/or Hibernate Search in action). As a note, you don't have to maintain the index (unless you explicitly request an index rebuild); the implementation manages everything behind the scenes although you can manipulate the index if you wish to do so. So, adding/deleting/updating a product will automatically update the according index.
For the second-level cache you get instant performance boost. On a test environment with a data set of approx 2 mil rows i had more than 20% improvement even on an extremely low request count. The performance boost is gradually larger as the request count increases - the application first hits the 2nd level cache and if it does not find it then hits the DB to fetch the required rows and inserts them on the cache for future queries. Again you can manage stuff like cache duration and other configuration settings, as well as explicitly clear the cache (all of it, a part of it, or particular entries) if you wish to do so. Note that cache state is managed by the application during save/update/delete.
For scallability
* the 2nd level cache depends on the provider (ie memcached is highly performant and scalable and supports distributed instances).
* for the Lucene.Net/NHibernate.Search you will need to set up a specific place that the indexes will reside and that place must be accessible for read/write by all web-application instances. Note here that the sensitive link is I/O and file contention, so setting up a machine with a faster than light file system will prevent that from happening (i am speaking for your scenario with many thousands of search requests per second)
As a side note i would highly recommend NHibernate.Search since it is extremely faster than LIKE queries and is easier to use than implementing SQL-Server's FullText search inside the application (which i have done).

Whether a second level cache will help depends on exactly how frequently your product table is updated in relation to cache hits. If you add 100 new products an hour but receive 10,000 queries an hour, even a 10% cache hit rate will make a big difference. If the rates are reversed, a second level cache will be of almost no value.
I suggest you set up a stress test environment that closely approximates your production environment and perform benchmarking on the various second level cache providers.
Also check that your DB is configured properly for an update-heavy scenario.

I recommend using NHibernate.Search w/ Lucene. It works together with the 2nd level cache. Lucene can do sophisticated text searching ripping fast and then return back the entity keys to NHibernate which pulls the full entity out of its 2nd level cache. The NHibernate.Search extension does the work of keeping your Lucene index in sync.
TekPub did a recent episode on your exact scenario of searching product descriptions. The episode compares NHibernate queries, SQL Full-text indexing and Lucene w/ NHibernate.Search.

Related

How to quickly warmup Neo4J page cache for a set of nodes and relationships

I'm setting up a REST API backed by a Neo4J instance of complex data about 400 GB of store size and need to warmup page cache for a set of nodes and relationships.
There are specific calls for a set of input data which takes a lot of time to query and get a response. So I want a way to warmup the cache only for the nodes and relationships which are accessed for these calls. I tried using APOC.warmup.run(true, true) (took around 15 mins which is acceptable for me), but it does load the whole store to memory which I can't do since I have a constraint on memory.I tried writing a simple Cypher which traverses through these path, but it is taking lot of time to execute and when I check the memory growth of Neo4J instance which is very slow as compared to APOC warmup.
I am also thinking if there is a way to extend/customize APOC warmup to load only specific parts of the store, but want to see if there are people out there who already tried something similar before.
I expect a quick way of warming up specific part of the store rather than the whole store.
You should also have enough page-cache configured for that (400G)
If you use apoc.warmup.run(true,true,true) also the schema indexes will be warmed up.
I guess it only took 15 minutes for you because your disk read performance might be lower? And how many CPUs do you have?
You cannot really load a specific part of the store.
But recent Neo4j versions track page-cache usage and restore it after restart.
So you can just run your queries and after a restart the same pages will be in the page cache.
You also see with PROFILE if you have page-faults in your query.

Getting most recent paths visited across sessions in Rails app

I have a simple rails app with no database and no controllers. It uses High Voltage for routing queries, then uses javascript to go get data using the params hash.
A typical URL looks like this:
http://example.com/?id=37ed660aa222e61ebbbc02db
I'd like to grab the ten unique URLs users have most recently visited and pass them to a view. Note that I said users, preferably across concurrent sessions.
Is there a way to retrieve this using ActiveSupport::Notifications or Production.log? Any examples, including where the code should best go, would be greatly appreciated!
I think that Redis would be ideally suited to this. It's one of the NoSQL key-value store db's, but its support for the value part being an ordered list, queue, etc. should make it easy to store unique urls in a FIFO list as they are visited, limit the size of that list (discard urls at the 'old' end of the list), and retrieve the most recent N urls to pass to your view. Your list should stay small enough that it would all stay in memory and be very fast. You might be able to do this with memcached or mongo or another one as well; I think it would be best though if the solution kept the stored values in memory.
If you aren't already using redis (or similar), it might seem like overkill to set it up and maintain just for this feature. But you can make it pay for itself by also using it for caching, background job processing (Resque / Sidekiq), and probably other things in your app.

ASP.NET MVC 3 in-memory data store

I have a project which provides users with a list of current tasks that need to be completed. Any user can complete any task, and so to ensure that only one user is working on a task at a time I need to be able to 'lock' it. I'm using SignalR for this, so a user requests a lock on a task, and if they are successful (ie. if noone else has locked it) then they will be able to access the further information that they need.
My problem is how to store the list of locked tasks. The original plan was simply to add an additional bit field 'IsLocked' to the Task table and update this when the user requested a lock and when the task was unlocked. We have about 300 concurrent users, however, and a task takes only about 3-4 minutes, meaning huge numbers of additional - and tiny - queries on the database. Therefore we were wondering about in-memory storage, simply storing a list of task ids in a 'lockedTasks' list.
I had considered using caching, but am unsure on the best ways to do this, or even if better alternatives exist. If anyone has any experience in this then some advice would be great thanks
I would avoid memory completely as IIS is not that great with it, if you found your self in the IIS need for refreshing the Application Pool for some sort of reason, your list is simply gone!
Maybe a MemCache system? If it does not loose things in the above way, but...
I would advice to be in the middle, IO File is fast that request data to a Database, specially if it's not in the same machine (witch for security reasons, it should never be), so... why not, and just to hold your list, you don't use one of the currently famous NoSQL database?
MongoDB is a document database that has a .NET Library and it's easy to use, it is not as fast as Memmory, but extremely quicker than Physical databases for what you want.
Normally the NoSQL Database will be hosted in the App_Data folder so it will be extremely fast to access and you can just hold there the task_id and user_id of all locked tasks.
Have you considered stateful filters?
Check out this links for more info:
ASP.NET MVC Filters and Statefulness
Brad Wilson: Advanced MVC
3 - (Video)
Brad Wilson: Advanced MVC 3 - (PDF)
I'm sorry, but if your app can't handle a single query every 3-4 minutes x 300 users, then you're doing something very wrong. Just browsing a site typically generates orders of magnitude more queries than that.

When/what to cache in Rails 3

Caching is something that I kind of ignored for a long time, as projects that I worked on were on local intranets with very little activity. I'm working on a much larger Rails 3 personal project now, and I'm trying to work out what and when I should cache things.
How do people generally determine this?
If I know a site is going to be relatively low-activity, should I just cache every single page?
If I have a page that calls several partials, is it better to do fragment caching in those partials, or page caching on those partials?
The Ruby on Rails guides did a fine job of explaining how caching in Rails 3 works, but I'm having trouble understanding the decision-making process associated with it.
Don't ever cache for the sake of it, cache because there's a need (with the exception of something like the homepage, which you know is going to be super popular.) Launch the site, and either parse your logs or use something like NewRelic to see what's slow. From there, you can work out what's worth caching.
Generally though, if something takes 500ms to complete, you should cache, and if it's over 1 second, you're probably doing too much in the request, and you should farm whatever you're doing to a background process…for example, fetching a Twitter feed, or manipulating images.
EDIT: See apneadiving's answer too, he links to some great screencasts (albeit based on Rails 2, but the theory is the same.)
You'll want to think about caching several kinds of things:
Requests that are hit a lot, and seldom change
Requests that are "expensive" to draw, lots of database calls, etc. Also hopefully these seldom change.
The other side of caching that shouldn't go without mention, is expiration. Its also often the harder part. You have to know when a cache is no longer good, and clear it out so fresh content will be generated. Sweepers, or Observers, depending on how you implement your cache can help you with this. You could also do it just based on a time value, allow caches to have a max-age and clear them after that no matter what.
As for fragment vs full page caching, think of it in terms of how often those parts are updated. If 3 partials of a page are never updated, and one is, maybe you want to cache those 3, and allow that 1 to be fetched live for so you can have up to the second accuracy. Or if the different partials of a page should have different caching rules: maybe a "timeline" section is cached, but has a cache-age of 1 minute. While the "friends" partial is cached for 12 hours.
Hope this helps!
If the site is relatively low activity you shouldn't cache any page. You cache because of performance problems, and performance problems come about because you have too much data to query, too many users, or worse, both of those situations at the same time.
Before you even think about caching, the first thing you do is look through your application for the requests that are taking up the most time. Not the slowest requests, but the requests your application spends the most aggregate time performing. That is if you have a request A that runs 10 times at 1500ms and request B that runs 5000 times at 250ms you work on optimizing B first.
It's actually pretty easy to grep through your production.log and extract rendering times and URLs to combine them into a simple report. You can even do that in real-time if you want.
Once you've identified a problematic request, you go about picking apart what it's doing to service the request. The first thing is to look for any queries that can be combined by using eager loading or by looking ahead a bit more to anticipate what you'll need. The next thing is to ensure you're not loading data that isn't used.
So many times you'll see code to list users and it's loading 50KB per person of biographical data, their Facebook and Twitter handles, literally everything about them, and all you use is their name.
Fetch as little as you need, and fetch it in the most efficient way you can. Use connection.select_rows when you don't need models.
The next step is to look at what kind of queries you're running, and how they're under-performing. Ensure your indexes are all set properly and are being used. Check that you're not doing complicated JOIN operations that could be resolved by a bit of tactical de-normalization.
Have a look at what data you are storing in your application, and try and find things that can be removed from your production database and warehoused somewhere else. Cycle your data out regularly when it's no longer relevant, preserve it in a separate database if you need to.
Then go over and have a look at how your database server is tuned. Does it have sufficiently large buffers? Is it on hardware that could be upgraded with more memory at a nominal cost? Too many people are running a completely un-tuned database server and with a few simple settings they can get ten-fold performance increases.
If, and only if, you still have a performance problem at this point then you might want to consider caching.
You know why you don't cache first? It's because once you cache something, that cached data is immediately stale. If parts of your application use this data under the assumption it's always up to date, you will have problems. If you don't expire this cache when the data does change, you will have problems. If you cache the data and never use it again, you're just clogging up your cache and you will have problems. Basically you'll have lots of problems when you use caching, so it's often a last resort.

How to prepare to be tech crunched

There is a good chance that we will be tech crunched in the next few days. Unfortunately, we have not gone live yet so we don't have a good estimation of how our system handles a production audience.
Our production setup consists of 2 EngineYard slices each with 3 mongrel instances, using Postgres as the database server.
Obviously a huge portion of how our app will hold up is to do with our actual code and queries etc. However, it would be good to see if there are any tips/pointers on what kind of load to expect or experiences from people who have been through it. Does 6 mongrel instances (possibly 8 if the servers can take it) sound like it will handle the load, or are at least most of it?
I have worked on several rails applications that experienced high load due to viral growth on Facebook.
Your mongrel count should be based on several factors. If your mongrels make API calls or deliver email and must wait for responses, then you should run as many as possible. Otherwise, try to maintain one mongrel per CPU core, with maybe a couple extra left over.
Make sure your server is using a Fair Proxy Balancer (not round robin). Here is the nginx module that does this: http://github.com/gnosek/nginx-upstream-fair/tree/master
And here are some other tips on improving and benchmarking your application performance to handle the load:
ActiveRecord
The most common problem Rails applications face is poor usage of ActiveRecord objects. It can be quite easy to make 100's of queries when only one is necessary. The easiest way to determine if this could be a problem with your application is to set up New Relic. After making a request to each major page on your site, take a look at the newrelic SQL overview. If you see a large number of very similar queries sequentially (select * from posts where id = 1, select * from posts where id = 2, select * from posts...) this may be a sign that you need to use a :include in one of your ActiveRecord calls.
Some other basic ActiveRecord tips (These are just the ones I can think of off the top of my head):
If you're not doing it already, make sure to correctly use indexes on your database tables.
Avoid making database calls in views, especially partials, it can be very easy to lose track of how much you are making database queries in views. Push all queries and calculations into your models or controllers.
Avoid making queries in iterators. Usually this can be done by using an :include.
Avoid having rails build ActiveRecord objects for large datasets as much as possible. When you make a call like Post.find(:all).size, a new class is instantiated for every Post in your database (and it could be a large query too). In this case you would want to use Post.count(:all), which will make a single fast query and return an integer without instantiating any objects.
Associations like User..has_many :objects create both a user.objects and user.object_ids method. The latter skips instantiation of ActiveRecord objects and can be much faster. Especially when dealing with large numbers of objects this is a good way to speed things up.
Learn and use named_scope whenever possible. It will help you keep your code tiny and makes it much easier to have efficient queries.
External APIs & ActionMailer
As much as you can, do not make API calls to external services while handling a request. Your server will stop executing code until a response is received. Not only will this add to load times, but your mongrel will not be able to handle new requests.
If you absolutely must make external calls during a request, you will need to run as many mongrels as possible since you may run into a situation where many of them are waiting for an API response and not doing anything else. (This is a very common problem when building Facebook applications)
The same applies to sending emails in some cases. If you expect many users to sign up in a short period of time, be sure to benchmark the time it takes for ActionMailer to deliver a message. If it's not almost instantaneous then you should consider storing emails in your database an using a separate script to deliver them.
Tools like BackgroundRB have been created to solve this problem.
Caching
Here's a good guide on the different methods of caching in rails.
Benchmarking (Locating performance problems)
If you suspect a method may be slow, try benchmarking it in console. Here's an example:
>> Benchmark.measure { User.find(4).pending_invitations }
=> #<Benchmark::Tms:0x77934b4 #cutime=0.0, #label="", #total=0.0, #stime=0.0, #real=0.00199985504150391, #utime=0.0, #cstime=0.0>
Keep track of methods that are slow in your application. Those are the ones you want to avoid executing frequently. In some cases only the first call will be slow since Rails has a query cache. You can also cache the method yourself using Memoization.
NewRelic will also provide a nice overview of how long methods and SQL calls take to execute.
Good luck!
Look into some load testing software like WEBLoad or if you have money, Quick Test Pro. This will help give you some idea. WEBLoad might be the best test in your situation.
You can generate thousands of virtual nodes hitting your site and you can inspect the performance of your servers from that load.
In my experience having watched some of our customers absorb a crunching, the traffic was fairly modest- not the bone crushing spike people seem to expect. Now, if you get syndicated and make on Yahoo's page or something, things may be different.
Search for the experiences of Facestat.com if you want to read about how they handled it (the Yahoo FP.)
My advise is just be prepared to turn off signups or go to a more static version of your site if your servers get too hot. Using a monitoring/profiling tool is a good idea as well, I like FiveRuns Manage tool for ease of setup.
Since you're using EngineYard, you should be able to allocate more machines to handle the load if necessary
Your big problems will probably not be the number of incoming requests, but will be the amount of data in your database showing you where your queries aren't using the indexes your expecting, or are returning too much data, e.g. The User List page works with 10 users, but dies when you try to show 10,000 users on that one page because you didn't add pagination (will_paginate plugin is almost your friend - watch out for 'select count(*)' queries that are generated for you)
So the two things to watch:
Missing indexes
Too much data per page
For #1, there's a plugin that runs an 'explain ...' query after every query so you can check index usage manually
There is a plugin that can generate data for you for various types of data that may help you fill your database up to test these queries too.
For #2, use will_paginate plugin or some other way to reduce data per page.
We've got basically the same setup as you, 2 prod slices and a staging slice at EY. We found ab to be a great load testing tool - just write a bash script with the urls that you expect to get hit and point it at your slice. Watch NewRelic stats and it should give you some idea of the load your app can handle and where you might need to optimise.
We also found query_reviewer to be very useful as well. It is great for finding those un-indexed tables and n+1 queries.

Resources