Is there and easy way to do this ? I have the search working ... I have paging working but not together
So the easiest way to me seems to just cache the search results and then refer to that collection in the controller as the collection to page through.. Does that make sense ?
Or how should I approach this from a high level ? (I come from 10 years of ASP.NET webforms)
Caching search results improves performance of paging through the results data, but consumes memory. If searches are done by multiple users you need to cache search results for each user. Depending on the potential amount of data to be held in memory this may or may not be feasible. (Such caching should be done for a short time with sliding/absolute expiration.)
Doing search every time you need to provide data for the current result page creates higher CPU load, but does not consume memory between requests. Of course, request for page data must carry search parameters, so that the search can be redone.
This is the usual compromise between placing more stress on CPU or memory, and it is up to you to decide what to choose. I would cache the results with a short sliding expiration time (1 minute) and still make sure that the search can be performed again if the cached result is not present.
Related
Our site has counters for views, comments and likes that are displayed on article thumbnails. Those thumbnails are cached and then used in various places, often cached themselves. The issue is that as the counters increment, all these different cache fragments quickly go stale. The result is that different pages displaying the same articles will show counters with different values.
The site is made in RoR 4.2.1 and caching is done through Memcached.
What is the best way to go about maximizing caching for best performance while still keeping updated content?
There are multiple approaches to do this, and it all boils down to how much optimization you need:
Fetch the count for page views in a separate AJAX call once the cached page has loaded. This solution should be highly compatible with your current design.
Cache partials instead of complete page. This means that you'll have cached copies of all partials except that for page views, and so you might need to make major changes in your code.
If the requests are too high, you can also write an approximation function (in js), which given past page views tells you the estimated page views, which can be displayed to end user. The base data can be refreshed every minute (or custom frequency).
I'm running virtual machine, so all the system information i can get. How can I use them to detect a page or revalant pages volatile? The result can be just a approximate volatile time of empirical conclusion. I want to use time series analysis to predict the next time of a page's content modification, is it possible and accurate? Are there any better methods? Thanks very much!
I'm going to answer for pages inside a process as the question if related to the OS gets very complex.
You can use VirtualQuery() and VirtualQueryEx()to determine the status of a given memory page. This includes if it is read only, guard page, DLL image section, writeable, etc. From these statuses you can infer the volatility of some pages.
All the read only pages can be assumed to be none-volatile. But that isn't strictly accurate since you can use VirtualProtect() to change the protection status of a page. And you can use VirtualProtextEx() to the same in a different application. So you'd need to re-check these.
What about the other pages? Any writeable page you're going to have to periodically check them. For example calculate a checksum and compare to previous checksum's to see if they've changed. And then record the time between changes.
You could use the NTDLL Function NtQueryInformationProcess() with ProcessWorkingSetWatch to get data on the page faults for the system.
Note sure if this what you're looking for but it's the simplest approach I can think of. It's potentially a bit CPU hungry. And reading each page regularly to calculate the checksums will trash your cache.
I have a simple jquery which calls a servlet via get and then Neo4j is used to return data in JSON format.
The system is workable after the FIRST query but the very first time it is used the system ins unbelievably slow. This is some kind of initialisation issue. I am using Heroku web hosting.
The code is fairly long so I am not posting now, but are there any known issues regarding the first invocation of Neo4j?
I have done limited testing so far for performance as I had a lot of JSON problems anyway and they only just got resolved.
Summary:
JQuery(LINUX)<--> get (JSON) <---> Neo4j
First Query - response is 10-20 secs
Second Query - time is 2-3 secs
More queries - 2/3 secs.
This is not a one-off; I tested this a few times and always the same pattern comes up.
This is a normal behaviour of Neo4j where store files are mapped into memory lazily for parts of the files that become hot, and becoming hot requires perhaps thousands of requests to such a part. This is a behaviour that has big stores in mind, whereas for smaller stores it merely gets in the way (why not map the whole thing if it fits in memory?).
Then on top of that is an "object" cache that further optimizes access, that get populated lazily for requested entities.
Using an SSD instead of spinning media will usually speed up the initial non-memory-mapped random access quite a bit, but in your scenario I recognize that's not viable.
There are thoughts on beeing more sensitive to hot parts of the store (i.e. memory map even if not as hot) at the start of a database lifecycle, or more precisely have the heat sensitivity be a function of how much is currently memory mapped versus how much can be mapped at maximum. This has shown to make initial requests much more responsive.
Caching is something that I kind of ignored for a long time, as projects that I worked on were on local intranets with very little activity. I'm working on a much larger Rails 3 personal project now, and I'm trying to work out what and when I should cache things.
How do people generally determine this?
If I know a site is going to be relatively low-activity, should I just cache every single page?
If I have a page that calls several partials, is it better to do fragment caching in those partials, or page caching on those partials?
The Ruby on Rails guides did a fine job of explaining how caching in Rails 3 works, but I'm having trouble understanding the decision-making process associated with it.
Don't ever cache for the sake of it, cache because there's a need (with the exception of something like the homepage, which you know is going to be super popular.) Launch the site, and either parse your logs or use something like NewRelic to see what's slow. From there, you can work out what's worth caching.
Generally though, if something takes 500ms to complete, you should cache, and if it's over 1 second, you're probably doing too much in the request, and you should farm whatever you're doing to a background process…for example, fetching a Twitter feed, or manipulating images.
EDIT: See apneadiving's answer too, he links to some great screencasts (albeit based on Rails 2, but the theory is the same.)
You'll want to think about caching several kinds of things:
Requests that are hit a lot, and seldom change
Requests that are "expensive" to draw, lots of database calls, etc. Also hopefully these seldom change.
The other side of caching that shouldn't go without mention, is expiration. Its also often the harder part. You have to know when a cache is no longer good, and clear it out so fresh content will be generated. Sweepers, or Observers, depending on how you implement your cache can help you with this. You could also do it just based on a time value, allow caches to have a max-age and clear them after that no matter what.
As for fragment vs full page caching, think of it in terms of how often those parts are updated. If 3 partials of a page are never updated, and one is, maybe you want to cache those 3, and allow that 1 to be fetched live for so you can have up to the second accuracy. Or if the different partials of a page should have different caching rules: maybe a "timeline" section is cached, but has a cache-age of 1 minute. While the "friends" partial is cached for 12 hours.
Hope this helps!
If the site is relatively low activity you shouldn't cache any page. You cache because of performance problems, and performance problems come about because you have too much data to query, too many users, or worse, both of those situations at the same time.
Before you even think about caching, the first thing you do is look through your application for the requests that are taking up the most time. Not the slowest requests, but the requests your application spends the most aggregate time performing. That is if you have a request A that runs 10 times at 1500ms and request B that runs 5000 times at 250ms you work on optimizing B first.
It's actually pretty easy to grep through your production.log and extract rendering times and URLs to combine them into a simple report. You can even do that in real-time if you want.
Once you've identified a problematic request, you go about picking apart what it's doing to service the request. The first thing is to look for any queries that can be combined by using eager loading or by looking ahead a bit more to anticipate what you'll need. The next thing is to ensure you're not loading data that isn't used.
So many times you'll see code to list users and it's loading 50KB per person of biographical data, their Facebook and Twitter handles, literally everything about them, and all you use is their name.
Fetch as little as you need, and fetch it in the most efficient way you can. Use connection.select_rows when you don't need models.
The next step is to look at what kind of queries you're running, and how they're under-performing. Ensure your indexes are all set properly and are being used. Check that you're not doing complicated JOIN operations that could be resolved by a bit of tactical de-normalization.
Have a look at what data you are storing in your application, and try and find things that can be removed from your production database and warehoused somewhere else. Cycle your data out regularly when it's no longer relevant, preserve it in a separate database if you need to.
Then go over and have a look at how your database server is tuned. Does it have sufficiently large buffers? Is it on hardware that could be upgraded with more memory at a nominal cost? Too many people are running a completely un-tuned database server and with a few simple settings they can get ten-fold performance increases.
If, and only if, you still have a performance problem at this point then you might want to consider caching.
You know why you don't cache first? It's because once you cache something, that cached data is immediately stale. If parts of your application use this data under the assumption it's always up to date, you will have problems. If you don't expire this cache when the data does change, you will have problems. If you cache the data and never use it again, you're just clogging up your cache and you will have problems. Basically you'll have lots of problems when you use caching, so it's often a last resort.
Suppose we are building an E-commerce site that allows consumers to search for products by typing in keywords. Say there are at most 200,000 products, and there are millions of consumers using the system. Let’s say the product table is updated fairly frequently. Since the number of products is not that high and we can probably store the entire product table in memory and search against it instead of hitting the database. We are hoping to create distributed caches that store the same data but reside in different servers (for high availability and performance reason) and we need to be able to synchronize data among these caches and invalidate caches when product table is modified.
Our application is built using ASP.NET MVC and NHibernate. I am trying to understand whether NHibernate’s level-2 caching would help with my situation. I would really appreciate if you guys can shed some light on this.
I understand that level-2 caching will help cache query result so if two different users are searching using the same keyword, the L2 Cache will serve the result from the cache instead of from the database. But it doesn’t help us much since the product table is updated frequently and the cached result will be stale.
My question is am I understanding L2 caching correctly and is there exists anything that help manage cache the way I would like to do (multiple caches, the same data, synchronize between cache and invalidate cache). Any thoughts is highly appreciated.
Having used both the second-level cache (using the memcached provider) and the NHibernate.Search add-on it seems to me you could benefit from both.
The NHibernate.Search component depends on Lucene.Net and keyword search is decoupled from the Database it self. A different index file is created per class mapped and optimizations can be set on the property level using attributes, giving you an extra level of granularity. Additionally, you can implement best match and propositions (check Lucene in Action and/or Hibernate Search in action). As a note, you don't have to maintain the index (unless you explicitly request an index rebuild); the implementation manages everything behind the scenes although you can manipulate the index if you wish to do so. So, adding/deleting/updating a product will automatically update the according index.
For the second-level cache you get instant performance boost. On a test environment with a data set of approx 2 mil rows i had more than 20% improvement even on an extremely low request count. The performance boost is gradually larger as the request count increases - the application first hits the 2nd level cache and if it does not find it then hits the DB to fetch the required rows and inserts them on the cache for future queries. Again you can manage stuff like cache duration and other configuration settings, as well as explicitly clear the cache (all of it, a part of it, or particular entries) if you wish to do so. Note that cache state is managed by the application during save/update/delete.
For scallability
* the 2nd level cache depends on the provider (ie memcached is highly performant and scalable and supports distributed instances).
* for the Lucene.Net/NHibernate.Search you will need to set up a specific place that the indexes will reside and that place must be accessible for read/write by all web-application instances. Note here that the sensitive link is I/O and file contention, so setting up a machine with a faster than light file system will prevent that from happening (i am speaking for your scenario with many thousands of search requests per second)
As a side note i would highly recommend NHibernate.Search since it is extremely faster than LIKE queries and is easier to use than implementing SQL-Server's FullText search inside the application (which i have done).
Whether a second level cache will help depends on exactly how frequently your product table is updated in relation to cache hits. If you add 100 new products an hour but receive 10,000 queries an hour, even a 10% cache hit rate will make a big difference. If the rates are reversed, a second level cache will be of almost no value.
I suggest you set up a stress test environment that closely approximates your production environment and perform benchmarking on the various second level cache providers.
Also check that your DB is configured properly for an update-heavy scenario.
I recommend using NHibernate.Search w/ Lucene. It works together with the 2nd level cache. Lucene can do sophisticated text searching ripping fast and then return back the entity keys to NHibernate which pulls the full entity out of its 2nd level cache. The NHibernate.Search extension does the work of keeping your Lucene index in sync.
TekPub did a recent episode on your exact scenario of searching product descriptions. The episode compares NHibernate queries, SQL Full-text indexing and Lucene w/ NHibernate.Search.