Rails3: Precomputing/storing data for graphs? - ruby-on-rails

I'm currently building a little admin platform with statistics and graphs with highchart and highstock. Right now the graphs are always fetching the data from the database everytime the load. But since the data will grow substantially in the future this is very inefficient and slows the database down. My question now is what the best approach it would be to store or precompute the data so the graphs don't have to fetch it from the database everytime they load.

If the amount of data is going to be really huge and you can sacrifice some realtime-ish manner of presenting it, the best way would be to compute the data for charts in some seperate database table and show the charts out of it. You can setup a background process (using whenever or delayed_job or whatever you like) to periodically update the pre-processed data table with fresh values.
Another option would be caching the chart response by any means you like (using built-in Rails caching, writing your custom cache etc) to deliver the same data to a big number of users with reduced DB hit.
However, in general, preprocessing seems to be the winner as stats tables usually contain very "sparse" data which can be pre-computed to a much smaller set to display on a chart yet having the option to apply some filtering / sorting if needed.
EDIT just forgot to mention there might be some room for optimization on the database side. For instance, if you can limit the periods the user is able to view data for (= cap the amount to be queried per used), than proper indexing and DB setup can deliver substantial performance without man-in-the middle things like caching or precomputing.

Related

Rails saving and/or caching complicated query results

I have an application that, at its core, is a sort of data warehouse and report generator. People use it to "mine" through a large amount of data with ad-hoc queries, produce a report page with a bunch of distribution graphs, and click through those graphs to look at specific result sets of the underlying items being "mined." The problem is that the database is now many hundreds of millions of rows of data, and even with indexing, some queries can take longer than a browser is willing to wait for a response.
Ideally, at some arbitrary cutoff, I want to "offline" the user's query, and perform it in the background, save the result set to a new table, and use a job to email a link to the user which could use this as a cached result to skip directly to the browser rendering the graphs. These jobs/results could be saved for a long time in case people wanted to revisit the particular problem they were working on, or emailed to coworkers. I would be tempted to just create a PDF of the result, but it's the interactive clicking of the graphs that I'm trying to preserve here.
None of the standard Rails caching techniques really captures this idea, so maybe I just have to do this all by hand, but I wanted to check to see if I wasn't missing something that I could start with. I could create a keyed model result in the in-memory cache, but I want these results to be preserved on the order of months, and I deploy at least once a week.
Considering Data loading from lots of join tables. That's why it's taking time to load.
Also you are performing calculation/visualization tasks with the data you fetch from DB, then show on UI.
I like to recommend some of the approaches to your problem:
Minimize the number of joins/nested join DB queries
Add some direct tables/columns, ex. If you are showing counts of comments of user the you can add new column in user table to store it in user table itself. You can add scheduled job to update data or add callback to update count
also try to minimize the calculations(if any) performing on UI side
you can also use the concept of lazy loading for fetching the data in chunks
Thanks, hope this will help you to decide where to start 🙂

Is it okay if Operational data store holds data for rolling time period?

We are planning to build an Operational data store for the front-end users data extraction requirements.
As far as I know the Kimball's approach to build ODS\DW, it should hold the data for complete time period and not like the rolling time period.
The reason being, there could be a need to extract older data from ODS\DW.
So I need your thoughts on this. How should I approach ?
I would create a snapshot table that could hold the values for the rolling period for each day, and filter on the client side of things which snapshot to display.
Once the period is over then the final values can be stored on the permanent data mart.
Kimball's approach for a data warehouse would be to load transactional data to any data warehouse if you can, because it is more flexible in terms of being rolled up. Certainly at the ODS stage you wouldn't want to 'pre-aggregate' your data, if there could be a need to get hold of older data.
If you store both the transactional data and then pre-aggregated versions of the data (in aggregate fact tables, with indexes/views or with a cube, or just filtering on the report side as the other answer suggests), you can get the best of both worlds.
(Note: Kimball's approach in fact does not require an ODS: they're fine if you want to build one, but their focus is on the dimensionally modelled data warehouse.)

Worst case behavior for updates using SQLite PRAGMA synchronous = OFF and journal_mode = MEMORY in iOS

By using both the synchronous=OFF and journal_mode=MEMORY options, I am able to reduce the speed of updates from 15 ms to around 2 ms which is a major performance improvement. These updates happen one at a time, so many other optimizations (like using transactions about a bunch of them) are not applicable.
According to the SQLite documentation, the DB can go 'corrupt' in the worst case if there is a power outage of some type. However, is not the worst thing that can happen is for the data to be lost, or possibly part of a transaction to be lost (which I guess is a form of corruption). Is it really possible for arbitrary corruption to occur with either of these options? If so, why?
I am not using any transactions, so partially written data from transactions is not a concern, and I can handle loosing data once in a blue moon. But if 'corruption' means that all the data in the DB can be randomly changed in an unpredictable way, that would be a strong reason to not use these options.
Does any one know what the real worst-case behavior would be on iOS?
Tables are organized as B-trees with the rowid as the key.
If some writes get lost while SQLite is updating the tree structure, the entire table might become unreadable.
(The same can happen with indexes, but those could be simply dropped and recreated.)
Data is organized in pages (typically 1 KB or 4 KB). If some page update gets lost while some tree is being reorganized, all the data in these pages (i.e., some random rows from the table with nearby rowid values) might become corruped.
If SQLite needs to allocate a new page, and that page contains plausible data (e.g., deleted data from the same table), and the writing of that page gets lost, then you have incorrect data in the table, without the ability to detect it.

ASP.NET MVC 3 - Web Application - Efficiently Aggregate Data

I am running an ASP.NET MVC 3 web application and would like to gather statistics such as:
How often is a specific product viewed
Which search phrases typically return specific products in their result list
How often (for specific products) does a search result convert to a view
I would like to aggregate this data and break it down:
By product
By product by week
etc.
I'm wondering what are the cleanest and most efficient strategies for aggregating the data. I can think of a couple but I'm sure there are many more:
Insert the data into a staging table, then run a job to aggregate the data and push it into permanent tables.
Use a queuing system (MSMQ/Rhino/etc.) and create a service to aggregate this data before it ever gets pushed to the database.
My concerns are:
I would like to limit the number of moving parts.
I would like to reduce impact on the database. The fewer round trips and less extraneous data stored the better
In certain scenarios (not listed) I would like the data to be somewhat close to real-time (accurate to the hour may be appropriate)
Does anyone have real world experience with this and if so which approach would you suggest and what are the positives and negatives? If there is a better solution that I am not thinking of I'd love ot hear it...
Thanks
JP
I needed to do something similar in a recent project. We've implemented a full audit system in a secondary database, it tracks changes on every record on the live db. Essentially every insert, update and delete actually updates 2 records, one in the live db and one in the audit db.
Since we have this data in realtime on the audit db, we use this second database to fill any reports we might need. One of the tricks I've found when working with a reporting DB is to forget about normalisation. Just create a table for each report you want, and have it carry just the data you want for that report. Its duplicating data, but the performance gains are worth it.
As to filling the actual data in the reports, we use a mixture. Daily reports are generated by a scheduled task at around 3am, ditto for the weekly and monthly reports, normally over weekends or late at night.
Other reports are generated on demand, using mostly the data since the last daily, so its not that many records, once again all from the secondary database.
I agree that you should create a separate database for your statistics, it will reduce the impact on your database.
You can go with your idea of having "Staging" tables and "Aggregate" tables; that way, if you want to access the near-real-time data you go o the staging table, when you want to historical data, you go to the aggregates.
Finally, I would recommend you use an asynchronous call to save your statistics; that way your pages will not have an impact in response time.
I suggest that you will create a separate database for this. The best way is to use BI technique. There is a separate services in
SQL server for Bi.

Questions about caching in high-traffic website

Suppose we are building an E-commerce site that allows consumers to search for products by typing in keywords. Say there are at most 200,000 products, and there are millions of consumers using the system. Let’s say the product table is updated fairly frequently. Since the number of products is not that high and we can probably store the entire product table in memory and search against it instead of hitting the database. We are hoping to create distributed caches that store the same data but reside in different servers (for high availability and performance reason) and we need to be able to synchronize data among these caches and invalidate caches when product table is modified.
Our application is built using ASP.NET MVC and NHibernate. I am trying to understand whether NHibernate’s level-2 caching would help with my situation. I would really appreciate if you guys can shed some light on this.
I understand that level-2 caching will help cache query result so if two different users are searching using the same keyword, the L2 Cache will serve the result from the cache instead of from the database. But it doesn’t help us much since the product table is updated frequently and the cached result will be stale.
My question is am I understanding L2 caching correctly and is there exists anything that help manage cache the way I would like to do (multiple caches, the same data, synchronize between cache and invalidate cache). Any thoughts is highly appreciated.
Having used both the second-level cache (using the memcached provider) and the NHibernate.Search add-on it seems to me you could benefit from both.
The NHibernate.Search component depends on Lucene.Net and keyword search is decoupled from the Database it self. A different index file is created per class mapped and optimizations can be set on the property level using attributes, giving you an extra level of granularity. Additionally, you can implement best match and propositions (check Lucene in Action and/or Hibernate Search in action). As a note, you don't have to maintain the index (unless you explicitly request an index rebuild); the implementation manages everything behind the scenes although you can manipulate the index if you wish to do so. So, adding/deleting/updating a product will automatically update the according index.
For the second-level cache you get instant performance boost. On a test environment with a data set of approx 2 mil rows i had more than 20% improvement even on an extremely low request count. The performance boost is gradually larger as the request count increases - the application first hits the 2nd level cache and if it does not find it then hits the DB to fetch the required rows and inserts them on the cache for future queries. Again you can manage stuff like cache duration and other configuration settings, as well as explicitly clear the cache (all of it, a part of it, or particular entries) if you wish to do so. Note that cache state is managed by the application during save/update/delete.
For scallability
* the 2nd level cache depends on the provider (ie memcached is highly performant and scalable and supports distributed instances).
* for the Lucene.Net/NHibernate.Search you will need to set up a specific place that the indexes will reside and that place must be accessible for read/write by all web-application instances. Note here that the sensitive link is I/O and file contention, so setting up a machine with a faster than light file system will prevent that from happening (i am speaking for your scenario with many thousands of search requests per second)
As a side note i would highly recommend NHibernate.Search since it is extremely faster than LIKE queries and is easier to use than implementing SQL-Server's FullText search inside the application (which i have done).
Whether a second level cache will help depends on exactly how frequently your product table is updated in relation to cache hits. If you add 100 new products an hour but receive 10,000 queries an hour, even a 10% cache hit rate will make a big difference. If the rates are reversed, a second level cache will be of almost no value.
I suggest you set up a stress test environment that closely approximates your production environment and perform benchmarking on the various second level cache providers.
Also check that your DB is configured properly for an update-heavy scenario.
I recommend using NHibernate.Search w/ Lucene. It works together with the 2nd level cache. Lucene can do sophisticated text searching ripping fast and then return back the entity keys to NHibernate which pulls the full entity out of its 2nd level cache. The NHibernate.Search extension does the work of keeping your Lucene index in sync.
TekPub did a recent episode on your exact scenario of searching product descriptions. The episode compares NHibernate queries, SQL Full-text indexing and Lucene w/ NHibernate.Search.

Resources