Rails saving and/or caching complicated query results - ruby-on-rails

I have an application that, at its core, is a sort of data warehouse and report generator. People use it to "mine" through a large amount of data with ad-hoc queries, produce a report page with a bunch of distribution graphs, and click through those graphs to look at specific result sets of the underlying items being "mined." The problem is that the database is now many hundreds of millions of rows of data, and even with indexing, some queries can take longer than a browser is willing to wait for a response.
Ideally, at some arbitrary cutoff, I want to "offline" the user's query, and perform it in the background, save the result set to a new table, and use a job to email a link to the user which could use this as a cached result to skip directly to the browser rendering the graphs. These jobs/results could be saved for a long time in case people wanted to revisit the particular problem they were working on, or emailed to coworkers. I would be tempted to just create a PDF of the result, but it's the interactive clicking of the graphs that I'm trying to preserve here.
None of the standard Rails caching techniques really captures this idea, so maybe I just have to do this all by hand, but I wanted to check to see if I wasn't missing something that I could start with. I could create a keyed model result in the in-memory cache, but I want these results to be preserved on the order of months, and I deploy at least once a week.

Considering Data loading from lots of join tables. That's why it's taking time to load.
Also you are performing calculation/visualization tasks with the data you fetch from DB, then show on UI.
I like to recommend some of the approaches to your problem:
Minimize the number of joins/nested join DB queries
Add some direct tables/columns, ex. If you are showing counts of comments of user the you can add new column in user table to store it in user table itself. You can add scheduled job to update data or add callback to update count
also try to minimize the calculations(if any) performing on UI side
you can also use the concept of lazy loading for fetching the data in chunks
Thanks, hope this will help you to decide where to start 🙂

Related

Pre-Made ActiveRecord to Optimize Performance / Save Resources

Essentially each time a visitor reaches the application, the controller performs a database query to check what are the most relevant items to show.
Although the items shown vary with time, they are not personally selected for each user.
This means that instead of being calculated each time a visitor comes, it would be better to be system performing a single query every like 10 minutes and store it, to apply on each visit.
What is the best way to apply this idea? I was thinking on cronjobs and maybe store on redis but IDK, some help is appreciated!
There are a number of ways to do this. One way that I've used in the past with success is to have a table in your database that represents the most relevant items and then have a cron job that updates that table.
Fragment caching like #wesley6j recommended isn't a bad way to go either and you can combine the 2 techniques as well if you want.
If you want more detailed suggestions, you can provide some more details about what you are trying to achieve.

The best way to handle erratic data on iOS

I am working on an application where I have a connection to a database. The database contains from 300MB to 4GB worth of data as each customer has their own database. My issue that I am having is in gathering the data, because of the potential database size, just downloading and storing the information locally isn't possible. The data can get quite complex and can vary. For an example:
A customer has a Job and they want to search for that job from the app.
I then fetch a list of jobs matching the search criteria.
The customer sees the job they want to view and I start the gathering process.
This job can potentially touch many tables, sometimes repeatedly..
There is the jobs table, a relational table to map to a person. Then there is another table that contains non-customer relational information, then there are calendar events associated to the job, which in tun can associate different people. Then there are emails attached to the job, which in turn can bring in additional people and events.
So I have a working model that gathers all of this information. The problem I have is that I cannot figure out a great method of signaling to my view that the data is completely downloaded. My initial thought was to use the NotificationCenter to message when the certain parts of the task were finished, allowing the core Job object to notify the view when everything was complete.
I know this is a pretty generalized question, but I'm honestly stumped as to how to take an unknown number of table results and translate that into a notice that my app can actually use.
My initial recommendation would be Core Data. It's designed for this kind of problem. No, I'm not saying to download the entire database into Core Data. I'm saying to use Core Data to manage your object model, because that's what it's good at.
As you receive data from the server, compose it into NSManagedObjects and stick them in the data store. On the UI side, create an NSFetchedResultsController to keep you informed as the data updates asynchronously. You don't necessarily need to persist this store. You could just keep it in memory and throw it away whenever you're done with the query, but keeping it on disk could be a nice caching solution. Again, don't think of Core Data as "a local database." Think of it as a model persistence engine that you can query for objects.
One advantage of this model is that you can provide the best available data to the user as it becomes available. But say you really don't want to get the information until it's all available. That's fine, too. Just let the network side keep updating its context, and then only save it when everything's complete. That way NSFetchedResultsController gets a single atomic update. The nice things with Core Data is that it has these concepts built in, so you can adjust your update strategy without requiring massive redesign.
The Notification Center will work great for this.
Post the notification at logical points in your data load to trigger a UI update for your users.

Rails3: Precomputing/storing data for graphs?

I'm currently building a little admin platform with statistics and graphs with highchart and highstock. Right now the graphs are always fetching the data from the database everytime the load. But since the data will grow substantially in the future this is very inefficient and slows the database down. My question now is what the best approach it would be to store or precompute the data so the graphs don't have to fetch it from the database everytime they load.
If the amount of data is going to be really huge and you can sacrifice some realtime-ish manner of presenting it, the best way would be to compute the data for charts in some seperate database table and show the charts out of it. You can setup a background process (using whenever or delayed_job or whatever you like) to periodically update the pre-processed data table with fresh values.
Another option would be caching the chart response by any means you like (using built-in Rails caching, writing your custom cache etc) to deliver the same data to a big number of users with reduced DB hit.
However, in general, preprocessing seems to be the winner as stats tables usually contain very "sparse" data which can be pre-computed to a much smaller set to display on a chart yet having the option to apply some filtering / sorting if needed.
EDIT just forgot to mention there might be some room for optimization on the database side. For instance, if you can limit the periods the user is able to view data for (= cap the amount to be queried per used), than proper indexing and DB setup can deliver substantial performance without man-in-the middle things like caching or precomputing.

Using statistical tables with Rails

I'm building an app that needs to store a fair amount of events that the users carry out. (Think LOTS as in millions per month).
I need to report on the these events (total of type x in the last month, etc) and need something resilient and fast.
I've toyed with Redis etc to store aggregates of the data, but this could just mean that I'm building up a massive store of single figure aggregates that aren't rebuildable.
Whilst this isn't a bad solution, I'm looking at storing the raw event data in tables that I can then query on a needs basis, and potentially generate aggregate counters on a periodic basis. This would thus give me the ability to add counters over time, and also carry out ad-hoc inspections on what is going on, something which aggregates don't allow.
Question is, how is best to do this? I obviously don't want to have to create a model for each table (which is what Rails would prefer), so do I just create the tables and interact with raw SQL on a needs basis, or is there some other choice for dealing with this sort of data?
I've worked on an app that had that type of data flow and the solution is the following :
-> store everything
-> create aggregates
-> delete everything after a short period (1 week or somehting) to free up resources
So you can simply store events with rails, have some background aggregate creation from another fast script (cron sql), read with rails the aggregates and yet another background script for raw event deletion.
Also .. rails and performance don't quite go hand in hand usually ;)

ASP.NET MVC 3 - Web Application - Efficiently Aggregate Data

I am running an ASP.NET MVC 3 web application and would like to gather statistics such as:
How often is a specific product viewed
Which search phrases typically return specific products in their result list
How often (for specific products) does a search result convert to a view
I would like to aggregate this data and break it down:
By product
By product by week
etc.
I'm wondering what are the cleanest and most efficient strategies for aggregating the data. I can think of a couple but I'm sure there are many more:
Insert the data into a staging table, then run a job to aggregate the data and push it into permanent tables.
Use a queuing system (MSMQ/Rhino/etc.) and create a service to aggregate this data before it ever gets pushed to the database.
My concerns are:
I would like to limit the number of moving parts.
I would like to reduce impact on the database. The fewer round trips and less extraneous data stored the better
In certain scenarios (not listed) I would like the data to be somewhat close to real-time (accurate to the hour may be appropriate)
Does anyone have real world experience with this and if so which approach would you suggest and what are the positives and negatives? If there is a better solution that I am not thinking of I'd love ot hear it...
Thanks
JP
I needed to do something similar in a recent project. We've implemented a full audit system in a secondary database, it tracks changes on every record on the live db. Essentially every insert, update and delete actually updates 2 records, one in the live db and one in the audit db.
Since we have this data in realtime on the audit db, we use this second database to fill any reports we might need. One of the tricks I've found when working with a reporting DB is to forget about normalisation. Just create a table for each report you want, and have it carry just the data you want for that report. Its duplicating data, but the performance gains are worth it.
As to filling the actual data in the reports, we use a mixture. Daily reports are generated by a scheduled task at around 3am, ditto for the weekly and monthly reports, normally over weekends or late at night.
Other reports are generated on demand, using mostly the data since the last daily, so its not that many records, once again all from the secondary database.
I agree that you should create a separate database for your statistics, it will reduce the impact on your database.
You can go with your idea of having "Staging" tables and "Aggregate" tables; that way, if you want to access the near-real-time data you go o the staging table, when you want to historical data, you go to the aggregates.
Finally, I would recommend you use an asynchronous call to save your statistics; that way your pages will not have an impact in response time.
I suggest that you will create a separate database for this. The best way is to use BI technique. There is a separate services in
SQL server for Bi.

Resources