calculating lots of statistics on database user data: optimizing performance - ruby-on-rails

I have the following situation (in Rails 3): my table contains financial transactions for each user (users can buy and sell products). Since lots of such transactions occur I present statistics related to the current user on the website, e.g. current balance, overall profit, how many products sold/bought overall, averages, etc. (the same also on a per month/per year basis instead of overall). Parts of this information is displayed to the user on many forms/pages so that the user can always see his current account information (different bits of statistics is displayed on different pages though).
My question is: how can I optimize database performance (and is it worth it)? Surely, if the user is just browsing, there is no need to re-calculate all of the values every time a new page is loaded unless a change to the underlying database has been made?
My first solution would be to store these statistics in their own table and update them once a financial transaction has been added/edited (in Rails maybe using :after_update ?). Taking this further, if, for example, a new transaction has been made, then I can just modify the average instead of re-calculating the whole thing.
My second idea would be to use some kind of caching (if this is possible?), or to store these values in the session object.
Which one is the preferred/recommended way, or is all of this a waste of time as the current largest number of financial transactions is in the range of 7000-9000?

You probably want to investigate summary tables, also known as materialized views.
This link may be helpful:
http://wiki.postgresql.org/wiki/Materialized_Views

Related

Rails saving and/or caching complicated query results

I have an application that, at its core, is a sort of data warehouse and report generator. People use it to "mine" through a large amount of data with ad-hoc queries, produce a report page with a bunch of distribution graphs, and click through those graphs to look at specific result sets of the underlying items being "mined." The problem is that the database is now many hundreds of millions of rows of data, and even with indexing, some queries can take longer than a browser is willing to wait for a response.
Ideally, at some arbitrary cutoff, I want to "offline" the user's query, and perform it in the background, save the result set to a new table, and use a job to email a link to the user which could use this as a cached result to skip directly to the browser rendering the graphs. These jobs/results could be saved for a long time in case people wanted to revisit the particular problem they were working on, or emailed to coworkers. I would be tempted to just create a PDF of the result, but it's the interactive clicking of the graphs that I'm trying to preserve here.
None of the standard Rails caching techniques really captures this idea, so maybe I just have to do this all by hand, but I wanted to check to see if I wasn't missing something that I could start with. I could create a keyed model result in the in-memory cache, but I want these results to be preserved on the order of months, and I deploy at least once a week.
Considering Data loading from lots of join tables. That's why it's taking time to load.
Also you are performing calculation/visualization tasks with the data you fetch from DB, then show on UI.
I like to recommend some of the approaches to your problem:
Minimize the number of joins/nested join DB queries
Add some direct tables/columns, ex. If you are showing counts of comments of user the you can add new column in user table to store it in user table itself. You can add scheduled job to update data or add callback to update count
also try to minimize the calculations(if any) performing on UI side
you can also use the concept of lazy loading for fetching the data in chunks
Thanks, hope this will help you to decide where to start 🙂

Getting Granular Data from Google Analytics to enable Machine Learning applications

In the context of Google Analytics, I wonder if I can get granular data for an account in the form of a table --or multiple tables that could be joined --containing all relevant information collected per user and then per session.
For each user there should be rows describing in detail the activities and outcomes --micro and macro-- of each session. Features would include source, time of visit, duration of visit, pages visited, time per page, goal conversions etc.
Having the row data in a granular form would enable me to apply machine learning algorithms that would help me explore the data and optimize decisions (web design, budget allocation, biding).
This is possible, however not by default. You will need to set up custom dimensions to be able to identify individual clients, sessions, and timestamps to be able to get list wise user data, rather then pre-aggregated data. A good place to start is https://www.simoahava.com/analytics/improve-data-collection-with-four-custom-dimensions/
There is no way to collect all data per user in one simple query. You will need to run multiple queries, pivot tables, etc. and merge's to get the full dataset you are currently envisaging.
Beyond the problem you currently have, there is also then the problem of downloading the data.
1) There is a 10,000 row limit, so you will need to make a loop to download all available rows.
2) Depending on your traffic, you are likely to encounter sampled data, so you will need to download the data per day, or hour to avoid Google Analytics sampling.

Datawarehousing, is it important to track historical data (SQL Server)?

We started designing a process for detecting changes in our ERP database for creating datawarehouse databases. Since they don't like placing triggers on the ERP databases or even enabling CDC (sql server), we are thinking in reading changes from the databases that get replicated to a central repository through transaction replication, then have an extra copy that will merge the changes (we will have CDC on the extra copy)...
I wonder if there is a possibility where data that changes within, let's say 15 minutes, is important enough to consider a change in our design, the way we plan in designing this would not be able to track every single change, it will only get the latest after a period of time, so for example if a value on a row changes from A to B, then 1 minute later, changes from B to C, the replication system will bring that last value to the central repository, then we will merge the table with our extra copy (that extra copy might have had the value of A, then it will be updated with C, and we lost the value of B).
Is there a good scenario in a data warehouse database where you need to track ALL changes a table has gone through ?
Taking care of historical data in a DW is important in some cases such as:
When the dimension value changes. Say, a supplier merged with another and changed their commercial name
When the fact table uses calculations derived based on other information outside the fact table that changes. Say conversion rate changes for example.
When you need to run queries that reflect fact information in previous periods (versions of the fact table).
An example where every change maters may be a bank account's balance or a storage warehouse item count or a stock price, etc.
For your particular case, you should check with your customer how the system will be used and what is its benefits exactly, and design accordingly. How granular the change should be captured (every hour, day, etc.) is primarily your customer's call.
Some techniques in handling dimension data change is in Kimball-Slowly Changing Dimension.
In direct answer to your question: depends on the application.
Examples:
The value is the description field of an item in some inventory, where the items themselves do not change (i.e. item ID X is always a sparkly-thingy). In this case saving short lived descriptions is probably not required.
The value is the last reading of temperature sensor. If it goes over a certain value action is taken to bring the temperature back. In this case you certainly need to save each an every change.
This raises three points:
The second case where every single change is required shows very bad design. Such a system would surely insert new values with a time stamp into a table and not update a single value.
Bad designs do exist. Hence:
The amount data being warehoused depends on the nature of data.
a. Will you be able to derive any intelligence from your warehoused data?
b. Will you be able to know based on changes at the database level what happened at the business level?
c. What happens to your data when the database schema changes because you upgraded the ERP product?
I'm wondering whether saving a log of changes on the table level is usable. You might be able to reverse engineer what a set of changes means and then save that to the warehouse, or actually get the ERP to "tell" you what it has done and save those changes.

Database design - recording transactions

I would appreciate some advice on how to structure a database for the following scenario:
I'm using Ruby on Rails, and so I have the following tables:
Products
Salespeople
Stores
Products are manufactured in batches, so each product item has a Batch code, so I think I will also need a table of batches, which refers to a product type.
Batch
In the real world, Salespeople take Product items (from a specific Batch) and in due course issue it to a Store. Importantly, Batches are large, and may be spread across many Salespeople, and subsequently, Stores.
At some future date, I would like to run the following reports:
Show all Batches of a Product issued to a specific Store.
Show all Batches held by a Salesperson (i.e. not yet sold).
Now, I'm assuming I need to build a table of Transactions, something like,
Transaction
salesperson_id
batch_id (through which the product can be determined)
store_id
typeOfTransaction (whether the Salesperson has obtained some stock, or sold some stock)
quantity
By dynamically running through a table of Transaction records, I can could derive the above reports. However, this seems inefficient and, over time, increasingly slow.
My question is: what is the best way to keep track of transactions like this, preferably without requiring dynamic processing of all transactions to derive total items from a batch given to a given store.
I don't believe I can just keep a central record of stock as Product comes in Batches, and Batches are distributed by Salepeople across Stores.
Thank you.
My question is: what is the best way to keep track of transactions like this, preferably without requiring dynamic processing of all transactions to derive total items from a batch given to a given store.
I don't believe I can just keep a central record of stock as Product comes in Batches, and Batches are distributed by Salepeople across Stores.
Believe it. :-)
In my experience, the only correct way to store this kind of stuff, is to break it down to something akin to T-leger accounting, i.e. debit/credit with a chart of accounts. It requires dynamic processing to derive totals as you've found out, but anything short of that will lead to tricky queries when dealing with reports and audit trails.
You can speed things up significantly, by maintaining partial or complete aggregate balances using triggers (e.g. monthly stock movements per store). This will reduce the number of rows you need to sum when running larger queries. Which of these you'll want to maintain will depend on your app and your reporting requirements.

Recommendations on handling object status fields in rails apps: store versus calculate?

I have a rails app that tracks membership cardholders, and needs to report on a cardholder's status. The status is defined - by business rule - as being either "in good standing," "in arrears," or "canceled," depending on whether the cardholder's most recent invoice has been paid.
Invoices are sent 30 days in advance, so a customer who has just been invoiced is still in good standing, one who is 20 days past the payment due date is in arrears, and a member who fails to pay his invoice more than 30 days after it is due would be canceled.
I'm looking for advice on whether it would be better to store the cardholder's current status as a field at the customer level (and deal with the potential update anomalies resulting from potential updates of invoice records without updating the corresponding cardholder's record), or whether it makes more sense to simply calculate the current cardholder status based on data in the database every time the status is requested (which could place a lot of load on the database and slow down the app).
Recommendations? Or other ideas I haven't thought of?
One important constraint: while it's unlikely that anyone will modify the database directly, there's always that possibility, so I need to try to put some safeguards in place to prevent the various database records from becoming out of sync with each other.
The storage of calculated data in your database is generally an optimisation. I would suggest that you calculate the value on every request and then monitor the performance of your application. If the fact that this data is not stored becomes an issue for you then is the time to refactor and store the value within the database.
Storing calculated values, particularly those that can affect multiple tables are generally a bad idea for the reasons that you have mentioned.
When/if you do refactor and store the value in the DB then you probably want a batch job that checks the value for data integrity on a regular basis.
The simplest approach would be to calculate the current cardholder status based on data in the database every time the status is requested. That way you have no duplication of data, and therefore no potential problems with the duplicates becoming out of step.
If, and only if, your measurements show that this calculation is causing a significant slowdown, then you can think about caching the value.
Recently I had similar decision to take and I decided to store status as a field in database. This is because I wanted to reduce sql queries and it looks simpler. I choose to do it that way because I will very often need to get this status and calculating it is (at least in my case) a bit complicated.
Possible problem with it is that it get out of sync, so I added some after_save and after_destroy to child model, to keep it synchronized. And of course if somebody would modify database in different way, it would make some problems.
You can write simple rake task that will check all statuses and, if needed, correct them. You can run it in cron so you don't have to worry about it.

Resources