I would appreciate some advice on how to structure a database for the following scenario:
I'm using Ruby on Rails, and so I have the following tables:
Products are manufactured in batches, so each product item has a Batch code, so I think I will also need a table of batches, which refers to a product type.
In the real world, Salespeople take Product items (from a specific Batch) and in due course issue it to a Store. Importantly, Batches are large, and may be spread across many Salespeople, and subsequently, Stores.
At some future date, I would like to run the following reports:
Show all Batches of a Product issued to a specific Store.
Show all Batches held by a Salesperson (i.e. not yet sold).
Now, I'm assuming I need to build a table of Transactions, something like,
batch_id (through which the product can be determined)
typeOfTransaction (whether the Salesperson has obtained some stock, or sold some stock)
By dynamically running through a table of Transaction records, I can could derive the above reports. However, this seems inefficient and, over time, increasingly slow.
My question is: what is the best way to keep track of transactions like this, preferably without requiring dynamic processing of all transactions to derive total items from a batch given to a given store.
I don't believe I can just keep a central record of stock as Product comes in Batches, and Batches are distributed by Salepeople across Stores.
Thank you.

Believe it. :-)
In my experience, the only correct way to store this kind of stuff, is to break it down to something akin to T-leger accounting, i.e. debit/credit with a chart of accounts. It requires dynamic processing to derive totals as you've found out, but anything short of that will lead to tricky queries when dealing with reports and audit trails.
You can speed things up significantly, by maintaining partial or complete aggregate balances using triggers (e.g. monthly stock movements per store). This will reduce the number of rows you need to sum when running larger queries. Which of these you'll want to maintain will depend on your app and your reporting requirements.


Is influx DB a good choice to aggregate a certain amount of events that happened during a timeframe

I'm trying out InfluxDB to know if my usecase fits.
My app generates a bunch of events like product created, product deleted, product purchased, payment recieved, category created etc. Each event has some other properties such as who created the product or what the payment method was...
I want to know how many products were purchased or howmany payments were done using a specific payment method or howmany payments were done for a day or till now or with in a time specified. Same for all the events like payment, shipping etc. I am yet to understand the concept of TSDB. Every example I see has some value that is varying, ie; temperature 23,30,23,35,24,33 and so on. In my app each event has a value of 1, since each event contibute to one unit of that event.
Is InfluxDB a good choice for this usecase ? If yes, How would I model my data for use cases like this ?
You could use TSDB for your e-commerce analysis but it might be better to try data warehouse instead, especially when your data volume grows rapidly.
TSDB are best used with time series data plus time series analysis. For example, if you care when the shopping cart is filled and emptied but don't care that much about what is in the shopping cart.
Your use case is more like OLAP and you could check out ClickHouse.

Implementing offline Item based recommendation using Mahout

I am trying to add recommendations to our e-commerce website using Mahout. I have decided to use Item Based recommender, i have around 60K products, 200K users and 4M user-product preferences. I am looking for a way to provide recommendation by calculating the item similarities offline, so that the recommender.recommend() method would provide results in under 100 milli seconds.
DataModel dataModel = new FileDataModel("/FilePath");
_itemSimilarity = new TanimotoCoefficientSimilarity(dataModel);
_recommender = new CachingRecommender(new GenericBooleanPrefItemBasedRecommender(dataModel,_itemSimilarity));
I was hoping if someone could point out to a method or a blog to help me understand the procedure and challenges with an offline computation of the item similarities. Also what is the recommended procedure was storing the pre-computed results from item similarities, should they be stored in a separate db, or a memcache?
PS - I plan to refresh the user-product preference data in 10-12 hours.
MAHOUT-1167 introduced into (the soon to be released) Mahout 0.8 trunk a way to calculate similarities in parallel on a single machine. I'm just mentioning it so you keep it in mind.
If you are just going to refresh the user-product preference data every 10-12 hours, you are better off just having a batch process that stores these precomputed recommendations somewhere and then deliver them to the end user from there. I cannot give detail information or advice due to the fact that this will vary greatly according to many factors, such as your current architecture, software stack, network capacity and so on. In other words, in your batch process, just run over all your users and ask for 10 recommendations for every one of them, then store the results somewhere to be delivered to the end user.
If you need response within 100 Milli seconds, it's better to do batch processing in the background on your server and that may include the following jobs.
Fetching data from your own user database (60K products, 200K users and 4M user-product preferences).
Prepare your data model based on the nature of your data (number of parameters, size of data, preference values etc..lot more) This could be an important step.
Run algorithm on the data model (need to choose the right algorithm according to your requirement). Recommendation data is available here.
May need to process the resultant data as per the requirement.
Store this data into a database (It is NoSQL in all my projects)
The above steps should be running periodically as a batch process.
Whenever a user requests for recommendations, your service provides a response by reading the recommendation data from the pre-calculated DB.
You may look at Apache Mahout (for recommendations) for this kind of task.
These are the steps in brief...Hope this helps !

calculating lots of statistics on database user data: optimizing performance

I have the following situation (in Rails 3): my table contains financial transactions for each user (users can buy and sell products). Since lots of such transactions occur I present statistics related to the current user on the website, e.g. current balance, overall profit, how many products sold/bought overall, averages, etc. (the same also on a per month/per year basis instead of overall). Parts of this information is displayed to the user on many forms/pages so that the user can always see his current account information (different bits of statistics is displayed on different pages though).
My question is: how can I optimize database performance (and is it worth it)? Surely, if the user is just browsing, there is no need to re-calculate all of the values every time a new page is loaded unless a change to the underlying database has been made?
My first solution would be to store these statistics in their own table and update them once a financial transaction has been added/edited (in Rails maybe using :after_update ?). Taking this further, if, for example, a new transaction has been made, then I can just modify the average instead of re-calculating the whole thing.
My second idea would be to use some kind of caching (if this is possible?), or to store these values in the session object.
Which one is the preferred/recommended way, or is all of this a waste of time as the current largest number of financial transactions is in the range of 7000-9000?
You probably want to investigate summary tables, also known as materialized views.
This link may be helpful:

Using statistical tables with Rails

I'm building an app that needs to store a fair amount of events that the users carry out. (Think LOTS as in millions per month).
I need to report on the these events (total of type x in the last month, etc) and need something resilient and fast.
I've toyed with Redis etc to store aggregates of the data, but this could just mean that I'm building up a massive store of single figure aggregates that aren't rebuildable.
Whilst this isn't a bad solution, I'm looking at storing the raw event data in tables that I can then query on a needs basis, and potentially generate aggregate counters on a periodic basis. This would thus give me the ability to add counters over time, and also carry out ad-hoc inspections on what is going on, something which aggregates don't allow.
Question is, how is best to do this? I obviously don't want to have to create a model for each table (which is what Rails would prefer), so do I just create the tables and interact with raw SQL on a needs basis, or is there some other choice for dealing with this sort of data?
I've worked on an app that had that type of data flow and the solution is the following :
-> store everything
-> create aggregates
-> delete everything after a short period (1 week or somehting) to free up resources
So you can simply store events with rails, have some background aggregate creation from another fast script (cron sql), read with rails the aggregates and yet another background script for raw event deletion.
Also .. rails and performance don't quite go hand in hand usually ;)

ASP.NET MVC 3 - Web Application - Efficiently Aggregate Data

I am running an ASP.NET MVC 3 web application and would like to gather statistics such as:
How often is a specific product viewed
Which search phrases typically return specific products in their result list
How often (for specific products) does a search result convert to a view
I would like to aggregate this data and break it down:
By product
By product by week
I'm wondering what are the cleanest and most efficient strategies for aggregating the data. I can think of a couple but I'm sure there are many more:
Insert the data into a staging table, then run a job to aggregate the data and push it into permanent tables.
Use a queuing system (MSMQ/Rhino/etc.) and create a service to aggregate this data before it ever gets pushed to the database.
My concerns are:
I would like to limit the number of moving parts.
I would like to reduce impact on the database. The fewer round trips and less extraneous data stored the better
In certain scenarios (not listed) I would like the data to be somewhat close to real-time (accurate to the hour may be appropriate)
Does anyone have real world experience with this and if so which approach would you suggest and what are the positives and negatives? If there is a better solution that I am not thinking of I'd love ot hear it...
I needed to do something similar in a recent project. We've implemented a full audit system in a secondary database, it tracks changes on every record on the live db. Essentially every insert, update and delete actually updates 2 records, one in the live db and one in the audit db.
Since we have this data in realtime on the audit db, we use this second database to fill any reports we might need. One of the tricks I've found when working with a reporting DB is to forget about normalisation. Just create a table for each report you want, and have it carry just the data you want for that report. Its duplicating data, but the performance gains are worth it.
As to filling the actual data in the reports, we use a mixture. Daily reports are generated by a scheduled task at around 3am, ditto for the weekly and monthly reports, normally over weekends or late at night.
Other reports are generated on demand, using mostly the data since the last daily, so its not that many records, once again all from the secondary database.
I agree that you should create a separate database for your statistics, it will reduce the impact on your database.
You can go with your idea of having "Staging" tables and "Aggregate" tables; that way, if you want to access the near-real-time data you go o the staging table, when you want to historical data, you go to the aggregates.
Finally, I would recommend you use an asynchronous call to save your statistics; that way your pages will not have an impact in response time.
I suggest that you will create a separate database for this. The best way is to use BI technique. There is a separate services in
SQL server for Bi.
