Rails Database and Model for Income and Expenses - ruby-on-rails

I'm new to Rails and am trying to figure out how to create models to track income and expenses in my app. Should I:
1) Create one model and database table called Finance, and then set a field called "type" to either income or expense, then continue with description, amount, date?
2) Or should I create two models and two tables called Income and Expenses, each with description, amount, and date?
I intend to use this data to allow photographers to track income and expenses related to their business. So for example when the photographer books an appointment they can associate income and expenses with that appointment. They can also see a report which shows monthly income, expenses, and profit.

I would say go with one table and use STI (i.e use the type field)..
Both income and expenses are inherently the same thing, just the "direction" of the operation is different. So to me it makes sense to use the same data model, with exceptions hidden in specific subtypes.
Now as for the issues mentioned in the other answer:
Ordering both items at the same time becomes easy with one table. It will be painful with two.
When indexing your table properly, it doesn't matter if its one or two tables. When creating an index on the type column, the cardinality of the records is the same as it would be in two tables, thus not really being that much different in terms of performance. Aggregation will be easier and faster with one table as well.
Table locking is not an issue, unless you use some kind of a you database (like MyISAM), which you should not be doing.

It's basically just a question of preference. You can do all database queries with one or two tables (using UNION). So I'd prefer two tables to have a cleaner model structure. Image you'd want so save an income entry:
One Table: You always have to set the type
Two Tables: You just have to choose the right model
But I can image one database query that could(!) be faster with using only one table:
If you want to ORDER both types let's say by date.
And there's another point where one table is better, but that doesn't apply to your model:
If there's an infinite number of types. Or: If the number of types can change.
For everything else two separate tables are better. Concerning query performance:
If the tables get really huge and you for example want to retrieve all income entries, it's of course faster to look up those entries in a table with 300000 entries than in a table with 600000 entries.
With a deeper look at the DBMS there's another reason for using two tables:
Table locking. Some database engines lock whole tables for write operations. Thus only half of the data would get locked and the other half can still be accessed at the same time.
I will have a look at the ORDER thing with two tables. Maybe I'm wrong and the performance impact isn't even there.
Results:
I've created three simple tables (using MySQL):
inc: id (int, PK), money (int, not null)
exp: id (int, PK), money (int, not null)
combi: id (int, PK), type (tinyint, index, not null), money (not null)
And then filled the tables with random data:
money: from 1 to 10000
type: from 1 to 2
inc: 100000 entries
exp: 100000 entries
combi: 200000 entries
Run these queries:
SELECT id, money
FROM combi
WHERE money > 5000
ORDER BY money
LIMIT 200000;
0,1 sec ... without index: 0,1 sec
SELECT * FROM (
SELECT id, money FROM inc WHERE money > 5000
UNION
SELECT id, money FROM exp WHERE money > 5000
) a
ORDER BY money LIMIT 200000;
0,16 sec
SELECT id, money
FROM combi
WHERE money > 5000 && type = 1
ORDER BY money
LIMIT 200000;
0,14 sec ... without index: 0,085 sec
SELECT id, money
FROM inc
WHERE money > 5000
ORDER BY money
LIMIT 200000;
0,04 sec
And you can see the expected results:
when you need income and expenses in one query, then one table is faster
when you need only income OR expenses, then two tables are faster
But what I don't understand: Why is the query with type = 1 so much slower? I thought using index would make it nearly equal fast?

Related

Should I use a model archive in rails

I have a model product with a has_many relation prices. The prices table is growing rapidly, only few current prices are normally needed, but I want to keep all as a history.
So I am thinking to "archive" all old prices. How do I do that best?
Before I had a column old and was filtering them out when ever I only wanted the current prices. But now the prices table has 2.5 million rows and only 200k are needed in most situations. That's why I thought I would just create a new model price_archive. Copy all "old" prices to price_archive and delete it from prices. And all logic will be moved to a module, used by both models, so I can use price and price_archive in the same way.
Pros for the archive approach:
~ most of the queries are done on the smaller data set (200k, not much growing)
Cons:
displaying both ordered by time needs to be sorted on some kind of joined data set, because times overlap. So it looks like (part.prices.to_a + part.prices_archive.to_a).sort(&:time). Not a big problem, because this will be used very soldomly. But:
I have other models (i.e. order) that use prices in a belongs_to relation, so those need price_id and price_archive_id (with one id always being nil), so that they still reference a price.
Most queries are: show all prices for product (in a select box) and mark the price that is connected to this order (or add it to the select box, when it is archived)
So the code would be something like:
Order.where(*where*).includes(:part => :prices, :price, :price_archive)
The db will query: prices WHERE part_id = ? [on 200k] + prices WHERE id = ? [on 200k] + price_archives WHERE id = ? [on 2300k, but with primary_key]
instead of prices WHERE part_id = ? [on 2500k, with normal index]
Is there a better way or should I stay with the old column?

Dealing with multiple fact tables concerning related processes in dimensional modeling

I have the following scenario where OLTP sales data is stored in two separate physical tables:
Sales
Refunds/Cancellations
A refund always refers to an existing sale (thus 'negating' it), though the dimensions of these tables are nearly the same (date, sales clerk, store, etc.). The data schema looks something like the following:
CREATE TABLE sale
(
sale_id uuid,
transaction_at timestamp with time zone,
store_id uuid,
clerk_id uuid,
clerk_number bigint,
currency character varying(3),
pos_id uuid,
total numeric,
net_total numeric
);
CREATE TABLE refund
(
refund_id uuid,
sale_id uuid, -- referenced sale to void
refunded_at timestamp with time zone,
pos_id uuid,
clerk_id uuid,
clerk_number bigint
)
I am trying to figure how to model this data for ingstion in a DW. Since I am relatively new to dimensional modeling I have begun reading The Data Warehouse Toolkit, but I am as of now unsure of the best approach to handle this case.
To my mind, these are two separate fact tables describing two different business processes (e.g. making a sale and getting a refund), though due to normalization concerns the refunds table (besides containing most of the same dimensions) is basically a pointer to a row in the sales table (which is fine for OLTP).
Analytical reports down the line would obviously want to look at these in a few ways:
All net sales per dimension (gross sales minus refunds)
All refund amounts per dimension
Other potential business use cases
As is, the first two cases would require joining the fact tables to either subtract the sales amount (case 1) or to get the information on refund amounts (case 2).
The approach that seems to make the most sense for me is something like the following (via some ETL/ELT process):
Load the (gross) sales data into a table in the DW
Load and denormalize refund data into a table in the DW, joining actual sale data so that amounts etc. are located in the fact table
Join either table with common conformed dimensions for further roll-ups and querying
This makes sense to me because:
Both fact tables have all required information from the physical event, and
There is no explicit dependency between the fact tables, and
Common dimensions can be reused
However, in this case, I still would not be able to get the net sales without joining these two tables. This makes me think that there should be a separate net_sale fact table, but this is problematic:
From a business point of view, sales without refunds are the vast majority of events that occur. A net_sale table would copy basically 99% of all sale data.
From a business process point of view, this table would describe an event that does not exist as such (there is no "net sale", only an aggregated view of sales amount per dimension minus refund costs).
Glossing over the third Chapter in The Data Warehouse Toolkit, I do not see this case mentioned explicitly (though there might be some parallels w.r.t. factless fact-tables and derived facts). What kind of approach would work in a case like this?

How to store data in fact table with multiple products in an order in data warehouse

I am trying to design a dimensional modeling for data warehousing for one of my project(Sales Order). I'm new to this concept.
So far, I could understand that the product, customer and date can be stored in the dimension table and the order info will be in the fact table.
Date_dimension table structure will be
date_dim_id, date, week_number, month_number
Product_dimension table structure will be
product_dim_id, product_name, desc, sku
Order_fact table structure will be
order_id, product_dim_id(fk), date_dim_id(fk), order_quantity, order_total_price, etc
If a order is place with 2 or more number of product, will there be repeated entry in the order_fact table for the same order_id, date_dim_id
Please help on this. I'm confused here. I know that in a relational database, order table will have one entry per order and relation between the product and order will be maintained in a different table having the order_id and product_id as the foreign key.
Thanks in advance.
This is a classic case where you should (probbaly) have two fact tables
FactOrderHeader and FactOrderDetail.
FactOrderHeader will have a record for each order, storing information regarding the value of the order and any order level discounts; though they could be expressed as an OrderDetail record in some cases.
FactOrderDetail will have a record for each order line, storing information regard the product, product cost, product sale price, number of items, item discount. etc.
You may need to have a DimOrderHeader as well, if there are non-Fact pieces of information that you want to store, for example, date the order was taken, delivered, paid.

Granularity in Star Schema leads to multiple values in Fact Table?

I'm trying to understand star schema at the moment & struggling a lot with granularity.
Say I have a fact table that has session_id, user_id, order_id, product_id and I want to roll-up to sessions by user by week (keeping in mind that not every session would lead to an order or a product & the DW needs to track the sessions for non-purchasing users as well as those who purchase).
I can see no reason to track order_ids or session_ids in the fact table so it would become something like:
week_date, user_id, total_orders, total_sessions ...
But how would I then track product_ids if a user makes more than one purchase in a week? I assume I can't keep multiple product ids in an array (eg: "20/02/2012","5","3","PR01,PR32,PR22")?
I'm thinking it may have to be kept at 'every session' level but that could lead to a very large amount of data. How would you implement granularity for an example such as above?
Dimensional modelling required Dimensions as well as Facts.
You need a Date/Calendar dimension, which includes columns like this:
calendar (id,cal_date,cal_year,cal_month,...)
The "grain" of your fact table is the key to data storage. If you have transactions, then the transaction should be the grain, and you store one row per transaction. Use proper (integer) surrogate keys to your dimensions, and your table won't be as large as you fear.
Now you can write a query like this, to sum sales of product by year:
select product_name,cal_year,sum(purchase_amount)
from fact_whatever
inner join calendar on id = fact_whatever.calendar_id
inner join product on id = fact_whatever.product_id
group by product_name,cal_year

Rating System Database Structure

I have two entity groups. Restaurants and Users. Restaurants can be rated (1-5) by users. And rating fromeach user should be retrievable.
Resturant(id, name, ..... , total_number_of_votes, total_voting_points )
User (id, name ...... )
Rating (id, restaurant_id, user_id, rating_value)
Do i need to store the avg value so that it need not be calculated every time ? which table is the best place to store avg_rating, total_no_of_votes, total_voting_points ?
Well, if you store the average value somewhere; it will only be accurate as of the last time you calculated it. (i.e. you have 5 reviews; then store the averages somewhere. You get 5 more new reviews, and then your saved average is incorrect).
My opinion is that this sort of logic is perfectly suited to a middle-tier. Calculating an average shouldn't be very resource intensive, and really shouldn't impact performance.
If you really want to store it in the database; I would probably store them in their own table, and update those values via triggers. However, this could be even more resource intensive than calculating it in the middle-tier.
Some database, for example, PostGreSQL, allow you to store an array as part of a row. e.g.
create table restaurants (
...,
ratings integer[],
...
);
So you could, for example, keep the last 5 ratings in the same row as the restaurant. When you get a new rating, shuffle the old ratings left, and add the new rating at the end, then calculate the average.

Resources