Keep historical database relations integrity when data changes - ruby-on-rails

I hesitate between various alternative when it comes to relations that have "historical"
value.
For example, let's say an User has bought an item at a certain date... if I just store this the classic way like:
transation_id: 1
user_id: 2
item_id: 3
created_at: 01/02/2010
Then obviously the user might change its name, the item might change its price, and 3 years later when I try to create a report of what happend I have false data.
I have two alternative:
keep it stupid like I shown earlier, but use something like https://github.com/airblade/paper_trail and do something like:
t = Transaction.find(1);
u = t.user.version_at(t.created_at)
create a database like transaction_users and transaction_items and copy the users/items into these tables when a transaction is made. The structure would then become:
transation_id: 1
transaction_user_id: 2
transaction_item_id: 3
created_at: 01/02/2010
Both approach have their merits, tho solution 1 looks much simpler... Do you see a problem with solution 1? How is this "historical data" problem usually solved? I have to solve this problem for 2-3 models like this for my project, what do you reckon would be the best solution?

Taking the example of Item price, you could also:
Store a copy of the price at the time in the transaction table
Creating a temporal table for item prices
Storing a copy of the price in the transaction table:
TABLE Transaction(
user_id -- User buying the item
,trans_date -- Date of transaction
,item_no -- The item
,item_price -- A copy of Price from the Item table as-of trans_date
)
Getting the price as of the time of transaction is then simply:
select item_price
from transaction;
Creating a temporal table for item prices:
TABLE item (
item_no
,etcetera -- All other information about the item, such as name, color
,PRIMARY KEY(item_no)
)
TABLE item_price(
item_no
,from_date
,price
,PRIMARY KEY(item_no, from_date)
,FOREIGN KEY(item_no)
REFERENCES item(item_no)
)
The data in the second table would looke something like:
ITEM_NO FROM_DATE PRICE
======= ========== =====
A 2010-01-01 100
A 2011-01-01 90
A 2012-01-01 50
B 2013-03-01 60
Saying that from the first of January 2010 the price of Article A was 100. It changed the first of Januari 2011 to 90, and then again to 50 from the first of January 2012.
You will most likely add a TO_DATE to the table, even though it's a denormalization (the TO_DATE is the next FROM_DATE).
Finding the price as of the transaction would be something along the lines of:
select t.item_no
,t.trans_date
,p.item_price
from transaction t
join item_price p on(
t.item_no = p.item_no
and t.trans_date between p.from_date and p.to_date
);
ITEM_NO TRANS_DATE PRICE
======= ========== =====
A 2010-12-31 100
A 2011-01-01 90
A 2011-05-01 90
A 2012-01-01 50
A 2012-05-01 50

I'll went with PaperTrail, it keeps history of all my models, even their destruction. I could always switch to point 2 later on if it doesn't scale.

Related

Rails Postgres, select users with relation and group them based on users starting time

I have Users and Checkpoint tables, each User can make multiple Checkpoints per day
I want to aggregate how many Checkpoints had been taken each day in the past 6 months based on each user's starting point, to create a graph showing avarage user Checkpoints witin thier x months.
for example:
if user1 started on January 1st, user2 started on March 15th, and user3 started on July 6th those would each be considered day 1, I would want the data from each of their day 1 even though they occur at different periods of time.
The current query I came up with, but unfotunatily it returns data based on fixed time for all of the users.
SELECT dates.date AS date,
checkpoints_count,
checkpoints_users
FROM (SELECT DATE_TRUNC('DAY', ('2000-01-01'::DATE - offs))::DATE AS date
-- 180 = 6 month in days
FROM GENERATE_SERIES(0, 180) AS offs) AS dates
LEFT OUTER JOIN (
SELECT
checkpoints_date::DATE AS date,
COUNT(id) AS checkpoints_count,
COUNT(user_id) AS checkpoints_users
FROM checkpoints
WHERE user_id in (1, 2, 3)
AND checkpoints_date::DATE
BETWEEN '2000-01-01'::DATE AND '2000-06-01'::DATE
GROUP BY checkpoints_date::DATE
) AS ck
ON dates.date = ck.date
ORDER BY dates.date;
EDIT
Here is a working SQL example that works (If I understand what you are looking for. Your SQL seems really complicated for what you are asking but I'm not a SQL expert)...
SELECT t1.*
FROM checkpoints AS t1
WHERE t1.user_id IN (1, 2, 3)
AND t1.id = (SELECT t2.id
FROM checkpoints AS t2
WHERE t2.user_id = t1.user_id
ORDER BY t2.check_date ASC
LIMIT 1)
SQL FIDDLE here
Since this is tagged Ruby on Rails I'll give you a rails answer.
If you know your user IDs (or use some other query to get them you have:
user_ids = [1, 2, 3]
first_checkpoints = []
user_ids.each do |id|
first_checkpoints << Checkpoint.where(user_id: id).order(:date).first
end
#returns an array of the first checkpoint of each user in list
This assumes a column in the checkpoints table called date. You didn't give a table structure for the two tables so this answer is a bit general. There might be a pure ActiveRecord answer. But this will get you what you want.

Only return one record per hour over a time period in Rails

I have written a Rails 4 app that accepts and plots sensor data. Sometimes there are 10 points per hour (but this number is not fixed). I'm plotting the data and doing a simple query of Points.all to get all the data points.
In order to reduce the query size, I would like to only return one record per hour. It doesn't matter which record is returned. The first record each hour using the created_at field would be fine.
How do I construct a query to do this?
You can get first one, but maybe average value is better. All you need to do is to group it by hour. I am not 100% about sqlite syntax but something in this sense:
connection.execute("SELECT AVG(READING_VALUE) FROM POINTS GROUP BY STRFTIME('%Y%m%d%H0', CREATED_AT)")
Inspired from this answer, here is an alternative which retrieves the latest record in that hour (if you don't want to average):
Point.from(
Point.select("max(unix_timestamp(created_at)) as max_timestamp")
.group("HOUR(created_at)") # subquery
)
.joins("INNER JOIN points ON subquery.max_timestamp = unix_timestamp(created_at)")
This will result in the following query:
SELECT `points`.*
FROM (
SELECT max(unix_timestamp(created_at)) as max_timestamp
FROM `points`
GROUP BY HOUR(created_at)
) subquery
INNER JOIN points ON subquery.max_timestamp = unix_timestamp(created_at)
You can also use MIN instead to get the first record of the hour, if you like, as well.

Get all comments by a set of users made the 24 hours prior to their last comment

I have a many-to-one relationship between the models User and Comment.
I would like to collect, preferably in a hash, all comments made by a set of users the 24 hours prior to each user's last comment (including his/her last comment).
This is what I have come up with, but I dont know how to create the hash with comments only from the time span mentioned.
Comment.order('updated_at desc').where(user_id: array_of_users_ids).group_by(&:user).each do |user, comments|
# rearrange the hash here?
end
The SQL solution for this would be a correlated subquery, made slightly tricky by non-standard date arithmetic depending on the RDBMS in use.
You'd be looking for a query such as:
select ...
from comments
where comments.user_id in (...) and
comments.updated_at >= (
select max(updated_at) - interval '1 day'
from comments c2
where c2.user_id = comments.user_id)
You ought to be able to achieve this with:
Comment.where(user_id: array_of_users_ids).
where("comments.updated_at >= (
select max(updated_at) - interval '1 day'
from comments c2
where c2.user_id = comments.user_id)").
order(... etc

Change Data Capture with table joins in ETL

In my ETL process I am using Change Data Capture (CDC) to discover only rows that have been changed in the source tables since the last extraction. Then I do the transformation only for this rows. The problem is when I have for example 2 tables which I want to join into one dimension, and only one of them has changed. For example I have table Countries and Towns as following:
Countries:
ID Name
1 France
Towns:
ID Name Country_ID
1 Lyon 1
Now lets say a new row is added to Towns table:
ID Name Country_ID
1 Lyon 1
2 Paris 2
The Countries table has not been changed, so CDC for these tables shows me only the row from Towns table. The problem is when I do the join between Countries and Towns, there is no row in Countries change set, so the join will result in empty set.
Do you have an idea how to solve it? Of course there might be more difficult cases, involving 3 and more tables, and consequential joins.
This is a typical problem found when doing Realtime Change-Data-Capture, or even Incremental-only daily changes.
There's multiple ways to solve this.
One way would be to do your joins on the natural keys in the dimension or mapping table, to get the associated country (SELECT distinct country_name, [..other attributes..] from dim_table where country_id = X).
Another alternative would be to do the join as part of the change capture process - when a row is loaded to towns, a trigger goes off that loads the foreign key values into the associated staging tables (country, etc).
There is allot i could babble on for more information on but i will be specific to what is in your question. I would suggest the following to get the results...
1st Pass is where everything matches via the join...
Union All
2nd Pass Gets all towns where there isn't a country
(left outer join with a where condition that
requires the ID in the countries table to be null/missing).
You would default the Country ID value in that unmatched join to something designated as a "Unmatched Value" typically 0 or -1 is used or a series of standard -negative numbers that you could assign descriptions to later to identify why data is bad for your example -1 could be "Found Town Without Country".

Logic and code help

Here are my models and associations:
User has many Awards
Award belongs to User
Prize has many Awards
Award belongs to Prize
Let's pretend that there are four Prizes (captured as records):
Pony
Toy
Gum
AwesomeStatus
Every day a User can be awarded one or more of these Prizes. But the User can only receive each Prize once per day. If the User wins AwesomeStatus, for ex, a record is added to the Awards table with a fk to User and Prize. Obviously, if the User doesn't win the AwesomeStatus for the day, no record is added.
At the end of the day (before midnight, let's say), I want to return a list of Users who lost their AwesomeStatus. (Of course, to lose your AwesomeStatus, you had to have the day before.) Unfortunately, in my case, I don't think observers will work and will have to rely on a script. Regardless, how would you go about determining which Users lost their AwesomeStatus? Note: don't make your solution overly dependent on the period of time -- in this case a day. I want to maintain flexibility in how many times per whatever period Users have an opportunity to win the prize (and to also lose it).
I would probably do something like this:
The class Award should also have a column awarded_at which contains the day the prize was awarded. So when it is time to create the award it can be done like this:
# This will make sure that no award will be created if it already exists for the current date
#user.awards.find_or_create_by_prize_id_and_awarded_at(#prize.id, Time.now.strftime("%Y-%m-%d"))
And then we can have a scope to load all users with an award that will expire today and no active awards for the supplied prize.
# user.rb
scope :are_losing_award, lambda { |prize_id, expires_after|
joins("INNER JOIN awards AS expired_awards ON users.id = expired_awards.user_id AND expired_awards.awarded_at = '#{(Time.now - expires_after.days).strftime("%Y-%m-%d")}'
LEFT OUTER JOIN awards AS active_awards ON users.id = active_awards.user_id AND active_awards.awarded_at > '(Time.now - expires_after.days).strftime("%Y-%m-%d")}' AND active_awards.prize_id = #{prize_id}").
where("expired_awards.prize_id = ? AND active_awards.id IS NULL", prize_id)
}
So then we can call it like this:
# Give me all users who got the prize three days ago and has not gotten it again since
User.are_losing_award(#prize.id, 3)
There might be some ways to write the scope better with ARel queries or something, I'm no expert with that yet, but this way should work until then :)
I'd add an integer "time period" field to awards, which stands for a given period of time (day, week, 5 hour period, whatever you want).
Now, you can search the awards table for users who have the award status at t-1, but not at t:
SELECT prev.user_id
FROM awards prev
OUTER JOIN awards current ON prev.user_id = current.user_id
AND prev.prize_id = current.prize_id
AND current.time_period = 1000
WHERE prev.prize_id = 1
AND current.prize_id IS NULL
AND prev.time_period = 999
Just use updated_at, or add an awarded_at like suggested above and use it like this:
scope :awarded, proc {|date| where(["updated_at <= ?", date])}
In your Award model. Print it like this, maybe:
awesome_status = Prize.find_by_name('AwesomeStatus')
p "Users who do not have AwesomeStatus anymore:"
User.all.each {|user| p user.username if user.awards.awarded(1.day.ago).collect(&:id).include?(awesome_status)}
If you want it to be dynamic, displayed somewhere, etc. throw a 'lasts_for' into Prize and compare against it and simply write a maintenance cronjob that sets an 'active' boolean on Award to false instead of deleting the association.

Resources