Best practices for fact table that depends on two processes - data-warehouse

I am building a star schema for an online business. One of the key processes is email newsletter signup.
But the analysis depends on two processes and I can't figure out how to model it the best way.
Here's how the process works:
Person visits website
Person fills out web form and is recorded as a contact in our CRM
Person receives a link asking him to confirm if this is really his email
Person clicks the link and is considered confirmed
Person can now receive emails from us
The signup and confirmation process take place at different times. Most people click the confirmation link on the same day, but we send two follow up email over a few days after the signup so some people may confirm their email only after a few days.
On top of that a person could signup several times on the website. Most of our signups are people who exchange their email address in return for some sort of resource like an eBook.
As long as the person's email is not marked confirmed in our system, we ask the person to confirm on each signup.
Since we have multiple offers it's not uncommon for a person to request eBook A, eBook B and eBook C and only confirm after several signups.
In the fact table each signup for emails that are unconfirmed yet is marked as ConfirmationRequested -> True.
If the person clicks a confirmation link of ANY of the confirmation request emails he should be considered confirmed for each of those signups.
How I want to analyse the data
See how many signups we had
See how many signups were re-signups and how many were new contacts in the CRM (new email address)
See how many new contacts have confirmed their email address (and become full subscribers)
See how many re-signups were asked to confirm their email and how many have done so
Analyse how long it takes for people to confirm their email address
Analyse the confirmation rate
Filter contacts by their confirmation status and analyse what people who have or have not confirmed have in common
I don't really care about confirmations in isolation from signups.
And for my purposes I would like to have a ConfirmationStatus dimension that is...
"Confirmed" if the person confirms within 7 days of sign up
"Pending" if the person hasn't confirmed, but 7 days haven't passed since signup yet
"Not Confirmed" if the person hasn't confirmed within 7 days (even if the person does confirm at some later point)
On top of that I usually look at this report on Mondays to analyse the previous week and compare it to other weeks. (I already have a working version of this report in a flat table, but I am trying to learn how to build proper star schemas.)
This has the additional challenge that contacts that signed up on Sunday for example only had less than a day to confirm and would drag down the confirmation rate and the latest week would look bad if compared to previous week where all contacts had the full 7 days to confirm.
So I calculate a "Confirmed within signup week" confirmation count and rate for all weeks to allow apples to apples comparisons.
How to model this...
I have considered the following options...
Option #1: Separate fact tables
Since these are separate processes that happen at separate times I have learned that I should create separate fact tables and then drill across common dimensions.
I could calculate signups that requested confirmations from the signup table and then calculate confirmations within a week of the signup through the contact and date dimensions.
But that wouldn't allow me to filter the signups by confirmation status.
That's why I am considering...
Option 2: A fact table that combines both signups and confirmations
I am thinking of something like this:
| Dim Signup Info | | | Dim Contact | | | Fact Signups | |
|-----------------------|------|---|-------------|------|---|----------------------|----|
| SignupInfoKey | SK | | ContactKey | SK | | SignupDateKey | FK |
| SignupType | SCD1 | | Name | SCD1 | | ConfirmationDate | FK |
| ConfirmationRequested | SCD1 | | Email | SCD1 | | SignupInfoKey | FK |
| ConfirmationSucceeded | SCD1 | | ... | | | ContactKey | FK |
| ConfirmationStatus | SCD1 | | | | | SignupId | DD |
| | | | | | | SignupDateTime | DD |
| | | | | | | ConfirmationDateTime | DD |
| | | | | | | Signups | M |
| | | | | | | NewContacts | M |
| | | | | | | ConfirmationMin | M |
| | | | | | | ConfirmationDays | M |
I need the ConfirmationDate in the fact to calculate the "Confirmed Within Week" measures at report time (I am using powerbi and it's easy there). I could of course also create a dimension "ConfirmedWIthinWeek" and then filter based on that, but it won't be as flexible... What if I decide later on to look at the data on a daily or monthly basis for example?
Another concern is that it will require to reprocess and update the fact tables on each incremental load for the past 7 days.
I know that's ok for dimensions, but is that ok for fact tables too?
So my questions are
Is option #2 a good solution or is there a better way to do this?
Is it ok to update fact tables or is that discouraged?
Overall my question is: What am I missing?
This seems like a very common thing. One obvious example would be an order star that has fact table columns for AmountOrdered, AmountPaid, AmountRefunded and dimensions like "Order Status", "Paid Status" and "Refunded Status".
But none of my searches have resulted in answers to this common problem. Surely there must be a term for the problem and a pattern name for the solution where I can learn more about it?

Related

Using Crosstab to Generate Data for Charts

I'm trying to make an efficient query to create a view that will contains counts for the number of successful logins by day as well as by type of user with no duplicate users per day.
I have 3 tables involved in this query. One table that contains all successful login attempts, one table for standard user accounts, and one table for admin user accounts. All user_id values are unique across the entire database so there are no user accounts that will share the same user_id with an admin account:
TABLE 1: user_account
user_id | username
---------|----------
1 | user1
2 | user2
TABLE 2: admin_account
user_id | username
---------|----------
6 | admin6
7 | admin7
TABLE 3: successful_logins
user_id | timestamp
---------|------------------------------
1 | 2022-01-23 14:39:12.63798-07
1 | 2022-01-28 11:16:45.63798-07
1 | 2022-01-28 01:53:51.63798-07
2 | 2022-01-28 15:19:21.63798-07
6 | 2022-01-28 09:42:36.63798-07
2 | 2022-01-23 03:46:21.63798-07
7 | 2022-01-28 19:52:16.63798-07
2 | 2022-01-29 23:12:41.63798-07
2 | 2022-01-29 18:50:10.63798-07
The resulting view I would like to generate would contain the following information from the above 3 tables:
VEIW: login_counts
date_of_login | successful_user_logins | successful_admin_logins
---------------|------------------------|-------------------------
2022-01-23 | 1 | 1
2022-01-28 | 2 | 2
2022-01-29 | 1 | 0
I'm currently reading up on how crosstabs work but having trouble figuring out how to write the query based on my table setups.
I actually was able to get the values I needed by using the following query:
SELECT
to_char(s.timestamp, 'YYYY-MM-DD') AS login_date,
count(distinct u.user_id) AS successful_user_logins,
count(distinct a.user_id) AS successful_admin_logins
FROM successful_logins s
LEFT JOIN user_account u ON u.user_id= s.user_id
LEFT JOIN admin_account a ON a.user_id= s.user_id
GROUP BY login_date
However, I was told it would be even quicker using crosstabs, especially considering the successful_logins table contains millions of records. So I'm trying to also create a version of the query using crosstabs then comparing both execution times.
Any help would be greatly appreciated. Thanks!
Turns out it isn't possible to do what I was asking about using crosstabs, so the original query I have will have to do.

Graph Modelling a 'transitive' relationship

This is a followup to an earlier question that I had posted and accepted an answer. I have a further question after getting feedback, and trying to post as a new question to hopefully get an answer.
Having discussed with users, the requirement just got more complex. What they actually do is something like a table in relational world with following columns (its denormalised with lot of repetitive data:
PartnerName | Service | Offered? |CurrentlyUsing | WeCouldSellThese |
XX | Baking | Yes |Competitor A, B | Product A |
XX | Baking | Yes |Competitor A, B | Product C |
XX | Baking | Yes |Competitor A, B | Product D |
XX | OnlyDough| Yes |Product A | Product C |
XX | Packing | No | | Product E |
Basically, they need to store information what is being used currently, and whether its currently offered by partner or not, they still try to sell them products (Offered Yes or No will both still lead to a market). There is a many-to-many relationship between service and product as well...which means there is a "3node" relationship - A particular partner for a particular product for a particular service, here are the 2 options I'm thinking of. The trouble with Option 1 is that Product A would have many To_Build outgoing relationships, so I dont have a way to figure out its for which partner.
Here are the options after I bring a new entity to split the relationship:
You can use an extra node (say, labelled "Build") to "reify" the "3-node relationship". For example:
By the way, you should also consider whether the Could_Offer relationship is redundant. For example, you could add an isOffered property to the Could_Build relationship and eliminate the Could_Offer relationship.

Unexpected behaviour with FireDAC Master-Detail relationships

I face a problem with FireDAC Master-Detail relationships.
FireDAC has two modes for M/D relationships : Parameter-Based and Range-Based http://docwiki.embarcadero.com/RADStudio/Berlin/en/Master-Detail_Relationship_(FireDAC)
The first one uses parameters on every query to retrieve the correspondent details needed after every scroll, and the second one loads first all the data in the datasets, and set the fields that define the master-detail relationships (filtering the details after every scroll on the master).
You can combine both methods, giving you the advantages of both (querys returning limited records while reduced traffic with the database, offline mode, ...).
It works nice and fast except when one of the details is empty. This seems to be the reason (quoted from the Documentation) :
Combining Methods
To combine both methods, an application should use both Parameters and
Range-based setups and include fiDetails into FetchOptions.Cache. Then
FireDAC at first uses range-based M/D. And if a dataset is empty, then
FireDAC uses parameter-based M/D. The new queried records are appended
to the internal records storage.
Also, you can use the TFDDataSet.OnMasterSetValues event handler to override M/D behavior.
Suppose you have
Master BILLS
+---------+------------+
| Bill_Id | Date |
+---------+------------+
| 1 | 01/01/2017 |
+---------+------------+
Detail LINES
+---------+---------+------------+
| Bill_Id | Line_Id | Concept |
+---------+---------+------------+
| 1 | 1 | Television |
| 1 | 2 | Computer |
+---------+---------+------------+
Subdetail TAXES
+---------+---------+-----+--------+
| Bill_Id | Line_Id | Tax | Import |
+---------+---------+-----+--------+
| 1 | 1 | 14% | 74.25 |
| 1 | 1 | 7% | 36.12 |
+---------+---------+-----+--------+
I have those 3 FDQuerys with parameters :
qryBills.SQL = 'select * from BILLS where Bill_Id = :Id';
qryLines.SQL = 'select * from LINES where Bill_Id = :Id';
qryTaxes.SQL = 'select * from TAXES where Bill_Id = :Id';
And the Master-Detail relationship is defined by range
qryLines.MasterFields = 'Bill_Id';
qryTaxes.MasterFields = 'Bill_Id;Line_Id';
If all the details contain records then everything is fine, but when a detail is empty (like in my example, where there are no Taxes for the Line #2) then when I scroll to that empty detail its query is re-launched (as the documentation says) duplicating the records for the not-empty details.
I mean :
I open the three Datasets for the Bill_Id #1
Everything looks fine, I see the master record, the Line #1 and its two taxes
I move to the second line and it still looks fine, the taxes appear empty.
When I go back to the first line, now I see two times its two taxes.
If I go to the second line again, and return to the first one, now I will see three times its two taxes.
...
The problem is that every time I move to the second line, its subdetail is empty, so it relaunches the qryTaxes query, duplicating its entire content.
Is not uncommon to have empty details, do you know of a way to prevent its query to be re-launched when it happens ?. I can't find it.
Thank you.

Removing duplicates in InfluxDB

I would like to perform a query to remove duplicates. What I define as a duplicate here is a measurement where we have more than 1 data point. They will have different tags, so they are not overwritten by default but I would like to remove the oldest inserted, regardless of the tags.
So for example, measurement of logins (it doesn't really make sense but it's to avoid using abstract entities):
> Email | Name | TS | Login Time
>
> a#a.com | Alice | xxxxx1000 | 2017-05-19
> a#a.com | Alice | xxxxx1000 | 2017-05-18
> a#a.com | Alice | xxxxx1000 | 2017-05-17
> b#b.com | Bob | xxxxx1000 | 2017-05-18
> c#c.com | Charlie | xxxxx1200 | 2017-05-19
I would like to remove the second and third line, because the data point has the same timestamp as the first, it is the same measurement but they have different login times and I would like to take only the last.
I know well that I could solve this with a query, but the requirement is more complex than this (visualization in Grafana of weird KPI data) and I need to remove actual duplicates (generated and loaded twice).
Thank you.
You can fetch all login user names using group by and then order by time , so that the latest login time will come up first ,then you can delete the remaining ones.
Also, you might need to copy your latest items to some another measurement , since you can't remove row in influxdb .
For this you might use limit 1 offset 0 so that only the latest login time will come from the query output.
Let me know, if I understand it correctly.

Ruby on Rails: Join Tables Concept

So I have been out of the coding game for a while and recently decided to pick up rails. I have a question about the concept of Join tables in rails. Specifically:
1) why are these join tables needed in the database?
2) Why can't I just JOIN two tables on the fly like we do in SQL?
A join table allows a clean linking of association between two independent tables. Join tables reduce data duplication while making it easy to find relationships in your data later on.
E.g. if you compare a table called users:
| id | name |
-----------------
| 1 | Sara |
| 2 | John |
| 3 | Anthony |
with a table called languages:
| id| title |
----------------
| 1 | English |
| 2 | French |
| 3 | German |
| 4 | Spanish |
You can see that both truly exist as separate concepts from one another. Neither is subordinate to the other the way a single user may have many orders, (where each order row might store a unique foreign_key representing the user_id of the user that made it).
When a language can have many users, and a user can have many languages -- we need a way to join them.
We can do that by creating a join table, such as user_languages, to store every link between a user and the language(s) that they may speak. With each row containing every matchup between the pairs:
| id | user_id | language_id |
------------------------------
| 1 | 1 | 1 |
| 2 | 1 | 2 |
| 3 | 1 | 4 |
| 4 | 2 | 1 |
| 5 | 3 | 1 |
With this data we can see that Sara (user_id: 1) is trilingual, while John(user_id: 2) and Anthony(user_id: 3) only speak English.
By creating a join table in-between both tables to store the linkage, we preserve our ability to make powerful queries in relation to data on other tables. For example, with a join table separating users and languages it would now be easy to find every User that speaks English or Spanish or both.
But where join tables get even more powerful is when you add new tables. If in the future we wanted to link languages to a new table called schools, we could simply create a new join table called school_languages. Even better, we can add this join table without needing to make any changes to the languages SQL table itself.
As Rails models, the data relationship between these tables would look like this:
User --> user_languages <-- Language --> school_languages <-- School
By default every school and user would be linked to Language using the same language_id(s)
This is powerful. Because with two join tables (user_languages & school_languages) now referencing the same unique language_id, it will now be easy to write queries about how either relates. For example we could find all schools who speak the language(s) of a user, or find all users who speak the language(s) of a school. As our tables expand, we can ride the joins to find relations about pretty much anything in our data.
tl;dr: Join tables preserve relations between separate concepts, making it easy to make powerful relational queries as you add new tables.

Resources