I'm working on a data warehouse that seeks to capture website visits and purchase. We have a hypothesis that by identifying patterns from previous site visits you can get insights into visitor behavior for the current site visit
The grain of my fact table is individual website visits and we assign a 1 if the customer makes a purchase and a 0 if she does not. Our fact is additive. We would like to be able explore and understand how the actions of prior visits influence the action of the current visit so I'm trying to figure out how you would go about modeling this. On a particular site visit a visitor could have 1, 2 or 12 prior site visits.
So my question is how would I model a past visit dimension that includes the past visit date, past visit activity (purchase or no purchase, time on site, etc). Is this an example of a use for a bridge table.
A bridge table in a data-warehouse is primarily (exclusively?) for dealing with many to many relationships, which you don't appear to have.
If the grain of your fact table is website visits then you don't need a 'past visit' dimension, since your fact table contains the visit history already.
You have two dimensions here:
Customer
Date
Time on site is presumably a number, and since you are treating purchase/no purchase as a boolean score (1,0) these are both measures and belong in the fact table.
The Customer dimension is for your customer attributes. Don't put measures here (e.g. prior scores). You should also consider how to handle changes (probably SCD type 2).
You could put your date field directly in the fact table but it is more powerful as a separate dimension, since you can much more easily analyze by quarters, financial years, public holidays etc.
So,
Example Fact_Website_Visit table:
Fact_Website_Visit_Key | Dim_Customer_Key | Dim_Date_Key | Purchase(1,0) | Time_On_Site
Example Dim_Customer Dimension:
Dim_Customer_Key | Customer_ID | Customer_Demographic
Example Dim_Date Dimension:
Dim_Date_Key | Full_Date | IsWeekend
To demonstrate how this works I've written an example report to see sale success and average time spent online on weekends grouped by customer demographic:
SELECT
Dim_Customer.demographic,
COUNT(fact.Fact_Website_Visit_Key) AS [# of Visits],
SUM (fact.Purchase) AS [Total Purchases],
AVG (fact.Time_On_Site) AS [Average Minutes Online],
SUM (fact.Purchase)/COUNT(fact.Fact_Website_Visit_Key)*100 AS [% sale success]
FROM
Fact_Website_Visit fact
INNER JOIN Dim_Customer ON fact.Dim_Customer_Key=Dim_Customer.Dim_Customer_Key
INNER JOIN Dim_Date ON fact.Dim_Date_Key=Dim_Date.Dim_Date_Key
WHERE
Dim_Date.IsWeekend='Y'
GROUP BY
Dim_Customer.Demographic
Related
I have data in BigQuery which have specific columns like time-stamp and userid, Some users visit the website multiple times.
The goal is to find out the time difference of users visiting multiple times.
Even if they visit 14 times, I need to find the difference between every consecutive visit.
This is a sample of my data:
This should help (assuming you want delta in minute). You can always switch to whatever period you need (hour, second, etc.)
Please note the usage of analytical function LAG which uses data partitioned over user_id and ordered by timestamp ts. Also, note that the first appearance of the user_id gets the difference of 0 because this is the first time user showed up :). Hope it helps.
select user_id, coalesce(timestamp_diff(ts_a, ts_b, minute), 0) as diff_from_prv_visit_minutes from (
select user_id, ts as ts_a, lag(ts) over (partition by user_id order by ts) as ts_b
from `mydataset.mytable`
)
I am writing what could be defined as an accountancy/invoicing app using Rails 5. I am in need of implementing a section that predicts the company's cashflow in the future. So far I've got the following:
Actual bank movements and balances (in the past), imported from the bank
Future invoices (income) which are expected to be paid on a certain date
Future one-time expenses which are expected to be paid on a certain date
Using these three sets of data, I can calculate, for any given date in the future, the sum of: the last known bank balance, plus all the future invoices values coming IN, minus all the future expenses going OUT, so I get, theoretically, the expected balance of the company for any given date.
My doubt arises when it comes to recurrent expenses (or potentially incomes). Given that all of the items I mentioned before (bank movements, invoices and expenses) are actual ActiveRecord records stored in my database, I'm not sure about how to treat the recurrent expenses, for example:
Let's imagine I want to enter a known future recurrent paycheck of a certain employee, which is $2000 every first day of the month.
1- Should I generate at some point the next X entries and treat them as normal future expenses (each with its own ID, date and amount)?
2- The other option I've thought of is having some kind of "declaration" on the nature of the recurrent expense, as in "it's $2000 every day 1 of month until -forever-", similarly to a cronjob. But, if I were to take this approach, I'd like to have an ActiveRecord - similar interface, so that I can do something like:
cashflow = []
last_movement = BankMovement.last
value = last_movement.balance
(last_movement.date..(last_movement.date + 12.months)).each do |day|
value += Invoice.pending.expected_on(day).sum(:gross_amount)
value -= Expense.pending.expected_on(day).sum(:gross_amount)
value -= RecurringExpense.expected_on(day).sum(:gross_amount)
cashflow.push( { date: day, balance: value } )
end
This feels almost right but, I'm not sure about how to link the actual expense when it comes with the recurrent/calculated one. How can I then change the date if the expense gets paid the day after it was supposed? I need to have an actual record of each one of those, at least whenever they are "consolidated".
I'm not really sure if I was clear enough with my trouble here, so, should anyone want and have some spare time to help me out, please feel free to ask for any extra relevant info, I'd really appreciate some help, especially if we can find a way of doing this "the Rails way"!
Scenario: There are 3 kinds of utilization metrics that i have derive for the users. In my application, users activity are tracked using his login history, number of customer calls made by the user, number of status changes performed by user.
All these information are maintained in 3 different tables in my application db like UserLoginHistory, CallHistory, OrderStatusHistory. All the actions made by each user is stored in these 3 tables along with DateTime info.
Now i am trying to create a reporting db that will help me in generating the overall utilization of user. Basically the report should show me for each user over a period:
UserName
Role
Number of Logins Made
Number of Calls Made
Number of Status updates Made
Now i am in the process of designing my fact table. How should i go about creating a Fact table for this scenario? Should i go about creating a single fact table with rows in it capturing all these details at the granular date level (in my DimDate table level) or 3 different fact tables and relate them?
The 2 options i described above arent convincing and i am looking for better design. Thanks.
As rule of thumb, when you have a report which uses different facts/metrics (Number of Logins Made, Number of Calls Made, Number of Status updates Made) with the same granularity (UserName, Role, Day/Hour/Minute), you put them in the same fact table, to avoid expensive joins.
For many reasons this is not always possible, but your case seems to me a bit different.
You have three tables with the user activity, where probably you store more detailed information about logins, calls and Status updates. What you need for your report is a table with your metrics and the values aggregated for the time granularity that you need.
Let's say you need the report at the day level, you need a table like this:
Day UserID RoleID #Logins #Calls #StatusUpdate
20150101 1 1 1 5 3
20150101 2 1 4 15 8
If tomorrow the business will require the report by hour, the you will need:
DayHour UserID RoleID #Logins #Calls #StatusUpdate
20150101 10:00AM 1 1 1 2 1
20150101 11:00AM 1 1 0 3 2
20150101 09:00AM 2 1 2 10 4
20150101 10:00AM 2 1 2 5 4
Then the Day level table will be like an aggregated (by Day) version of the second one. The DayHour attribute is child of the Day one.
If you need minute details you go down with the granularity.
You can also start directly with a summary table at the minute level, but I would double check the requirement with the business, usually one hour range (or 15 minutes) are enough.
Then if they need to get more detailed information, you can always drill down querying your original tables. The good thing is that when you drill to that level you should have just a small set of rows to query (like just few hours for a specific UserName) and your database should be able to handle it.
I'm currently designing a website which can help my rowing team plan training times and such. The basic idea is that every rower can set the times they can train. Coaches can then see the availability of all the rowers in a handy table and can use this to plan a training.
My question is, how should I represent availability in the class diagram and database?
The idea that I had was to divide days into time blocks: Block 1 stands for 7:00 - 7:30, block 2 stand for 7:30 - 8:00. Then I will create a table 'timeblocks' with the following attributes:
block_id
user_id
date (day, month and year)
block_number
availability
Is this a efficient way of storing availability data?\
Another way you can normalization this table into two piece. a special block table and availability table.
block :
Block_id
block_range
Time_Block
Time_blockId
Block_ID
user_ID
Date
Availability
I've just begun diving into data warehousing and I have one question that I just can't seem to figure out.
I have a business which has ten stores, each with a certain employees. In my data warehouse I have a dimension representing the store. The employee dimension is a SCD, with a column for start/end, and the store at which the employee is working.
My fact table is based on suggestions the employees give (anonymously) to the store managers. This table contains the suggestion type (cleanliness, salary issue, etc), the date it was submitted (foreign keyed to a Time dimension table), and the store at which it was submitted.
What I want to do is create a report showing the ratio of the number of suggestions to the number of employees in a given year. Because the number of employees changes periodically I just can't do a simple query for the total number of employees.
Unfortunately I've searched the web quite a bit trying to find a solution but the majority of the examples are retail based sales, which is different from what I'm trying to do.
Any help would be appreciated. I do have the AdventureWorksDW installed on my machine so I can use that as a point of reference if anyone offers a suggestion using that.
Thanks in advance!
The slowly changing dimension should have a natural key that identifies the source of the row (otherwise how would it know what to compare to detect changes). This should be constant amongst all iterations of the dimension. You can get a count of employees by computing a distinct count of the natural key.
Edit: If your transaction table (suggestion) has a date on it, a distinct count of employees grouped by a computed function of the suggestion date (e.g. datepart (yy, s.SuggestionDate)) and the business unit should do it. You don't need to worry about the date on the employee dimension as the applicable row should join directly to the transaction table.
Add another fact table for number of Employees in each store for each month -- you could use max number for the month. Then average months for the year, use this as "number of employees in a year".
Load your new fact table at the end of each month. The new table would look like:
fact table: EmployeeCount
KeyEmployeeCount int -- surrogate key
KeyDate int -- FK to date dimension, point to last day of a month
KeyStore int -- FK to store dimension
NumberOfEmployes int -- (max) number of employees for the month in a given store
If you need a finer resolution, use "per week" or even "per day". The main idea is to average the NumberOfEmployes measure for a given store over the year.