Star Schema Design for User Utilization Reports

Star Schema Design for User Utilization Reports - data-warehouse

Scenario: There are 3 kinds of utilization metrics that i have derive for the users. In my application, users activity are tracked using his login history, number of customer calls made by the user, number of status changes performed by user.
All these information are maintained in 3 different tables in my application db like UserLoginHistory, CallHistory, OrderStatusHistory. All the actions made by each user is stored in these 3 tables along with DateTime info.
Now i am trying to create a reporting db that will help me in generating the overall utilization of user. Basically the report should show me for each user over a period:
UserName
Role
Number of Logins Made
Number of Calls Made
Number of Status updates Made
Now i am in the process of designing my fact table. How should i go about creating a Fact table for this scenario? Should i go about creating a single fact table with rows in it capturing all these details at the granular date level (in my DimDate table level) or 3 different fact tables and relate them?
The 2 options i described above arent convincing and i am looking for better design. Thanks.

As rule of thumb, when you have a report which uses different facts/metrics (Number of Logins Made, Number of Calls Made, Number of Status updates Made) with the same granularity (UserName, Role, Day/Hour/Minute), you put them in the same fact table, to avoid expensive joins.
For many reasons this is not always possible, but your case seems to me a bit different.
You have three tables with the user activity, where probably you store more detailed information about logins, calls and Status updates. What you need for your report is a table with your metrics and the values aggregated for the time granularity that you need.
Let's say you need the report at the day level, you need a table like this:
Day UserID RoleID #Logins #Calls #StatusUpdate
20150101 1 1 1 5 3
20150101 2 1 4 15 8
If tomorrow the business will require the report by hour, the you will need:
DayHour UserID RoleID #Logins #Calls #StatusUpdate
20150101 10:00AM 1 1 1 2 1
20150101 11:00AM 1 1 0 3 2
20150101 09:00AM 2 1 2 10 4
20150101 10:00AM 2 1 2 5 4
Then the Day level table will be like an aggregated (by Day) version of the second one. The DayHour attribute is child of the Day one.
If you need minute details you go down with the granularity.
You can also start directly with a summary table at the minute level, but I would double check the requirement with the business, usually one hour range (or 15 minutes) are enough.
Then if they need to get more detailed information, you can always drill down querying your original tables. The good thing is that when you drill to that level you should have just a small set of rows to query (like just few hours for a specific UserName) and your database should be able to handle it.

Related

I need to be able to count by hour the first iteration of each unique story

I get data from our CMS that shows all actions of staff within that system.
My challenge is to be able to show in a chart only the first iteration of each story as published and the hour in which it occurred.
A single story can be published multiple times during a day.
Using countuniquesif I can get the number of unique stories per hour by using:
=COUNTUNIQUEIFS(Sheet0!I2:I60000,Sheet0!G2:G60000,"NL_Stories/Ready",Sheet0!E2:E60000,"Webpub",Sheet0!B2:B60000,"08")
=COUNTUNIQUEIFS(Sheet0!I2:I60000,Sheet0!G2:G60000,"NL_Stories/Ready",Sheet0!E2:E60000,"Webpub",Sheet0!B2:B60000,"09")
Etc
However, if that story is published in the period from 8am-9am (08 in column B), if it is published again between 9am-10am (09 in column B) it will be counted again.
How can I limit this to just the first time it is published and excluded in any of the other hours.
I have attached a spreadsheet with two tabs, one with the raw data and two with what I currently do.
Any assistance appreciated
https://docs.google.com/spreadsheets/d/1V-kZyUUfXtaf6pMYDCxNSUjjWclOrUcW2Pk2y678RZo/edit?usp=sharing

Getting time-stamp difference of same user coming multiple times on a website

I have data in BigQuery which have specific columns like time-stamp and userid, Some users visit the website multiple times.
The goal is to find out the time difference of users visiting multiple times.
Even if they visit 14 times, I need to find the difference between every consecutive visit.
This is a sample of my data:

This should help (assuming you want delta in minute). You can always switch to whatever period you need (hour, second, etc.)
Please note the usage of analytical function LAG which uses data partitioned over user_id and ordered by timestamp ts. Also, note that the first appearance of the user_id gets the difference of 0 because this is the first time user showed up :). Hope it helps.
select user_id, coalesce(timestamp_diff(ts_a, ts_b, minute), 0) as diff_from_prv_visit_minutes from (
select user_id, ts as ts_a, lag(ts) over (partition by user_id order by ts) as ts_b
from `mydataset.mytable`
)

Prepping Data For Usage Clustering

Dataset: I'm given the number of minutes individual customers use a product each day and am trying to cluster this data in order to find common usage patterns.
My question: How can I format the data so that, for example, a power user with high levels of use for a year looks the same as a different power user who has only been able to use the device for a month before I ended data collection?
So far I've turned each customer into an array where each cell is the number of minutes used that day. This array starts when the user first uses the product and ends after the user's first year of use. All entries in the cells must be double values (e.x. 200.0 minutes used) for the clustering model. I've considered either setting all cells/days after the last day of data collection to either -1.0 or NULL. Are either of these a valid approach? If not what would you suggest?

For the problem where you want both users (one that used the product a lot every day for a year, and the other used it a lot for one month), create a new entry where it's values are:
avg_usage per time_bin
time_bin can be a month, a day or another time bin which best fits your needs.
This way, a user which use a product, let's say 200 minutes per day for one year, will get:
200 * 30 * 12 / 12 = 6000 minutes per month
and the other user, which joined just last month, will also get, with the exact same usage will get:
200 * 30 * 1 / 1 = 6000 minutes per month.
This way, it doesn't matter when you have started to use the product, the only thing that matter, is the usage rate.
An important thing you might take into consideration, that products, may be forgotten for some time. for example, a computer, and I'm away for a vacation. Those days I didn't use my computer, doesn't have (maybe) an effect of my general usage of this product. So, based on your data, product and intuition you might consider removing gaps like the one I mentioned, and not take it into account inside the calculation.
The amount of time a user has used your product could be a signal of something, but if indeed he only started some time ago, and still using it until today, it may be something you need to take into consideration, and for that use, this average binning technique may help.

Is there a dimension modeling design pattern for multi-valued dimensions

I'm working on a data warehouse that seeks to capture website visits and purchase. We have a hypothesis that by identifying patterns from previous site visits you can get insights into visitor behavior for the current site visit
The grain of my fact table is individual website visits and we assign a 1 if the customer makes a purchase and a 0 if she does not. Our fact is additive. We would like to be able explore and understand how the actions of prior visits influence the action of the current visit so I'm trying to figure out how you would go about modeling this. On a particular site visit a visitor could have 1, 2 or 12 prior site visits.
So my question is how would I model a past visit dimension that includes the past visit date, past visit activity (purchase or no purchase, time on site, etc). Is this an example of a use for a bridge table.

A bridge table in a data-warehouse is primarily (exclusively?) for dealing with many to many relationships, which you don't appear to have.
If the grain of your fact table is website visits then you don't need a 'past visit' dimension, since your fact table contains the visit history already.
You have two dimensions here:
Customer
Date
Time on site is presumably a number, and since you are treating purchase/no purchase as a boolean score (1,0) these are both measures and belong in the fact table.
The Customer dimension is for your customer attributes. Don't put measures here (e.g. prior scores). You should also consider how to handle changes (probably SCD type 2).
You could put your date field directly in the fact table but it is more powerful as a separate dimension, since you can much more easily analyze by quarters, financial years, public holidays etc.
So,
Example Fact_Website_Visit table:
Fact_Website_Visit_Key | Dim_Customer_Key | Dim_Date_Key | Purchase(1,0) | Time_On_Site
Example Dim_Customer Dimension:
Dim_Customer_Key | Customer_ID | Customer_Demographic
Example Dim_Date Dimension:
Dim_Date_Key | Full_Date | IsWeekend
To demonstrate how this works I've written an example report to see sale success and average time spent online on weekends grouped by customer demographic:
SELECT
Dim_Customer.demographic,
COUNT(fact.Fact_Website_Visit_Key) AS [# of Visits],
SUM (fact.Purchase) AS [Total Purchases],
AVG (fact.Time_On_Site) AS [Average Minutes Online],
SUM (fact.Purchase)/COUNT(fact.Fact_Website_Visit_Key)*100 AS [% sale success]
FROM
Fact_Website_Visit fact
INNER JOIN Dim_Customer ON fact.Dim_Customer_Key=Dim_Customer.Dim_Customer_Key
INNER JOIN Dim_Date ON fact.Dim_Date_Key=Dim_Date.Dim_Date_Key
WHERE
Dim_Date.IsWeekend='Y'
GROUP BY
Dim_Customer.Demographic

Algorithm for tracking changes in value over time

I am writing a rails app that deals with product inventory. I would like to include the following features, and am struggling with developing an efficient algorithm:
View stock history (how many were in stock on each date)
Quantity removed from warehouse, and quantity added to warehouse over specific periods of time
Amount of time the product was out of stock in any given period
My questions are as follows:
What is the best way of tracking changes? In addition to my Products
table, should I create another table called
HistoricProductQuantities, and insert a new record each time there
is a change in the quantity?
What number should I track? The historic stock quantity (i.e. 50 in
stock on this day, 24 in stock on that day), or the CHANGE in stock
quantity i.e. -5 (5 sold) or 15 (15 added to inventory)? Or do I
track both in separate tables?
Thanks for your help.

First of all I recommend implementing Date Dimensions on your application, as it seems like you will be doing a lot of Time related calculations. Search on Google for date dimensions as it's beyond the scope of your questions. That said, I believe it will be of great benefit for your app to implement and use date dimensions.
As far as your direct questions go:
What is the best way of tracking changes? In addition to my Products table, should I create another table called HistoricProductQuantities, and insert a new record each time there is a change in the quantity?
Yes you could do this, I would probably call it HistoricProductSnapshot and keep track of the product activity in there on daily basis. With this information as well as time dimensions you could do calculations such as "how many of Product X Did we have 5 days ago or a month ago etc etc."
What number should I track? The historic stock quantity (i.e. 50 in stock on this day, 24 in stock on that day), or the CHANGE in stock quantity i.e. -5 (5 sold) or 15 (15 added to inventory)? Or do I track both in separate tables?
I do not have experience writing inventory control software but I believe with the Snapshot table I mentioned on the question above you would only have to keep track of quantities per day. The Change in product counts could then be calculated from your snapshot table. You could for example have a function that will output the product amount in a given time range as an array. Example: From March 1 to March 7 these were the stock amounts for Product Y [45,40,39,27,22,45,44].
Hope that helps. As I said I am not a product inventory guy but I have worked with Point of Sales Systems and the procedure above should give you a could enough start for what you are trying to do.

This gem could be usefull for tracking changes in models https://github.com/collectiveidea/audited

Keep the data raw. I would personally create a new data entry every day, displaying how much items you have in stock per day. Or you can make the interval much shorter, such as every 12 hours.
For our particular use case:
We had a table called Days, which had a many to many relationship with products, and each "relationship" will have a value called quantity (to keep track of quantity of product per day). Additionally per relationship, we had another value for the relationship with transactions (a one to many relationship) that has the entries for the time of transaction and remaining stocks.
I would personally advise you to use the quantity of stock as the raw data, as it will enable you to gather the data such as how much items were removed during a certain transaction, when the item was out of stock and when it became in stock, all through the data. When you have data in which you need to perform statistical calculations on, it's best to store this data as raw values (quantity of the item).

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart