Dimensional Modeling: app session or activity measures - data-warehouse

I am trying to answer the below question given by the business (The business generates revenue from multiple apps through customer pay model) The business is interested in the below questions
new users (trend with respect to previous months)
daily active users
Day 1 retention
I came up with the below DM
Dimension: users, app, deviceid, useractions, plan, date
Fact: fact_activity(userid, appid,deviceid, actionid)
Actions could be: app installed, app launch, registered, completed purchase, postedcomments, playgame etc
The questions I have is
should the fact table contain action_type instead of the actionid into the fact (to avoid join with useractions)
Definition of day 1 retention: No of apps installed/ app launches next day how do to avoid multiple counting of single user using multiple devices
Would it be advisable to have device details in the user dimension
or separate.
If I need to measure average session duration, should I use another fact at session level or tweak the activity fact?

your questions are really unanswerable without significant more information about your business processes, data definitions, etc. In effect, you are asking someone to design a dimensional model for you before they can answer your questions - which is obviously not going to happen.
However, I can give you some very generic pointers that may help you:
Dimensions
A Dimension describes an entity so if attributes can't be described as belonging to the same entity then they shouldn't be in the same dimension. In your case, I assume a Device and a User are not the same thing and therefore they need to be separate dimensions
Facts
You need to define your measures i.e. what precisely are the things you are going to want to aggregate (count, sum, avg, etc) and how are they defined/calculated.
For each measure, you also need to define its grain i.e. what is the minimum set of dimensions that uniquely identify it. Once you have the grain defined, if multiple measures have the same grain then they can be held in the same fact table and if they don't then they can't

Related

Handling Contracts extension and licensing/Subscriptions addition/removal in dimensional model

Background: I am trying to design a star schema for a data warehouse. We have the following business model where we have few products that our customers can buy and then use. The customers are companies and then they have people in their organization who can be mapped to the licenses they have brought for products.
I have the following dimensions.
Account_dim: The dimension contains all the list of companies that have are our current/prospective with our company. It could have companies who still don't have a contract with us and are still in a discussion phase. so some rows might not have a contract.
User_dim: This is the list of users the company has nominated point of contacts for their company. So a user will belong to one particular Account in the Account_dim. One account can have many users.
Product_Dim: This dimension contains all the information regarding all the products we sell. The cost of a license and how many users are allowed on a license.So if for example he brought product A a max of two users can use it.
Now I have three tables that have data regarding the contract.
Contract: It contains information regarding a contract we have which will include the contract start date and end date and the account which this contract is assigned to.
products_bought: This table contains the product brought under a contract. A contract can hold multiple products bought.Each product row will have the product start date/end date and the price of the asset the client has paid.
allocated users:Each product bought can have users mapped to it who are allowed to use the product which is the user in user_dim for that account. Basically attaching a license to a user.
I am trying to model the contract, product bought and allocated user so I can generate the following data.
The amount of money a account has spend on products.
THe utilization of licenses by an account. for example an account has a product that allows 3 users but has only one user mapped to it will show the product is under utilized.
I tried denormalizing all three tables into one fact table but the I am running into problem where the contract end date can be changed if it is extended. As well as new assets can be mapped to it. Also last be not least, the company can remove a user and then map another user to the product or remove users because they left the company or add more users.
How can this be best modeled. Because they contract and asset users can change they should be SCDs rather than fact table or how should I implement a fact to handle these changes as well which must be captured as well to maintain history of usage over time.
your best bet is to read a book on how to go about designing a data warehouse: The Data Warehouse Lifecycle Toolkit as this will give you all the information you need to be able to answer questions like this.
However, to specifically address your question, the best way to approach this is as follows:
Define your measures: what are the values that you wish to be able to aggregate in your reports
Define the grain of each measure: what are the dimensions that uniquely identify each measure. For example, a transaction amount might be defined by Store, Customer and Date/Time; if you dropped any of these then the transaction amount would change; if you added another dimension, such as rainfall, it would not change the transaction amount (n.b. having defined the grain of a measure you should never add dimensions that would change the grain e.g. Product Dimension, in this example)
Once you have defined your measures and their grains you can add all the other dimensions to them (that won't affect their grain) andn then decide whether to hold them in separate fact tables or combine them into one fact table:
Rule: if two measures don't have the same grain you must not put them in the same fact table
Guidance: for measures that meet the above rule, if there is also significant overlap in the other dimensions you want to use for each measure then consider combining them into a single fact table. My rule of thumb is that if you have 2-3 dimensions that don't apply to all measures then that's OK; if you hit 5 or more then you probably need to be thinking of splitting the measures into separate facts

Datawarehouse: Multivalued Slowly Changing Dimensions

I am currently creating a datwarehouse for a (coffee) aggregator in latin america. They have two main business operations:
buying coffee from farmers and selling it in the international
market and
providing micro-credit loans to these farmers to
increase their yield.
My plan is to create a datawarehouse on top of their operational systems/dbs.
The first business process I will integrate is the credit operation, after that I will add the buying of the coffee of individual farmers.
For the credit operation I envision a single fact table which consists of the loan-amount, with dimensions to farmer, loan-officer etc. But before getting into the fact table concerning loans, I am currently working on the creation of the farmer dimension.
I have a nice little farmer dimension with some keys, geograhical location, sex, education, etc etc
I also would like to include the "economic production" of the farmer. This is information that is captured in the loan application process and basically says what kind of coffee they produce and the size of the land they produce this on. The relation between farmer and economic production is thus 1:n
Furthermore, this changes from year to year and is obviously only known for farmers that have done a loan application.
The goal of this information, is to be able to (even before the credit fact table is created) create some basic figures / insights on the farmers, their spatial distribution and their economic activity/output.
Iam thus thinking of having a farmer dimension connected to a "production dimension". This production dimension would be (1) time varying and (2) multivalued. The time variance I plan to implement according to type 2 (valid_from, valid_to and currently_valid columns).
Since I am rather new to the whole datawarehouse scene I have been reading a lot up on common techniques and principles, mainly from Kimball's excellent (!) book. However, I havent come across anything which describes such a dimension-dimension connection.
My questions therefore are:
is this common and considered a good approach?
where can I find some information on best practices concerning this matter
EDIT: A second possibility that I am thinking about would be to create some kind of a factless fact table which deals with "customer interaction" (e.g. the loan application process in which such information from farmers is collected). This fact table would then have a FK to the farmer and FK to the production dimension as well as a FK to a time dimension table. As it has no facts associated with it, this would only form some kind of a 1:n linking table. The only difference with the former method would be that the time-dimension is now in a seperate table as opposed to be included in the production table in my opinion..
EDIT2: Or should I create perhaps a production fact table, although it does not coincide with a business process of the aggregator. In that case probably the surface area to produce a certain crop would become the measure, and potentially the crop types / varieties etc would go into a seperate dimension.

Transaction Fact Table approach

I'm working on financial data mart structure.
And I'm having some doubts on whats the better approach to do so.
The source system database,Dynamics AX 2009, has three tables for customer transaction.
One table for open transactions, where the Customer still needs to pay for service/product;
One table for settle transactions, where it holds what the customer have already paid;
Finally a table that have all customers transactions, holds transactions from open to settle and also others transactions as customer to bank or ledger accounts.
I thought in two options, first I will maintain a fact table representing the three table, fact for open transactions, fact for any customer transaction and fact for settle transaction.
Second is to create a single fact to hold all transactions, to do so I would have to do a full join on three tables.
I'm not sure on both approaches, as the first seems to copy tables from production and create the proper dimension.
On the Second one I would create a massive fact table, that where data would constantly change, as open transaction are delete on source system when they are settle.
Another doubt, should i create a fact with scd(slowly changing dimension) structure to maintain history data?(star date, end date , flag)
It's hard to say from the information given whether this needs to be one or more Fact tables. However, the key point which you should use to decide is whether all of the information is at the same granularity. Consider the grain of your intended Fact table(s) and you should find an answer for whether you need one table or multiple tables.
If all of the information sits at the same grain - i.e. all of the same dimensions apply to all of the measures you are considering putting into the same Fact table - then they can probably all live in the same Fact table. If you're finding that some of the Dimensions wouldn't apply to some of the measures then you probably need to re-think your design. Either you might need multiple Fact tables, or you might need to take all of your measures down to the lowest grain and combine hierarchies into single Dimensions if you currently have them split across multiple Dimensions.
While it's been mentioned that having measures in separate cubes could make it difficult to compare things, keep in mind that you don't need one cube per Fact table. You can have multiple Fact tables in a single cube, and sometimes this is very helpful when you need to be able to compare measures which share some Dimensions but not others. This is far, far better than forcing data which does not have the same grain into one Fact table.
Also, it sounds like what you're trying to model is the sales ledger of an organisation. I'd suggest having a dig around via Google as you may well be able to find materials discussing dimensional data warehouse design for sales ledger structures, rather than reinventing the wheel. If you don't have a decent understanding of the accounting concepts you're trying to model I would especially recommend looking for a reference schema to work from, or failing that doing some reading up on accountancy concepts (and sales ledgers specifically). Understanding the account structure should help you understand what the grain of your Fact table(s) needs to be, how to model the Dimensions, and so on.
This is a really helpful abridged version of Kimball's modelling techniques which discusses grain, and the different types of Fact table, amongst many other topics:
http://www.kimballgroup.com/wp-content/uploads/2013/08/2013.09-Kimball-Dimensional-Modeling-Techniques11.pdf
I think you should just use one fact table (one cube) and use a dimension to differentiate between open/settled/etc. transactions. That's what dimensions are for: They help you to categorize your measures and get a specific view on them. This approach will also open much more possibilities to create knowledge with your cube. With separate cubes for open/settled/etc. transactions, it will be harder or not possible to set this data into contrast.
Since the data is changing constantly, you should consider to update your fact table in a given time and rebuild your cube if it needs to.
If you use scd or not really depends on the data you process and what it is used for. Is there a business case claiming it? Is there a technical use?
I think this is something you have to decide on your own.

Neo4j graph model for a social network

I've created a graph model for a social network and needed some concrete advice regarding the design in regards to scaling. Pardon the n00bness of these questions but I'm not finding very many clear examples out there...
NOTE: the status updates and activity nodes /relationships are linked lists - with the newest entries constantly being placed at the top of the list.
Linked lists allow for news feed generation, but there could be hundreds of records per user - I presume the limit clause isn't sufficient even though the data is in descending order by date. Do I have to have a separate linked list that would only hold the most recent 10 status/activity updates) and constantly replace the head on that list to get better activity feed generation, or will one list properly sorted and do the job (with a limit clause)
These nodes all have properties (json data with content, IDs, etc) - how do "global" indexes come into play here so that I can find, for example, users that like Depeche Mode without waiting a lifetime for results? I know how to add a node to an index, just wondering if I'm missing a part of the picture here..
Security - logins and passwords.. I would presume a graph database could store them, but I'd presume it's a security risk at this point - would it be better to keep this in postgres etc?
How would you improve this model to handle scalability? Imagine 20 million users banging away on this..
Imagine 40 million users - what's wrong with this model when it comes to scalability?
Part 1.
You can write cypher or gremlin queries that do what you want. Remember that you can traverse forwards and backwards on edges. Given a user, it should always be relatively constant time to pull up the last ten things they did.
Part 2.
If you are representing a band as an entity of a certain type, index on that attribute. Then you'll be able to pull out that node and traverse outwards to find all the users who like that band. If you don't have an independent entity, or it is somehow implicit, you'll want to enable full text search for your respective graph database.
Part 3.
Learn more about security. The only thing you would be storing would be a properly hashed string of the user's password. At that point you would be fine using any graph db and good security practices.
Part 4/5.
Once you have one user, worry about the next thousand.
When you have a thousand users, worry about the next hundred thousand.
When you have one hundred thousand, worry about the next million.
When you have a million users, you can start worrying about the questions you asked.
Until you have at least 0.1% of the users/volume you want to scale to, it's mental masturbation to try and ask questions about how to scale up to a certain size.

How specific do I get in BDD scenarios?

Take two different ways of stating the same behavior.
Option A:
Given a customer has 50 items in their shopping cart
When they check out
Then they will receive a 10% discount on their order
Option B:
Given a customer has a high volume of items in their shopping cart
When they check out
Then they will receive a high volume discount on their order
The former is far more specific. If someone has some question about exactly when a customer gets a high volume discount or how much to give them, reading this scenario makes it very clear. Serving the purposes of documenting the behavior, it's about as specific as it can be, although any change in those values will require changing the scenario.
The second is more generalized and doesn't have the clarity of the first. Automating it would require incorporating the values "50" and "10" in the step implementations. On the other hand, the scenario captures the core business need: a high volume customer gets a discount. If we later decide to use "40" and "15", the scenario doesn't have to change because the core business need hasn't really changed (though the step implementation would). Also, the term "high volume customer" communicates something about why we're giving them the discount.
So, which is better? Rather, under what circumstances should I favor the former or the latter?
I think I'll go for option A.
The thing is that BDD scenarios must serve as documentation of the system.
So if a non technical wants to know how your discount system is working (A business guy, a tester or someone from the customer support team), they surely would like to know what it means to have a high volume of item and what it is the applied discount.
And they would not want to have to go in the plumbing code to get this information back.
I think this information is important and can not be hidden from the reader.
Another benefit is that it will allow for a non developer (a tester for example) to write new scenarios and check what will happen if there are 1 item in the cart or 100 items.
When you get too much abstract about thing, it gets harder to apply deliberate discovery.
So with a scenario as in Option B, you loose the opportunity to ask your self these questions:
What happen if we have more than 50 items like 100 items is there any other discount available
What happen if we have 1 item, surrely we need to not apply a discount or should we apply a discount based on the total price of the cart instead of the number of items in it, someone buying only one really expensive item should benefit a discount too ?
is 10% the only available type of discount, do we have for example fixed amount discounts ? Do we have more complex discount strategies ?
When the business variable are visible, you can play around with them and figure out stuff that you may have forgotten or think about new interesting (or not) features.
As a general rule, I'd hide what it does not matter to know in a scenario and in that case the number of items and the applied discount value do really matter to the reader.

Resources