In a data warehouse, should a measure be based on a fact or a dimension? - data-warehouse

Let's say there is a data warehouse created from a shop data. A fact is a single purchase of a product. There is a dimension that describes a customer.
There is a need to create a measure that stores a number of distinct customers. Can this measure be created based on customer identifier in the dimension table, or it needs to be fact table? In which cases one or the other solution is better?
Below I post a visualization based on an AdventureWorks2016 database:

Related

How to add some global and local explainability to a predictions to understand for which reasons a customer churns?

The main goal is to understand:
How likely is a customer churn?
Identify the churn reason(s) per user.
For now, I'm using a random forest model. I can see the most important features for all users. Is there a way I could get the important feature per user? E.g., maybe a customer is leaving since they don't like the product, and another one is leaving since it's an expensive product, etc.
Thanks in advance!
Individual SHAP values can be used for this kind of local interpretability. SHAP values compute the contribution of each predictor, and can be applied globally, or for particular groups (i.e. customers who leave), or for individual customers.

Designing a data warehouse for inventory management

I have a college assignment requirement to built a Data warehouse for product Inventory management which can help inventory management understand in-hand value and using historical data they can predict when to bring new inventory. I have been reading to find out best way to do it using Cubes or Data mart. My question here is do I have to create a Data warehouse first and on top of that built Cube, Data mart or I can directly extract transactional data into Cube/Data Mart.
Next, Is it mandatory to built a Star Schema(or other DW schema) for doing this assignment as after reading multiple articles my understanding is OLAP cube can have multiple facts surrounded by Dimensions.
Your question is far bigger than you know!
As a general principle, you would have a staging database(s) which lands the data from one or more OLTP systems. then the staging database(s) would feed data to a datawarehouse (DWH). On top of a DWH would be built a number of Marts, these typically are subject area specific.
There are several DWH methodologies
Kimball Star Schema - you mention star schema above, this broadly is Kimball Star Schema. Proposed by Ralph Kimball. Also I would include here Snowflake Schemas, which are a variation on Star Schemas.
Inmon Model - Proposed by Bill Inmon
Data Vault - proposed by Dan Linstedt. Has a large user base in the Benelux countries. There are variations on the Data Vault.
It's important not to get confused between a DWH methodology and the technology to implement a DWH, though sometimes there are some technologies that lend themselves to particular methodologies. For example OLAP cubes work easily with Kimball star schemas. There is no particular need to use a relational technology for particular databases. Some NoSQL databases (like Cassandra) lend themselves to staging databases well.
To answer your specific questions
Do I have to create a Data warehouse first and on
top of that built Cube, Data mart or I can directly extract
transactional data into Cube/Data Mart.
OLAP Cubes are optional if you have a specific Mart that is tailored to your reporting but it depends on your reporting and analysis requirements and the speed of access.
A Data Mart could actually be built only using an OLAP cube, coming straight from the DWH.
Specifically on inventory management, all of these DWH methodologies would be suitable.
I can't answer your last question, as that seems to be the point of the assignment and you havn't given enough information to answer the question, but you need to do some research into dimensional modelling, so I hope this has pointed you in the right direction!
The answer is yes, a star model will always help a better analysis, but it is relational, a cube is multidimensional (where it performs all data crossings) and often uses as a data source to star models (recommended).
OLAP cubes are generally used for fast analysis and summaries of data.
So, by standard, I recommend you make all the star models you need and then generate the OLAP cubes for your analysis.
AS this is a 'homework' question, I would guess that the lecturer is looking for pros/cons between Kimball and Inmon which are the two 'default' designs for end-user reporting. In the real world DataVault can also be applied as part of the DWH strategy but it plays a different purpose and is not recommended for end-user consumption.
DataVault is a design pattern to bring data in from source systems unmolested. Data will inevitably need to be cleaned before being presented to the end-user solution and DV allows the DWH ETL process to be re-run if any issues are found or the business requirements change, especially if the granularity level goes down (e.g. the original fact table was for sales and the dimension requirements were for salesman and product category, now they want fact-sales by sales round and salesman for product subcategory and category. Without DV you do not have the granular data to replay the historical information and rebuild the DWH)

Should I build a separate dimension for Visas?

I am designing a data mart for University students and confused about visa and passport information that should it go in the student dimension or should I define a separate dimension for it. Which would be the better approach?

Can we predict the dates where each customers is to make transaction(s)?

I came across a project where we have variables in a data set such as customer ids, dates they purchased the products, type of products they purchased, and product price. I wanted to predict at what date the customer is likely to make a transaction and what product they are likely to purchase. Dates could be in days, weeks, or months.
From my understanding, I think I'll have to split the problem into different models. 1st model predicting the product(s) that EACH customer will purchase. 2nd model predicting the date of the transaction that is likely to occur for EACH customer. Obviously for the first model, we should be using classification machine learning models. I am not sure which model should I be using for the 2nd model. It could be time series, but I have not predicted the dates for a model yet. I hope I am the right track.
Main questions are:
Can we predict the dates from any machine learning techniques in terms of days, weeks, or months?
Can we predict the dates and products that each customer is going to purchase? or do we need to split the problem and perform separate models for it?
Suggestions will be very much appreciated!
Check out the BTYD package:
http://cran.r-project.org/web/packages/BTYD/vignettes/BTYD-walkthrough.pdf
It uses Bayesian models to model customer purchase behaviour - both on the individual customer level and in aggregate. It certainly can solve your problem of "when" customers will buy. Regarding the problem "which products" - I suspect that you could separately model the purchasing process for particular product (or set of products).

Pattern for managing large user uploaded datasets?

I'm a relatively noob programmer. I am creating web based GIS tool where users can upload custom datasets ranging from 10 rows to 1million. The datasets can have variable columns and datatypes. How do you manage these user submitted datasets?
Is the creation of a table per dataset a bad idea? (BTW - i'll be using postgresql as the database).
My apologies if this is already answered somewhere, but my search did not turn up any good results. I may be using bad keywords in my search.
Thanks!
creating a table per dataset is not a 'bad' idea at all. swivel.com was a very similar app to what you are describing and we used table per dataset and it worked very well for graph generation on user uploaded datasets and comparing data across datasets using joins. we had over 10k datasets and close to a million graphs and some datasets were very large.
you also get lots of free usage out of your orm layer, for instance we could use active record for working with a dataset (each dataset is a generated model class with its table set to the actual table)
pitfall wise is you gotta do a LOT of joins if you have any kind of cross dataset calculations.
My coworkers and I recently tackled a similar problem where we had a poor data model in MySQL and were looking for better ways to implement it. We weighed a few different options, including MongoDB, and ended up using the entity attribute value model. The EAV model is essentially a 3-column model. It allowed us to a single model to represent a variable number of columns and data types.
You can read a little about our problem here, but it sounds like it might be a good fit for you too.

Resources