I am asking this question in context of Data Warehousing only.
Are Dimensional models & De-normalized models the same or different ?
As far as I have heard from DW enthusiast, there is nothing called Normalized or De-normalized data model.
But my understanding is, breaking down the Dimensions i.e. Snow-flaking is the Dimensional model. Whereas the model with flattened hierarchy dimensions is called a De-normalized data model. Both are data modelling concepts in Data Warehousing.
I need your expert advice on this.
And what we can we call the data model that does not have surrogate keys but instead has the primary keys - codes from the operational (OLTP) system to join Fact-Dimension together?
A Dimensional model is normally thought of as 'denormalised', because of the way dimension tables are handled.
A data warehouse with 'snowflaked' dimensions can still be called a dimensional model, but they're not the advice of Kimball, whose approach is what most people think of when they think of dimensional modelling.
Breaking down the dimensions (i.e. snowflaking) is normalising those tables, and dimensional modelling (as described by Kimball) suggests avoiding snowflaking where possible, although people of course sometimes do, for all sorts of reasons. The model with flattened hierarchy dimensions is a denormalised data model, and this is the main thing that people mean when they talk of a dimensional model.
As for a system that doesn't have surrogate keys: that could also be called a data warehouse also, you could also call it a dimensional model, but is against the recommended approach by Kimball (whether for better or worse!).
Related
I am trying to build a marketing mix model which is using multiple predictors.
These predictors are basically investments across different categories to predict returns
However the R2 value for the model is very less due to the presence of outliers.
I am not able to remove them because of business constraints since huge investments cannot be eliminated.
I am trying to find a logical way to consider these outliers and have a good R2 value
I tried segmenting the categories and model the data for each of the categories separately, but that is jut an weak fix, since we are trying to look for ways to establish it across the board.
MLR3 model includes a lot of redundant data not needed when applying the model. The traditional R approach is to save all the data used for model training. It leads to the growth of used memory. What leads to the growth of used memory. In the traditional R model, it usually can be fixed easily by just assigning NULL to redundant fields. But it is not so clear for mlr3.
You can directly access the underlying model in mlr3 using the $model slot, see e.g. the basics chapter in the mlr3 book. This is where the trained model is put and what's used to make the predictions, so you can modify this in exactly the same way as you would modify the model directly.
Of course, some of this may break other mlr3 functionality, e.g. information on feature importance that is used by some other functions. But in principle, you can perform exactly the same model customization that you can do for the "raw" model.
I'm new to Machine Learning, and I'd like to make a question regarding the model generalization. In my case, I'm going to produce some mechanical parts, and I'm interested in the control of the input parameters to obtain certain properties on the final part.
More particularly, I'm interested in 8 parameters (say, P1, P2, ..., P8). In which to optimize the number of required pieces produced to maximize the combinations of parameters explored, I've divided the problem into 2 sets. For the first set of pieces, I'll vary the first 4 parameters (P1 ... P4), while the others will be held constant. In the second case, I'll do the opposite (variables P5 ... P8 and constants P1 ... P4).
So I'd like to know if it's possible to make a single model that has the eight parameters as inputs to predict the properties of the final part. I ask because as I'm not varying all the 8 variables at once, I thought that maybe I would have to do 1 model for each set of parameters, and the predictions of the 2 different models couldn't be related one to the other.
Thanks in advance.
In most cases having two different models will have a better accuracy then one big model. The reason is that in local models, the model will only look at 4 features and will be able to identify patterns among them to make prediction.
But this particular approach will most certainly fail to scale. Right now you only have two sets of data but what if it increases and you have 20 sets of data. It will not be possible for you to create and maintain 20 ML models in production.
What works best for your case will need some experimentation. Take a random sample from data and train ML models. Take one big model and two local models and evaluate their performance. Not just accuracy, but also their F1 score, AUC-PR and ROC curve too to find out what works best for you. If you do not see a major performance drop, then one big model for the entire dataset will be a better option. If you know that your data will always be divided into these two sets and you dont care about scalability, then go with two local models.
Are some types of data sets just not predictive?
A current real life example for myself: My goal is to create a predictive model for cross selling insurance products. E.g. Car Insurance to Health Insurance.
My data set consists mainly of characteristic data such as what state they live in, age, gender, type of car etc...
I've tried various different models such as XGboosted Trees to regularised logistic regressions and AUC cannot get above .65.
So that leads me to - are some types of data sets just not predictive?
How do you help stakeholders understand this?
Some datasets may not be very predictive. Esspecially if you're lacking variables that accounts for much of the variance. It's hard to say if that is the case without talking to subject matter experts. With that said though, models are good and fine but I would also ensure that you're spending significant amount of time engineering features. Often time representing data the right way can be the difference between a working model and a bad model, especially in tree models.
How are dimensional models used different in the two approaches to data warehousing?
I understand that a data warehouse created using the bottom-up approach has data marts as the building blocks of the data warehouse, and each data mart has it's own dimensional model. Is it the same for the top-down approach? Does Inmons method use dimensional models?
Kimball's method uses (collection of) data-marts with a common "dimension bus" as a data-warehouse.
Inmon's method has a subject-oriented normalized structure as a warehouse, and then from that structure the data is exported to data-marts, which may (or may not) be star-shaped as Kimball's.
For very large warehouses, those two architectures converge -- at least become similar -- due to introduction of master-data management structure/storage in the Kimball-type architecture.
There is a white paper on Inmon's site called A Tale of Two Architectures which nicely summarizes the two approaches.
Dimensional modelling is a design pattern sometimes used for Data Marts. It's not a very effective technique for complex Data Warehouse design due to the redundancy and in-built bias in dimensional models. Kimball's "bottom-up" approach attempts to sidestep the issue by referring to a collection of Data Marts as a "Data Warehouse" - an excuse that looks far less credible today than it did in the 1990s when Kimball first proposed it.
Inmon recommends Normal Form as the most flexible, powerful and efficient basis for building a Data Warehouse.