Fact and Dimension Relational Model Snow Flake Schema - data-warehouse

Click to See Summary of two datasets
How to create fact and dimension tables with Sales data frame from Daily Target data frame

Related

Multiple data from different sources on time series forecasting

I have an interesting question about time series forecasting. If someone has temporal data from multiple sensors, each dataset would have data, e.g., from 2010 to 2015, so if one were to train a forecasting model using all the data from those different sensors, how should the data be organized? because if one just stacked up the data set, it would generate, e.g., sensorDataset1 (2010–2015), sensorDataset2 (2010–2015), and the cycle would start over with sensors 3, 4, and n. Is this a problem with time series data or not?
If yes, what is the proper way to handle this?
I tried using all the data stacked up and training the model anyway, and actually it has a good error, but I wonder if that approach is actually valid.
Try sampling your individual sensor data sets to the same period.
For example, if sensor 1 has a data entry every 5 minutes and sensor 2 has an entry every 10 minutes. Try to sample your data to a common period across all sensors. Each data point you show to your model will have better quality data that should influence the performance of your model.
The aspect that will influence your error depends on what you're trying to forecast and the relationships that exist in your data that showcase the relationship between variables.

how to adjusting already built ML predictive model

How can I continue machine learning model after predicting results?
What I mean by that is that I built a model for my 1 million records dataset, this model took around 1 day to get built.
I extracted the model results using Python and now I have a (function) that I can feed it with my features and it gives me a prediction results
but with time my dataset has become 1.5 million records.
I do not want to redo the whole thing all over again from scratch.
Is there any way I continue of top of thf first model I built ( the one with 1 million records) so the new model take less time to adjust it based on the new 0.5 million records compare to re building everything from scratch for 1.5 million records.
P.S. I am asking for all algorithms, if there is anyway to do this for any algorithm that would be good to know which ones are these

Before clustering should i do an analysis on time series?

I have a question. I have a lot of different items, different articles of a company, (26000) and i have the sell quantity of 52 weeks of 2017. I need to do a forecasting model for the future so I decided to do a cluster of items.
The goal is to show the quantity of items that were sold during 2017 in the similar quantity and for the new collection of items i do a classification based on the cluster and do a specific model forecasting for items. It’s my first time that i use machine learning so i need help.
Do I need to do an analysis about correlation before i do the cluster?
I can create a metric based on correlation that i put in my cluster function like the distance metric.
Doing clustering on time series data cannot yield results on raw data.
Time series data is about trends and not actual values.
Try transforming your data to reflect some trends and the do clustering.
For example suppose your data is like 5,10,45,23
Transform it to 0,1,1,0. (1 means increase in value then previous). By doing so you can cluster the items which increases or decreases together.
This is just an opinion, you will have to try out various transformations and see what works on your data. https://datascience.stackexchange.com/ is relevant place to ask such questions

Should I keep/remove identical training examples that represent different objects?

I have prepared a dataset to recognise a certain type of objects (about 2240 negative object examples and only about 90 positive object examples). However, after calculating 10 features for each object in the dataset, the number of unique training instances dropped to about 130 and 30, respectively.
Since the identical training instances actually represent different objects, can I say that this duplication holds relevant information (e.g. the distribution of object feature values), which may be useful in one way or another?
If you omit the duplicates, that will skew the base rate of each distinct object. If the training data are a representative sample of the real world, then you don't want that, because you will actually be training for a slightly different world (one with different base rates).
To clarify the point, consider a scenario in which there are just two distinct objects. Your original data contains 99 of object A and 1 of object B. After throwing out duplicates, you have 1 object A and 1 object B. A classifier trained on the de-duplicated data will be substantially different than one trained on the original data.
My advice is to leave the duplicates in the data.

What are the types of dimension tables in star schema design? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
When reading about star schema design I have seen that many people uses various names for different types of dimension tables.
Please list the names and a small description of each type. If any list also an alias name.
I have come across these types of dimension tables so far:
Regular dimension
Standard star dimension.
Time Dimension
A special case of the standard star dimension.
Parent-child dimension
Used to model hierarchical structures, fx BOM (bill of materials).
Snowflake dimension
Can also be used to model hierarchical structures.
Degenerate dimensions
When the dimension attribute is stored as part of fact table, and not in a separate dimension table. Typically used for high cardinality dimensions like "transaction number".
Junk dimension
A single table with a combination of different and unrelated attributes to avoid having a large number of foreign keys in the fact table. Junk dimensions are often created to manage the foreign keys created by Rapidly Changing Dimensions. Typically used for low cardinality, non-related dimensions like gender or other booleans.
Role playing dimensions
For instance, a "Date" dimension can be used for "Date of Sale", as well as "Date of Delivery", or "Date of Hire".
Mini dimensions
For rapidly changing large dimensions. Typically used for managing high frequency, low cardinality change in a dimension.
Conformed dimensions
Implemented in multiple database tables using the same structure, attributes, domain values, definitions and concepts in each implementation. Also seen under the name Shared dimension.
Monster Dimension
A very large dimension.
Shrunk dimension Is a subset of a dimension’s attributes that apply to a higher level of summary. For example, a Month dimension would be a shrunken dimension of the Date dimension. The Month dimension could be connected to a forecast fact table whose grain is at the monthly level.
Dimension.
Inferred Dimensions
While loading fact records, a dimension record may not yet be ready. One solution is to generate an surrogate key with Null for all the other attributes. This should technically be called an inferred member, but is often called an inferred dimension.
Static Dimension
It not extracted from the original data source, but are created within the context of the data warehouse. A static dimension can be loaded manually — for example with Status codes — or it can be generated by a procedure, such as a Date or Time dimension.
Multi value Dimension
Is simply a bridge table between the entities involved in the many-to-many relationship. It is also possible that the many-to-many is between a fact and dimension.
Then there is a group of dimension tables I will call Dynamic dimensions.
These can be further divided into 2 groups.
Slowly changing dimension/Rapidly changing dimension
Attributes of a dimension that would undergo changes over time
Slowly Growing Dimension/Rapidly Growing Dimension
Relates to the growth of records/elements in the dimension.
NB: These can then be combined with the size of the dimension table, resulting in "Rapidly Changing Monster Dimension", "Slowly changing mini dimension" etc.
Special cases:
I'm not sure about these ones, so please help with a description/use scenario.
Data Mining Dimensions
Virtual dimension
Demographic Dimensions
Write-Enabled Dimensions
Dependent Dimensions
Independent Dimensions
Primary Dimensions
Secondary Dimensions
Tertiary Dimensions
Informational dimension
Dimension triage dimension
Non-conforming dimensions from the general ledger
Reference dimension
A reference dimension relationship between a cube dimension and a measure group exists when the key column for the dimension is joined indirectly to the fact table through a key in another dimension table in a snowflake schema design as shown in the following illustration:
An alias for a reference dimension, could be a Snowflake dimension since a reference dimension relationship represents the relationship between dimension tables and a fact table in a snowflake schema design.
Data quality dimension
Some authors suggest adding a special dimension called a data quality dimension to describe each facttable-record further.
Typical values in a data quality dimension could then be “Normal value,” “Out-of-bounds value,” “Unlikely value,” “Verified value,” “Unverified value,” and “Uncertain value.”
NB: The chosen values in a data quality dimension depends on the specific business needs in a given situation.

Resources