I am new to data warehousing and I need to design a fact constellation schema for a retail stores network. Can any body suggest me a good tutorial? I have seen rarely any on the web.
The Data Warehouse Toolkit is essential reading for data warehousing, and includes an entire chapter where retail sales is used as an example.
Related
so I am trying to set up a data warehouse for a service where each customer has their own database with a unique schema. How do I go about setting up a warehouse so each customer has their own semantic layer / relational model set up automatically (since we (centrally) do not know what is in each database) So that each customer can easily report on their data? Is there any automatic process we can follow? Am I missing something?
It depends on whether you want a consolidated view of the data, or if each customer's data is to remain segregated.
If consolidation is the objective (and there are huge benefits for a multi-tenant SAAS vendor to have a consolidated overview of customer data) then Nithin B's suggestion is good.
If separate warehouses are required, then you'll need to think about how to optimise your costs. The two biggest components will be ETL/ELT, and database hosting.
The fastest way to ETL/ELT is data warehouse automation. You'll find a good list of vendors on our web site (http://ajilius.com/competitors). Look for a solution that will give you the flexibility to meet your deployment options (cloud and/or on-premise), as well as the geographic reach you'll need for accessing customer data.
Will you be hosting your own databases or in the cloud? How much data will each tenant require? A good starting point would be PostgreSQL or SQL Server (SMP), and Ajilius gives you the flexibility to instantly migrate to MPP platforms if your needs outgrow those platforms.
There are many ways to address this.
Land all the tables in a Landing area in different schemas.
Stage the data into appropriate staging tables for dim and fact loads.
Create a dim table to identify the Customer Area. For eg: Dim_Source
Load the data into the fact tables. Any specific customers can filter the data from the facts by using the Dim_Source values.
This design would help overall Enterprise reporting as well.
Hope that helps.
I would start with a Kimball BUS Matrix.
Cheers
Nithin
Am I doing this correctly? There's no measure so this is throwing me off a bit.
I am designing my database to hold records of user profiles. The Users can come in and edit profile on a front end portal that links to the this DB when records are edited/updated/deleted. The DB also needs to produce XML feeds for a public website.
The warehouse:
Yes, a fact table can exist without measures, it is called a factless fact table.
Please inform more on : http://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/factless-fact-table/ and other documentation.
While you absolutely can have a fact table without measures - as RaduM has linked to an explanation of - if you have no measures anywhere in your model I would question whether this database should use a dimensional model at all.
Dimensional models are intended for BI functions - data analysis, reporting, feeding into cubes, etc. Your description in a later comment about the use of this database seems to suggest this database is actually just the back end database for a website? If so, I would suggest avoiding dimensional modelling altogether. A standard normalised data model is likely to be far more suitable.
Data warehouses are normally secondary datastores which are not your live application database. Data is pulled from your primary sources into the data warehouse for reporting and analytics needs.
Transactional databases - like the one you are describing - are generally modelled in a more standard and more highly normalised manner. The usual gold standard is third normal form or higher. If you're unclear on the rules of database normalisation and the concept of third normal form, then I would strongly suggest that you obtain some training on this (there are online tutorials around if you search), and then have a crack at remodelling your scenario in this way. If you get stuck, post up a new question with the problem(s) you're running into.
You might also find this previous question helpful - it describes the difference between OLTP and OLAP. While you're not using OLAP, dimensional models are often used as the the RDBMS layer behind an OLAP database:
What are OLTP and OLAP. What is the difference between them?
We are building a data warehouse by consuming file feeds from different sources.
The file feeds are all denormalized/flattened (In the Transactions (fact) file, the Account attributes keeps repeating in all the records).
Also, the account information changes often (the feed gives an as-is version of the data).
What is the best practice in this situation. Should the data warehouse have a star schema model (with the Account information as a slowly changing dimension and a Transaction fact). Will re-normalizing make the ETL process complex?
In my company, whenever some input is denormalized, we normalize it and from there we proceed with loading our schemas (whatever your schema is).
The reason is that, being de-normalized, those inputs are difficult to check for inconsistencies (data quality). Apart from that, conforming all of your inputs to some standard allows your code to be more maintainable.
In our case, following the Kimball practices has been a total success, fact table, slow changing dimensions and all that jazz.
Hard to answer without such details as daily volume, latency threshold, resource availability, reporting requirements, platform and tool constraints, etc. A traditional ODS, where you import into and store a normalized structure before creating data marts from that, is great but not optimal for big data or real time analysis. A more modern approach, using a data lake in Hadoop or a virtualization layer, may not be feasible for your organization.
General Opinions:
1) re-normalizing does seem unnecessary from both a complexity and performance standpoint unless you have some ongoing use for the normalized data store.
2) Whether or not you build a traditional star schema or a graph or whatever should be governed by the reporting requirements and tools, not the source data format. Those sources will change, btw.
3) "Transaction" does not sound like a fact to me. A purchase transaction, e.g., could feed a sales fact, an accumulating snapshot for a sales cycle, a funnel conversion fact, etc.
4) I'm not sure whether "Account" is a customer, or a balance account such as a credit card, online payment service, bank account, etc. They imply different SCD types. In any case, Google will be sufficient to get plenty of information about building those dimensions.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I am working on a project to implement an historian.
I can't really find a difference between an historian and a data warehouse.
Any details would be useful.
Data Historian
Data historians are groups of tables within a database that store historical information about a process or information system.
Data historians are used to keep historical data regarding a manufacturing system. This data can be changes in state of a data point, current values and summary data for these points. Usually this data comes from automated systems like PLCs, DCS or other process controlling system. However some historian data can be human entered.
There are several historians available for commercial use. However, one of the most common historians have tended to be custom developed. The commercial versions would be products like OsiSoft’s PI or GE’s Data Historian.
Some examples of data that could be stored in a data historian are items (or tags) like:
- Total products manufactured for the day
-Total defects created on a particular crew shift
-Current temperature of a motor on the production line
-Set point for the maximum allowable value being monitored by another tag
-Current speed of a conveyor
-Maximum flow rate of a pump over a period of time
-Human entered marker showing a manual event occured
-Total amount of a chemical added to a tank
These items are some of the important data tags that might be captured. However, once captured the next step is in presentation or reporting of that data. This is where the work of analysis is of great importance. The data/time stamp of one tag can have a huge correlation to another/other tag(s). Carefully storing this in the historians’ database is critical to good reporting.
The retrieval of data stored in a data historian is the slowest part of the system to be implemented. Many companies do a great job of putting data into a historian, but then do not go back and retrieve any of the data. Many times this author has gone into a site that claims to have a historian only to find that the data is “in there somewhere”, but has never had a report run against the data to validate the accuracy of the data.
The rule-of-thumb should be to provide feedback on any of the tags entered as soon as possible after storage into the historian. Reporting on the first few entries of a newly added tag is important, but ongoing review is important too. Once the data is incorporated into both a detailed listing and a summarized list the data can be reviewed for accuracy by operations personnel on a regular basis.
This regular review process by the operational personnel is very important. The finest data gathering systems that might historically archive millions of data points will be of little value to anyone if the data is not reviewed for accuracy by those that are experts in that information.
Data Warehouse
Data warehousing combines data from multiple, usually varied, sources into one comprehensive and easily manipulated database. Different methods can then be used by a company or organization to access this data for a wide range of purposes. Analysis can be performed to determine trends over time and to create plans based on this information. Smaller companies often use more limited formats to analyze more precise or smaller data sets, though warehousing can also utilize these methods.
Accessing Data Through Warehousing
Common methods for accessing systems of data warehousing include queries, reporting, and analysis. Because warehousing creates one database, the number of sources can be nearly limitless, provided the system can handle the volume. The final result, however, is homogeneous data, which can be more easily manipulated. System queries are used to gain information from a warehouse and this creates a report for analysis.
Uses for Data Warehouses
Companies commonly use data warehousing to analyze trends over time. They might use it to view day-to-day operations, but its primary function is often strategic planning based on long-term data overviews. From such reports, companies make business models, forecasts, and other projections. Routinely, because the data stored in data warehouses is intended to provide more overview-like reporting, the data is read-only.
Is it beneficial to pull the data from Datawarehouse for analytical CRM application or it should be pulled from the source systems without the need of Datawarehouse??....Please help me answering.....
For CRM it is better to fetch the data from datawarehouse. Where a data transformations developed according to the buiness needs using various ETL tools, using this transofrmations you can integrate the CRM analytics for analysing the large chunk of data.
I guess the answer will lie in a few factors
what data you need,
the granularity of that data and,
the ease of extract
If you need data that you will need to access more than one source system, then you will have to do the joining of that data between them. One big strength of getting the data from a DWH, is that they tend to have data from a number of source systems and are well connected across these source systems with busienss rules being applied consistently across them.
A datawarehouse should have lowest granularity data, but sometimes, for pragmatic reasons, decisions may have been taken to partly summarise the data, thus you may not have the approproate granularity.
The big advantage of a DWH is that it is a simle dimensional model structure (for a kimball star schema any how), so as long as the first two are true, I would always get my data from the DWH.
g/l!
Sharing my thoughts on business case to pull from datawarehouse rather than directly from CRM system would be -
DWH can hold lot more indicators for Decision making and analysis at enterprise level across various systems than a single system like CRM. Therefore if you want to further your analysis on CRM data you can merge easily information from other system to perform better analytics/BI from DWH.
If you want to bring conformity across systems for seeing data of customer with single view. For example, you can have pipeline and sales information from CRM and then perform revenue calculation in another system for the same customer. Its possible that you want both sets of details in single place with same customer record linked to both measures.Then you might want to add Risk (Credit information) from external source into the same record in DWH. It brings true scability in terms of reporting and adhoc requests.
Remove the non-core work and dettach the CRM production system from BI and reporting (not talking of specific CRM reports). This has various advantages both terms of operations and convinence. You can google on this subject more to understand the benefits.
For now these are the only points that come to me. I will try adding more thoughts later.
P.S: I am more than happy to be corrected :-)