How to design data vault schema for efficient queries for BPMS system? - data-warehouse

Current situation:
We have a BPMS (business process management suite) in place. There is increasing demand on historical and operative reports. The data model in the BPMS is not designed for historical queries. So we are analysing the possible solutions.
Solution in mind:
The idea is to push data on events in flow to an external database. Typical events in BPM are: new process instance was created, status changed, a step in the process was performed or status of the process instance was changed. Data vault is besides the star schema one of the interesting alternatives. Let’s assume there are two Hubs: PI (processitem instances) and OU (organisational unit) and a Link table LINK_PI_OU. Each time the process item is assigned to an organisational unit a new line will be added to the link table. The LOAD_DATE in the link table contains the datetime when this record was added. The record in the link table with the latest LOAD_DATE shows the current assignment of the process instance.
Question:
Let’ assume the business wants to know to whom all open process instances are currently assigned grouped by organisational unit.
How will a query look like for this report? Can it really be performant?
Or am I on the complete wrong way?

In general terms I didnt think that Data-Vault is intended to be an end user report layer or even a faux transactional system.
Im not completely clear on your archectiture, but in my understanding D-V is a historical repository that keeps all data for an enterprise that feeds a (Kimball/Inmon)datawarehouse. So in high level terms ...
Transaction systems => D-V => DWH => (cubes =>) users
This being the case, I wouldnt be posing queries to a Data Vault, instead I would write some ETL to populate a data warehouse and pose queries at the DWH.
The other view, I guess, is that you could build a set of views on top of the D-V, that would hide the structure from users, but I think I'm a bit of a purist and would go for a DWH.

As #Marcud D said, Data Vault is the model of Data Warehouse and usually when using DV modelling, you have to build data marts from DV for reporting purposes. I think that organizational unit should be modeled as Satellite table, not as Hub table. So, in any way, you should build a query to feed a specific data mart from DV model and then use it for reporting purposes.

Related

How to handle multitenant data warehouse (each customer has a unique schema)?

so I am trying to set up a data warehouse for a service where each customer has their own database with a unique schema. How do I go about setting up a warehouse so each customer has their own semantic layer / relational model set up automatically (since we (centrally) do not know what is in each database) So that each customer can easily report on their data? Is there any automatic process we can follow? Am I missing something?
It depends on whether you want a consolidated view of the data, or if each customer's data is to remain segregated.
If consolidation is the objective (and there are huge benefits for a multi-tenant SAAS vendor to have a consolidated overview of customer data) then Nithin B's suggestion is good.
If separate warehouses are required, then you'll need to think about how to optimise your costs. The two biggest components will be ETL/ELT, and database hosting.
The fastest way to ETL/ELT is data warehouse automation. You'll find a good list of vendors on our web site (http://ajilius.com/competitors). Look for a solution that will give you the flexibility to meet your deployment options (cloud and/or on-premise), as well as the geographic reach you'll need for accessing customer data.
Will you be hosting your own databases or in the cloud? How much data will each tenant require? A good starting point would be PostgreSQL or SQL Server (SMP), and Ajilius gives you the flexibility to instantly migrate to MPP platforms if your needs outgrow those platforms.
There are many ways to address this.
Land all the tables in a Landing area in different schemas.
Stage the data into appropriate staging tables for dim and fact loads.
Create a dim table to identify the Customer Area. For eg: Dim_Source
Load the data into the fact tables. Any specific customers can filter the data from the facts by using the Dim_Source values.
This design would help overall Enterprise reporting as well.
Hope that helps.
I would start with a Kimball BUS Matrix.
Cheers
Nithin

A datawarehouse without measures

Am I doing this correctly? There's no measure so this is throwing me off a bit.
I am designing my database to hold records of user profiles. The Users can come in and edit profile on a front end portal that links to the this DB when records are edited/updated/deleted. The DB also needs to produce XML feeds for a public website.
The warehouse:
Yes, a fact table can exist without measures, it is called a factless fact table.
Please inform more on : http://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/factless-fact-table/ and other documentation.
While you absolutely can have a fact table without measures - as RaduM has linked to an explanation of - if you have no measures anywhere in your model I would question whether this database should use a dimensional model at all.
Dimensional models are intended for BI functions - data analysis, reporting, feeding into cubes, etc. Your description in a later comment about the use of this database seems to suggest this database is actually just the back end database for a website? If so, I would suggest avoiding dimensional modelling altogether. A standard normalised data model is likely to be far more suitable.
Data warehouses are normally secondary datastores which are not your live application database. Data is pulled from your primary sources into the data warehouse for reporting and analytics needs.
Transactional databases - like the one you are describing - are generally modelled in a more standard and more highly normalised manner. The usual gold standard is third normal form or higher. If you're unclear on the rules of database normalisation and the concept of third normal form, then I would strongly suggest that you obtain some training on this (there are online tutorials around if you search), and then have a crack at remodelling your scenario in this way. If you get stuck, post up a new question with the problem(s) you're running into.
You might also find this previous question helpful - it describes the difference between OLTP and OLAP. While you're not using OLAP, dimensional models are often used as the the RDBMS layer behind an OLAP database:
What are OLTP and OLAP. What is the difference between them?

Loading Denormalized Data into a Datawarehouse

We are building a data warehouse by consuming file feeds from different sources.
The file feeds are all denormalized/flattened (In the Transactions (fact) file, the Account attributes keeps repeating in all the records).
Also, the account information changes often (the feed gives an as-is version of the data).
What is the best practice in this situation. Should the data warehouse have a star schema model (with the Account information as a slowly changing dimension and a Transaction fact). Will re-normalizing make the ETL process complex?
In my company, whenever some input is denormalized, we normalize it and from there we proceed with loading our schemas (whatever your schema is).
The reason is that, being de-normalized, those inputs are difficult to check for inconsistencies (data quality). Apart from that, conforming all of your inputs to some standard allows your code to be more maintainable.
In our case, following the Kimball practices has been a total success, fact table, slow changing dimensions and all that jazz.
Hard to answer without such details as daily volume, latency threshold, resource availability, reporting requirements, platform and tool constraints, etc. A traditional ODS, where you import into and store a normalized structure before creating data marts from that, is great but not optimal for big data or real time analysis. A more modern approach, using a data lake in Hadoop or a virtualization layer, may not be feasible for your organization.
General Opinions:
1) re-normalizing does seem unnecessary from both a complexity and performance standpoint unless you have some ongoing use for the normalized data store.
2) Whether or not you build a traditional star schema or a graph or whatever should be governed by the reporting requirements and tools, not the source data format. Those sources will change, btw.
3) "Transaction" does not sound like a fact to me. A purchase transaction, e.g., could feed a sales fact, an accumulating snapshot for a sales cycle, a funnel conversion fact, etc.
4) I'm not sure whether "Account" is a customer, or a balance account such as a credit card, online payment service, bank account, etc. They imply different SCD types. In any case, Google will be sufficient to get plenty of information about building those dimensions.

When we use Datamart and Datawarehousing?

I am new to DW . When we should use the term Datamart and when we should use the term Datawarehousing . Please explain with example may be your own example or in terms of Adventureworks .
I'm don't work on MS SQL Server. But here's a generic example with a business use case.
Let me add another term to this. First off, there is a main transactional database which interacts with your application (assuming you have an application to interact with, obviously). The data gets written into the Master database (hopefully, you are using Master-Slave replication), and simultaneously gets copied into the salve. According to the business and reporting requirements, cleansing and ETL is performed on the application data and data is aggregated and stored in a denormalized form to improve reporting performance and reduce the number of joins. Complex pre-calculated data is readily available to the business user for reporting and analysis purposes. This is a dimensional database - which is a denormalized form of the main transactional database (most probably in 3NF).
But, as you may know, all businesses have different supporting systems which also bring in data in the form of spreadsheets, csvs and flatfiles. This data is usually for a single domain, such as, call center, collections so on and so forth. We can call every such separate domain data as data mart. The data from different domains is also operated upon by an ETL tool and is denormalized in its own fashion. When we combine all the datamarts and dimensional databases for solving reporting and analysis problem for business, we get a data warehouse.
Assume that you have a major application, running on a website - which is your main business. You have all primary consumer interaction on that website. That will give you your primary dimensional database. For consumer support, you may have a separate solution, such as Avaya or Genesys implemented in your company - they will provide you the data on the same (or probably different server). You prepare ETLs to load that data onto your own server. You call the resultant data as data marts. And you combine all of these things to get a data warehouse. I know, I am being repetitive but that is on purpose.

Datawarehouse for analytical CRM

Is it beneficial to pull the data from Datawarehouse for analytical CRM application or it should be pulled from the source systems without the need of Datawarehouse??....Please help me answering.....
For CRM it is better to fetch the data from datawarehouse. Where a data transformations developed according to the buiness needs using various ETL tools, using this transofrmations you can integrate the CRM analytics for analysing the large chunk of data.
I guess the answer will lie in a few factors
what data you need,
the granularity of that data and,
the ease of extract
If you need data that you will need to access more than one source system, then you will have to do the joining of that data between them. One big strength of getting the data from a DWH, is that they tend to have data from a number of source systems and are well connected across these source systems with busienss rules being applied consistently across them.
A datawarehouse should have lowest granularity data, but sometimes, for pragmatic reasons, decisions may have been taken to partly summarise the data, thus you may not have the approproate granularity.
The big advantage of a DWH is that it is a simle dimensional model structure (for a kimball star schema any how), so as long as the first two are true, I would always get my data from the DWH.
g/l!
Sharing my thoughts on business case to pull from datawarehouse rather than directly from CRM system would be -
DWH can hold lot more indicators for Decision making and analysis at enterprise level across various systems than a single system like CRM. Therefore if you want to further your analysis on CRM data you can merge easily information from other system to perform better analytics/BI from DWH.
If you want to bring conformity across systems for seeing data of customer with single view. For example, you can have pipeline and sales information from CRM and then perform revenue calculation in another system for the same customer. Its possible that you want both sets of details in single place with same customer record linked to both measures.Then you might want to add Risk (Credit information) from external source into the same record in DWH. It brings true scability in terms of reporting and adhoc requests.
Remove the non-core work and dettach the CRM production system from BI and reporting (not talking of specific CRM reports). This has various advantages both terms of operations and convinence. You can google on this subject more to understand the benefits.
For now these are the only points that come to me. I will try adding more thoughts later.
P.S: I am more than happy to be corrected :-)

Resources