How to integrate various data marts? - data-warehouse

I recently joined a healthcare company and they have separate datamarts for each type of each type of diseases. Lets say I have three different DM's as follows:
HIV
HepC
Respiratory
How would I go on to integrate these into one Data-warehouse?
From what I have read, this is a Kimball Aprroach.
And I should look for similar dimensions and try to build on that.
Any other recommendations ?

Your question is too vague. Without knowing what you want to do with a data warehouse, and how the data marts are structured, it's hard to comment on how you should go about it. You might want to step back and think about two things, and explain: what do I want to do? and what do I have?
Talk with the stake holders to settle on what you they to have on a data warehouse. How do they want to use a data warehouse? Is it for internal analytics is for simple aggregation reports? If so, what kind of metrics need to be aggregated? If they are doing complex analytics, what kind of metrics do they need for it? I recommend identifying a list of "needs", and prioritize them, so you can think about what dimensions need to be delivered first.
After that, research what you have closely. What does each disease data mart have? Does it have information about disease? taxonomy? patients who have that disease? procedures done for that disease? Identify the structure of the data marts, and make a list of attributes that can be derived from them.
After that, you might have a more fruitful conversation on the methods of integration.

Related

A datawarehouse without measures

Am I doing this correctly? There's no measure so this is throwing me off a bit.
I am designing my database to hold records of user profiles. The Users can come in and edit profile on a front end portal that links to the this DB when records are edited/updated/deleted. The DB also needs to produce XML feeds for a public website.
The warehouse:
Yes, a fact table can exist without measures, it is called a factless fact table.
Please inform more on : http://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/factless-fact-table/ and other documentation.
While you absolutely can have a fact table without measures - as RaduM has linked to an explanation of - if you have no measures anywhere in your model I would question whether this database should use a dimensional model at all.
Dimensional models are intended for BI functions - data analysis, reporting, feeding into cubes, etc. Your description in a later comment about the use of this database seems to suggest this database is actually just the back end database for a website? If so, I would suggest avoiding dimensional modelling altogether. A standard normalised data model is likely to be far more suitable.
Data warehouses are normally secondary datastores which are not your live application database. Data is pulled from your primary sources into the data warehouse for reporting and analytics needs.
Transactional databases - like the one you are describing - are generally modelled in a more standard and more highly normalised manner. The usual gold standard is third normal form or higher. If you're unclear on the rules of database normalisation and the concept of third normal form, then I would strongly suggest that you obtain some training on this (there are online tutorials around if you search), and then have a crack at remodelling your scenario in this way. If you get stuck, post up a new question with the problem(s) you're running into.
You might also find this previous question helpful - it describes the difference between OLTP and OLAP. While you're not using OLAP, dimensional models are often used as the the RDBMS layer behind an OLAP database:
What are OLTP and OLAP. What is the difference between them?

When to not use neo4j?

Neo4j is a great tool for mapping relational data, but I am curious what under what conditions it would not be a good tool to use.
In which use cases would using neo4j be a bad idea?
You might want to check out this slide deck and in particular slides 18-22.
Your question could have a lot of details to it, but let me try to focus on the big pieces. Graph databases are naturally indexed by relationships. So graph databases will be good when you need to traverse a lot of relationships. Graphs themselves are very flexible, so they'll be good when the inter-connections between your data need to change from time to time, or when the data about your core objects that's important to store needs to change. Graphs are a very natural method of modeling some (but not all) data sources, things like peer to peer networks, road maps, organizational structures, etc.
Graphs tend to not be good at managing huge lists of things. For example, if you were going to build a customer transaction database with analytics (where you need 1 million customers, 50 million transactions, and all you do is post transactions all day long) then it's probably not a good fit. RDBMS is great at that, notice how that use case doesn't exploit relationships really.
Make sure to read those two links I provided, they have much more discussion.
For maintenance reasons, any service aggregating data feeds has until now been well advised to keep their sources independent.
If I want to explore relationships between different feeds, this can be done at application level, using data tracking (for example) user preferences amongst the other feeds.
Graph databases are about managing relationship complexity, but this complexity is in many cases a design choice. Putting all your kids in one bathtub is fine until you drop the soap..

Transaction Fact Table approach

I'm working on financial data mart structure.
And I'm having some doubts on whats the better approach to do so.
The source system database,Dynamics AX 2009, has three tables for customer transaction.
One table for open transactions, where the Customer still needs to pay for service/product;
One table for settle transactions, where it holds what the customer have already paid;
Finally a table that have all customers transactions, holds transactions from open to settle and also others transactions as customer to bank or ledger accounts.
I thought in two options, first I will maintain a fact table representing the three table, fact for open transactions, fact for any customer transaction and fact for settle transaction.
Second is to create a single fact to hold all transactions, to do so I would have to do a full join on three tables.
I'm not sure on both approaches, as the first seems to copy tables from production and create the proper dimension.
On the Second one I would create a massive fact table, that where data would constantly change, as open transaction are delete on source system when they are settle.
Another doubt, should i create a fact with scd(slowly changing dimension) structure to maintain history data?(star date, end date , flag)
It's hard to say from the information given whether this needs to be one or more Fact tables. However, the key point which you should use to decide is whether all of the information is at the same granularity. Consider the grain of your intended Fact table(s) and you should find an answer for whether you need one table or multiple tables.
If all of the information sits at the same grain - i.e. all of the same dimensions apply to all of the measures you are considering putting into the same Fact table - then they can probably all live in the same Fact table. If you're finding that some of the Dimensions wouldn't apply to some of the measures then you probably need to re-think your design. Either you might need multiple Fact tables, or you might need to take all of your measures down to the lowest grain and combine hierarchies into single Dimensions if you currently have them split across multiple Dimensions.
While it's been mentioned that having measures in separate cubes could make it difficult to compare things, keep in mind that you don't need one cube per Fact table. You can have multiple Fact tables in a single cube, and sometimes this is very helpful when you need to be able to compare measures which share some Dimensions but not others. This is far, far better than forcing data which does not have the same grain into one Fact table.
Also, it sounds like what you're trying to model is the sales ledger of an organisation. I'd suggest having a dig around via Google as you may well be able to find materials discussing dimensional data warehouse design for sales ledger structures, rather than reinventing the wheel. If you don't have a decent understanding of the accounting concepts you're trying to model I would especially recommend looking for a reference schema to work from, or failing that doing some reading up on accountancy concepts (and sales ledgers specifically). Understanding the account structure should help you understand what the grain of your Fact table(s) needs to be, how to model the Dimensions, and so on.
This is a really helpful abridged version of Kimball's modelling techniques which discusses grain, and the different types of Fact table, amongst many other topics:
http://www.kimballgroup.com/wp-content/uploads/2013/08/2013.09-Kimball-Dimensional-Modeling-Techniques11.pdf
I think you should just use one fact table (one cube) and use a dimension to differentiate between open/settled/etc. transactions. That's what dimensions are for: They help you to categorize your measures and get a specific view on them. This approach will also open much more possibilities to create knowledge with your cube. With separate cubes for open/settled/etc. transactions, it will be harder or not possible to set this data into contrast.
Since the data is changing constantly, you should consider to update your fact table in a given time and rebuild your cube if it needs to.
If you use scd or not really depends on the data you process and what it is used for. Is there a business case claiming it? Is there a technical use?
I think this is something you have to decide on your own.

Datawarehouse for analytical CRM

Is it beneficial to pull the data from Datawarehouse for analytical CRM application or it should be pulled from the source systems without the need of Datawarehouse??....Please help me answering.....
For CRM it is better to fetch the data from datawarehouse. Where a data transformations developed according to the buiness needs using various ETL tools, using this transofrmations you can integrate the CRM analytics for analysing the large chunk of data.
I guess the answer will lie in a few factors
what data you need,
the granularity of that data and,
the ease of extract
If you need data that you will need to access more than one source system, then you will have to do the joining of that data between them. One big strength of getting the data from a DWH, is that they tend to have data from a number of source systems and are well connected across these source systems with busienss rules being applied consistently across them.
A datawarehouse should have lowest granularity data, but sometimes, for pragmatic reasons, decisions may have been taken to partly summarise the data, thus you may not have the approproate granularity.
The big advantage of a DWH is that it is a simle dimensional model structure (for a kimball star schema any how), so as long as the first two are true, I would always get my data from the DWH.
g/l!
Sharing my thoughts on business case to pull from datawarehouse rather than directly from CRM system would be -
DWH can hold lot more indicators for Decision making and analysis at enterprise level across various systems than a single system like CRM. Therefore if you want to further your analysis on CRM data you can merge easily information from other system to perform better analytics/BI from DWH.
If you want to bring conformity across systems for seeing data of customer with single view. For example, you can have pipeline and sales information from CRM and then perform revenue calculation in another system for the same customer. Its possible that you want both sets of details in single place with same customer record linked to both measures.Then you might want to add Risk (Credit information) from external source into the same record in DWH. It brings true scability in terms of reporting and adhoc requests.
Remove the non-core work and dettach the CRM production system from BI and reporting (not talking of specific CRM reports). This has various advantages both terms of operations and convinence. You can google on this subject more to understand the benefits.
For now these are the only points that come to me. I will try adding more thoughts later.
P.S: I am more than happy to be corrected :-)

practical problems of transforming data in data warehouse

i need to explain the practical problems that might be encountered when transforming their transactional (and other) data from their diverse sources into the Data Warehouse. according to my knowledge this is about cleansing and scrubbing data. if anyone knows about any practical problem please help me.thanks for your help
That's a broad topic, but I'll offer a few good starting points.
For starters, think about history. If a transaction updates some data point, do you need to apply that retroactively, or do you need to remember what the value was at any given point in time. For example, suppose you have a monthly report of customers by city, and one of your customers moves. How should the DW reflect that.
Think about data acceptance. Is every input row a good input? For example, if you're dealing with web data, there are crawlers and spammers that you might not want to count the same as you count user traffic.
Think about data synchronization. Do all your inputs use the same keys? Do you know how to translate between them? Does Team A mean the same thing by "cust_id" as Team B does? A project glossary is very helpful here.
Think about localization. Are you inputs all in the same time zone? Do they all use the same calendar system? Do you need to handle unicode?
Think about reporting. Are the data you're capturing able to answer the questions people will ask of the DW? If not, how can you capture data that can?
Think about presentation. Should you be showing customers the same data you're using for internal reporting? Does finance need to see a different slice of the data than marketing?
This really only scratches the surface of the issues that come up on a major DW project. I would refer you to Ralph Kimball's assorted books on Data Warehousing for a more in depth discussion of problems and solutions. Hope this helps you get started.
You give the answer in your question.
According to my knowledge this is about cleansing and scrubbing data.
And you are correct. Cleansing data means that you have a company-wide list of clean element attributes, and a mapping that changes the unclean elements into clean elements.
Processing the data against the clean element attributes is a piece of cake compared to creating the company-wide list of clean element attributes.
You have to get people from different departments to agree on what data to warehouse, and to agree on what each element means. This is a difficult sociological problem. It's not a terribly hard technical problem.
Good luck getting your company-wide list of clean element attributes.

Resources