I am new to DW . When we should use the term Datamart and when we should use the term Datawarehousing . Please explain with example may be your own example or in terms of Adventureworks .
I'm don't work on MS SQL Server. But here's a generic example with a business use case.
Let me add another term to this. First off, there is a main transactional database which interacts with your application (assuming you have an application to interact with, obviously). The data gets written into the Master database (hopefully, you are using Master-Slave replication), and simultaneously gets copied into the salve. According to the business and reporting requirements, cleansing and ETL is performed on the application data and data is aggregated and stored in a denormalized form to improve reporting performance and reduce the number of joins. Complex pre-calculated data is readily available to the business user for reporting and analysis purposes. This is a dimensional database - which is a denormalized form of the main transactional database (most probably in 3NF).
But, as you may know, all businesses have different supporting systems which also bring in data in the form of spreadsheets, csvs and flatfiles. This data is usually for a single domain, such as, call center, collections so on and so forth. We can call every such separate domain data as data mart. The data from different domains is also operated upon by an ETL tool and is denormalized in its own fashion. When we combine all the datamarts and dimensional databases for solving reporting and analysis problem for business, we get a data warehouse.
Assume that you have a major application, running on a website - which is your main business. You have all primary consumer interaction on that website. That will give you your primary dimensional database. For consumer support, you may have a separate solution, such as Avaya or Genesys implemented in your company - they will provide you the data on the same (or probably different server). You prepare ETLs to load that data onto your own server. You call the resultant data as data marts. And you combine all of these things to get a data warehouse. I know, I am being repetitive but that is on purpose.
Related
so I am trying to set up a data warehouse for a service where each customer has their own database with a unique schema. How do I go about setting up a warehouse so each customer has their own semantic layer / relational model set up automatically (since we (centrally) do not know what is in each database) So that each customer can easily report on their data? Is there any automatic process we can follow? Am I missing something?
It depends on whether you want a consolidated view of the data, or if each customer's data is to remain segregated.
If consolidation is the objective (and there are huge benefits for a multi-tenant SAAS vendor to have a consolidated overview of customer data) then Nithin B's suggestion is good.
If separate warehouses are required, then you'll need to think about how to optimise your costs. The two biggest components will be ETL/ELT, and database hosting.
The fastest way to ETL/ELT is data warehouse automation. You'll find a good list of vendors on our web site (http://ajilius.com/competitors). Look for a solution that will give you the flexibility to meet your deployment options (cloud and/or on-premise), as well as the geographic reach you'll need for accessing customer data.
Will you be hosting your own databases or in the cloud? How much data will each tenant require? A good starting point would be PostgreSQL or SQL Server (SMP), and Ajilius gives you the flexibility to instantly migrate to MPP platforms if your needs outgrow those platforms.
There are many ways to address this.
Land all the tables in a Landing area in different schemas.
Stage the data into appropriate staging tables for dim and fact loads.
Create a dim table to identify the Customer Area. For eg: Dim_Source
Load the data into the fact tables. Any specific customers can filter the data from the facts by using the Dim_Source values.
This design would help overall Enterprise reporting as well.
Hope that helps.
I would start with a Kimball BUS Matrix.
Cheers
Nithin
We have to create rather large Ruby on Rails application based on large database. This database is updated daily, each table has about 500 000 records (or more) and this number will grow over time. We will also have to provide proper versioning of all data along with referential integrity. It must be possible for user to move from version to version, which are kind of "snapshots" of main database at different points of time. In addition some portions of data need to be served to other external applications with and API.
Considering large amounts of data we thought of splitting database into pieces:
State of the data at present time
Versioned attributes of each table
Snapshots of the first database at specific, historical points in time
Each of those would have it's own application, creating a service with API to interact with the data. It's needed as we don't want to create multiple applications connecting to multiple databases directly.
The question is: is this the proper approach? If not, what would you suggest?
We've never had any experience with project of this magnitude and we're trying to find the best possible solution. We don't know if this kind of data separation has any sense. If so, how to provide proper communication of different applications with individual services and between services themselves, as this will be also required.
In general the amount of data in the tables should not be your first concern. In PostgreSQL you have a very large number of options to optimize queries against large tables. The larger question has to do with what exactly you are querying, when, and why. Your query loads are always larger concerns than the amount of data. It's one thing to have ten years of financial data amounting to 4M rows. It's something different to have to aggregate those ten years of data to determine what the balance of the checking account is.
In general it sounds to me like you are trying to create a system that will rely on such aggregates. In that case I recommend the following approach, which I call log-aggregate-snapshot. In this, you have essentially three complementary models which work together to provide up-to-date well-performing solution. However the restrictions on this are important to recognize and understand.
Event model. This is append-only, with no updates. In this model inserts occur, and updates to some metadata used for some queries only as absolutely needed. For a financial application this would be the tables representing the journal entries and lines.
The aggregate closing model. This is append-only (though deletes are allowed for purposes of re-opening periods). This provides roll-forward information for specific purposes. Once a closing entry is in, no entries can be made for a closed period. In a financial application, this would represent closing balances. New balances can be calculated by starting at an aggregation point and rolling forward. You can also use partial indexes to make it easier to pull just the data you need.
Auxiliary data model. This consists of smaller tables which do allow updates, inserts, and deletes provided that integrity to the other models is not impinged. In a financial application this might be things like customer or vendor data, employee data, and the like.
Is it beneficial to pull the data from Datawarehouse for analytical CRM application or it should be pulled from the source systems without the need of Datawarehouse??....Please help me answering.....
For CRM it is better to fetch the data from datawarehouse. Where a data transformations developed according to the buiness needs using various ETL tools, using this transofrmations you can integrate the CRM analytics for analysing the large chunk of data.
I guess the answer will lie in a few factors
what data you need,
the granularity of that data and,
the ease of extract
If you need data that you will need to access more than one source system, then you will have to do the joining of that data between them. One big strength of getting the data from a DWH, is that they tend to have data from a number of source systems and are well connected across these source systems with busienss rules being applied consistently across them.
A datawarehouse should have lowest granularity data, but sometimes, for pragmatic reasons, decisions may have been taken to partly summarise the data, thus you may not have the approproate granularity.
The big advantage of a DWH is that it is a simle dimensional model structure (for a kimball star schema any how), so as long as the first two are true, I would always get my data from the DWH.
g/l!
Sharing my thoughts on business case to pull from datawarehouse rather than directly from CRM system would be -
DWH can hold lot more indicators for Decision making and analysis at enterprise level across various systems than a single system like CRM. Therefore if you want to further your analysis on CRM data you can merge easily information from other system to perform better analytics/BI from DWH.
If you want to bring conformity across systems for seeing data of customer with single view. For example, you can have pipeline and sales information from CRM and then perform revenue calculation in another system for the same customer. Its possible that you want both sets of details in single place with same customer record linked to both measures.Then you might want to add Risk (Credit information) from external source into the same record in DWH. It brings true scability in terms of reporting and adhoc requests.
Remove the non-core work and dettach the CRM production system from BI and reporting (not talking of specific CRM reports). This has various advantages both terms of operations and convinence. You can google on this subject more to understand the benefits.
For now these are the only points that come to me. I will try adding more thoughts later.
P.S: I am more than happy to be corrected :-)
Current situation:
We have a BPMS (business process management suite) in place. There is increasing demand on historical and operative reports. The data model in the BPMS is not designed for historical queries. So we are analysing the possible solutions.
Solution in mind:
The idea is to push data on events in flow to an external database. Typical events in BPM are: new process instance was created, status changed, a step in the process was performed or status of the process instance was changed. Data vault is besides the star schema one of the interesting alternatives. Let’s assume there are two Hubs: PI (processitem instances) and OU (organisational unit) and a Link table LINK_PI_OU. Each time the process item is assigned to an organisational unit a new line will be added to the link table. The LOAD_DATE in the link table contains the datetime when this record was added. The record in the link table with the latest LOAD_DATE shows the current assignment of the process instance.
Question:
Let’ assume the business wants to know to whom all open process instances are currently assigned grouped by organisational unit.
How will a query look like for this report? Can it really be performant?
Or am I on the complete wrong way?
In general terms I didnt think that Data-Vault is intended to be an end user report layer or even a faux transactional system.
Im not completely clear on your archectiture, but in my understanding D-V is a historical repository that keeps all data for an enterprise that feeds a (Kimball/Inmon)datawarehouse. So in high level terms ...
Transaction systems => D-V => DWH => (cubes =>) users
This being the case, I wouldnt be posing queries to a Data Vault, instead I would write some ETL to populate a data warehouse and pose queries at the DWH.
The other view, I guess, is that you could build a set of views on top of the D-V, that would hide the structure from users, but I think I'm a bit of a purist and would go for a DWH.
As #Marcud D said, Data Vault is the model of Data Warehouse and usually when using DV modelling, you have to build data marts from DV for reporting purposes. I think that organizational unit should be modeled as Satellite table, not as Hub table. So, in any way, you should build a query to feed a specific data mart from DV model and then use it for reporting purposes.
I am developing a web-based application using Rails. I am debating between using a Graph Database, such as InfoGrid, or a Document Database, such as MongoDB.
My application will need to store both small sets of data, such as a URL, and very large sets of data, such as Virtual Machines. This data will be tied to a single user.
I am interested in learning about peoples experiences with either Graph or Document databases and why they would use either of the options.
Thank you
I don't feel enough experienced with both worlds to properly and fully answer your question, however I'm using a document database for some time and here are some personal hints.
The document databases are based on a concept of key,value, and static views and are pretty cool for finding a set of documents that have a particular value.
They don't conceptualize the relations between documents.
So if your software have to provide advanced "queries" where selection criteria act on several 'types of document' or if you simply need to perform a selection using several elements, the [key,value] concept is not appropriate.
There are also a number of other cases where document databases are inappropriate : presenting large datasets in "paged" tables, sortable on several columns is one of the cases where the performances are low and disk space usage is huge.
So in many cases you'll have to perform "server side" processing in order to pick up the pieces, and with rails, or any other ruby based framework, you might run into performance issues.
The graph database are based on the concept of tripplestore, meaning that they also conceptualize the relations between the entities.
The graph can be traversed using the relations (and entity roles), and might be more convenient when performing searches across relation-structured data.
As I have no experience with graph database, I'm not aware if the graph database can be easily queried/traversed with several criterias, however if an advised reader has such an information I'd really appreciate any examples of such queries/traversals.
I'm currently reading about InfoGrid and trying to figure if such databases could by handy in order to perform complex requests on a very large set of data, relations included ....
From what I can read, the InfoGrah should be considered as a "data federator" able to search/mine the data from several sources (Stores) wich can also be a NoSQL database such as Mongo.
Wich means that you could use a mongo store for updates and InfoGraph for data searching, and maybe spare a lot of cpu and disk when it comes to complex searches inside a nosql database.
Of course it might seem a little "overkill" if your app simply stores a large set of huge binary files in a database and all you need is to perform simple key queries and to retrieve the result. In that cas a nosql database such as mongo or couch would probably be handy.
Hope some of this helps ;)
When connecting related documents by edges, will you get a shallow or a deep graph? I think the answer to that question is important when deciding between graphdbs and documentdbs. See Square Pegs and Round Holes in the NOSQL World by Jim Webber for thoughts along these lines.