I am building a master database to store all relevant information about our customers. I am using Neo4j.
Below is a sample of our model. We have Person, that can be registered in 3 of our mobile applications. (App.01, App. 02, App. 03 - We use CPF key, it is like a SSN). In those apps the user can be registered with an email. So it is represented by Email entity. Those user can have multiple address represented by Address entity.
The question is:
As I am building a Master Data, IMO, if someone query the mdm database asking for all "best" information about a person, I would return for example:
Name: John
Best email: email2 (because it has two apps using it)
Best address: addr1 (because it has tow apps using it)
So I am going to build some heuristis to define what is the "best" email and address.
For this purpose, I have some options:
I could create an edge from John to email2 and to addr1. So it's going to be easy for an user of MDM to get the "best" address/email from John.
I could build a rest API endpoint and create this heuristic in query time.
Does anyone have experience using graph database or design MDM database?
Is it a good approach?
This question is a complement for the question: Using Neo4j to build a Master Data Management
The graph data model is good to store your master data, however, your master data most likely will co-exist with operational and reference data in the form of dimensions.
if you decide to go with a graph model for your DMD, make sure that you have a well defined semantic model for the core dimension is MDM, usually:
products
customer
employees
Assets
Location
These core dimensions become attributes of your nodes.
Also, decide what DMD architecture style you are going to adopt, some popular ones are:
The Registry - Graph fits very well with this style because your master data remains in the SOS(system of record) and the references can be represented in the graph very nicely.
Master data Hub - Extra transformations ar4e required to transpose your system of record from tabular to the graph.
Master-Master. - this style fits well with your MDM in the graph if you do not have too many legacy apps that depend on your MDM.
Approach 1 would add a lot of essentially redundant information (about 2N extra relationships, where N is the number of people), and also require more complex coding to handle changes to a person's apps. And, as always when information is stored redundantly, you would have to be especially careful that inconsistencies do not creep in. But, it should be faster when querying for the "best" contact info.
Approach 2 keeps the DB the same size, but requires a more complex and slower query to get the "best" contact info. However, changing a person's apps and contact info is straightforward.
To decide which approach to use, you should consider whether DB size is an issue, and also look at your use cases and how frequently they will be performed.
Here is a simple heuristic if DB size is not an issue. Suppose G is the frequency at which you need to get a person's "best" contact info, and M is the frequency at which you need to modify a person's apps or contact info. You would pick approach 1 if the value of G/M exceeds some threshold value, K, that you would have to decide on, taking into consideration the above considerations.
Related
This seems like a pretty common use-case. Let's say we have sensitive PII that we want to protect, such as SSNs. We mask that data using dynamic data masking in Snowflake. Now we have an engineer that is writing data transformations, and they need to join two tables using SSN. They don't have clearance to view the SSNs, but they can view the other information on both tables. I want the engineer to be able to join the two tables, and see all the combined unsecured data, while keeping the SSN secret from the engineer. I'm really not sure why Snowflake doesn't use real values for joins behind the scenes while refusing to return them in results. Is there a workaround?
One idea is to make the masking policy return a hash of the initial value. That has a couple of limitations. First, it is explicitly warned against in the Snowflake docs. Second, it requires runtime hashing of all the values, which slows down query execution seemingly needlessly. Third, there is the issue of hash collisions which could break joins. This could result in an engineer spending days working to track down a bug in their code, only to realize that the extra rows in their dataset are the result of a hash collision.
Another potential solution is using an external tokenization provider (docs). I don't understand this option well, but it appears that this would mean that I would need to store the actual values and their tokenized form with a third party service, then make an API call each time I wanted to use the values in a query. That seems less than ideal. I'd rather the solution be contained within Snowflake.
I'd love to hear any thoughts, thanks in advance.
If you care about database integrity and avoid errors: Don't use SSNs as identifiers.
A SSN can be a property of a person, but don't use it as their primary key.
As the United States Social Security Administration says:
A 1990 OIG, HHS study indicated that 45% of organizations, both public and private, using SSNs make no effort to verify SSN accuracy. This leads to the real possibility that transfers of data from one organization to another could be inaccurate; computer matching of data between different organizations could be invalid; and innocent persons could be subjected to unwarranted intrusions into their privacy or improper changes in their benefits or services or even misidentified with serious results.
Also:
The SSN is the single most widely used record identifier for both government and the private sector, exerting a broad influence on the lives of most Americans. However, by itself, it is not a personal identifier because it lacks systematic assignment to every person and the means to authenticate a person's identity.
https://www.ssa.gov/history/reports/ssnreportc2.html
Instead you could create a unique id for each person within your database, and use that key for joins.
so I am trying to set up a data warehouse for a service where each customer has their own database with a unique schema. How do I go about setting up a warehouse so each customer has their own semantic layer / relational model set up automatically (since we (centrally) do not know what is in each database) So that each customer can easily report on their data? Is there any automatic process we can follow? Am I missing something?
It depends on whether you want a consolidated view of the data, or if each customer's data is to remain segregated.
If consolidation is the objective (and there are huge benefits for a multi-tenant SAAS vendor to have a consolidated overview of customer data) then Nithin B's suggestion is good.
If separate warehouses are required, then you'll need to think about how to optimise your costs. The two biggest components will be ETL/ELT, and database hosting.
The fastest way to ETL/ELT is data warehouse automation. You'll find a good list of vendors on our web site (http://ajilius.com/competitors). Look for a solution that will give you the flexibility to meet your deployment options (cloud and/or on-premise), as well as the geographic reach you'll need for accessing customer data.
Will you be hosting your own databases or in the cloud? How much data will each tenant require? A good starting point would be PostgreSQL or SQL Server (SMP), and Ajilius gives you the flexibility to instantly migrate to MPP platforms if your needs outgrow those platforms.
There are many ways to address this.
Land all the tables in a Landing area in different schemas.
Stage the data into appropriate staging tables for dim and fact loads.
Create a dim table to identify the Customer Area. For eg: Dim_Source
Load the data into the fact tables. Any specific customers can filter the data from the facts by using the Dim_Source values.
This design would help overall Enterprise reporting as well.
Hope that helps.
I would start with a Kimball BUS Matrix.
Cheers
Nithin
We are building a data warehouse by consuming file feeds from different sources.
The file feeds are all denormalized/flattened (In the Transactions (fact) file, the Account attributes keeps repeating in all the records).
Also, the account information changes often (the feed gives an as-is version of the data).
What is the best practice in this situation. Should the data warehouse have a star schema model (with the Account information as a slowly changing dimension and a Transaction fact). Will re-normalizing make the ETL process complex?
In my company, whenever some input is denormalized, we normalize it and from there we proceed with loading our schemas (whatever your schema is).
The reason is that, being de-normalized, those inputs are difficult to check for inconsistencies (data quality). Apart from that, conforming all of your inputs to some standard allows your code to be more maintainable.
In our case, following the Kimball practices has been a total success, fact table, slow changing dimensions and all that jazz.
Hard to answer without such details as daily volume, latency threshold, resource availability, reporting requirements, platform and tool constraints, etc. A traditional ODS, where you import into and store a normalized structure before creating data marts from that, is great but not optimal for big data or real time analysis. A more modern approach, using a data lake in Hadoop or a virtualization layer, may not be feasible for your organization.
General Opinions:
1) re-normalizing does seem unnecessary from both a complexity and performance standpoint unless you have some ongoing use for the normalized data store.
2) Whether or not you build a traditional star schema or a graph or whatever should be governed by the reporting requirements and tools, not the source data format. Those sources will change, btw.
3) "Transaction" does not sound like a fact to me. A purchase transaction, e.g., could feed a sales fact, an accumulating snapshot for a sales cycle, a funnel conversion fact, etc.
4) I'm not sure whether "Account" is a customer, or a balance account such as a credit card, online payment service, bank account, etc. They imply different SCD types. In any case, Google will be sufficient to get plenty of information about building those dimensions.
I am building an ad analytics tool which assumes a data structure like this:
Account
Campaign
Keyword
Conversion
I have a lot of information about individual conversion events, which can be tied back to the cost data of each campaign, keyword, ad group, etc. In SQL, you could consider each property a sort of foreign key (text-based) to the campaign, keyword or ad in a particular account, but that's inefficient and slow. It doesn't sound like a great idea to make campaign_id, keyword_id, etc. fields and populate them either, because I want the analytics to be available in near-real time.
What would be a good way to model this with MongoDB?
Assuming a very high volume of conversion events (millions per day or more), a storage engine alone (MongoDB or anything else) won't help you. What you need is the ability to run map-reduce jobs on the data in order to calculate the analytics. You can scale-out your cluster as necessary to achieve near-real time performance.
The free/open-source options that I can suggest are Hadoop (and probably HBase and Hive) or Riak.
There are other options - I'm only suggesting these two because I've personal experience with them in a high scale production environment. We're currently using Hadoop to power an analytics system processing billions of events per day.
If you're not into rolling your own and are able and willing to pay (a lot!) then look at GreenPlum and Vertica.
I'll be happy to share more information on potential solution designs - but I'll need more data on what you're trying to achieve - scale, use cases etc.
I'm not sure that MongoDB is really the right choice for something like this, since MongoDB is really more about storing less well (or more complex) documents rather than hierarchical records like this one. However, if you are going the MongoDB route, then you can just use the account, campaign and keyword tags directly. There is no substantive benefit to abstracting these into meaningless keys in MongoDB. You can index these fields directly in MongoDB.
I don't know what your volumes are going to be and what other factors are affecting your technology choices. However, assuming that your accounts, campaigns and keywords don't change that frequently, you could do this with plain old RDBMS (SQL or Oracle etc.) using lookup tables for these determinants where the foreign keys are meaningless integers. If you're doing live analytics you could adopt a star schema and keep all of the numeric FKs on the base fact table (Conversion) so that you aren't joining a chain of four tables to get the whole picture, instead you'd be doing three one-hop joins. This would allow you to summarize at any level with only a single join.
Current situation:
We have a BPMS (business process management suite) in place. There is increasing demand on historical and operative reports. The data model in the BPMS is not designed for historical queries. So we are analysing the possible solutions.
Solution in mind:
The idea is to push data on events in flow to an external database. Typical events in BPM are: new process instance was created, status changed, a step in the process was performed or status of the process instance was changed. Data vault is besides the star schema one of the interesting alternatives. Let’s assume there are two Hubs: PI (processitem instances) and OU (organisational unit) and a Link table LINK_PI_OU. Each time the process item is assigned to an organisational unit a new line will be added to the link table. The LOAD_DATE in the link table contains the datetime when this record was added. The record in the link table with the latest LOAD_DATE shows the current assignment of the process instance.
Question:
Let’ assume the business wants to know to whom all open process instances are currently assigned grouped by organisational unit.
How will a query look like for this report? Can it really be performant?
Or am I on the complete wrong way?
In general terms I didnt think that Data-Vault is intended to be an end user report layer or even a faux transactional system.
Im not completely clear on your archectiture, but in my understanding D-V is a historical repository that keeps all data for an enterprise that feeds a (Kimball/Inmon)datawarehouse. So in high level terms ...
Transaction systems => D-V => DWH => (cubes =>) users
This being the case, I wouldnt be posing queries to a Data Vault, instead I would write some ETL to populate a data warehouse and pose queries at the DWH.
The other view, I guess, is that you could build a set of views on top of the D-V, that would hide the structure from users, but I think I'm a bit of a purist and would go for a DWH.
As #Marcud D said, Data Vault is the model of Data Warehouse and usually when using DV modelling, you have to build data marts from DV for reporting purposes. I think that organizational unit should be modeled as Satellite table, not as Hub table. So, in any way, you should build a query to feed a specific data mart from DV model and then use it for reporting purposes.