A question in regards to Dimensional Modelling and Role Playing.
We have an Address dimension which is ‘role playing’. We receive Addresses from different sources including CRM systems. Addresses could also be of different types, such as Address of a company, individual etc. So from the Role Playing Address dimension, a single address could be tagged as the Address of a company and Address for billing in different facts.
There are different fact tables and they have different keys which would hold address data. Fact_Sales would have keys such as Customer_Address_Key, Company_Head_Office_Address_Key. So I believe we are kind of role playing the addresses in these facts.
Question:
Our lead Data Architect has a concern around this.
• We are capturing a lot of addresses from a number of systems. How would we identify where these addresses came from, and what type of addresses are these without going to the fact tables.
I would still suggest going through the facts, but I would like to consult the wider community over there before putting my feet firmly on the ground.
Is there any better way to do this, perhaps a separate table which defines the combination of Address_Key, Address_Type_Key and Source_Key.
Please let me know if you need any further clarification or pictures etc.
Cheers
Nithin
It sounds like in the situation you have that you should just include columns for the type of address and the source of the address in the address dimension itself, so it stands alone and you don't have to go via a fact to know what kind of thing it is. You wouldn't need a separate table with keys as you mentioned- the data can safely be denormalised in the dimension.
As an aside:
Although many people do have an address table which is separate, the approach from the Kimball Group would not be to have have 'address' or location dimension as a multi purpose dimension that stands alone- it provides part of what describes something else (like a company, or a customer, or even a 'delivery location'). Instead you'd have the dimension (e.g Customer) and Within that dimension you'd have a number of Address fields, named appropriately (CustomerAddress1, CustomerAddress2, CustomerCity). You may choose to administer the address centrally for convenience behind the scenes, with the other dimensions formed by means of views or further ETL, but in the presentation of the star schema the address table would not be seen separately. The addresses are still conformed in that they're called the same thing and mean the same thing.
However plenty of people go with a separate Address table as you've done
It is very reasonable to include source as an attribute of the dimension. The bigger question is how do you select the "Current" address for a customer if you have multiple sources. That is where things will get tricky.
You need Current Customer Address to mean the same thing throughout your business regardless of the source from which it was captured. I would refer to this as a conformed dimension. You need to 'conform' all of your addresses sources to the same structure so you can use them as a single dimension.
In the large majority of your facts, the source of the address is irrelevant. You are only needing to know that it is the current address. You may have a smaller model that can provide analysis on the source of the customer address.
The hard part is deciding which source is most trustworthy when the address is in multiple sources. You need to consider the source and the date of the last update. In other words, is the primary source still preferred when a less trustworthy source has a more recent update.
Type is usually just an attribute of the address. However, if your address can be used for multiple things (physical, shipping, billing, etc), that may need to be defined by the role-playing relationship. For other analytics on address, you can break city/state & zip into separate dimensions if you need to break things down by geographic location. I would recommend City & State be used as a single entity. If you treat City as separate from State, you'll get funny results when slicing by cities that exist in more than one state.
Related
This seems like a pretty common use-case. Let's say we have sensitive PII that we want to protect, such as SSNs. We mask that data using dynamic data masking in Snowflake. Now we have an engineer that is writing data transformations, and they need to join two tables using SSN. They don't have clearance to view the SSNs, but they can view the other information on both tables. I want the engineer to be able to join the two tables, and see all the combined unsecured data, while keeping the SSN secret from the engineer. I'm really not sure why Snowflake doesn't use real values for joins behind the scenes while refusing to return them in results. Is there a workaround?
One idea is to make the masking policy return a hash of the initial value. That has a couple of limitations. First, it is explicitly warned against in the Snowflake docs. Second, it requires runtime hashing of all the values, which slows down query execution seemingly needlessly. Third, there is the issue of hash collisions which could break joins. This could result in an engineer spending days working to track down a bug in their code, only to realize that the extra rows in their dataset are the result of a hash collision.
Another potential solution is using an external tokenization provider (docs). I don't understand this option well, but it appears that this would mean that I would need to store the actual values and their tokenized form with a third party service, then make an API call each time I wanted to use the values in a query. That seems less than ideal. I'd rather the solution be contained within Snowflake.
I'd love to hear any thoughts, thanks in advance.
If you care about database integrity and avoid errors: Don't use SSNs as identifiers.
A SSN can be a property of a person, but don't use it as their primary key.
As the United States Social Security Administration says:
A 1990 OIG, HHS study indicated that 45% of organizations, both public and private, using SSNs make no effort to verify SSN accuracy. This leads to the real possibility that transfers of data from one organization to another could be inaccurate; computer matching of data between different organizations could be invalid; and innocent persons could be subjected to unwarranted intrusions into their privacy or improper changes in their benefits or services or even misidentified with serious results.
Also:
The SSN is the single most widely used record identifier for both government and the private sector, exerting a broad influence on the lives of most Americans. However, by itself, it is not a personal identifier because it lacks systematic assignment to every person and the means to authenticate a person's identity.
https://www.ssa.gov/history/reports/ssnreportc2.html
Instead you could create a unique id for each person within your database, and use that key for joins.
With a normal 'graph database' the data is broken up into nodes and edges, and there isn't much of a restriction/schema between the connections. With this, it seems great for modeling straightforward graphs where the relationships are relatively consistent -- Movies with cast and crew; Computer networks with IPs and devices; Social networks with users and connections; etc.
Are there any graph-like databases that can be more specialized? For example to be able to model something like an electrical circuit where each component has a sort of 'schema' or well defined input and output -- i.e., a Resistor has two connections and has various properties:
a Transistor takes has three connections and has various properties, etc.
I'm not asking about particular circuit simulators, such as https://www.falstad.com/circuit/circuitjs.html, but more about whether it's possible in any graph (or pseudo-graph) databases to model and enforce very specific, well-defined relationships in a network, such as circuit design.
Definitely possible.
I've been working on this problem with Neo4j, and Restagraph is the result. It provides a REST API that enforces a schema on any updates to the database, and I've packaged it as a Docker image.
I haven't really promoted it so far, because it's only recently been mature enough for my own use, and I really need to improve the documentation. If you try it out, though, I'd love to hear any feedback you have.
TLDR: in general yes, but it depends.
This is a really broad question, so let me break it down.
While it's a little exaggerating to talk about all graph databases (which are not as standardized as SQL databases - which in turn are not very standardized as well), so take this answer with a grain of salt: Yes, that is possible.
As in SQL databases, you usually can set up constraints to be checked before any changes in data is persisted.
Most graph databases incorporate something along the lines of a "type", similarly to what a table represents in SQL databases. Some allow to constrain relationships to only target specific types, so you could restrict relationships e.g. between a node using a CAN bus and an I2C-bus to the specific types.
If a database does not provide these mechanisms, it's usually possible to constrain relationships to the existence of specific keys and values in the model. To have another example than your circuit one: Imagine a node-based system, which has typed inputs and outputs - an int-based output can only be connected to an int based input, a float based output only to a float based input, etc. Then you could add a field output_type and input_type to the nodes and constrain relationships between the values.
As soon as you add the ability to write (the SQL-similar stored) procedures, you can write very complex data integrity constraints.
So, while it is possible, the question is, if you should.
How much logic you actually want to put into your database is a decades-long heated argument. At some point in your application architecture, you will have to check the validity of the data that you are handling. Handling the data consistency in the database itself solves a lot of problems with race conditions or performance issues through multiple round trips between the application and the database, which would occur if the consistency checks are done in the application layer.
Putting a lot of your logic into the database severely limits your ability to switch databases ("vendor lock-in"), might lead to code duplication between your application layer and your database, and sprays your logic between two (or more) layers of your architecture (which makes it harder to find bugs, introduces temporal coupling, and might re-introduce race conditions and performance problems where you have to use transactions again).
My personal take is along the lines of Steve Wozniak - see your database as another service. If that service can provide you with everything you need to ensure data integrity, it might be a good idea to just use the database directly. But if this increases the problems I mentioned before, you might be better off putting a layer between your database and your business logic.
Objectivity/DB is object/graph database that uses schema. You can absolutely do what you are proposing. It supports complex object definitions including type inheritance and it has a full graph/navigational query language similar to Cypher. www.objectivity.com
I am building a master database to store all relevant information about our customers. I am using Neo4j.
Below is a sample of our model. We have Person, that can be registered in 3 of our mobile applications. (App.01, App. 02, App. 03 - We use CPF key, it is like a SSN). In those apps the user can be registered with an email. So it is represented by Email entity. Those user can have multiple address represented by Address entity.
The question is:
As I am building a Master Data, IMO, if someone query the mdm database asking for all "best" information about a person, I would return for example:
Name: John
Best email: email2 (because it has two apps using it)
Best address: addr1 (because it has tow apps using it)
So I am going to build some heuristis to define what is the "best" email and address.
For this purpose, I have some options:
I could create an edge from John to email2 and to addr1. So it's going to be easy for an user of MDM to get the "best" address/email from John.
I could build a rest API endpoint and create this heuristic in query time.
Does anyone have experience using graph database or design MDM database?
Is it a good approach?
This question is a complement for the question: Using Neo4j to build a Master Data Management
The graph data model is good to store your master data, however, your master data most likely will co-exist with operational and reference data in the form of dimensions.
if you decide to go with a graph model for your DMD, make sure that you have a well defined semantic model for the core dimension is MDM, usually:
products
customer
employees
Assets
Location
These core dimensions become attributes of your nodes.
Also, decide what DMD architecture style you are going to adopt, some popular ones are:
The Registry - Graph fits very well with this style because your master data remains in the SOS(system of record) and the references can be represented in the graph very nicely.
Master data Hub - Extra transformations ar4e required to transpose your system of record from tabular to the graph.
Master-Master. - this style fits well with your MDM in the graph if you do not have too many legacy apps that depend on your MDM.
Approach 1 would add a lot of essentially redundant information (about 2N extra relationships, where N is the number of people), and also require more complex coding to handle changes to a person's apps. And, as always when information is stored redundantly, you would have to be especially careful that inconsistencies do not creep in. But, it should be faster when querying for the "best" contact info.
Approach 2 keeps the DB the same size, but requires a more complex and slower query to get the "best" contact info. However, changing a person's apps and contact info is straightforward.
To decide which approach to use, you should consider whether DB size is an issue, and also look at your use cases and how frequently they will be performed.
Here is a simple heuristic if DB size is not an issue. Suppose G is the frequency at which you need to get a person's "best" contact info, and M is the frequency at which you need to modify a person's apps or contact info. You would pick approach 1 if the value of G/M exceeds some threshold value, K, that you would have to decide on, taking into consideration the above considerations.
I'm working on financial data mart structure.
And I'm having some doubts on whats the better approach to do so.
The source system database,Dynamics AX 2009, has three tables for customer transaction.
One table for open transactions, where the Customer still needs to pay for service/product;
One table for settle transactions, where it holds what the customer have already paid;
Finally a table that have all customers transactions, holds transactions from open to settle and also others transactions as customer to bank or ledger accounts.
I thought in two options, first I will maintain a fact table representing the three table, fact for open transactions, fact for any customer transaction and fact for settle transaction.
Second is to create a single fact to hold all transactions, to do so I would have to do a full join on three tables.
I'm not sure on both approaches, as the first seems to copy tables from production and create the proper dimension.
On the Second one I would create a massive fact table, that where data would constantly change, as open transaction are delete on source system when they are settle.
Another doubt, should i create a fact with scd(slowly changing dimension) structure to maintain history data?(star date, end date , flag)
It's hard to say from the information given whether this needs to be one or more Fact tables. However, the key point which you should use to decide is whether all of the information is at the same granularity. Consider the grain of your intended Fact table(s) and you should find an answer for whether you need one table or multiple tables.
If all of the information sits at the same grain - i.e. all of the same dimensions apply to all of the measures you are considering putting into the same Fact table - then they can probably all live in the same Fact table. If you're finding that some of the Dimensions wouldn't apply to some of the measures then you probably need to re-think your design. Either you might need multiple Fact tables, or you might need to take all of your measures down to the lowest grain and combine hierarchies into single Dimensions if you currently have them split across multiple Dimensions.
While it's been mentioned that having measures in separate cubes could make it difficult to compare things, keep in mind that you don't need one cube per Fact table. You can have multiple Fact tables in a single cube, and sometimes this is very helpful when you need to be able to compare measures which share some Dimensions but not others. This is far, far better than forcing data which does not have the same grain into one Fact table.
Also, it sounds like what you're trying to model is the sales ledger of an organisation. I'd suggest having a dig around via Google as you may well be able to find materials discussing dimensional data warehouse design for sales ledger structures, rather than reinventing the wheel. If you don't have a decent understanding of the accounting concepts you're trying to model I would especially recommend looking for a reference schema to work from, or failing that doing some reading up on accountancy concepts (and sales ledgers specifically). Understanding the account structure should help you understand what the grain of your Fact table(s) needs to be, how to model the Dimensions, and so on.
This is a really helpful abridged version of Kimball's modelling techniques which discusses grain, and the different types of Fact table, amongst many other topics:
http://www.kimballgroup.com/wp-content/uploads/2013/08/2013.09-Kimball-Dimensional-Modeling-Techniques11.pdf
I think you should just use one fact table (one cube) and use a dimension to differentiate between open/settled/etc. transactions. That's what dimensions are for: They help you to categorize your measures and get a specific view on them. This approach will also open much more possibilities to create knowledge with your cube. With separate cubes for open/settled/etc. transactions, it will be harder or not possible to set this data into contrast.
Since the data is changing constantly, you should consider to update your fact table in a given time and rebuild your cube if it needs to.
If you use scd or not really depends on the data you process and what it is used for. Is there a business case claiming it? Is there a technical use?
I think this is something you have to decide on your own.
With reference to localities and postal codes
Each postal code can have one more localities
Each locality can have one or more postal codes
Accordingly should this be created as a M:M scenario with a 3rd join table 'areas'?
The postal code table would only have a single column being the postal code itself and the locality table would also only have a single column being the locality name.
The alternative is a single table including both but it would result in repeated data.
Thanks in advance...
The question you have asked leaves open mostly to opinion. There are many factors that might make you lower the normalization based on the goals of how you plan to query the data.
Traditional normalization usually suggest the M:M scenario is correct, but that leaves applications constantly joining 3 tables to relate the information, and that may not be the most efficient if the applications do this in high frequency.
The alternative of a single table with repeated data could be optimal if accompanied by well designed non-clustered indexing so that joins are minimized and index seeks optimized in execution plans. However, storage would be taxed due to the non-clustered indexed, and apps of course have to know that the data coming back could be duped. But if the point is simply validating if a locality is within a zip code, this is expected.
Short story, there is the textbook answer in a perfect world, and then practically there may be other factors of performance, storage, query optimization, and application tendencies that could make lower normal forms preferable for certain situations.