Postal code database normalisation - normalization

With reference to localities and postal codes
Each postal code can have one more localities
Each locality can have one or more postal codes
Accordingly should this be created as a M:M scenario with a 3rd join table 'areas'?
The postal code table would only have a single column being the postal code itself and the locality table would also only have a single column being the locality name.
The alternative is a single table including both but it would result in repeated data.
Thanks in advance...

The question you have asked leaves open mostly to opinion. There are many factors that might make you lower the normalization based on the goals of how you plan to query the data.
Traditional normalization usually suggest the M:M scenario is correct, but that leaves applications constantly joining 3 tables to relate the information, and that may not be the most efficient if the applications do this in high frequency.
The alternative of a single table with repeated data could be optimal if accompanied by well designed non-clustered indexing so that joins are minimized and index seeks optimized in execution plans. However, storage would be taxed due to the non-clustered indexed, and apps of course have to know that the data coming back could be duped. But if the point is simply validating if a locality is within a zip code, this is expected.
Short story, there is the textbook answer in a perfect world, and then practically there may be other factors of performance, storage, query optimization, and application tendencies that could make lower normal forms preferable for certain situations.

Related

How can I use dynamic data masking ro protect a column but still allow it to be used in joins?

This seems like a pretty common use-case. Let's say we have sensitive PII that we want to protect, such as SSNs. We mask that data using dynamic data masking in Snowflake. Now we have an engineer that is writing data transformations, and they need to join two tables using SSN. They don't have clearance to view the SSNs, but they can view the other information on both tables. I want the engineer to be able to join the two tables, and see all the combined unsecured data, while keeping the SSN secret from the engineer. I'm really not sure why Snowflake doesn't use real values for joins behind the scenes while refusing to return them in results. Is there a workaround?
One idea is to make the masking policy return a hash of the initial value. That has a couple of limitations. First, it is explicitly warned against in the Snowflake docs. Second, it requires runtime hashing of all the values, which slows down query execution seemingly needlessly. Third, there is the issue of hash collisions which could break joins. This could result in an engineer spending days working to track down a bug in their code, only to realize that the extra rows in their dataset are the result of a hash collision.
Another potential solution is using an external tokenization provider (docs). I don't understand this option well, but it appears that this would mean that I would need to store the actual values and their tokenized form with a third party service, then make an API call each time I wanted to use the values in a query. That seems less than ideal. I'd rather the solution be contained within Snowflake.
I'd love to hear any thoughts, thanks in advance.
If you care about database integrity and avoid errors: Don't use SSNs as identifiers.
A SSN can be a property of a person, but don't use it as their primary key.
As the United States Social Security Administration says:
A 1990 OIG, HHS study indicated that 45% of organizations, both public and private, using SSNs make no effort to verify SSN accuracy. This leads to the real possibility that transfers of data from one organization to another could be inaccurate; computer matching of data between different organizations could be invalid; and innocent persons could be subjected to unwarranted intrusions into their privacy or improper changes in their benefits or services or even misidentified with serious results.
Also:
The SSN is the single most widely used record identifier for both government and the private sector, exerting a broad influence on the lives of most Americans. However, by itself, it is not a personal identifier because it lacks systematic assignment to every person and the means to authenticate a person's identity.
https://www.ssa.gov/history/reports/ssnreportc2.html
Instead you could create a unique id for each person within your database, and use that key for joins.

Is ~44 columns too much for a model? Does it make sense to break a one-to-one relation?

I am interested in what the best practice is for a model that has a lot of data attached to it. Most of my app revolve around one model (SKU), and it seems to have more and more things associated with it.
For example, my SKU model has multiple prices, dimensions, weight, recommended prices for multiple price levels, title, description, shelf life, etc. Would it make sense to break all the pricing info to another table? Or break up the SKU into different uses of the SKU and associate them? For example, WebSKU, StockSKU, etc.
As mentioned in the answer linked by Tom, if all your attributes really belong to that model there is no reason to break it up. However, if you have columns like price1, price2, price3 or dimension_x_1, dimension_y_1, dimension_x_2, dimension_y_2, etc, then it usually means you should be creating another table to contain those.
For example, you could set it up so that you have the following models
Sku
has_many :prices
has_many :dimensions
Price
belongs_to :sku
Dimension
belongs_to :sku
As everyone else said, the design of a database should respond to the logic behind it. Why? Mainly, because it will be easier to maintain and understand.
I was also going to drive attention to normalization rules, as #sawa did.
Generally, is a good approach to normalize your database, as it provides several advantages. You should read this wikipedia link (at least as a starting point).
Following normal rules will help you to design your database taking into account the logic behind your data.
But denormalization also has it's advantages. The first (always considered) being optimizing read performance. This basically means having data on one table that you would have had in different tables when following normal rules, and generally makes sense when that data has some logic relation.
You have to aim to achieve a balance depending on the problem you are facing.
On the other side, for the tags on your post I can see you are using ruby on rails, that uses the active record pattern. One consequence of the database model you are presenting, is that you will probably have a domain model just as complex. I mean, very large. I don't know every detail about your project, but I guess that it will quickly grow to be a god object, making your code hard to maintain, extend and understand.
Database should be designed not according to how many columns it has, but according to logic, particularly following Codd's normal forms. If there is systematic redundancy in your database, then that is a sign for splitting it into multiple tables. If not, keep it as is.
I think it is good to design data model, taking into account how DB engine works with files and memory. The first bottleneck of PostgreSQL is file IO. Memory consumption is also an important part. When PostgreSQL reads some table data (FYI: table data is not read at Index-Only-Scans) it reads 8 KB (compile time parameter) pages. More tuples in such a page, - less file IO, less memory consumption, better cache using (more often hits, fast prewarming, etc.), better performance.
So, if one have a really high-loaded project, it can be useful to think about separation of often used data to isolated tables (as a next step - place this tables into a separate tablespace on SDD or powerful RAID).
I.e. there should be some balance between a logic simplicity and performance tweaks.

How to Organize an out of control table?

Hello and good morning.
I am working on a side project where I am adding an analytic board to an already existing app. The problem is that now the users table has over 400 columns. My question is that what's a better way of organizing this table such as splintering the table off into separate tables. How do you do that and how do you communicate the tables between the new tables?
Another concern is that If I separate the table will I still be able to save into it through the user model? I have code right now that says:
user.wallet += 100
user.save
If I separate wallet from user and link the two tables will I have to change this code. The reason I'm asking this is that there is a ton of code like this in the app.
Thank you so much if you can help me understanding how to organize a database. As a bonus if there is a book that talks about database organization can you recommend it to me (preferably one that is in rails).
Edit: Is there also a way to do all of this without loosing any data. For example transfer the data to a new column on the new table then destroying the old column.
Please read about:
Database Normalization
You'll get loads of hits when searching for that string and there are many books about database design covering that subject.
It is most likely, that this table of yours lacks normalization, but you have to see yourself!
Just to give an orientation - I would get a little anxious when dealing with a tenth of that number of columns. That saying, I clearly have to stress that there might be well normalized tables with 400 columns as well as sloppily created examples with just 10 columns.
Generally speaking, the probability of dealing with bad designed tables and hence facing trouble simply rises with the number of columns.
So take your time and if you find out, that users table needs normalization next step would indeed be to spread data over several tables. Because that clearly (and most likely even heavily) affects the coding of your application here is where you thoroughly have to balance pros and cons - simply impossible to judge that from far away.
Say, you have substantial problems (e.g. fierce performance problems - you wouldn't post it) that could be eased by normalization there are different approaches of how to split data. Here please read about:
Cardinalities
Usually the new tables are linked by
Foreign Keys
, identical data (like a user id) that appear in multiple tables and that are used to join them, that is.
And finally, yes, you can do that without losing data as the overall amount of information never changes when normalizing.
In case your last question was meant to be technical: There is no problem in reading data from one column and inserting them into a new one (of a new table). That has to happen in a certain order as foreign keys have to be filled before you can use them. See
Referential Integrity
However, quite obvious: Deleting data and dropping columns interferes with the operability of your application. Good planning is due.

Transaction Fact Table approach

I'm working on financial data mart structure.
And I'm having some doubts on whats the better approach to do so.
The source system database,Dynamics AX 2009, has three tables for customer transaction.
One table for open transactions, where the Customer still needs to pay for service/product;
One table for settle transactions, where it holds what the customer have already paid;
Finally a table that have all customers transactions, holds transactions from open to settle and also others transactions as customer to bank or ledger accounts.
I thought in two options, first I will maintain a fact table representing the three table, fact for open transactions, fact for any customer transaction and fact for settle transaction.
Second is to create a single fact to hold all transactions, to do so I would have to do a full join on three tables.
I'm not sure on both approaches, as the first seems to copy tables from production and create the proper dimension.
On the Second one I would create a massive fact table, that where data would constantly change, as open transaction are delete on source system when they are settle.
Another doubt, should i create a fact with scd(slowly changing dimension) structure to maintain history data?(star date, end date , flag)
It's hard to say from the information given whether this needs to be one or more Fact tables. However, the key point which you should use to decide is whether all of the information is at the same granularity. Consider the grain of your intended Fact table(s) and you should find an answer for whether you need one table or multiple tables.
If all of the information sits at the same grain - i.e. all of the same dimensions apply to all of the measures you are considering putting into the same Fact table - then they can probably all live in the same Fact table. If you're finding that some of the Dimensions wouldn't apply to some of the measures then you probably need to re-think your design. Either you might need multiple Fact tables, or you might need to take all of your measures down to the lowest grain and combine hierarchies into single Dimensions if you currently have them split across multiple Dimensions.
While it's been mentioned that having measures in separate cubes could make it difficult to compare things, keep in mind that you don't need one cube per Fact table. You can have multiple Fact tables in a single cube, and sometimes this is very helpful when you need to be able to compare measures which share some Dimensions but not others. This is far, far better than forcing data which does not have the same grain into one Fact table.
Also, it sounds like what you're trying to model is the sales ledger of an organisation. I'd suggest having a dig around via Google as you may well be able to find materials discussing dimensional data warehouse design for sales ledger structures, rather than reinventing the wheel. If you don't have a decent understanding of the accounting concepts you're trying to model I would especially recommend looking for a reference schema to work from, or failing that doing some reading up on accountancy concepts (and sales ledgers specifically). Understanding the account structure should help you understand what the grain of your Fact table(s) needs to be, how to model the Dimensions, and so on.
This is a really helpful abridged version of Kimball's modelling techniques which discusses grain, and the different types of Fact table, amongst many other topics:
http://www.kimballgroup.com/wp-content/uploads/2013/08/2013.09-Kimball-Dimensional-Modeling-Techniques11.pdf
I think you should just use one fact table (one cube) and use a dimension to differentiate between open/settled/etc. transactions. That's what dimensions are for: They help you to categorize your measures and get a specific view on them. This approach will also open much more possibilities to create knowledge with your cube. With separate cubes for open/settled/etc. transactions, it will be harder or not possible to set this data into contrast.
Since the data is changing constantly, you should consider to update your fact table in a given time and rebuild your cube if it needs to.
If you use scd or not really depends on the data you process and what it is used for. Is there a business case claiming it? Is there a technical use?
I think this is something you have to decide on your own.

How do I avoid complex joins in star schema?

My fact table holds a user score in a course he took. Some of the details of the course, which I have to show on the report, comes from more then one table (in the actual OLTP db).
Do I create a none normalized version of that course entry in a dimension table?
Or do I just join the fact table directly to the course table join to the other tables that describe this course (course_type,faculty who created this course etc)
Snowflaking or bridge tables do make the joins more complicated, and not just from a coding perspective, it also makes it less simple for BI users.
In most cases, I would put these directly in existing or additional dimension tables.
For instance, you have a scores fact table, which has the user details in a dimension which may or may not hold demographics on the user (perhaps it's only a bridge). Sometimes it is better to split out demographic information. So even though the gender and age might be associated with a user entity, in the dimensional model, these might be individual dimensions or lumped into a single dimension - all depending on the usage scenarios.
Perhaps your scores are attached to a state and states have regions (snowflake). It might be far more efficient for analysis to have the region dimension linked directly instead of going through the state dimension.
I think what you will find is that the dimensional model is a very pragmatic denormalization approach. The main things which are non-negotiable are the facts - after that the choice of dimensions is very much informed by the behavior of the data, your foresight for common usage scenarios - and avoiding falling into the too few dimensions and too many dimensions problems.
Maybe I do not understand your question, but a fact table in a star schema is supposed to be joined to dimension tables surrounding it.
If you do not feel like making joins, simply create a view, and use the view for reporting.
If you were to post a model (schema), it would be easier to comment/help.
It is a common practice to consolidate several dimensions together, sacrificing normalization in favor of performance. This is usually done when your typical query will need all dimensions together (as opposed to using different bits for different use cases).
Also remember that while you receive a reduction in join overhead, there are some drawbacks:
Loss of flexibility, which might hinder development as the warehouse expands
Full table scans take longer (in traditional row-based RDBMS such as SQL Server)
Disk space consumption
You will have to consider each case separately.
It might be worthwhile to also consider the option of creating a materialized view, if such ability is offered by your RDBMS.
We commonly have a snowflake schema as the physical DWH design, but add a reporting view layer that flattens the snowflake schema into a star schema.
This way your OLAP cube becomes much simpler adn easier to manage.

Resources