Is ~44 columns too much for a model? Does it make sense to break a one-to-one relation? - ruby-on-rails

I am interested in what the best practice is for a model that has a lot of data attached to it. Most of my app revolve around one model (SKU), and it seems to have more and more things associated with it.
For example, my SKU model has multiple prices, dimensions, weight, recommended prices for multiple price levels, title, description, shelf life, etc. Would it make sense to break all the pricing info to another table? Or break up the SKU into different uses of the SKU and associate them? For example, WebSKU, StockSKU, etc.

As mentioned in the answer linked by Tom, if all your attributes really belong to that model there is no reason to break it up. However, if you have columns like price1, price2, price3 or dimension_x_1, dimension_y_1, dimension_x_2, dimension_y_2, etc, then it usually means you should be creating another table to contain those.
For example, you could set it up so that you have the following models
Sku
has_many :prices
has_many :dimensions
Price
belongs_to :sku
Dimension
belongs_to :sku

As everyone else said, the design of a database should respond to the logic behind it. Why? Mainly, because it will be easier to maintain and understand.
I was also going to drive attention to normalization rules, as #sawa did.
Generally, is a good approach to normalize your database, as it provides several advantages. You should read this wikipedia link (at least as a starting point).
Following normal rules will help you to design your database taking into account the logic behind your data.
But denormalization also has it's advantages. The first (always considered) being optimizing read performance. This basically means having data on one table that you would have had in different tables when following normal rules, and generally makes sense when that data has some logic relation.
You have to aim to achieve a balance depending on the problem you are facing.
On the other side, for the tags on your post I can see you are using ruby on rails, that uses the active record pattern. One consequence of the database model you are presenting, is that you will probably have a domain model just as complex. I mean, very large. I don't know every detail about your project, but I guess that it will quickly grow to be a god object, making your code hard to maintain, extend and understand.

Database should be designed not according to how many columns it has, but according to logic, particularly following Codd's normal forms. If there is systematic redundancy in your database, then that is a sign for splitting it into multiple tables. If not, keep it as is.

I think it is good to design data model, taking into account how DB engine works with files and memory. The first bottleneck of PostgreSQL is file IO. Memory consumption is also an important part. When PostgreSQL reads some table data (FYI: table data is not read at Index-Only-Scans) it reads 8 KB (compile time parameter) pages. More tuples in such a page, - less file IO, less memory consumption, better cache using (more often hits, fast prewarming, etc.), better performance.
So, if one have a really high-loaded project, it can be useful to think about separation of often used data to isolated tables (as a next step - place this tables into a separate tablespace on SDD or powerful RAID).
I.e. there should be some balance between a logic simplicity and performance tweaks.

Related

Transaction Fact Table approach

I'm working on financial data mart structure.
And I'm having some doubts on whats the better approach to do so.
The source system database,Dynamics AX 2009, has three tables for customer transaction.
One table for open transactions, where the Customer still needs to pay for service/product;
One table for settle transactions, where it holds what the customer have already paid;
Finally a table that have all customers transactions, holds transactions from open to settle and also others transactions as customer to bank or ledger accounts.
I thought in two options, first I will maintain a fact table representing the three table, fact for open transactions, fact for any customer transaction and fact for settle transaction.
Second is to create a single fact to hold all transactions, to do so I would have to do a full join on three tables.
I'm not sure on both approaches, as the first seems to copy tables from production and create the proper dimension.
On the Second one I would create a massive fact table, that where data would constantly change, as open transaction are delete on source system when they are settle.
Another doubt, should i create a fact with scd(slowly changing dimension) structure to maintain history data?(star date, end date , flag)
It's hard to say from the information given whether this needs to be one or more Fact tables. However, the key point which you should use to decide is whether all of the information is at the same granularity. Consider the grain of your intended Fact table(s) and you should find an answer for whether you need one table or multiple tables.
If all of the information sits at the same grain - i.e. all of the same dimensions apply to all of the measures you are considering putting into the same Fact table - then they can probably all live in the same Fact table. If you're finding that some of the Dimensions wouldn't apply to some of the measures then you probably need to re-think your design. Either you might need multiple Fact tables, or you might need to take all of your measures down to the lowest grain and combine hierarchies into single Dimensions if you currently have them split across multiple Dimensions.
While it's been mentioned that having measures in separate cubes could make it difficult to compare things, keep in mind that you don't need one cube per Fact table. You can have multiple Fact tables in a single cube, and sometimes this is very helpful when you need to be able to compare measures which share some Dimensions but not others. This is far, far better than forcing data which does not have the same grain into one Fact table.
Also, it sounds like what you're trying to model is the sales ledger of an organisation. I'd suggest having a dig around via Google as you may well be able to find materials discussing dimensional data warehouse design for sales ledger structures, rather than reinventing the wheel. If you don't have a decent understanding of the accounting concepts you're trying to model I would especially recommend looking for a reference schema to work from, or failing that doing some reading up on accountancy concepts (and sales ledgers specifically). Understanding the account structure should help you understand what the grain of your Fact table(s) needs to be, how to model the Dimensions, and so on.
This is a really helpful abridged version of Kimball's modelling techniques which discusses grain, and the different types of Fact table, amongst many other topics:
http://www.kimballgroup.com/wp-content/uploads/2013/08/2013.09-Kimball-Dimensional-Modeling-Techniques11.pdf
I think you should just use one fact table (one cube) and use a dimension to differentiate between open/settled/etc. transactions. That's what dimensions are for: They help you to categorize your measures and get a specific view on them. This approach will also open much more possibilities to create knowledge with your cube. With separate cubes for open/settled/etc. transactions, it will be harder or not possible to set this data into contrast.
Since the data is changing constantly, you should consider to update your fact table in a given time and rebuild your cube if it needs to.
If you use scd or not really depends on the data you process and what it is used for. Is there a business case claiming it? Is there a technical use?
I think this is something you have to decide on your own.

Core Data Model Design - Attributes vs Entities

I've been developing a very basic core data application for over a year now (Toy Collector, http://bit.ly/tocapp), and I'm looking at doing a redesign so that I can build in iCloud support. I figured while I'm doing that, I might as well update my core data model (if needed), and I'm having a heck of a time tracking down "best practices" for the following:
Currently, I have 2 entities:
Toy, Keywords
Toy has all the information about the object: Name, Year, Set, imageName, Owned, Wanted, Manufacturer, etc, (18 attributes in all)
Keywords has the normalized words to help speed up the search
My question is whether or not there is any advantage to breaking out some of the Toy attributes into their own entities. For example, I could have a manufacturer entity that stores the dozen or so manufacturers, instead of keeping that information in the Toy object. My gut tells me this could reduce the memory footprint (instead of 50,000 objects storing a manufacturer string, there would simple be 12 manufacturer strings in an entity with a relationship to the main Toy entity). Does that kind of organization really matter? Am I trying to overcomplicate things? I just feel like my entity has a lot of attributes, and I'm not sure if taking the time to break it apart into multiple entities would make a difference.
Any advice or pointers would be appreciated!
Zack
Your question is pretty broad, since it addresses the topic of database design. Let me say upfront that it is almost impossible to give you any sensible suggestions, since I would need to know a lot more about your app, use cases, etc. than it is possible through a S.O. question.
Coming to your concrete questions, I would say that you correctly identify one of advantages of splitting a table into multiple ones; actually, the advantage of doing that is not just reducing the database footprint, rather keep data redundancy to a minimum. Redundancy not only affects memory footprint but also manageability and modifiability of your data, and lack of redundancy could even cause anomalies or corruption. There is even a whole database theory topic which is known as database normalisation that addresses this king of concerns.
On the other hand, as it is always the case, redundancy can help performance, and this is actually the case when you can fetch your data through a simple query instead of multiple queries or table joins. There is a technique to improving a database performance which is known as database denormalization and is the exact opposite to normalisation. Your current scheme is fully denormalized.
Using Core Data, which is a relational object graph manager running often on top of SQLite, which is a relational database manager, you have also to take into account the fact that Core Data will automatically build your object graph and fetch into memory the data when you need it. This means that if you can take a smaller memory footprint on disk for granted, this might not be the case when it comes to RAM footprint of your query results (Core Data will "explode", so to say, at some moment your data from multiple tables into one object plus its attributes).
In your specific case, you should also possibly take into account the cost of migrating your existing user base (if the database is not read-only).
All in all, I would say that if your app does not have any database footprint issues at the moment; if you do not feel that creating new tables might be useful, e.g., in order to add new functionality, such as listing all manufacturers; and, finally, if you do not foresee tasks like renaming a manufacturer or such at some point, then maybe refactoring your database will not add much benefit. But, as I say, without knowing your app in detail and your roadmap for it, it is difficult to say anything really on spot. In any case, I hope this general considerations will help you take a decision.
EDIT:
If you want to investigate your core data performance and try to understand where the bottlenecks are, give a try to Instruments/Core Data tool (Product/Profile menu). There are a lot of things that can go bad.
On the other hand, it is really hard to help you further without having more details about the type of searches your app allows to do. One thing that is not clear to me is if your searches are slow only when they return a lot of results or they are slow even when returning a few results.
Normalizing might help performance if you only use (say, after doing a search) just one normalized entity (e.g., to display the toy name in a table). In this case all of the attributes referring to other entities would be faults (hence would not occupy memory nor take) and this might speed up things. But, if you do a search and then display the information from the other tables as well, then there might not be any advantage, quite the opposite, since the faults would have to be resolved immediately and this would produce more accesses to the database.
Also it is true that depending on how you use it, core data could not be the best way to handle your data. Have a look at this Brent Simmons' post relating his experience.

Does a serialized hash column make more sense than an associated model/table for flexibility?

I have been researching quite a bit and the general consensus is to avoid serialized hashes in a DB whenever possible, however the design I have lends itself to this structure, so I'm hoping to get some opinions and/or advice. Here is the scenario:
I have a model/table :products which houses financial products. Each product has_many investment strategies, which I had originally stored in a separate :strategies model/table. Since each product has completely different strategies, and each strategy has different attributes, its become extremely difficult (and hacky) to manipulate each strategy's attributes into normalized, consistent columns (to the point where I have products that I simply cannot add to the application). Additionally, a strategy's attributes can sometimes change based on the amount of money allocated to that strategy.
In order to solve this issue, I am looking into removing the :strategies model/table altogether and simply adding a strategies column to my :products model/table. The new column would house a multi-dimensional hash of each product's strategies. This option allows for tremendous flexibility from a data storage perspective.
My primary question is, do I lose any functionality by restructuring my database this way? There will be times when I need to search a product by it's strategy's attributes and I have read that searching within a multi-dimensional hash is difficult at best. Is this considered bad practice? Is there a third solution that I haven't considered?
The advantages of rolling with multiple tables for this design is you can leverage the database to protect your data with constraints, functions and triggers. The database is the only place you can protect your customers data with 100% confidence. These tried and true techniques have lost their luster in recent years and are viewed as cumbersome and/or unnecessary to those who do not understand them.
Hash based stores within relational databases are currently changing quickly due to popularity of nosql databases, however, traditionally it has been difficult to fully protect your customers data from the database with this implementation. Therefore, the application layer is where much of this protection lives. With that said, this is being innovated on and maybe someday they will solve it.
The big advantage of using the hash as a column in a table is you can get up and going more quickly while your figuring out your problem. In addition, you can pivot more easily because most modifications are made in the application layer on the fly.
Full text seaching and complex queries can also be a bit more difficult if your using an hash based store within a relational database.
General rule of thumb is if you need the data to safe and or have some complex reporting to do, go relational. Think a big financial services type app ;) Otherwise if your building a more social, data display style app, or just mocking things up there is nothing wrong with a serialized hash column. Most importantly remember to write tests so you can refactor more confidently if you choose wrong!
My $0.02
I would be curious to know which decision you choose and how it has worked out.

Rails and databases - Store old data in a separate table?

Okay, so I'm putting together a book store with Ruby on Rails. Books are fast moving and varied, so at any point of time there are a small number in the store. Books that have been ordered and shipped must be stored, mainly for the purpose of records.
Hence, I have a situation where a small section of data from a table is going to be very frequently accessed. A much much larger section of it will very rarely accessed at all. My plan is to move books that have been ordered and shipped to a separate table, so that the table of current books is small and very quick to access.
Does this approach make sense? Is there a better way of achieving this?
If I am to use this approach, is there a way of sharing a model between tables in Rails?
I agree with Randy's comment about considering the number of books in the database, and whether or not it's really worth it. Only after you try it, and come back with real performance numbers to consider should you consider optimizing in this way, I believe.
On the other hand, there's plenty of precedent for having the idea of an "archive" table. From a design standpoint, this is totally fine. It's a question of the tradeoff between complexity and performance. But again, only after you try it and see whether or not the performance is acceptable, will you have a solid reason to choose one approach over another.

How do I avoid complex joins in star schema?

My fact table holds a user score in a course he took. Some of the details of the course, which I have to show on the report, comes from more then one table (in the actual OLTP db).
Do I create a none normalized version of that course entry in a dimension table?
Or do I just join the fact table directly to the course table join to the other tables that describe this course (course_type,faculty who created this course etc)
Snowflaking or bridge tables do make the joins more complicated, and not just from a coding perspective, it also makes it less simple for BI users.
In most cases, I would put these directly in existing or additional dimension tables.
For instance, you have a scores fact table, which has the user details in a dimension which may or may not hold demographics on the user (perhaps it's only a bridge). Sometimes it is better to split out demographic information. So even though the gender and age might be associated with a user entity, in the dimensional model, these might be individual dimensions or lumped into a single dimension - all depending on the usage scenarios.
Perhaps your scores are attached to a state and states have regions (snowflake). It might be far more efficient for analysis to have the region dimension linked directly instead of going through the state dimension.
I think what you will find is that the dimensional model is a very pragmatic denormalization approach. The main things which are non-negotiable are the facts - after that the choice of dimensions is very much informed by the behavior of the data, your foresight for common usage scenarios - and avoiding falling into the too few dimensions and too many dimensions problems.
Maybe I do not understand your question, but a fact table in a star schema is supposed to be joined to dimension tables surrounding it.
If you do not feel like making joins, simply create a view, and use the view for reporting.
If you were to post a model (schema), it would be easier to comment/help.
It is a common practice to consolidate several dimensions together, sacrificing normalization in favor of performance. This is usually done when your typical query will need all dimensions together (as opposed to using different bits for different use cases).
Also remember that while you receive a reduction in join overhead, there are some drawbacks:
Loss of flexibility, which might hinder development as the warehouse expands
Full table scans take longer (in traditional row-based RDBMS such as SQL Server)
Disk space consumption
You will have to consider each case separately.
It might be worthwhile to also consider the option of creating a materialized view, if such ability is offered by your RDBMS.
We commonly have a snowflake schema as the physical DWH design, but add a reporting view layer that flattens the snowflake schema into a star schema.
This way your OLAP cube becomes much simpler adn easier to manage.

Resources