I am fairly new to data warehouses and I want to make sure my plan makes sense. I am using a star schema to capture insurance information for reporting purposes. The users want to see everything as-of a specific day, since ever field can change daily I am planning to have a record in the fact table join to a unique record in each dimension. I know that this will generate a lot of records but I am unable to think of another way to do it.
I appreciate any insight.
Related
I'm working on an existing Rails app, on Postgresql, that calculates commissions and various data for contractors.
Employees have many Contractors. Contractors and Employees both have fields that are used in business logic to calculate commissions.
My client wants to have a yearly snapshot of all of their data, so that they can be free to change business logic, add and remove employees, etc without losing their past (calculated) data.
My initial thought in implementing this would be Postgres schemas. I would have a cron task every year that takes the database as-is and copies every table and record to a schema for that year. That would be equivalent to simply having the older version of the DB in the future. I am worried, however, that application logic would break once columns are added in the future.
For example, a schema is created one year and a column gets added to a contractors table later that is used in a commissions calculation. How would I also save the old version of this commissions formula that doesn't depend on the new column?
The only solution I can think of is to simply keep the old formula and conditionally use them based on schema. I feel like this is very dirty and can lead to a lot of garbage as business logic changes.
How do you recommend I approach this problem? Thanks in advance for your help!
I think you should have stored the calculated commision in your db to prevent recalculation. An accepted calculated value is a fact, just persist that value.
Should you need to audit the calculated fields sometime later, Im not sure the old calculation logic should be made very convenient to retrieve on application layer. You might need to trace back your code svn for this. Or the data warehouse should have the calculation logic. The application can only provide the required calculation parameters and let the auditor handle it.
If the usecase is to easily rollback to a specific historical business rules out of blue, then I wouldnt recommend to accommodate such requirement.
I'm still very much a noob and I've been having quite a bit of trouble figuring out how to structure my DB for my Gym/Workout Log app.
The data is to be presented in TableViews with Rows/Sections.The idea is that the end user would first select a day of the week and will name his workout, and he could have multiple workouts under the same day of the week if he wishes. Then, within each workout, he could have multiple exercises, and within each exercise, he could have an array of Weights and Repetitions that need to maintain the order in which they are pushed in (might be some trouble there since I heard that arrays do not always maintain the same order when queried).
There are a few ways that I can go about structuring my DB, but I know that I have to avoid Sub-Collections because although sub-collections will structure my DB beautifully, they are a pain to work with when it comes to reading and performing cascading deletes. I've read that Maps are the way to go, but that's kind of what I''m having trouble with, especially in terms of reading the data. I'm going to post what I have come up with and I'm hoping that someone can suggest on how I can improve the model or what I can change so that I can access the String values of Days, Workouts, Exercises, and Weight/Repetitions as easily as possible, because the way I have it set out, those values are stored as Keys. Much appreciated!
You can nest up to 100 levels but doesn't mean you should. You can organize your data as a subcollection which is a collection within a specific document. You can query across subcollections with the same collection ID using Collection Group Queries. You can find best practices here
Hello and good morning.
I am working on a side project where I am adding an analytic board to an already existing app. The problem is that now the users table has over 400 columns. My question is that what's a better way of organizing this table such as splintering the table off into separate tables. How do you do that and how do you communicate the tables between the new tables?
Another concern is that If I separate the table will I still be able to save into it through the user model? I have code right now that says:
user.wallet += 100
user.save
If I separate wallet from user and link the two tables will I have to change this code. The reason I'm asking this is that there is a ton of code like this in the app.
Thank you so much if you can help me understanding how to organize a database. As a bonus if there is a book that talks about database organization can you recommend it to me (preferably one that is in rails).
Edit: Is there also a way to do all of this without loosing any data. For example transfer the data to a new column on the new table then destroying the old column.
Please read about:
Database Normalization
You'll get loads of hits when searching for that string and there are many books about database design covering that subject.
It is most likely, that this table of yours lacks normalization, but you have to see yourself!
Just to give an orientation - I would get a little anxious when dealing with a tenth of that number of columns. That saying, I clearly have to stress that there might be well normalized tables with 400 columns as well as sloppily created examples with just 10 columns.
Generally speaking, the probability of dealing with bad designed tables and hence facing trouble simply rises with the number of columns.
So take your time and if you find out, that users table needs normalization next step would indeed be to spread data over several tables. Because that clearly (and most likely even heavily) affects the coding of your application here is where you thoroughly have to balance pros and cons - simply impossible to judge that from far away.
Say, you have substantial problems (e.g. fierce performance problems - you wouldn't post it) that could be eased by normalization there are different approaches of how to split data. Here please read about:
Cardinalities
Usually the new tables are linked by
Foreign Keys
, identical data (like a user id) that appear in multiple tables and that are used to join them, that is.
And finally, yes, you can do that without losing data as the overall amount of information never changes when normalizing.
In case your last question was meant to be technical: There is no problem in reading data from one column and inserting them into a new one (of a new table). That has to happen in a certain order as foreign keys have to be filled before you can use them. See
Referential Integrity
However, quite obvious: Deleting data and dropping columns interferes with the operability of your application. Good planning is due.
I'm working on financial data mart structure.
And I'm having some doubts on whats the better approach to do so.
The source system database,Dynamics AX 2009, has three tables for customer transaction.
One table for open transactions, where the Customer still needs to pay for service/product;
One table for settle transactions, where it holds what the customer have already paid;
Finally a table that have all customers transactions, holds transactions from open to settle and also others transactions as customer to bank or ledger accounts.
I thought in two options, first I will maintain a fact table representing the three table, fact for open transactions, fact for any customer transaction and fact for settle transaction.
Second is to create a single fact to hold all transactions, to do so I would have to do a full join on three tables.
I'm not sure on both approaches, as the first seems to copy tables from production and create the proper dimension.
On the Second one I would create a massive fact table, that where data would constantly change, as open transaction are delete on source system when they are settle.
Another doubt, should i create a fact with scd(slowly changing dimension) structure to maintain history data?(star date, end date , flag)
It's hard to say from the information given whether this needs to be one or more Fact tables. However, the key point which you should use to decide is whether all of the information is at the same granularity. Consider the grain of your intended Fact table(s) and you should find an answer for whether you need one table or multiple tables.
If all of the information sits at the same grain - i.e. all of the same dimensions apply to all of the measures you are considering putting into the same Fact table - then they can probably all live in the same Fact table. If you're finding that some of the Dimensions wouldn't apply to some of the measures then you probably need to re-think your design. Either you might need multiple Fact tables, or you might need to take all of your measures down to the lowest grain and combine hierarchies into single Dimensions if you currently have them split across multiple Dimensions.
While it's been mentioned that having measures in separate cubes could make it difficult to compare things, keep in mind that you don't need one cube per Fact table. You can have multiple Fact tables in a single cube, and sometimes this is very helpful when you need to be able to compare measures which share some Dimensions but not others. This is far, far better than forcing data which does not have the same grain into one Fact table.
Also, it sounds like what you're trying to model is the sales ledger of an organisation. I'd suggest having a dig around via Google as you may well be able to find materials discussing dimensional data warehouse design for sales ledger structures, rather than reinventing the wheel. If you don't have a decent understanding of the accounting concepts you're trying to model I would especially recommend looking for a reference schema to work from, or failing that doing some reading up on accountancy concepts (and sales ledgers specifically). Understanding the account structure should help you understand what the grain of your Fact table(s) needs to be, how to model the Dimensions, and so on.
This is a really helpful abridged version of Kimball's modelling techniques which discusses grain, and the different types of Fact table, amongst many other topics:
http://www.kimballgroup.com/wp-content/uploads/2013/08/2013.09-Kimball-Dimensional-Modeling-Techniques11.pdf
I think you should just use one fact table (one cube) and use a dimension to differentiate between open/settled/etc. transactions. That's what dimensions are for: They help you to categorize your measures and get a specific view on them. This approach will also open much more possibilities to create knowledge with your cube. With separate cubes for open/settled/etc. transactions, it will be harder or not possible to set this data into contrast.
Since the data is changing constantly, you should consider to update your fact table in a given time and rebuild your cube if it needs to.
If you use scd or not really depends on the data you process and what it is used for. Is there a business case claiming it? Is there a technical use?
I think this is something you have to decide on your own.
I'm new on dimensional modelling I believe that you guys can help me in the following doubts.
In the production system I have a transaction table, sales table for example.The unique identifier is a primary key called SaleId.
Example:
My doubt is when modelling the fact table should the SaleID be included in the fact table as a NaturalKey?
Also should the Fact table have a SurrogateKey?
Please feel free to send me any link as reference.
Thanks in advance
Technically speaking, it is probably not a natural key - it does look system generated. However, sometimes it is very valid to store a system generated ID in a Fact for use as a Degenerate Dimension. Usually, these are cases where either the business users do have sight of this system generated ID (order numbers, invoice numbers, purchase order numbers, etc.), or where there's no other useful way of identifying some rows which can be usefully grouped together.
If the users of your BI solutions are likely to want to be able to drill down into information and look at it by sale, then the SaleID might well be a good candidate for this treatment. Have a think whether there's any other way for them to get to this level - could a customer be associated with two distinct sales on the same day? If so, would your users want to look at them as two separate sales? You might need to speak to them to find out what's going to be useful for them. If for some reason you can't get a clear answer, I'd say keep it - there's little harm, and you can always remove it later if it's not used.
Here's the Kimball group's take on Degenerate Dimensions, in case you're at all unclear on how they work:
http://www.kimballgroup.com/2003/06/design-tip-46-another-look-at-degenerate-dimensions/
As far as Fact table surrogate keys - I always use them. As Kimball's Design Tip #81 points out, they're sometimes useful, and it's the kind of thing I'd rather put in at the beginning and not use than realise later on that it would have been useful to have. Point 2 - where you might want to make updates by inserting new rows and deleting the old ones - certainly applies to work I've done.
The requirement for a primary key in a fact table depends on the type of the fact table. Transactional facts which are never updated do not need it. Periodic snapshots probably don't need it, unless the current period is a to-date measure. Accumulating snapshots definitely need it.