is there any free or good paid tools to allow business users to edit data warehouse dimensions and then initiate updates to related tables?
Looking for a really simple one solution. One example, is to let business users change Product dimension so they can assign/change Product Category or Price.
I am on SQL Server 2008R2
Just as an example about back applying: when the user changes a product price they may wish to back date it. This requires the following changes:
Create a new dimension record (assuming this is an SCD2), generating a new surrogate key with a start/end date
Replace the old surrogate key in the fact with the new surrogate key from the effective date
So this is at the very least a two step process which I wrap up in a stored procedure called by the ADP
Again all the usual suspects (Microsoft, IBM etc,) have what they call MDM tools, but they are all really complicated, requiring definition of a business model (which is fair enough)
Related
I'm working on an existing Rails app, on Postgresql, that calculates commissions and various data for contractors.
Employees have many Contractors. Contractors and Employees both have fields that are used in business logic to calculate commissions.
My client wants to have a yearly snapshot of all of their data, so that they can be free to change business logic, add and remove employees, etc without losing their past (calculated) data.
My initial thought in implementing this would be Postgres schemas. I would have a cron task every year that takes the database as-is and copies every table and record to a schema for that year. That would be equivalent to simply having the older version of the DB in the future. I am worried, however, that application logic would break once columns are added in the future.
For example, a schema is created one year and a column gets added to a contractors table later that is used in a commissions calculation. How would I also save the old version of this commissions formula that doesn't depend on the new column?
The only solution I can think of is to simply keep the old formula and conditionally use them based on schema. I feel like this is very dirty and can lead to a lot of garbage as business logic changes.
How do you recommend I approach this problem? Thanks in advance for your help!
I think you should have stored the calculated commision in your db to prevent recalculation. An accepted calculated value is a fact, just persist that value.
Should you need to audit the calculated fields sometime later, Im not sure the old calculation logic should be made very convenient to retrieve on application layer. You might need to trace back your code svn for this. Or the data warehouse should have the calculation logic. The application can only provide the required calculation parameters and let the auditor handle it.
If the usecase is to easily rollback to a specific historical business rules out of blue, then I wouldnt recommend to accommodate such requirement.
Background: I am trying to design a star schema for a data warehouse. We have the following business model where we have few products that our customers can buy and then use. The customers are companies and then they have people in their organization who can be mapped to the licenses they have brought for products.
I have the following dimensions.
Account_dim: The dimension contains all the list of companies that have are our current/prospective with our company. It could have companies who still don't have a contract with us and are still in a discussion phase. so some rows might not have a contract.
User_dim: This is the list of users the company has nominated point of contacts for their company. So a user will belong to one particular Account in the Account_dim. One account can have many users.
Product_Dim: This dimension contains all the information regarding all the products we sell. The cost of a license and how many users are allowed on a license.So if for example he brought product A a max of two users can use it.
Now I have three tables that have data regarding the contract.
Contract: It contains information regarding a contract we have which will include the contract start date and end date and the account which this contract is assigned to.
products_bought: This table contains the product brought under a contract. A contract can hold multiple products bought.Each product row will have the product start date/end date and the price of the asset the client has paid.
allocated users:Each product bought can have users mapped to it who are allowed to use the product which is the user in user_dim for that account. Basically attaching a license to a user.
I am trying to model the contract, product bought and allocated user so I can generate the following data.
The amount of money a account has spend on products.
THe utilization of licenses by an account. for example an account has a product that allows 3 users but has only one user mapped to it will show the product is under utilized.
I tried denormalizing all three tables into one fact table but the I am running into problem where the contract end date can be changed if it is extended. As well as new assets can be mapped to it. Also last be not least, the company can remove a user and then map another user to the product or remove users because they left the company or add more users.
How can this be best modeled. Because they contract and asset users can change they should be SCDs rather than fact table or how should I implement a fact to handle these changes as well which must be captured as well to maintain history of usage over time.
your best bet is to read a book on how to go about designing a data warehouse: The Data Warehouse Lifecycle Toolkit as this will give you all the information you need to be able to answer questions like this.
However, to specifically address your question, the best way to approach this is as follows:
Define your measures: what are the values that you wish to be able to aggregate in your reports
Define the grain of each measure: what are the dimensions that uniquely identify each measure. For example, a transaction amount might be defined by Store, Customer and Date/Time; if you dropped any of these then the transaction amount would change; if you added another dimension, such as rainfall, it would not change the transaction amount (n.b. having defined the grain of a measure you should never add dimensions that would change the grain e.g. Product Dimension, in this example)
Once you have defined your measures and their grains you can add all the other dimensions to them (that won't affect their grain) andn then decide whether to hold them in separate fact tables or combine them into one fact table:
Rule: if two measures don't have the same grain you must not put them in the same fact table
Guidance: for measures that meet the above rule, if there is also significant overlap in the other dimensions you want to use for each measure then consider combining them into a single fact table. My rule of thumb is that if you have 2-3 dimensions that don't apply to all measures then that's OK; if you hit 5 or more then you probably need to be thinking of splitting the measures into separate facts
I am building a master database to store all relevant information about our customers. I am using Neo4j.
Below is a sample of our model. We have Person, that can be registered in 3 of our mobile applications. (App.01, App. 02, App. 03 - We use CPF key, it is like a SSN). In those apps the user can be registered with an email. So it is represented by Email entity. Those user can have multiple address represented by Address entity.
The question is:
As I am building a Master Data, IMO, if someone query the mdm database asking for all "best" information about a person, I would return for example:
Name: John
Best email: email2 (because it has two apps using it)
Best address: addr1 (because it has tow apps using it)
So I am going to build some heuristis to define what is the "best" email and address.
For this purpose, I have some options:
I could create an edge from John to email2 and to addr1. So it's going to be easy for an user of MDM to get the "best" address/email from John.
I could build a rest API endpoint and create this heuristic in query time.
Does anyone have experience using graph database or design MDM database?
Is it a good approach?
This question is a complement for the question: Using Neo4j to build a Master Data Management
The graph data model is good to store your master data, however, your master data most likely will co-exist with operational and reference data in the form of dimensions.
if you decide to go with a graph model for your DMD, make sure that you have a well defined semantic model for the core dimension is MDM, usually:
products
customer
employees
Assets
Location
These core dimensions become attributes of your nodes.
Also, decide what DMD architecture style you are going to adopt, some popular ones are:
The Registry - Graph fits very well with this style because your master data remains in the SOS(system of record) and the references can be represented in the graph very nicely.
Master data Hub - Extra transformations ar4e required to transpose your system of record from tabular to the graph.
Master-Master. - this style fits well with your MDM in the graph if you do not have too many legacy apps that depend on your MDM.
Approach 1 would add a lot of essentially redundant information (about 2N extra relationships, where N is the number of people), and also require more complex coding to handle changes to a person's apps. And, as always when information is stored redundantly, you would have to be especially careful that inconsistencies do not creep in. But, it should be faster when querying for the "best" contact info.
Approach 2 keeps the DB the same size, but requires a more complex and slower query to get the "best" contact info. However, changing a person's apps and contact info is straightforward.
To decide which approach to use, you should consider whether DB size is an issue, and also look at your use cases and how frequently they will be performed.
Here is a simple heuristic if DB size is not an issue. Suppose G is the frequency at which you need to get a person's "best" contact info, and M is the frequency at which you need to modify a person's apps or contact info. You would pick approach 1 if the value of G/M exceeds some threshold value, K, that you would have to decide on, taking into consideration the above considerations.
I am going to write a simple credit system that user can "add", "deduct" credits in the system. Currently I am thinking of two approaches.
Simple one: Store the user' credit as balance field in the database, and all actions ("add", "deduct") are logged but not used to compute the latest balance.
History based: Don't store the balance in database. The balance is computed by looking at the history of transactions, e.g. ("add", "deduct")
Both case would works I think, but I am looking to see if any caveat when designing such a system, particularly I am favoring the History based system.
Or, are there any reference implementation or open source module I am use?
Update: Or are there any Ruby/Rail based module like AuthLogic so I can plug and play into my existing code without reinventing the wheel (e.g. transaction, rollback, security etc)?
Absolutely use both.
The balance-based way gives you fast access to the current amount.
The history-based way gives you auditing. The history table should store the transaction (as you describe), a timestamp, the balance before the transaction happened, and ideally a way to track the funds' source/destination.
See the Ruby Toolbox for bookkeeping and Plutus double-entry bookkeeping gem.
In addition, if your credit system may affect users, then I recommend also using logging, and ideally read about secure log verification and provable timestamp chaining.
For logging details see: techniques for ensuring verifiability of event log files.
For open source code that does credit, you may want to look into: http://www.gnucash.org/
Adding and deducting credits implies that you might also need to be aware of where these credits came from and where they went. Any time you get into a situation like this, whether it is with currency or some other numerical quantity that needs to be tracked and accounted for, you should consider using a double entry accounting pattern.
This pattern has worked for centuries and gives you all of the functionality you need to be able to see what your balances are and how they got to be that way:
Audit log of all transactions (including sources and sinks of "funds")
Running balance of all accounts over time (if you choose to record it)
Easy validation of the correctness of records
Ability to "write-once" - no updates means no tampering
If you aren't familiar with the details, start here: Double Entry Bookkeeping or ask anyone who has taken an introductory course in bookkeeping.
You asked for a Ruby on Rails open source solution that you could plug and play into your application. You can use Plutus. Here is an excerpt from the description of this project on Github:
The plutus plugin provides a complete double entry accounting system
for use in any Ruby on Rails application. The plugin follows general
Double Entry Bookkeeping practices. ... Plutus consists of tables that
maintain your accounts, entries and debits and credits. Each entry can
have many debits and credits. The entry table, which records your
business transactions is, essentially, your accounting Journal.
yes, use both.
On top of that, you'll sometime need to reverse a transaction/
transactions.When doing that, create a new reversed transaction to
notate the money transfer.
sometimes, You'll need to unify several transactions under one roof. I suggest to create a third table called 'tokens' that will be the payments manager and you'll unify those grouped transactions under that token.
token.transactions = (select * from transactions t where t.token = "123") for example
Current situation:
We have a BPMS (business process management suite) in place. There is increasing demand on historical and operative reports. The data model in the BPMS is not designed for historical queries. So we are analysing the possible solutions.
Solution in mind:
The idea is to push data on events in flow to an external database. Typical events in BPM are: new process instance was created, status changed, a step in the process was performed or status of the process instance was changed. Data vault is besides the star schema one of the interesting alternatives. Let’s assume there are two Hubs: PI (processitem instances) and OU (organisational unit) and a Link table LINK_PI_OU. Each time the process item is assigned to an organisational unit a new line will be added to the link table. The LOAD_DATE in the link table contains the datetime when this record was added. The record in the link table with the latest LOAD_DATE shows the current assignment of the process instance.
Question:
Let’ assume the business wants to know to whom all open process instances are currently assigned grouped by organisational unit.
How will a query look like for this report? Can it really be performant?
Or am I on the complete wrong way?
In general terms I didnt think that Data-Vault is intended to be an end user report layer or even a faux transactional system.
Im not completely clear on your archectiture, but in my understanding D-V is a historical repository that keeps all data for an enterprise that feeds a (Kimball/Inmon)datawarehouse. So in high level terms ...
Transaction systems => D-V => DWH => (cubes =>) users
This being the case, I wouldnt be posing queries to a Data Vault, instead I would write some ETL to populate a data warehouse and pose queries at the DWH.
The other view, I guess, is that you could build a set of views on top of the D-V, that would hide the structure from users, but I think I'm a bit of a purist and would go for a DWH.
As #Marcud D said, Data Vault is the model of Data Warehouse and usually when using DV modelling, you have to build data marts from DV for reporting purposes. I think that organizational unit should be modeled as Satellite table, not as Hub table. So, in any way, you should build a query to feed a specific data mart from DV model and then use it for reporting purposes.