Is it possible to have conditional OLAP dimension aggregators? - data-warehouse

I have a set of OLAP cubes, in the form of a snow-flake schema, each representing one factory.
I have three concepts that for some factories clearly behave as 3 dimensions, and for other factories clearly behave as 2 dimensions.
The concepts are always the same: "products", "sales agents" and "customers".
But for some cases, I doubt if I should model it as a purely 3 dimensional cube or I should play around with some tweak or trick with a 2 dimensional cube.
Cases A and B are the ones that are clear for me, and Case C is the one that generates my wonderings.
CASE A: Clearly a 3 dimensional cube
Any agent can sell any product to any company. Several agents are resposible together for the same set of customers.
I model this case as this:
CASE B: Clearly a 2 dimensional cube
Every agent is 'responsible' for a portfolio of customers, and he can sell any product but only to his customers. The analysis is made on 'current responsability on the portfolio' so if an agent leaves the company, all his customers are reassigned to a new agent and the customer uniquely belongs to the new agent.
I model this case as this:
CASE C: My doubts
A customer may have been assigned a single agent or a set of several agents each one being responsible for a ProductCategory.
For example:
Alice manages TablesAndWoods ltd and GreenForest ltd.
Bob manages Chairs ltd and FastWheels ltd.
Carol manages Forniture ltd ONLY for ProductType = 'machinery' and also manages FrozenBottles ltd for ANY type of product.
Dave also manages Forniture ltd but ONLY for ProductType = 'consumables' and also manages HighCeilings ltd for ANY type of product.
QUESTION:
In this example "Case C":
Are customer and agent independent dimensions because Forniture ltd has relation both to Caroland Dave, so it is a 3D cube?
Or it is a 2D cube, where agent is not an independent dimension, but it is an aggregator of customer "conditioned" somehow by the ProductCategory product aggregator?
I would like to see how would you model this.
Thanks in advance.

Here is how I would model it:
Your fact table is Sales.
Your dimensions are (probably) Date, Product, Customer and Agent. This is closest to your Case A.
Collapse your snowflake (white entities) into the dimensions. The presence of these entities suggest that you should consider whether type-2 slowly changing dimensions are needed for at-time analysis.
Consider a Bridge table to capture the many-to-many relationship between Agent and Product.

Related

Creating variable of team demographics- similarity/dissimilarity

I wish to create a variable depicting similarity with respect to race to team members. In other words, I want to know of the people an individual shares a manager with, what percentage of the team is of the same race?
The variables I currently have are participant id, participant race, manager id, manager race, and team size. I know the racial breakdown of the teams, but I need the percentage of similar others in a team for each participant (in one column, not across columns split by race).
First calculate the number of participants of the same race for each participants, then divide that in the team size to get the ratio.
In the following code I assume you have the team ID although you didn't mention that in your post:
aggregate out=* mode=addvariables /break team_id participant_race /NRaceInTeam=n.
compute PRaceInTeam=NRaceInTeam/Team_Size.

Neo4j Choice Values Design Decision

TLDR: single and multi-select choice fields: for a generic ETL tool should I model them as nodes connected with relationships or direct labels
Detail:
I am thinking through an ETL tool/workflow that I am building that pulls from an API source to create graphs on the fly. The api generally returns data structured as individual entries that belong to "lists" such as People, Company, City, etc. with individual attributes/properties also present on the entries.
I know that I want to create node labels for each 'List', BUT the thing is that multiple list types have single and multi-select choice fields on an individual entry (such as company type for the company list. Company type options might include institution, bank, school, investor, real estate company, etc. and a company can have one or many types).
I ideally would want a generic rule that can handle all choice fields across all lists when writing to the neo4j graph. (Another choice field is "Tier" for example, with Tier 1, Tier 2, Tier 3 and Tier 4 as potential values)
Would it be best practice to
create a node label for each choice field category (i.e. a node label called "CompanyType" with X instances of that node one for each company type with a "name" property equal to the company type), and then link the Company node to its one or many CompanyType nodes via a "company_type" edge
or
create a node label for each choice field option and apply that label to the Company node in question (So a Company node might then have 5-6 labels such as Company:Tier1:Tier2:Investor:Bank)

E-R diagram confusion

I am in the process of designing this E-R diagram for a shop of which I have shown part of below (the rest is not relevant). See the link please:
E-R diagram
The issue that I have is that the shop only sells two items, Socks and Shoes.
Have I correctly detailed this in my diagram? I'm not sure if my cardinalities and/or my design is correct. A customer has to buy at least one of these items for the order to exist (but has the liberty to buy any number).
The Shoe and Sock entities would have their respective ID attribute, and I am planning to translate to a relational schema like this:
(I forgot to add to my diagram the ORDER_CONTAINS relationship to have an attribute called "Quantity". )
Table: Order_Contains
ORDER_ID | SHOEID | SOCKID | QTY
primary key | FK, could be null |FK, could be null | INT
This clearly won't work since the Qty would be meaningless. Is there a way I can reduce the products to just two products and make all this work?
Having two one-to-many relationships combined into one with nullable fields is a poor design. How would you record an order containing both shoes and socks - a row per shoe with SOCKID set to NULL and vice-versa for socks, or would you combine rows? In the former case the meaning of QTY is clear though it depends on the contents of SHOEID/SOCKID fields, but what would the QTY mean in the latter case? How would you deal with rows where both SHOEID and SOCKID are NULL and the QTY is positive? Keep in mind Murphy's law of databases - if it can be recorded it will be. Worse, your primary key (ORDER_ID) will prevent you from recording more than one row, so a customer couldn't buy more than one (pair of) socks or shoes.
A better design would be to have two separate relations:
Order_Socks (ORDER_ID PK/FK, SOCKID PK/FK, QTY)
Order_Shoes (ORDER_ID PK/FK, SHOEID PK/FK, QTY)
With this, there's only one way to record the contents of an order and it's unambiguous.
You have not explained very well the context here. I'll try to explain from what I understand, and give you some hints.
Do your shop only and always (forever) sell 2 products? Do the details of these products (color, model, weight, width, etc...) need to be persisted in the database? If yes, then we have two entities in the model, SOCKS and SHOES. Each entity has its own properties. A purchase or a order is usually seen as an event on the ERD. If your customers always buys (or order) socks with shoes, then there will always be a link between three entities:
CLIENTS --- SHOES --- SOCKS
This connection / association / relationship is an event, and this would be the purchase (or order).
If a customer can buy separate shoes and socks, then socks and shoes are subtypes of a super entity, called PRODUCTS, and a purchase is an event between CUSTOMERS and PRODUCTS. Here in this case we have a partitioning relationship.
If however, your customers buy separate products, and your store will not sell forever only 2 products, and details of the products are not always the same and will not be saved as columns in a table, then the case is another.
Shoes and socks are considered products, as well as other items that can be considered in future. Thus, we have records/rows in a PRODUCTS table.
When a customer places an order (or a purchase), he (she) is acquiring products. There is a strong link between customers and products here, again usually an event, which would be the purchase (or a order).
I do not know if you do it, but before thinking of start a diagram, type the problem context in a paper or a document. Show all details present in the situation.
The entities are seen when they have properties. If you need to save the name of a customer, the customer's eye color, the customer's e-mail, and so on, then you will have certainly a CUSTOMER entity.
If you see entities relate in some way, then you have a relationship, and you should ask yourself what kind of relationship these entities form. In your case of products and customers, we have a purchasing relationship there between. The established relationship is a purchase (or an order, you call it). One customer can buy various products, and one product (not on the same shelf, is the type, model) can be purchased for several customers, thus, we have a Many-To-Many relationship.
The relationship created changes according to the context. Whatever, we'll invent something crazy here as examples. Say we have customers and products. Say you want to persist a situation where customers lick Products (something really crazy, just for you to see how the context says the relationship).
There would be an intimate connection between customers and products entities (really close... I think...). In this case, the relationship represents a history of customers licking products. This would generate an EVENT. In this event you could put properties such as the date, the amount of times a customer licked a proper product, the weather, the time, the traffic light color on the street, etc., only what you need to persist according to your context, your needs.
Remember that for N-N relationships created, we need to see if new entities (out of relationship) will emerge. This usually happens when you are decomposing the conceptual model to the logical model. Probably, product orders will generate not one but two entities: The ORDER and the products of orders. It is within the products of orders that you place the list of products ordered from each customer, and the quantity.
I would like to present various materials to study ERD, but unfortunately they are all in Portuguese. I hope I have helped you in some way. If you want to be more specific about your problem, I think I can really help you best. Anything, please ask.

Neo4J Grandchild Relationships Linked to Nodes

I'm trying to model contractor relationships in Neo4J and I'm struggling with how to conceptualize subcontracts. I have nodes for Government Agencies (label: Agency) and Contractors (label:Company). Each of these have geospatial Office nodes with the HAS_OFFICE relationship. I'm thinking of creating a node that represents a Government Contract (label: Contract).
Here's what I'm struggling with: A Contract has a Government Agency (I'm thinking this is a "HAS CONTRACT" relationship) and one or more prime contractor(s) (I'm thinking this is a "PRIME" relationship). Here's where it gets complicated. Each of those primes contractors can have subcontractors under the prime contract only. Graphically, this is:
(Agency) -[HAS_CONTRACT]-> (Contract) -[PRIME]-> (Company 1) -[SUB]-> (Company 2)
The problem I'm struggling with is that the [SUB] relationship is only for certain contracts -- not all. For example:
Agency 1 -HAS-> Contract ABC -P-> Company 1 -S-> Company 2
Agency 1 -HAS-> Contract ABC -P-> Company 3 -S-> Company 4
Agency 2 -HAS-> Contract XYZ -P-> Company 1
Agency 2 -HAS-> Contract XYZ -P-> Company 4 -S-> Company 2
I want some way to search on that so I can ask cypher questions like "Find ways Agency 2 can put money on contract with Company 2." It should come back with the XYZ contract through Company 4, and NOT the XYZ contract through Company 1.
It seems like maybe storing and filtering on data within the relationship would work, but I'm struggling with how. Can I say Prima and Sub relationships have a property, "contract_id" that must match Contract['id']? If so, how?
Edit: I don't want to have to specify the contract name for the query. Based on #MarkM's reply, I'm thinking something like:
MATCH (a:Agency)-[:HAS]-(c:Contract)-[:PRIME {contract_id:c.id}]
-(p:Company)-[:SUB {contract_id:c.id}]-(s:Company)
RETURN s
I'd also like to be able to use things like shortestPath to find the shortest path between an agency and a contractor that follows a single contract ID.
I'd create the subcontractor by having two relationships, one to the contractor and one to the contract.
(:Agency)-[:ISSUES]->(con:Contract)-[:PRIMARY]->(contractor:Company)
(con:Contract)-[:SECONDARY]->(subContractor:Company)<-[:SUBCONTRACTS]-(contractor:Company)
Perhaps you can mode your use-case as a graph-gist, which is a good way of documenting and discussing modeling issues.
This seems pretty simple; I apologize if I've misunderstood the question.
If you want subcontractors you can simply query:
MATCH (a:Agency)-[:HAS]-(:Contract)-[:PRIME]-(p:Company)-[:SUB]-(s:Company) RETURN s
This will return all companies that are subcontractors. The query matches the whole pattern. So if you want XYZ contract subcontractors you simply give it the parameter:
MATCH (a:Agency)-[:HAS]-(:Contract {contractID: XYZ})-[:PRIME]-(p:Company)-[:SUB]-(s:Company) RETURN s
You'll only get company 2.
EDIT: based on your edit:
"Find ways Agency 2 can put money on contract with Company 2"
This seems to require some domain-specific knowledge which I don't have. I assume Agency 2 can only put money on subcontractors but not primes?? I might help if you reword so we know exactly what your trying to get from the graph. From my reading it looks like you want all companies that are subcontractors under Company 2's contracts. Is that right?
If that's what you want, again you just give Neo the path:
MATCH (a:Agency: {AgencyID: 2)-[:HAS]
-(c:Contract)-[:PRIME]-(:Company)-[:SUB]-(s:Company: {companyID: 2)
RETURN c, s
This will give you a list of all contracts under XYZ for which Company 2 is a subcontractor. With the current example, it will one row: [c:Contract XYZ, s:Company 2]. If Agency 2 had more contracts under which Company 2 subcontracted, you would get more rows.
You can't do this: [:PRIME {contract_id:c.id}] [:SUB {contract_id:c.id}] because Prime and Sub relationships shouldn't have contract_id properties. They don't need them — the very fact that they are connected to a contract is enough.
One thing that might make this a little more complicated is if the subcontractors also have subcontractors, but that's not evident.
Okay take 2:
So the problem isn't captured well in the original example data — sorry for missing it. A better example is:
Agency 1 -HAS-> Contract ABC -P-> Company 1 -S-> Company 2
Agency 1 -HAS-> Contract XYZ -P-> Company 1 -S-> Company 3
Now if I ask
MATCH (a:Agency)-[:HAS_CONTRACT]-(ABC:Contract {id:ABC})-[:PRIME]
-(c:Company)-[:SUBS]-(c2) RETURN c2
I'll get both Company 2 and 3 even though only 2 is on ABC, Right?
The problem here is the data model not the query. There's no way to distinguish a company's subs because they are all connected directly to the company node. You could put a property on the sub relationship with the prime ID, but a better way that really captures the information is to add another contract node under company. Whether you label this as a different type depends on your situation.
Company1 then [:HAS] a contract which the subs are connected to. The contract can then point back to the prime contract with a relationship of something like [:PARENT] or [:PRIME] or maybe from the prime to the sub with a [:SUBCONTRACT] relationship
Now everything becomes much easier. You can find all subcontracts under a particular contract, all subcontracts a particular company [:HAS], etc. To find all subcontractors under a particular contract you could query something like this:
MATCH (contract:Contract { id:"ContractABC" })-[:PRIME]-(c:Company)
-[:HAS]->(subcontract:Contract)-[:PARENT]-(contract)
WITH c, subcontract
MATCH (subcontract)-[:SUBS]-(subcontractor:Company)
RETURN c, subcontractor
This should give you a list of all companies and their subcontractors under contract ABC. In this case Company 1, Company 2 (but not company 3).
Here's a console example: http://console.neo4j.org/?id=flhv8e
I've left the original [:SUB] relationships but you might not need them.

Should I flatten multiple customer into one row of dimension or using a bridge table

I'm new to datawarehousing and I have a star schema with a contract fact table. It holds basic contract information like Start date, end date, amount ...etc.
I have to link theses facts to a customer dimension. there's a maximum of 4 customers per contract. So I think that I have two options either I flatten the 4 customers into one row for ex:
DimCutomers
name1, lastName1, birthDate1, ... , name4, lastName4, birthDate4
the other option from what I've heard is to create a bridge table between the facts and the customer dimension. Thus complexifying the model.
What do you think I should do ? What are the advantages / drawbacks of each solution and is there a better solution ?
I would start by creating a customer dimension with all customers in it, and with only one customer per row. A customer dimension can be a useful tool by itself for CRM and other purposes and it means you'll have a single, reliable list of customers, which makes whatever design you then implement much easier.
After that it depends on the relationship between the customer(s) and the contract. The main scenarios I can think of are that a) one contract has 4 customer 'roles', b) one contract has 1-4 customers, all with the same role, and c) one contract has 1-n customers, all with the same role.
Scenario A would be that each contract has 4 customer roles, e.g. one customer who requested the contract, a second who signs it, a third who witnesses it and a fourth who pays for it. In that case your fact table will have one row per contract and 4 customer ID columns, each of which references the customer dimension:
...
RequesterCustomerID int,
SignatoryCustomerID int,
WitnessCustomerID int,
BillableCustomerID int,
...
Of course, if one customer is both a requester and a witness then you'll have the same ID in both RequesterCustomerID and WitnessCustomerID because you only have one row for him in your customer dimension. This is completely normal.
Scenario B is that all customers have the same role, e.g. each contract has 1-4 signatories. If the number of signatories can never be more than 4, and if you're very confident that this will 'always' be true, then the simple solution is also to have one row per contract in the fact table with 4 columns that reference the customer dimension:
...
SignatoryCustomer1 int,
SignatoryCustomer2 int,
SignatoryCustomer3 int,
SignatoryCustomer4 int,
...
Even if most contracts only have 1 or 2 signatories, it's not doing much harm to have 2 less frequently used columns in the table.
Scenario C is where one contract has 1-n customers, where n is a number that varies widely and can even be very large (class action lawsuit?). If you have 50 customers on one contract, then adding 50 columns to the fact table becomes difficult to manage. In this case I would add a bridge table called ContractCustomers or whatever that links the fact table with the customer dimension. This isn't as 'neat' as the other solutions, but a pure star schema isn't very good at handling n:m relationships like this anyway.
There may also be more complex cases, where you mix scenarios A and C: a contract has 3 requesters, 5 signatories, 2 witnesses and the bill is split 3 ways between the requesters. In this case you will have no choice but to create some kind of bridge table that contains the specific customer mix for each contract, because it simply can't be represented cleanly with just one fact and one dimension table.
Either way can work but each solution has different implications. Certainly you need customer and contract tables. A key question is: is it always a maximum of four or may it eventually increase beyond that? If it will stay at 4, then you can have a repeating group of 4 customer IDs in the contract. The disadvantage of this is that it is fixed. If a contract does not have four, there are some empty spaces. If, however, there might be more than 4, then the only viable solution is to use a bridge table because in a bridge table you add more customers by inserting new rows (with no change to the table structure). In the fixed solution, in this case you add more than 4 customers by altering the table. A bridge table is an example of what, for many decades now, ER modeling has called an associative entity. It is the more flexible of the two solutions. However, I worked on a margin system once wherein large margin amounts needed five levels of manager approval. It has been five and will always be five, they told me. Each approving manager represented a different organizational level. In this case, we used a repeating group of five manager IDs, one for each level, and included them in the trade. So it is important to understand the current business rules and the future outlook.

Resources