Is Sales Transaction modeled as Hub or a Link in Data Vault 2.0 - data-warehouse

I'm a rookie in Data Vault, so please excuse my ignorance. I am currently ramping up and modeling Raw Data Vault in parallel using Data Vault 2.0. I have few assumptions and need help validating them.
1) Individual Hubs are modeled for:
a) Product(holds pk-Product_Hkey, BK,Metadata),
b) Customer(holds pk-Customer_Hkey,BK,Metadata),
c) Store(holds pk-Store_Hkey,BK,Metadata).
Now a Sales Txn's that involves all the above Business Objects should be modeled as a Link Table
d) Link table- Sales_Link(holds pk-Sales_Hkey, Sales Txn ID, Product_Hkey(fk), Customer_Hkey(fk), Store_Hkey(fk), Metadata) and a Satellite needs to be associated to Link table holding some descriptive data about Link.
Is the above approach valid ?
My rationale for the above Link Table is because
I consider Sales Txn ID as a non-BK & hence
Sales Txn's must be hosted in a Link as opposed to hub.
2) Operational data has different types of customers.(Retail, Professional). All customers (agnostic to types) should be modeled in one hub & this distinction of customer types should be made by modeling different Satellites(one for retail, one for Professional) tied to Customer hub.
Is the above valid?
I have researched online technical forums, but got conflicting theories, so I'm posting it here.
There is no code applicable here

I would suggest to model sales as Hub if you are fine with below points else link is perfectly good design..
Sales transaction as a hub (Sales_Hub) :
Whats business key? Can you consider "Sales Txn ID"(unique number) as a BK.
Is this hub or the same BK used in another Link (except Sales_Link) i.e. link on link.
Are you ok with Sales_Link with no satellite, since all the descriptive exists in Sales_Hub.
Also it will store same BK+Audit metadata info in two places (Hub/Link) and addition joins to fetch data from Hub-satellite.
Is valid when
Customer information (retail,professional..etc) stored in separate tables at source(s) system.
You should model a satellite if the data is coming thru single source table then you apply soft rules to bifurcate them into their type in Business data vault.

Related

Datawarehouse design

I am going to design a Datawarehouse (although its not an easy process). I am wondering through out the ETL process , how the data in the Datawarehouse is going to extract/transform to Data Mart ?
Are there any model design within Datawarehouse vs Datamart ? Also usually starschema or snowflake?so should we place the table like in the following
In Datawarehouse
dim_tableA
dim_tableB
fact_tableA
fact_tableB
And in Datamart A
dim_tableA (full copy from datawarehouse)
fact_tableA (full copy from datawarehouse)
And in Datamart B
dim_tableB (full copy from datawarehouse)
fact_tableB (full copy from datawarehouse)
is it something real life example which can demonstrate the model difference between datawarehouse and datamart ?
I echo with both Nick's responses and in more technical way following Kimball methodology:
In my opinion and my experience. At high level ,we have data marts like Service Analytics , Financial Analytics , Sales Analytics , Marketing Analytics ,Customer Analytics etc. These were grouped as below
Subject Areas -> Logical grouping(Star Modelling) ->Data Marts -> Dimension &Fact (As per Kimball’s)
Example:
AP Real Time ->Supplier, Supplier Transaction’s , GL Data -> Financial Analytics + Customer Analytics->Physical Tables
Data marts contain repositories of summarized data collected for analysis on a specific section or unit within an organization, for example, the sales department. ... A data warehouse is a large centralized repository of data that contains information from many sources within an organization.
Depending on their needs, companies can use multiple data marts for different departments and opt for data mart consolidation by merging different marts to build a single data warehouse later. This approach is called the Kimball Dimensional Design Method. Another method, called The Inmon Approach, is to first design a data warehouse and then create multiple data marts for particular services as needed.
An example: In a data warehouse, email clicks are recorded based on a click date, with the email address being just one of the click parameters. For a CRM expert, the e-mail address (or any other customer identifier) ​​will be the entry point: opposite each contact, the frequency of clicks, the date of the last click, etc.
The Datamart is a prism that adapts the data to the user. In this, its keys to success depend a lot on the way the data is organized. The more understandable it is to the user, the better the result. This is why the titles of each field and their method of calculation must stick as closely as possible to the uses of the trade.

Microservices (Application-Level joins) more API calls - leads to more latency?

I have 2 Micro Services one for Orders and one for Customers
Exactly like below example
http://microservices.io/patterns/data/database-per-service.html
Which works without any problem.
I can list Customers data and Orders data based on input CustomerId
But now there is new requirement to develop a new screen
Which shows Orders of input Date and show CustomerName beside each Order information
When going to implementation
I can fetch the list of Ordersof input Date
But to show the corresponding CustomerNames based on a list of CustomerIds
I make a multiple API calls to Customer microservice , each call send CustomerId to get CustomerName
Which lead us to more latency
I know above solution is a bad one
So any ideas please?
The point of a microservices architecture is to split your problem domain into (technically, organizationally and semantically) independent parts. Making the "microservices" glorified (apified) tables actually creates more problems than it solves, if it solves any problem at all.
Here are a few things to do first:
List architectural constraints (i.e. the reason for doing microservices). Is it separate scaling ability, organizational problems, making team independent, etc.
List business-relevant boundaries in the problem domain (i.e. parts that theoretically don't need each other to work, or don't require synchronous communication).
With that information, here are a few ways to fix the problem:
Restructure the services based on business boundaries instead of technical ones. This means not using tables or layers or other technical stuff to split functions. Services should be a complete vertical slice of the problem domain.
Or as a work-around create a third system which aggregates data and can create reports.
Or if you find there is actually no reason to keep the microservices approach, just do it in a way you are used to.
New requirement needs data from cross Domain
Below are the ways
Update the customer Id and Name in every call . Issue is latency as
there would be multiple round trips
Have a cache of all CustomerName with ID in Order Service ( I am
assuming there a finite customers ).Issue would be , when to refresh
cache or invalidate cache , For that you may need to expose some
rest call to invalidate fields. For new customers which are not
there in cache go and fetch from DB and update cache for future . )
Use CQRS way in which all the needed data( Orders customers etc ..) goes to a separate table . Now in this schema you can create a composite SQL query . This will remove the round trips etc ...

surveymonkey Where is qtype and respondent_id in the get_survey_details extract?

I'm trying to replicate the survey monkey relational database format (A relational database view of your data with a separate file created for each database table. Knowledge of SQL (Structured Query Language) is necessary.) to download responses for our reporting analytics using the Survey Monkey API. However I'm not able to find the QType and respondent_id data in the get_survey_details API extract method. Can someone help?
1.QType is found in the Questions.xls data in the current relational database format download.
I was able to find all of the other data in the Questions.xls data in the get_survey_details API (question_id, page_id, position, heading) but not QType.
2.Respondent_id is found in the Responses.xls data in the the relational database format download.
I can see that respondent_id is in the get_responses API method but that does not have the associated Key1 data that I also need. Key1 data is answer_id data in the get_survey_details API which is why I expected to find the corresponding respondent_id there as well.
SurveyMonkey's deprecated relational database download (RDD) format and API provide data using very different paradigms. Using the API to recreate the RDD format in order to work with an old integration is probably a poor use of time. A more productive idea would be to use the API to build a more modern integration from the ground-up taking advantage of things like real-time data availability to modernize the functionality. But if you're determined:
You will need to map the family and subtype of the question type to the QTypes you're used to. The information you need to build the mapping can be found on SurveyMonkey's developer portal in Data Types.
get_responses returns answer_id as row and/or col. For matrix question types, you will have both which cross reference to and answer and answer items from get_survey_details. For matrix questions, you might consider concatenating the row and col to create a single unique key value like the Key1 you're accustomed to.
I've done this. It got over the immediate need when the RDD format was withdrawn.
Now that I have more time, I'm looking at a better design but as always backwards compatibility with a large code base is the drag.
To answer your question on Qtype, see my reply at
What are the expected values for the various "ENUM" types returned by the SurveyMonkey API?

Node state tracking / logging using Neo4j

I'm exploring potential use cases for neo4j, and I find that the relationship model is great, but I'm curious if the database can support something along the lines of a business transaction log.
For instance, a video rental store:
Customer A rents Video A on 01/01/2014
Customer A returns Video A on 01/20/2014
Customer B rents Video A on 01/25/2014
Customer B returns video A on 02/15/2014
Customer C rents Video A on 03/10/2014
etc...
The business requirement would be to track all rental transaction relationships relating to the Video A node.
This seems to be technically possible. Would one create a new relationship for every time that a new rental occurs? Are there better ways to approach this? Is this a misuse of the technology?
Nice! This is the exact use case that led me to develop FlockData (github link). FD uses Neo4J to track event type activity against a domain document (Rental in your example). Then use Tags to create Nodes that represent Meta Data associated with the domain doc (Movie/Person). You have an event node for each change in state of the Rental. Couple of graphs over here on LinkedIn showing "User Created", "User Approved" and "User Audited".
FD uses 3 databases to achieve its goals - Neo4j for the network of relationships, KV store for the bulky data (Redis or Riak) and ElasticSearch to let users find their Business Context Document (the Rental) via free text.
In terms of your specific question exercise caution with nodes that have a lot of relationships. Checkout this article on modelling dates. Peter Neubauer has a similar article somewhere in the Neo4j docs.
I'd look at it depending on what you're trying to get out of it. If you're looking to develop a recommendation engine, or see the relationships between users and/or movies, a graphDB is a pretty natural solution. If you're looking at tracking the state changes of Video A over time, a Temporal database is modeled for that (http://en.wikipedia.org/wiki/Temporal_database). For a straight up transactional system, a traditional relational database will work easily. Personally, I think you'll have better options with a graphDB. In your example, you would have 3 consumer nodes, 1 video node, 3 relationships of type :RENTS and two of :RETURNS. You'd want to make sure that your property model supports the same user re-renting the same movie (store the date in an array, not a single value). Just some thoughts...

Solr Join - getting data from different index

I'm working on a project where we have 2 million products and have 50 clients with different pricing scheme. Indexing 2M X 50 records is not an option at the moment. I have looked at solr join and cannot get it to work the way i want it too. I know it's like a self join so I'm kinda skeptical it would work but here it is anyway.
here is the sample schema
core0 - product
core1 - client
So given a client id, i wanted to display all bags manufactured by Samsonite sorted by lowest price.
If there's a better way of approaching this, I'm open to redesigning exciting schema.
Thank you in advance.
Solr is not a relational database. You should give a look at the sharding feature and split your indexes. Also, you could write your custom plugins to elaborate the price data based on client's id/name/whatever at index time (BAD you'll still get a product replicated for each client).
How we do (so you can get an example):
clients are handled by sqlite
products are stored in solr with their "base" price
each client has a "pricing rule" applied via custom query handler when they interrogate the db (it's just a value modifier)

Resources