Dimensional Modeling - Outrigger dimension for geography - data-warehouse

Currently, I'm working on dimensional modeling and have a question in regards to an outrigger dimension.
The company is trading and acts as a broker between customer and supplier.
For a fact table, "Fact Trades", we include dimCustomer and dimSupplier.
Each of these dimensions have an address.
My question is if it is correct to do outrigger dimensions that refer to geography. This way we can measure how much we have delivered from an origin and delivered to a city.
dimensional model
I am curious to what is best practice. I hope you can help to explain how this should be modelled correctly and why.
Hope my question was clear and that I have posted it the correct place.
Thanks in advance.

I can think of at least 3 possible options; your particular circumstances will determine which is best for you:
If you often filter your fact by geography but without needing company/person information (i.e. how many trades where between London and New York?) then I would create a standalone geography dimension and link it directly to your fact (twice - for customer and supplier). This doesn't also stop you having geographic attributes in your customer/supplier Dims, as a dimensional model is not normalised
If geographic attributes change at a significantly more frequent rate than the customer/supplier attributes, and the customer/supplier Dims have a lot of attributes, then it may be worth creating an outrigger dim for the geographical attributes - as this reduces the maintenance required for the customer/supplier Dims. However, given that most companies/people rarely change their address, this is probably unlikely
Keep the geographical attributes in the customer/supplier Dims. I would probably do this anyway even if also picking option 1 above
Just out of interest - do customer and supplier have significantly different sets of attributes (I assume they are both companies or people)? Is it necessary to create separate Dims for them?

Related

Is this problem suitable for machine leaning - brain.js?

The problem I would like to solve is how to choose the best seats on a train based on some ordered user preferences. eg. whether they'd like a seat facing forwards, backwards (or don't care), whether they'd like a seat at a table or not, whether they need to be near a toilet, luggage rack, buffet car, near the door. Window / Aisle seat. Whether they want the aisle to the left or the right (can be very important some someone with a stuff knee!).
Most customers will specify one or two preferences, other may specify more. For some, being near the toilet might be the most important factor, for others having that table to work at might be the most important.
There may be more than one passenger (although they will share preferences). These should be sat as close to each other as possible. 2 passengers would ideally be sat next to each other, or opposite each other at a table seat. A group of 8 passengers might best be split into 2 groups of 4 or 4 groups of 2...
Position is defined by carriage number (seats in the same carriage are better then seats in different carriages) and by x/y coordinate within that carriage - so easy enough to calculate distance between any pair of seats - but a BIG job to calculate distances between EVERY pair of seats...)
Each [available] seat (pre-filtered by ticket class) will have the above attributes either defined or set to NULL (for unknown - seat facing is often unknown).
So for training I can provide a vast array of example trains and customer preferences with the best balance of preferences version position.
For execution I want to provide a run-time specific array of seats with attributes, a set of user preferences and a set if weighting for those preference (eg. passenger 1 thinks being near toilet is most important, passenger 2 think having a table is most important, passenger 3 think being in the quiet carriage is..) and finally the number of passengers.
Output will be an array of seats (one per passenger) that strike the best compromise between matching as many customer preferences as possible (usually not possible to match all preferences) and keeping the seats fairly close to each other.
eg. We might be able to match 2 preferences with seats 2 rows apart, but match 3 preference with seats 10 rows apart...
Obviously distance will need a weighting the same as the individual preference and necessary to choose between those two. I suppose a distance not greater than X becomes just one more customer preference...
I've not done any ML work before, so it's all going to be a learning exercise for me. I wish I had the time to just play and see what comes out, but I don't, Happy to do that, but I need to have a reasonable expectation of a positive result otherwise I'll have to focus on a more traditional approach. Limited time and all that...
So, my questions are:
Is this a suitable problem for machine learning?
If so, is brain.js a good choice, or is something else more suitable? AWS ML service perhaps?
Any advice on how to organise all my data into something suitable for an ML engine to process?
Machine Learning is good at finding hidden patterns in complex data. In your case, you would need a lot of data where user preferences are already matched with optimal seating arrangements.
You could then try to see if the ML model can actually make optimal seating arrangements by itself. It’s an interesting problem but it may also lead to unexpected seating :)
If you don’t have training data you could collect it live, by registering where people sit down, knowing their preferences.

Data Warehousing - Star Schema vs Flat Table

I'm trying to design a Data Warehouse for a single store of commonly required data ranging from finance systems, project scheduling systems and a myriad of scientific systems. I.e. many different data marts.
I have been reading up on Data Warehousing and popular methods such as Star Schemas and Kimball methods etc but one question I cannot find answer to is:
Why is it better to design your DW Data Mart as a star schema rather than a single flat table?
Surely having no joins between facts and attributes/dimensions is faster and simpler than having lots of small joins to all the dimension tables? Disk space is not a problem, we'll just throw more disks at the database if necessary. Is the star schema slightly outdated these days or is it still data architect dogma?
Your question is very good: the Kimball mantra for dimensional modelling is to improve performance and to improve usability.
But I don't think it is outdated, or dogma- it is a reasonable, practical approach for many situations and platforms.
The way relational DBs store data means there's a balancing act to be struck between the numbers and types of tables, the routes in to the data for typical queries, easy maintainability and description of relationships between data, the numbers of joins, the way the joins are constructed, the indexability of columns, etc.
3NF (or further) is one end of the spectrum, suiting OLTP systems, and a single table is the other end of the spectrum. Dimensional models are in the middle and appropriate for reporting, at least when using certain technologies.
Performance isn't all about 'number of joins', although a star schema performs better for reporting workloads than a fully normalised database, in part because of a reduce number of joins. Dimensions are typically very wide. If you are including all those dimension fields in every row of every fact, you have very large rows indeed, and finding your way into those rows will perform very badly for typical queries.
Facts are numerous, so if you can make those tables compact, with the 'wordier' dimensions filterable, you hit a sweet spot of performance that a single table isn't going to match, unless heavily indexed.
And yes a single table for a fact is simpler in terms of numbers of tables but is it really easier to navigate? Dimensions and facts are easy concepts to understand, and what if you want to cross you queries across facts? You've got many different data marts but one of the benefits of having a data warehouse in the first place is that these aren't distinct- they're related and can be reported across. Conformed dimensions enable this.
If you combine your fact and dimensions into a single table, you'll either lose the visibility into dimension attributes that have never been used, or your measures will be thrown off by inclusion of a dummy event for the unused dimension attribute.
For example, a restaurant menu is a dimension and the purchased food is a fact. If you combined these into one table, how would you identify which food has never been ordered? For that matter, prior to your first order, how would you identify what food was available on the menu?
The dimension represents the possibilities, the fact represents the realization of the possibilities.
Combining facts and dimensions in the same table limits the scalability and the flexibility.
Suppose that one day the business decides to change a dimension description ( for example the product name ). Dimension tables aren't as deep as the fact tables and the update process or SCD management should be easier and less resource intensive.

Neo4j Relationship design

Revisiting Neo4j after a long absence. I have read a lot of articles but still find I have a few questions to get me going again....
Bidirectional relationships
I have a “connected to”-type scenario where 2 nodes are connected to each other. In fact, the idea is to model a type of flow. However, the flow in both directions is not always the same. I’m uncertain of the best method to use: 1 relationship with 2 properties or 2 distinct relationships?
The former feels like the comfortable choice but then doesn’t feel natural in terms of modelling the actual facts – for example: what to call the properties because FlowIn and FlowOut wouldn’t make sense when looked at from each nodes’ perspective. I also wonder about the performance of properties versus relationships in this case – these values will need to be updated.
Representing Time
Now I want to take a step further and represent the flow between nodes at specific times or, more accurately, between specific times. So between 2pm and 3pm the flow between #1 and #2 will be x.
How should this be done in an optimal way? Relationship per time frame per connection seems….verbose. Could a timeframe being represented as a node be of value?!
Are there any Maximum Flow samples with Cypher out there?
Particularly interested in push-relabel max flow problem solving.
Thank you for any advice to might have to offer.
While you have definitely given some thought to your problem the question is a little unclear. This seems to be a question about Graph Data Models. You would like to know how best to organize a model to represent a complex relationship. If you are trying to track the "flow" between two nodes then assign a weight property to a unidirected edge.
Bidirectional relationships should be carefully considered. Neo4j can process them just as fast as unidirectional relationships. A quote from the graphaware about using bidirectional relationships:
Relationships in Neo4j can be traversed in both directions with the same speed. Moreover, direction can be completely ignored. Therefore, there is no need to create two different relationships between nodes, if one implies the other.
I believe your problems can be alleviated by gaining a better understanding of Graph data models. Looking at a few different models and understanding the why will help more than understanding cypher syntax at this point. May I suggest reading this survey by 2 professors at the University of Chile titled "Survey of Graph Database Models." The "Hypernode" model on page 21 may be of particular interest to you since it sounds like you are trying to model a complex cyclic object. From page twenty one;
Hypernodes can be used to represent simple (flat) and complex objects (hierarchical, composite, and cyclic) as well as mappings and records. A key feature is its inherent ability to encapsulate information.
Hopefully that information helps you in your efforts to model a complex relationship.

Logic for selecting best nearby venues for display on a map

I have an app that displays information about certain venues. Each venue is awarded a rating on a scale from 0-100. The app includes a map, and on the map I'd like to show the best nearby venues. (The point is to recommend to the user alternative venues that they might like.)
What is the best way to approach this problem?
If I fetch the nearest x venues, many bad venues (i.e. those with a
low rating) show.
If I fetch the highest rated venues, many of them
will be too far away to be useful as recommendations.
This seems like a pretty common challenge for any geolocation app, so I'm interested to know what approach other people have taken.
I have considered "scoring" each possible venue by taking into account its rating and its distance in miles.
I've also considered fetching the highest rated venues within a y mile radius, but this gets problematic because in some cities there are a lot of venues in a small area (e.g. New York) and in others it's reasonable to recommend venues that are farther away.
(This is a Rails app, and I'm using Solr with the Sunspot gem to retrieve the data. But I'm not necessarily looking for answers in code here, more just advice about the logic.)
Personally, I would implement a few formulas and use some form of A/B testing to get an idea as to which ones yield the best results on some outcome metric. What exactly that metric is is up to you. It could be clicks, or it could be something more complicated.
Start out with the simplest formula you can think of (ideally one that is computationally cheap as well) to establish a baseline. From there, you can iterate, but the absolute key concept is that you'll have hard data to tell you if you're getting better or worse, not just a hunch (perhaps that a more complicated formula is better). Even if you got your hands on Yelp's formula, it might not work for you.
For instance, as you mentioned, a single score calculated based on some linear combination of inverse distance and establishment quality would be a good starting point and you can roll it out in a few minutes. Make sure to normalize each component score in some way. Here's a possible very simple algorithm you could start with:
Filter venues as much as possible on fast-to-query attributes (by type, country, etc.)
Filter remaining venues within a fairly wide radius (you'll need to do some research into exactly how to do this in a performant way; there are plenty of posts on Stackoverflow and else where on this. You'll want to index your database table on latitude and longitude, and follow a number of other best practices).
Score the remaining venues using some weights that seem intuitive to you (I arbitrarily picked 0.25 and 0.75, but they should add up to 1:
score = 0.25*(1-((distance/distance of furthest venue in remaining
set)-distance of closest venue)) + 0.75*(quality score/highest quality
score in remaining set)
Sort them by score and take the top n
I would put money on Yelp using some fancy-pants version of this simple idea. They may be using machine learning to actually select the weights for each component score, but the conceptual basis is similar.
While there are plenty of possibilities for calculating formulas of varying complexity, the only way to truly know which one works best is to gather data.
I would fix the number of venues returned at say 7.
Discard all venues with scores in the lowest quartile of reviewers scores, to avoid bad customer experiences, then return the top 7 within a postcode. If this results in less than 7 entries, then look to the neighboring post codes to find the best scores to complete the list.
This would result in a list of top to mediocre scores locally, perhaps with some really good scores only a short distance away.
From a UX perspective this would easily allow users to either select a postcode/area they are interested in or allow the app to determine its location.
From a data perspective, you already have addresses. The only "tricky" bit is determining what the neighboring postcodes/areas are, but I'm sure someone has figured that out already.
As an aside, I'm a great believer in things changing. Like restaurants changing hands or the owners waking up and getting better. I would consider offering a "dangerous" list of sub-standard eateries "at your own risk" as another form of evening entertainment. Personally I have found some of my worst dining experiences have formed some of my best dining out stories :-) And if the place has been harshly judged in the past you can sometimes find it is now a gem in the making.
First I suggest that you use bayesian average to maintain an overall rating for all the venues, more info here: https://github.com/tyrauber/acts_rateable
Then you can retrieve the nearest venues ordered by distance then ordered by rating. two order by statements in your query

GORM: designing a domain class

I have a question:
I have a domain : LoanAccount. We have different product of loans but they just different on how to calculate the interest.
for example:
1. Regular Loan calculate interest rate using Annuity Interest Rate formula
2. Vehicle Loan calculate interest rate using Flat Interest Rate formula
3. Temporary Loan calculate interest rate with another formula (i have no idea what is that).
We also could change the rule every year ... we use different formula as well ...
My Question:
Should I put all the logic formula in services ?
Should I make every loan in different domain class ?
or should I make 1 domain class but it has different interest rate calculation methods ?
Any example would be good :)
Thank you in advance !
My suggestion is to separate interest calculating logic from the domain objects.
Hard-wiring the domain object and it's interest calculation is likely to lead you in trouble.
It would be more complicated to change the type of interest calculation for existing account type (which could be expected business request)
When new account type is created you can easily use all the calculation methods you have already implemented for it
It's likely that interest-calculating algorithm will grow in complexity in the future and it may need properties that should not be part of Account domain object, like some business constants, list of transactions etc.
Grails (because Spring) naturally supports to have business logic in services (declarative transactions etc.) rather than in the domain objects. You will always have less pain when going along with the framework than otherwise.

Resources