Data warehousing dimension design - data-warehouse

I'm hoping someone can help me with this.
Suppose we have 2 dimensions in our vehicle data warehouse: TRUCKS and PACKAGES. Both are Type 2 SCD.
dim_TRUCKS contains the following data:
TRUCK_KEY NAME PRICE
1 Ram 45000
2 F150 48000
3 Tundra 43000
dim_PACKAGES contains the following data:
PACKAGE_KEY NAME PRICE
4 Offroad 4000
5 Luxury 7000
6 Sport 2000
The biz rules and requirements state that each TRUCK offers only one PACKAGE. (I know that's not realistic, but it best conveys the particular business dilemma I'm faced with).
The PACKAGE that each TRUCK offers can change over time.
So the question is what's the best way to design and implement this?
My initial thought is to simply add the PACKAGE_KEY to dim_TRUCKS, such as this:
TRUCK_KEY NAME PRICE PACKAGE_KEY
1 Ram 45000 4
2 F150 48000 4
3 Tundra 43000 6
Obviously what I'd end up with is an attribute of a SCD being based on another SCD. Is that bad design? Is there a better way to go?
Thanks much.

I would not model the business rule one package only on one truck. I would rather set PACKAGE and TRUCK as dimensions which are being referred to out of the fact table.
Reason
If the business rule changes in the future (the probability for this is usually high) you might have to remodel more than when keeping it simple now.
Also you are right, always try to hold complexity down and since you can choose, I'd go with the less complex way in directly referring to the dimension.

Related

Dimensional Modeling - Outrigger dimension for geography

Currently, I'm working on dimensional modeling and have a question in regards to an outrigger dimension.
The company is trading and acts as a broker between customer and supplier.
For a fact table, "Fact Trades", we include dimCustomer and dimSupplier.
Each of these dimensions have an address.
My question is if it is correct to do outrigger dimensions that refer to geography. This way we can measure how much we have delivered from an origin and delivered to a city.
dimensional model
I am curious to what is best practice. I hope you can help to explain how this should be modelled correctly and why.
Hope my question was clear and that I have posted it the correct place.
Thanks in advance.
I can think of at least 3 possible options; your particular circumstances will determine which is best for you:
If you often filter your fact by geography but without needing company/person information (i.e. how many trades where between London and New York?) then I would create a standalone geography dimension and link it directly to your fact (twice - for customer and supplier). This doesn't also stop you having geographic attributes in your customer/supplier Dims, as a dimensional model is not normalised
If geographic attributes change at a significantly more frequent rate than the customer/supplier attributes, and the customer/supplier Dims have a lot of attributes, then it may be worth creating an outrigger dim for the geographical attributes - as this reduces the maintenance required for the customer/supplier Dims. However, given that most companies/people rarely change their address, this is probably unlikely
Keep the geographical attributes in the customer/supplier Dims. I would probably do this anyway even if also picking option 1 above
Just out of interest - do customer and supplier have significantly different sets of attributes (I assume they are both companies or people)? Is it necessary to create separate Dims for them?

Is this problem suitable for machine leaning - brain.js?

The problem I would like to solve is how to choose the best seats on a train based on some ordered user preferences. eg. whether they'd like a seat facing forwards, backwards (or don't care), whether they'd like a seat at a table or not, whether they need to be near a toilet, luggage rack, buffet car, near the door. Window / Aisle seat. Whether they want the aisle to the left or the right (can be very important some someone with a stuff knee!).
Most customers will specify one or two preferences, other may specify more. For some, being near the toilet might be the most important factor, for others having that table to work at might be the most important.
There may be more than one passenger (although they will share preferences). These should be sat as close to each other as possible. 2 passengers would ideally be sat next to each other, or opposite each other at a table seat. A group of 8 passengers might best be split into 2 groups of 4 or 4 groups of 2...
Position is defined by carriage number (seats in the same carriage are better then seats in different carriages) and by x/y coordinate within that carriage - so easy enough to calculate distance between any pair of seats - but a BIG job to calculate distances between EVERY pair of seats...)
Each [available] seat (pre-filtered by ticket class) will have the above attributes either defined or set to NULL (for unknown - seat facing is often unknown).
So for training I can provide a vast array of example trains and customer preferences with the best balance of preferences version position.
For execution I want to provide a run-time specific array of seats with attributes, a set of user preferences and a set if weighting for those preference (eg. passenger 1 thinks being near toilet is most important, passenger 2 think having a table is most important, passenger 3 think being in the quiet carriage is..) and finally the number of passengers.
Output will be an array of seats (one per passenger) that strike the best compromise between matching as many customer preferences as possible (usually not possible to match all preferences) and keeping the seats fairly close to each other.
eg. We might be able to match 2 preferences with seats 2 rows apart, but match 3 preference with seats 10 rows apart...
Obviously distance will need a weighting the same as the individual preference and necessary to choose between those two. I suppose a distance not greater than X becomes just one more customer preference...
I've not done any ML work before, so it's all going to be a learning exercise for me. I wish I had the time to just play and see what comes out, but I don't, Happy to do that, but I need to have a reasonable expectation of a positive result otherwise I'll have to focus on a more traditional approach. Limited time and all that...
So, my questions are:
Is this a suitable problem for machine learning?
If so, is brain.js a good choice, or is something else more suitable? AWS ML service perhaps?
Any advice on how to organise all my data into something suitable for an ML engine to process?
Machine Learning is good at finding hidden patterns in complex data. In your case, you would need a lot of data where user preferences are already matched with optimal seating arrangements.
You could then try to see if the ML model can actually make optimal seating arrangements by itself. It’s an interesting problem but it may also lead to unexpected seating :)
If you don’t have training data you could collect it live, by registering where people sit down, knowing their preferences.

DWH SCD type 2 implementation in SQL Server scd2 and scd1

We are implementing a new dwh solution. I have many dimensions that require slowly changing type 2 attributes. I was considering implementing a combination of Type 2 and Type 1 attributes in my dimension. That is for some dimension attributes, we track history by inserting new rows in the dim table (Type2), for other attributes we will just update the existing row for any changes (Type1)
Questions:
Is this a good practice? is it OK to have a combination of SCD 1 and 2 for the same dim?
Is there any limit on the number of SCD 2 attributes in a dimension? My dimension is pretty wide, like 300 cols, out of which one of the users is requesting for about 150 cols to be tracked by scd type 2. Is it OK to have so many scd2 attributes in a dim? Is there going to be any impact on performance of downstream reporting BI solutions like cubes and dashboards because of this?
In the OLTP system, we maintain an "audit" table to log any updates. Though this is not in a very easily queryable format, we get answers to most of our questions related to changes from this. We don't need much reporting on data changes. Of course there are some important columns like Status for which we definitely need SCD2 but rest of the columns, I am not sure having history for lot of other columns in the DWH adds any value. My question is when we have this audit table in OLTP, how do I decide what attributes need SCD 2 in the DWH?
Good practice? Yes. Standard feature of dimensional modelling that is overlooked too often. I've seen dimensions with combinations of SCD0, SCD1 and SCD2, and there's nothing to prevent other SCD-types being used as well.
No limit on columns, but that does seem a little excessive. You probably want to use a "hash" method to detect the SCD2 changes, where you calculate a hash over the SCD2 columns, and use this value to detect if any of the columns have changed.
Sorry, but I don't understand the question about audit logs. Are these logs your data source?

Questions on how to model many semi-boolean attributes in a star schema

What's the best way to model 37 different attributes/"checkpoints" (that can be graded as Pass/Fail/Not Applicable) in a dimension for a star schema where each row in the fact table is a communication that is graded against the checkpoints in question?
TL;DR:
I've developed a star schema model where each row in the fact table is a single communication. These communications go through a series of graded "checks" (e.g. "Posted on Time", "Correct Email Subject", "XYZ content copied correctly", etc.) and each check can be graded as "Passed", "Missed", or "Not Applicable".
Different types of communications are graded on different sets of checks (e.g. one type of communication may only be graded on three checks, the rest being "Not Applicable", whereas another type of communication is graded on 19 checks). There are 37 total unique checks.
I've built a "CommunicationGrading" Type 2 slowly changing dimension to facilitate reporting of which "checks" communications are scoring most poorly. The dimension has 37 columns, one for each attribute, and each row is a permutation of the attributes and the score they can receive (pass/fail/NA). A new row is added when a new permutation becomes available - filling all possible permutations unfortunately returns millions of rows, whereas this way is < 100 rows, much less overhead. I've created 37 separate measures that aggregate the # of communications that have missed each of the 37 separate "checks".
I can quickly build a treemap in PBI, drag the 37 measures on there, see the total # of communications that have missed each "check", and determine that X # of communications missed Y checkpoint this month. The problem comes when I want to use the visual as a slicer, (e.g. selecting a check/tile on the treemap to see what individual communications missed that check in a table beneath the treemap) or determining the top N "checks" given a slice of data.
From what I can tell, the issue is because I'm using 37 different attributes and measures rather than one attribute and one measure (where I could drag the single measure into Values and the single attribute/column containing all checks into Group field in the treemap visual). The problem is, I'm stumped on how to best model this/the Grading dimension. Would it involve trimming the dimension down to just two columns, one for the checks and one for the checks' possible scores, then creating a bridge table to handle the M:M relationship? Other ideas?
Your dimension (implemented as a junk dimension- something to google) is one way of doing it, although if going down that road I'd break it down into multiple dimensions of related checkpoints to massively reduce the permutations in each. It also isn't clear why this would need to be a Type 2- is there a history of this dimension you would need to track?
However I'd suggest one approach to explore is having a new fact for each communication's score at each checkpoint- you could have one dimension of grade result (passed, failed, not applicable) and one dimension of each checkpoint (which is just the checkpoint description). It would also allow you to count on that fact rather than having to have 37 different measures. You may wish to keep a fact at the communication level if there is some aggregate information to retain, but that would depend on your requirements.

Predicting Football match winners based only on previous data of same match

I'm a huge football(soccer) fan and interested in Machine Learning too. As a project for my ML course I'm trying to build a model that would predict the chance of winning for the home team, given the names of the home and away team.(I query my dataset and accordingly create datapoints based on previous matches between those 2 teams)
I have data for several seasons for all teams however I have the following issues that I would like some advice with.. The EPL(English Premier League) has 20teams which play each other at home and away (380 total games in a season). Thus, each season, any 2 teams play each other only twice.
I have data for the past 10+ years, resulting in 2*10=20 datapoints for the two teams. However I do not want to go past 3 years since I believe teams change quite considerably over time (ManCity, Liverpool) and this would only introduce more error into the system.
So this results in just around 6-8 data points for each pair of team. However, I do have several features(upto 20+) for each data point like Full-time goals, half time goals, passes, shots, yellows, reds, etc. for both teams so I can include features like recent form, recent home form, recent away form etc.
However the idea of just having only 6-8 datapoints to train with seems incorrect to me. Any thoughts on how I could counter this problem?(if this is a problem in the first place i.e.)
Thanks!
EDIT: FWIW, here's a link to my report which I compiled at the completion of my project. https://www.dropbox.com/s/ec4a66ytfkbsncz/report.pdf . It's not 'great' stuff but I think some of the observations I managed to elicit were pretty cool (like how my prediction worked very well for the Bundesliga because Bayern win the league all the time).
That's an interesting problem which I don't think has an unique solution. However, there are a couple of little things that I could try if I were in your position.
I share your concerning about 6-8 points per class being too little data to build a reliable model. So I would try to model the problem a bit differently. In order to have more data for each class, instead of having 20 classes I would have only two (home/away) and I would add two features, one for the team being home and other one for the away team. In that setup, you can still predict which team would win given if it is playing as home or away, and your problem has more data to produce a result.
Another idea would be to take data from other European leagues. Since now teams are a feature and not a class, it shouldn't add too much noise to your model and you could benefit from the additional data (assuming that those features are valid in another leagues)
I have some similar system - a good base for source data is football-data.co.uk.
I have used last N seasons for each league and built a model (believe me, more than 3 years is a must!). Depends on your criterial function - if criterion is best-fit or maximum profit you may build your own predicting model.
One very good thing to know is that each league is different, also bookmaker gives different home win odds on favorite in Belgium than in 5th English League, where you can find really value odds for instance.
Out of that you can compile interesting model, such as betting tips to beat bookmakers on specific matches, using your pattern and to have value bets. Or you can try to chase as much winning tips as you can, but possibly earns less (draws earn a lot of money even though less amount of draws is winning).
Hopefully I gave you some ideas, for more feel free to ask.
Don't know if this is still helpful, but features like Full-time goals, half time goals, passes, shots, yellows, reds, etc. are features that you don't have for the new match that you want to classify.
I would treat this as a classification problem (you want to classify the match in one of 3 categories: 1, X, or 2) and add more features that you can also apply to the new match. i.e: the number of missing players (due to injury/red cards), the number of wins/draws/losses each team has had in a row immediately BEFORE the match, which is the home team (already mentioned), goals scored in the last few matches home and away etc...
Having 6-8 matches is the real problem. This dataset is very small and there would be a lot of over-fitting, but if you use features like the ones I mentioned, I think you could also use older data.

Resources