how to decide if customer is a dimension - data-warehouse

I am creating a data warehouse (data mart) for a project-based (labor-centric) organization. (That is, they sell labor-based "projects"; they don't sell physical products.) They are interested in project- and customer-related dimension info. I need to make a design decision about a certain dimension. Should I make this dimension be "Project" (with customer info as attributes on this dimensions)? Or, should I make two separate dimensions -- one for project and another for customer? What are some questions to ask (or things to think about) to help me make this decision?

If the customer and the project represent axisis for analysis you can proceed with the following design :
The customer and the project can be Slowly Changing dimensions where you decide a type from the following :
SCD Type Summary
Type 1 Overwrite the changes
Type 2 History will be added as a new row.
The fact table can handle measures like the cost, the number of working days...

Related

Is this problem suitable for machine leaning - brain.js?

The problem I would like to solve is how to choose the best seats on a train based on some ordered user preferences. eg. whether they'd like a seat facing forwards, backwards (or don't care), whether they'd like a seat at a table or not, whether they need to be near a toilet, luggage rack, buffet car, near the door. Window / Aisle seat. Whether they want the aisle to the left or the right (can be very important some someone with a stuff knee!).
Most customers will specify one or two preferences, other may specify more. For some, being near the toilet might be the most important factor, for others having that table to work at might be the most important.
There may be more than one passenger (although they will share preferences). These should be sat as close to each other as possible. 2 passengers would ideally be sat next to each other, or opposite each other at a table seat. A group of 8 passengers might best be split into 2 groups of 4 or 4 groups of 2...
Position is defined by carriage number (seats in the same carriage are better then seats in different carriages) and by x/y coordinate within that carriage - so easy enough to calculate distance between any pair of seats - but a BIG job to calculate distances between EVERY pair of seats...)
Each [available] seat (pre-filtered by ticket class) will have the above attributes either defined or set to NULL (for unknown - seat facing is often unknown).
So for training I can provide a vast array of example trains and customer preferences with the best balance of preferences version position.
For execution I want to provide a run-time specific array of seats with attributes, a set of user preferences and a set if weighting for those preference (eg. passenger 1 thinks being near toilet is most important, passenger 2 think having a table is most important, passenger 3 think being in the quiet carriage is..) and finally the number of passengers.
Output will be an array of seats (one per passenger) that strike the best compromise between matching as many customer preferences as possible (usually not possible to match all preferences) and keeping the seats fairly close to each other.
eg. We might be able to match 2 preferences with seats 2 rows apart, but match 3 preference with seats 10 rows apart...
Obviously distance will need a weighting the same as the individual preference and necessary to choose between those two. I suppose a distance not greater than X becomes just one more customer preference...
I've not done any ML work before, so it's all going to be a learning exercise for me. I wish I had the time to just play and see what comes out, but I don't, Happy to do that, but I need to have a reasonable expectation of a positive result otherwise I'll have to focus on a more traditional approach. Limited time and all that...
So, my questions are:
Is this a suitable problem for machine learning?
If so, is brain.js a good choice, or is something else more suitable? AWS ML service perhaps?
Any advice on how to organise all my data into something suitable for an ML engine to process?
Machine Learning is good at finding hidden patterns in complex data. In your case, you would need a lot of data where user preferences are already matched with optimal seating arrangements.
You could then try to see if the ML model can actually make optimal seating arrangements by itself. It’s an interesting problem but it may also lead to unexpected seating :)
If you don’t have training data you could collect it live, by registering where people sit down, knowing their preferences.

Questions on how to model many semi-boolean attributes in a star schema

What's the best way to model 37 different attributes/"checkpoints" (that can be graded as Pass/Fail/Not Applicable) in a dimension for a star schema where each row in the fact table is a communication that is graded against the checkpoints in question?
TL;DR:
I've developed a star schema model where each row in the fact table is a single communication. These communications go through a series of graded "checks" (e.g. "Posted on Time", "Correct Email Subject", "XYZ content copied correctly", etc.) and each check can be graded as "Passed", "Missed", or "Not Applicable".
Different types of communications are graded on different sets of checks (e.g. one type of communication may only be graded on three checks, the rest being "Not Applicable", whereas another type of communication is graded on 19 checks). There are 37 total unique checks.
I've built a "CommunicationGrading" Type 2 slowly changing dimension to facilitate reporting of which "checks" communications are scoring most poorly. The dimension has 37 columns, one for each attribute, and each row is a permutation of the attributes and the score they can receive (pass/fail/NA). A new row is added when a new permutation becomes available - filling all possible permutations unfortunately returns millions of rows, whereas this way is < 100 rows, much less overhead. I've created 37 separate measures that aggregate the # of communications that have missed each of the 37 separate "checks".
I can quickly build a treemap in PBI, drag the 37 measures on there, see the total # of communications that have missed each "check", and determine that X # of communications missed Y checkpoint this month. The problem comes when I want to use the visual as a slicer, (e.g. selecting a check/tile on the treemap to see what individual communications missed that check in a table beneath the treemap) or determining the top N "checks" given a slice of data.
From what I can tell, the issue is because I'm using 37 different attributes and measures rather than one attribute and one measure (where I could drag the single measure into Values and the single attribute/column containing all checks into Group field in the treemap visual). The problem is, I'm stumped on how to best model this/the Grading dimension. Would it involve trimming the dimension down to just two columns, one for the checks and one for the checks' possible scores, then creating a bridge table to handle the M:M relationship? Other ideas?
Your dimension (implemented as a junk dimension- something to google) is one way of doing it, although if going down that road I'd break it down into multiple dimensions of related checkpoints to massively reduce the permutations in each. It also isn't clear why this would need to be a Type 2- is there a history of this dimension you would need to track?
However I'd suggest one approach to explore is having a new fact for each communication's score at each checkpoint- you could have one dimension of grade result (passed, failed, not applicable) and one dimension of each checkpoint (which is just the checkpoint description). It would also allow you to count on that fact rather than having to have 37 different measures. You may wish to keep a fact at the communication level if there is some aggregate information to retain, but that would depend on your requirements.

Designing a Data Warehouse/ Star Schema - Choosing facts

Consider a crowdfunding system whereby anyone in the world can invest in a project.
I have the normalized database design in place and now I am trying to create a data warehouse it (OLAP).
I have come up with the following:
This has been denormalized and I have chosen Investment as the fact table because I think the following examples could be useful business needs:
Look at investments by project type
Investments by time periods i.e. total amount of investments made per week etc.
Having done some reading (The Data Warehouse Toolkit: Ralph Kimball) I feel like my schema isn't quite right. The book says to declare the grain (in my case each Investment) and then add facts within the context of the declared grain.
Some facts I have included do not seem to match the grain: TotalNumberOfInvestors, TotalAmountInvestedInProject, PercentOfProjectTarget.
But I feel these could be useful as you could see what these amounts are at the time of that investment.
Do these facts seem appropriate? Finally, is the TotalNumberOfInvestors fact implicitly made with the reference to the Investor dimension?
I think "one row for each investment" is a good candidate grain.
The problem with your fact table design is that you include columns which should actually be calculations in your data-application ( olap cube ).
TotalNumberOfInvestors can be calculated by taking the distinct count of investors.
TotalAmountInvestedInProject should be removed from the fact table because it is actually a calculation with assumptions. Try grouping by project and then taking the sum of InvestmentAmount, which is a more natural approach.
PercentOfProjectTarget is calculated by taking the sum of FactInvestment.InvestmentAmount divided by the sum of DimProject.TargetAmount. A constraint for making this calculationwork is having at least a member of DimProject in your report.
Hope this helps,
Mark.
Either calculate these additional measures in a reporting tool or create a set of aggregated fact tables on top of the base one. They will be less granular and will reference only a subset of dimensions.
Projects seem to be a good candidate. It will be an accumulating snapshot fact table that you can also use to track projects' life cycle.

GORM: designing a domain class

I have a question:
I have a domain : LoanAccount. We have different product of loans but they just different on how to calculate the interest.
for example:
1. Regular Loan calculate interest rate using Annuity Interest Rate formula
2. Vehicle Loan calculate interest rate using Flat Interest Rate formula
3. Temporary Loan calculate interest rate with another formula (i have no idea what is that).
We also could change the rule every year ... we use different formula as well ...
My Question:
Should I put all the logic formula in services ?
Should I make every loan in different domain class ?
or should I make 1 domain class but it has different interest rate calculation methods ?
Any example would be good :)
Thank you in advance !
My suggestion is to separate interest calculating logic from the domain objects.
Hard-wiring the domain object and it's interest calculation is likely to lead you in trouble.
It would be more complicated to change the type of interest calculation for existing account type (which could be expected business request)
When new account type is created you can easily use all the calculation methods you have already implemented for it
It's likely that interest-calculating algorithm will grow in complexity in the future and it may need properties that should not be part of Account domain object, like some business constants, list of transactions etc.
Grails (because Spring) naturally supports to have business logic in services (declarative transactions etc.) rather than in the domain objects. You will always have less pain when going along with the framework than otherwise.

Predicting Football match winners based only on previous data of same match

I'm a huge football(soccer) fan and interested in Machine Learning too. As a project for my ML course I'm trying to build a model that would predict the chance of winning for the home team, given the names of the home and away team.(I query my dataset and accordingly create datapoints based on previous matches between those 2 teams)
I have data for several seasons for all teams however I have the following issues that I would like some advice with.. The EPL(English Premier League) has 20teams which play each other at home and away (380 total games in a season). Thus, each season, any 2 teams play each other only twice.
I have data for the past 10+ years, resulting in 2*10=20 datapoints for the two teams. However I do not want to go past 3 years since I believe teams change quite considerably over time (ManCity, Liverpool) and this would only introduce more error into the system.
So this results in just around 6-8 data points for each pair of team. However, I do have several features(upto 20+) for each data point like Full-time goals, half time goals, passes, shots, yellows, reds, etc. for both teams so I can include features like recent form, recent home form, recent away form etc.
However the idea of just having only 6-8 datapoints to train with seems incorrect to me. Any thoughts on how I could counter this problem?(if this is a problem in the first place i.e.)
Thanks!
EDIT: FWIW, here's a link to my report which I compiled at the completion of my project. https://www.dropbox.com/s/ec4a66ytfkbsncz/report.pdf . It's not 'great' stuff but I think some of the observations I managed to elicit were pretty cool (like how my prediction worked very well for the Bundesliga because Bayern win the league all the time).
That's an interesting problem which I don't think has an unique solution. However, there are a couple of little things that I could try if I were in your position.
I share your concerning about 6-8 points per class being too little data to build a reliable model. So I would try to model the problem a bit differently. In order to have more data for each class, instead of having 20 classes I would have only two (home/away) and I would add two features, one for the team being home and other one for the away team. In that setup, you can still predict which team would win given if it is playing as home or away, and your problem has more data to produce a result.
Another idea would be to take data from other European leagues. Since now teams are a feature and not a class, it shouldn't add too much noise to your model and you could benefit from the additional data (assuming that those features are valid in another leagues)
I have some similar system - a good base for source data is football-data.co.uk.
I have used last N seasons for each league and built a model (believe me, more than 3 years is a must!). Depends on your criterial function - if criterion is best-fit or maximum profit you may build your own predicting model.
One very good thing to know is that each league is different, also bookmaker gives different home win odds on favorite in Belgium than in 5th English League, where you can find really value odds for instance.
Out of that you can compile interesting model, such as betting tips to beat bookmakers on specific matches, using your pattern and to have value bets. Or you can try to chase as much winning tips as you can, but possibly earns less (draws earn a lot of money even though less amount of draws is winning).
Hopefully I gave you some ideas, for more feel free to ask.
Don't know if this is still helpful, but features like Full-time goals, half time goals, passes, shots, yellows, reds, etc. are features that you don't have for the new match that you want to classify.
I would treat this as a classification problem (you want to classify the match in one of 3 categories: 1, X, or 2) and add more features that you can also apply to the new match. i.e: the number of missing players (due to injury/red cards), the number of wins/draws/losses each team has had in a row immediately BEFORE the match, which is the home team (already mentioned), goals scored in the last few matches home and away etc...
Having 6-8 matches is the real problem. This dataset is very small and there would be a lot of over-fitting, but if you use features like the ones I mentioned, I think you could also use older data.

Resources