What's the difference between additive, semi-additive, and non-additive measures - data-warehouse

I have searched in the net what's the difference between additive, semi-additive, and non-additive measures in a data warehouse. I have found some results but I have difficulty understanding the differences because they aren't an example. Could you please explain to me more the difference between additive, semi-additive, and non-additive measures with examples.

The numeric measures in a fact table fall into three categories. The most flexible and useful facts are fully additive; additive measures can be summed across any of the dimensions associated with the fact table.
An example of a fully additive measure is sales (purchases from a store). You can add hourly sales to get the sales for a day, week, month, quarter, or year. You can add sales across stores or regions.
Semi-additive measures can be summed across some dimensions, but not all; checking account or savings account balance amounts are common semi-additive facts.
You can recreate a balance amount from the transactions file, but it doesn't make any sense to add the balance amounts from October, November, and December (across the time dimension).
Finally, some measures are completely non-additive, such as ratios. A good approach for non-additive facts is, where possible, to store the fully additive components of the non-additive measure and sum these components into the final answer set.
Finally, you calculate the final non-additive fact.
Emphasizing that, usually, non-additives are averages and percentages and ratios and it is irrelevant to perform additional aggregations on them (for example summing 2 selling ratio of a product will not be meaningful at all),
Thus non-additive facts should not be stored within the fact table and will usually be manipulation on the additive facts within the fact table.
Source:
[1]

Related

Features from consecutive measurements for classification

I'm currently working on a small machine learning project.
The task deals with medical data of a couple of thousands of patients. For each patient there where taken 12 of measurements of the same bunch of vital signs each one hour apart.
These measurements must note been taken immediately after the patient has entered the hospital but could start with some offset. However the patient will stay 24h in the hospital in total, so they can't start later than after 11 hours after the entrance.
Now the task is to predict for each patient whether none, one or multiple of 10 possible tests will be ordered during the remainder of the stay, and also to predict the future mean value of some of the vital signs for the remainder of the stay.
I have a training set that comes together with the labels that I should predict.
My question is mainly about how I can process the features, I thought about turning the measurement results for a patient into one long vector and use it as training example for a classifier.
However I'm not quite shure how I should include the Time information of each measurement into the features (should I even consider time at all?).
If I understood correctly, you want to include time information of each measurement into features. One way I thought is to make an empty vector of length 24, as the patient stays for 24 hours in the hospital. Then you can use one-hot representation, for example, if the measurement was taken in 12th, 15th and 20th hours of his stay, your time feature vector will have 1 at 12th, 15th and 20th position and all others are zero. You can append this time vector with other features and make a single vector for each patient of length = length(other vector) + length(time vector). Or you can use different approaches to combine these features.
Please let me know if you think this approach makes sense for you. Thanks.

Is it necessary to make time series data stationary before applying tree based ML methods i.e. Random Forest or Xgboost etc?

As in case of ARIMA models, we have to make our data stationary. Is it necessary to make our time series data stationary before applying tree based ML methods?
I have a dataset of customers with monthly electricity consumption of past 2 to 10 years, and I am supposed to predict each customer's next 5 to 6 month's consumption. In the dataset some customers have strange behavior like for a particular month their consumption varies considerably to what he consumed in the same month of last year or last 3 to 4 years, and this change is not because of temperature. And as we don't know the reason behind this change, model is unable to predict that consumption correctly.
So making each customer's timeseries stationary would help in this case or not?

Any Statistical or Machine Learning Method to Predict Salary

I am working on FinTech company. We are providing loan for our customers. Customers who want to apply for loan must fill in some information in our app and one of the information is salary information. Using webscraping we are able to grab our customers' bank transaction data for last 3-7 last months.
Using any statistic or machine learning technique how can I easily spot if the salary amount (or pretty much same) stated in customers bank transaction data? Should I make one model (logic) for each customer or it should be only one model apply for all customers?
Please advise
I don't think you need machine learning for this.
Out of the list of all transaction, keep only those that add money to the account, rather than subtract money from the account
Round all numbers to a certain accuracy (e.g. 2510 USD -> 2500 USD)
Build a dataset that contains the total amount added to the account for each day. In other words, group transactions by day, and add 0's wherever needed
Apply a discrete Fourier transform to find the periodic components in this time-series
There should only be 1 periodic item, repeating every 30ish days
Set the values of all other periodically repeating items to 0
Apply inverse discrete Fourier transform to get only that information that repeats every 28/30 days
For more information on the Fourier transform, check out https://en.wikipedia.org/wiki/Fourier_transform
For a practical example (using MatLab),
check out
https://nl.mathworks.com/help/matlab/examples/fft-for-spectral-analysis.html?requestedDomain=www.mathworks.com
It shows how to give a frequency decomposition of a time-signal. If you apply the same logic, you can use this frequency decomposition to figure out which frequencies are dominant (typically the salary will be one of them).

Assistance regarding model choice

Im new to &investigating Machine Learning. I have a use case & data but I am unsure of a few things, mainly how my model will run, and what model to start with. Details of the use case and questions are below. Any advice is appreciated.
My Main question is:
When basing a result on scores that are accumulated over time, is it possible to design a model to run on a continuous basis so it gives a best guess at all times, be it run on day one or 3 months into the semester?
What model should I start with? I was thinking a classifier, but ranking might be interesting also.
Use Case Details
Apprentices take a semesterized course, 4 semesters long, each 6 months in duration. Over the course of a semester, apprentices perform various operations and processes & are scored on how well they do. After each semester, the apprentices either have sufficient score to move on to semester 2, or they fail.
We are investigating building a model that will help identify apprentices who are in danger of failing, with enough time for them to receive help.
Each procedure is assigned a complexity code of simple, intermediate or advanced, and are weighted by complexity.
Regarding Features, we have the following: -
Initial interview scores
Entry Exam Scores
Total number of simple procedures each apprentice performed
Total number of intermediate procedures each apprentice performed
Total number of advanced procedures each apprentice performed
Average score for each complexity level
Demograph information (nationality, age, gender)
I am unsure of is how the model will work and when we will run it. i.e. - if we run it on day one of the semester, I assume everyone will fail as everyone has procedure scores of 0
Current plan is to run the model 2-3 months into each semester, so there is enough score data & also enough time to help any apprentices who are in danger of failing.
This definitely looks like a classification model problem:
y = f(x[0],x[1], ..., x[N-1])
where y (boolean output) = {pass, fail} and x[i] are different features.
There is a plethora of ML classification models like Naive Bayes, Neural Networks, Decision Trees, etc. which can be used depending upon the type of the data. In case you are looking for an answer which suggests a particular ML model, then I would need more data for the same. However, in general, this flow-chart can be helpful in selection of the same. You can also read about Model Selection from Andrew-Ng's CS229's 5th lecture.
Now coming back to the basic methodology, some of these features like initial interview scores, entry exam scores, etc. you already know in advance. Whereas, some of them like performance in procedures are known over the semester.
So, there is no harm in saying that the model will always predict better towards the end of each semester.
However, I can make a few suggestions to make it even better:
Instead of taking the initial procedure-scores as 0, take them as a mean/median of the past performances in other procedures by the subject-apprentice.
You can even build a sub-model to analyze the relation between procedure-scores and interview-scores as they are not completely independent. (I will explain this sentence in the later part of the answer)
However, if the semester is very first semester of the subject-apprentice, then you won't have such data already present for that apprentice. In that case, you might need to consider the average performances of other apprentices with similar profiles as the subject-apprentice. If the data-set is not very large, K Nearest Neighbors approach can be quite useful here. However, for large data-sets, KNN suffers from the curse of dimensionality.
Also, plot a graph between y and different variables x[i], so as to see the independent variation of y with respect to each variable.
Most probably (although it's just a hypotheses), y will depend more the initial variables in comparison the variables achieved later. The reason being that the later variables are not completely independent of the former variables.
My point is, if a model can be created to predict the output of a semester, then, a similar model can be created to predict just the output of the 1st procedure-test.
In the end, as the model might be heavily based on demographic factors and other things, it might not be a very successful model. For the same reason, we cannot accurately predict election results, soccer match results, etc. As they are heavily dependent upon real-time dynamic data.
For dynamic predictions based on different procedure performances, Time Series Analysis can be a bit helpful. But in any case, the final result will heavily dependent on the apprentice's continuity in motivation and performance which will become more clear towards the end of each semester.

time dimension and milliseconds

I am with a dilemma designing the time dimension, I am not sure if I should include the millisecond in the time dimension or create a dimension for the millisecond grain.
I can see advantages and disadvantages in include the millisecond grain within the time dimensions.
Advantages:
perform calculations directly over the dimensional keys (date and time dimensions are the only datawarehouse allowed to contain any intelligence in them). And the purpose of the facts table is to measure response times.
Disadvantages:
time dimension gets to big and i might lose query performance.
Other information it is important to know:
Marketing guy told me to expect between 50 million facts per month (we know how they are, I should be prepared for a few more)
The facts are to be aggregated and in a non-additive way, this is: i want quality of service measures: average (semi-additive), median, percentiles.
Each fact have 12 time checkpoints.
I think its kind of dimentional data separation which can decrease size of dimension table which it should be,
relative sample would be you may separate dimDate and dimTime although you can have them together, but if not it will create a big dimension table,
now I think a practical solution is to seperate dimMilisecond from dimTime, and bcs its only contain one filed you can only add this to your fact without creating a physical dimMilisecond and it will be kind of Degenerated Dimension Attribute (DDA) in your fact table.
regards.

Resources