MDX Query between Fact and dimension table - join

I have a cube in SSAS it has different dimensions and one fact table. one of the dimensions is a dimGoodsType with [weight] attribute. I have a factSoldItems which has [price] measure. now I want to calculate this sum(price * weight) (each solditem has its dimGoodsTypeId so it has its weight related to GoodsType) How can I define this formula in mdx?

You can define another measure group in you cube with dimGoodsType as data source table and Weight column as a measure, and connect it with Goods Type dimension as usual. Then, in the properties tab of Price measure you can set Measure Expression as [Measures].[Price] * [Measures].[Weight]. This calculation will take place before any aggregation takes place. The main problem is that if you define straight forward calculation as Price * Weight, SSAS will first sum all weights and sum all prices in the context of the current cell, and only after that it will perform multiplication, but you want to always do your multiplication on the leaf level and to sum from there.
The other solution could be to create view_factSoldItems where you will add your calculated column Weighted Price as price * weight and then add this measure to the cube.

Related

Calculating proportional diversity in groups

In my data set, consistent of employees nested in teams, I have to calculate proportional diversity index for gender. This index is the percentage of minority group members present within each team. in my data set male coded as 0 and female as 1. Now I wonder if there is any simple way for coming up with the number of minority in each team.
Tnx for your guidance
If what you need is just the percentage of males and females in each team you can calculate:
sort cases by teamVar.
split file by teamVar.
freq genderVar.
split file off.
This will get you the results in the output window.
If you want the results in another dataset you can use aggregate:
dataset declare byteam.
aggregate out=byteam /break=teamVar
/Pfemales=Pin(genderVar 1 1)
/Pmales=Pin(genderVar 0 0).

In the algorithm LambdaRank (in Learning to Rank) what does |∆ NDCG| means?

This Article describes the LambdaRank algorithm for information retrieval. In formula 8 page 6, the authors propose to multiply the gradient (lambda) by a term called |∆NDCG|.
I do understand that this term is the difference of two NDCGs when swapping two elements in the list:
the size of the change in NDCG (|∆NDCG|) given by swapping the rank positions of U1 and U2
(while leaving the rank positions of all other urls unchanged)
However, I do not understand which ordered list is considered when swapping U1 and U2. Is it the list ordered by the predictions from the model at the current iteration ? Or is it the list ordered by the ground-truth labels of the documents ? Or maybe, the list of the predictions from the model at the previous iteration as suggested by Tie-Yan Liu in his book Learning to Rank for Information Retrieval ?
Short answer: It's the list ordered by the predictions from the model at the current iteration.
Let's see why it makes sense.
At each training iteration, we perform the following steps (these steps are standard for all Machine Learning algorithms, whether it's classification or regression or ranking tasks):
Calculate scores s[i] = f(x[i]) returned by our model for each document i.
Calculate the gradients of model's weights ∂C/∂w, back-propagated from RankNet's cost C. This gradient is the sum of all pairwise gradients ∂C[i, j]/∂w, calculated for each document's pair (i, j).
Perform a gradient ascent step (i.e. w := w + u * ∂C/∂w where u is step size).
In "Speeding up RankNet" paragraph, the notion λ[i] was introduced as contributions of each document's computed scores (using the model weights at current iteration) to the overall gradient ∂C/∂w (at current iteration). If we order our list of documents by the scores from the model at current iteration, each λ[i] can be thought of as "arrows" attached to each document, the sign of which tells us to which direction, up or down, that document should be moved to increase NDCG. Again, NCDG is computed from the order, predicted by our model.
Now, the problem is that the lambdas λ[i, j] for the pair (i, j) contributes equally to the overall gradient. That means the rankings of documents below, let’s say, 100th position is given equal improtance to the rankings of the top documents. This is not what we want: we should prioritize having relevant documents at the very top much more than having correct ranking below 100th position.
That's why we multiply each of those "arrows" by |∆NDCG| to emphasise on top ranking more than the ranking at the bottom of our list.

DAX - 2 phased weighted average with 2 different weight measures

I have rather complex problem to expres. In DAX powerpivot I am trying to create measure which will be using two different Weighted averages in one measure based on aggregation level.
The problem is complicated even more, because weight measures have different level of duplication (need to apply distinct SUM on them).I have been able to create Distinct SUM Measure1 and 2 to solve that.
[Weight_1] = SUMX(DISTINCT(Table1[Level2],[SupportWeight1])
[SupportWeight1] = MAX(Table1[RevenueLevel2])
[Weight_2] = SUMX(DISTINCT(Table1[Level3],[SupportWeight2])
[SupportWeight1] = MAX(Table1[RevenueLevel3])
So far so good, It was necessary because as you will see in below example, both measures need to be "deduplicated" during aggregation.
Weight_1 is Unique per Level2 dimension and Weight 2 is unique on higher level, per Level3 dimension.
After that I wanted to create Weighted average utilizing Weight_1, creating new supporting column:
[Measure x Weight_1] = [Measure] * [Weight_1]
I have forgot to mention, Weights are unique on Higher granularity (Level2 and Level3), however [Measure] is unique on lowest granularity: Level1
Now I had everything to create First weighted average:
[Measure Weight_1] = SUMX(Table1,[Measure x Weight_1])/SUMX(Table1,[Weight_1])
I t was done and works as expected.
Tricky part started now. I was thinking that simply creating one next support column and final measure I will accomplish "Final weighted" measure. And it will behave as expected different way on Level 1,2,3 and different on Level4,5,6,...
So I have use [Measure Weight_1] and create:
[Measure Weight_1 x Weight_2] = [Measure Weight_1] * [Weight_2]
Consequently Final measure:
[Measure Weight_2 over Weight_1] =SUMX(Table1, [Measure Weight_1 x Weight_2] )/SUMX(Table1,[Weight_2])
However it does not work obviously those measure are on different granularities and they do not comes together during aggregation. In final measure issue is that Level3 aggregation is arithmetical average not weighted average, but I am expecting to get there same result as in [Measure Weight_1]. Simply Because weight #2 has same value Lvel 3,2,1.
Consequently something like this would be probably treated in MDX with on Focus function.
Maybe the issue is column [Measure Weight_1 x Weight_2] maybe i need to aggregate first also this.
Maybe it could be accomplished with ROLLUP functions but I am not certain how to write it.
I am stuck here.
Try to rewrite programmatically my desired solution:
Weighted Average of Measure X =
IF Dim = Level1 Then Measure
IF Dim = Level2 Then AVG(Measure)
IF Dim = Level3 Then SUM(AVG(Measure)*Weight_1) / SUM(Weight_1)
IF Dim = Level4 Then SUM(SUM(AVG(Measure)*Weight_1) / SUM(Weight_1) * Weight_2) / SUM(Weight_2)

IBM SPSS 20: Rank and output a Matrix of Pearson's Correlations?

I was given a data set with 1,000 variables and have been asked to run Pearson's correlations on the explanatory variables and a binary dependent variable. I generate the correlations using the following code:
correlations /variables = x1 to x500 with y
correlations /variables = x501 to x1000 with y
The resulting output is a table which appears un-sortable in SPSS or other software (e.g. Excel)
x1 Pearson Correlation
p-value
N
-----------------------
x2 Pearson Correlation
p-value
N
-----------------------
.
.
.
-----------------------
xi Pearson Correlation
p-value
N
-----------------------
I want to be able to rank the variables according to Pearson's Correlation and then p-value. Does SPSS have the capability to save the Variable Name, Pearson's Correlation value and p-Value as a table, and then rank them?
I am too used to Stata and R and could not notice anything in the manual. Would a workaround be to run univariate regression models with only one dependent variable 1,000 times and try saving those coefficients?
Thanks!
You can easily pivot the statistics into the columns in the output table, which would give you a sortable arrangement. Try it with a few variables to see how this works. You double click the table to activate it and then Use Pivot > Pivoting Trays to open the controls for pivoting.
To do this for your real data, you will want to capture the table using OMS, creating a new dataset, which you can then sort or do other data manipulation operations. When you create your OMS command, you will want to tell it to pivot the table so that the dataset arrangement is convenient.
Bear in mind that fishing for the highest correlations is likely to give you an overly optimistic view of the predictive power of the top variables.
The NAIVEBAYES procedure (Statistics Server) might be another approach to consider. Check the Command Syntax Reference for details.

How are binary classifiers generalised to classify data into arbitrarily large sets?

How can algorithms which partition a space in to halves, such as Suport Vector Machines, be generalised to label data with labels from sets such as the integers?
For example, a support vector machine operates by constructing a hyperplane and then things 'above' the hyperplane take one label, and things below it take the other label.
How does this get generalised so that the labels are, for example, integers, or some other arbitrarily large set?
One option is the 'one-vs-all' approach, in which you create one classifier for each set you want to partition into, and select the set with the highest probability.
For example, say you want to classify objects with a label from {1,2,3}. Then you can create three binary classifiers:
C1 = 1 or (not 1)
C2 = 2 or (not 2)
C3 = 3 or (not 3)
If you run these classifiers on a new piece of data X, then they might return:
C1(X) = 31.6% chance of being in 1
C2(X) = 63.3% chance of being in 2
C3(X) = 89.3% chance of being in 3
Based on these outputs, you could classify X as most likely being from class 3. (The probabilities don't add up to 1 - that's because the classifiers don't know about each other).
If your output labels are ordered (with some kind of meaningful, rather than arbitrary ordering). For example, in finance you want to classify stocks into {BUY, SELL, HOLD}. Although you can't legitimately perform a regression on these (the data is ordinal rather than ratio data) you can assign the values of -1, 0 and 1 to SELL, HOLD and BUY and then pretend that you have ratio data. Sometimes this can give good results even though it's not theoretically justified.
Another approach is the Cramer-Singer method ("On the algorithmic implementation of multiclass kernel-based vector machines").
Svmlight implements it here: http://svmlight.joachims.org/svm_multiclass.html.
Classification into an infinite set (such as the set of integers) is called ordinal regression. Usually this is done by mapping a range of continuous values onto an element of the set. (see http://mlg.eng.cam.ac.uk/zoubin/papers/chu05a.pdf, Figure 1a)

Resources