I have rather complex problem to expres. In DAX powerpivot I am trying to create measure which will be using two different Weighted averages in one measure based on aggregation level.
The problem is complicated even more, because weight measures have different level of duplication (need to apply distinct SUM on them).I have been able to create Distinct SUM Measure1 and 2 to solve that.
[Weight_1] = SUMX(DISTINCT(Table1[Level2],[SupportWeight1])
[SupportWeight1] = MAX(Table1[RevenueLevel2])
[Weight_2] = SUMX(DISTINCT(Table1[Level3],[SupportWeight2])
[SupportWeight1] = MAX(Table1[RevenueLevel3])
So far so good, It was necessary because as you will see in below example, both measures need to be "deduplicated" during aggregation.
Weight_1 is Unique per Level2 dimension and Weight 2 is unique on higher level, per Level3 dimension.
After that I wanted to create Weighted average utilizing Weight_1, creating new supporting column:
[Measure x Weight_1] = [Measure] * [Weight_1]
I have forgot to mention, Weights are unique on Higher granularity (Level2 and Level3), however [Measure] is unique on lowest granularity: Level1
Now I had everything to create First weighted average:
[Measure Weight_1] = SUMX(Table1,[Measure x Weight_1])/SUMX(Table1,[Weight_1])
I t was done and works as expected.
Tricky part started now. I was thinking that simply creating one next support column and final measure I will accomplish "Final weighted" measure. And it will behave as expected different way on Level 1,2,3 and different on Level4,5,6,...
So I have use [Measure Weight_1] and create:
[Measure Weight_1 x Weight_2] = [Measure Weight_1] * [Weight_2]
Consequently Final measure:
[Measure Weight_2 over Weight_1] =SUMX(Table1, [Measure Weight_1 x Weight_2] )/SUMX(Table1,[Weight_2])
However it does not work obviously those measure are on different granularities and they do not comes together during aggregation. In final measure issue is that Level3 aggregation is arithmetical average not weighted average, but I am expecting to get there same result as in [Measure Weight_1]. Simply Because weight #2 has same value Lvel 3,2,1.
Consequently something like this would be probably treated in MDX with on Focus function.
Maybe the issue is column [Measure Weight_1 x Weight_2] maybe i need to aggregate first also this.
Maybe it could be accomplished with ROLLUP functions but I am not certain how to write it.
I am stuck here.
Try to rewrite programmatically my desired solution:
Weighted Average of Measure X =
IF Dim = Level1 Then Measure
IF Dim = Level2 Then AVG(Measure)
IF Dim = Level3 Then SUM(AVG(Measure)*Weight_1) / SUM(Weight_1)
IF Dim = Level4 Then SUM(SUM(AVG(Measure)*Weight_1) / SUM(Weight_1) * Weight_2) / SUM(Weight_2)
Related
I have a data set with 20 non-overlapping different swap rates (spot1y, 1y1y, 2y1y, 3y1y, 4y1y, 5y2y, 7y3y, 10y2y, 12y3y...) over the past year.
I want to use PCA / multiregression and look at residuals in order to determine which sectors on the curve are cheap/rich. Has anyone had experience with this? I've done PCA but not for time series. I'd ideally like to model something similar to the first figure here but in USD.
https://plus.credit-suisse.com/rpc4/ravDocView?docid=kv66a7
Thanks!
Here are some broad strokes that can help answer your question. Also, that's a neat analysis from CS :)
Let's be pythonistas and use NumPy. You can imagine your dataset as a 20x261 array of floats. The first place to start is creating the array. Suppose you have a CSV file storing the raw data persistently. Then a reasonable first step to load the data would be something as simple as:
import numpy
x = numpy.loadtxt("path/to/my/file")
The object x is our raw time series matrix, and we verify the truthness of x.shape == (20, 261). The next step is to transform this array into it's covariance matrix. Whether it has been done on the raw data already, or it still has to be done, the first step is centering each time series on it's mean, like this:
x_centered = x - x.mean(axis=1, keepdims=True)
The purpose of this step is to help simplify any necessary rescaling, and is a very good habit that usually shouldn't be skipped. The call to x.mean uses the parameters axis and keepdims to make sure each row (e.g. the time series for spot1yr, ...) is centered with it's mean value.
The next steps are to square and scale x to produce a swap rate covariance array. With 2-dimensional arrays like x, there are two ways to square it-- one that leads to a 261x261 array and another that leads to a 20x20 array. It's the second array we are interested in, and the squaring procedure that will work for our purposes is:
x_centered_squared = numpy.matmul(x_centered, x_centered.transpose())
Then, to scale one can chose between 1/261 or 1/(261-1) depending on the statistical context, which looks like this:
x_covariance = x_centered_squared * (1/261)
The array x_covariance has an entry for how each swap rate changes with itself, and changes with any one of the other swap rates. In linear-algebraic terms, it is a symmetric operator that characterizes the spread of each swap rate.
Linear algebra also tells us that this array can be decomposed into it's associated eigen-spectrum, with elements in this spectrum being scalar-vector pairs, or eigenvalue-eigenvector pairs. In the analysis you shared, x_covariance's eigenvalues are plotted in exhibit two as percent variance explained. To produce the data for a plot like exhibit two (which you will always want to furnish to the readers of your PCA), you simply divide each eigenvalue by the sum of all of them, then multiply each by 100.0. Due to the convenient properties of x_covariance, a suitable way to compute it's spectrum is like this:
vals, vects = numpy.linalg.eig(x_covariance)
We are now in a position to talk about residuals! Here is their definition (with our namespace): residuals_ij = x_ij − reconstructed_ij; i = 1:20; j = 1:261. Thus for every datum in x, there is a corresponding residual, and to find them, we need to recover the reconstructed_ij array. We can do this column-by-column, operating on each x_i with a change of basis operator to produce each reconstructed_i, each of which can be viewed as coordinates in a proper subspace of the original or raw basis. The analysis describes a modified Gram-Schmidt approach to compute the change of basis operator we need, which ensures this proper subspace's basis is an orthogonal set.
What we are going to do in the approach is take the eigenvectors corresponding to the three largest eigenvalues, and transform them into three mutually orthogonal vectors, x, y, z. Research the web for active discussions and questions geared toward developing the Gram-Schmidt process for all sorts of practical applications, but for simplicity let's follow the analysis by hand:
x = vects[0] - sum([])
xx = numpy.dot(x, x)
y = vects[1] - sum(
(numpy.dot(x, vects[1]) / xx) * x
)
yy = numpy.dot(y, y)
z = vects[2] - sum(
(numpy.dot(x, vects[2]) / xx) * x,
(numpy.dot(y, vects[2]) / yy) * y
)
It's reasonable to implement normalization before or after this step, which should be informed by the data of course.
Now with the raw data, we implicitly made the assumption that the basis is standard, we need a map between {e1, e2, ..., e20} and {x,y,z}, which is given by
ch_of_basis = numpy.array([x,y,z]).transpose()
This can be used to compute each reconstructed_i, like this:
reconstructed = []
for measurement in x.transpose().tolist():
reconstructed.append(numpy.dot(ch_of_basis, measurement))
reconstructed = numpy.array(reconstructed).transpose()
And then you get the residuals by subtraction:
residuals = x - reconstructed
This flow obviously might need further tuning, but it's the gist of how to do compute all the residuals. To get that periodic bar plot, take the average of each row in residuals.
I have a cube in SSAS it has different dimensions and one fact table. one of the dimensions is a dimGoodsType with [weight] attribute. I have a factSoldItems which has [price] measure. now I want to calculate this sum(price * weight) (each solditem has its dimGoodsTypeId so it has its weight related to GoodsType) How can I define this formula in mdx?
You can define another measure group in you cube with dimGoodsType as data source table and Weight column as a measure, and connect it with Goods Type dimension as usual. Then, in the properties tab of Price measure you can set Measure Expression as [Measures].[Price] * [Measures].[Weight]. This calculation will take place before any aggregation takes place. The main problem is that if you define straight forward calculation as Price * Weight, SSAS will first sum all weights and sum all prices in the context of the current cell, and only after that it will perform multiplication, but you want to always do your multiplication on the leaf level and to sum from there.
The other solution could be to create view_factSoldItems where you will add your calculated column Weighted Price as price * weight and then add this measure to the cube.
I implemented a cosine-theta function, which calculates the relation between two articles. If two articles are very similar then the words should contain quite some overlap. However, a cosine theta score of 0.54 does not mean "related" or "not related". I should end up with a definitive answer which is either 0 for 'not related' or 1 for 'related'.
I know that there are sigmoid and softmax functions, yet I should find the optimal parameters to give to such functions and I do not know if these functions are satisfactory solutions. I was thinking that I have the cosine theta score, I can calculate the percentage of overlap between two sentences two (e.g. the amount of overlapping words divided by the amount of words in the article) and maybe some more interesting things. Then with the data, I could maybe write a function (what type of function I do not know and is part of the question!), after which I can minimize the error via the SciPy library. This means that I should do some sort of supervised learning, and I am willing to label article pairs with labels (0/1) in order to train a network. Is this worth the effort?
# Count words of two strings.
v1, v2 = self.word_count(s1), self.word_count(s2)
# Calculate the intersection of the words in both strings.
v3 = set(v1.keys()) & set(v2.keys())
# Calculate some sort of ratio between the overlap and the
# article length (since 1 overlapping word on 2 words is more important
# then 4 overlapping words on articles of 492 words).
p = min(len(v1), len(v2)) / len(v3)
numerator = sum([v1[w] * v2[w] for w in v3])
w1 = sum([v1[w]**2 for w in v1.keys()])
w2 = sum([v2[w]**2 for w in v2.keys()])
denominator = math.sqrt(w1) * math.sqrt(w2)
# Calculate the cosine similarity
if not denominator:
return 0.0
else:
return (float(numerator) / denominator)
As said, I would like to use variables such as p, and the cosine theta score in order to produce an accurate discrete binary label, either 0 or 1.
As said, I would like to use variables such as p, and the cosine theta score in order to produce an accurate discrete binary label, either 0 or 1.
Here it really comes down to what you mean by accuracy. It is up to you to choose how the overlap affects whether or not two strings are "matching" unless you have a labelled data set. If you have a labelled data set (I.e., a set of pairs of strings along with a 0 or 1 label), then you can train a binary classification algorithm and try to optimise based on that. I would recommend something like a neural net or SVM due to the potentially high dimensional, categorical nature of your problem.
Even the optimisation, however, is a subjective measure. For example, in theory let's pretend you have a model which out of 100 samples only predicts 1 answer (Giving 99 unknowns). Technically if that one answer is correct, that is a model with 100% accuracy, but which has a very low recall. Generally in machine learning you will find a trade off between recall and accuracy.
Some people like to go for certain metrics which combine the two (The most famous of which is the F1 score), but honestly it depends on the application. If I have a marketing campaign with a fixed budget, then I care more about accuracy - I would only want to target consumers who are likely to buy my product. If however, we are looking to test for a deadly disease or markers for bank fraud, then it's feasible for that test to be accurate only 10% of the time - if its recall of true positives is somewhere close to 100%.
Finally, if you have no labelled data, then your best bet is just to define some cut off value which you believe indicates a good match. This is would then be more analogous to a binary clustering problem, and you could use some more abstract measure such as distance to a centroid to test which cluster (Either the "related" or "unrelated" cluster) the point belongs to. Note however that here your features feel like they would be incredibly hard to define.
This my sound as very naive question. I checked on google and many YouTube videos for beginners and pretty much, all explain data weighting as something the most obvious. I still do not understand why data is being weighted.
Let's assume I have four features:
a b c d
1 2 1 4
If I pass each value to Sigmond function, I'll receive -1 >< 1 value already.
I really don't understand why data needs or it is recommended to be weighted first. If you could explain to me this in very simple manner, I would appreciate it a lot.
I think you are not talking about weighing data but features.
A feature is a column in your table and as data I would understand rows.
The confusion comes now from the fact that weighing rows is also sometimes sensible, e.g., if you want to punish misclassification of positive class more.
Why do we need to weigh features?
I assume you are talking about a modle like
prediction = sigmoid(sum_i weight_i * feature_i) > base
Let's assume you want to predict whether a person is overweight based on Bodyweight, height, and age.
In R we can generate a sample dataset as
height = rnorm(100,1.80,0.1) #normal distributed mean 1.8,variance 0.1
weight = rnorm(100,70,10)
age = runif(100,0,100)
ow = weight / (height**2)>25 #overweight if BMI > 25
data = data.frame(height,weight,age,bc,ow)
if we now plot the data you can see that at least my sample of the data can be separated with a straight line in weight/height. However, age does not provide any value. If we weight it prior to the sum/sigmoid you can put all factors into relation.
Furthermore, as you can see from the following plot the weight/height have a very different domain. Hence, they need to be put into relation, such that the line in the following plot has the right slope, as the value of weight have are one order of magnitude larger
I have been given this raw data to use in Spss and i'm so confused since i'm used to R instead.
An experiment monitored the amount of weight gained by anorexic girls after various treatments. Girls were placed to assigned to one of three groups. Group 1 had no therapy, Group 2 had cognitive behaviour therapy. Group 3 had family therapy. The researchers wanted to know if the two treatment groups produced weight gain relative to the control group.
This is the data
group1<- c(-9.3,-5.4,12.3,-2,-10.2,-12.2,11.6,-7.1,6.2,9.2,8.3,3.3,11.3,-10.6,-4.6,-6.7,2.8,3.7,15.9,-10.2)
group2<-c(-1.7,-3.5,14.9,3.5,17.1,-7.6,1.6,11.7,6.1,-4,20.9,-9.1,2.1,-1.4,1.4,-3.7,2.4,12.6,1.9,3.9,15.4)
group3<-c(11.4,11.0,5.5,9.4,13.6,-2.9,7.4,21.5,-5.3,-3.8,13.4,13.1,9,3.9,5.7,10.7)
I have been asked to come up with the mean and stdeviation of the independant variable which i believe is the treatment groups as a function of weight.
then do anova for the data and pairwise comparisons
i dont know where to start with this data besides putting it in the SPSS
with R i would use summary and anova functions but with the SPSS im lost.
Please help
For comparison of means and one-way ANOVA (and all of the potential options) navigate the menus for Analyze -> Compare Means. Below is an example using Tukey post-hoc comparisons. In the future just search the command syntax reference. A search for ANOVA would have told you all you needed to know.
DATA LIST FREE (",") / val.
BEGIN DATA
-9.3,-5.4,12.3,-2,-10.2,-12.2,11.6,-7.1,6.2,9.2,8.3,3.3,11.3,-10.6,-4.6,-6.7,2.8,3.7,15.9,-10.2
-1.7,-3.5,14.9,3.5,17.1,-7.6,1.6,11.7,6.1,-4,20.9,-9.1,2.1,-1.4,1.4,-3.7,2.4,12.6,1.9,3.9,15.4
11.4,11.0,5.5,9.4,13.6,-2.9,7.4,21.5,-5.3,-3.8,13.4,13.1,9,3.9,5.7,10.7
END DATA.
DATASET NAME val.
DO IF $casenum <= 20.
COMPUTE grID = 1.
ELSE IF $casenum > 20 AND $casenum <= 41.
COMPUTE grID = 2.
ELSE.
COMPUTE grID = 3.
END IF.
*Means and Standard Deviations.
MEANS
TABLES=val BY grID
/CELLS MEAN COUNT STDDEV .
*Anova.
ONEWAY val BY grID
/MISSING ANALYSIS
/POSTHOC = TUKEY ALPHA(.05).