Calculating proportional diversity in groups - spss

In my data set, consistent of employees nested in teams, I have to calculate proportional diversity index for gender. This index is the percentage of minority group members present within each team. in my data set male coded as 0 and female as 1. Now I wonder if there is any simple way for coming up with the number of minority in each team.
Tnx for your guidance

If what you need is just the percentage of males and females in each team you can calculate:
sort cases by teamVar.
split file by teamVar.
freq genderVar.
split file off.
This will get you the results in the output window.
If you want the results in another dataset you can use aggregate:
dataset declare byteam.
aggregate out=byteam /break=teamVar
/Pfemales=Pin(genderVar 1 1)
/Pmales=Pin(genderVar 0 0).

Related

How to verify if two text datasets are from different distribution?

I have two text datasets. Each dataset consists of multiple sequences and each sequence can have more than one sentence.
How do I measure if both datasets are from same distribution?
The purpose is to verify transfer learning from one distribution to another only if the difference between the distributions is statistically significant.
I am panning to use chi-square test but not sure if it will help for text data considering the high degrees of freedom.
update:
Example:
Supppose I want to train a sentiment classification model. I train a model on IMDb dataset and evaluate on IMDb and Yelp datasets. I found that my model trained on IMDb still does well on Yelp. But the question is how different these datasets are?
Train Dataset : https://www.kaggle.com/columbine/imdb-dataset-sentiment-analysis-in-csv-format?select=Train.csv
Eval 1: https://www.kaggle.com/columbine/imdb-dataset-sentiment-analysis-in-csv-format?select=Valid.csv
Eval 2: https://www.kaggle.com/omkarsabnis/sentiment-analysis-on-the-yelp-reviews-dataset
Now,
How different are train and eval 1?
How different are train and eval 2?
Is the dissimilarity between train and eval 2 by chance ? What is the statistical significance and p value?
The question "are text A and text B coming from the same distribution?" is somehow poorly defined. For example, these two questions (1,2) can be viewed as generated from the same distribution (distribution of all questions on StackExchange) or from different distributions (distribution of two different subdomains of StackExchange). So it's not clear what is the property that you want to test.
Anyway, you can come up with any test statistic of your choice, approximate its distribution in case of "single source" by simulation, and calculate the p-value of your test.
As a toy example, let's take two small corpora: two random articles from English Wikipedia. I'll do it in Python
import requests
from bs4 import BeautifulSoup
urls = [
'https://en.wikipedia.org/wiki/Nanjing_(Liao_dynasty)',
'https://en.wikipedia.org/wiki/United_States_Passport_Card'
]
texts = [BeautifulSoup(requests.get(u).text).find('div', {'class': 'mw-parser-output'}).text for u in urls]
Now I use a primitive tokenizer to count individual words in texts, and use root mean squared difference in word relative frequencies as my test statistic. You can use any other statistic, as long as you calculate it consistently.
import re
from collections import Counter
from copy import deepcopy
TOKEN = re.compile(r'([^\W\d]+|\d+|[^\w\s])')
counters = [Counter(re.findall(TOKEN, t)) for t in texts]
print([sum(c.values()) for c in counters])
# [5068, 4053]: texts are of approximately the same size
def word_freq_rmse(c1, c2):
result = 0
vocab = set(c1.keys()).union(set(c2.keys()))
n1, n2 = sum(c1.values()), sum(c2.values())
n = len(vocab)
for word in vocab:
result += (c1[word]/n1 - c2[word]/n2)**2 / n
return result**0.5
print(word_freq_rmse(*counters))
# rmse is 0.001178, but is this a small or large difference?
I get a value of 0.001178, but I don't know whether it's a large difference. So I need to simulate the distribution of this test statistic under the null hypothesis: when both texts are from the same distribution. To simulate it, I merge two texts into one, and then split them randomly, and calculate my statistic when comparing these two random parts.
import random
tokens = [tok for t in texts for tok in re.findall(TOKEN, t)]
split = sum(counters[0].values())
distribution = []
for i in range(1000):
random.shuffle(tokens)
c1 = Counter(tokens[:split])
c2 = Counter(tokens[split:])
distribution.append(word_freq_rmse(c1, c2))
Now I can see how unusual is the value of my observed test statistic under the null hypothesis:
observed = word_freq_rmse(*counters)
p_value = sum(x >= observed for x in distribution) / len(distribution)
print(p_value) # it is 0.0
print(observed, max(distribution), sum(distribution) / len(distribution)) # 0.0011 0.0006 0.0004
We see that when texts are from the same distribution, my test statistic is on average 0.0004 and almost never exceeds 0.0006, so the value of 0.0011 is very unusual, and the null hypothesis that two my texts originate from the same distribution should be rejected.
I wrote an article which is similar to your problem but not exactly the same.
https://towardsdatascience.com/a-new-way-to-bow-analysis-feature-engineering-part1-e012eba90ef
The problem that I was trying to solve is to check if a word has different (significant) distributions across categories or labels.
There are a few similarities between your problem and the one I had mentioned above.
You want to compare two sources of datasets, which can be taken as two different categories
Also, to compare the data sources, you will have to compare the words as sentences can't be directly compared
So, my proposed solution to this will be as:
Create words features across the two datasets using count-vectorizer and get top X words from each
Let's say you have total distinct words as N, now initialize count=0 and start to compare the distribution for each word and if the differences are significant increment the counter. Also, there could be cases where a word only exists in one of the datasets and that is a good new, by that I mean it shows that it is a distinguishing feature, so, for this also increment the count
Let's say the total count is n. Now, the lower is the n/N ratio, similar two texts are and vice-a-versa
Also, to verify this methodology - Split the data from a single source into two (random sampling) and run the above analysis, if the n/N ratio is closer to 0 which indicates that the two data sources are similar which also is the case.
Please let me know if this approach worked or not, also if you think there are any flaws in this, I would love to think and try evolving it.

In the algorithm LambdaRank (in Learning to Rank) what does |∆ NDCG| means?

This Article describes the LambdaRank algorithm for information retrieval. In formula 8 page 6, the authors propose to multiply the gradient (lambda) by a term called |∆NDCG|.
I do understand that this term is the difference of two NDCGs when swapping two elements in the list:
the size of the change in NDCG (|∆NDCG|) given by swapping the rank positions of U1 and U2
(while leaving the rank positions of all other urls unchanged)
However, I do not understand which ordered list is considered when swapping U1 and U2. Is it the list ordered by the predictions from the model at the current iteration ? Or is it the list ordered by the ground-truth labels of the documents ? Or maybe, the list of the predictions from the model at the previous iteration as suggested by Tie-Yan Liu in his book Learning to Rank for Information Retrieval ?
Short answer: It's the list ordered by the predictions from the model at the current iteration.
Let's see why it makes sense.
At each training iteration, we perform the following steps (these steps are standard for all Machine Learning algorithms, whether it's classification or regression or ranking tasks):
Calculate scores s[i] = f(x[i]) returned by our model for each document i.
Calculate the gradients of model's weights ∂C/∂w, back-propagated from RankNet's cost C. This gradient is the sum of all pairwise gradients ∂C[i, j]/∂w, calculated for each document's pair (i, j).
Perform a gradient ascent step (i.e. w := w + u * ∂C/∂w where u is step size).
In "Speeding up RankNet" paragraph, the notion λ[i] was introduced as contributions of each document's computed scores (using the model weights at current iteration) to the overall gradient ∂C/∂w (at current iteration). If we order our list of documents by the scores from the model at current iteration, each λ[i] can be thought of as "arrows" attached to each document, the sign of which tells us to which direction, up or down, that document should be moved to increase NDCG. Again, NCDG is computed from the order, predicted by our model.
Now, the problem is that the lambdas λ[i, j] for the pair (i, j) contributes equally to the overall gradient. That means the rankings of documents below, let’s say, 100th position is given equal improtance to the rankings of the top documents. This is not what we want: we should prioritize having relevant documents at the very top much more than having correct ranking below 100th position.
That's why we multiply each of those "arrows" by |∆NDCG| to emphasise on top ranking more than the ranking at the bottom of our list.

MDX Query between Fact and dimension table

I have a cube in SSAS it has different dimensions and one fact table. one of the dimensions is a dimGoodsType with [weight] attribute. I have a factSoldItems which has [price] measure. now I want to calculate this sum(price * weight) (each solditem has its dimGoodsTypeId so it has its weight related to GoodsType) How can I define this formula in mdx?
You can define another measure group in you cube with dimGoodsType as data source table and Weight column as a measure, and connect it with Goods Type dimension as usual. Then, in the properties tab of Price measure you can set Measure Expression as [Measures].[Price] * [Measures].[Weight]. This calculation will take place before any aggregation takes place. The main problem is that if you define straight forward calculation as Price * Weight, SSAS will first sum all weights and sum all prices in the context of the current cell, and only after that it will perform multiplication, but you want to always do your multiplication on the leaf level and to sum from there.
The other solution could be to create view_factSoldItems where you will add your calculated column Weighted Price as price * weight and then add this measure to the cube.

machine learning, why do we need to weight data

This my sound as very naive question. I checked on google and many YouTube videos for beginners and pretty much, all explain data weighting as something the most obvious. I still do not understand why data is being weighted.
Let's assume I have four features:
a b c d
1 2 1 4
If I pass each value to Sigmond function, I'll receive -1 >< 1 value already.
I really don't understand why data needs or it is recommended to be weighted first. If you could explain to me this in very simple manner, I would appreciate it a lot.
I think you are not talking about weighing data but features.
A feature is a column in your table and as data I would understand rows.
The confusion comes now from the fact that weighing rows is also sometimes sensible, e.g., if you want to punish misclassification of positive class more.
Why do we need to weigh features?
I assume you are talking about a modle like
prediction = sigmoid(sum_i weight_i * feature_i) > base
Let's assume you want to predict whether a person is overweight based on Bodyweight, height, and age.
In R we can generate a sample dataset as
height = rnorm(100,1.80,0.1) #normal distributed mean 1.8,variance 0.1
weight = rnorm(100,70,10)
age = runif(100,0,100)
ow = weight / (height**2)>25 #overweight if BMI > 25
data = data.frame(height,weight,age,bc,ow)
if we now plot the data you can see that at least my sample of the data can be separated with a straight line in weight/height. However, age does not provide any value. If we weight it prior to the sum/sigmoid you can put all factors into relation.
Furthermore, as you can see from the following plot the weight/height have a very different domain. Hence, they need to be put into relation, such that the line in the following plot has the right slope, as the value of weight have are one order of magnitude larger

How to derive means and sds and anova for this data in SPSS

I have been given this raw data to use in Spss and i'm so confused since i'm used to R instead.
An experiment monitored the amount of weight gained by anorexic girls after various treatments. Girls were placed to assigned to one of three groups. Group 1 had no therapy, Group 2 had cognitive behaviour therapy. Group 3 had family therapy. The researchers wanted to know if the two treatment groups produced weight gain relative to the control group.
This is the data
group1<- c(-9.3,-5.4,12.3,-2,-10.2,-12.2,11.6,-7.1,6.2,9.2,8.3,3.3,11.3,-10.6,-4.6,-6.7,2.8,3.7,15.9,-10.2)
group2<-c(-1.7,-3.5,14.9,3.5,17.1,-7.6,1.6,11.7,6.1,-4,20.9,-9.1,2.1,-1.4,1.4,-3.7,2.4,12.6,1.9,3.9,15.4)
group3<-c(11.4,11.0,5.5,9.4,13.6,-2.9,7.4,21.5,-5.3,-3.8,13.4,13.1,9,3.9,5.7,10.7)
I have been asked to come up with the mean and stdeviation of the independant variable which i believe is the treatment groups as a function of weight.
then do anova for the data and pairwise comparisons
i dont know where to start with this data besides putting it in the SPSS
with R i would use summary and anova functions but with the SPSS im lost.
Please help
For comparison of means and one-way ANOVA (and all of the potential options) navigate the menus for Analyze -> Compare Means. Below is an example using Tukey post-hoc comparisons. In the future just search the command syntax reference. A search for ANOVA would have told you all you needed to know.
DATA LIST FREE (",") / val.
BEGIN DATA
-9.3,-5.4,12.3,-2,-10.2,-12.2,11.6,-7.1,6.2,9.2,8.3,3.3,11.3,-10.6,-4.6,-6.7,2.8,3.7,15.9,-10.2
-1.7,-3.5,14.9,3.5,17.1,-7.6,1.6,11.7,6.1,-4,20.9,-9.1,2.1,-1.4,1.4,-3.7,2.4,12.6,1.9,3.9,15.4
11.4,11.0,5.5,9.4,13.6,-2.9,7.4,21.5,-5.3,-3.8,13.4,13.1,9,3.9,5.7,10.7
END DATA.
DATASET NAME val.
DO IF $casenum <= 20.
COMPUTE grID = 1.
ELSE IF $casenum > 20 AND $casenum <= 41.
COMPUTE grID = 2.
ELSE.
COMPUTE grID = 3.
END IF.
*Means and Standard Deviations.
MEANS
TABLES=val BY grID
/CELLS MEAN COUNT STDDEV .
*Anova.
ONEWAY val BY grID
/MISSING ANALYSIS
/POSTHOC = TUKEY ALPHA(.05).

Resources