I am just learning SPSS, I have a background in PL/SQL and T-SQL
I have a dataset and need to make three groups based on deviation from the mean for a specific variable
Greater than 1 Standard Deviation Above the Mean
Greater than 1 Standard Deviation Below the Mean
All Others
I wanted to use a scratch variable but have no idea how to find the standard deviation of an existing variable and populate it into a scratch variable to use for my grouping conditions.
Any help appreciated
The aggregate command can calculate the SD of a variable and add it to the dataset, like this:
aggregate/outfile=* mode=addvariables/break=
/SDyourvar=sd(yourvar) /MEANyourvar=mean(yourvar).
Now you can use the variables to create groups like this for example:
do if yourvar < (MEANyourvar - SDyourvar).
compute group=-1.
else if yourvar > (MEANyourvar + SDyourvar).
compute group=1.
else.
compute group=0.
end if.
Or for a shorter version:
compute group=(yourvar > (MEANyourvar+SDyourvar)) - (yourvar < (MEANyourvar-SDyourvar)).
Related
I know I can create a language model with 1 head:
from transformers import AutoModelForMultipleChoice
model = AutoModelForMultipleChoice.from_pretrained("distilbert-base-cased").to(device)
But how can I create the same base model structure (e.g., distilbert-base-cased) with 2 heads? Say, one is AutoModelForMultipleChoice and the second is AutoModelForSequenceClassification. I need the only difference between the 2 models (1 head vs 2 heads) to be the additional head (from parameters perspective).
So now my input for the 2 heads model is something like [sequence_label, multiple_choice_labels]
In general case you will need to create a custom class derived from the DistilBertPreTrainedModel. Inside __init__() you will need to define your desired heads architectures. Then you will need to create your own forward() function and define inside it a custom loss involving both heads, and return result.
But if you are talking specifically about DistilBertForMultipleChoice and DistilBertForSequenceClassification, there is a shortcut, as the heads architecture happen to be identical (see source) and the difference is only in loss function. So you can try to train your model as multi label sequence classification problem, where the label per sequence will be [sequence_label, multiple_choice_label_0, multiple_choice_label_1, ...] . For example, in case you have an entry like {sequence, choice0, choice1, seq_label:True, correct_choice:0}
your dataset will be
[ {'text':(sequence, choice0), 'label':(1 1 0)},
{'text':(sequence, choice1), 'label':(1 0 0)} ]
This way the result of the sequence classification will be in the first position and to get the correct choice probability you will need to apply softmax function on the rest of the logits.
I have a list of about 30 variables, all named something like test_1, test_2, test_3, etc. I need to check if the values are all the same, and typically do so by exporting to excel and using an if statement comparing the min value to the max (i.e. if the min=max then all the values are the same).
Is there a way I can do this right in SPSS without having to export? It seems inefficient to compare if test_1=test_2 and test_2=test_3 etc.
This is sort of a hack, but it get's the job done: can calculate the standard deviation of all your variables:
compute sd_test=SD(test_1, test_2, ..., test_n).
EXECUTE.
sd_test=0 for records where all test_i variables are equal.
I'm running tensorflow 2.1 and tensorflow_probability 0.9. I have fit a Structural Time Series Model with a seasonal component.
I wish to implement an Integrated Random walk in order to smooth the trend component, as per Time Series Analysis by State Space Methods: Second Edition, Durbin & Koopman. The integrated random walk is achieved by setting the level component variance to equal 0.
Is implementing this constraint possible in Tensorflow Probability?
Further to this in Durbin & Koopman, higher order random walks are discussed. Could this be implemented?
Thanks in advance for your time.
If I understand correctly, an integrated random walk is just the special case of LocalLinearTrend in which the level simply integrates the randomly-evolving slope component (ie it has no independent source of variation). You could patch this in by subclassing LocalLinearTrend and fixing level_scale = 0. in the models it builds:
class IntegratedRandomWalk(sts.LocalLinearTrend):
def __init__(self,
slope_scale_prior=None,
initial_slope_prior=None,
observed_time_series=None,
name=None):
super(IntegratedRandomWalk, self).__init__(
slope_scale_prior=slope_scale_prior,
initial_slope_prior=initial_slope_prior,
observed_time_series=observed_time_series,
name=None)
# Remove 'level_scale' parameter from the model.
del self._parameters[0]
def _make_state_space_model(self,
num_timesteps,
param_map,
initial_state_prior=None,
initial_step=0):
# Fix `level_scale` to zero, so that the level
# cannot change except by integrating the
# slope.
param_map['level_scale'] = 0.
return super(IntegratedRandomWalk, self)._make_state_space_model(
num_timesteps=num_timesteps,
param_map=param_map,
initial_state_prior=initial_state_prior,
initial_step=initial_step)
(it would be mathematically equivalent to build a LocalLinearTrend with level_scale_prior concentrated at zero, but that constraint makes inference difficult, so it's generally better to just ignore or remove the parameter entirely, as I did here).
By higher-order random walks, do you mean autoregressive models? If so, sts.Autoregressive might be relevant.
There does not seem to be an "easy" way (such as in R or python) to create interaction terms between dummy variables in gretl ?
Do we really need to code those manually which will be difficult for many levels? Here is a minimal example of manual coding:
open credscore.gdt
SelfemplOwnRent=OwnRent*Selfempl
# model 1
ols Acc 0 OwnRent Selfempl SelfemplOwnRent
Now my manual interaction term will not work for factors with many levels and in fact does not even do the job for binary variables.
Thanks,
ML
One way of doing this is to use lists. Use the dummify-command for generating dummies for each level and the ^-operator for creating the interactions. Example:
open griliches.gdt
discrete med
list X = dummify(med)
list D = dummify(mrt)
list INT = X^D
ols lw 0 X D INT
The command discrete turns your variable into a discrete variable and allows to use dummify (this step is not necessary if your variable is already discrete). Now all interactions terms are stored in the list INT and you can easily assess them in the following ols-command.
#Markus Loecher on your second question:
You can always use the rename command to rename a series. So you would have to loop over all elements in list INT to do so. However, I would rather suggest to rename both input series, in the above example mrt and med respectively, before computing the interaction terms if you want shorter series names.
I have an sav file with plenty of variables. What I would like to do now is create macros/routines that detect basic properties of a range of item sets, using SPSS syntax.
COMPUTE scale_vars_01 = v_28 TO v_240.
The code above is intended to define a range of items which I would like to observe in further detail. How can I get the number of elements in the "array" scale_vars_01, as an integer?
Thanks for info. (as you see, the SPSS syntax is still kind of strange to me and I am thinking about using Python instead, but that might be too much overhead for my relatively simple purposes).
One way is to use COUNT, such as:
COUNT Total = v_28 TO v_240 (LO THRU HI).
This will count all of the valid values in the vector. This will not work if the vector contains mixed types (e.g. string and numeric) or if the vector has missing values. An inefficient way to get the entire count using DO REPEAT is below:
DO IF $casenum = 1.
COMPUTE Total = 0.
DO REPEAT V = v_28 TO V240.
COMPUTE Total = Total + 1.
END REPEAT.
ELSE.
COMPUTE Total = LAG(Total).
END IF.
This will work for mixed type variables, and will count fields with missing values. (The DO IF would work the same for COUNT, this forces a data pass, but for large datasets and large lists will only evaluate for the first case.)
Python is probably the most efficient way to do this though - and I see no reason not to use it if you are familiar with it.
BEGIN PROGRAM.
import spss
beg = 'X1'
end = 'X10'
MyVars = []
for i in xrange(spss.GetVariableCount()):
x = spss.GetVariableName(i)
MyVars.append(x)
len = MyVars.index(end) - MyVars.index(beg) + 1
print len
END PROGRAM.
Statistics has a built-in macro facility that could be used to define sets of variables, but the Python apis provide much more powerful ways to access and use the metadata. And there is an extension command SPSSINC SELECT VARIABLES that can define macros based on variable metadata such as patterns in names, measurement level, type, and other properties. It generates a macro listing these variables that can then be used in standard syntax.