cox proportional hazard regression in SPSS using reference group - spss

I am running cox proportional hazard regression in SPSS to see the association of 'predictor' with risk of a disease in a 10 years follow-up. I have another variable 'age_quartiles' with values 1,2,3,4 and want to use '1' as reference to get HRs for 2,3, and 4 relative to '1'. When I put this variable in Strata I still get one 'HR' as follows ('S_URAT_07' is the predictor with continuous values);
Question: How do I get HRs for the predictor for the event based on 'age_quartiles' 2,3 and 4 and keeping 1 as reference group? 'age_quartile' is not a predictor here. Am I suppose to choose a specific method?

As I answered yesterday to this same question on Cross Validated:
The model you're fitting involves only the one parameter for changes in hazard as S_URAT_07 varies (e.g., the B is the change in log hazard for a single unit increase in S_URAT_07), regardless of the level of age_quartiles. What differs by age_quartiles is the baseline hazard function when it's used as a strata or stratification variable, and the hazards are then no longer proportional.
If you specify age_quartiles as a factor (called a categorical covariate in COXREG) rather than a strata variable, you'll again get a single coefficient for S_URAT_07, but also a set of three coefficients that reflect proportionally differing baselines for each level of age_quartiles. You can specify simple contrasts on the factor with the first level as the reference category to reflect comparisons with that category.
If you specify age_quartiles as a factor and also include the interaction bewteen it and S_URAT_07, then you get separate proportional baseline hazard functions, but also allow the impact of S_URAT_07 to differ depending on the age_quartiles level.


generalized linear mixed model output spss

I am writing my master thesis and I run a generalized linear mixed regression model in SPSS (version 28) using count data.
Research question: which effect has the population mobility on the Covid-19 incidence at the federal state level in Germany during the period from February 2020 to November 2021.
To test the effect of population mobility (independent variable) on Covid-19 incidence (dependent variable) hierarchical models were used, with fixed factors:
mobility variables in 6 places.(scale)
cumulative vaccination rate (only second dose).( scale)
season (summer as the reference category) (nominal)
and random effects:
one model with days variable (Time level). (Scale)
Second model with federal states variable ( each state has a number from 1 to 16) ( place level). (Nominal)
Third model with both days and federal states (Time and place level).
First I have built intercept-only model to check which type of regression is more suitable for the count data (Possion or Negativ binomial) and to choose also the best variable as an offset from two variables..It showed that negative binomial regression is the best for this data. (Based on the BIC or AIC)
Secondly I have checked the collinearity between the original 6 mobility variables and I have excluded mobility variables that are highly correlated based on VIF. (Only one Variable was excluded)
Thirdly I have built 7 generalized linear models by adding only the fixed effects or the fixed factors which are the 5 mobility variables, the cumulative vaccination rate dose 2 and the season (with summer as a reference category) to the intercept only model gradually. From these 7 models the final model with best model fit was selected.
Finally I have built a generalized linear mixed model with the above final model and a classic random effect by adding Days variable only ((random-intercept component for time; TIME level)) and then with federal states variable only ((random-intercept component for place; PLACE level)) and finally with adding both of them together.
I am not sure if I ran the last step regarding the generalized linear mixed models correctly or not??
These are my Steps:
Analyze-> mixed models-> generalized linear mixed model-> fields and effects:> case
Target distribution and relationship (link) with the linear model-> custom :
Distribution-> negative binomial
Link Funktion -> log
2.Fixed effects-> include intercept & 5 mobility variables & cumulative vaccination rate & season
3.random effects-> no intercept & days variable (TIME LEVEL)
Random effect covariance type: variance component
4.weight and offset-> use offset field-> log expected cases adjusted wave variable
Build options like general and estimation remain unchanged (suggested by spss)
Model options like Estimated means remain unchanged (suggested by spss)
I have done the same steps with the other 2 models except with random effects:
3.random effects-> no intercept & Federal state variable (PLACE LEVEL)
3.random effects-> no intercept & days variable & Federal state variable (TIME & PLACE LEVEL)
1.the variance of the random effect of days variable ( time level ) was very small 5,565E-6, indicating only marginal effect in the model. (MODEL 1)
2.the covariance of the random effect of the federal states was zero and the variance was 0.079 ( place level )(MODEL 2)
3.the variance of the random effect of days variable was very small 4,126E-6 and the covariance of the random effect of the federal states was zero and the variance was 0.060 ( Time and place level )(MODEL 3)
Can someone please check my steps and tell me which model from the models in the last step is the best for the presentation of results and explain also the last point in the output within the picture?
Thanks in advance to all of you...

Are data dependencies relevant when preparing data for neural network?

Data: When I have N rows of data like this: (x,y,z) where logically f(x,y)=z, that is z is dependent on x and y, like in my case (setting1, setting2 ,signal) . Different x's and y's can lead to the same z, but the z's wouldn't mean the same thing.
There are 30 unique setting1, 30 setting2 and 1 signal for each (setting1, setting2)-pairing, hence 900 signal values.
Data set: These [900,3] data points are considered 1 data set. I have many samples of these data sets.
I want to make a classification based on these data sets, but I need to flatten the data (make them all into one row). If I flatten it, I will duplicate all the setting values (setting1 and setting2) 30 times, i.e. I will have a row with 3x900 columns.
Is it correct to keep all the duplicate setting1,setting2 values in the data set? Or should I remove them and only include the unique values a single time?, i.e. have a row with 30 + 30 + 900 columns. I'm worried, that the logical dependency of the signal to the settings will be lost this way. Is this relevant? Or shouldn't I bother including the settings at all (e.g. due to correlations)?
If I understand correctly, you are training NN on a sample where each observation is [900,3].
You are flatning it and getting an input layer of 3*900.
Some of those values are a result of a function on others.
It is important which function, as if it is a liniar function, NN might not work:
From here:
"If inputs are linearly dependent then you are in effect introducing
the same variable as multiple inputs. By doing so you've introduced a
new problem for the network, finding the dependency so that the
duplicated inputs are treated as a single input and a single new
dimension in the data. For some dependencies, finding appropriate
weights for the duplicate inputs is not possible."
Also, if you add dependent variables you risk the NN being biased towards said variables.
E.g. If you are running LMS on [x1,x2,x3,average(x1,x2)] to predict y, you basically assign a higher weight to the x1 and x2 variables.
Unless you have a reason to believe that those weights should be higher, don't include their function.
I was not able to find any link to support, but my intuition is that you might want to decrease your input layer in addition to omitting the dependent values:
From professor A. Ng's ML Course I remember that the input should be the minimum amount of values that are 'reasonable' to make the prediction.
Reasonable is vague, but I understand it so: If you try to predict the price of a house include footage, area quality, distance from major hub, do not include average sun spot activity during the open home day even though you got that data.
I would remove the duplicates, I would also look for any other data that can be omitted, maybe run PCA over the full set of Nx[3,900].

Case of No examples left while constructing a Decision Tree

I was reading the topic of Decision Trees(page 720) from book Artificial Intelligence A Modern Approach 3rd edition. The book is describing some cases that may occur after we split the training set(examples) by choosing an attribute. One of the case mentioned is
If there are no examples left, it means that no example has been observed for this combination of attribute values, and we return a default value calculated from the plurality classification of all the examples that were used in constructing the node’s parent.
I understand that by plurality classification they mean majority rule. But I am unable to understand the above cases i.e. when could it occur. Some example of decision tree where the above cases becomes true.
Think of the problem as constructing a 2D table of occurrence counts where the column represents some feature or class to be considered and the rows represent particular configurations of other variables.
for example,
X Y Z | class counts
1 1 1 | ...
1 1 2 | ...
1 1 3 | ...
The table represents the joint distribution of the training set.
A particular combination of X, Y and Z (say 1,3,1) may not have been seen during training. The more variables you have, the more likely you will encounter unseen combinations. If you have 10 variables each with two states then there are 1024 possible configurations of those variables. If there are three states for each then the number of configurations would be 3 ^ 10, etc.
Frankly, I would use 1/numberCols for any particular column with a missing row as you don't really have any information regarding it. You could use 1/Sum(rows) for each column but this may unnecessarily bias the result. Depends on the data.

Non-linear interaction terms in Stata

I have a continuous dependent variable polity_diff and a continuous primary independent variable nb_eq. I have hypothesized that the effect of nb_eq will vary with different levels of the continuous variable gini_round in a non-linear manner: The effect of nb_eq will be greatest for mid-range values of gini_round and close to 0 for both low and high levels of gini_round (functional shape as a second-order polynomial).
My question is: How this is modelled in Stata?
To this point I've tried with a categorized version of gini_round which allows me to compare the different groups, but obviously this doesn't use data to its fullest. I can't get my head around the inclusion of a single interaction term which allows me to test my hypothesis. My best bet so far is something along the lines of the following (which is simplified by excluding some if-arguments etc.):
xtreg polity_diff c.nb_eq##c.gini_round_squared, fe vce(cluster countryno),
but I have close to 0 confidence that this is even nearly right.
Here's how I might do it:
sysuse auto, clear
reg price c.weight#(c.mpg##c.mpg) i.foreign
margins, dydx(weight) at(mpg = (10(10)40))
margins, dydx(weight) at(mpg=(10(10)40)) contrast(atcontrast(ar(2(1)4)._at) wald)
We interact weight with a second degree polynomial of mpg. The first margins calculates the average marginal effect of weight at different values of mpg. The graph looks like what you describe. The second margins compares the slopes at adjacent values of mpg and does a joint test that they are all equal.
I would probably give weight its own effect as well (two octothorpes rather than one), but the graph does not come out like your example:
reg price c.weight##(c.mpg##c.mpg) i.foreign

Conditional Random Field feature functions

I've been reading some papers on CRFs and am slightly confused about the feature functions. Unary (node) and binary (edge) features f are normally of the form
f(yc, xc) = 1{yc=y ̃c}fg(xc).
where {.} is the indicator function evaluating to 1 if the condition enclosed is true, and 0 otherwise. fg is a function of the data xc which extracts useful attributes (features) from the data.
Now it seems to me that to create CRF features the true labels (yc) must be known. This is true for training but for the testing phase the true class labels are unknown (since we are trying to determine their most likely value).
Am I missing something? How can this be correctly implemented?
The idea with the CRF is that it assigns a score to each setting of the labels. So what you do, notionally, is compute the scores for all possible label assignments and then whichever labeling gets the biggest score is what the CRF predicts/outputs. This is only going to make sense if the CRF gives different scores to different label assignments. When you think of it that way it's clear that the labels must be involved in the feature functions for this to work.
So lets say the log probability function for your CRF is F(x,y). So it assigns a number to each combination of a data sample x and a labeling y. So when you get a new data sample the predicted label during test time is just argmax_y F(new_x, y). That is, you find the value of y that makes F(new_x,y) the biggest and that's the predicted labeling.
