is GLM the best way to deal with a (numeric IV+ categorical IV =binary DV) dataset? how to plot? - glm

i have a dataset includes 1 numeric IV + 5 categorical IVs , the DV is also binary, so i used glmer to explore the predication effects:
$model <- glmer(Y~(Xn+Xc1+Xc2+Xc3+Xc3)^2+(1|id),data,family=binomial)
but is this the best way? and how to plot the main effect and interactions?

Related

Transforming Features to increase similarity

I have a large dataset (~20,000 samples x 2,000 features-- each sample w/ a corresponding y-value) that I'm constructing a regression ML model for.
The input vectors are bitvectors with either 1s or 0s at each position.
Interestingly, I have noticed that when I 'randomly' select N samples such that their y-values are between two arbitrary values A and B (such that B-A is much smaller than the total range of values in y), the subsequent model is much better at predicting other values with the A-->B range not used in the training of the model.
However, the overall similarity of the input X vectors for these values are in no way more similar than any random selection of X values across the whole dataset.
Is there an available method to transform the input X-vectors such that those with more similar y-values are "closer" (I'm not particular the methodology, but it could be something like cosine similarity), and those with not similar y-values are separated?
After more thought, I believe this question can be re-framed as a supervised clustering problem. What might be able to accomplish this might be as simple as:
import umap
print(df.shape)
>> (23,312, 2149)
print(len(target))
>> 23,312
embedding = umap.UMAP().fit_transform(df, y=target)

How can I include survey weights in a poisson pint process model fitted to a logistic regression quadrature scheme?

Is it possible to include weights in a poisson point process model fitted to a logistic regression quadrature scheme? My data is a stratified sample and I would like to account for this sampling strategy in order to have valid population level predictions.
This is a question about the model-fitting function ppm in the R package spatstat.
Yes, you can include survey weights. The easiest way is to create a covariate surveyweight, which could be a function(x,y) or a pixel image or a column of data associated with your quadrature scheme. Then when fitting the model using ppm, add the model term +offset(log(surveyweight)).
The result of ppm will be a fitted model that describes the observed point pattern. You can do prediction, simulation etc from this model, but be aware that these will be predictions or simulations of the observed point process including the effect of non-constant survey effort.
To get a prediction or simulation of the original point process (i.e. after removing the effect of non-constant survey effort) you need to replace the original covariate surveyweight by another covariate that is constant and equal to 1, then pass this to predict.ppm in the argument newdata.
Here are a few lines to elaborate on the answer by #adrian-baddeley.
If you have the setup of your related question and we imagine you have the weights and two covariates in a data.frame in the same order as the points of your quadscheme:
library(spatstat)
X <- split(chorley)$larynx
D <- split(chorley)$lung
Q <- quadscheme.logi(X,D)
covar <- data.frame(weights = runif(npoints(chorley)),
covar1 = rnorm(npoints(chorley)),
covar2 = rnorm(npoints(chorley)))
fit <- ppm(Q ~ offset(log(weights)) + covar1 + covar2, data = covar)

replace missing values in categorical data

Let's suppose I have a column with categorical data "red" "green" "blue" and empty cells
red
green
red
blue
NaN
I'm sure that the NaN belongs to red green blue, should I replace the NaN by the average of the colors or is a too strong assumption? It will be
col1 | col2 | col3
1 0 0
0 1 0
1 0 0
0 0 1
0.5 0.25 0.25
Or even scale the last row but keeping the ratio so these values have less influence? Usually what is the best practice?
0.25 0.125 0.125
The simplest strategy for handling missing data is to remove records that contain a missing value.
The scikit-learn library provides the Imputer() pre-processing class that can be used to replace missing values. Since it is categorical data, using mean as replacement value is not recommended. You can use
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)
The Imputer class operates directly on the NumPy array instead of the DataFrame.
Last but not least, not ALL ML algorithm cannot handle missing value. Different implementations of ML also different.
It depends on what you want to do with the data.
Is the average of these colors useful for your purpose?
You are creating a new possible value doing that, that is probably not wanted. Especially since you are talking about categorical data, and you are handling it as if it was numeric data.
In Machine Learning you would replace the missing values with the most common categorical value regarding a target attribute (what you want to predict).
Example: You want to predict if a person is male or female by looking at their car, and the color feature has some missing values. If most of the cars from male(female) drivers are blue(red), you would use that value to fill missing entries of cars from male(female) drivers.
In addition to Lan's answer's approach, which seems most commonly used, you can use something based on matrix factorization. For example there is a variant of Generalized Low Rank Models that can impute such data, just as probabilistic matrix factorization is used to impute continuous data.
GLRMs can be used from H2O which provides bindings for both Python and R.

libsvm not giving support vectors / no support vectors

I am using jlibsvm to do SVM for regression .My data set is very small (42 samples) . When I use the dataset to create the model using epsilon SVR with sigmoid kernel then no support vectors are generated.
This is what I get in my model file :
svm_type epsilon_svr
kernel_type sigmoid
gamma 0.02380952425301075
coef0 0.0
label
rho -66.42803
total_sv 0
probA -1.0
SV
When I use some other data set on the libsvm website I get a model file with support vectors fine.
Can someone please suggest why no support vectors are being generated for my data set ?
My data set file is formatted right so no issues there...
This could mean that the best found classification, given your data and the hyperparameters, is to assign the same label to all samples.
Are your samples unbalanced? What's the number of positive and negative samples? You might want to try to add a weighting to positive/negative samples to account for that
It could also be the samples are hard to separate given their structure and the kernel type. Have you tried a different structure?
With only 42 data samples, maybe you could add them to your question and get better answers.

PCA first or normalization first?

When doing regression or classification, what is the correct (or better) way to preprocess the data?
Normalize the data -> PCA -> training
PCA -> normalize PCA output -> training
Normalize the data -> PCA -> normalize PCA output -> training
Which of the above is more correct, or is the "standardized" way to preprocess the data? By "normalize" I mean either standardization, linear scaling or some other techniques.
You should normalize the data before doing PCA. For example, consider the following situation. I create a data set X with a known correlation matrix C:
>> C = [1 0.5; 0.5 1];
>> A = chol(rho);
>> X = randn(100,2) * A;
If I now perform PCA, I correctly find that the principal components (the rows of the weights vector) are oriented at an angle to the coordinate axes:
>> wts=pca(X)
wts =
0.6659 0.7461
-0.7461 0.6659
If I now scale the first feature of the data set by 100, intuitively we think that the principal components shouldn't change:
>> Y = X;
>> Y(:,1) = 100 * Y(:,1);
However, we now find that the principal components are aligned with the coordinate axes:
>> wts=pca(Y)
wts =
1.0000 0.0056
-0.0056 1.0000
To resolve this, there are two options. First, I could rescale the data:
>> Ynorm = bsxfun(#rdivide,Y,std(Y))
(The weird bsxfun notation is used to do vector-matrix arithmetic in Matlab - all I'm doing is subtracting the mean and dividing by the standard deviation of each feature).
We now get sensible results from PCA:
>> wts = pca(Ynorm)
wts =
-0.7125 -0.7016
0.7016 -0.7125
They're slightly different to the PCA on the original data because we've now guaranteed that our features have unit standard deviation, which wasn't the case originally.
The other option is to perform PCA using the correlation matrix of the data, instead of the outer product:
>> wts = pca(Y,'corr')
wts =
0.7071 0.7071
-0.7071 0.7071
In fact this is completely equivalent to standardizing the data by subtracting the mean and then dividing by the standard deviation. It's just more convenient. In my opinion you should always do this unless you have a good reason not to (e.g. if you want to pick up differences in the variation of each feature).
You need to normalize the data first always. Otherwise, PCA or other techniques that are used to reduce dimensions will give different results.
Normalize the data at first. Actually some R packages, useful to perform PCA analysis, normalize data automatically before performing PCA.
If the variables have different units or describe different characteristics, it is mandatory to normalize.
the answer is the 3rd option as after doing pca we have to normalize the pca output as the whole data will have completely different standard. we have to normalize the dataset before and after PCA as it will more accuarate.
I got another reason in PCA objective function.
May you see detail in this link
enter link description here
Assuming the X matrix has been normalized before PCA.

Resources