P for interactions after multiple imputation with R - interaction

I have run a multivariable regression analysis using glm.mids() function after multiple imputation with mice on a large dataset.
I should now calculate the interaction between two of the covariates included in the model. What I would need are the results that can be obtained using allEffects() function, but it does not work after multiple imputation.
Is there any other package providing odds ratios for the interaction model obtained using glm.mids()?

Related

Result verification with Weka Experiment tab with individual classifier models

I ran different classifiers on the same dataset. I got some statistical values after run the classifiers.
This is the summary of all classifiers
I am using Weka to trained the model. Weka itself has a method to compare different algorithms. For that we need to use the Experiment tab. I have done with this option as well for the same dataset.
Weka gave me the result for Kappa statistics when use Experiment tab
Rootmean squared error is
Relative absolute error
and so on.....
Now I am unable to understand that the values I got from Experiment tab how does those are similar to the values that I have shared in the table format in the first picture?
I presume that the initial table was populated with statistics obtained from cross-validation runs in the Weka Explorer.
The Explorer aggregates the predictions across a single cross-validation run so that it appears that you had a single test set of that size. It is only to be used as an explorative tool, hence the name.
The Experimenter records the metrics (like accuracy, rmse, etc) generated from each fold pair across the number of runs that you perform during your experiment. The metrics collected across multiple classifiers and/or datasets can then be analyzed using significance tests. By default, 10 runs of 10-fold CV are used, which is recommended for such comparisons. This results in 100 individual values for each metric from which mean and standard deviation are generated. */v indicate whether there is a statistically significant loss/win.

What's the correct approach to evaluate a model using K-fold cross validation over multiple seeds?

I am training a deep learning model using a 5-fold CV over three random seeds (random seeds are for model initialization, CV is split once). For each fold, I save the best model. Hence, I get 15 models after the simulation. To assess the performance, I take the best of these 15 models (unchanged during the entire evaluation process) and evaluate it using the validation fold of all the 5-folds for each seed. I then average the results across these seeds.
I would like to know if I am doing the right thing here.
I have read that there are two ways to compute CV performance: [1] pooling, where the performance is calculated globally over the union of all the test sets [2] averaging, where the performance is computed for every test set separately, with results being the average of these.
I intend to use method two (averaging).
Yes, you can use the averaging method for the 5-fold CV, but I don't understand what you mean by "For each fold, I save the best model". Moreover, three random seed values are not enough. You should use at least 10 different values and plot a boxplot for the corresponding results across these seeds.

Understanding multiple Linear regression

I am doing multiple regression problem. I have the below data set as below.
rank--discipline--yrs.since.phd--yrs.service--sex--salary
[ 1 1 19 18 1 139750],......
I am taking salary as dependent variable, and other variable as independent variable. After doing data pre processing, I ran the gradient descent, regression model. I estimated bias(intercept), coefficient for all independent features.
I want to do scattered plot for the actual values and regression line
for the hypothesis I predicted. Since we have more than one features here,
I have the below questions.
While plotting actual values (scatted plot), how do I decide the x-axis values. Meaning, I have list of values. for example, first row [1,1,19,18,1]=>139750 How do I transform or map [1,1,19,18,1] to x-axis.? I need to somehow make [1,1,19,18,1] to one value, so I can mark a point of (x,y) in the plot.
While plotting regression line, what would be the feature values, so I can calculate the hypothesis value.?
Meaning now, I have the intercept, and weight of all features, but I dont have the feature values. How do I decide upon the feature values now.?
I want to calculate the points and use matplot to do the jobs. I am aware that there are lot of tools available outside including matplotlib to do the job. But I want to get the basic understanding.
Thanks.
I am still not sure I completely understand your question, so if something is not what you expected comment below and we will work it out.
Now,
Query 1: In all your datasets you are going to have multiple inputs and there is no way to view the target variable salary in your case with respect to all, in a single graph, what is usually done is either you apply the concept of dimensionality reduction on your data using t-sne (link) or you use principal component analysis (PCA) to reduce the dimensionality of your data, and make your output a function of two or three variables and then plot it on the screen, the other technique that I prefer is rather plotting target vs each variable separately as subplot, The reason for this is we don't even have a way to comprehend how we will see the data that is in more than three dimensions.
Query 2: If you are not determined to use matplotlib, I will suggest seaborn.regplot(), but let's also do it in matplotlib. Suppose the variable you want to use first is 'discipline' vs 'salary'.
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
X = df[['discipline']]
Y = df['salary']
lm.fit(X,Y)
After running this lm.coef_ will give you the coefficient, and lm.intercept_ will give you the intercept, in a linear equation that forms this variable, then you can plot the data between two variables and a line using matplotlib easily.
what you can do is ->
from pandas import plotting as pdplt
pdplt.scatter_matrix(dataframe, pass the remaining required parameters)
by this you will get a matrix of plots(in your case it's 6X6) which will exactly show how each column in your dataframe relates to the other columns and you can clearly visualise which feature dominates the result and also how the features are correlated to each other.
If you ask me this is the first thing I used to do with such types of problems and then remove all correlated features and select the features which best approximate the output.
But as you have to plot a 2d plot and in the above approach you might get more than a single feature which dominate the output then what you can do is a miracle named PCA.
If you ask me PCA is one of the most beautiful thing in machine learning. What it will do that is somehow merges all your feautres in some magical ratio which will generate principle components for your data. Principal components are those components which govern/major contribution to your model. You apply pca by simply importing from sklearn and then select the first principle component(as you need a 2d plot) or might select 2 priciple components and plot a 3d graph. But always remember this that these pricipal components are not the real features of your model but they are some magical combination and how PCA did so is very very interesting(by using concepts like eigen values and vectors) and you can build by your own also.
Apart from all these you can apply Singular Value decomposition(SVD) to your model which is the essence of whole linear algebra which is a type of matrix decomposition existing for all matrix. What this do is decompose your matrix into three matrix out of which the diagonal matrix which consists of singular values(a scaling factor) in descending order and what you have to do is that select the top singular values (in your case only the first one having highest magnitude) and construct back a feature matrix from 5 columns to 1 columns and then plot that. You can do svd by using the numpy.linalg
Once you applied any one of these methods then what you can do is learn your hypothesis with only the single most important selected feature and finally plot the graph. But take a tip, just for plotting a 2d graph you should avoid other important features beacuse maybe you have 3 principal components all having almost the same contribution and may the top three singular values are very close to each other. So take my words and take all important features into account and if you need the visualisation of these important features then use scatter matrix
Summary ->
All I want to mention is that you can do the same process with all these things and also can invent your own statistical or mathematical model for compressing your feature space.
But for me I prefer to go with PCA and in such type of problems I even first plot the scatter matrix to get an visual intuition to the data. And also PCA and SVD helps to remove redundancy and hence overfitting.
For rest details refer to docs.
Happy machine learning...

Information leakage in Cross-validation

Description of classification problem:
Assume a regular dataset X with n samples and d features.
This classification problem is somewhat hard (many features, few samples, low overall AUC ~70%).
It might be useful to mention that feature selection/extraction, dimension reduction, kernels, many classifiers have been applied. So I am not interested in trying these.
I am not looking forward to see an improvement in overall AUC. The goal is to find relevant features in haystack of features.
Description of my approach:
I select all pairwise combination of d features and create many two dimensional sub-datasets x with n samples.
On each sub-dataset x, I perform a 10-fold cross-validation (using all samples of the main dataset X). A very long process, assume weeks of computation.
I select top k pairs (according to highest AUC for example) and label them as +. All other pairs are labeled as -.
For each pair, I can compute several properties (e.g. relations between each pair using Expert's knowledge). These properties can be calculated without using the labels in main dataset X.
Now I have pairs which are labeled as + or -. In addition, each pair has many properties calculated based on Expert's knowledge (i.e. features). Hence, I have a new classification problem. Lets call this newly generated dataset Y.
I train a classifier on Y while following cross-validation rules. Surprisingly, I can predict the + and - labels with 90% AUC.
As far as I can see, it means that I am able to select relevant features. However, seeing a 90% AUC makes me worried about information leakage somewhere in this long process. Specially in step 3.
I was wondering if anyone can see any leakage in this approach.
Information Leakage:
Incorporation of target labels in the actual features. Your classifier will produce good prediction while did not learn anything.
Showing your test set to you classifier during the training phase. Your classifier will "memorize" the test set and its corresponding labels without "learning" anything.
Update 1:
I want to stress that indeed I am using all data points of X in step 1. However, I am not using them ever again (even for testing). The final 90% AUC is obtained from predicting labels of dataset Y.
On the other hand, it would be useful to note that, even if I randomize the values of my main dataset X, the computed features for dataset Y is going to be the same. However, the sample labels in Y would change because the previous + pairs might not be a good one anymore. Therefore they will be labeled as -.
Update 2:
Although I haven't got any opinion, I am going to state what I have got during 4 days of talking with pattern recognition researchers. Briefly I became confident that there is no information leakage (as long as I wont go back to the first dataset X and using its labels). Later on, in case I wanted to check to see if I could have better performance in X (i.e. predicting sample labels), I need to use only a part of dataset X for pairwise comparison (as training set). Then I can use the rest of samples in X as test set while using positively predicted pairs of Y as features.
I will set this as an answer in case no one could reject this method.
If your processes in step 1 uses all data. then the features you are learning have information from the whole data set. Since you selected based on the whole dataset and THEN validation, you are leaking serious information.
You should probably stick with tools that are well known / already done for you before running out and trying weird strategies like this. Try using a model with L1 regularization to do feature selection for your, or start with some of the simpler searches like Sequential Backward Selection.
If you do cross validation correctly in the end, each training will perform its own independent feature selection. If you do one global feature selection and then do CV, you are going to be doing it wrong and probably leaking information.

measuring the accuracy of a model and the importance of a feature in SVM

I'm starting to use LIBSVM for regression analysis. My world has about 20 features and thousands to millions of training samples.
I'm curious about two things:
Is there a metric that indicates the accuracy or confidence of the model, perhaps in the .model file or elsewhere?
How can I determine whether or not a feature is significant? E.g., if I'm trying to predict body weight as a function of height, shoulder width, gender and hair color, I might discover that hair color is not a significant feature in predicting weight. Is that reflected in the .model file, or is there some way to find out?
libSVM calculates p-values for test points based upon the certainty of the classifier (i.e., how far is the test point from the decision boundary and how wide are the margins).
I think you should consider the determination of feature importance a separate problem from training your SVMs. There are tons of approaches for "feature selection" (just open any text book) but one easy to understand, straightforward approach would be a simple cross-validation as follows:
Divide your dataset into k folds (e.g., k = 10 is common)
For each of the k folds:
Separate your data into train/test sets (the current fold is the test set, the rest are the training set)
Train your SVM classifier using only n-1 of your n features
Measure the prediction performance
Average the performance of your n-1 feature classifier for all k test folds
Repeat 1-3 for all remaining features
You could also do the reverse where you test each of the n features separately but you will likely miss out on important second and higher order interactions between the features.
In general, however, SVMs are good at ignoring irrelevant features.
You may also want to try and visualize your data using Principal Components Analysis to get a feel for how the data is distributed.
The F-score is a metric commonly used for features selection in Machine Learning.
Since version 3.0, LIBSVM library includes a directory called tools. In that directory is a python script called fselect.py, which calculates F-score. To use it, just execute from the command line and pass in the file comprised of training data (and optionally a testing data file).
python fselect.py data_training data_testing
The output is comprised of an fscore for each of the features in your data set which corresponds to the importance of that feature to the model result (regression score).

Resources