I have a complete ozone data set which consist a few missing values. I would like to use SPSS to do single imputation to impute my data.
Before I start impute my data, I would like to do randomly simulate missing data patterns with 5%, 10%, 15%, 25% and 40% of the data missing in order to evaluating the accuracy of imputation methods.
Can someone please teach me how to do the randomly missing data pattern by using SPSS?
Besides that can someone please tell me how to obtain the performance indicator such as: mean absolute error, coefficient of determination and root mean square error in order to check the best method for estimating missing values.
Unfortunately, my current SPSS supports no missing data analysis, so I can only give some general advice.
First: For your missing data pattern: Simply go to Data -> Select cases -> Random Sample and delete the desired amount of cases and then run the Imputation.
The values you mentioned should be provided by spss if you use their imputation module. There is a manual:
ftp://public.dhe.ibm.com/software/analytics/spss/documentation/statistics/20.0/de/client/Manuals/IBM_SPSS_Missing_Values.pdf
The answer to your first question. Assume your study variable is y and you want to simulate missingness of the variable y. This is en example code to compute extra variable y_miss according to your missing data pattern.
do if uniform(1) < .05.
comp y_miss = $SYSMIS.
else.
comp y_miss = y.
end if.
Related
Please help me find an approach to solving the following problem: Let X is a matrix X_mxn = (x1,…,xn), xi is a time series and a vector Y_mx1. To predict values from Y_mx1, let's train some model, let linear regression. We get Y = f (X). Now we need to find X for some given value of Y. The most naive thing is brute force, but what are the competent ways to solve such problems? Perhaps there is a use of the scipy.optimize package here, please enlighten me.
get an explanation or matherial to read for understanding
Most scipy-optimize algorithm use gradient method, for those optimization problem, we could apply these into re-engineering of data (find the best date to invest in the stock market...)
If you want to optimize the result, you should choose a good step size and suitable optimize method.
However, we should not classify tge problem as "predict" of xi because what we are doing is to find local/global maximum/minimum.
For example Newton-CG, your data/equation should contain all the information needed/a simulation, but no prediction is made from the method.
If you want to do a pretiction on "time", you could categorize the time data in "year,month..." then using unsupervise learning to "group" the data. If trend is obtained, then we can re-enginning the result to know the time
I am using the Learning from Data textbook by Yaser Abu-Mostafa et al. I am curious about the following statement in the linear regression chapter and would like to verify that my understanding is correct.
After talking about the "pseudo-inverse" way to get the "best weights" (best for minimizing squared error), i.e w_lin = (X^T X)^-1 X^T y
The statement is "The linear regression weight vector is an attempt to map the inputs X to the outputs y. However, w_lin does not produce y exactly, but produces an estimate X w_lin which differs from y due to in sample error.
If the data is on a hyper-plane, won't X w_lin exactly match y (i.e in-sample error = 0)? I.e above statement is only talking about data that is not linearly separable.
Here, 'w_lin' is the not the same for all data points (all pairs of (X,y)).
The linear regression model finds the best possible weight vector (or best possible 'w_lin') considering all data points such that X*w_lin gives a result very close to 'y' for any data point.
Hence the error will not be zero unless all data points line on a straight line.
The community might not get whole context unless the book is opened because not everything that the author of the book says might have been covered in your post. But let me try to answer.
Whenever any model is formed, there are certain constants used whose value is not known beforehand but are used to fit the line/curve as good as possible. Also, the equations, many a times, contain an element of randomness. Variables that take random values cause some errors when actual and expected outputs are computed.
Suggested reading: Errors and residuals
I'm still a little unsure of whether questions like these belong on stackoverflow. Is this website only for questions with explicit code? The "How to Format" just tells me it should be a programming question, which it is. I will remove my question if the community thinks otherwise.
I have created a neural network and am predicting reasonable values for most of my data (the task is multi-variate time series forecasting).
I scale my data before inputting it using scikit-learn's MinMaxScaler(0,1) or MinMaxScaler(-1,1) (the two primary scalings I am using).
The model learns, predicts, and I inverse the scaling using MinMaxScaler()'s inverse_transform method to visually see how close my predictions were to the actual values. However, I notice that the inverse_scaled values for a particular part of the vector I predicted have now become very noisy. Here is what I mean (inverse_scaled prediction):
Left end: noisy; right end: not-so noisy.
I initially thought that perhaps my network didn't learn that part of the vector well, so is just outputting ~random values. BUT, I notice that the predicted values before the inverse scaling seem to match the actual values very well, but that these values are typically near 0 or -1 (lower limit of the feature scale) because of the fact that these values have a very large spread (unscaled mean= 1E-1, max= 1E+1 [not an outlier]). Example (scaled prediction):
So, when inverse transforming these values (again, often near -1 or 0), the transformed values exhibit loud noise, as shown in the images.
Questions:
1.) Should I be using a different scaler/scaling differently, perhaps one that exponentially/nonlinearly scales? MinMaxScaler() scales each column. Simply dropping the high-magnitude data isn't an option since they are real, meaningful data. 2.) What other solutions can help this?
Please let me know if you'd like anything else clarified.
I am working with multivariate data with random effects.
My hypothesis is this: D has an effect on A1 and A2, where A1 and A2 are binary data, and D is a continuous variable.
I also have a random effect, R, that is a factor variable.
So my model would be something like this: A1andA2~D, random=1=~1|R
I tried to use the function manyglm in mvabund package, but it can not deal with random effects. Or I can use lme4, but it can not deal with multivariate data.
I can convert my multivariate data to a 4 level factor variable, but I didn't find any method to use not binary but factor data as a response variable. I also can convert the continuous D into factor variable.
Do you have any advice about what to use in that situation?
First, I know this should be a comment and not a complete answer but I can't comment yet and thought you might still appreciate the pointer.
You should be able to analyze your data with the MCMCglmm R package. (see here for an Intro), as it can handle mixed models with multivariate response data.
Data: When I have N rows of data like this: (x,y,z) where logically f(x,y)=z, that is z is dependent on x and y, like in my case (setting1, setting2 ,signal) . Different x's and y's can lead to the same z, but the z's wouldn't mean the same thing.
There are 30 unique setting1, 30 setting2 and 1 signal for each (setting1, setting2)-pairing, hence 900 signal values.
Data set: These [900,3] data points are considered 1 data set. I have many samples of these data sets.
I want to make a classification based on these data sets, but I need to flatten the data (make them all into one row). If I flatten it, I will duplicate all the setting values (setting1 and setting2) 30 times, i.e. I will have a row with 3x900 columns.
Question:
Is it correct to keep all the duplicate setting1,setting2 values in the data set? Or should I remove them and only include the unique values a single time?, i.e. have a row with 30 + 30 + 900 columns. I'm worried, that the logical dependency of the signal to the settings will be lost this way. Is this relevant? Or shouldn't I bother including the settings at all (e.g. due to correlations)?
If I understand correctly, you are training NN on a sample where each observation is [900,3].
You are flatning it and getting an input layer of 3*900.
Some of those values are a result of a function on others.
It is important which function, as if it is a liniar function, NN might not work:
From here:
"If inputs are linearly dependent then you are in effect introducing
the same variable as multiple inputs. By doing so you've introduced a
new problem for the network, finding the dependency so that the
duplicated inputs are treated as a single input and a single new
dimension in the data. For some dependencies, finding appropriate
weights for the duplicate inputs is not possible."
Also, if you add dependent variables you risk the NN being biased towards said variables.
E.g. If you are running LMS on [x1,x2,x3,average(x1,x2)] to predict y, you basically assign a higher weight to the x1 and x2 variables.
Unless you have a reason to believe that those weights should be higher, don't include their function.
I was not able to find any link to support, but my intuition is that you might want to decrease your input layer in addition to omitting the dependent values:
From professor A. Ng's ML Course I remember that the input should be the minimum amount of values that are 'reasonable' to make the prediction.
Reasonable is vague, but I understand it so: If you try to predict the price of a house include footage, area quality, distance from major hub, do not include average sun spot activity during the open home day even though you got that data.
I would remove the duplicates, I would also look for any other data that can be omitted, maybe run PCA over the full set of Nx[3,900].