How do you weight cases in a BRFSS dataset opened in SPSS? - spss

I'm a Penn State student working on a class project in SPSS where I'm trying to weight cases in the 2019 BRFSS dataset that I have uploaded to the program. Most of the class is using the NHANES dataset, and they weight samples by 2-year interview weight. My professor and I are having a very difficult time trying to figure out which sample weighting to use for the BRFSS dataset. There are instructions for how to do this in SAS (use the _LLCPWT for weighting cases), but this does not work in SPSS. Is anyone aware of how to turn on sample weighting in the 2019 BRFSS dataset in SPSS?

You can use the WEIGHT command to weight cases. The weights should be in a variable in your dataset.
If the variable containing the sample weights is called _LLCPWT, then you can activate weighting by running the following command in a Syntax window:
weight by _LLCPWT.
When activated, the bottom right corner of your Data Editor windows will indicate 'Weight On`.

Related

How to find the impact of one variable on the other when there seems to no correlation between them?

can we predict growth percentage in sales of an item given the change in discount(positive or negative number) from the previous year as a predictor variable. There seems to be no correlation between these. How to solve this problem using machine learning?
You are on the wrong track to ask this question.
Correlation is on the knowledge side of Statistics, Please check Pearson’s correlation of coefficient / Spearman’s correlation of coefficient in order to find the correlation between the discount changes and the sales groth correlation.
In Machine Learning, we seldom compare two percentage data, instead, we compare the actual sales/discount value. A simple ML can be applied by Linear regression (most ML is used in multi-dimension, as your case is one-x one-y data (single column to single output). Please refer to related information online and solved with excel or python code.

Result verification with Weka Experiment tab with individual classifier models

I ran different classifiers on the same dataset. I got some statistical values after run the classifiers.
This is the summary of all classifiers
I am using Weka to trained the model. Weka itself has a method to compare different algorithms. For that we need to use the Experiment tab. I have done with this option as well for the same dataset.
Weka gave me the result for Kappa statistics when use Experiment tab
Rootmean squared error is
Relative absolute error
and so on.....
Now I am unable to understand that the values I got from Experiment tab how does those are similar to the values that I have shared in the table format in the first picture?
I presume that the initial table was populated with statistics obtained from cross-validation runs in the Weka Explorer.
The Explorer aggregates the predictions across a single cross-validation run so that it appears that you had a single test set of that size. It is only to be used as an explorative tool, hence the name.
The Experimenter records the metrics (like accuracy, rmse, etc) generated from each fold pair across the number of runs that you perform during your experiment. The metrics collected across multiple classifiers and/or datasets can then be analyzed using significance tests. By default, 10 runs of 10-fold CV are used, which is recommended for such comparisons. This results in 100 individual values for each metric from which mean and standard deviation are generated. */v indicate whether there is a statistically significant loss/win.

What is a proper Learning Technique for the given data sample

I am working in matlab.
I have data samples of two unrelated variables at 256 time-steps. Their plots with their value on Y - axis and time-steps on X-axis is as below.
Typical Plot for the first variable say Pos is
Typical Plot for the second variable say Vel is
Now I need to predict the values for these variables at next 10 time-steps. To check various machine learning techniques to do so , I took values of the variables at first 246 time-steps , predicted the next 10 time-steps and then compared them with their actual value by calculating the mean square error say ms_error.
I have done this using time-series(NAR) ,linear regression,fuzzy input systems,neural networks. but none of these are able to give the value of ms_error lesser than 2.
Can someone suggest a learning algorithm to use to predict future values for data samples like these two.
You could try with symbolic regression via Genetic Programming.
Genetic Programing makes no assumption about the structure of the function that fits your data points, so it's well suited to this sort of discovery task.
Symbolic regression was one of the earliest applications of GP and continues to be widely studied.
There are many ready-to-use environments for every major programming language and many tutorials on the subject e.g.
C++: Beagle
Java: ECJ
Matlab: GPTIPS - GPLAB
Python: DEAP
(I don't mean these are the best, just well known. Of course Google search could bring up other software that is more suitable for your needs).

measuring the accuracy of a model and the importance of a feature in SVM

I'm starting to use LIBSVM for regression analysis. My world has about 20 features and thousands to millions of training samples.
I'm curious about two things:
Is there a metric that indicates the accuracy or confidence of the model, perhaps in the .model file or elsewhere?
How can I determine whether or not a feature is significant? E.g., if I'm trying to predict body weight as a function of height, shoulder width, gender and hair color, I might discover that hair color is not a significant feature in predicting weight. Is that reflected in the .model file, or is there some way to find out?
libSVM calculates p-values for test points based upon the certainty of the classifier (i.e., how far is the test point from the decision boundary and how wide are the margins).
I think you should consider the determination of feature importance a separate problem from training your SVMs. There are tons of approaches for "feature selection" (just open any text book) but one easy to understand, straightforward approach would be a simple cross-validation as follows:
Divide your dataset into k folds (e.g., k = 10 is common)
For each of the k folds:
Separate your data into train/test sets (the current fold is the test set, the rest are the training set)
Train your SVM classifier using only n-1 of your n features
Measure the prediction performance
Average the performance of your n-1 feature classifier for all k test folds
Repeat 1-3 for all remaining features
You could also do the reverse where you test each of the n features separately but you will likely miss out on important second and higher order interactions between the features.
In general, however, SVMs are good at ignoring irrelevant features.
You may also want to try and visualize your data using Principal Components Analysis to get a feel for how the data is distributed.
The F-score is a metric commonly used for features selection in Machine Learning.
Since version 3.0, LIBSVM library includes a directory called tools. In that directory is a python script called fselect.py, which calculates F-score. To use it, just execute from the command line and pass in the file comprised of training data (and optionally a testing data file).
python fselect.py data_training data_testing
The output is comprised of an fscore for each of the features in your data set which corresponds to the importance of that feature to the model result (regression score).

Ordinal classification packages and algorithms

I'm attempting to make a classifier that chooses a rating (1-5) for a item i. For each item i, I have a vector x containing about 40 different quantities pertaining to i. I also have a gold standard rating for each item. Based on some function of x, I want to train a classifier to give me a rating 1-5 that closely matches the gold standard.
Most of the information I've seen on classifiers deal with just binary decisions, while I have a rating decision. Are there common techniques or code libraries out there to deal with this sort of problem?
I agree with you that ML problems in which the response variable is on an ordinal scale
require special handling--'machine-mode' (i.e., returning a class label) seems insufficient
because the class labels ignore the relationship among the labels ("1st, 2nd, 3rd");
likewise, 'regression-mode' (i.e., treating the ordinal labels as floats, {1, 2, 3}) because
it ignores the metric distance between the response variables (e.g., 3 - 2 != 1).
R has (at least) several packages directed to ordinal regression. One of these is actually called Ordinal, but i haven't used it. I have used the Design Package in R for ordinal regression and i can certainly recommend it. Design contains a complete set of functions for solution, diagnostics, testing, and results presentation of ordinal regression problems via the Ordinal Logistic Model. Both Packages are available from CRAN) A step-by-step solution of an ordinal regression problem using the Design Package is presented on the UCLA Stats Site.
Also, i recently looked at a paper by a group at Yahoo working on ordinal classification using Support Vector Machines. I have not attempted to apply their technique.
Have you tried using Weka? It supports binary, numerical, and nominal attributes out of the box, the latter two of which might work well enough for your purposes.
Furthermore, it looks like one of the classifiers that's available is a meta-classifier called OrdinalClassClassifier.java, which is the result of this research:
Eibe Frank and Mark Hall, A simple approach to ordinal classification. In Proceedings of the 12th European Conference on Machine Learning, 2001, pp. 145-156.
If you don't need a pre-made approach, then these references (in addition to doug's note about the Yahoo SVM paper) might be useful:
W Chu and Z Ghahramani, Gaussian processes for ordinal regression. Journal of Machine Learning Research, 2006.
Wei Chu and S. Sathiya Keerthi, New approaches to support vector ordinal regression. In Proceedings of the 22nd international conference on Machine Learning, 2005, 145-152.
The problems that dough has raised are all valid. Let me add another one. You didn't say how you would like to measure the agreement between the classification and the "gold standard". You have to formulate the answer to that question as soon as possible, as this will have a huge impact on your next step. In my experience, the most problematic part of any (ok, not any, most) optimization task is the score function. Try asking yourself whether all errors equal? Does miss-classifying the "3" as being "4" has the same impact as classifying "4" as "3"? What about "1" vs "5". Can mistakenly missing one case have disastrous consequences (miss HIV diagnosis, activate pilot ejection in a plane)
The simplest way to measure the agreement between categorical classifiers is Cohen's Kappa. More complicated methods are described in the following links here, here, here, and here
Having said that, sometimes picking a solution that "just works", instead of "the right one" is faster and easier. If I were you I would pick a machine learning library (R, Weka, I personally love Orange) and see what I get. Only if you don't have reasonably good results with that, look for more complex solutions
If not interested in fancy statistics a one hidden layer back propagation neural network with 3 or 5 output nodes will probably do the trick if the training data is sufficiently large. Most NN classifiers try to minimize the mean squared error which is not always desired. Support Vector Machines mentioned earlier is a good alternative.
FANN is a good library for back propagation NNs, it also has some tools to assist in training of the network.
There are two packages in R that might help taming ordinal data
ordinalForest on CRAN
rpartScore on CRAN
I'm working on an OrdinalClassifier that is based on the sklearn framework (specifically the OVR multiclass classifier) and which works well with sklearn workflow such as pipelines, cross validation, and scoring.
Through testing, I'm finding that it performs very well vs. standard non-ordinal multiclass classification using SVC. And it gives much greater control over optimizing for precision and recall on the positive class (in my testing, I used sklearn's diabetes dataset and transformed the disease progression target(y) into a low, medium, high class label. Testing via cross validation is on my repo along with attribution. Scoring is based on weighted f1.
https://github.com/leeprevost/OrdinalClassifier

Resources