Categorical PCA (CATPCA) in SPSS (23) - spss

I am trying to conduct nonlinear principal component analysis using CATPCA in SPSS. I am following [a tutorial] (http://www.ncbi.nlm.nih.gov/pubmed/22176263) by Linting & Kooij (2012) and did not find that certain steps are straightforward. For the timebeing, my questions are:
How do I get a screeplot within CATPCA. The authors describe it as a necessary step but I can't seem to find it within the CATPCA drop-down menu.
Similarly, the tutorial describes the use of bootstrap confidence interval to test the significance of the factor loadings but the Bootstrap Confidence Ellipses option under the Save menu seems disabled (or I can't seem to activate those). What am I missing?
These are the most pressing questions that I encountered thus far. Thank you.

CATPCA does not produce a scree plot. You can create one manually by copying the eigenvalues out of the Model Summary table in the output, or (if you will need to create a lot of scree plots) you can use the SPSS Output Management System (OMS) to automate pulling the values out of the table and creating the plot.
In order to enable the the Bootstrap Confidence Ellipses controls on the Save subdialog, you need to check "Perform bootstrapping" on the Bootstrap subdialog.

See footnote of the literature (Linting & Kooij,2012 p20). "Eigenvalues are from the bottom row of the Correlations transformed variables table." You can create scree plot from these eigenvalues.

Related

How to create a generalized dataset to detect all display digits with Roboflow

I want to detect digits on a display. For doing that I am using a custom 19 classes dataset. The choosen model has been yolov5-X. The resolution is 640x640. Some of the objets are:
0-9 digits
Some text as objects
Total --> 17 classes
I am having problems to detect all the digits when I want to detect 23, 28, 22 for example. If they are very close to each other the model finds problems.
I am using roboflow to create diferent folders in which I add some prepcocessings to have a full control of what I am entering into the model. All are checked and entered in a new folder called TRAIN_BASE. In total I have 3500 images with digits and the majority of variance is with hue and brightness.
Any advice to make the model able to catch all the digits besides being to close from each other?
Here are the steps I follow:
First of all, The use of mosaic dataset was not a good choice the purpose of detecting digits on a display because in a real scenario I was never gonna find pieces of digits. That reason made the model not to recognize some digits if it was not shure.
example of the digits problem concept
Another big improvement was to change the anchor boxes of the yolo model to adapt them to small objects. To know which anchor boxes I needed. Just with adding this argument to train.py is enought in the script provided by ultralitics to print custom anchors and add them to your custom architecture.
To check which augmentations can be good and which not, the next article explains it quite visually.
P.D: Thanks for the fast response to help the comunity gave me.

Is it possible to split a dataset in Google Dataprep? If so, how?

I've been looking into Google Dataprep as an ETL solution to perform some basic data transformation before feeding it to a machine learning platform. I'm wondering if it's possible to use the Dataprep/Dataflow tools to split a dataset into train, test, and validation sets. Ideally I'm looking to do a stratified split on a target column, but for starters I'd settle for a simple uniform random split by percent of whole (e.g. 50% train, 30% validation, 20% test).
So far I haven't been able to find anything about whether this is even possible with Dataprep, so I'm wondering if anyone knows definitively if this is possible and, if so, how to accomplish it.
EDIT 1
Thanks #jakub-janoštík for getting me going in the right direction! I modified your answer slightly and came up with the following (in wrangle form):
case condition: customConditions cases: [false,0] default: rand() as: 'split_condition'
case condition: customConditions cases: [split_condition < 0.6,'train'],[split_condition >= 0.8,'test'] default: 'validation' as: 'dataset_type'
drop col: split_condition action: Drop
By assigning random values in a separate step, I got the guaranteed percentage split I was looking for. The flow ended up looking like this:
Image: final flow diagram with dataset splitting
EDIT 2
I just figured out how to do the stratified split too, so I thought I'd add it in case anyone else is trying to do this. Here's the rough steps:
Split your dataset based on whatever subpopulations you're targeting (e.g. target0, target1)
For each subpopulation, do the uniform random split described above (e.g. now you have target0-train, target0-test, target0-validation, target1-train, etc.)
For each set type (i.e. train, test, validation):
Create a new recipe from one of the sets
Edit the recipe, and use the Union transform to merge it with other datasets of the same type (e.g. target0-train union with target1-train). The union button is in the middle of the toolbar on the Edit Recipe page.
I hope that's helpful to someone!
I'm looking at the same problem and I was able to partially solve this using "case on custom condition" and "Random" functions. What I do is that I create new column named target and apply following logic:
After applying this you'll have new column with these 3 new labels and you can generate 3 new datasets by applying row filtering rules based on those values. Thing to keep in mind is that each time you'll run the job you'll get different validation set. So if you want to keep it fixed you need to use the dataset created in first run as input for future runs (and randomise only train and test sets).
If you need more control on the distribution of labels in your datasets there is ROWNUMBER window function that could potentially be used. But I haven't been able to make it work yet.

Using R to map seedling locations using set reference points

I'm looking for some guidance on the approach I should take to mapping some points with R.
Earlier this year I went off to a forest to map the spatial distribution of some seedlings. I created a grid—every two meters I set down a flag with a tagname, and what I did is I would measure the distance from a flag to a seedling, as well as the angle using a military compass. I chose this method in hopes of getting better accuracy (GPS Garmins prove useless for this sort of task under canopy cover).
I am really new to spatial distribution work altogether, so I was hoping someone could provide guidance on what R packages I should use.
First step is to create a grid with my reference points (the flags). Second step is to tell R to use a reference point and my directions to mark the location of a seedling. From there come other things, such as cluster analysis.
The largest R package for analysing point pattern data like yours is spatstat which has a very detailed documentation and an accompanying book.
For more specific help you would need to upload (part of) your data so we can see how it is organised and how you should read it in and convert to standard x,y coordinates.
Full disclosure: I'm a co-author of both the package and the book.

can SPSS regard ordinal measures as producing continuous data?

In SPSS, when defining the measure of a variable, the usual options are "Scale", "Ordinal", and "Nominal" (see image).
However, when using actual dialog boxes to do analyses, SPSS will often ask us to describe whether the data are "Continuous" or "Categorical". E.g., I was watching this video by James Gaskin (a great YouTube teacher by the way), and saw this dialog box (image below).
My Question: In the second image, you can see that the narrator put some "Ordinal" variables in the "Continuous" box. Is it okay to do that? How come?
For most procedures, the treatment of a variable is determined by how you use it. The measurement level is just a reminder, so you can treat a variable however it makes sense.
There are some procedures that automatically determine how to treat a variable based on the measurement level, including CTABLES, the Chart Builder, and TREE, but you can change the level temporarily in the dialog box or in syntax or change it persistently via VARIABLE LEVEL or in the Data Editor. Also, most of the statistical extension commands use the declared measurement level to determine whether a variable is continuous or a factor.

Add data series to highlight cases on a box plot (Excel, SPSS or R)

first time user of this forum - guidance on how to provide enough information is very appreciated. I am trying to replicate the presentation of data used in the Medical education field. This will help improve the quality of examiners' marking of trainees in a Clinical Exam. What I would like to communicate will be similar to what is already communicated in the College of General Practitioners regarding one of their own exams, please see www.gp10.com.au/slides/thursday/slide29.pdf to help understand what it is I want to present. I have access to Excel, SPSS and R, so any help with any of these would be great. However as a first attempt I have used SPSS and created 3 variables: dummy variable, a "station score" and a "global rating score"(GRS). The "station score"(ST) is a value between 0 and 10 (non-integers) and is on the y-axis similar to the pdf presentation of "Candidate Final Marks". The x-axis is the "global rating scale", an integer from 1 to 6 and is represented in the pdf as the "Overall Performance Scale". When I use SPSS's boxplot I get a boxplot as depicted.
.
What I would like to do is overlay a single examiners own scoring of X number of examinees. So for one examiner (examiner A) provided the following marks:
ST: 5.53,7.38,7.38,7.44,6.81
GRS: 3,4,4,5,3
(this is transposed into two columns).
Whether it be SPSS, Excel or R how would I be able to overlay the box and whisker plots with the individual data points provided by the one examiner? This would help show the degree to which the examiners' marking styles are in concordance with the expected distribution of ST scores across GRS. Any help greatly appreciated! I like Excel graphics but I have found it very difficult to work with when choosing the examiners' data as a separate series - somehow the examiners' GRS scores do not line up nicely on the x-axis. I am very new to R but am also very interested in R, and would expend time to get a good result in R if a good result is viable. I understand JMP may be preferable for this type of thing but access to this may not be possible.

Resources