SPSS - Filter columns based on specific criteria - spss

I have a dataset (See below) where I want to filter out any observations where there is only a 1 in the McDonalds column, such as for ID#3 (I do not want Mcdonalds in my analyses). I want to keep any observations where there is a 1 in other columns (eventhough there is a 1 in the McDonalds column - such as ID #1-2). I have tried using the select cases option, and just putting McDonalds=0, but this filters out any observations where there are 1s in the other columns as well. Below is a sample of my dataset, I actually have many more columns and was trying to avoid having to individually name every other column in the "Select Cases" option in SPSS. Would anyone be able to help me please? Thanks.
Data:

To avoid naming each of the other columns separately you can use to in the syntax. Also, basically, you want to keep lines that have 1 in any of the other columns regardless of the value in the Mcdonald's column, so there is no need to mention it in the syntax.
So say for example that your column names are McDonalds, RedBull, var3, var4, var5, TacoBell, you could use either of these following options:
select if any(1, RedBull to TacoBell).
or this :
select if sum(RedBull to TacoBell)>1.
Note: using the to convention requires that the relevant variables be contiguous in the data.

You just need to add the "OR" operator (which is the vertical bar: |) between all the mentioned conditions.
So basically, you want to keep the cases when McDonalds = 0 | RedBull = 1 | TacoBell = 1.
You can either copy the above line into the Select cases -> If option, or write the following lines into the SPSS syntax file, replacing the DataSet1 for the name of your dataset:
DATASET ACTIVATE DataSet1.
USE ALL.
COMPUTE filter_$=(McDonalds = 0 | RedBull = 1 | TacoBell = 1).
VARIABLE LABELS filter_$ 'McDonalds = 0 | RedBull = 1 | TacoBell = 1 (FILTER)'.
VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'.
FORMATS filter_$ (f1.0).
FILTER BY filter_$.
EXECUTE.

Related

selecting range of values based upon first few characters in spss?

I know that through
select cases if char.substr(variable_name,1,3)="I22".
I can select values based on the first # of characters but this is not exactly my question. I need to select RANGE OF values that start with few characters, here is an example of what I want:
if I have the following cases:
I22A33
I22B33
I22C33
I22D33
So I want to select I22B33 and I22C33 out of the above 4 values, so it's like a range of cases between b and c.
One way to flag any cases that meet your criteria is using INDEX and a series of OR conditions. Not particularly modular, but if you just have a couple of conditions you're searching for it could get you on your way.
Edit: These searches are case-insensitive (due to UPCASE) and search for matches at the start of the string. To search for matches anywhere within the string set the condition to > 0 (instead of = 1).
COMPUTE f_I22 = (INDEX(UPCASE(var_name),'I22B33') = 1)
OR (INDEX(UPCASE(var_name),'I22C33') = 1) .
EXE .
Assuming in this range of values that you want to select, all the values will start with either "I22B" or "I22C", you can simply use:
select cases if char.substr(variable_name,1,4)="I22B" or
char.substr(variable_name,1,4)="I22C".

Filtering out based on count using Apache Beam

I am using Dataflow and Apache Beam to process a dataset and store the result in a headerless csv file with two columns, something like this:
A1,a
A2,a
A3,b
A4,a
A5,c
...
I want to filter out certain entries based on the following two conditions:
1- In the second column, if the number of occurrences of a certain value is less than N, then remove all such rows. For instance if N=10 and c only appears 7 times, then I want all those rows to be filtered out.
2- In the second column, if the number of occurrences of a certain value is more than M, then only keep M many of such rows and filter out the rest. For instance if M=1000 and a appears 1200 times, then I want 200 of such entries to be filtered out, and the other 1000 cases to be stored in the csv file.
In other words, I want to make sure all elements of the second columns appear more than N and less than M many times.
My question is whether this is possible by using some filter in Beam? Or should it be done as a post-process step once the csv file is created and saved?
You can use beam.Filter to filter out all the second column values that matches your range's lower bound condition into a PCollection.
Then correlate that PCollection (as a side input) with your original PCollection to filter out all the lines that need to be excluded.
As for the upperbound, since you want to keep any upperbound amount of elements instead of excluding them completely, you should do some post processing or come up with some combine transforms to do that.
An example with Python SDK using word count.
class ReadWordsFromText(beam.PTransform):
def __init__(self, file_pattern):
self._file_pattern = file_pattern
def expand(self, pcoll):
return (pcoll.pipeline
| beam.io.ReadFromText(self._file_pattern)
| beam.FlatMap(lambda line: re.findall(r'[\w\']+', line.strip(), re.UNICODE)))
p = beam.Pipeline()
words = (p
| 'read' >> ReadWordsFromText('gs://apache-beam-samples/shakespeare/kinglear.txt')
| "lower" >> beam.Map(lambda word: word.lower()))
import random
# Assume this is the data PCollection you want to do filter on.
data = words | beam.Map(lambda word: (word, random.randint(1, 101)))
counts = (words
| 'count' >> beam.combiners.Count.PerElement())
words_with_counts_bigger_than_100 = counts | beam.Filter(lambda count: count[1] > 100) | beam.Map(lambda count: count[0])
Now you get a pcollection like
def cross_join(left, rights):
for x in rights:
if left[0] == x:
yield (left, x)
data_with_word_counts_bigger_than_100 = data | beam.FlatMap(cross_join, rights=beam.pvalue.AsIter(words_with_counts_bigger_than_100))
Now you filtered out elements below lowerbound from the data set and get
Note the 66 from ('king', 66) is the fake random data I put in.
To debug with such visualizations, you can use interactive beam. You can setup your own notebook runtime following instructions; Or you can use hosted solutions provided by Google Dataflow Notebooks.

stack data and restructure without using var to cases or casestovar in SPSS

I have the following situation: a loop (stack data) with only 1 index variable and with multiple items corresponding to the statements, as in the picture below (sorry it is Excel, but is the same as in SPSS):
stack data - cases on multiple lines, but never filling for 1 respondent all the columns
I want to reach to the following situation but without using casestovars to restructure, because that creates a lot of empty variables. I remember for older versions it was a command like Update, which was moving up the cases, to reach the following result:
reducing the cases per respondent
Like starting from this:
ID Index Q1_1 Q1_2 Q1_3 Q1_4 Q1_5 Q1_6
1 1 1 1
1 2 1 1
1 3 1 1
To reach to this:
ID Q1_1 Q1_2 Q1_3 Q1_4 Q1_5 Q1_6
1 1 1 1 1 1 1
But without using casestovars. Is there any command in SPSS syntax for this?
Thank you very much, have a nice day!
Not entirely sure how variable your data structure is likely to be in reality but if as demo'ed where you have only a single response for each q1_1 to q1_6 per respondent ID, then the below would be sufficient:
dataset declare dsAgg.
aggregate outfile="dsAgg" /break=respid /q1_1 to q1_6=max(q1_1 to q1_6).
Also not sure of the significance of duplicate index values within the same respondent IDs, if this was intended or not.
The following syntax could do the job -
* first we'll recreate your example data.
data list list/respid index q1_1 to q1_6.
begin data
1,1,1,,,,,
1,2,,2,,,,
1,3,,,1,,,
1,4,,,,2,,
1,5,,,,,1,
1,6,,,,,,2
2,1,3,,,,,
2,1,,4,,,,
2,2,,,5,,,
2,2,,,,4,,
2,3,,,,,3,
2,3,,,,,,2
end data.
* now to work: first thing is to make sure the data from each ID are together.
sort cases by respid index.
* the loop will fill down the data to the last line of each ID.
do repeat qq=q1_1 to q1_6.
if respid=lag(respid) and missing(qq) qq=lag(qq).
end repeat.
* the following lines will help recognize the last line for each ID and select it.
compute lineNR=$casenum.
aggregate /outfile=* mode=ADDVARIABLES/break=respid/MXlineNR=max(lineNR).
select if lineNR=MXlineNR.
exe.

How to apply content based filtering in ne04j

I have a data in below format where 1st column represents the products node, all the following columns represent properties of the products. I want to apply content based filtering algo using cosine similarity in Neo4j. For that, I believe, I need to define the fx columns as the properties of each product node and then call these properties as a vector and then apply cosine similarity between the products. I am having trouble doing two things:
1. How to define these columns as properties in one go(as the columns could be more than 100).
2. How to call all the property values as a vector to be able to apply cosine similarity.
Product f1 f2 f3 f4 f5
P1 0 1 0 1 1
P2 1 0 1 1 0
P3 1 1 1 1 1
P4 0 0 0 1 0
You can use LOAD CSS to input your data.
For example, this query will read in your data file and output for each input line (ignoring the header line) a name string and a props collection:
LOAD CSV FROM 'file:///data.csv' AS line FIELDTERMINATOR ' '
WITH line SKIP 1
RETURN HEAD(line) AS name, [p IN TAIL(line) | TOFLOAT(p)] AS props
Even though your data has a header line, the above query skips over it, as it is not needed. In fact, we don't want to use the WITH HEADERS option of LOAD CSV, since that would convert each data line into a map, whereas it is more convenient for our current purposes to get each data line as a collection of values.
The above query assumes that all the columns are space-separated, that the first column will always contain a name string, and that all other columns contain the numeric values that should be put into the same collection (named props).
If you replace RETURN with WITH, you can append additional clauses to the query that make use of the name and props values.

Sort all the cases of specific variable in descending order but other will remain same using SPSS Syntax

I have two variables (id and Var1) in SPSS as below. I want to sort Var1 as descending order but other variables do not change accordingly with Var1. i.e. other variable will remain same as before sort.
My data is...
id Var1
-- ----
M-1 3
M-2 4
M-3 2
M-4 7
But I want like this..
id Var1
-- ----
M-1 7
M-2 4
M-3 3
M-4 2
My Syntax/code is...
data list list
/id(A3) Var1(F2.0).
begin data.
M-1 3
M-2 4
M-3 2
M-4 7
end data.
sort cases by BY Var1(D).
execute.
When I run this code it also sort id according to Var1. But I do not want to expand this sort command for entire variables. I only want to sort for current selection variable in SPSS.
Can anyone help using SPSS Syntax?
You Could split the dataset sort the Var1 variable and then merge them together. One way to do so would be this:
* create data.
data list list
/id(A3) Var1(F2.0).
begin data.
M-1 3
M-2 4
M-3 2
M-4 7
end data.
DATASET NAME ids.
DATASET COPY sortvar.
* Delete sort variable (Var1) from dataset "ids".
DELETE VARIABLES Var1.
* Keep only sort variable in dataset "sortvars".
DATASET ACTIVATE sortvar.
DELETE VARIABLES id.
* sort Var1.
SORT CASES BY Var1(D).
* Merge datasets.
MATCH FILES
/FILE ids
/FILE sortvar.
EXECUTE.
If you have lots of variables to delete in the sortvar dataset you could also use the MATCH CASES command:
* Delete all variables but Var1.
DATASET ACTIVATE sortvar.
MATCH CASES
/FILE *
/KEEP Var1.
Alternativly you can use the SAVE command in combination with the KEEP or DROP options in order to split the dataset.

Resources