SPSS Selecting some cases while excluding others - spss

I am trying to select multiple cases and exclude some in my data in SPSS. I have different department numbers and I want to exclude dept 0133 and 0083 all while only selecting that course level equal 100 and 200 level courses. Is this possible to do in one selection? (Note: dept is a string variable)

Looks like this should do it:
select if any(level, 100, 200) and not any(dept,'0133','0083').

Related

How to identify cases that have multiple variables with the same value in SPSS

I have a dataset in which there are multiple variables for various times.
Here is a sample part of the dataset:
I'm trying to identify the number/percentage of cases that have the same value in any of the multiple variables.
For example, if I have a database of teachers who left a school where they worked and there are variables for why the teacher left each school, how would I find out if a teacher left multiple schools for the same reason. So I have reasonleft1, reasonleft2, reasonleft3, up to 20. Each reasonleft has the same coded response options. For example, 1=better opportunity elsewhere, 2=retired, 3=otherwise left workforce, etc. I'm stumped on how to figure out if any case/teacher left multiple schools for the same reason. Which teachers left multiple schools out of the 20 for 1=better opportunity elsewhere, for example.
Thanks!
This can be done in the following two steps:
You need to restructure the dataset so that each "time" appears in a separate row.
Now you can aggregate to count the number of appearances of each reason per person.
The following syntax will do that:
varstocases
/make facilititype from facilititype1_pre facilititype2_pre facilititype3_pre
/make timeinplace from timeinplace1_pre timeinplace2_pre timeinplace3_pre
/make reasonleft from reasonleft1_pre reasonleft2_pre reasonleft3_pre
/index = timeN(reasonleft).
* you should continue the numbering for as much as needed.
dataset declare MyAgg.
aggregate outfile=MyAgg /break=ID reasonleft/Ntimes=n.
At this point you have a new dataset which has the count of each reason for each ID. If you wish you can go back to wide format, and create a column for each reason (the values in each column are the count of times this reason appeared for the ID).
This way:
dataset activate MyAgg.
casestovars /id=ID /index=reasonleft/separator="_".

Ranking pcollection elements

I am using Google DataFlow Java SDK 2.2.0. Use case as follows:
PCollection pEmployees: employees and corresponding department name. may contain up to 10 million elements.
PCollection pDepartments: department name and number of elements to be published per department. will contain few hundred elements.
task: Collect elements from pEmployees as per the department-wise number for all departments from pDepartments. This will be a big collection (up to a few hundred thousand elements or few GBs).
We cannot user Top transform here as it would work one at a time on pEmployee, whereas we have multiple departments and that too, in a PCollection. We can assign a row number to each of the elements from pEmployees, join it with pDepartments and filter the records where row_number > target number from pDepartments. This will require a global ranking.
Question: how can we assign rank/row numbers to the elements in a pcollection?.
This is very close to the Sample transform, but not quite, because it applies the same threshold to all keys when used as .perKey(). Generally, Beam currently doesn't support per-key combines with different combine function parameters.
I'd recommend to emulate it by using CoGroupByKey to join pEmployees and pDepartments and obtain tuples (CoGbkResult) containing department name, N = number of elements, and all employees in that department. Then simply iterate through the employees and emit the first N and discard the rest.

Google Big Query: How to select subset of smaller table using nested fake join?

I would like to solve the problem related to How can I join two tables using intervals in Google Big Query? by selecting subset of of smaller table.
I wanted to use solution by #FelipeHoffa using row_number function Row number in BigQuery?
I have created nested query as follows:
SELECT a.DevID DeviceId,
a.device_make OS
FROM
(SELECT device_id DevID, device_make, A, lat, long, is_gps
FROM [Data.PlacesMaster] WHERE not device_id is null and is_gps is true) a JOIN (select ROW_NUMBER() OVER() row_number,top_left_lat, top_left_long, bottom_right_lat, bottom_right_long, A, count from (SELECT top_left_lat, top_left_long, bottom_right_lat,bottom_right_long, A, COUNT(*) count from [Karol.fast_food_box]
GROUP BY (....?)
ORDER BY COUNT DESC,
WHERE row_number BETWEEN 1000 AND 2000)) b ON a.A=b.A
WHERE (a.lat BETWEEN b.bottom_right_lat AND b.top_left_lat)
AND (a.long BETWEEN b.top_left_long AND b.bottom_right_long)
GROUP EACH BY DeviceId,
OS
Could you help in finalising it please? I cannot break the smaller table by "group by", i need to have consistency between two tables and select only items with lat,long from MASTER.table that fit into the given bounding box of a smaller table. I need to match lat,long into box really, my solution form How can I join two tables using intervals in Google Big Query? works only for small tables (approx 1000 to 2000 rows), hence this issue. Thank you in advance.
It looks like you're applying two approaches at once: 1) split a table into chunks of rows, and run on each, and 2) include a field, "A", tagging your boxes and your points into 'regions', that you can equi-join on. Approach (1) just does the same total work in more pieces (also, it's adding complication), so I would suggest focusing on approach (2), which cuts the work down to be ~quadratic in each 'region' rather than quadratic in the size of the whole world.
So the key thing is what values your A takes on, and how many points and boxes carry each A value. For example, if A is a country code, that has the right logical structure, but it probably doesn't help enough once you get a lot of data in any one country. If it goes to the state or province code, that gets you one step farther. Quantized lat/long grid cells generalize better. Sooner or later you do have to deal with falling across a region edge, which can be somewhat tricky. I would use a lat/long grid myself.
What A values are you using? In your data, what is the A value with the maximum (number of points * number of boxes)?

How to select random subset of cases in SPSS based on student number?

I am setting some student assignments where most students will be using SPSS. In order to encourage students to do their own work, I want students to have a partially unique dataset. Thus, I'd like to get each to the open the master data file, and then get the student to run a couple of lines of syntax that produces a unique data file. In pseudo code, I'd like to do something like the following where 12345551234 is a student number:
set random number generator = 12345551234
select 90% random subset ofcases and drop the rest.
What is simple SPSS syntax dropping a subset of cases from the data file?
After playing around I came up with this syntax, but perhaps there are simpler or otherwise better suggestions.
* Replace number below with student number or first 10 numbers of student number.
SET SEED=1234567891.
FILTER OFF.
USE ALL.
SAMPLE .90.
EXECUTE.

rails 3 + activerecord: is there a single query to count(field1) grouped by field2?

I'm trying to find the best way to summarize the data in a table
I have a table Info with fields
id
region_number integer (NOT associated with another table)
member_name string
member_active T/F
Members belong to a region, have a name, and are either active or not.
I'm wondering if there is a single query that will create a table with 3 columns, and as many rows as there are unique region_numbers:
For each unique region_number:
region_number
COUNT of members in that region
COUNT of members in that region with active=TRUE
Suppose I have 50 regions, I see how to do it with 2x50 queries but that surely is not the right approach!
You can always group on several things if you're prepared to do a tiny bit of post-processing:
SELECT region_number, COUNT(*) AS instances, member_active
GROUP BY region_number, member_active
WHERE region_number IN ?
This allows you do to one query for all region numbers at the same time. There will be one row for the T values, one for the F, but only if those are present.
If you see a case where you're doing a lot of queries that differ only in identifiers, that's something you can usually execute in one shot like this.

Resources