Compute the Sum of Columns in SPSS - spss

I have a lot of columns in SPSS and for a calculation, I need to get the sum of each and every one of them. Is there a way to do this in SPSS?
An example of what I mean is shown below:
age gender question 1 question 2
-------------------------------------------------
25 m 2 3
19 f 4 2
20 f 3 4
------- -------
need sum need sum

If you just need an ouput table with the results then see the DESCRIPTIVES command.
Alternatively, if you need the results in an output dataset for further processing then see the AGGREGATE command.

use: Analyse > Reports > Summaries in Columns and add your columns

Related

Sheets ArrayFormula. Find nearest number by group

Master Data
Group-Value pairs
1 | 1
1 | 2
1 | 3
2 | 5
2 | 8
3 | 10
3 | 12
Work Data
Group-Value pairs + desired result
1 | 4 | 3 (3≤4, max in group 1)
1 | 2 | 2 (2≤2, max in group 1)
2 | 6 | 5 (5≤6, max in group 2)
3 | 7 | no result (both 10 and 12 > than 7)
The task is to find the maximum possible matched number from a group, the number should be less or equal to the given number.
For Group 1, value 4:
=> filter Master Data (1,2,3) => find 3
Will have no problem with doing it once, need to do it with arrayformula.
My attempts to solve it were using modifications of the vlookup formula, with wrong outputs so far.
Samples and my working "arena":
https://docs.google.com/spreadsheets/d/11Cd2BGpGN-0h2bL0LQ_EpIDBKKT2hvTVHoxGC6i8uTE/edit?usp=sharing
Notes: no need to solve it in a single formula, because it may slow down the result.
I used
=ArrayFormula(VLOOKUP(D4:D8&text(E4:E8,"0000"),A4:A10&text(B4:B10,"0000"),1,true))
starting in J4
then
=ArrayFormula(if(--left(J4:J8)=D4:D8,--right(J4:J8,4),""))
starting in K4.
Needs further refinement but doesn't make any assumptions about max of previous group.
EDIT
So after further work it would look like this
=ArrayFormula(if(D4:D="",,
if(D4:D=
vlookup(D4:D&text(E4:E,"0000"),filter({A4:A&text(B4:B,"0000"),A4:A},A4:A<>""),2,true),
vlookup(D4:D&text(E4:E,"0000"),filter({A4:A&text(B4:B,"0000"),B4:B},A4:A<>""),2,true),"")))
A lot like #player0's solution in fact.
I guess you could make it a bit more general by doing something like
=text(B4,rept("0",ceiling(log10(max(B4:B)))))
assuming these are positive integers.
Alternative method
I think this is a better way. Find the start row of each group and how many rows r less than or equal to the required group/value pair are in that group. Then just go forward r-1 rows from the first line of the group to find the matching value:
=ArrayFormula(if(countifs(A4:A,D4:D,B4:B,"<="&E4:E)>0,
vlookup(
vlookup(D4:D,{A4:A,row(A4:A)},2,false)+countifs(A4:A,D4:D,B4:B,"<="&E4:E)-1,{row(A4:A),B4:B},2,false),))
Assuming of course that the Master data is sorted by group and value - otherwise you would have to use sort():
=ArrayFormula(if(countifs(A4:A,D4:D,B4:B,"<="&E4:E)>0,
vlookup(
vlookup(D4:D,{sort(A4:A,A4:A,1,B4:B,1),row(A4:A)},2,false)+countifs(A4:A,D4:D,B4:B,"<="&E4:E)-1,{row(A4:A),SORT(B4:B,A4:A,1,B4:B,1)},2,false),))
My solution was based on the technique of finding the maximum number by a row. The sample formula is here:
https://docs.google.com/spreadsheets/d/1VY157ykKsCVDqEKDBp3oAVaG0LTXAz8wUCggCrFXMDM/edit#gid=628408999
My whole solution is here:
https://docs.google.com/spreadsheets/d/11Cd2BGpGN-0h2bL0LQ_EpIDBKKT2hvTVHoxGC6i8uTE/edit#gid=0
Step 1
Get joined numbers by groups from a Master Table.
1 | 3,2,1
2 | 8,5
3 | 12,10
Used offset to achieve this ↑. And used vlookup to match this semi-result with work table.
Step 2
Used if + split to check if the resulted value was ≤ than my work value, and in the same formula used query to find the maximum by each row.
compose a query: used join + sequence
=IF(M3=0,,"select "&JOIN(", ",INDEX("max(Col"&SEQUENCE(M3)&")")))
result:
select max(Col1), max(Col2), max(Col3), max(Col4), max(Col5)
Found the maximum by each group:
=index(TRANSPOSE(QUERY(TRANSPOSE(data), "select ...")))
This final formula was the 🔑 to solving the problem.
Note: the result: 0 of my formula means "no matches". This is fine for me.
try:
=INDEX(IFNA(IF(E4:E>=
VLOOKUP(D4:D&TEXT(E4:E, "00000"), {A4:A&TEXT(FILTER(B4:B, B4:B<>""), "00000"), B4:B}, 2),
VLOOKUP(D4:D&TEXT(E4:E, "00000"), {A4:A&TEXT(FILTER(B4:B, B4:B<>""), "00000"), B4:B}, 2), 0)))

Checking to which range a value belongs in Google Sheets

I have some data in the following way
Category
[Range 1_min]
[Range 1_max]
[Range 2_min]
[Range 2_max]
...
A
120
130
...
B
100
119
131
140
...
I want to be able to quickly query a number and have it return the category it belongs to, for example 135 belongs to B and 121 belongs to A.
I already have a script that does this, but since there are 1000+ categories, it takes a long time to run. Is there a faster way of doing this?
Thanks.
You can use LOOKUP:
=ArrayFormula(LOOKUP(2,1/((G2>=B2:B)*(G2<=C2:C)+(G2>=D2:D)*(G2<=E2:E)),A2:A))
Addition:
For more ranges you can add MMULT (not sure it's easier):
=ArrayFormula(LOOKUP(1,5/(MMULT(--(K2>={B2:B,D2:D,F2:F,H2:H}),ROW(A1:A4)^0)*MMULT(--(K2<={C2:C,E2:E,G2:G,I2:I}),ROW(A1:A4)^0)),A2:A))
some conditions:
change first argument of LOOKUP to 1
for second LOOKUP argument change denominator to 5 (number of cols to compare + 1)
for second MMULT argument ROW(A1:A4) use row count according column count to compare (i.e. for 4 cols ->ROW(A1:A4), for 6 cols -> ROW(A1:A6) etc. )

InfluxDB: Is flux the only way to add simple calculations as a column in a query?

I'm trying a query like so:
SELECT COUNT("value"), F("value"),G("value") FROM "someTable" WHERE time >= t1 AND time < t2 GROUP BY (aggregateWindow),*
F = sum of squares, and this wouldn't be too hard if I could do something like the following SUM("value"*"value"), but apparently that doesn't work in Influx (or maybe I'm using the syntax wrong).
G = time stamp of aggregate in unix epoch + aggregateWindow. So for example, if aggregateWindow == 1s, then I would want the following output (assuming there's only one point in that aggregateWindow whose value is value):
time value F G
---- ----- -- -----------------
1600272300000000000 1 1 1600272301000000000
1600272301000000000 2 4 1600272302000000000
1600272302000000000 3 9 1600272303000000000
1600272303000000000 4 16 1600272304000000000
1600272304000000000 5 25 1600272305000000000
I know you can implement sum of squares via flux as described here, but I'm worried about the performance of Flux vs regular Influx queries as mentioned here. So basically I'm asking, is flux the only and most efficient way of making a query like this?
Simple:
SELECT
COUNT("value"),
"value"*"value" AS F,
POW("value", 2) AS FwithPOWfunction
FROM "someTable"
WHERE
time >= t1 AND time < t2
GROUP BY (aggregateWindow)
You can't create new time column, but you can apply offset to time grouping GROUP BY time(time_interval,[<offset_interval])
Doc is your good friend to get more details and learn correct syntax:
https://docs.influxdata.com/influxdb/v1.8/query_language/explore-data/#group-by-time-intervals-and-fill
https://docs.influxdata.com/influxdb/v1.8/query_language/functions/#pow

Google Sheets Query Group By / First-N-Per-Group

I'm trying to find a simple solution for first-n-per-group.
I have a table of data, first column dates and rest data. I want to group based around the date, as multiple entries per date are allowed. For the second column some numbers, but want the FIRST record.
Currently the aggregate function I could possibly use is MIN() but that will return the lowest value and not the first.
A B
01/01/2018 10
01/01/2018 15
02/01/2018 10
02/01/2018 2
02/01/2018 100
02/01/2018 20
03/01/2018 5
03/01/2018 2
Desired output
A B
01/01/2018 10
02/01/2018 10
03/01/2018 5
Current results using MIN() - undesired
A B
01/01/2018 10
02/01/2018 2
03/01/2018 2
It's a shame there isn't a FIRST() aggregate function in Google Sheets, which would make this a lot easier.
I saw a couple of examples of using the Row Number and ArrayQuery, but that doesn't seem to work for me. There are about 5000 rows of data so trying to keep this as efficient as possible, and not have to recalculate the entire sheet on any change, each taking a few seconds.
Currently I have this, which appends a third column with the Row Number:
=query({A1:B, arrayformula(row(A1:B))}, "select min(Col1),min(Col2) group by Col1")
Thanks
EDIT 1
A suggested solution was =SORTN(A:B,2^99,2,1,1), which is a clean simple one. However, this requires a large range of "free space" to display the returned dataset. Imagine 3000+ rows.
I was hoping for a QUERY() -based solution, as I wanted to do further operations with the results. Specifically, count the occurrences of distinct values.
For example: I wanted a returned dataset of
A B
01/01/2018 10
02/01/2018 10
03/01/2018 5
Yet I want to count the occurrences of those values (and then ignoring the dates). For example:
B C
10 2
5 1
Perhaps I've confused the situation by using numbers? the "data" in ColB is TEXT (short 3 letter codes), however I used numbers to show I couldn't use MIN() function as that returns the numerically lowest value.
So in brief:
Go through all rows (3000+ rows) and group by the FIRST row of a particular date
return the FIRST value of that row
COUNT() all unique occurrences of those FIRST values, disregarding the date. Just a list with the unique values and their count (again, only the first one of any particular day)
=SORTN(A:B,2^99,2,1,1)
If your data is sorted as in the sample, You can easily remove duplicates with SORTN()

SPSS: Inconsistent totals due to rounding of numbers

I am using weights when running the data with SPSS custom tables.
Thus it is expected that the column or row values may not add up to row total, column total or Table Total due to rounding of decimals
sample table result:
variable 2
category 1 category 2 Total
variable 1 category 1 45 52 97
category 2 60 56 115
Total 105 107 211
Is there a way to force SPSS to output the correct row, column, or table totals?
expected table output:
variable 2
category 1 category 2 Total
variable 1 category 1 45 52 97
category 2 60 56 116
Total 105 108 213
If you are using the CROSSTABS procedure to produce these figures then you should do using the option ASIS.
To be clear: the total displayed by CTABLES is mathematically correct. However, if you want to display as the total the sum of the displayed values in the rows, instead, the only way to do this is by using the STATS TABLE CALC extension command to recompute the totals using the rounded values.
Here is how to do that.
First, you need to create a Python module named customcalc.py with the following contents
def custom(datacells, ncells, roworcol):
'''Calculate sum of formatted values'''
total = sum(float(datacells.GetValueAt(roworcol,i)) for i in range(ncells))
return(total)
This file should be saved in the python\lib\site-packages directory under your Statistics installation or anywhere else that Python can find it.
Then, after your CTABLES command, run this syntax
STATS TABLE CALC SUBTYPE="customtable" PROCESS=PRECEDING
/TARGET custommodule="customcalc"
FORMULA="customcalc.custom(datacells, ncells, roworcol)" DIMENSION=COLUMNS LEVEL = -2 LOCATION="Total"
LABEL="Rounded Count".
That custom function adds up the formatted values in each row instead of the full precision values. If you have suppressed the default statistic name, Count, so that "Total" is the innermost label, use LEVEL=-1 instead of LEVEL=-2 ABOVE.

Resources