Running Count is Slow in Google Sheets

Running Count is Slow in Google Sheets - google-sheets

Here's my way of calculating running count by groups in Sheets:
=LAMBDA(a,INDEX(if(a="",,COUNTIFS(a,a,row(a),"<="&row(a)))))(B4:B)
The complexity of this formula is R^2 = 1000000 operations for 1K rows. I'd love to make more efficient formula, and tried combinations of LABMDA and SCAN. For now I've found only the way to do it fast with 1 group at a time:
=INDEX(IF(B4:B="🌽 Corn",SCAN(0,B4:B,LAMBDA(i,v,if(v="🌽 Corn",i+1,i))),))
Can we do the same for all groups? Do you have an idea?
Note: the script solution would use object and hash to make it fast.
Legal Tests
We have a list of N items total with m groups. Group m(i) is a unique item which may repeat randomly. Samlpe dataset:
a
b
b
b
a
↑ Sample for 5 items total and 2 groups: N=5; m=2. Groups are "a" and "b"
The task is to find the function which will work faster for different numbers of N and m:
Case #1. 1000+ accurances of an item from a group m(i)
Case #2. 1000+ different groups m
General case sagnificant number of total items N ~ 50K+
Playground
Samlpe Google Sheet with 50K rows of data. Please click on the button 'Use Tamplate':
Test Sheet with 50K values
Speed Results
Tested solutions:
Countifs from the question and Countif and from answer.
Xlookup from answer
Complex Match logic from answer
🏆Sorting logic from the answer
In my enviroment, the sorting option works faster than other provided solutions. Test results are here, tested with the code from here.

Transpose groups m = 5
I've found a possible way for a small amount of counted groups.
In my tests: 20K rows and 5 groups => cumulative count worked faster with this function:
INDEX(if(B4:B="",,LAMBDA(eq,BYROW(index(TRANSPOSE(SPLIT(TRANSPOSE(BYCOL(eq,LAMBDA(c,query("-"&SCAN(0,c,LAMBDA(i,v,i+v)),,2^99))))," -"))*eq),LAMBDA(r,sum(r))))(--(B4:B=TRANSPOSE(UNIQUE(B4:B))))))
It's ugly, but for now I cannot do a better version as bycol function does not produce arrays.
Apps Script
The perfect solution would be to have "hash"-like function in google-sheets:
/** runningCount
*
* #param {Range} data
*
* #CustomFunction
*
*/
function runningCount(data) {
var obj = {};
var l = data[0].length;
var k;
var res = [], row;
for (var i = 0; i < data.length; i++) {
row = []
for (var ii = 0; ii < l; ii++) {
k = '' + data[i][ii];
if (k === '') {
row.push('');
} else {
if (!(k in obj)) {
obj[k] = 1;
} else {
obj[k]++;
}
row.push(obj[k]);
}
}
res.push(row);
}
return res;
}

You can try this:
=QUERY(
REDUCE(
{"", 0},
B4:B10000,
LAMBDA(
acc,
cur,
{
acc;
cur, XLOOKUP(
cur,
INDEX(acc, 0, 1),
INDEX(acc, 0, 2),
0,
0,
-1
) + 1
}
)
),
"SELECT Col2 OFFSET 1",
0
)
A bit better than R^2. Works fast enough on 10 000 rows. On 100 000 rows it works, but it is quite slow.

Another approach. Works roughly 4 times faster than the first one.
=LAMBDA(
shift,
ref,
big_ref,
LAMBDA(
base_ref,
big_ref,
ARRAYFORMULA(
IF(
A2:A = "",,
MATCH(VLOOKUP(A2:A, base_ref, 2,) + ROW(A2:A), big_ref,) - VLOOKUP(A2:A, base_ref, 3,)
)
)
)
(
ARRAYFORMULA(
{
ref,
SEQUENCE(ROWS(ref)) * shift,
MATCH(SEQUENCE(ROWS(ref)) * shift, big_ref,)
}
),
big_ref
)
)
(
10 ^ INT(LOG10(ROWS(A:A)) + 1),
UNIQUE(A2:A),
SORT(
{
MATCH(A2:A, UNIQUE(A2:A),) * 10 ^ INT(LOG10(ROWS(A:A)) + 1) + ROW(A2:A);
SEQUENCE(ROWS(UNIQUE(A2:A))) * 10 ^ INT(LOG10(ROWS(A:A)) + 1)
}
)
)

Sorting algorithm
The idea is to use SORT in order to reduce the complexity of the calculation. Sorting is the built-in functionality and it works faster than countifs.
Sort columns and their indexes
Find the place where each new element of a group starts
Create a counter of elements for sorted range
Sort the result back using indexes from step 1
Data is in range A2:A
1. Sort + Indexes
=SORT({A2:A,SEQUENCE(ROWS(A2:A))})
2. Group Starts
C2:C is a range with sorted groups
=MAP(SEQUENCE(ROWS(A2:A)),LAMBDA(v,if(v=1,0,if(INDEX(C2:C,v)<>INDEX(C2:C,v-1),1,0))))
3. Counters
Count the item of each group by the column of 0/1 values, 1 - where group starts:
=SCAN(0,F2:F,LAMBDA(ini,v,IF(v=1,1,ini+1)))
4. Sort the resulting countes back
=SORT(H2:H,D2:D,1)
The Final Solution
Suggested by Tom Sharpe:
cut out one stage of the calculation by omitting the map and going
straight to a scan like this:
=LAMBDA(a,INDEX(if(a="",, LAMBDA(srt, SORT( SCAN(1,SEQUENCE(ROWS(a)), LAMBDA(ini,v,if(v=1,1,if(INDEX(srt,v,1)<>INDEX(srt,v-1,1),1,ini+1)))), index(srt,,2),1) ) (SORT({a,SEQUENCE(ROWS(a))})))))(A2:A)
↑ In my tests this solution is faster.
I pack it into the named function. Sample file with the solution:
https://docs.google.com/spreadsheets/d/1OSnLuCh-duW4eWH3Y6eqrJM8nU1akmjXJsluFFEkw6M/edit#gid=0
this image explains the logic and the speed of sorting:
↑ read more about the speed test

Here's an implementation of kishkin's second approach that offloads much of the lookup table setup to lambdas early on. The changes in logic are not that big, but they seem to benefit the formula quite a bit:
5 uniques
5000 rows
4000 rows
3000 rows
2000 rows
1000 rows
lambda offload
14.87x
14.45x
10.04x
10.50x
7.05x
sort redux
7.73x
5.89x
4.89x
3.96x
2.24x
max makhrov sort
4.23x
4.52x
3.65x
3.31x
1.95x
array countifs
2.59x
2.66x
2.55x
2.56x
2.90x
kishkin2
0.83x
0.80x
0.81x
1.03x
1.19x
naïve countif
1.00x
1.00x
1.00x
1.00x
1.00x
I primarily tested using this benchmark and would welcome testing by others.
=arrayformula(
lambda(
groups,
lambda(
uniques, shiftingFactor,
lambda(
shiftedOrdinals,
lambda(
ordinalLookup,
lambda(
groupLookup,
iferror(
match(
vlookup(groups, groupLookup, 2, true) + row(groups),
ordinalLookup,
1
)
-
vlookup(groups, groupLookup, 3, true)
)
)(
sort(
{
uniques,
shiftedOrdinals,
match(shiftedOrdinals, ordinalLookup, 1)
}
)
)
)(
sort(
{
match(groups, uniques, 1) * shiftingFactor + row(groups);
shiftedOrdinals
}
)
)
)(sequence(rows(uniques)) * shiftingFactor)
)(
unique(groups),
10 ^ int(log10(rows(groups)) + 1)
)
)(A2:A)
)
The formula performs best when the number of groups is small. Here are some benchmark results with a simple numeric 50k row corpus where the number of uniques differs:
50k rows
11 uniques
1000 uniques
lambda offload
14.41x
3.57x
array countifs
1.00x
1.00x
Performance degrades as the number of groups increases, and I even got a few incorrect results when the number of groups approached 20k.

Mmm, it will probably be more efficient, but you'll have to try:
=Byrow(B4:B,lambda(each,if(each="","",countif(B4:each,each))))
or
=map(B4:B,lambda(each,if(each="","",countif(B4:each,each))))
Let me know!

Related

Modify an existing named function that returns a Cartesian product / cross join by adding an argument that specifies the # of columns/values per row

I found the following Google Sheets named function CARTESIAN_PRODUCT as an answer to a related question Here:
=IF(COLUMNS(range) = 1, IFNA(FILTER(range, range <> "")), LAMBDA(sub_product, last_col, REDUCE(, SEQUENCE(ROWS(sub_product)), LAMBDA(acc, cur, LAMBDA(new_range, IF(cur = 1, new_range, {acc; new_range}))({ARRAYFORMULA(IF(SEQUENCE(ROWS(last_col)), INDEX(sub_product, cur,))), last_col}))))(CARTESIAN_PRODUCT(ARRAY_CONSTRAIN(range, ROWS(range), COLUMNS(range) - 1)), LAMBDA(r, IFNA(FILTER(r, r <> "")))(INDEX(range,, COLUMNS(range)))))
This function has 1 argument, range, which specifies the columns with the values, and returns a Cartesian product / cross join with the same number of columns as are included in the range:
Example
I would like modify this named function by adding an argument that specifies the # of columns/values per row. For example, I'd like to be able to take the same range as in the image above and return 2 columns instead of 3:
Desired Result
I found a similar pair of named functions that work together to return all unique combinations from a single column (which I know is not a Cartesian product / cross join) and that include an additional argument, r, that specifies the # of columns/values per row Here:
COMBINATIONS_INDICES:
=LAMBDA(f_range; LAMBDA(f_range_rows; IF(OR(r <= 0; r > f_range_rows);; IF(r = f_range_rows; SEQUENCE(1; r); LAMBDA(n; max_inds; REDUCE(SEQUENCE(1; r); SEQUENCE(PRODUCT(SEQUENCE(n)) / PRODUCT(SEQUENCE(n - r)) / PRODUCT(SEQUENCE(r)) - 1); LAMBDA(acc; cur; {acc; LAMBDA(ind; IF(ind = 1; SEQUENCE(1; r; INDEX(acc; ROWS(acc); 1) + 1); {ARRAY_CONSTRAIN(INDEX(acc; ROWS(acc);); 1; ind - 1)\ SEQUENCE(1; r - ind + 1; INDEX(acc; ROWS(acc); ind) + 1)}))(MATCH(2; ARRAYFORMULA(1 / (max_inds - INDEX(acc; ROWS(acc);) > 0))))})))(f_range_rows; SEQUENCE(1; r; f_range_rows - r + 1)))))(ROWS(f_range)))(FLATTEN(range))
and COMBINATIONS:
=LAMBDA(comb_inds; IF(comb_inds = "";; LAMBDA(f_range; MAP(comb_inds; LAMBDA(i; INDEX(f_range; i))))(FLATTEN(range))))(COMBINATIONS_INDICES(range; r))
Example
So far I've been unsuccessful in my attempts to add an argument like what can be found in the COMBINATIONS_INDICES and COMBINATIONS functions that specifies the # of columns/values per row to the CARTESIAN_PRODUCT function.
Can this be done?
Edit:
Here is a screenshot of how the result would look like if we had 4 columns and wanted to restrict it to 2 and 3 columns.

Try out this named function:
=IFERROR(FILTER(SPLIT(REDUCE(,SEQUENCE(1,COLUMNS(range)),LAMBDA(a,c,FLATTEN(a&"ζ"&TRANSPOSE(FILTER(INDEX(range,,c),INDEX(range,,c)<>""))))),"ζ"),TRANSPOSE(QUERY({SEQUENCE(cols);SEQUENCE(COLUMNS(range)-cols,1,0,0)},"where Col1 is not null"))),NA())
The arguments are range and cols.

Calculate sum of row but its initial row number and row count

Let's say I have a column of numbers:
1
2
3
4
5
6
7
8
Is there a formula that can calculate sum of numbers starting from n-th row and adding to the sum k numbers, for example start from 4th row and add 3 numbers down the row, i.e. PartialSum(4, 3) would be 4 + 5 + 6 = 15
BTW I can't use App Script as now it has some type of error Error code RESOURCE_EXHAUSTED. and in general I have had issue of stabile work with App Script before too.

As Tanaike mentioned, the error code when using Google Apps Script was just a temporary bug that seems to be solved at this moment.
Now, I can think of 2 possible solutions for this using custom functions:
Solution 1
If your data follows a specific numeric order one by one just like the example provided in the post, you may want to consider using the following code:
function PartialSum(n, k) {
let sum = n;
for(let i=1; i<k; i++)
{
sum = sum + n + i;
}
return sum;
}
Solution 2
If your data does not follow any particular order and you just want to sum a specific number of rows that follow the row you select, then you can use:
function PartialSum(n, k) {
let ss = SpreadsheetApp.getActiveSheet();
let r = ss.getRange(n, 1); // Set column 1 as default (change it as needed)
let sum = n;
for(let i=1; i<k; i++)
{
let val = ss.getRange(n + i, 1).getValue();
sum = sum + val;
}
return sum;
}
Result:
References:
Custom Functions in Google Sheets

Formula:
= SUM( OFFSEET( initialCellName, 0, 0, numberOfElementsInColumn, 0) )
Example add 7 elements starting from A5 cell:
= SUM( OFFSEET( A5, 0, 0, 7, 0) )

In Google Sheets get first n results from an array without using query

I have a table:
The formula in W.Avg. Value per Point for size "Small" is this:
=AVERAGE.WEIGHTED(
FILTER(
INDEX(INDIRECT("recent_sdcalcs_size"),0,2),
INDEX(INDIRECT("recent_listings"),0,3) = H15,
INDEX(INDIRECT("recent_sdcalcs_size"),0,2)<>""
),
FILTER(
INDEX(INDIRECT("recent_listings"),0,5),
INDEX(INDIRECT("recent_listings"),0,3) = H15,
INDEX(INDIRECT("recent_sdcalcs_size"),0,2)<>""
)
)
Where H15 = "Small".
This formula works... It will lookup all defined values and weights for "Small" listings on another sheet and give me a weighted average of those values. The problem is that the sample array of values and weights returned by the two FILTER() functions are not representative of the population of glows.
I know the population distribution of glows (% Chance under Glows). From this I have created the following table:
This gives me a representative count for each glow relative to each size (and vice versa). And, again, everything to this point is working just fine.
But the final step is where I'm struggling. When I lookup size "Small" values in the FILTER() functions above I need to limit the results returned to AVERAGE.WEIGHTED() to (in this example) 62 none, 31 Dusky, 15 Lucent, 7 Bright, 3 Brilliant, 1 Radiant, and none of Dazzling, StarLike, Crown Jewel, or Titanic. As per the "Small" row in the second image.
The spreadsheet this is all from is here.
I actually achieved this already using QUERY()'s "limit" feature. Unfortunately using QUERY() anywhere in the spreadsheet makes the (unrelated) Google Apps scripts go from taking ~10 seconds to execute to taking ~200 seconds to execute. This makes QUERY() unusable in my case. Here's what I had for that (doesn't actually work, see comments):
=AVERAGE.WEIGHTED(
FILTER(
ARRAYFORMULA(QUERY(
INDIRECT("recent_data"),
"select ( E * F ) where (
D = '"&$H$27:$H$36&"' and
C = '"&$H15&"' and
V > 0
) limit "&$C46:$L46
)),
ISNUMBER(ARRAYFORMULA(QUERY(
INDIRECT("recent_data"),
"select ( E * F ) where (
D = '"&$H$27:$H$36&"' and
C = '"&$H15&"' and
V > 0
) limit "&$C46:$L46
)))
),
FILTER(
ARRAYFORMULA(QUERY(
INDIRECT("recent_data"),
"select F where (
D = '"&$H$27:$H$36&"' and
C = '"&$H15&"' and
V > 0
) limit "&$C46:$L46
)),
ISNUMBER(ARRAYFORMULA(QUERY(
INDIRECT("recent_data"),
"select F where (
D = '"&$H$27:$H$36&"' and
C = '"&$H15&"' and
V > 0
) limit "&$C46:$L46
)))
)
)
I also tried to use SORTN(). But that's not going to work for a number of reasons.
=ARRAYFORMULA(SORTN(FILTER(
INDEX(INDIRECT("recent_sdcalcs_size"),0,2),
INDEX(INDIRECT("recent_listings"),0,3) = H15,
INDEX(INDIRECT("recent_sdcalcs_size"),0,2)<>""
),TRANSPOSE($C46:$L46)))

Using sumif by row in an arrayformula

I've got a sumif at the start of every row of my data adding up numbers if they are >0 and another doing the same for numbers <0 like this:
=SUMIF(P6:X6;">0")
This works and all but it's quite a pain to drag the cel down every time I add more data. Is there a way for me to turn this into a ARRAYFORMULA that just keeps going down.

The formula for sums ">0" is:
=arrayformula(mmult(A2:C*--(A2:C>0), transpose(A2:C2 * 0 + 1)))
and for sums "<0":
=arrayformula(mmult(A2:C*--(A2:C<0), transpose(A2:C2 * 0 + 1)))
transpose(A2:C2 * 0 + 1)) is an array of 1: [1, 1, 1, ...] It's the part of mmult function to convert the result into row.
--(A2:C>0) double minus is for converting booleans into 1 (if true) and 0 (if false)

How to calculate the summation of pairwise minima of two ranges with standard formulas?

Problem statement
I'd like to calculate the following formula in a Google Spreadsheet, where x and y are both ranges of n rows and 1 column and t is a variable, using only standard formulas:
Current situation
Right now I'm feeding x (say, A1:A10), y (say, B1:B10) and t (say, D1) to a custom function (myFunction(t, x, y), see below), but executing scripts is rather performance intensive, so I'd like to know if there is a way to make this calculation without using a custom function.
function myFunction(t, x, y)
{
var sum = 0;
for (var i = 0; i < x.length; i++)
{
var xi = parseInt(x[i]);
var yi = parseInt(y[i]);
sum += Math.min(t * xi, yi);
}
return sum;
}
In this example, E1 would become: =myFunction(D1, A1:A10, B1:B10)
Desired situation
I am looking for something like =SUM(MIN(D1 * A1:A10, B1:B10)), but a confirmation or an educated guess that this is not possible is of course also welcome.

I have not been able to test, but I think that the following formula can do what you need.
=SUM(ARRAYFORMULA(IF(ARRAYFORMULA(D1 * A1:A10) < B1:B10; ARRAYFORMULA(D1 * A1:A10); B1:B10)))
UPDATE
Indeed, a better approach is commented by Jelle Fresen, eliminating unnecessary redundancy.
=SUM(ARRAYFORMULA(IF(D1 * A1:A10 < B1:B10; D1 * A1:A10; B1:B10)))

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Running Count is Slow in Google Sheets - google-sheets

You can try this: =QUERY( REDUCE( {"", 0}, B4:B10000, LAMBDA( acc, cur, { acc; cur, XLOOKUP( cur, INDEX(acc, 0, 1), INDEX(acc, 0, 2), 0, 0, -1 ) + 1 } ) ), "SELECT Col2 OFFSET 1", 0 ) A bit better than R^2. Works fast enough on 10 000 rows. On 100 000 rows it works, but it is quite slow.

Mmm, it will probably be more efficient, but you'll have to try: =Byrow(B4:B,lambda(each,if(each="","",countif(B4:each,each)))) or =map(B4:B,lambda(each,if(each="","",countif(B4:each,each)))) Let me know!

Related

Modify an existing named function that returns a Cartesian product / cross join by adding an argument that specifies the # of columns/values per row

Calculate sum of row but its initial row number and row count

In Google Sheets get first n results from an array without using query

Using sumif by row in an arrayformula

How to calculate the summation of pairwise minima of two ranges with standard formulas?

Categories

Resources