I think I've gone crazy: Google Sheet Lookup

I think I've gone crazy: Google Sheet Lookup - google-sheets

I'm trying to do a really simple lookup on a data validation field:
I have 4 simple values:
risk free
low
intermediate
high
and besides those I have the values for those:
0
0.1
0.2
0.5
I use the lookup formula as follows:
=LOOKUP(J11,H7:H10,I7:I10)
When I then change the values to let's say low or risk free, it always shows me the value 0.5.
But when I change the words to this:
abc
def
ghi
jkl
it delivers the right values.
I tried several different sheets, browsers as well as google accounts which uses different languages (english & german).
Can somebody please explain this to me :D

drop the lookup and use:
=VLOOKUP(J11, H7:I10, 2, 0)

Related

What factors determine the memory used in lambda functions?

=SUM(SEQUENCE(10000000))
The formula above is able to sum upto 10 million virtual array elements. We know that 10 million is the limit according to this question and answer. Now, if the same is implemented as Lambda using Lambda helper function REDUCE:
=REDUCE(,SEQUENCE(10000000),LAMBDA(a,c,a+c))
We get,
Calculation limit was reached when trying to calculate this formula
Official documentation says
This can happen in 2 cases:
The computation for the formula takes too long.
It uses too much memory.
To resolve it, use a simpler formula to reduce complexity.
So, it says the reason is space and time complexity. But what is the exact space used to throw this error? How is this determined?
In the REDUCE function above, the limit was at around 66k for a virtual array:
=REDUCE(,SEQUENCE(66660),LAMBDA(a,c,a+c))
However, if we remove the addition criteria and make it return only the current value c, the allowed virtual array size seems to increase to 190k:
=REDUCE(,SEQUENCE(190000),LAMBDA(a,c,c))
After which it throws a error. So, what factors determine the memory limit here? I think it's memory limit, because it throws the error almost within a few seconds.

If you're affected by this issue, you can send feedback to Google:
Open a spreadsheet, preferably one where you bumped into the issue.
Replace any sensitive information with anonymized but realistic-looking data. Remove any sensitive information that is not needed to reproduce the issue.
Choose Help > Report a Problem or Help > Help Sheets Improve. If you are on a paid Google Workspace Domain, see Contact Google Workspace support.
Explain why the calculation limit is an issue for you.
Request:
Justice: Removing arbitrary limits on lambda functions
Equality: Avoiding discrimination against lambda functions
Transparency: Documenting the said discrimination in more clarity and detail
Include a link to this Stack Overflow answer post.
Update Oct '22 (Credit to MaxMarkhov)
The limit is now 10x higher at 1.9 million 1999992. This is still less than 1/5th of 10 million virtual array limit of non-lambda formulas, but much better than before. Also non-lambda formulas's limit doesn't reduce with number of operations. But lambda helper formulas limit still does decrease with number of operations. So, even though it's 10x higher, that just means ~5 extra operations inside lambda(see table below).
A partial answer
We know for a fact, the following factors decide the calculation limit drum roll:
Number of operations
(Nested)LAMBDA() function calls
The base number for 1 operation seems to be 199992 1 2(=REDUCE(,SEQUENCE(199992),LAMBDA(a,c,c))). But for a zero-op or a no-op(=REDUCE(,SEQUENCE(10000000),LAMBDA(a,c,0))), the memory limit is much higher, but you'll still run into time limit. We also know number of operations is a factor, because
=REDUCE(,SEQUENCE(66664/1),LAMBDA(a,c,a+c)) fails
=REDUCE(,SEQUENCE(66664),LAMBDA(a,c,a+c)) works.
=REDUCE(,SEQUENCE(66664),LAMBDA(a,c,a+c+0)) fails
Note that the size of operands doesn't matter. If =REDUCE(,SEQUENCE(39998),LAMBDA(a,c,a+c+0)) works, =REDUCE(,SEQUENCE(39998),LAMBDA(a,c,a+c+100000)) will also work.
For each increase in number of operations inside the lambda function, the maximum allowed array size falls by 2n-1(Credit to #OlegValter for actually figuring out there's a factor multiple here):
Maximum sequence
Number of operations (inside lambda)
Reduction(from 199992)
Formula
199992
1
1
REDUCE(,SEQUENCE(199992),LAMBDA(a,c,c))
66664
2
1/3
REDUCE(,SEQUENCE(66664),LAMBDA(a,c,a+c))
39998
3
1/5
REDUCE(,SEQUENCE(39998),LAMBDA(a,c,a+c+10000))
28570
4
1/7
REDUCE(,SEQUENCE(28570),LAMBDA(a,c,a+c+10000+0))
Operations outside the LAMBDA functions also count. For eg, =REDUCE(,SEQUENCE(199992/1),LAMBDA(a,c,c)) will fail due to extra /1 operation, but you only need to reduce the array size linearly by 1 or 2 per operation, i.e., this =REDUCE(,SEQUENCE(199990/1),LAMBDA(a,c,c)) will work3.
In addition LAMBDA function calls itself cost more. So, refactoring your code doesn't eliminate the memory limit, but reduces it furthermore. For eg, if your code uses LAMBDA(a,c,(a-1)+(a-1)), if you add another lambda like this: LAMBDA(a,c,LAMBDA(aminus,aminus+aminus)(a-1)), it errors out with much less array elements than before(~20% less). LAMBDA is much more expensive than repeating calls.
There are many other factors at play, especially with other LAMBDA functions. Google might change their mind about these arbitrary limits later. But this gives a start.
Possible workarounds:
LAMBDA itself isn't restricted. You can nest as much as you want to. Only LAMBDA Helper Functions are restricted. (Credit to player0)
Named functions which don't use LAMBDA(helper functions) themselves, aren't subjected to the same restrictions. But they're subject to maximum recursion restrictions.
Another workaround is to avoid using lambda as a arrayformula and use autofill or drag fill feature, by making the lambda function return only one value per function. Note that this might actually make your sheet slow. But apparently, Google is ok with that - multiple individual calls instead of a single array call. For example, I've written a permutations function here to get a list of all permutations. While it complains about "memory limit" for a array with more than 6 items, it can work easily by autofill/dragfill/copy+paste with relative ranges.

not even an answer
by brute-forcing a few ideas it looks like there are more hidden variables than previously thought. it is probably safe to say that the upper limit is a result of "running out of memory" especially when calculation time does not play any role. the thing is that there are factors even outside of LAMBDA that affect the computational capabilities of the formula. here is a brief summary of the issue in layman's terms:
WHY WERE/ARE LAMBDA'S MINIONS STUPID?!
UPDATE: limit boundaries were moved 10-fold higher, so none of the below testing formulae limits represent the actual up-to-date state, however, lambda minions are still not limitless!
let's imagine a memory buffer from the 1999 era with a limited size of 30 units that kicks in only when you use LAMBDA with friends (MAP, SCAN,BYCOL, BYROW, REDUCE, MAKEARRAY). keep in mind that in google sheets when we use any other formula, the limiting factor is usually the cell count limit.
example 1
output capability: 199995 cells!
reduction from 199995: 1/1 (meh, but ok)
example 2
output capability: 49998 cells!
reduction from 199995: 1/~4 (*double-checking the calendar if the year is really 2022*)
example 3
output capability: 995 cells!
reduction from 199995: 1/201 !! (*remembering this company built a quantum computer*)
further testing
establishing the baseline: all below formulae are maxed out so they work as "one step before erroring out". please keep noticing the numbers as a direct representation of row (not cell) processing abilities
starting with a simple:
=ROWS(BYROW(SEQUENCE(99994), LAMBDA(x, AVERAGE(x))))
by adding one more x the following would error out so even the length of strings matters:
=ROWS(BYROW(SEQUENCE(99994), LAMBDA(xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, AVERAGE(xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx))))
doubling the array brings no issues:
=ROWS(BYROW({SEQUENCE(99994), SEQUENCE(99994)}, LAMBDA(x, AVERAGE(x))))
but additional "stuff" will reduce the output by 1:
=ROWS(BYROW({SEQUENCE(99993), SEQUENCE(99993, 1, 5)}, LAMBDA(x, AVERAGE(x))))
interestingly this one runs with no problem so now even the complexity of input matters (?):
=ROWS(BYROW(SEQUENCE(99994, 6, 0, 5), LAMBDA(x, AVERAGE(x))))
and with this one, it seems that even choice of formula selection matters:
=ROWS(BYROW(RANDARRAY(99996, 2), LAMBDA(x, AVERAGE(x))))
but what if we move from virtual input to real input... A1 cell being set to =RANDARRAY(105000, 3) we can have:
=ROWS(BYROW(A1:B99997, LAMBDA(x, AVERAGE(x))))
again, it's not a matter of cells because even with 8 columns we can get the same:
=ROWS(BYROW(A1:H99997, LAMBDA(x, AVERAGE(x))))
not bad, however, indirecting the range will put us back to 99995:
=ROWS(BYROW(INDIRECT("A1:B"&99995), LAMBDA(x, AVERAGE(x))))
another fact is that LAMBDA as a standalone function runs flawlessly even with an array 105000×8 (that's solid 840K cells)
=LAMBDA(x, AVERAGE(x))(A1:H105000)
so is this really the memory issue of LAMBDA (?) or the factors that determine the memory used in LAMBDA are limits of unknown origin bestowed upon LAMBDA by individual incapabilities of:
MAP
SCAN
BYCOL
BYROW
REDUCE
MAKEARRAY
and their unoptimized memory demands shaken by wast variety of yet unknown variables within our spacetime

Edit 2022/10/26
Seems, Google Sheets Team has just increased the max. limit 10x times 😍.
1999992 from 199992
My original formula supposed it would be 199992 cells, but as you see the "behind" logic changes and may also change in the future.
LAMBDA+Friends Limit
The maximum number of rows you can use in the formula (guess):
Limit = 1999992/(1 + inside_lambdas) - outside_lambdas
inside_lambdas and outside_lambdas are functions and parameters, each count 1:
+ / * -
5, A1, "text",
MOD, AVERAGE, etc.
{"array element"}
The limit is about cells operated by the "lambda+" formula: reduce, byrow, etc.
My tests are here:
Lambda Limits \ Sample Sheet
Steps to fix:
Do Not use Lambda if possible :(
Do most of the calculations outside lambda if possible
Split formulas to multiple cells, having the limit in mind. Copy formulas, each one has its own limit.
Ask Google to Fix this. In Sheets use the menu Help > Help Sheets Improve
Write to the support if you have a paid account.
Final notes:
my formula for the limit is guess, and it works for my examples and tests. Please try it and comment to this answer if you find an error.
the formula does not answer how long variable names affect the limit (=ROWS(BYROW(SEQUENCE(99994), LAMBDA(xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, AVERAGE(xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx))))) Need more tests to figure out the correct effect on the limit. As this does not break: =ROWS(BYROW(SEQUENCE(199992), LAMBDA(xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, AVERAGE(xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx)))), my suggestion is this is the max. length of the variable name, and it does not change the cells limit.
Google Sheets team may change the logic "behind" the formula, so all tests may appear invalid in a time.

max-series-per-database limit exceeded clarification needed / how to calculate number of series in use

We recently started to encounter this error:
{"error":"partial write: max-series-per-database limit exceeded: (1000000) dropped=1"}
When writing metric data like this:
resque_job,environment=beta,billing_status=active-current,billing_active=active,instance_id=1103,instance_testmode=0,instance_staging=0,server_addr=RESQUE,database_host=db11.msp1.our-domain.com,admin_sso_key=_EMPTY_,admin_is_internal=_EMPTY_,queue_priority=default seconds_spent_job=0.20966601371765,number_in_batch=1 1649203450783000002
I know that Influx recommends you keep your series cardinality low, and our impression was that series cardinality would mean keeping each tag individually to a small number of values. e.g. we felt comfortable sending instance_id=1103 as a tag, because we know that there will never be more than 2000 distinct instance_id tag values.
But after running into this error... I'm afraid maybe I was mistaken here. Do we actually need to keep the cardinality of all possible combinations of all tags low? e.g. do these two things count as two separate series towards the 1,000,000 default max, because the instance_id is different?
resque_job,environment=beta,billing_status=active-current,billing_active=active,instance_id=1111,instance_testmode=0,instance_staging=0,server_addr=RESQUE,database_host=db11.msp1.our-domain.com,admin_sso_key=_EMPTY_,admin_is_internal=_EMPTY_,queue_priority=default seconds_spent_job=0.20966601371765,number_in_batch=1 1649203450783000002
resque_job,environment=beta,billing_status=active-current,billing_active=active,instance_id=2222,instance_testmode=0,instance_staging=0,server_addr=RESQUE,database_host=db11.msp1.our-domain.com,admin_sso_key=_EMPTY_,admin_is_internal=_EMPTY_,queue_priority=default seconds_spent_job=0.20966601371765,number_in_batch=1 1649203450783000002
If those count as two separate series... then is there a better way to structure this data in Influx? 1,000,000 total seems like a tiny amount if each separate combination of tags is a separate series...
Does InfluxDB 2.x help with this?
Is there a better tool that can handle a large number of tags and not bump into limits like this?

There is no way to figure out what data was not recorded. Update the max-series-per-database configuration to be more than 1M in order to stop dropping data.
This can be an indication that you are creating a lot of series. i saw some documentation on why that isn't great.
Hope this helps!

Google Sheets IF AND OR Logic

I am making a scoring system on Google sheets and I am struggling with the logic I need for the final step.
This question might be related, but I can't seem to apply the logic.
There are a number of chemicals tested and for each an amount detected (AD) si given, and each has a benchmark amount allowed (AL). From AL and AD we calculate AD/AL= %AL.
The Total Score (TS) is calculated based on an additive and weighted formula that takes into consideration the individual %ALs, but I won't go into that formula.
The final step is for me to "calculate" the Display Score (DS), which has some rules to it, and this is where I need the logic. The rules are as follows:
If any of the %Als are over 100 (this is will make TS>100 too) and DS should show "100+"
If none of the %ALs are over 99, (TS may be above or below 100) then DS can NOT be over 99, so it should show TS, maxing out at 99.
I want to do this within the sheet itself. I think the correct tool is logic operators IF, AND, OR.
I have made many attempts, these are some: (I am replacing cell references with the acronyms I used above)
=IF(TS>100,"100+",TS)
=IF(OR(AND(MAX(RANGE_OF_%ALS)<100,TS>99),(AND(MAX(RANGE_OF_%ALS)>100,TS>100)),99,"100+"))
I have also tried to think about how I would solve this in Python (just to explore it, I don't want to use Python for the solution). This was my attempt:
if Max%AL<100:
if TS<100:
print(TS)
else:
print("99")
else:
if TS>100:
print("100+")
Those are my attempts at thinking through the problem. I would appreciate some help.
This is a link to a copy of my sheet: https://docs.google.com/spreadsheets/d/1ZBnaFUepVdduEE2GBdxf5iEsfDsFNPIYhrhblHDHEYs/edit?usp=sharing

Please try:
=if(max(RANGE_OF_%ALS)>1,"100+",if(max(RANGE_OF_%ALS)<=0.99,MIN(TS,0.99),"?"))

What should i do to maintain performance of a mobile app which is using database?

I'm building an app using database.
I have a words table and everytime user types something, this app will record and update word the database.
And the frequency field will be auto increase after user enter one matched word.
But the trouble is user type day by day and i afraid the search performance will be reduce after times and also the Int field will reach to the limit (max limit Int) someday.
So, i limit the database to around less than 50.000 records.
I delete less-used records after a certain time.
But i don't know how to deal with frequency Int field of each word?
How to know exactly frequency usage of each word without increasing the field forever?

I recommend that you use a logarithmic scale for the frequency values. That's what is often done in situations like this. See Wikipedia to learn about logarithmic scales.
For example, if you have a word MAN that has a frequency of 15, the value you store in the database would be log(15) ~= 1.17609125906.
If you then find 4 new occurrences of MAN, then you want to add 4 to the field. You cannot add the log values directly because log(x)+log(y)=log(x*y). (See the Logarithm Rules section of this article for more information on log rules.)
Instead -- assuming you use a base 10 logarithm, you would use this formula:
SET frequency = log(10^frequency+4)

Depending on the length of your words, the few bytes for the frequency don't matter. With an unsigned four bytes integer, you can count up to more than two billion, which is way above the number of words what the user can type in in their whole lifespan.
So may want to go for two or three bytes, but the savings may be negligible.
Anyway, there are the following approaches for preventing overflow:
You can detect it, and then undo the operations, scale everything down by some factor of two, and then redo.
You can periodically check all your numbers and do the scaling when approaching the limit.
You can do a probabilistic update like below.
Probabilistic update
Instead of simply incrementing the frequency every time by one, you do it only with a probability which gets lower and lower as the counter grows. For example, you can do the increment with a probability of 1.0 / (oldValue + 1) or 2 ** -oldValue. The latter leads to a logarithmic growth, but, unlike the idea in the other answer, it works.
There are obviously some disadvantages due to the randomness and precision loss, but when all you care about is the relative frequency, it should be good enough.

Is there a cleverer Ruby algorithm than brute-force for finding correlation in multidimensional data?

My platform here is Ruby - a webapp using Rails 3.2 in particular.
I'm trying to match objects (people) based on their ratings for certain items. People may rate all, some, or none of the same items as other people. Ratings are integers between 0 and 5. The number of items available to rate, and the number of users, can both be considered to be non-trivial.
A quick illustration -
The brute-force approach is to iterate through all people, calculating differences for each item. In Ruby-flavoured pseudo-code -
MATCHES = {}
for each (PERSON in (people except USER)) do
for each (RATING that PERSON has made) do
if (USER has rated the item that RATING refers to) do
MATCHES[PERSON's id] += difference between PERSON's rating and USER's rating
end
end
end
lowest values in MATCHES are the best matches for USER
The problem here being that as the number of items, ratings, and people increase, this code will take a very significant time to run, and ignoring caching for now, this is code that has to run a lot, since this matching is the primary function of my app.
I'm open to cleverer algorithms and cleverer databases to achieve this, but doing it algorithmically and as such allowing me to keep everything in MySQL or PostgreSQL would make my life a lot easier. The only thing I'd say is that the data does need to persist.
If any more detail would help, please feel free to ask. Any assistance greatly appreciated!

Check out the KD-Tree. It's specifically designed to speed up neighbour-finding in N-Dimensional spaces, like your rating system (Person 1 is 3 units along the X axis, 4 units along the Y axis, and so on).
You'll likely have to do this in an actual programming language. There are spatial indexes for some DBs, but they're usually designed for geographic work, like PostGIS (which uses GiST indexing), and only support two or three dimensions.
That said, I did find this tantalizing blog post on PostGIS. I was then unable to find any other references to this, but maybe your luck will be better than mine...
Hope that helps!

Technically your task is matching long strings made out of characters of a 5 letter alphabet. This kind of stuff is researched extensively in the area of computational biology. (Typically with 4 letter alphabets). If you do not know the book http://www.amazon.com/Algorithms-Strings-Trees-Sequences-Computational/dp/0521585198 then you might want to get hold of a copy. IMHO this is THE standard book on fuzzy matching / scoring of sequences.

Is your data sparse? With rating, most of the time not every user rates every object.
Naively comparing each object to every other is O(n*n*d), where d is the number of operations. However, a key trick of all the Hadoop solutions is to transpose the matrix, and work only on the non-zero values in the columns. Assuming that your sparsity is s=0.01, this reduces the runtime to O(d*n*s*n*s), i.e. by a factor of s*s. So if your sparsity is 1 out of 100, your computation will be theoretically 10000 times faster.
Note that the resulting data will still be a O(n*n) distance matrix, so strictl speaking the problem is still quadratic.
The way to beat the quadratic factor is to use index structures. The k-d-tree has already been mentioned, but I'm not aware of a version for categorical / discrete data and missing values. Indexing such data is not very well researched AFAICT.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart