SPSS percentile issue - spss

I am working with SPSS 18.
I am using FREQUENCIES to calculate the 95th percentile of a variable.
FREQUENCIES SdrelPromSldDeu_Acr_5_0
/FORMAT=NOTABLE
/PERCENTILES 1,5,95,99.
The result is given in a table
Statistics
SdrelPromSldDeu_Acr_5_0
N Valid 8881
Missing 0
Percentiles 1 -1,001060644014
5 -1,000541440102
95 6619,140632636228
99 9223372,036854776000
But if I double-click the 9223372,036854776 to copy it, another number appears: 1.0757943411193715E7.
If I use MEANS to get the maximum value, the result is 2.4329524990388575E8, so the number that appears on the double-click seems possible.
I have seen 9223372,03 in other cases as well, as if it were some kind of upper limit SPSS is able to display.
Can anybody tell me if the 9223372,03 represents anything useful? Should I trust the bigger number?
Thanks!

It appears to be a bug in the display of SPSS.
The number you have shown is eerily similar to
9223372036854775807
which is the highest value possible if a variable is declared as a long integer.
see also:
https://en.wikipedia.org/wiki/9223372036854775807
Since your actual number is 11 degrees smaller, it should not reach this limit. Hence the conclusion that it must be a bug in the display software.
Do not trust it.
(the number behind may or may not be right, but the 9223372,03 is surely wrong)

Related

Google Sheet yields infinitesimal number as remainder of an integer/whole number

I have this worksheet where I need to create a checker to determine whether a number (result of dividing the sum of two numbers by another value --DIVISOR) is an integer/does not have decimals. Upon running the said checker, it mostly worked just fine but appeared to detect that a few items are not integers despite being exact multiples of the DIVISOR.
https://docs.google.com/spreadsheets/d/17-idS5G0kUI7JoHAx3qcJOiJ-zofmMrg93hUvZuxPiA/edit#gid=0
I have two values (V1 and V2) whose sum I need to divide by a certain number (Divisor).
I need the OUTPUT to be an integer/whole number. Since the DIVISOR is a multiple of SUM (V1,V2), the OUTPUT is supposed to be a whole number. I also expanded the number of decimal places to make sure that there are no trailing numbers after the decimal point.
However, upon running the MOD function over the OUTPUT, it generated some infinitesimal value.
I also tried TRUNCATING the OUTPUT and getting the DIFFERENCE between the TRUNC and OUTPUT. It yielded the same remainder value as the MOD result.
I downloaded the GSheet and opened it in MS Excel. There seems to be no problem with the DIFFERENCE result, but the MOD function yielded yet another value.
actually, this is not a bug and it is pretty common. its called a floating point "error" and in a nutshell, it has to do things with how decimal numbers are stored within a google sheets (even excel or any other app)
more details can be found here: https://en.wikipedia.org/wiki/IEEE_754
to counter it you will need to introduce rounding like:
=ROUND(SUM(A1:A))
this is not an ideal solution for all cases so depending on your requirements you may need to use these instead of ROUND:
ROUNDUP
ROUNDDOWN
TRUNC
TEXT

What factors determine the memory used in lambda functions?

=SUM(SEQUENCE(10000000))
The formula above is able to sum upto 10 million virtual array elements. We know that 10 million is the limit according to this question and answer. Now, if the same is implemented as Lambda using Lambda helper function REDUCE:
=REDUCE(,SEQUENCE(10000000),LAMBDA(a,c,a+c))
We get,
Calculation limit was reached when trying to calculate this formula
Official documentation says
This can happen in 2 cases:
The computation for the formula takes too long.
It uses too much memory.
To resolve it, use a simpler formula to reduce complexity.
So, it says the reason is space and time complexity. But what is the exact space used to throw this error? How is this determined?
In the REDUCE function above, the limit was at around 66k for a virtual array:
=REDUCE(,SEQUENCE(66660),LAMBDA(a,c,a+c))
However, if we remove the addition criteria and make it return only the current value c, the allowed virtual array size seems to increase to 190k:
=REDUCE(,SEQUENCE(190000),LAMBDA(a,c,c))
After which it throws a error. So, what factors determine the memory limit here? I think it's memory limit, because it throws the error almost within a few seconds.
If you're affected by this issue, you can send feedback to Google:
Open a spreadsheet, preferably one where you bumped into the issue.
Replace any sensitive information with anonymized but realistic-looking data. Remove any sensitive information that is not needed to reproduce the issue.
Choose Help > Report a Problem or Help > Help Sheets Improve. If you are on a paid Google Workspace Domain, see Contact Google Workspace support.
Explain why the calculation limit is an issue for you.
Request:
Justice: Removing arbitrary limits on lambda functions
Equality: Avoiding discrimination against lambda functions
Transparency: Documenting the said discrimination in more clarity and detail
Include a link to this Stack Overflow answer post.
Update Oct '22 (Credit to MaxMarkhov)
The limit is now 10x higher at 1.9 million 1999992. This is still less than 1/5th of 10 million virtual array limit of non-lambda formulas, but much better than before. Also non-lambda formulas's limit doesn't reduce with number of operations. But lambda helper formulas limit still does decrease with number of operations. So, even though it's 10x higher, that just means ~5 extra operations inside lambda(see table below).
A partial answer
We know for a fact, the following factors decide the calculation limit drum roll:
Number of operations
(Nested)LAMBDA() function calls
The base number for 1 operation seems to be 199992 1 2(=REDUCE(,SEQUENCE(199992),LAMBDA(a,c,c))). But for a zero-op or a no-op(=REDUCE(,SEQUENCE(10000000),LAMBDA(a,c,0))), the memory limit is much higher, but you'll still run into time limit. We also know number of operations is a factor, because
=REDUCE(,SEQUENCE(66664/1),LAMBDA(a,c,a+c)) fails
=REDUCE(,SEQUENCE(66664),LAMBDA(a,c,a+c)) works.
=REDUCE(,SEQUENCE(66664),LAMBDA(a,c,a+c+0)) fails
Note that the size of operands doesn't matter. If =REDUCE(,SEQUENCE(39998),LAMBDA(a,c,a+c+0)) works, =REDUCE(,SEQUENCE(39998),LAMBDA(a,c,a+c+100000)) will also work.
For each increase in number of operations inside the lambda function, the maximum allowed array size falls by 2n-1(Credit to #OlegValter for actually figuring out there's a factor multiple here):
Maximum sequence
Number of operations (inside lambda)
Reduction(from 199992)
Formula
199992
1
1
REDUCE(,SEQUENCE(199992),LAMBDA(a,c,c))
66664
2
1/3
REDUCE(,SEQUENCE(66664),LAMBDA(a,c,a+c))
39998
3
1/5
REDUCE(,SEQUENCE(39998),LAMBDA(a,c,a+c+10000))
28570
4
1/7
REDUCE(,SEQUENCE(28570),LAMBDA(a,c,a+c+10000+0))
Operations outside the LAMBDA functions also count. For eg, =REDUCE(,SEQUENCE(199992/1),LAMBDA(a,c,c)) will fail due to extra /1 operation, but you only need to reduce the array size linearly by 1 or 2 per operation, i.e., this =REDUCE(,SEQUENCE(199990/1),LAMBDA(a,c,c)) will work3.
In addition LAMBDA function calls itself cost more. So, refactoring your code doesn't eliminate the memory limit, but reduces it furthermore. For eg, if your code uses LAMBDA(a,c,(a-1)+(a-1)), if you add another lambda like this: LAMBDA(a,c,LAMBDA(aminus,aminus+aminus)(a-1)), it errors out with much less array elements than before(~20% less). LAMBDA is much more expensive than repeating calls.
There are many other factors at play, especially with other LAMBDA functions. Google might change their mind about these arbitrary limits later. But this gives a start.
Possible workarounds:
LAMBDA itself isn't restricted. You can nest as much as you want to. Only LAMBDA Helper Functions are restricted. (Credit to player0)
Named functions which don't use LAMBDA(helper functions) themselves, aren't subjected to the same restrictions. But they're subject to maximum recursion restrictions.
Another workaround is to avoid using lambda as a arrayformula and use autofill or drag fill feature, by making the lambda function return only one value per function. Note that this might actually make your sheet slow. But apparently, Google is ok with that - multiple individual calls instead of a single array call. For example, I've written a permutations function here to get a list of all permutations. While it complains about "memory limit" for a array with more than 6 items, it can work easily by autofill/dragfill/copy+paste with relative ranges.
not even an answer
by brute-forcing a few ideas it looks like there are more hidden variables than previously thought. it is probably safe to say that the upper limit is a result of "running out of memory" especially when calculation time does not play any role. the thing is that there are factors even outside of LAMBDA that affect the computational capabilities of the formula. here is a brief summary of the issue in layman's terms:
WHY WERE/ARE LAMBDA'S MINIONS STUPID?!
UPDATE: limit boundaries were moved 10-fold higher, so none of the below testing formulae limits represent the actual up-to-date state, however, lambda minions are still not limitless!
let's imagine a memory buffer from the 1999 era with a limited size of 30 units that kicks in only when you use LAMBDA with friends (MAP, SCAN,BYCOL, BYROW, REDUCE, MAKEARRAY). keep in mind that in google sheets when we use any other formula, the limiting factor is usually the cell count limit.
example 1
output capability: 199995 cells!
reduction from 199995: 1/1 (meh, but ok)
example 2
output capability: 49998 cells!
reduction from 199995: 1/~4 (*double-checking the calendar if the year is really 2022*)
example 3
output capability: 995 cells!
reduction from 199995: 1/201 !! (*remembering this company built a quantum computer*)
further testing
establishing the baseline: all below formulae are maxed out so they work as "one step before erroring out". please keep noticing the numbers as a direct representation of row (not cell) processing abilities
starting with a simple:
=ROWS(BYROW(SEQUENCE(99994), LAMBDA(x, AVERAGE(x))))
by adding one more x the following would error out so even the length of strings matters:
=ROWS(BYROW(SEQUENCE(99994), LAMBDA(xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, AVERAGE(xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx))))
doubling the array brings no issues:
=ROWS(BYROW({SEQUENCE(99994), SEQUENCE(99994)}, LAMBDA(x, AVERAGE(x))))
but additional "stuff" will reduce the output by 1:
=ROWS(BYROW({SEQUENCE(99993), SEQUENCE(99993, 1, 5)}, LAMBDA(x, AVERAGE(x))))
interestingly this one runs with no problem so now even the complexity of input matters (?):
=ROWS(BYROW(SEQUENCE(99994, 6, 0, 5), LAMBDA(x, AVERAGE(x))))
and with this one, it seems that even choice of formula selection matters:
=ROWS(BYROW(RANDARRAY(99996, 2), LAMBDA(x, AVERAGE(x))))
but what if we move from virtual input to real input... A1 cell being set to =RANDARRAY(105000, 3) we can have:
=ROWS(BYROW(A1:B99997, LAMBDA(x, AVERAGE(x))))
again, it's not a matter of cells because even with 8 columns we can get the same:
=ROWS(BYROW(A1:H99997, LAMBDA(x, AVERAGE(x))))
not bad, however, indirecting the range will put us back to 99995:
=ROWS(BYROW(INDIRECT("A1:B"&99995), LAMBDA(x, AVERAGE(x))))
another fact is that LAMBDA as a standalone function runs flawlessly even with an array 105000×8 (that's solid 840K cells)
=LAMBDA(x, AVERAGE(x))(A1:H105000)
so is this really the memory issue of LAMBDA (?) or the factors that determine the memory used in LAMBDA are limits of unknown origin bestowed upon LAMBDA by individual incapabilities of:
MAP
SCAN
BYCOL
BYROW
REDUCE
MAKEARRAY
and their unoptimized memory demands shaken by wast variety of yet unknown variables within our spacetime
Edit 2022/10/26
Seems, Google Sheets Team has just increased the max. limit 10x times 😍.
1999992 from 199992
My original formula supposed it would be 199992 cells, but as you see the "behind" logic changes and may also change in the future.
LAMBDA+Friends Limit
The maximum number of rows you can use in the formula (guess):
Limit = 1999992/(1 + inside_lambdas) - outside_lambdas
inside_lambdas and outside_lambdas are functions and parameters, each count 1:
+ / * -
5, A1, "text",
MOD, AVERAGE, etc.
{"array element"}
The limit is about cells operated by the "lambda+" formula: reduce, byrow, etc.
My tests are here:
Lambda Limits \ Sample Sheet
Steps to fix:
Do Not use Lambda if possible :(
Do most of the calculations outside lambda if possible
Split formulas to multiple cells, having the limit in mind. Copy formulas, each one has its own limit.
Ask Google to Fix this. In Sheets use the menu Help > Help Sheets Improve
Write to the support if you have a paid account.
Final notes:
my formula for the limit is guess, and it works for my examples and tests. Please try it and comment to this answer if you find an error.
the formula does not answer how long variable names affect the limit (=ROWS(BYROW(SEQUENCE(99994), LAMBDA(xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, AVERAGE(xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx))))) Need more tests to figure out the correct effect on the limit. As this does not break: =ROWS(BYROW(SEQUENCE(199992), LAMBDA(xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, AVERAGE(xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx)))), my suggestion is this is the max. length of the variable name, and it does not change the cells limit.
Google Sheets team may change the logic "behind" the formula, so all tests may appear invalid in a time.

Prometheus increase not handling process restarts

I am trying to figure out the behavior of Prometheus' increase() querying function with process restarts.
When there is a process restart within a 2m interval and I query:
sum(increase(my_metric_total[2m]))
I get a value less than expected.
For example, in a simple experiment I mock:
3 lcm_restarts
1 process restart
2 lcm_restarts
All within a 2 minute interval.
Upon querying:
sum(increase(lcm_restarts[2m]))
I receive a value of ~4.5 when I am expecting 5.
lcm_restarts graph
sum(increase(lcm_restarts[2m])) result
Could someone please explain?
Pretty concise and well-prepared first question here. Please keep this spirit!
When working with counters, functions as rate(), irate() and also increase() are adjusting on resets due to restarts. Other than the name suggests, the increase() function does not calculate the absolute increase in the given time frame but is a different way to write rate(metric[interval]) * number_of_seconds_in_interval. The rate() function takes the first and the last measurement in a series and calculates the per-second increase in the given time. This is the reason why you may observe non-integer increases even if you always increase in full numbers as the measurements are almost never exactly at the start and end of the interval.
For more details about this, please have a look at the prometheus docs for the increase() function. There are also some good hints on what and what not to do when working with counters in the robust perception blog.
Having a look at your label dimensions, I also think that counter resets don't apply to your constructed example. There is one label called reason that changed between the restarts and so created a second time series (not continuing the existing one). Here you are also basically summing up the rates of two different time series increases that (for themselves) both have their extrapolation happening.
So basically there isn't really anything wrong what you are doing, you just shouldn't rely on getting highly precise numbers out of prometheus for your use case.
Prometheus may return unexpected results from increase() function due to the following reasons:
Prometheus may return fractional results from increase() over integer counter because of extrapolation. See this issue for details.
Prometheus may return lower than expected results from increase(m[d]) because it doesn't take into account possible counter increase between the last raw sample just before the specified lookbehind window [d] and the first raw sample inside the lookbehind window [d]. See this article and this comment for details.
Prometheus skips the increase for the first sample in a time series. For example, increase() over the following series of samples would return 1 instead of 11: 10 11 11. See these docs for details.
These issues are going to be fixed according to this design doc. In the mean time it is possible to use other Prometheus-like systems such as VictoriaMetrics, which are free from these issues.

What should i do to maintain performance of a mobile app which is using database?

I'm building an app using database.
I have a words table and everytime user types something, this app will record and update word the database.
And the frequency field will be auto increase after user enter one matched word.
But the trouble is user type day by day and i afraid the search performance will be reduce after times and also the Int field will reach to the limit (max limit Int) someday.
So, i limit the database to around less than 50.000 records.
I delete less-used records after a certain time.
But i don't know how to deal with frequency Int field of each word?
How to know exactly frequency usage of each word without increasing the field forever?
I recommend that you use a logarithmic scale for the frequency values. That's what is often done in situations like this. See Wikipedia to learn about logarithmic scales.
For example, if you have a word MAN that has a frequency of 15, the value you store in the database would be log(15) ~= 1.17609125906.
If you then find 4 new occurrences of MAN, then you want to add 4 to the field. You cannot add the log values directly because log(x)+log(y)=log(x*y). (See the Logarithm Rules section of this article for more information on log rules.)
Instead -- assuming you use a base 10 logarithm, you would use this formula:
SET frequency = log(10^frequency+4)
Depending on the length of your words, the few bytes for the frequency don't matter. With an unsigned four bytes integer, you can count up to more than two billion, which is way above the number of words what the user can type in in their whole lifespan.
So may want to go for two or three bytes, but the savings may be negligible.
Anyway, there are the following approaches for preventing overflow:
You can detect it, and then undo the operations, scale everything down by some factor of two, and then redo.
You can periodically check all your numbers and do the scaling when approaching the limit.
You can do a probabilistic update like below.
Probabilistic update
Instead of simply incrementing the frequency every time by one, you do it only with a probability which gets lower and lower as the counter grows. For example, you can do the increment with a probability of 1.0 / (oldValue + 1) or 2 ** -oldValue. The latter leads to a logarithmic growth, but, unlike the idea in the other answer, it works.
There are obviously some disadvantages due to the randomness and precision loss, but when all you care about is the relative frequency, it should be good enough.

Precise definition of HL7(v2) field RXE-25?

Can anyone explain what the field of RXE-25 in HL-7v2 means? The description is "Give strength". I have read the official explanation, but I feel this is ambiguous. I am not sure whether this field should contain a)the strength of a single tablet/dose form vs b)the total strength to administer.
For example, hydroxychloroquine [HCQ] is a lupus medication that comes in 200 mg tablets. Lupus patients are frequently started off on 400 mg of this per day (ie 2 tablets).
Let's say RXE-3 ("Give Amount - Minimum") is "2", and RXE-5 ("Give Units") is "tablet". And let's assume there are multiple tablet strengths, so we don't know what dose that is. Would one put the per-tablet dose in RXE-25 (ie "200" mg), or instead put the entire dose (2 tablets="400" mg)?
My understanding of all the 'Give' fields is they represent the amount given per dose. So to answer your questions:
b)the total strength to administer
AND
put the entire dose (2 tablets="400" mg)
Actually, I found the answer hidden in another part of the HL7 documentation, specifically under the RXO segment (RXO-18 Requested Give Strength). Per that documentation, this applies to various RX_ segments.
The example given:
One way would be: "Ampicillin 250 mg capsules, 2 capsules four times a
day." In this case the give amount would be 2, the give units would be
capsules, the strength would be 250 and the strength units would
milligrams.
So it seems the GIVE STRENGTH AMOUNT (if present, and no GIVE DRUG STRENGTH VOLUME is specified) is multiplied by the GIVE AMOUNT to come up with the total dose. So I believe the answer to the example in the question would be a) 200 mg.

Resources