Forcing a table rehash not working after a previous rehash - lua

I've created a function that resizes an array and sets new entries to 0, but can also decrease the size of the array in 2 different ways:
1. Simply setting the n property to the new size (the length operator cannot be used because of this reason).
2. Setting all values after the new size to nil up to 2*size to force a rehash.
local function resize(array, elements, free)
local size = array.n
if elements < size then -- Decrease Size
array.n = elements
if free then
size = math.max(size, #array) -- In case of multiple resizes
local base = elements + 1
for idx = base, 2*size do -- Force a rehash -> free extra unneeded memory
array[idx] = nil
end
end
elseif elements > size then -- Increase Size
array.n = elements
for idx = size + 1, elements do
array[idx] = 0
end
end
end
How I tested it:
local mem = {n=0};
resize(mem, 50000)
print(mem.n, #mem) -- 50000 50000
print(collectgarbage("count")) -- relatively large number
resize(mem, 10000, true)
print(mem.n, #mem) -- 10000 10000
print(collectgarbage("count")) -- smaller number
resize(mem, 20, true)
print(mem.n, #mem) -- 20 20
print(collectgarbage("count")) -- same number as above, but it should be a smaller number
However when I don't pass true as the third argument to the second call of resize (so it doesn't force a rehash on the second call), the third call does end up rehashing it.
Am I missing something? I'm expecting the third one to also rehash after the second one has.

Here is a clearer picture of how the table usually looks like before and after the resizes:
table: 0x15bd3d0 n: 0 #: 0 narr: 0 nrec: 1
table: 0x15bd3d0 n: 50000 #: 50000 narr: 65536 nrec: 1
table: 0x15bd3d0 n: 10000 #: 10000 narr: 16384 nrec: 2
table: 0x15bd3d0 n: 20 #: 20 narr: 16384 nrec: 2
And here is what happens:
During the resize to 50000 elements, the table is rehashed several times, and at the end it contains exactly one hash part slot for the n field and enough array part slots for the integer keys.
During the shrinking to 10000 elements, you first assign nil to the integer keys 10001 to 65536, and then from 65537 to 100000. The first group of assignments will never cause a rehash, because you assign to existing fields. This has to do with the guarantees for the next function. The second group of assignments will cause rehashes, but since you are assinging nils, Lua will realize at some point that the array part of the table is more than half empty (see comment at the beginning of ltable.c). Lua will then shrink the array part to a reasonable size and use a second hash slot for the new key. But since you are assigning nils, that second hash slot is never occupied, and Lua is free to re-use it for all the remaining assignments (and it often but not always does). You wouldn't notice a rehash at this point anyway, because you will always end up with the 16384 array slots and 2 hash slots (one for n, one for the new element to be assigned).
The shrinking to 20 elements just continues this way, with the exception that a second hash slot is already available. So you might never get a rehash (and the array size stays larger than necessary), but if you do (Lua for some reason doesn't like the one free hash slot), you'll see the number of array slots drop to a reasonable level.
This is what it looks like when you do get a rehash during the second shrinking:
table: 0x11c43d0 n: 0 #: 0 narr: 0 nrec: 1
table: 0x11c43d0 n: 50000 #: 50000 narr: 65536 nrec: 1
table: 0x11c43d0 n: 10000 #: 10000 narr: 16384 nrec: 2
table: 0x11c43d0 n: 20 #: 20 narr: 32 nrec: 2
If you want to repeat my experiments, the git HEAD version of lua-getsize (original version here) now also returns the number of slots in the array/hash parts of a table.

Related

Generating a Lua table with random non repeating numbers

I'm looking to generate a table of random values, but want to make sure that none of those values are repeated within the table.
So my basic table generation looks like this:
numbers = {}
for i = 1, 5 do
table.insert(numbers, math.random(20))
end
So that will work in populating a table with 5 random values between 1-20. However, it's the making sure none of those values repeat is where I'm stuck.
One approach would be to shuffle an array of numbers and then take the first n numbers. The wrong way to go about shuffling an array is to maintain a list of previously generated random numbers, checking against that with each newly generated random number before adding it to the final array. Such a solution is O(n^2) in time complexity when iterating over the array during the check; this will be painful for large arrays, or for small arrays when many must be created. Lua has constant time array access since tables are really hash tables, so you could get away with this, except: sometimes many random numbers will need to be tried before a suitable one (that has not already been used) is found. This can be a real problem near the end of an array of many random numbers, i.e., when you want 1000 random numbers and have filled all but the last slot, how many random tries (and how many iterations of the 999 numbers already selected) will it take to find the only number (42, of course) that is still available?
The right way to go about shuffling is to use a shuffling algorithm. The Fisher-Yates shuffle is a common solution to this problem. The idea is that you start at one end of an array, and swap each element with a random element that occurs later in the list until the entire array has been shuffled. This solution is O(n) in time complexity, thus much less wasteful of computational resources.
Here is an implementation in Lua:
function shuffle (arr)
for i = 1, #arr - 1 do
local j = math.random(i, #arr)
arr[i], arr[j] = arr[j], arr[i]
end
end
Testing in the REPL:
> t = { 1, 2, 3, 4, 5, 6 }
> table.inspect(t)
1 = 1
2 = 2
3 = 3
4 = 4
5 = 5
6 = 6
> shuffle(t)
> table.inspect(t)
1 = 4
2 = 5
3 = 1
4 = 6
5 = 2
6 = 3
This can easily be extended to create lists of random numbers:
function shuffled_numbers (n)
local numbers = {}
for i = 1, n do
numbers[i] = i
end
shuffle(numbers)
return numbers
end
REPL interaction:
> s = shuffled_numbers(10)
> table.inspect(s)
1 = 9
2 = 5
3 = 3
4 = 4
5 = 7
6 = 6
7 = 2
8 = 10
9 = 8
10 = 1
If you want to see what is happening during the shuffle, add a print statement in the shuffle function:
function shuffle (arr)
for i = 1, #arr - 1 do
local j = math.random(i, #arr)
print(string.format("%d (%d) <--> %d (select %d)", i, arr[i], j, arr[j]))
arr[i], arr[j] = arr[j], arr[i]
end
end
Now you can see the swaps as they occur if you recall that in the above implementation of shuffled_numbers the array { 1, 2, ..., n } is the starting point of the shuffle. Note that sometimes a number is swapped with itself, which is to say that the number in the current unselected position is a valid choice, too. Also note that the last number is automatically the correct selection, since it is the only number that has not yet been randomly selected:
> s = shuffled_numbers(10)
1 (1) <--> 5 (select 5)
2 (2) <--> 10 (select 10)
3 (3) <--> 5 (select 1)
4 (4) <--> 9 (select 9)
5 (3) <--> 8 (select 8)
6 (6) <--> 9 (select 4)
7 (7) <--> 8 (select 3)
8 (7) <--> 10 (select 2)
9 (6) <--> 9 (select 6)
> table.inspect(s)
1 = 5
2 = 10
3 = 1
4 = 9
5 = 8
6 = 4
7 = 3
8 = 2
9 = 6
10 = 7
Obtaining a selection of 5 random numbers between 1 and 20 is easy enough to accomplish using the shuffle function; one of the virtues of this approach is that the shuffling operation has been abstracted to an O(n) procedure which can shuffle any array, numeric or otherwise. The function that calls shuffle is responsible for supplying the input and returning the results.
A simple solution for more flexibility in the range of random numbers returned:
-- Take the first N numbers from a shuffled range [A, B].
function shuffled_range_take (n, a, b)
local numbers = {}
for i = a, b do
numbers[i] = i
end
shuffle(numbers)
return { table.unpack(numbers, 1, n) }
-- table.unpack won't work for very large ranges, e.g. [1, 1000000]
-- You could instead use this for arbitrarily large ranges:
-- local take = {}
-- for i= 1, n do
-- take[i] = numbers[i]
-- end
-- return take
end
REPL interaction creating a table containing 5 random values between 1 and 20:
> s = shuffled_range_take(5, 1, 20)
> table.inspect(s)
1 = 1
2 = 10
3 = 4
4 = 8
5 = 20
But, there is a disadvantage to the shuffle method in some circumstances. When the number of elements needed is small compared with the number of available elements, the above solution must shuffle a large array to obtain comparatively few random elements. The shuffle is O(n) in the number of elements available, while the memoization method is roughly O(n) in the number of elements chosen. A memoization method like that of #AlexanderMashin performs poorly when the goal is to create an array of 20 random numbers between 1 and 20, because the final numbers chosen may need to be chosen many times before suitable numbers are found. But when only 5 random numbers between 1 and 20 are needed, this problem with duplicate choices is less of an issue. This approach seems to perform better than the shuffle, up to about 10 numbers needed from 20 random numbers. When more than 10 numbers are needed from 20, the shuffle begins to perform better. This break-even point is different for larger numbers of elements to choose from; for 1000 available elements, parity is reached at about 700 chosen. When performance is critical, testing is the only way to determine the best solution.
numbers = {}
local i = 1;
while i<=5 do
n = 0
local rand = math.random(20)
for x=1,#numbers do
if numbers[x] == rand then
n = n + 1
end
end
if n == 0 then
table.insert(numbers, rand)
i = i + 1
end
n = 0
end
the method I used for this process was to use a for to scan each of the elements in the table and increase the variable n if one of them was equal to the random value given, so if x was different from 0, the value would not be inserted in the table and would not increment the variable i (I had to use the while to work with i)
if you want to print each of the elements in the table to check the values you can use this:
for i=1,#numbers do
print(numbers[i])
end
I suggest an alternative method based on the fact that it is easy to make sets in Lua: they are just tables with true values.
-- needed is how many random numbers in the table are needed,
-- maximum is the maximum value of a random non-negtive integer.
local function fill_table( needed, maximum )
math.randomseed ( os.time () ) -- reseed the random numbers generator
local numbers = {}
local used = {} -- which numbers are already used
for i = 1, needed do
local random
repeat
random = math.random( maximum )
until not used[random]
used[random] = true
numbers[i] = random
end
return numbers
end
Making a table with 20 keys (use for/do/end) and then do your desired times
rand_number=table.remove(tablename, math.random(1,#tablename))
EDIT: Corrected - See first comment
And rand_number never holds the same value. I use this as a simulation for a "Lottozahlengenerator" (german, sorry) or random video/music clips playing where duplicates are unwanted.

hypothesis function space in decision tree

I am reading the book "Artificial Intelligence" by Stuart Russell and Peter Norvig (Chapter 18). The following paragraph is from the decision trees context.
For a wide variety of problems, the decision tree format yields a
nice, concise result. But some functions cannot be represented
concisely. For example, the majority function, which returns true if
and only if more than half of the inputs are true, requires an
exponentially large decision tree.
In other words, decision trees are good for some kinds of functions
and bad for others. Is there any kind of representation that is
efficient for all kinds of functions? Unfortunately, the answer is no.
We can show this in a general way. Consider the set of all Boolean
functions on "n" attributes. How many different functions are in this
set? This is just the number of different truth tables that we can
write down, because the function is defined by its truth table.
A truth table over "n" attributes has 2^n rows, one for each
combination of values of the attributes.
We can consider the “answer” column of the table as a 2^n-bit number
that defines the function. That means there are (2^(2^n)) different
functions (and there will be more than that number of trees, since
more than one tree can compute the same function). This is a scary
number. For example, with just the ten Boolean attributes of our
restaurant problem there are 2^1024 or about 10^308 different
functions to choose from.
What does author mean by "answer" column of the table as a 2^n-bit number that defines the function?
How did author derive (2^(2^n)) different functions?
Please elaborate on above question, preferably with simple example, such as n = 3.
Consider a general truth table for a 3-input function, where the result for each triple is also a Boolean (1 or 0), represented by variables i through 'p':
A B C f(a,b,c)
0 0 0 i
0 0 1 j
0 1 0 k
0 1 1 l
1 0 0 m
1 0 1 n
1 1 0 o
1 1 1 p
We can now represent any function on three variables as an 8-bit number, ijklmnop. For instance, and is 00000001; or is 01111111; one_hot (exactly one input True) is 01101000.
For 3 variables, you have 2^3 bits in the "answer", the complete function definition. Since there are 8 bits in the "answer", there are 2^8 possible functions we can define.
Does that outline the field of comprehension for you?
More detail on an example function
You simply (once you see the pattern) make the eight bits correspond to the entires in the table. For instance, the table for one-hot looks like this:
A B C f(a,b,c)
0 0 0 0
0 0 1 1
0 1 0 1
0 1 1 0
1 0 0 1
1 0 1 0
1 1 0 0
1 1 1 0
Reading down the "answer" column, labeled f(a,b,c), you get the 8-bit sequence 01101000. That 8-bit number is sufficient to completely define the function: the rows listing all the combinations of a, b, c are in a fixed (numerical) sequence.
You can write any such function in a template format:
def and(a, b, c):
and_def = '00000001'
index = 4*a + 2*b + 1*c
return and_def[index]
Now, if we generalize this to any 3-input binary function:
def_bin_func(a, b, c, func_def)
return func_def[4*a + 2*b + 1*c]
If you wish, you can further generalize the template for a list of inputs: concatenate the bits and use that integer as the index into the func_def string.
Does that clear it up?

How to combine search text with other criteria using Redis?

I successfully wrote an intersection of text search and other criteria using Redis. To achieve that I'm using a Lua script. The issue is that I'm not only reading, but also writing values from that script. From Redis 3.2 it's possible to achieve that by calling redis.replicate_commands(), but not before 3.2.
Below is how I'm storing the values.
Names
> HSET product:name 'Cool product' 1
> HSET product:name 'Nice product' 2
Price
> ZADD product:price 49.90 1
> ZADD product:price 54.90 2
Then, to get all products that matches 'ice', for example, I call:
> HSCAN product:name 0 MATCH *ice*
However, since HSCAN uses a cursor, I have to call it multiple times to fetch all results. This is where I'm using a Lua script:
local cursor = 0
local fields = {}
local ids = {}
local key = 'product:name'
local value = '*' .. ARGV[1] .. '*'
repeat
local result = redis.call('HSCAN', key, cursor, 'MATCH', value)
cursor = tonumber(result[1])
fields = result[2]
for i, id in ipairs(fields) do
if i % 2 == 0 then
ids[#ids + 1] = id
end
end
until cursor == 0
return ids
Since it's not possible to use the result of a script with another call, like SADD key EVAL(SHA) .... And also, it's not possible to use global variables within scripts. I've changed the part inside the fields' loop to access the list of ID's outside the script:
if i % 2 == 0 then
ids[#ids + 1] = id
redis.call('SADD', KEYS[1], id)
end
I had to add redis.replicate_commands() to the first line. With this change I can get all ID's from the key I passed when calling the script (see KEYS[1]).
And, finally, to get a list 100 product ID's priced between 40 and 50 where the name contains "ice", I do the following:
> ZUNIONSTORE tmp:price 1 product:price WEIGHTS 1
> ZREMRANGEBYSCORE tmp:price 0 40
> ZREMRANGEBYSCORE tmp:price 50 +INF
> EVALSHA b81c2b... 1 tmp:name ice
> ZINTERSTORE tmp:result tmp:price tmp:name
> ZCOUNT tmp:result -INF +INF
> ZRANGE tmp:result 0 100
I use the ZCOUNT call to know in advance how many result pages I'll have, doing count / 100.
As I said before, this works nicely with Redis 3.2. But when I tried to run the code at AWS, which only supports Redis up to 2.8, I couldn't make it work anymore. I'm not sure how to iterate with HSCAN cursor without using a script or without writing from the script. There is a way to make it work on Redis 2.8?
Some considerations:
I know I can do part of the processing outside Redis (like iterate the cursor or intersect the matches), but it'll affect the application overall performance.
I don't want to deploy a Redis instance by my own to use version 3.2.
The criteria above (price range and name) is just an example to keep things simple here. I have other fields and type of matches, not only those.
I'm not sure if the way I'm storing the data is the best way. I'm willing to listen suggestion about it.
The only problem I found here is storing the values inside a lua scirpt. So instead of storing them inside a lua, take that value outside lua (return that values of string[]). Store them in a set in a different call using sadd (key,members[]). Then proceed with intersection and returning results.
> ZUNIONSTORE tmp:price 1 product:price WEIGHTS 1
> ZREVRANGEBYSCORE tmp:price 0 40
> ZREVRANGEBYSCORE tmp:price 50 +INF
> nameSet[] = EVALSHA b81c2b... 1 ice
> SADD tmp:name nameSet
> ZINTERSTORE tmp:result tmp:price tmp:name
> ZCOUNT tmp:result -INF +INF
> ZRANGE tmp:result 0 100
IMO your design is the most optimal one. One advice would be to use pipeline wherever possible, as it would process everything at one go.
Hope this helps
UPDATE
There is no such thing like array ([ ]) in lua you have to use the lua table to achieve it. In your script you are returning ids right, that itself is an array you can use it as a separate call to achieve the sadd.
String [] nameSet = (String[]) evalsha b81c2b... 1 ice -> This is in java
SADD tmp:name nameSet
And the corresponding lua script is the same as that of your 1st one.
local cursor = 0
local fields = {}
local ids = {}
local key = 'product:name'
local value = '*' .. ARGV[1] .. '*'
repeat
local result = redis.call('HSCAN', key, cursor, 'MATCH', value)
cursor = tonumber(result[1])
fields = result[2]
for i, id in ipairs(fields) do
if i % 2 == 0 then
ids[#ids + 1] = id
end
end
until cursor == 0
return ids
The problem isn't that you're writing to the database, it's that you're doing a write after a HSCAN, which is a non-deterministic command.
In my opinion there's rarely a good reason to use a SCAN command in a Lua script. The main purpose of the command is to allow you to do things in small batches so you don't lock up the server processing a huge key space (or hash key space). Since scripts are atomic, though, using HSCAN doesn't help—you're still locking up the server until the whole thing's done.
Here are the options I can see:
If you can't risk locking up the server with a lengthy command:
Use HSCAN on the client. This is the safest option, but also the slowest.
If you're want to do as much processing in a single atomic Lua command as possible:
Use Redis 3.2 and script effects replication.
Do the scanning in the script, but return the values to the client and initiate the write from there. (That is, Karthikeyan Gopall's answer.)
Instead of HSCAN, do an HKEYS in the script and filter the results using Lua's pattern matching. Since HKEYS is deterministic you won't have a problem with the subsequent write. The downside, of course, is that you have to read in all of the keys first, regardless of whether they match your pattern. (Though HSCAN is also O(N) in the size of the hash.)

What is the relation between address lines and memory?

These are my assignments:
Write a program to find the number of address lines in an n Kbytes of memory. Assume that n is always to the power of 2.
Sample input: 2
Sample output: 11
I don't need specific coding help, but I don't know the relation between address lines and memory.
To express in very easy terms, without any bus-multiplexing, the number of bits required to address a memory is the number of lines (address or data) required to access that memory.
Quoting from the Wikipedia article,
a system with a 32-bit address bus can address 232 (4,294,967,296) memory locations.
for a simple example, consider this, you have 3 address lines (A, B, C), so the values which can be formed using 3 bits are
A B C
0 0 0
0 0 1
0 1 0
0 1 1
1 0 0
1 0 1
1 1 0
1 1 1
Total 8 values. So using ABC, you can access any of those eight values, i.e., you can reach any of those memory addresses.
So, TL;DR, the simple relationship is, with n number of lines, we can represent 2n number of addresses.
An address line usually refers to a physical connection between a CPU/chipset and memory. They specify which address to access in the memory. So the task is to find out how many bits are required to pass the input number as an address.
In your example, the input is 2 kilobytes = 2048 = 2^11, hence the answer 11. If your input is 64 kilobytes, the answer is 16 (65536 = 2^16).

Constrained Sequence to Index Mapping

I'm puzzling over how to map a set of sequences to consecutive integers.
All the sequences follow this rule:
A_0 = 1
A_n >= 1
A_n <= max(A_0 .. A_n-1) + 1
I'm looking for a solution that will be able to, given such a sequence, compute a integer for doing a lookup into a table and given an index into the table, generate the sequence.
Example: for length 3, there are 5 the valid sequences. A fast function for doing the following map (preferably in both direction) would be a good solution
1,1,1 0
1,1,2 1
1,2,1 2
1,2,2 3
1,2,3 4
The point of the exercise is to get a packed table with a 1-1 mapping between valid sequences and cells.
The size of the set in bounded only by the number of unique sequences possible.
I don't know now what the length of the sequence will be but it will be a small, <12, constant known in advance.
I'll get to this sooner or later, but though I'd throw it out for the community to have "fun" with in the meantime.
these are different valid sequences
1,1,2,3,2,1,4
1,1,2,3,1,2,4
1,2,3,4,5,6,7
1,1,1,1,2,3,2
these are not
1,2,2,4
2,
1,1,2,3,5
Related to this
There is a natural sequence indexing, but no so easy to calculate.
Let look for A_n for n>0, since A_0 = 1.
Indexing is done in 2 steps.
Part 1:
Group sequences by places where A_n = max(A_0 .. A_n-1) + 1. Call these places steps.
On steps are consecutive numbers (2,3,4,5,...).
On non-step places we can put numbers from 1 to number of steps with index less than k.
Each group can be represent as binary string where 1 is step and 0 non-step. E.g. 001001010 means group with 112aa3b4c, a<=2, b<=3, c<=4. Because, groups are indexed with binary number there is natural indexing of groups. From 0 to 2^length - 1. Lets call value of group binary representation group order.
Part 2:
Index sequences inside a group. Since groups define step positions, only numbers on non-step positions are variable, and they are variable in defined ranges. With that it is easy to index sequence of given group inside that group, with lexicographical order of variable places.
It is easy to calculate number of sequences in one group. It is number of form 1^i_1 * 2^i_2 * 3^i_3 * ....
Combining:
This gives a 2 part key: <Steps, Group> this then needs to be mapped to the integers. To do that we have to find how many sequences are in groups that have order less than some value. For that, lets first find how many sequences are in groups of given length. That can be computed passing through all groups and summing number of sequences or similar with recurrence. Let T(l, n) be number of sequences of length l (A_0 is omitted ) where maximal value of first element can be n+1. Than holds:
T(l,n) = n*T(l-1,n) + T(l-1,n+1)
T(1,n) = n
Because l + n <= sequence length + 1 there are ~sequence_length^2/2 T(l,n) values, which can be easily calculated.
Next is to calculate number of sequences in groups of order less or equal than given value. That can be done with summing of T(l,n) values. E.g. number of sequences in groups with order <= 1001010 binary, is equal to
T(7,1) + # for 1000000
2^2 * T(4,2) + # for 001000
2^2 * 3 * T(2,3) # for 010
Optimizations:
This will give a mapping but the direct implementation for combining the key parts is >O(1) at best. On the other hand, the Steps portion of the key is small and by computing the range of Groups for each Steps value, a lookup table can reduce this to O(1).
I'm not 100% sure about upper formula, but it should be something like it.
With these remarks and recurrence it is possible to make functions sequence -> index and index -> sequence. But not so trivial :-)
I think hash with out sorting should be the thing.
As A0 always start with 0, may be I think we can think of the sequence as an number with base 12 and use its base 10 as the key for look up. ( Still not sure about this).
This is a python function which can do the job for you assuming you got these values stored in a file and you pass the lines to the function
def valid_lines(lines):
for line in lines:
line = line.split(",")
if line[0] == 1 and line[-1] and line[-1] <= max(line)+1:
yield line
lines = (line for line in open('/tmp/numbers.txt'))
for valid_line in valid_lines(lines):
print valid_line
Given the sequence, I would sort it, then use the hash of the sorted sequence as the index of the table.

Resources