Lua: understanding table array part and hash part - lua

In section 4, Tables, in The Implementation of Lua 5.0 there is and example:
local t = {100, 200, 300, x = 9.3}
So we have t[4] == nil. If I write t[0] = 0, this will go to hash part.
If I write t[5] = 500 where it will go? Array part or hash part?
I would eager to hear answer for Lua 5.1, Lua 5.2 and LuaJIT 2 implementation if there is difference.

Contiguous integer keys starting from 1 always go in the array part.
Keys that are not positive integers always go in the hash part.
Other than that, it is unspecified, so you cannot predict where t[5] will be stored according to the spec (and it may or may not move between the two, for example if you create then delete t[4].)
LuaJIT 2 is slightly different - it will also store t[0] in the array part.
If you need it to be predictable (which is probably a design smell), stick to pure-array tables (contiguous integer keys starting from 1 - if you want to leave gap use a value of false instead of nil) or pure hash tables (avoid non-negative integer keys.)

Quoting from Implementation of Lua 5.0
The array part tries to store the values corresponding to integer keys from 1 to some limit n.Values corresponding to non-integer keys or to integer keys outside the array range are
stored in the hash part.
The index of the array part starts from 1, that's why t[0] = 0 will go to hash part.
The computed size of the array part is the largest nsuch that at least half the slots between 1 and n are in use (to avoid wasting space with sparse arrays) and there is at least one used slot between n/2+1 and n(to avoid a size n when n/2 would do).
According from this rule, in the example table:
local t = {100, 200, 300, x = 9.3}
The array part which holds 3 elements, may have a size of 3, 4 or 5. (EDIT: the size should be 4, see #dualed's comment.)
Assume that the array has a size of 4, when writing t[5] = 500, the array part can no longer hold the element t[5], what if the array part resize to 8? With a size of 8, the array part holds 4 elements, which is equal to (so, not less that) half of the array size. And the index from between n/2+1 and n, which in this case, is 5 to 8, has one element:t[5]. So an array size of 8 can accomplish the requirement. In this case, t[5] will go to the array part.

Related

Lua length operator (#) with nil values

After reading this topic and after experimenting a bit, I am trying to understand how the Lua length operator works when a table contains nil values.
Before I started to investigate, I thought that the length was simply the number of consecutive non-nil elements, starting at index 1:
print(#{nil}) -- 0
print(#{"o"}) -- 1
print(#{"o",nil}) -- 1
print(#{"o","o"}) -- 2
print(#{"o","o",nil}) -- 2
That looks pretty simple, right?
But my headache started when I accidentally added an element after a nil-terminated table:
print(#{"o",nil,"o"})
My guess was that it should probably print 1 because it would stop counting when the first nil is found. Or maybe it should print 2 if the length operator is greedy enough to look for non-nil elements after the first nil. But the above code prints 3.
So I’ve ran several other tests to see what happens:
-- nil before the end
print(#{nil,"o"}) -- 2
print(#{nil,"o","o"}) -- 3
print(#{"o",nil,"o"}) -- 3
-- several nil elements
print(#{"o",nil,nil}) -- 1
print(#{nil,"o",nil}) -- 0
print(#{nil,nil,"o"}) -- 3
I should mention that repl.it currently uses Lua 5.1.5 which is rather old, but if you test with the Lua demo, which currently uses Lua 5.3.5, you’ll get the same results.
By looking at those results and by looking at this answer, I assume that:
if the last element is not nil, the length operator returns the full size of the table, including nil entries if any
if the last element is nil, it counts the number of consecutive non-nil and stops counting at the first nil
Are those assumptions correct?
Can we predict a 100% well-defined behavior when a table contains one or several nil values?
The Lua documentation states that the length of a table is only defined if the table is a sequence. Does that mean that the length operator has undefined behavior for non-sequences?
Apart from the length operator, can nil values cause any trouble in a table?
We can predict some behaviour, but it is not standardised, and as such you should never rely on it. It's quite possible that the behaviour may change within this major version of Lua.
Should you ever need to fill a table with nil values, I suggest wrapping the table and replace holes with a unique placeholder value (eg. NIL={}; if v==nil then t[k]=NIL end, this is quite cheap to test against and safe.).
That said...
As there is even a difference in the result of # depending on how the table is defined, you'll have to distinguish between statically defined (constant) tables and dynamic defined (muted) tables.
Static table definitions:
#{nil,nil,nil,nil,nil, 1} -- 6
#{3, 2, nil, 1} -- 4
#{nil,nil,nil, 1, 1,nil} -- 0
#{nil,nil, 1, 1, 1,nil} -- 5
#{nil, 1, 1, 1, 1,nil} -- 5
#{nil,nil,nil,nil, 1,nil} -- 0
#{nil,nil, 1,nil, 1,nil,nil} -- 5
#{nil,nil,nil, 1,nil,nil, 1,nil} -- 4
Using this kind of definition, as long as the last value is non-nil, you will get a length equal to the position of the last value. If the last value is nil, Lua starts a (non-linear) search from the tail until it finds the first non-nil value.
Dynamic data definition
local x={}; x[5]=1;print(#x) -- 0
local x={}; x[1]=1;x[2]=1;x[3]=1;x[5]=1;print(#x) -- 3
local x={}; x[1]=1;x[2]=1;x[4]=1;x[5]=1;print(#x) -- 5
#{[5]=1} -- 0
local x={nil,nil,nil,1};x[5]=1;print(#x) -- 0
As soon as the table was changed once, the operator works the other way (that includes static definitions with []). If the first element is nil, # always returns 0, but if not it starts a search that I did not investigate further (I guess you can check the sources, though I don't think it's a standard binary search), until it finds a nil value that is preceded by a non-nil value.
As said before, relying on this behaviour is not a good idea, and invites lots of issues down the road. Though if you want to make a nasty unmaintainable program to mess with a colleague, that's a sure way to do it.
When a table is a sequence (all numeric keys start at 1 and there are no nil gaps), # is defined to be precisely the count of those elements.
For non-sequence tables, it is a bit more complicated. Lua 5.2 seems to leave the result as undefined. For 5.1 and 5.3, the result of the operation is a border.
A border in a table is any positive index that contains a non-nil value followed by nil, or 0 if the first element is nil. # is defined to return any value that satifies these conditions.
Looking at it from another perspective, since tables contain an "array" part and a "map" part, Lua has no way of knowing where the "map" indices start. For example, you can create a table with 1000 values and then set the first 999 of them to nil; that could leave you with a table of "size" 1000. However, you can also start with an empty table and set the 1000th element, having a table of "size" 0 but still structurally equivalent to the first one. The result of # is then simply the first valid value the internal algorithm finds.
The length operator produces undefined behaviour for tables that aren't sequences (i.e. tables with nil elements in the middle of the array). This means that even if the Lua implementation always behaves in a certain way, you shouldn't rely on that behaviour, as it may change in future versions of Lua, or in different implementations like LuaJIT.
You can use nils in tables - there is nothing wrong with that - just don't use the length operator on a table which might have nils before non-nil values.
The post you linked to contains more details about how the actual algorithm works. It mentions counting elements with a "binsearch", i.e. a binary search. This is not the same as just counting the elements one by one - if there are nils in the table, then depending on their exact position, the binary search algorithm may treat them as the end of the table, or may just ignore them.
To sum up, the algorithm is harder to predict than you were assuming, and even though it is technically possible to predict what will happen in any given case, you shouldn't rely on that behaviour as it is liable to change.

Lua Complex Tables/List Sizes

I am trying to find the number of entries for test[0]
test = {}
test[0] = {}
test[0].x = {}
test[0].x[0] = 1
test[0].x[1] = 1
test[0].x[2] = 1
test[0].y = {}
test[0].y[0] = 1
I am expecting table.getn(test[0]) to be 2 for entries test[0].x and test[0].y but it results in 0. Why is this, and what do I need to do to get what I am looking for?
Note that table.getn in Lua 5.0 has been replaced by the # operator since Lua 5.1
The size of a table is only valid for the sequence part of a table (i.e, with positive numeric keys from 1 to some number n, and n is the size).
In this example, test[0] has only two kesy "x" and "y". As a result its size is 0.
table.getn and the lua 5.1 length operator are defined to operate on "lists" or arrays. Your table isn't one. It has no numerical indices.
So the result is undefined in lua 5.1 (though it will be zero here) and 0 in lua 5.0 as the size is defined to be one less the first integer index with a nil value which is the integer index 1.
Also worth noting is that table.getn(test[0].x) will return 2 and table.getn(test[0].y) will return 0 (since lua arrays start at 1).

Why does Lua's length (#) operator return unexpected values?

Lua has the # operator to compute the "length" of a table being used as an array.
I checked this operator and I am surprised.
This is code, that I let run under Lua 5.2.3:
t = {};
t[0] = 1;
t[1] = 2;
print(#t); -- 1 aha lua counts from one
t[2] = 3;
print(#t); -- 2 tree values, but only two are count
t[4] = 3;
print(#t); -- 4 but 3 is mssing?
t[400] = 400;
t[401] = 401;
print(#t); -- still 4, now I am confused?
t2 = {10, 20, nil, 40}
print(#t2); -- 4 but documentations says this is not a sequence?
Can someone explain the rules?
About tables in general
(oh, can't you just give me an array)
In Lua, a table is the single general-purpose data structure. Table keys can be of any type, like number, string, boolean. Only nil keys aren't allowed.
Whether tables can or can't contain nil values is a surprisingly difficult question which I tried to answer in depth here. Let's just assume that setting t[k] = nil should be the observably the same as never setting k at all.
Table construction syntax (like t2 = {10, 20, nil, 40}) is a syntactic sugar for creating a table and then setting its values one by one (in this case: t2 = {}, t2[1] = 10, t2[2] = 20, t2[3] = nil, t2[4] = 40).
Tables as arrays
(oh, from this angle it really looks quite arrayish)
As tables are the only complex data structure in Lua, the language (for convenience) provides some ways for manipulating tables as if they were arrays.
Notably, this includes the length operator (#t) and many standard functions, like table.insert, table.remove, and more.
The behavior of the length operator (and, in consequence, the mentioned utility functions) is only defined for array-like tables with a particular set of keys, so-called sequences.
Quoting the Lua 5.2 Reference manual:
the length of a table t is only defined if the table is a sequence, that is, the set of its positive numeric keys is equal to {1..n} for some integer n
As a result, the behavior of calling #t on a table not being a sequence at that time, is undefined.
It means that any result could be expected, including 0, -1, or false, or an error being raised (unrealistic for the sake of backwards compatibility), or even Lua crashing (quite unrealistic).
Indirectly, this means that the behavior of utility functions that expect a sequence is undefined if called with a non-sequence.
Sequences and non-sequences
(it's really not obvious)
So far, we know that using the length operator on tables not being sequences is a bad idea. That means that we should either do that in programs that are written in a particular way, that guarantees that those tables will always be sequences in practice, or, in case we are provided with a table without any assumptions about their content, we should dynamically ensure they are indeed a sequence.
Let's practice. Remember: positive numeric keys have to be in the form {1..n}, e.g. {1}, {1, 2, 3}, {1, 2, 3, 4, 5}, etc.
t = {}
t[1] = 123
t[2] = "bar"
t[3] = 456
Sequence. Easy.
t = {}
t[1] = 123
t[2] = "bar"
t[3] = 456
t[5] = false
Not a sequence. {1, 2, 3, 5} is missing 4.
t = {}
t[1] = 123
t[2] = "bar"
t[3] = 456
t[4] = nil
t[5] = false
Not a sequence. nil values aren't considered part of the table, so again we're missing 4.
t = {}
t[1] = 123
t[2] = "bar"
t[3.14] = 456
t[4] = nil
t[5] = false
Not a sequence. 3.14 is positive, but isn't an integer.
t = {}
t[0] = "foo"
t[1] = 123
t[2] = "bar"
Sequence. 0 isn't counted for the length and utility functions will ignore it, but this is a valid sequence. The definition only gives requirements about positive number keys.
t = {}
t[-1] = "foo"
t[1] = 123
t[2] = "bar"
Sequence. Similar.
t = {}
t[1] = 123
t["bar"] = "foo"
t[2] = "bar"
t[false] = 1
t[3] = 0
Sequence. We don't care about non-numeric keys.
Diving into the implementation
(if you really have to know)
But what happens in C implementation of Lua when we call # on a non-sequence?
Background: Tables in Lua are internally divided into array part and hash part. That's an optimization. Lua tries to avoid allocating memory often, so it pre allocates for the next power of two. That's another optimization.
When the last item in the array part is nil, the result of # is the length of the shortest valid sequence found by binsearching the array part for the first nil-followed key.
When the last item in the array part is not nil AND the hash part is empty, the result of # is the physical length of the array part.
When the last item in the array part is not nil AND the hash part is NOT empty, the result of # is the length of the shortest valid sequence found by binsearching the hash part for for the first nil-followed key (that is such positive integer i that t[i] ~= nil and t[i+1] == nil), assuming that the array part is full of non-nils(!).
So the result of # is almost always the (desired) length of the shortest valid sequence, unless the last element in the array part representing a non-sequence is non-nil. Then, the result is bigger than desired.
Why is that? It seems like yet another optimization (for power-of-two sized arrays). The complexity of # on such tables is O(1), while other variants are O(log(n)).
In Lua only specially formed tables are considered an array. They are not really an array such as what one might consider as an array in the C language. The items are still in a hash table. But the keys are numeric and contiguous from 1 to N. Lua arrays are unit offset, not zero offset.
The bottom line is that if you do not know if the table you have formed meets the Lua criteria for an array then you must count up the items in the table to know the length of the table. That is the only way. Here is a function to do it:
function table_count(T)
local count = 0
for _ in pairs(T) do count = count + 1 end
return count
end
If you populate a table with the "insert" function used in the manner of the following example, then you will be guaranteed of making an "array" table.
s={}
table.insert(s,[whatever you want to store])
table.insert could be in a loop or called from other places in your code. The point is, if you put items in your table in this way then it will be an array table and you can use the # operator to know how many items are in the table, otherwise you have to count the items.

What's the difference between table.insert(t, i) and t[#t+1] = i?

In Lua, there seem to be two ways of appending an element to an array:
table.insert(t, i)
and
t[#t+1] = i
Which should I use, and why?
Which to use is a matter of preference and circumstance: as the # length operator was introduced in version 5.1, t[#t+1] = i will not work in Lua 5.0, whereas table.insert has been present since 5.0 and will work in both. On the other hand, t[#t+1] = i uses exclusively language-level operators, wheras table.insert involves a function (which has a slight amount of overhead to look up and call and depends on the table module in the environment).
In the second edition of Programming in Lua (an update of the Lua 5.0-oriented first edition), Roberto Ierusalimschy (the designer of Lua) states that he prefers t[#t+1] = i, as it's more visible.
Also, depending on your use case, the answer may be "neither". See the manual entry on the behavior of the length operator:
If the array has "holes" (that is, nil values between other non-nil values), then #t can be any of the indices that directly precedes a nil value (that is, it may consider any such nil value as the end of the array).
As such, if you're dealing with an array with holes, using either one (table.insert uses the length operator) may "append" your value to a lower index in the array than you want. How you define the size of your array in this scenario is up to you, and, again, depends on preference and circumstance: you can use table.maxn (disappearing in 5.2 but trivial to write), you can keep an n field in the table and update it when necessary, you can wrap the table in a metatable, or you could use another solution that better fits your situation (in a loop, a local tsize in the scope immediately outside the loop will often suffice).
The following is slightly on the amusing side but possibly with a grain of aesthetics. Even though there are obvious reasons that mytable:operation() is not supplied like mystring:operation(), one can easily roll one's own variant, and get a third notation if desired.
Table = {}
Table.__index = table
function Table.new()
local t = {}
setmetatable(t, Table)
return t
end
mytable = Table.new()
mytable:insert('Hello')
mytable:insert('World')
for _, s in ipairs(mytable) do
print(s)
end
insert can insert arbitrarily (as its name states), it only defaults to #t + 1, where as t[#t + 1] = i will always append to the (end of the) table. see section 5.5 in the lua manual.
'#' operator only use indexed key table.
t = {1, 2 ,3 ,4, 5, x=1, y=2}
at above code
print(#t) --> print 5 not 7
'#' operator whenever not using.
If you want to '#' operator, then check it to table elements type.
Insert function can using any type use.But element count to work slow than '#'

Constrained Sequence to Index Mapping

I'm puzzling over how to map a set of sequences to consecutive integers.
All the sequences follow this rule:
A_0 = 1
A_n >= 1
A_n <= max(A_0 .. A_n-1) + 1
I'm looking for a solution that will be able to, given such a sequence, compute a integer for doing a lookup into a table and given an index into the table, generate the sequence.
Example: for length 3, there are 5 the valid sequences. A fast function for doing the following map (preferably in both direction) would be a good solution
1,1,1 0
1,1,2 1
1,2,1 2
1,2,2 3
1,2,3 4
The point of the exercise is to get a packed table with a 1-1 mapping between valid sequences and cells.
The size of the set in bounded only by the number of unique sequences possible.
I don't know now what the length of the sequence will be but it will be a small, <12, constant known in advance.
I'll get to this sooner or later, but though I'd throw it out for the community to have "fun" with in the meantime.
these are different valid sequences
1,1,2,3,2,1,4
1,1,2,3,1,2,4
1,2,3,4,5,6,7
1,1,1,1,2,3,2
these are not
1,2,2,4
2,
1,1,2,3,5
Related to this
There is a natural sequence indexing, but no so easy to calculate.
Let look for A_n for n>0, since A_0 = 1.
Indexing is done in 2 steps.
Part 1:
Group sequences by places where A_n = max(A_0 .. A_n-1) + 1. Call these places steps.
On steps are consecutive numbers (2,3,4,5,...).
On non-step places we can put numbers from 1 to number of steps with index less than k.
Each group can be represent as binary string where 1 is step and 0 non-step. E.g. 001001010 means group with 112aa3b4c, a<=2, b<=3, c<=4. Because, groups are indexed with binary number there is natural indexing of groups. From 0 to 2^length - 1. Lets call value of group binary representation group order.
Part 2:
Index sequences inside a group. Since groups define step positions, only numbers on non-step positions are variable, and they are variable in defined ranges. With that it is easy to index sequence of given group inside that group, with lexicographical order of variable places.
It is easy to calculate number of sequences in one group. It is number of form 1^i_1 * 2^i_2 * 3^i_3 * ....
Combining:
This gives a 2 part key: <Steps, Group> this then needs to be mapped to the integers. To do that we have to find how many sequences are in groups that have order less than some value. For that, lets first find how many sequences are in groups of given length. That can be computed passing through all groups and summing number of sequences or similar with recurrence. Let T(l, n) be number of sequences of length l (A_0 is omitted ) where maximal value of first element can be n+1. Than holds:
T(l,n) = n*T(l-1,n) + T(l-1,n+1)
T(1,n) = n
Because l + n <= sequence length + 1 there are ~sequence_length^2/2 T(l,n) values, which can be easily calculated.
Next is to calculate number of sequences in groups of order less or equal than given value. That can be done with summing of T(l,n) values. E.g. number of sequences in groups with order <= 1001010 binary, is equal to
T(7,1) + # for 1000000
2^2 * T(4,2) + # for 001000
2^2 * 3 * T(2,3) # for 010
Optimizations:
This will give a mapping but the direct implementation for combining the key parts is >O(1) at best. On the other hand, the Steps portion of the key is small and by computing the range of Groups for each Steps value, a lookup table can reduce this to O(1).
I'm not 100% sure about upper formula, but it should be something like it.
With these remarks and recurrence it is possible to make functions sequence -> index and index -> sequence. But not so trivial :-)
I think hash with out sorting should be the thing.
As A0 always start with 0, may be I think we can think of the sequence as an number with base 12 and use its base 10 as the key for look up. ( Still not sure about this).
This is a python function which can do the job for you assuming you got these values stored in a file and you pass the lines to the function
def valid_lines(lines):
for line in lines:
line = line.split(",")
if line[0] == 1 and line[-1] and line[-1] <= max(line)+1:
yield line
lines = (line for line in open('/tmp/numbers.txt'))
for valid_line in valid_lines(lines):
print valid_line
Given the sequence, I would sort it, then use the hash of the sorted sequence as the index of the table.

Resources