How does Lua 5.x represent sparse arrays? - lua

Say, I have an array like this:
T = {1,2,[1000] = 3, [-1] = -1}
I know 1 and 2 will be in continous array part and -1 will be in hash part.
But I don't know where 3 will be. How it will be represented 'inside' Lua.
Would there be 997 wasted spaces between 2 and 3? Would 3 be delegated to hash part for efficency? Would there be 2 linked continous tables, one starting from index 1 and second starting from index 1000?

It depends on which version of Lua you use. In Lua 4, tables are implemented strictly as hash tables. In Lua 5, tables are part hash tables and part arrays, see the Lua Implementation where Section 4 covers tables and sparse arrays.
The array part tries to store the values corresponding to integer keys
from 1 to some limit n. Values corresponding to non-integer keys or
to integer keys outside the range are stored in the hash part. ... The
computed size of the array part is the largest n such that at least
half the blocks between 1 and n are in use... and there is at least
one slot used between n/2+1 and n.
In your example, 1000 would likely be outside the initial n, and would not cause the array part to grow as it would be too sparse.

You shouldn't need to worry about these details: just trust that Lua tables are implemented efficiently with expected constant-time access to an entry given its key. The array part is just an implementation detail to reduce memory usage by not needing to store some keys.
As explained by rpattiso, there is no memory waste in your example.

Related

Lua table, if the key starts from 1000, will there be a performance loss?

a={}
a[1000]=1
I don't define other things. Will the storage space be occupied from 1 to 999?
I know that a[1]=nil or a[999]=nil, will it be calculated from 1 to 1000 in sequence during traversal?
No, there is not going to be space allocated for other (1-999) elements (unless you create 1000 elements and then delete 1-999). Lua supports "sparse" arrays and will use the hash part of the table to store those key/value pairs.
If you are asking whether a[1000] is going to be slower if 1-999 elements are not present, then it's possible, as in this case the hash part of the table is going to be used (instead of the array part), but you'll have to benchmark to see if there is any observable difference that matters in your case.

Swift 3 and Index of a custom linked list collection type

In Swift 3 Collection indices have to conform to Comparable instead of Equatable.
Full story can be read here swift-evolution/0065.
Here's a relevant quote:
Usually an index can be represented with one or two Ints that
efficiently encode the path to the element from the root of a data
structure. Since one is free to choose the encoding of the “path”, we
think it is possible to choose it in such a way that indices are
cheaply comparable. That has been the case for all of the indices
required to implement the standard library, and a few others we
investigated while researching this change.
In my implementation of a custom linked list collection a node (pointing to a successor) is the opaque index type. However, given two instances, it is not possible to tell if one precedes another without risking traversal of a significant part of the chain.
I'm curious, how would you implement Comparable for a linked list index with O(1) complexity?
The only idea that I currently have is to somehow count steps while advancing the index, storing it within the index type as a property and then comparing those values.
Serious downside of this solution is that indices must be invalidated when mutating the collection. And while that seems reasonable for arrays, I do not want to break that huge benefit linked lists have - they do not invalidate indices of unchanged nodes.
EDIT:
It can be done at the cost of two additional integers as collection properties assuming that single linked list implements front insert, front remove and back append. Any meddling around in the middle would anyway break O(1) complexity requirement.
Here's my take on it.
a) I introduced one private integer type property to my custom Index type: depth.
b) I introduced two private integer type properties to the collection: startDepth and endDepth, which both default to zero for an empty list.
Each front insert decrements the startDepth.
Each front remove increments the startDepth.
Each back append increments the endDepth.
Thus all indices startIndex..<endIndex have a reflecting integer range startDepth..<endDepth.
c) Whenever collection vends an index either by startIndex or endIndex it will inherit its corresponding depth value from the collection. When collection is asked to advance the index by invoking index(_ after:) I will simply initialize a new Index instance with incremented depth value (depth += 1).
Conforming to Comparable boils down to comparing left-hand side depth value to the right-hand side one.
Note that because I expand the integer range from both sides as well, all the depth values for the middle indices remain unchanged (thus are not invalidated).
Conclusion:
Traded benefit of O(1) index comparisons at the cost of minor increase in memory footprint and few integer increments and decrements. I expect index lifetime to be short and number of collections relatively small.
If anyone has a better solution I'd gladly take a look at it!
I may have another solution. If you use floats instead of integers, you can gain kind of O(1) insertion-in-the-middle performance if you set the sortIndex of the inserted node to a value between the predecessor and the successor's sortIndex. This would require to store (and update) the predecessor's sortIndex on your nodes (I imagine this should not be to hard since it is only changed on insertion or removal and it can always be propagated 'up').
In your index(after:) method you need to query the successor node, but since you use your node as index, that is be straightforward.
One caveat is the finite precision of floating points, so if on insertion you the distance between the two sort indices are two small, you need to reindex at least part of the list. Since you said you only expect small scale, I would just go through the hole list and use the position for that.
This approach has all the benefits of your own, with the added benefit of good performance on insertion in the middle.

Fortran entries of array change seemingly at random

I have been working with a FORTRAN program. I have noticed seemingly random changes in a 1D matrix I'm working with. It is a matrix of 4000 integers. Values are added to the matrix one by one, starting with index 1 and iterating by 1 for each added value. The matrix does not get fully "filled", currently only 100 values are placed into the matrix. So one would expect that the first 100 entries of the matrix will be non-zero (all added values are non-zero) and the remaining 3900 entries will be 0. However, several of the entries of the matrix end up being large negative numbers, but I'm certain that no portion of my code touches these entries.
What could be causing this issue? I'm sorry but I can't post the code for you all to work with.
The code has several other large matrices, taking up a total of ~100 MB of space. Could this potentially be a memory issue?
Thanks!
You have to initialize your array, otherwise it will almost always contain garbage. This would do it:
array = 0.0e0 ! real array
or
array = 0.0e0 ! double precision
or
array = 0 ! integer
A "matrix" is two-dimensional; your array is one-dimensional.
Things do not change unless you ask them to change.
FORTRAN does not initialize variables other than (as I recall) in a labeled COMMON. As such, they are guaranteed to start out with garbage values. Try initializing your data with a DATA statement. If you have to initialize a labeled COMMON, you will have to supply a BLOCK DATA subprogram.

Redis Data Structure Space Requirements

What is the difference in space between sorted sets and lists in redis? My guess is that sorted sets are some kind of balanced binary tree, and lists are a linked list. This means that on top of the three values that I'm encoding for each of them, key, score, value, although I'll munge together score and value for the linkedlist, the overhead is that the linkedlist needs to keep track of one other node, and the binary tree needs to keep track of two, so that the space overhead to using a sorted set is O(N).
If my value, and score are both longs, and the pointers to the other nodes are also longs, it seems like the space overhead of a single node goes from 3 longs to 4 longs on a 64-bit computer, which is a 33% increase in space.
Is this true?
It is much more than your estimation. Let's suppose ziplists are not used (i.e. you have a significant number of items).
A Redis list is a classical double-linked list: 3 pointers (prev,next,value) per item.
A sorted set is a dictionary plus a skip list. In the dictionary, items will be stored with 3 pointers as well (key,value,next). The skip list memory footprint is more complex to evaluate: each node takes 1 double (score), 2 pointers (obj,backward), plus n couples (pointer,span value) with n between 1 and 32. Most items will take only 1 or 2 couples.
In other words, when it is not represented as a ziplist, a sorted set is by far the Redis data structure with the most overhead. Compared to a list, the memory overhead is more than 200% (i.e. 3 times).
Note: the best way to evaluate memory consumption with Redis is to try to build a big list or sorted set with pseudo-data and use INFO to get the memory footprint.

Lookup table size reduction

I have an application in which I have to store a couple of millions of integers, I have to store them in a Look up table, obviously I cannot store such amount of data in memory and in my requirements I am very limited I have to store the data in an embebedded system so I am very limited in the space, so I would like to ask you about recommended methods that I can use for the reduction of the look up table. I cannot use function approximation such as neural networks, the values needs to be in a table. The range of the integers is not known at the moment. When I say integers I mean a 32 bit value.
Basically the idea is use some copmpression method to reduce the amount of memory but without losing many precision. This thing needs to run in hardware so the computation overhead cannot be very high.
In my algorithm I have to access to one value of the table do some operations with it and after update the value. In the end what I should have is a function which I pass an index to it and then I get a value, and after I have to use another function to write a value in the table.
I found one called tile coding , this one is based on several look up tables, does anyone know any other method?.
Thanks.
I'd look at the types of numbers you need to store and pull out the information that's common for many of them. For example, if they're tightly clustered, you can take the mean, store it, and store the offsets. The offsets will have fewer bits than the original numbers. Or, if they're more or less uniformly distributed, you can store the first number and then store the offset to the next number.
It would help to know what your key is to look up the numbers.
I need more detail on the problem. If you cannot store the real value of the integers but instead an approximation, that means you are going to reduce (throw away) some of the data (detail), correct? I think you are looking for a hash, which can be an artform in itself. For example say you have 32 bit values, one hash would be to take the 4 bytes and xor them together, this would result in a single 8 bit value, reducing your storage by a factor of 4 but also reducing the real value of original data. Typically you could/would go further and perhaps and only use a few of those 8 bits , say the lower 4 and reduce the value further.
I think my real problem is either you need the data or you dont, if you need the data you need to compress it or find more memory to store it. If you dont, then use a hash of some sort to reduce the number of bits until you reach the amount of memory you have for storage.
Read http://www.cs.ualberta.ca/~sutton/RL-FAQ.html
"Function approximation" refers to the
use of a parameterized functional form
to represent the value function
(and/or the policy), as opposed to a
simple table."
Perhaps that applies. Also, update your question with additional facts -- don't merely answer in the comments.
Edit.
A bit array can easily store a bit for each of your millions of numbers. Let's say you have numbers in the range of 1 to 8 million. In a single megabyte of storage you can have a 1 bit for each number in your set and a 0 for each number not in your set.
If you have numbers in the range of 1 to 32 million, you'll require 4Mb of memory for a big table of all 32M distinct numbers.
See my answer to Modern, high performance bloom filter in Python? for a Python implementation of a bit array of unlimited size.
If you are merely looking for the presence of the number in question a bloom filter, might be what you are looking for. Honestly though your question is fairly vague and confusing. It would help to explain what Q values are, and what you do with them once you find them in the table.
If your set of integers is homongenous, then you could try a hash table, because there is a trick you can use to cut the size of the stored integers, in your case, in half.
Assume the integer, n, because its set is homogenous can be the hash. Assume you have 0x10000 (16k) buckets. Each bucket index, iBucket = n&FFFF. Each item in a bucket need only store 16 bits, since the first 16 bits are the bucket index. The other thing you have to do to keep the data small is to put the count of items in the bucket, and use an array to hold the items in the bucket. Using a linked list will be too large and slow. When you iterate the array looking for a match, remember you only need to compare the 16 bits that are stored.
So assuming a bucket is a pointer to the array and a count. On a 32 bit system, this is 64 bits max. If the number of ints was small enough we might be able to do some fancy things and use 32 bits for a bucket. 16k * 8 bytes = 524k, 2 million shorts = 4mb. So this gets you a method to lookup the ints and about 40% compression.

Resources