How to efficiently extract all values in a large master table that start with a specified string - lua

I'm designing the UI for a Lua program, one element of which requires the user to either select an existing value from a master table or create a new value in that table.
I would normally use an IUP list with EDITBOX = "YES".
However, the number of items that the user can select may run into many hundreds or possibly thousands, and the performance when populating the list in iup (and also selecting from it) is unacceptably slow. I cannot control the number of items in the table.
My current thinking is to create a list with an editbox, but without any values. As the user types into the editbox (after perhaps 2-3 characters) the list would populate with the subset of table items that start with the characters typed. The user could then select an item from the list or keep typing to narrow the options or create a new item.
For this to work, I need to be able to create a new table with the items from the master table that start with the entered characters.
One option would be to iterate through the master table using the Penlight 'startswith' function to create the new table:
require "pl.init"
local subtable = {} --empty result table
local startstring = "xyz" -- will actually be set by the iup control
for _, v in ipairs (mastertable) do
if stringx.startswith(v, startstring) then
table.insert(subtable,v)
end
end
However, I'm worried about the performance of doing that if the master table is huge. Is there a more efficient way to code this, or a different way I could implement the UI?

There are various approaches you can take to improve the big-O performance of your prefix search, at the cost of increased code complexity; that said, given the size of your dataset (thousands of items) and the intended use (triggered by user interaction, rather than e.g. game logic that needs to run every frame), I think a simple linear search over the options is almost certainly going to be fast enough.
To test this theory, I timed the following code:
local dict = {}
for word in io.lines('/usr/share/dict/words') do
table.insert(dict, word)
end
local matched = {}
local search = "^" .. (...)
for _,word in ipairs(dict) do
if word:match(search) then
table.insert(matched, word)
end
end
print('Found '..#matched..' words.')
I used /usr/bin/time -v and tried it with both lua 5.2 and luaJIT.
Note that this is fairly pessimistic compared to your code:
no attempt made to localize library functions that are repeatedly called, or use # instead of table.insert
timing includes not just the search but also the cost of loading the dictionary into memory in the first place
string.match is almost certainly slower than stringx.startswith
dictionary contains ~100k entries rather than the "hundreds to thousands" you expect in your application
Even with all those caveats, it costs 50-100ms in lua5.2 and 30-50ms in luaJIT, over 50 runs.
If I use os.clock() to time the actual search, it consistently costs about 10ms in lua5.2 and 3-4 in luajit.
Now, this is on a fairly fast laptop (Core i7), but also non-optimized code running on a dataset 10-100x larger than you expect to process; given that, I suspect that the naïve approach of just looping over the entries calling startswith will be plenty fast for your purposes, and result in code that's significantly simpler and easier to debug.

Related

Garbage collection and the memoization of a string-to-string function

The following exercise comes from p. 234 of Ierusalimschy's Programming in Lua (4th edition). (NB: Earlier in the book, the author explicitly rejects the word memoization, and insists on using memorization instead. Keep this in mind as you read the excerpt below.)
Exercise 23.3: Imagine you have to implement a memorizing table for a function from strings to strings. Making the table weak will not do the removal of entries, because weak tables do not consider strings as collectable objects. How can you implement memorization in that case?
I am stumped!
Part of my problem is that I have not been able to devise a way to bring about the (garbage) collection of a string.
In contrast, with a table, I can equip it with finalizer that will report when the table is about to be collected. Is there a way to confirm that a given string (and only that string) has been garbage-collected?
Another difficulty is simply figuring out what the desired function's specification is. The best I can do is to figure out what it isn't. Earlier in the book (p. 225), the author gave the following example of a "memorizing" function:
Imagine a generic server that takes requests in the form of strings with Lua code. Each time it gets a request, it runs load on the string, and then calls the resulting function. However, load is an expensive function, and some commands to the server may be quite frequent. Instead of calling load repeatedly each time it receives a common command like "closeconnection()", the server can memorize the results from load using an auxiliary table. Before calling load, the server checks in the table whether the given string already has a translation. If it cannot find a match then (and only then) the server calls load and stores the result into the table. We can pack this behavior in a new function:
[standard memo(r)ized implementation omitted; see variant using a weak-value table below]
The savings with this scheme can be huge. However, it may also cause unsuspected waste. ALthough some commands epeat over and over, many other commands happen only once. Gradually, the ["memorizing"] table results accumulates all commands the server has ever received plus their respective codes; after enough time, this behavior will exhaust the server's memory.
A weak table provides a simple solution to this problem. If the results table has weak values, each garbage-collection cycle will remove all translations not in use at that moment (which means virtually all of them)1:
local results = {}
setmetatable(results, {__mode = "v"}) -- make values weak
function mem_loadstring (s)
local res = results[s]
if res == nil then -- results not available?
res = assert(load(s)) -- compute new results
result[s] = res -- save for later reuse
end
return res
end
As the original problem statement notes, this scheme won't work when the function to be memo(r)ized returns strings, because the garbage collector does not treat strings as "collectable".
Of course, if one is allowed to change the desired function's interface so that instead of returning a string, it returns a singleton table whose sole item is the real result string, then the problem becomes almost trivial, but I find it hard to believe that the author had such a crude "solution" in mind2.
In case it matters, I am using Lua 5.3.
1 As an aside, if the rationale for memo(r)ization is to avoid invoking load more often than necessary, the scheme proposed by the author does not make sense to me. It seems to me that this scheme is based on the assumption (a heuristic, really) that a translation that is used frequently, and thus would pay to memo(r)ize, is also one that is always reachable (and hence not collectable). I don't see why this should necessarily, or even likely, be the case.
2 One may be able to put lipstick on this pig in the form of a __tostring method that would allow the table (the one returned by the memo(r)ized function) to masquerade as a string in certain contexts; it's still a pig, though.
Your idea is correct: wrap string into a table (because table is collectable).
function memormoize (func_from_string_to_string)
local cached = {}
setmetatable(cached, {__mode = "v"})
return
function(s)
local c = cached[s] or {func_from_string_to_string(s)}
cached[s] = c
return c[1]
end
end
And I see no pigs in this solution :-)
one that is always reachable (and hence not collectable). I don't see why this should necessarily, or even likely, be the case.
There will be no "always reachable" items in a weak table.
But most frequent items will be recalculated only once per GC cycle.
The ideal solution (never collect frequently used items) would require more complex implementation.
For example, you can move items from normal cache to weak cache when item's "inactivity timer" reaches some threshold.

Lua tables: performance hit for starting array indexing at 0?

I'm porting FFT code from Java to Lua, and I'm starting to worry a bit about the fact that in Lua the array part of a table starts indexing at 1 while in Java array indexing starts at 0.
For the input array this causes no problem because the Java code is set up to handle the possibility that the data under consideration is not located at the start of the array. However, all of the working arrays internal to the code are assumed to starting indexing at 0. I know that the code will work as written -- Lua tables are awesome like that -- but I have no sense at all about the performance hit I might incur by having the "0" element of the array going into the hash table part of the underlying C structure (or indeed, if that is what will happen).
My question: is this something worth worrying about? Should I be planning to profile and hand-optimize the code? (The code will eventually be used to transform many relatively small (> 100 time points) signals of varying lengths not known in advance.)
I have made small, probably not that reliable, test:
local arr = {}
for i=0,10000000 do
arr[i] = i*2
end
for k, v in pairs(arr) do
arr[k] = v*v
end
And similar version with 1 as the first index. On my system:
$ time lua example0.lua
real 2.003s
$ time lua example1.lua
real 2.014s
I was also interested how table.insert will perform
for i=1,10000000 do
table.insert(arr, 2*i)
...
and, suprisingly
$ time lua example2.lua
real 6.012s
Results:
Of course, it depends on what system you're running it, probably also whic lua version, but it seems that it makes little to no difference between zero-start and one-start. Bigger difference is caused by the way you insert things to array.
I think the correct answer in this case is changing the algorithm so that everything is indexed with 1. And consider that part of the conversion.
Your FFT will be less surprising to another Lua user (like me), given that all "array-like" tables are indexed by one.
It might not be as stressful as you might think, given the way numeric loops are structured in Lua (where the "start" and the "end" are "inclusive"). You would be exchanging this:
for i=0,#array-1 do
... (do stuff with i)
end
By this:
for i=1,#array do
... (do stuff with i)
end
The non-numeric loops would remain unchanged (except that you will be able to use ipairs too, if you so desire).

Redis Capped Sorted Set, List, or Queue?

Has anyone implemented a capped data-structure of any kind in Redis? I'm working on building something like a news feed. The feed will wind up being manipulated and read from very frequently, and holding it in a sorted set in Redis would be cheap and perfect for my use case. The only issue is I only ever need n items per feed, and I'm worried about memory overflow, so I'd like to ensure each feed never gets above n items. It seems pretty trivial to make a capped sorted collection in Redis with Lua:
redis-cli EVAL "$(cat update_feed.lua)" 1 feeds:some_feed "thing_to_add", n
Where update_feed.lua looks something like (without testing it):
redis.call('ZADD', KEYS[1], os.time(), ARGV[1])
local num = redis.call('ZCARD', KEYS[1])
if num > ARGV[2]:
redis.call('ZREMRANGEBYRANK', KEYS[1], -n, -inf)
That's not bad at all, and pretty cheap, but it seems like such a basic thing that could be doable much more cheaply by instantiating the sorted set with only n buckets to begin with. I can't find a way to do that in redis, so I guess my question is: did I miss something, and if I didn't, why is there no structure for this in redis, even if it just ran the basic Lua script I described, it seems like it would be a typical enough use-case that it ought to be implemented as an option for redis data structures?
You can use LTRIM if it is a list.
Excerpt from the documentation.
LPUSH mylist someelement
LTRIM mylist 0 99
This pair of commands will push a new element on the list, while making sure that the list will not grow larger than 100 elements. This is very useful when using Redis to store logs for example. It is important to note that when used in this way LTRIM is an O(1) operation because in the average case just one element is removed from the tail of the list.
I use sorted sets myself for this. I too thought about using lists, but then I found that manipulating the INSIDE of a list is fairly expensive -- O(n) -- while manipulating the inside of a sorted set is O(log n).
That's what sealed the deal for me--will you ever be manipulating the inside of the set? If so, stick with sorted sets and just flush the oldest whenever you have to, just like you were thinking.

How to count occurrences of a substring within string fast with Ruby

I have a text file sized 300MB, I want to count the occurrences of each 10,000 substrings in the file. I want to know how to do it fast.
Now, I use the following code:
content = IO.read("path/to/mytextfile")
Word.each do |w|
w.occurrence = content.scan(w.name).size
w.save
end
Word is an ActiveRecord class.
It took me almost 1 day to finish the counting. Is there anyway to do it faster? Thanks.
Edit1:
Thank you again. I am running rails 2.3.9. The name filed of words table contains what I am searching for, and it contains only unique values. Instead of using Word.each, I use batch(1000 rows a time) load. It should help.
I rewrited the whole code with the idea from bpaulon. Now it only took a few hours to finish the counting.
I profiled the new version code, now the largest time costing methods are utf8 encode supported string truncating code
def truncate(n)
self.slice(/\A.{0,#{n}}/m)
end
and characters counting code
def utf8_length
self.unpack('U*').size
end
Any other faster methods to replace them?
Your use of scan creates an array, counts the size of it, then throws it away. If you have a lot of occurrences of the substring inside a big file, you will create a big array temporarily, potentially burning up CPU time with memory management, but that should still run pretty quickly, even with 300MB.
Because Word is an ActiveRecord class, it is dependent on the schema and any indexes in your database, plus any issues your database server might be having. If the database is not optimized or is responding slowly or the query used to retrieve the data is not efficient, then the iteration will be slow. You might find it a lot faster to grab groups of Word so they are in RAM, then iterate over them.
And, if the database and your code are running on the same machine, you could be suffering from resource constraints like having only one drive, not enough RAM, etc.
Without knowing more about your environment and hardware it's hard to say.
EDIT:
I can grab the substrings into an array/hash first, then add the count results to the array or hash, and write the results back to database after all the counting is done. You think it be faster, right?
No, I doubt that will help a lot, and, without knowing where the problem lies all you might do is make the problem worse because you'll have to load 10,000 records as objects from the database, then build a 10,000 element hash or array which will also be in memory along with the DB records, then write them out.
Ruby will only use a single core, currently, but you can gain speed by using Ruby 1.9+. I'd recommend installing RVM and letting it manage your Ruby. Be sure to read the instructions on that page, then run rvm notes and follow those directions.
What is your Word model and the underlying schema and indexes look like? Is the database on the same machine?
EDIT: From looking at your table schema, you have no indexes except for id which really won't help much for normal look-ups. I'd recommend presenting your schema on Stack Overflow's sibling site https://dba.stackexchange.com/ and explain what you want to do. At a minimum I'd add a key to the text fields to help avoid full table scans for any searches you do.
What might help more is to read: Retrieving Multiple Objects in Batches from "Active Record Query Interface".
Also, look at the SQL being emitted when your Word.each is running. Is it something like "select * from word"? If so, Rails is pulling in 10,000 records to iterate over them one by one. If it is something like "select * from word where id=1" then for every record you have a database read followed by a write when you update the count. That is the scenario that the "Retrieving Multiple Objects in Batches" link will help fix.
Also, I am guessing that content is the text you are searching for, but I can't tell for sure. Is it possible you have duplicated text values causing you to do scans more than once for the same text? If so, select your records using a unique condition on that field and then update your counts for all matching records at one time.
Have you profiled your code to see if Ruby itself can help you pinpoint the problem? Modify your code a little to process 100 or 1000 records. Start the app with the -r profile flag. When the app exits profiler will output a table showing where time was spent.
What version of Rails are you running?
I think you could approach this problem differently
You do not need to scan the file this many times, you could create a db, like in mongo or mysql, and for each word you find, you fetch the db for it and then adds on some "counter" field.
You could ask me "but then I will have to scan my database a lot and it could take a lot more". Well, sure you wouldn't ask this, but it won't take more time because databases are focused in IO, besides you could always index it.
EDIT: There is no way to delimit at all?? Let's say that where you have the a Word.name string you really holds a (not simple) regex. Could the regex contain the \n? Well, if the regex can contain any value, you should estimate the maximum size of string the regex can fetch, double it, and scan the file by that ammount of chars but moving the cursor by that number.
Lets say your estimate of the maximum your regex could fetch it is like 20 chars nad your file has from 0 to 30000 chars. You pass each regex you have from 0 to 40 chars, then again from 20 to 60, from 40 to 80, etc...
You should also hold the position you found of your smaller regex so it wouldn't repeat it.
Finally, this solution seems to be not worth the effort, your problem may have a greater solution based on what that regexes are, but it will be faster than invoke scan Words.count times your your 300Mb string.
You could load your entire "Word" table into a Trie, then do back-tracking since you said there are no delimiters in the text.
So for each character in the text, go down the Trie of words. If you hit a word, increment its count. "Going down the trie" involves three cases:
There's no node at this character. (If you're mid-search, pop the back-tracking stack)
There's a node at this character. (But it's not a Word)
There's a node at this character. (It's a Word - increment and "dirty")
Back-tracking is just keeping track of places you want to go after you've exhausted this "search" of the Trie, which is when you run out of nodes to visit. This will probably be each character you visit that is a root of the Trie.
After you've done this, you can then visit all the nodes you changed and just update the records they represent.
This will take some time to implement, but will surely be faster than each & scan.

(Secure) Random string?

In Lua, one would usually generate random values, and/or strings by using math.random & math.randomseed, where os.time is used for math.randomseed.
This method however has one major weakness; The returned number is always just as random as the current time, AND the interval for each random number is one second, which is way too long if one needs many random values in a very short time.
This issue is even pointed out by the Lua Users wiki: http://lua-users.org/wiki/MathLibraryTutorial, and the corresponding RandomStringS receipe: http://lua-users.org/wiki/RandomStrings.
So I've sat down and wrote a different algorithm (if it even can be called that), that generates random numbers by (mis-)using the memory addresses of tables:
math.randomseed(os.time())
function realrandom(maxlen)
local tbl = {}
local num = tonumber(string.sub(tostring(tbl), 8))
if maxlen ~= nil then
num = num % maxlen
end
return num
end
function string.random(length,pattern)
local length = length or 11
local pattern = pattern or '%a%d'
local rand = ""
local allchars = ""
for loop=0, 255 do
allchars = allchars .. string.char(loop)
end
local str=string.gsub(allchars, '[^'..pattern..']','')
while string.len(rand) ~= length do
local randidx = realrandom(string.len(str))
local randbyte = string.byte(str, randidx)
rand = rand .. string.char(randbyte)
end
return rand
end
At first, everything seems perfectly random, and I'm sure they are... at least for the current program.
So my question is, how random are these numbers returned by realrandom really?
Or is there an even better way to generate random numbers in a shorter interval than one second (which kind of implies that os.time shouldn't be used, as explaind above), without relying on external libraries, AND, if possible, in an entirely crossplatform manner?
EDIT:
There seems to be a major misunderstanding regarding the way the RNG is seeded; In production code, the call to math.randomseed() happens just once, this was just a badly chosen example here.
What I mean by the random value is only random once per second, is easily demonstrated by this paste: http://codepad.org/4cDsTpcD
As this question will get downvoted regardless my edits, I also cancelled my previously accepted answer - In hope for a better one, even if just better opinions. I understand that issues regarding random values/numbers has been discussed many times before, but I have not found such a question that could be relevant to Lua - Please keep that in mind!
You should not call seed each time you call random, you ought to call it only once, on the program initialization (unless you get the seed from somewhere, for example, to replicate some previous "random" behaviour).
Standard Lua random generator is of poor quality in the statistical sense (as it is, in fact, standard C random generator), do not use it if you care for that. Use, for example, lrandom module (available in LuaRocks).
If you need more secure random, read from /dev/random on Linux. (I think that Windows should have something along the same lines — but you may need to code something in C to use it.)
Relying on table pointer values is a bad idea. Think about alternate Lua implementations, in Java, for example — there is no telling what they would return. (Also, the pointer values may be predictable, and they may be, under certain circumstances the same each time the program is invoked.)
If you want finer precision for the seed (and you will want this only if you're launching the program more often than once per second), you should use a timer with better resolution. For example, socket.gettime() from LuaSocket. Multiply it by some value, since math.randomseed is working with integer part only, and socket.gettime() returns time in (floating point) seconds.
require 'socket'
math.randomseed(socket.gettime() * 1e6)
for i = 1, 1e3 do
print(math.random())
end
This method however has one major
weakness; The returned number is
always just as random as the current
time, AND the interval for each random
number is one second, which is way too
long if one needs many random values
in a very short time.
It has those weaknesses only if you implement it incorrectly.
math.randomseed is supposed to be called sparingly - usually just once at the beginning of your program, and it usually seeds using os.time. Once the seed is set, you can use math.random many times, and it will yield random values.
See what happens on this sample:
> math.randomseed(1)
> return math.random(), math.random(), math.random()
0.84018771715471 0.39438292681909 0.78309922375861
> math.randomseed(2)
> return math.random(), math.random(), math.random()
0.70097636929759 0.80967634907443 0.088795455214007
> math.randomseed(1)
> return math.random(), math.random(), math.random()
0.84018771715471 0.39438292681909 0.78309922375861
When I change the seed from 1 to 2, I get different random results. But when I go back to 1, the "random sequence" is reset. I obtain the same values as before.
os.time() returns an ever-increasing number. Using it as a seed is appropriate; then you can invoke math.random forever and have different random numbers every time you invoke it.
The only scenario you have to be a bit worried about non-randomness is when your program is supposed to be executed more than once per second. In that case, as the others are saying, the simplest solution is using a clock with higher definition.
In other words:
Invoke math.randomseed with an appropiate seed (os.time() is ok 99% of the cases) at the beginning of your program
Invoke math.random every time you need a random number.
Regards!
Some thoughts on the first part of your question:
So my question is, how random are these numbers returned by realrandom really?
Your function is attempting to discover the address of a table by using a quirk of its default implementation of tostring(). I don't believe that the string returned by tostring{} has a specified format, or that the value included in that string has any documented meaning. In practice, it is derived from the address of something related to the specific table, and so distinct tables convert to distinct strings. However, the next version of Lua is free to change that to anything that is convenient. Worse, the format it takes will be highly platform dependent because it appears to use the %p format specifier to sprintf() which is only specified as being a sensible representation of a pointer.
There's also a much bigger issue. While the address of the nth table created in a process might seem random on your platform, tt might not be random at all. Or it might vary in only a few bits. For example, on my win7 box only a few bits vary, and not very randomly:
C:...>for /L %i in (1,1,20) do # lua -e "print{}"
table: 0042E5D8
table: 0061E5D8
table: 0024E5D8
table: 0049E5D8
table: 0042E5D8
table: 0042E5D8
table: 0042E5D8
table: 0064E5D8
table: 0042E5D8
table: 002FE5D8
table: 0042E5D8
table: 0049E5D8
table: 0042E5D8
table: 0042E5D8
table: 0042E5D8
table: 0024E5D8
table: 0042E5D8
table: 0042E5D8
table: 0061E5D8
table: 0042E5D8
Other platforms will vary, of course. I'd even expect there to be platforms where the address of the first allocated table is completely deterministic, and hence identical on every run of the program.
In short, the address of an arbitrary object in your process image is not a very good source of randomness.
Edit: For completeness, I'd like to add a couple of other thoughts that came to mind over night.
The stock tostring() function is supplied by the base library and implemented by the function luaB_tostring(). The relevant bit is this fragment:
switch (lua_type(L, 1)) {
...
default:
lua_pushfstring(L, "%s: %p", luaL_typename(L, 1), lua_topointer(L, 1));
break;
If you really are calling this function, then the end of the string will be an address, represented by standard C sprintf() format %p, strongly related to the specific table. One observation is that I've seen several distinct implementations for %p. Windows MSVCR80.DLL (the version of the C library used by the current release of Lua for Windows) makes it equivalent to %08X. My Ubuntu Karmic Koala box appears to make it equivalent to %#x which notably drops leading zeros. If you are going to parse out that part of the string, then you should do it in a way that is more flexible in the face of variation of the meaning of %p.
Note, also, that doing anything like this in library code may expose you to a couple of surprises.
First, if the table passed to tostring() has a metatable that provides the function __tostring(), then that function will be called, and the fragment quoted above will never be executed at all. In your case, that issue cannot arise because tables have individual metatables, and you didn't accidentally apply a metatable to your local table.
Second, by the time your module loads, some other module or user-supplied code might have replaced the stock tostring() with something else. If the replacement is benign, (such as a memoization wrapper) then it likely doesn't matter to the code as written. However, this would be a source of attack, and is entirely outside the control of your module. That doesn't strike me as a good idea if the goal is some kind of improved security for your random seed material.
Third, you might not be loaded in a stock Lua interpreter at all, and the larger application (Lightroom, WoW, Wireshark, ...) may choose to replace the base library functions with their own implementations. This is a much less likely issue for tostring(), but note that the base library's print() is a frequent target for replacement or removal in alternate implementations and there are modules (Lua Lanes, for one) that break if print is not the implementation in the base library.
A few important things come to mind:
In most other languages you typically only call the random 'seed' function once at the beginning of the program or perhaps at limited times throughout its execution. You generally do not want to call it each time you generate a random number/sequence. If you call it once when the program starts you get around the "once per second" limitation. By calling it each time you may actually end up with less randomness in your results.
Your realrandom() function seems to rely on a private implementation detail of Lua. What happens in the next major release if this detail changes to always return the same number, or only even numbers, etc.... Just because it works for now is not a strong enough guarantee, especially in the case of wanting a secure RNG.
When you say "everything seems perfectly random" how are you measuring this performance? We humans are terrible at determining if a sequence is random or not and just looking at a sequence of numbers would be virtually impossible to truly tell if they were random or not. There are many ways to quantify the "randomness" of a series including frequency distribution, autocorrelation, compression, and many more far beyond my understanding.
If you are writing a true "secure PRNG" for production do not write your own! Investigate and use a library or algorithm by experts who has spent years/decades studying, designing and trying to break it. True secure random number generation is hard.
If you need more info start on the PRNG article on Wikipedia and use the references/links there as needed.

Resources