luajit copy table is slow - memory

Within a larger lua-script, I have to copy several tables dt:
for i=1,dt:nrow() do
local r = {}
for j=1,dt:ncol() do
r[j] = dt[i][j]
end
rslt:append(r)
end
The tables are about 50,000 lines x 25 cols, containing mainly doubles. luajit takes about 10 times as long as "standard" lua. On all other calculations/operations I do before, luajit is faster (1.5 to 3 times).

As silly as this may sound, try pre-allocating the r table with 25 values:
local r = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
Unfortunately Lua API doesn't allow pre-allocation of tables, so this is the only way to avoid re-allocations caused by array assignment in the inner loop. My tests show noticeable improvement, but not close to 10x (although I don't use your methods, so your results may vary).

Related

Is there a faster way to find primes in lua?

I am working on Project Euler, and my code is just taking way too long to compute. I am supposed to find the sum of all primes less than 2,000,000 , but my program would take years to complete. I would try some different ways to find primes, but the problem is that I only know one way.
Anyways, here is my code:
sum=2
flag=0
prime=3
while prime<2000000 do
for i=2,prime-1 do
if prime%i==0 then
flag=1
end
end
if flag==0 then
print(prime)
sum=sum+prime
end
prime=prime+1
flag=0
if prime==2000000 then
print(sum)
end
end
Does anyone know of any more ways to find primes, or even a way to optimize this? I always try to figure coding out myself, but this one is truly stumping me.
Anyways, thanks!
This code is based on Sieve of Eratosthenes.
Whenever a prime is found, its multiples are marked as non-prime. Remaining integers are primes.
nonprimes={}
max=2000000
sum=2
prime=3
while prime<max do
if not nonprimes[prime] then
-- found a prime
sum = sum + prime
-- marks multiples of prime
i=prime*prime
while i < max do
nonprimes[i] = true
i = i + 2*prime
end
end
-- primes cannot be even
prime = prime + 2
end
print(sum)
As an optimization, even numbers are never considered. It reduces array size and number of iterations by 2. This is also why considered multiple of a found prime are (2k+1)*prime.
Your program had some bugs and computing n^2 divisions is very expensive.

How to filter list 1 without elements of list 2

What is the best way to create a new list based on List1 without elements of List2?
List1 = ["Candy", "Brandy", "Sandy", "Lady", "Baby", "Shady"].
List2 = ["Sandy", "Shady", "Candy", "Sandy"].
The contents of the new list should be:
List3 = ["Brandy", "Lady", "Baby"].
Currently, the best way to do this is to use a module that handles sets, such as ordsets:
> ordsets:subtract(ordsets:from_list(List1), ordsets:from_list(List2)).
["Baby","Brandy","Lady"]
If you're using Erlang/OTP 22 or later (due to be released in June 2019), the best way is using the -- operator:
> List3 = List1 -- List2.
["Brandy","Lady","Baby"]
The runtime complexity of this operation is O(n log n) starting in Erlang/OTP 22, but in earlier Erlang versions, the runtime complexity of this operation is O(n*m), so it would perform very badly if both lists are very long.
See the Retired Myths chapter in the Erlang Efficiency Guide:
12.3 Myth: List subtraction ("--" operator) is slow
List subtraction used to have a run-time complexity proportional to the product of the length of its operands, so it was extremely slow when both lists were long.
As of OTP 22 the run-time complexity is "n log n" and the operation will complete quickly even when both lists are very long. In fact, it is faster and uses less memory than the commonly used workaround to convert both lists to ordered sets before subtracting them with ordsets:subtract/2.

H5PY Writes Very Slow

I have a h5py dataset like below. I want to index the records by string instead of by numeric value. So, e.g. I would be able to get the value of the first record by dset[dset.attrs['id1']].
I am trying to write the attributes with the code below, but it is extremely slow. If I do a %timeit dset.attrs[rid] = idx in the loop a single write is about 310ms. The strings I am writing are 36 characters. I have about 100k records I need to write, which would take about 9 hours. Something must be terribly wrong? Also the CPU is pegged.
ids = ['id1', 'id2', 'id3']
h5 = h5py.File("/tmp/ds.h5", "w")
dset = h5.create_dataset("lds", (100000, ), dtype='float32')
for idx, id in enumerate(ids): # loop takes forever
dset.attrs[id] = idx # takes about ~310ms
EDIT
Minimal "working" example.
for idx, rid in enumerate(range(10)):
%timeit dset.attrs[str(rid)] = idx
10 loops, best of 3: 470 ms per loop
10 loops, best of 3: 470 ms per loop
...
Nearly 0.5 second for a single write.
Use the latest value for parameter libver. This is a lot faster. So, e.g.
h5py.File('ds.h5', 'w', libver='latest')
See here: https://github.com/h5py/h5py/issues/705

Possible to use less/greater than operators with IF ANY?

Is it possible to use <,> operators with the if any function? Something like this:
select if (any(>10,Q1) AND any(<2,Q2 to Q10))
You definitely need to create an auxiliary variable to do this.
#Jignesh Sutar's solution is one that works fine. However there are often multiple ways in SPSS to accomplish a certain task.
Here is another solution where the COUNT command comes in handy.
It is important to note that the following solution assumes that the values of the variables are integers. If you have float values (1.5 for instance) you'll get a wrong result.
* count occurrences where Q2 to Q10 is less then 2.
COUNT #QLT2 = Q2 TO Q10 (LOWEST THRU 1).
* select if Q1>10 and
* there is at least one occurrence where Q2 to Q10 is less then 2.
SELECT (Q1>10 AND #QLT2>0).
There is also a variant for this sort of solution that deals with float variables correctly. But I think it is less intuitive though.
* count occurrences where Q2 to Q10 is 2 or higher.
COUNT #QGE2 = Q2 TO Q10 (2 THRU HIGHEST).
* select if Q1>10 and
* not every occurences of (the 9 variables) Q2 to Q10 is two or higher.
SELECT IF (Q1>10 AND #QGE2<9).
Note: Variables beginning with # are temporary variables. They are not stored in the data set.
I don't think you can (would be nice if you could - you can do something similar in Excel with COUNTIF & SUMIF IIRC).
You've have to construct a new variable which tests the multiple ANY less than condition, as per below example:
input program.
loop #j = 1 to 1000.
compute ID=#j.
vector Q(10).
loop #i = 1 to 10.
compute Q(#i) = trunc(rv.uniform(-20,20)).
end loop.
end case.
end loop.
end file.
end input program.
execute.
vector Q=Q2 to Q10.
loop #i=1 to 9 if Q(#i)<2.
compute #QLT2=1.
end loop if Q(#i)<2.
select if (Q1>10 and #QLT2=1).
exe.

Why is this RegExp taking 16 minutes to process on Rails?

I've written a function to remove email addresses from my data using gsub. The code is below. The problem is that it takes a total of 27 minutes to execute the function on a set of 10,000 records. (16 minutes for the first pattern, 11 minutes for the second). Elsewhere in the code I process about 20 other RegExp's using a similar flow (iterating through data.each) and they all finish in less than a second. (BTW, I recognize that my RegExp's aren't perfect and may catch some strings that aren't email addresses.)
Is there something about these two RegExp's that is causing the processing time to be so high? I've tried it on seven different data sources all with the same result, so the problem isn't peculiar to my data set.
def remove_email_addresses!(data)
email_patterns = [
/[[:graph:]]+#[[:graph:]]+/i,
/[[:graph:]]+ +at +[^ ][ [[:graph:]]]{0,40} +dot +com/i
]
data.each do |row|
email_patterns.each do |pattern|
row[:title].gsub!(pattern,"") unless row[:title].blank?
row[:description].gsub!(pattern,"") unless row[:description].blank?
end
end
end
Check that your faster code isn't just doing var =~ /blah/ matching, rather than replacement: that is several orders of magnitude faster.
In addition to reducing backtracking and replacing + and * with ranges for safety, as follows...
email_patterns = [
/\b[-_.\w]{1,128}#[-_.\w]{1,128}/i,
/\b[-_.\w]{1,128} {1,10}at {1,10}[^ ][-_.\w ]{0,40} {1,10}dot {1,10}com/i
]
... you could also try "unrolling your loop", though this is unlikely to cause any issues unless there is some kind of interaction between the iterators (which there shouldn't be, but...). That is:
data.each do |row|
row[:title].gsub!(patterns[0],"") unless row[:title].blank?
row[:description].gsub!(patterns[0],"") unless row[:description].blank?
row[:title].gsub!(patterns[1],"") unless row[:title].blank?
row[:description].gsub!(patterns[1],"") unless row[:description].blank?
end
Finally, if this causes little to no speedup, consider profiling with something like ruby-prof to find out whether the regexes themselves are the issue, or whether there's a problem in the do iterator or the unless clauses instead.
Could it be that the data is large enough that it causes issues with paging once read in? If so, might it be faster to read the data in and parse it in chunks of N entries, rather than process the whole lot at once?

Resources