when is vectorization favored in Julia?

when is vectorization favored in Julia? - vectorization

I have 2 functions for determining pi numerically in Julia. The second function (which I think is vectorized) is slower than the first.
Why is vectorization slower? Are there rules when to vectorize and when not to?
function determine_pi(n)
area = zeros(Float64, n);
sum = 0;
for i=1:n
if ((rand()^2+rand()^2) <=1)
sum = sum + 1;
end
area[i] = sum*1.0/i;
end
return area
end
and another function
function determine_pi_vec(n)
res = cumsum(map(x -> x<=1?1:0, rand(n).^2+rand(n).^2))./[1:n]
return res
end
When run for n=10^7, below are execution times (after running few times)
n=10^7
#time returnArray = determine_pi(n)
#output elapsed time: 0.183211324 seconds (80000128 bytes allocated)
#time returnArray2 = determine_pi_vec(n);
#elapsed time: 2.436501454 seconds (880001336 bytes allocated, 30.71% gc time)

Vectorization is good if
It makes the code easier to read, and performance isn't critical
If its a linear algebra operation, using a vectorized style can be good because Julia can use BLAS and LAPACK to perform your operation with very specialized, high-performance code.
In general, I personally find it best to start with vectorized code, look for any speed problems, then devectorize any troublesome issues.
Your second code is slow not so much due to it being vectorized, but due to the use of an anonymous function: unfortunately in Julia 0.3, these are normally quite a bit slower. map in general doesn't perform very well, I believe because Julia can't infer the output type of the function (its still "anonymous" from the perspective of the map function). I wrote a different vectorized version which avoids anonymous functions, and is possibly a little bit easier to read:
function determine_pi_vec2(n)
return cumsum((rand(n).^2 .+ rand(n).^2) .<= 1) ./ (1:n)
end
Benchmarking with
function bench(n, f)
f(10)
srand(1000)
#time f(n)
srand(1000)
#time f(n)
srand(1000)
#time f(n)
end
bench(10^8, determine_pi)
gc()
bench(10^8, determine_pi_vec)
gc()
bench(10^8, determine_pi_vec2)
gives me the results
elapsed time: 5.996090409 seconds (800000064 bytes allocated)
elapsed time: 6.028323688 seconds (800000064 bytes allocated)
elapsed time: 6.172004807 seconds (800000064 bytes allocated)
elapsed time: 14.09414031 seconds (8800005224 bytes allocated, 7.69% gc time)
elapsed time: 14.323797823 seconds (8800001272 bytes allocated, 8.61% gc time)
elapsed time: 14.048216404 seconds (8800001272 bytes allocated, 8.46% gc time)
elapsed time: 8.906563284 seconds (5612510776 bytes allocated, 3.21% gc time)
elapsed time: 8.939001114 seconds (5612506184 bytes allocated, 4.25% gc time)
elapsed time: 9.028656043 seconds (5612506184 bytes allocated, 4.23% gc time)
so vectorized code can definitely be about as good as devectorized in some cases, even when we aren't in a the linear-algebra case.

Related

Confused over precision of lua's os.clock

I thought Lua os.clock() returns times in second. But from the documentation here https://www.lua.org/pil/22.1.html, the example they have
local x = os.clock()
local s = 0
for i=1,100000 do s = s + i end
print(string.format("elapsed time: %.2f\n", os.clock() - x))
Is rounding the result to 2 decimal places. Is os.clock() returns second.ms?
Also running this in Lua gives
> print(os.clock())
0.024615
What are these decimal places?

os.clock and os.time are not the same sort of time.
os.time is dealing with "wall-clock time, the sort of time humans use.
os.clock is a counter reporting CPU time. The decimal number you get from os.clock is the number of seconds the CPU spent running the current task. The CPU time has no correlation to wall-clock time other than using the same base time units (seconds).

Apache Ignite use too much RAM

I've tried to use Ignite to store events, but face a problem of too much RAM usage during inserting new data
I'm runing ignite node with 1GB Heap and default configuration
curs.execute("""CREATE TABLE trololo (id LONG PRIMARY KEY, user_id LONG, event_type INT, timestamp TIMESTAMP) WITH "template=replicated" """);
n = 10000
for i in range(200):
values = []
for j in range(n):
id_ = i * n + j
event_type = random.randint(1, 5)
user_id = random.randint(1000, 5000)
timestamp = datetime.datetime.utcnow() - timedelta(hours=random.randint(1, 100))
values.append("({id}, {user_id}, {event_type}, '{timestamp}')".format(
id=id_, user_id=user_id, event_type=event_type, uid=uid, timestamp=timestamp.strftime('%Y-%m-%dT%H:%M:%S-00:00')
))
query = "INSERT INTO trololo (id, user_id, event_type, TIMESTAMP) VALUES %s;" % ",".join(values)
curs.execute(query)
But after loading about 10^6 events, I got 100% CPU usage because all heap are taken and GC trying to clean some space (unsuccessfully)
Then I stop for about 10 minutes and after that GC succesfully clean some space and I could continue loading new data
Then again heap fully loaded and all over again
It's really strange behaviour and I couldn't find a way how I could load 10^7 events without those problems
aproximately event should take:
8 + 8 + 4 + 10(timestamp size?) is about 30 bytes
30 bytes x3 (overhead) so it should be less than 100bytes per record
So 10^7 * 10^2 = 10^9 bytes = 1Gb
So it seems that 10^7 events should fit into 1Gb RAM, isn't it?

Actually, since version 2.0, Ignite stores all in offheap with default settings.
The main problem here is that you generate a very big query string with 10000 inserts, that should be parsed and, of course, will be stored in heap. After decreasing this size for each query, you will get better results here.
But also, as you can see in doc for capacity planning, Ignite adds around 200 bytes overhead for each entry. Additionally, add around 200-300MB per node for internal memory and reasonable amount of memory for JVM and GC to operate efficiently
If you really want to use only 1gb heap you can try to tune GC, but I would recommend increasing heap size.

Single-threaded program profiles 15% of runtime in semaphore_wait_trap

On Mac OS using mono, if I compile and profile the program below, I get the following results:
% fsharpc --nologo -g foo.fs -o foo.exe
% mono --profile=default:stat foo.exe
...
Statistical samples summary
Sample type: cycles
Unmanaged hits: 336 (49.1%)
Managed hits: 349 (50.9%)
Unresolved hits: 1 ( 0.1%)
Hits % Method name
154 22.48 Microsoft.FSharp.Collections.SetTreeModule:height ...
105 15.33 semaphore_wait_trap
74 10.80 Microsoft.FSharp.Collections.SetTreeModule:add ...
...
Note the second entry, semaphore_wait_trap.
Here is the program:
[<EntryPoint>]
let main args =
let s = seq { 1..1000000 } |> Set.ofSeq
s |> Seq.iter (fun _ -> ())
0
I looked in the source for the Set module, but I didn't find any (obvious) locking.
Is my single-threaded program really spending 15% of its execution time messing with semaphores? If it is, can I make it not do that and get a performance boost?

According to Instruments, it's sgen/gc calling semaphore_wait_trap:
Sgen is documented as stopping all other threads while it collects:
Before doing a collection (minor or major), the collector must stop
all running threads so that it can have a stable view of the current
state of the heap, without the other threads changing it
In other words, when the code is trying to allocate memory and a GC is required, the time it takes shows up under semaphore_wait_trap since that's your application thread. I suspect the mono profiler doesn't profile the gc thread itself so you don't see the time in the collection code.
The germane output then is really the GC summary:
GC summary
GC resizes: 0
Max heap size: 0
Object moves: 1002691
Gen0 collections: 123, max time: 14187us, total time: 354803us, average: 2884us
Gen1 collections: 3, max time: 41336us, total time: 60281us, average: 20093us
If you want your code to run faster, don't collect as often.
Understanding the actual cost of collection can be done through dtrace since sgen has dtrace probes.

Better way to reverse binary

I'm trying to reverse binary like this:
reverse(Bin) ->
list_to_binary(lists:reverse([rev_bits(<<B>>) || B <- binary:bin_to_list(Bin)])).
rev_bits(<<A:1, B:1, C:1, D:1, E:1, F:1, G:1, H:1>>) ->
<<H:1, G:1, F:1, E:1, D:1, C:1, B:1, A:1>>.
I don't like this code. Could you please advise better way to accomplish this routine?

Somewhat like your rev_bits function:
rev (<<>>, Acc) -> Acc;
rev (<<H:1/binary, Rest/binary>>, Acc) ->
rev(Rest, <<H/binary, Acc/binary>>).
I believe binary concatenation is optimised so this should be quite fast already.
Edited: use clauses instead of case…of…end.

Better alternative:
rev(Binary) ->
Size = erlang:size(Binary)*8,
<<X:Size/integer-little>> = Binary,
<<X:Size/integer-big>>.
Benchmark results of comparing to fenollp iteration method. The benchmark test was done calling both functions with a random binary containing 8192 random bytes:
Calling reverse 10 times
BENCHMARK my method: Calling reverse/1 function 10 times. Process took 0.000299 seconds
BENCHMARK fenollp iteration method: Calling reverse_recursive/1 function 10 times. Process took 0.058528 seconds
Calling reverse 100 times
BENCHMARK my method: Calling reverse/1 function 100 times. Process took 0.002703 seconds
BENCHMARK fenollp iteration method: Calling reverse_recursive/1 function 100 times. Process took 0.391098 seconds
The method proposed by me is usually at least 100 times faster.

binary:encode_unsigned(binary:decode_unsigned(Bin, little)).

Does F# do automatic memoisation?

I have this code:
for i in 1 .. 10 do
let (tree, interval) = time (fun () -> insert [12.; 6. + 1.0] exampletree 128.)
printfn "insertion time: %A" interval.TotalMilliseconds
()
with the time function defined as
let time f =
let start = DateTime.Now
let res = f ()
let finish = DateTime.Now
(res, finish - start)
the function insert is not relevant here, other than the fact that it doesn't employ mutation and thus returns the same value every time.
I get the results:
insertion time: 218.75
insertion time: 0.0
insertion time: 0.0
insertion time: 0.0
insertion time: 0.0
insertion time: 0.0
insertion time: 0.0
insertion time: 0.0
insertion time: 0.0
insertion time: 0.0
The question is why does the code calculate the result only once (from the insertion time, the result is always correct and equal)? Also, how to force the program to do computation multiple times (I need that for profiling purposes)?
Edit: Jared has supplied the right answer. Now that I know what to look for, I can get the stopwatch code from a timeit function for F#
I had the following results:
insertion time: 243.4247
insertion time: 0.0768
insertion time: 0.0636
insertion time: 0.0617
insertion time: 0.065
insertion time: 0.0564
insertion time: 0.062
insertion time: 0.069
insertion time: 0.0656
insertion time: 0.0553

F# does not do automatic memoization of your functions. In this case memo-ization would be incorrect. Even though you don't mutate items directly you are accessing a mutable value (DateTime.Now) from within your function. Memoizing that or a function accessing it would be a mistake since it can change from call to call.
What you're seeing here is an effect of the .Net JIT. The first time this is run the function f() is JIT'd and produces a noticable delay. The other times it's already JIT'd and executes a time which is smaller than the granularity of DateTime
One way to prove this is to use a more granular measuring class like StopWatch. This will show the function executes many times.

The first timing is probably due to JIT compilation. The actual code you're timing probably runs in less time than DateTime is able to measure.
Edit: Beaten by 18 secs... I'm just glad I had the right idea :)

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

when is vectorization favored in Julia? - vectorization

Related

Confused over precision of lua's os.clock

Apache Ignite use too much RAM

Single-threaded program profiles 15% of runtime in semaphore_wait_trap

Better way to reverse binary

Does F# do automatic memoisation?

Categories

Resources