Memory allocation with views - memory

Consider the following code
using Distributions
using BenchmarkTools
u = randn(100, 2)
res = ones(100)
idx = 1
u_vector = u[:, idx]
#btime $res = $1.0 .- $u_vector;
#btime $res = $1.0 .- $u[:,idx];
#btime #views $res = $1.0 .- $u[:,idx];
These are the results that I got from the three lines with #btime
julia> #btime $res = $1.0 .- $u_vector;
37.478 ns (1 allocation: 896 bytes)
julia> #btime $res = $1.0 .- $u[:,idx];
607.383 ns (13 allocations: 1.97 KiB)
julia> #btime #views $res = $1.0 .- $u[:,idx];
397.597 ns (6 allocations: 1.08 KiB)
The second #btime line has the greatest amount of time and allocations but that's in line with my expectation, since I'm slicing. However, I'm not sure why the third line with #views is not the same as the first line? I thought by using #views I'm not longer creating a copy. Is there a way to "fix" the third line? In my real code, the user provides idx so idx is not known in advance. Therefore, I would want to reduce allocations when I do slicing.

What I assume you are looking for is:
julia> #btime $res .= 1.0 .- view($u, :,$idx);
13.126 ns (0 allocations: 0 bytes)
The point is that you want to avoid allocation of the vector on RHS, and that is why you should use .= not =.
I also changed #views to view call. It does not matter here, but, in general using #views is tricky at times, and I avoid it unless there is a reason, see here.

In my machine, the results are different. Consider that you shouldn't use $ for 1.0; on the other hand, you should use $ for idx. Here is the result on my machine:
julia> #btime $res = 1.0 .- $u_vector;
103.455 ns (1 allocation: 896 bytes)
julia> #btime $res = 1.0 .- $u[:,$idx];
241.978 ns (2 allocations: 1.75 KiB)
julia> #btime #views $res = 1.0 .- $u[:,$idx];
105.058 ns (1 allocation: 896 bytes)
Next, I provided a picture that contains several times running the test and the result of versioninfo():
julia> versioninfo()
Julia Version 1.8.3
Commit 0434deb161 (2022-11-14 20:14 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 8 × Intel(R) Core(TM) i7-4800MQ CPU # 2.70GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, haswell)
Threads: 1 on 8 virtual cores

Related

Table printing a list of lists Common lisp

I wish to print this data in a table with the columns aligned. I tried with Format but the columns were not aligned. Does anyone know how to do it ? Thank you.
(("tiscali" 10000 2.31 0.84 -14700.0 "none")
("atlantia" 50 22.65 22.68 1.5 "none")
("bper-banca" 1000 1.59 2.01 423.0 "none")
("alerion-cleanpower" 30 44.14 36.45 -230.7 "none")
("tesmec" 10000 0.12 0.14 150.0 "none")
("cover-50" 120 8.95 9.6 78.0 "none")
("ovs" 1000 1.71 1.93 217.0 "none")
("credito-emiliano" 200 5.7 6.26 112.0 "none"))
I tried to align the columns wit the ~T directive, no way. Is there a piece of code that prints nicely table data?
Let's break this down.
First, let's give your data a nice name:
(defparameter *data*
'(("tiscali" 10000 2.31 0.84 -14700.0 "none")
("atlantia" 50 22.65 22.68 1.5 "none")
("bper-banca" 1000 1.59 2.01 423.0 "none")
("alerion-cleanpower" 30 44.14 36.45 -230.7 "none")
("tesmec" 10000 0.12 0.14 150.0 "none")
("cover-50" 120 8.95 9.6 78.0 "none")
("ovs" 1000 1.71 1.93 217.0 "none")
("credito-emiliano" 200 5.7 6.26 112.0 "none")))
Now, come up with a way to print each line using format and destructuring-bind. Widths of various fields are hard-coded in.
(defun print-line (line)
(destructuring-bind (a b c d e f) line
(format T "~20a ~5d ~6,2f ~6,2f ~10,2f ~4a~%" a b c d e f)))
Once you know you can print a line, you just need to do that for each line.
(mapcar 'print-line *data*)
Result:
tiscali 10000 2.31 0.84 -14700.00 none
atlantia 50 22.65 22.68 1.50 none
bper-banca 1000 1.59 2.01 423.00 none
alerion-cleanpower 30 44.14 36.45 -230.70 none
tesmec 10000 0.12 0.14 150.00 none
cover-50 120 8.95 9.60 78.00 none
ovs 1000 1.71 1.93 217.00 none
credito-emiliano 200 5.70 6.26 112.00 none
I have something like this in my personal code, that I reproduced here in a simplified way:
(defpackage :tabular (:use :cl))
(in-package :tabular)
I have a function that turns any object into a list of values (a row), here the usage is for a list of values, so it is already in the correct shape.
(defgeneric columnize (object)
(:documentation "Representation of object as a list of fields")
(:method ((o list)) o))
I also define a transpose method that works with lists of various sizes:
(defun transpose (lists)
(when (notany #'null lists)
(cons
(mapcar #'first lists)
(transpose (mapcar #'cdr lists)))))
Here is your data, as defined by Chris:
(defparameter *data*
'(("tiscali" 10000 2.31 0.84 -14700.0 "none")
("atlantia" 50 22.65 22.68 1.5 "none")
("bper-banca" 1000 1.59 2.01 423.0 "none")
("alerion-cleanpower" 30 44.14 36.45 -230.7 "none")
("tesmec" 10000 0.12 0.14 150.0 "none")
("cover-50" 120 8.95 9.6 78.0 "none")
("ovs" 1000 1.71 1.93 217.0 "none")
("credito-emiliano" 200 5.7 6.26 112.0 "none")))
And finally, a function that prints a list of objects in a tabular way.
Basically, I convert all objects to list of values, convert them to string, and compute their size. This gives a matrix of size that I transpose to have a list of sizes for the same column: this is used to compute the width of each column, based on the maximum size of the actual data.
In practice, I allow also the generic function to add indicators like how to justify (left/right), etc.
(defun tabulate (stream objects)
(loop
for n from 0
for o in objects
for row = (mapcar #'princ-to-string (columnize o))
collect row into rows
collect (mapcar #'length row) into row-widths
finally
(flet ((build-format-arguments (max-width row)
(when (> max-width 0)
(list max-width #\space row))))
(loop
with number-width = (ceiling (log n 10))
with col-widths = (transpose row-widths)
with max-col-widths = (mapcar (lambda (s) (reduce #'max s)) col-widths)
for index from 0
for row in rows
for entries = (mapcan #'build-format-arguments max-col-widths row)
do (format stream
"~v,'0d. ~{~v,,,va~^ ~}~%"
number-width index entries)))))
For example:
(fresh-line)
(tabulate *standard-output* *data*)
Gives:
0. tiscali 10000 2.31 0.84 -14700.0 none
1. atlantia 50 22.65 22.68 1.5 none
2. bper-banca 1000 1.59 2.01 423.0 none
3. alerion-cleanpower 30 44.14 36.45 -230.7 none
4. tesmec 10000 0.12 0.14 150.0 none
5. cover-50 120 8.95 9.6 78.0 none
6. ovs 1000 1.71 1.93 217.0 none
7. credito-emiliano 200 5.7 6.26 112.0 none
As you can see there is some adjustments that could be made to format floating points values so that they align on the dot, but this is already quite useful.

LUA bad argument #2

i am a total beginner with LUA / ESP8266 and i am trying to find out where this error comes from:
PANIC: unprotected error in call to Lua API (bad argument #2 to 'set' (index out of range))
This is the Whole Message in serial monitor:
NodeMCU 2.2.0.0 built with Docker provided by frightanic.com
.branch: master
.commit: 11592951b90707cdcb6d751876170bf4da82850d
.SSL: false
.Build type: float
.LFS: disabled
.modules: adc,bit,dht,file,gpio,i2c,mqtt,net,node,ow,spi,tmr,uart,wifi
build created on 2019-12-07 23:52
powered by Lua 5.1.4 on SDK 2.2.1(6ab97e9)
> Config done, IP is 192.168.2.168
LED-Server started
PANIC: unprotected error in call to Lua API (bad argument #2 to 'set' (index out of range))
ets Jan 8 2013,rst cause:2, boot mode:(3,6)
load 0x40100000, len 27780, room 16
tail 4
chksum 0xbc
load 0x3ffe8000, len 2188, room 4
tail 8
chksum 0xba
load 0x3ffe888c, len 136, room 0
tail 8
chksum 0xf2
csum 0xf2
å¬ú‰.Éo‰ísÉÚo|Ï.å.õd$`..#íú..æÑ2rí.lúN‡.Éo„..l`.Ñ‚r€lÑ$.å...l`.Ñ‚s≤pɉ$.å....l`.Ñ‚r€l.èæ.å...$l`.{$é.êo.Ñü¬cc.ÑÑ".|l.Bè.c.‰è¬.lc‰ÚnÓ.2NN‚....å#€‚n.ÏéÑ.l..$Ïådè|Ïl.é.lÄ.o¸.Ñæ.#".llÏÑè..c...åû„åc.l.Ñb.{$r.
I Uploaded this code (https://github.com/Christoph-D/esp8266-wakelight) to the ESP8266, and did build the correct NodeMCU firmware with all required modules.
The Serial output is ok for a couple of seconds, then i get this error and it starts to repeat rebooting.
Where would i start looking for the Problem?
Thanks a lot!!!
EDIT: there are only a few places in the lua files where anything about "set" is written:
local function update_buffer(buffer, c)
if not c.r_frac then c.r_frac = 0 end
if not c.g_frac then c.g_frac = 0 end
if not c.b_frac then c.b_frac = 0 end
local r2 = c.r_frac >= 0 and c.r + 1 or c.r - 1
local g2 = c.g_frac >= 0 and c.g + 1 or c.g - 1
local b2 = c.b_frac >= 0 and c.b + 1 or c.b - 1
local r3, g3, b3
local set = buffer.set
for i = 1, NUM_LEDS do
if i > c.r_frac then r3 = c.r else r3 = r2 end
if i > c.g_frac then g3 = c.g else g3 = g2 end
if i > c.b_frac then b3 = c.b else b3 = b2 end
set(buffer, i - 1, g3, r3, b3)
end
end
Is there anything wrong?
Just above the for-loop where set is called, try adding this:
print(buffer:size(), NUM_LEDS)
If everything is OK, it should print the same number twice. If NUM_LEDS is larger, then that's your bug.
I don't really get why it uses the global variable in that place anyway; it'd make much more sense to use buffer:size() instead for exactly this reason.

julia: #timev differing values for bytes allocated

Is it better to use #allocated versus "bytes allocated" for measuring memory usage? I'm a bit surprised that bytes allocated changes from invocation to invocation.
julia> #timev map(x->2*x, [1:100])
0.047360 seconds (89.54 k allocations: 4.269 MiB)
elapsed time (ns): 47359831
bytes allocated: 4476884
pool allocs: 89536
non-pool GC allocs:1
1-element Array{StepRange{Int64,Int64},1}:
2:2:200
julia> #timev map(x->2*x, [1:100])
0.047821 seconds (89.56 k allocations: 4.271 MiB)
elapsed time (ns): 47820714
bytes allocated: 4478708
pool allocs: 89554
non-pool GC allocs:1
1-element Array{StepRange{Int64,Int64},1}:
2:2:200
julia> #timev map(x->2*x, [1:100])
0.045273 seconds (89.58 k allocations: 4.274 MiB)
elapsed time (ns): 45272518
bytes allocated: 4481108
pool allocs: 89580
non-pool GC allocs:1
1-element Array{StepRange{Int64,Int64},1}:
2:2:200
Firstly, you should read the performance tips section of the Julia manual: https://docs.julialang.org/en/v1/manual/performance-tips/index.html
You are violating tip number one: don't benchmark in global scope. A big red flag should be that this simple operation takes 4/100 of a second and allocates 4MB.
For benchmarking, always use the BenchmarkTools.jl package. Below is example usage.
(BTW, do you really mean to operate on [1:100]? This is a single-element vector, where the single element is a Range object. Did you perhaps intend to work on 1:100 or maybe collect(1:100)?)
julia> using BenchmarkTools
julia> foo(y) = map(x->2*x, y)
foo (generic function with 2 methods)
julia> v = 1:100
1:100
julia> #btime foo($v)
73.372 ns (1 allocation: 896 bytes)
julia> v = collect(1:100);
julia> #btime foo($v);
73.699 ns (2 allocations: 912 bytes)
julia> #btime foo($v);
73.100 ns (2 allocations: 912 bytes)
julia> #btime foo($v);
74.033 ns (2 allocations: 912 bytes)
julia> v = [1:100];
julia> #btime foo($v);
55.563 ns (2 allocations: 128 bytes)
As you can see, runtimes are almost 6 orders of magnitude faster than what you are seeing, and allocations are stable.
Notice also that the last example, which uses [1:100], is faster than the others, but that's because it's doing something else.

dask read_parquet with pyarrow memory blow up

I am using dask to write and read parquet. I am writing using fastparquet engine and reading using pyarrow engine.
My worker has 1 gb of memory. With fastparquet the memory usage is fine, but when i switch to pyarrow, it just blows up and causes the worker to restart.
I have a reproducible example below which fails with pyarrow on a worker of 1gb memory limit.
In reality my dataset is much more bigger than this. The only reason of using pyarrow is it gives me speed boost while scanning compared to fastparquet(somewhere around 7x-8x)
dask : 0.17.1
pyarrow : 0.9.0.post1
fastparquet : 0.1.3
import dask.dataframe as dd
import numpy as np
import pandas as pd
size = 9900000
tmpdir = '/tmp/test/outputParquet1'
d = {'a': np.random.normal(0, 0.3, size=size).cumsum() + 50,
'b': np.random.choice(['A', 'B', 'C'], size=size),
'c': np.random.choice(['D', 'E', 'F'], size=size),
'd': np.random.normal(0, 0.4, size=size).cumsum() + 50,
'e': np.random.normal(0, 0.5, size=size).cumsum() + 50,
'f': np.random.normal(0, 0.6, size=size).cumsum() + 50,
'g': np.random.normal(0, 0.7, size=size).cumsum() + 50}
df = dd.from_pandas(pd.DataFrame(d), 200)
df.to_parquet(tmpdir, compression='snappy', write_index=True,
engine='fastparquet')
#engine = 'pyarrow' #fails due to worker restart
engine = 'fastparquet' #works fine
df_partitioned = dd.read_parquet(tmpdir + "/*.parquet", engine=engine)
print(df_partitioned.count().compute())
df_partitioned.query("b=='A'").count().compute()
Edit: My original setup has spark jobs running that writes data parallely into partitions using fastparquet. So the metadata file is created in the innermost partition rather than the parent directory.Hence using glob paths instead of parent directory(fastparquet is much faster with parent directory read whereas pyarrow wins when scanning with glob path)
I recommend selecting the columns you need in the read_parquet call
df = dd.read_parquet('/path/to/*.parquet', engine='pyarrow', columns=['b'])
This will allow you to efficiently read only a few columns that you need rather than all of the columns at once.
Some timing results on my non memory-restricted system
With your example data
In [17]: df_partitioned = dd.read_parquet(tmpdir, engine='fastparquet')
In [18]: %timeit df_partitioned.count().compute()
2.47 s ± 114 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [19]: df_partitioned = dd.read_parquet(tmpdir, engine='pyarrow')
In [20]: %timeit df_partitioned.count().compute()
1.93 s ± 96.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
With columns b and c converted to categorical before writing
In [30]: df_partitioned = dd.read_parquet(tmpdir, engine='fastparquet')
In [31]: %timeit df_partitioned.count().compute()
1.25 s ± 83.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [32]: df_partitioned = dd.read_parquet(tmpdir, engine='pyarrow')
In [33]: %timeit df_partitioned.count().compute()
1.82 s ± 63 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
With fastparquet direct, single-threaded
In [36]: %timeit fastparquet.ParquetFile(tmpdir).to_pandas().count()
1.82 s ± 19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
With 20 partitions instead of 200 (fastparquet, categories)
In [42]: %timeit df_partitioned.count().compute()
863 ms ± 78.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You could also filter as you load the data.e.g by a specific column
df = dd.read_parquet('/path/to/*.parquet', engine='fastparquet', filters=[(COLUMN, 'operation', 'SOME_VALUE')]).
Imagine operations like ==, >, <, and so on.

Condition for memory access conflict in memory-banked vector processors

The Hennessy-Patterson book on Computer Architecture (Quantitative Approach 5ed) says that in a vector architecture with multiple memory banks, a bank conflict can happen if the following condition is met (Page 279 in 5ed):
(Number of banks) / LeastCommonMultiple(Number of banks, Stride) < Bank busy time
However, I think it should be GreatestCommonFactor instead of LCM, because memory conflict would occur if the effective number of banks you have is less than the busy time. By effective number of banks I mean this - let's say you have 8 banks, and a stride of 2. Then effectively you have 4 banks, because the memory accesses will be lined up only at four banks (e.g, let's say your accesses are all even numbers, starting from 0, then your accesses will be lined up at banks 0,2,4,6).
In fact, this formula even fails for the example given right below it. Suppose we have 8 memory banks with busy time of 6 clock cycles, with total memory latency of 12 clock cycles, how long will it take to complete a 64-element vector load with stride of 1? - Here they calculate the time as 12+64=76 clock cycles. However, memory bank conflict will occur according to the condition given, so we clearly can't have one access per cycle (64 in the equation).
Am I getting it wrong, or has the wrong formula managed to survive 5 editions of this book (unlikely)?
GCD(banks, stride) should come into it; your argument about that is correct.
Let's try this for a few different strides and see what we get,
for number of banks = b = 8.
# generated with the calc(1) function
define f(s) { print s, " | ", lcm(s,8), " | ", gcd(s,8), " | ", 8/lcm(s,8), " | ", 8/gcd(s,8) }`
stride | LCM(s,b) | GCF(s,b) | b/LCM(s,b) | b/GCF(s,b)
1 | 8 | 1 | 1 | 8 # 8 < 6 = false: no conflict
2 | 8 | 2 | 1 | 4 # 4 < 6 = true: conflict
3 | 24 | 1 | ~0.333 | 8 # 8 < 6 = false: no conflict
4 | 8 | 4 | 1 | 2 # 2 < 6 = true: conflict
5 | 40 | 1 | 0.2 | 8
6 | 24 | 2 | ~0.333 | 4
7 | 56 | 1 | ~0.143 | 8
8 | 8 | 8 | 1 | 1
9 | 72 | 1 | ~0.111 | 8
x >=8 2^0..3 <=1 1 2 4 or 8
b/LCM(s,b) is always <=1, so it always predicts conflicts.
I think GCF (aka GCD) looks right for the stride values I've looked at so far. You only have a problem if the stride doesn't distribute the accesses over all the banks, and that's what b/GCF(s,b) tells you.
Stride = 8 should be the worst-case, using the same bank every time. gcd(8,8) = lcm(8,8) = 8. So both expressions give 8/8 = 1 which is less than the bank busy/recovery time, thus correctly predicting conflicts.
Stride=1 is of course the best case (no conflicts if there are enough banks to hide the busy time). gcd(8,1) = 1 correctly predicts no conflicts: (8/1 = 8, which is not less than 6). lcm(8,1) = 8. (8/8 < 6 is true) incorrectly predicts conflicts.

Resources