In place update columns of a matrix - memory

I'm trying to optimize the speed and allocation of this loop
function loop(n,k)
m = rand(10,n)
r = rand(10,1)
for j in 1:k
for i in 2:size(m,2)
#inbounds m[:,i-1] = m[:,i] + rand!(r)
end
end
end
The memory allocation is quite big: #time loop(10000,30) has an allocation of 599.94k, increasing in k. I think there are two contributing factors: (1) allocation of m[:,i] and (2) allocation of m[:,i-1]. I'm hoping that #inbounds can help, but it doesn't. Removing #inbounds doesn't help with the allocation.
Is there a way to reduce allocations? I'm really not creating any new objects, so it should be invariant to k. I tried to replace #inbounds with #view but it didn't even run. I don't think I can use broadcast! here.

Use views as they do not materialize, use broadcasting.
function loop2(n,k)
m = rand(10,n)
for j in 1:k
for i in 2:size(m,2)
#inbounds m[:,i-1] .= #view(m[:,i]) .+ rand(10)
end
end
end
The new version is 3x faster and uses 3.5x less allocations and 40% of memory:
julia> #btime loop(1000,1000)
227.918 ms (3485003 allocations: 327.64 MiB)
julia> #btime loop2(1000,1000)
75.154 ms (999002 allocations: 152.51 MiB)

Prezemyslaw points out the main issues, but I see a further benefit from combining his ideas with your original idea of preallocating r and then using in place rand!:
julia> using Random
julia> function loop3(n,k)
m = rand(10,n)
r = rand(10)
for j in 1:k
for i in 2:size(m,2)
#inbounds m[:,i-1] .= #view(m[:,i]) .+ rand!(r)
end
end
end
loop3 (generic function with 1 method
Which gives:
julia> #btime loop(1_000, 1_000)
266.851 ms (3485003 allocations: 327.64 MiB)
julia> #btime loop2(1_000, 1_000)
101.003 ms (999002 allocations: 152.51 MiB)
julia> #btime loop3(1_000, 1_000)
61.447 ms (3 allocations: 78.36 KiB)
which is basically now just the allocation of m and r:
julia> #btime begin
rand(10, 1_000); rand(10)
end
15.881 μs (3 allocations: 78.36 KiB)

Related

cv::cuda::GpuMat::create allocates much more than requested

I'm using the latest OpenCV 4.x with CUDA supoprt + CUDA 11.6.
I want to allocate GpuMat image in device memory by doing so:
cv::cuda::GpuMat test1;
test1.create(100, 1000000, CV_8UC1);
and I measure consumed memory before create function call and after (using nvidia-smi tool).
Before:
| 0 N/A N/A 372354 C ...aur/example_build/example 199MiB |
After:
| 0 N/A N/A 389636 C ...aur/example_build/example 295MiB |
So + ~100 MB - makes sense.
But when I allocate the image this way (changed W and H):
cv::cuda::GpuMat test1;
test1.create(1000000, 100, CV_8UC1);
I see this:
Before:
| 0 N/A N/A 379124 C ...aur/example_build/example 199MiB |
After:
| 0 N/A N/A 379124 C ...aur/example_build/example 689MiB |
I expected the same increment as in test1 though.
In various cases, consumption is x5 more than expected, when the image is "high and narrow". What do I understand wrong?
In various cases, consumption is x5 more than expected, when the image is "high and narrow". What do I understand wrong?
OpenCV GpuMat uses a pitched allocation. If the minimum pitch is for example 512 bytes, then allocating a "narrow" image is going to be extra-expensive.
On my tesla V100, the minimum pitch (kind of like saying the minimum "width" for each line) for a pitched allocation is 512. 512/100 = 5x.
No I don't have any suggestions for workarounds. Allocate a wider image. Or accept the extra cost.
I think most CUDA GPUs will have a minimum pitch of 512 bytes, because the minimum texture alignment is 512 bytes. You can use the following code to find yours:
$ cat t2060.cu
#include <iostream>
int main(){
char *d;
size_t p;
cudaMallocPitch(&d, &p, 1, 100);
std::cout << p << std::endl;
}
$ nvcc -o t2060 t2060.cu
$ compute-sanitizer ./t2060
========= COMPUTE-SANITIZER
512
========= ERROR SUMMARY: 0 errors
$
(As an aside, I don't know how you decided that your first example shows +100MB. I see 199MiB and 201MiB. The difference between those two appears to be 2MB. But this doesn't seem to be the crux of your question, and the 500MB allocation for a 100MB image of width 100 bytes is explained above.)

CVXPY trying to formulate big problem: ValueError: negative dimensions are not allowed. Is RAM usage the problem here?

I am facing a problem trying to run a fairly large optimization problem in my opinion. You can see the code below. The variable b size is 500 x 96. What I am trying to do is to match a sum of timeseries profiles (351236 x 15 min timesteps) with a bigger profile by minimizing their difference. With the same formulation and a much smaller problem (672 timesteps and a b variable of the size 10 x 5) the problem is solved in under 2 seconds without a problem. But when I am running it for the full scale problem I get the error you see below.
I am running this on Jupyter Lab and python 3.7.4. The python installation is done with conda.
I would expect the problem to solve as with the much smaller problem. But when I run this one, RAM usage explodes up to 100 GB (about 99% of the available RAM on the server). After a while the RAM usage goes down and then a periodical swinging begins (RAM goes up and down from 50% to 100% every few minutes). From the error and after a lot of googling my suspicion is that the problem is too big for the memory and that at some point data is getting broken down to smaller pieces. I do not think it reaches to the point, where the solver does its work. I tried to optimize the code by vectorizing everything (current version) and trying not to have loops etc. in the formulation. But this did not change anything. Do you guys have any clue if this is a bug or a limitation? Or do you maybe have an idea on how to solve this?
X_opt = cp.Constant(np.asarray(X.iloc[:,:500])) # the array size is (35136,500)
K_opt = cp.Constant(np.asarray(K.YearlyDemand)) # the vector size is 96
b = cp.Variable((500,96),boolean = True, value = np.zeros((500,96)))
Y_opt = cp.Constant(np.asarray(y)) # the vector size is 35136
constraints = []
constraints.append( cp.sum(b, axis = 0) == 1 ) # the sum of the elements of every column of b must be equal to 1
constraints.append( cp.sum(b, axis = 1) <= 1 ) # the sum of the elements of every row of b must be smaller or equal to 1
objective = cp.Minimize(cp.sum(cp.abs(Y_opt-cp.sum((cp.diag(K_opt)*((X_opt#b).T)).T, axis = 1))))
prob = cp.Problem(objective, constraints)
prob.solve(solver = cp.GLPK_MI, verbose = True)
ValueError Traceback (most recent call last)
in
D:\Anaconda3\envs\py37DuAL\lib\site-packages\cvxpy\problems\problem.py in solve(self, *args, **kwargs)
287 else:
288 solve_func = Problem._solve
--> 289 return solve_func(self, *args, **kwargs)
290
291 #classmethod
D:\Anaconda3\envs\py37DuAL\lib\site-packages\cvxpy\problems\problem.py in _solve(self, solver, warm_start, verbose, parallel, gp, qcp, **kwargs)
567 self._construct_chains(solver=solver, gp=gp)
568 data, solving_inverse_data = self._solving_chain.apply(
--> 569 self._intermediate_problem)
570 solution = self._solving_chain.solve_via_data(
571 self, data, warm_start, verbose, kwargs)
D:\Anaconda3\envs\py37DuAL\lib\site-packages\cvxpy\reductions\chain.py in apply(self, problem)
63 inverse_data = []
64 for r in self.reductions:
---> 65 problem, inv = r.apply(problem)
66 inverse_data.append(inv)
67 return problem, inverse_data
D:\Anaconda3\envs\py37DuAL\lib\site-packages\cvxpy\reductions\matrix_stuffing.py in apply(self, problem)
98 # Batch expressions together, then split apart.
99 expr_list = [arg for c in cons for arg in c.args]
--> 100 Afull, bfull = extractor.affine(expr_list)
101 if 0 not in Afull.shape and 0 not in bfull.shape:
102 Afull = cvxtypes.constant()(Afull)
D:\Anaconda3\envs\py37DuAL\lib\site-packages\cvxpy\utilities\coeff_extractor.py in affine(self, expr)
76 size = sum([e.size for e in expr_list])
77 op_list = [e.canonical_form[0] for e in expr_list]
---> 78 V, I, J, b = canonInterface.get_problem_matrix(op_list, self.id_map)
79 A = sp.csr_matrix((V, (I, J)), shape=(size, self.N))
80 return A, b.flatten()
D:\Anaconda3\envs\py37DuAL\lib\site-packages\cvxpy\cvxcore\python\canonInterface.py in get_problem_matrix(linOps, id_to_col, constr_offsets)
65
66 # Unpacking
---> 67 V = problemData.getV(len(problemData.V))
68 I = problemData.getI(len(problemData.I))
69 J = problemData.getJ(len(problemData.J))
D:\Anaconda3\envs\py37DuAL\lib\site-packages\cvxpy\cvxcore\python\cvxcore.py in getV(self, values)
320
321 def getV(self, values):
--> 322 return _cvxcore.ProblemData_getV(self, values)
323
324 def getI(self, values):
ValueError: negative dimensions are not allowed
This problem is solved here:
https://github.com/cvxgrp/cvxpy/issues/826#issuecomment-648618636
Note that the general problem is that large problems create underlying matrixies too large for the numpy.int32 which CVXPY uses. You can modify the code in CVXPY fairly easily to continue using the SCS solver.
You will have to modify the file canonInterface.py here:
D:\Anaconda3\envs\py37DuAL\lib\site-packages\cvxpy\cvxcore\python\
If you have trouble finding the second file to modify, just modify the first one, and use the traceback to find the second file.

python opencv create image from bytearray

I am capturing video from a Ricoh Theta V camera. It delivers the video as Motion JPEG (MJPEG). To get the video you have to do an HTTP POST alas which means I cannot use the cv2.VideoCapture(url) feature.
So the way to do this per numerous posts on the web and SO is something like this:
bytes = bytes()
while True:
bytes += stream.read(1024)
a = bytes.find(b'\xff\xd8')
b = bytes.find(b'\xff\xd9')
if a != -1 and b != -1:
jpg = bytes[a:b+2]
bytes = bytes[b+2:]
i = cv2.imdecode(np.fromstring(jpg, dtype=np.uint8), cv2.IMREAD_COLOR)
cv2.imshow('i', i)
if cv2.waitKey(1) == 27:
exit(0)
That actually works, except it is slow. I'm processing a 1920x1080 jpeg stream. on a Mac Book Pro running OSX 10.12.6. The call to imdecode takes approx 425000 microseconds to process each image
Any idea how to do this without imdecode or make imdecode faster? I'd like it to work at 60FPS with HD video (at least).
I'm using Python3.7 and OpenCV4.
Updated Again
I looked into JPEG decoding from the memory buffer using PyTurboJPEG, the code goes like this to compare with OpenCV's imdecode():
#!/usr/bin/env python3
import cv2
from turbojpeg import TurboJPEG, TJPF_GRAY, TJSAMP_GRAY
# Load image into memory
r = open('image.jpg','rb').read()
inp = np.asarray(bytearray(r), dtype=np.uint8)
# Decode JPEG from memory into Numpy array using OpenCV
i0 = cv2.imdecode(inp, cv2.IMREAD_COLOR)
# Use default library installation
jpeg = TurboJPEG()
# Decode JPEG from memory using turbojpeg
i1 = jpeg.decode(r)
cv2.imshow('Decoded with TurboJPEG', i1)
cv2.waitKey(0)
And the answer is that TurboJPEG is 7x faster! That is 4.6ms versus 32.2ms.
In [18]: %timeit i0 = cv2.imdecode(inp, cv2.IMREAD_COLOR)
32.2 ms ± 346 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [19]: %timeit i1 = jpeg.decode(r)
4.63 ms ± 55.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Kudos to #Nuzhny for spotting it first!
Updated Answer
I have been doing some further benchmarks on this and was unable to verify your claim that it is faster to save an image to disk and read it with imread() than it is to use imdecode() from memory. Here is how I tested in IPython:
import cv2
# First use 'imread()'
%timeit i1 = cv2.imread('image.jpg', cv2.IMREAD_COLOR)
116 ms ± 2.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Now prepare the exact same image in memory
r = open('image.jpg','rb').read()
inp = np.asarray(bytearray(r), dtype=np.uint8)
# And try again with 'imdecode()'
%timeit i0 = cv2.imdecode(inp, cv2.IMREAD_COLOR)
113 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
So, I find imdecode() around 3% faster than imread() on my machine. Even if I include the np.asarray() into the timing, it is still quicker from memory than disk - and I have seriously fast 3GB/s NVME disks on my machine...
Original Answer
I haven't tested this but it seems to me that you are doing this in a loop:
read 1k bytes
append it to a buffer
look for JPEG SOI marker (0xffdb)
look for JPEG EOI marker (0xffd9)
if you have found both the start and the end of a JPEG frame, decode it
1) Now, most JPEG images with any interesting content I have seen are between 30kB to 300kB so you are going to do 30-300 append operations on a buffer. I don't know much abut Python but I guess that may cause a re-allocation of memory, which I guess may be slow.
2) Next you are going to look for the SOI marker in the first 1kB, then again in the first 2kB, then again in the first 3kB, then again in the first 4kB - even if you have already found it!
3) Likewise, you are going to look for the EOI marker in the first 1kB, the first 2kB...
So, I would suggest you try:
1) allocating a bigger buffer at the start and acquiring directly into it at the appropriate offset
2) not searching for the SOI marker if you have already found it - e.g. set it to -1 at the start of each frame and only try and find it if it is still -1
3) only look for the EOI marker in the new data on each iteration, not in all the data you have already searched on previous iterations
4) furthermore, actually, don't bother looking for the EOI marker unless you have already found the SOI marker, because the end of a frame without the corresponding start is no use to you anyway - it is incomplete.
I may be wrong in my assumptions, (I have been before!) but at least if they are public someone cleverer than me can check them!!!
I recommend to use turbo-jpeg. It has a python API: PyTurboJPEG.

sliding window in verilog when doing convolution

I am working on my CNN project in Verilog , but I am having some problems of implementing convolution procedure of Image with 3x3 Filter. I wrote a code for convolutional module, but now when it comes to convolution, I have to read the values from memory, which contains the pixels of the image. The thing is that I have to read these values in particular order, since convolution takes the dot product of 2 matrices and then strides it by 1 to the right. So let's say if the image is 5x5 matrix which stored in a memory array
[ a1 a2 a3 a4 a5
a6 a7 a8 a9 a10
a11 a12 a13 a14 a15 ] - memory Ram
how can I read the values of the memory in the following order:
a1 then a2 then a3 , then a6 then a7 then a8, and last row a11 a12 a13, and then stride and start over starting with a2 , a3, etc. til I reach the end of my array. Please suggest any solution how I should address the memory in this situation, the code snippet would be highly appreciated. Thank you.
p.s. my memory array will contain a lot of data, approximately will be a matrix of [400x300] , where the filter is [3x3].
Looks like a simple case of nested for-loops. This walk through the 16-entry memory as you wanted:
for (start=0; start<3; start=start+1)
for(i=1; i<16; i=i+5)
for (j=0; j<3; j=j+1)
data = mem[start+i+j]; // C: printf("%d\n",start+i+j);
Note that the code is both C and Verilog compatible so you can test your sequence in a C-compiler if you want (I did).
If you don't like the for loops you can make them into counters. In HDL you always reverse the order and start with the inner loop:
if (j<3)
j <= j + 1;
else
begin
j <=0;
if (i<16) // should be 15 if you start from 0
i <= i + 5;
else
begin
i <= 1; // You sure it should not be zero?
if (start<3)
start <= start + 1;
else
begin
start <= 0;
all_done <= 1'b1
end // nd of start
end // end of j
end // end of i
In a different part pf the design you can now use start+i+j as address.
Last : I would start with indices 0,1,2 as your picture is likely to start from memory address 0. You need to change the 'i' loop for that.
(HDL code is not compiled or tested)

torch/nn - Joining arrays of Tensors element wise

The subject of this question is joining tensors for neural networks with torch/nn and torch/nngraph libraries for Lua. I started coding in Lua a few weeks ago so my experience is very minimal. In the text below, I refer to lua tables as arrays.
Context
I am working with a recurrent neural network for speech recognition.
At some point in the network there are N number of arrays of m Tensors.
a = {a1, a2, ..., aM},
b = {b1, b2, ..., bM},
... N times
Where ai and bi are tensors and {} represents an array.
What needs to be done is join all those arrays element-wise so that output is an array of M Tensors where output[i] is the result of joining every ith Tensors from the N arrays over the second dimension.
output = {z1, z2, ..., zM}
Example
|| used to represent Tensors
x = {|1 1|, |2 2|}
|1 1| |2 2|
Tensors of size 2x2
y = {|3 3 3|, |4 4 4|}
|3 3 3| |4 4 4|
Tensors of size 2x3
|
| Join{x,y}
\/
z = {|1 1 3 3 3|, |2 2 4 4 4|}
|1 1 3 3 3| |2 2 4 4 4|
Tensors of size 2x5
So the first Tensor of x of size 2x2 was joined with the first Tensor of y of size 2x3 over the second dimension and same thing for second Tensor of each array resulting in z an array of Tensors 2x5.
Problem
Now this is a basic concatenation, but I can't seem to find a module in the torch/nn library that would allow me to do that. I could write my own module of course, but if an already existing module does it then I would rather go with that.
The only existing module I know that joins table is (obviously) JoinTable. It takes an array of Tensors and joins them together. I want to join arrayS of tensors element-wise.
Also, as we are feeding input to our network, the number of Tensors in the N arrays varies, so m from the context above is not constant.
Idea
What I thought I could do in order to use the module JoinTable is convert my arrays into Tensors instead and then JoinTable on the converted N Tensors. But then again I would need a module that does such a conversion and and another one to convert back to an array in order to feed it to the next layers of the network.
Last resort
Write a new module that iterates over all given arrays and concatenates element-wise. Of course it's do-able, but the whole purpose of this post is to find a way to avoid writing smelly modules. It seems weird to me that such a module doesn't already exist.
Conclusion
I finally decided to do as I wrote in Last resort. I wrote a new module that iterates over all given arrays and concatenates element-wise.
Though, the answer given by #fmguler does the same without having to write a new module.
You can do it with nn.SelectTable and nn.JoinTable like this;
require 'nn'
x = {torch.Tensor{{1,1},{1,1}}, torch.Tensor{{2,2},{2,2}}}
y = {torch.Tensor{{3,3,3},{3,3,3}}, torch.Tensor{{4,4,4},{4,4,4}}}
res = {}
res[1] = nn.JoinTable(2):forward({nn.SelectTable(1):forward(x),nn.SelectTable(1):forward(y)})
res[2] = nn.JoinTable(2):forward({nn.SelectTable(2):forward(x),nn.SelectTable(2):forward(y)})
print(res[1])
print(res[2])
If you want this to be done in a module, wrap it in nnGraph;
require 'nngraph'
x = {torch.Tensor{{1,1},{1,1}}, torch.Tensor{{2,2},{2,2}}}
y = {torch.Tensor{{3,3,3},{3,3,3}}, torch.Tensor{{4,4,4},{4,4,4}}}
xi = nn.Identity()()
yi = nn.Identity()()
res = {}
--you can loop over columns here>>
res[1] = nn.JoinTable(2)({nn.SelectTable(1)(xi),nn.SelectTable(1)(yi)})
res[2] = nn.JoinTable(2)({nn.SelectTable(2)(xi),nn.SelectTable(2)(yi)})
module = nn.gModule({xi,yi},res)
--test like this
result = module:forward({x,y})
print(result)
print(result[1])
print(result[2])
--gives the result
th> print(result)
{
1 : DoubleTensor - size: 2x5
2 : DoubleTensor - size: 2x5
}
th> print(result[1])
1 1 3 3 3
1 1 3 3 3
[torch.DoubleTensor of size 2x5]
th> print(result[2])
2 2 4 4 4
2 2 4 4 4
[torch.DoubleTensor of size 2x5]

Resources