Optimizing image acquisition with Matlab parallel computing toolbox tools - image-processing

Using a single matlab worker I easily can achieve maximal frames per seconds (fps) of with my camera (using matlab imaq toolbox). This simple code does it:
matlabpool(1)
start(vid)
pause(1); % give matlab time to initialize the camera
for j=1:frames
data = getsnapshot(vid);
end
However, once I try to do some image processing on the fly, the effective rate drops by 50%. Since I have 5 more workers in the matlabpool (and also a gpu), can I optimize this such that each frame grabbed will be processed by a different worker? for example:
for j=1:frames
data = getsnapshot(vid);
<do some analysis with worker mod((j),5)+2 i.e. worker 2 to 6 >
end
the issue is the 'data' is serially obtained from the camera, and the analysis takes about 2 rounds of the loop, so if a different worker (or core) would take care of that each time, the maximum fps can be obtain again...

The way I see it, the workflow here is serial by nature..
Best you can do is to vectorize/parallelize your image processing function (so you still grab images one-by-one, but you distribute the processing on multiple cores)

I think I got the solution:
for i=1:frames
for sf=1:6; % I got 6 cores
m(:,:,sf) = getsnapshot(vid);
end
spmd
result=f(m(:,:,labindex));
end
end
I manange to get better results with GPU parallelization though...

Related

Why does dask worker fails due to MemoryError on "small" size task? [Dask.bag]

I am running a pipeline on multiple images. The pipeline consist of reading the images from file system, doing so processing on each of them, then saving the images to file system. However the dask worker fails due to MemoryError.
Is there a way to assure the dask workers don't load too many images in memory? i.e. Wait until there is enough space on a worker before starting the processing pipeline on a new image.
I have one scheduler and 40 workers with 4 cores, 15GB ram and running Centos7. I am trying to process 125 images in a batch; each image is fairly large but small enough to fit on a worker; around 3GB require for the whole process.
I tried to process a smaller amount of images and it works great.
EDITED
from dask.distributed import Client, LocalCluster
# LocalCluster is used to show the config of the workers on the actual cluster
client = Client(LocalCluster(n_workers=2, resources={'process': 1}))
paths = ['list', 'of', 'paths']
# Read the file data from each path
data = client.map(read, path, resources={'process': 1)
# Apply foo to the data n times
for _ in range(n):
data = client.map(foo, x, resources={'process': 1)
# Save the processed data
data.map(save, x, resources={'process': 1)
# Retrieve results
client.gather(data)
I expected the images to be process as space was available on the workers but it seems like the images are all loaded simultaneously on the different workers.
EDIT:
My issues is that all task get assigned to workers and they don't have enough memory. I found how to limit the number of task a worker handle at a single moment [https://distributed.readthedocs.io/en/latest/resources.html#resources-are-applied-separately-to-each-worker-process](see here).
However, with that limit, when I execute my task they all finish the read step, then the process step and finally the save step. This is an issues since the image are spilled to disk.
Would there be a way to make every task finish before starting a new one?
e.g. on Worker-1: read(img1)->process(img1)->save(img1)->read(img2)->...
Dask does not generally know how much memory a task will need, it can only know the size of the outputs, and that, only once they are finished. This is because Dask simply executes a pthon function and then waits for it to complete; but all osrts of things can happen within a python function. You should generally expect as many tasks to begin as you have available worker cores - as you are finding.
If you want a smaller total memory load, then your solution should be simple: have a small enough number of workers, so that if all of them are using the maximum memory that you can expect, you still have some spare in the system to cope.
To EDIT: you may want to try running optimize on the graph before submission (although this should happen anyway, I think), as it sounds like your linear chains of tasks should be "fused". http://docs.dask.org/en/latest/optimize.html

Does autodiff in tensorflow work if I have a for loop involved in constructing my graph?

I have a situation where I have a batch of images and in each image I have to perform some operation over a tiny patch in that image. Now the problem is the patch size is variable in each image in the batch. So this implies that I cannot vectorize it. I could vectorize by considering the entire range of pixels in an image but my patch size per image is really a small fraction and I don't want to waste my memory here by performing the operation and storing the results for all the pixels in each image.
So in short, I need to use a loop. Now I see that tensorflow has just a while loop defined and no for loops. So my question is , if I use a plain python style for loop for performing operations over my tensor , will the autodiff fail to calculate gradients in my graph?
Tensorflow does not know (thus does not care) how the graph has been constructed, you can even write each node by hand as long as you use proper functions to do so. So in particular for loop has nothing to do with TF. TF while loop on the other hand gives you ability to express dynamic computations inside the graph, so if you want to process data in a sequence and only need a current one in the memory - only while loop can achieve that. If you create a huge graph by hand (through the loop) it will be always executed, and everything stored in memory. As long as this fits on your machine you should be fine. Another thing is the dynamic length, if sometimes you need to run a loop 10 times, and sometimes 1000, you have to use tf.while_loop, you cannot do this with for loop (unless you create separate graphs for each possible length).

TensorFlow: How does one check for bottlenecks in data input pipeline?

I'm currently using tf-slim to create and read tfrecord files into my models, and through this method there is an automatic tensorboard visualization available showing:
The tf.train.batch batch/fraction_of_32_full visualization, which is consistently near 0 value. I believe this should be dependent on how fast the dequeue operation gives the tf.train.batch FIFO queue its tensors.
The parallel reader parallel_read/filenames/fraction_of_32_full and paralell_read/fraction_of_5394_full visualizations, which are always at 1.0 value. I believe this op is what extracts the tensors from the tfrecords and put them into a queue ready for dequeuing.
My question is this: Is my dequeuing operation too slow and causing a bottleneck in my model evaluation?
Why is it that "fraction_of_32" appears although I'm using a batch size of 256? Also, is a queue fraction value of 1.0 the ideal case? Since it would mean the data is always ready for the GPU to work on.
If my dequeueing operation is too slow, how do I actually improve the dequeueing speed? I've checked the source code for tf-slim and it seems that the decoder is embedded within the function I'm using, and I'm not sure if there's an external way to work around it.
I had a similar problem. If batch/fraction_of_32_full gets close to zero, it means that you are consuming data faster than you are producing it.
32 is the default size of the queue, regardless of your batch size. It is wise to set it at least as large as the batch size.
This is the relevant doc: https://www.tensorflow.org/api_docs/python/tf/train/batch
Setting num_threads = multiprocessing.cpu_count(), and capacity = batch_size can help to keep the queue full.

Calclulate Power consumption when CPU don't change to LPM Mode in Contiki

I need to calculate power consumption of CPU. According to this formula.
Power(mW) = cpu * 1.8 / time.
Where time is the sum of cpu + lpm.
I need to measure at the start of certain process and at the end, however the time passed it is to short, and cpu don't change to lpm mode as seen in the next values taken with powertrace_print().
all_cpu all_lpm all_transmit all_listen
116443 1514881 148 1531616
17268 1514881 148 1532440
Calculating power consumption of cpu I got 1.8 mW (which is exactly the value of current draw of CPU in active mode).
My question is, how calculate power consumption in this case?
If MCU does not go into a LPM, then it spends all the time in active mode, so the result of 1.8 mW you get looks correct.
Perhaps you want to ask something different? If you want to measure the time required to execute a specific block of code, you can add RTIMER_NOW() calls at the start and end of the block.
The time resolution of RTIMER_NOW may be too coarse for short-time operations. You can use a higher frequency timer for that, depending on your platform, e.g. read the TBR register for timing if you're compiling for a msp430 based sensor node.

DirectX 11, Combining pixel shaders to prevent bottlenecks

I'm trying to implement one complex algorithm using GPU. The only problem is HW limitations and maximum available feature level is 9_3.
Algorithm is basically "stereo matching"-like algorithm for two images. Because of mentioned limitations all calculations has to be performed in Vertex/Pixel shaders only (there is no computation API available). Vertex shaders are rather useless here so I considered them as pass-through vertex shaders.
Let me shortly describe the algorithm:
Take two images and calculate cost volume maps (basically conterting RGB to Grayscale -> translate right image by D and subtract it from the left image). This step is repeated around 20 times for different D which generates Texture3D.
Problem here: I cannot simply create one Pixel Shader which calculates
those 20 repetitions in one go because of size limitation of Pixel
Shader (max. 512 arithmetics), so I'm forced to call Draw() in a loop
in C++ which unnecessary involves CPU while all operations are done on
the same two images - it seems to me like I have one bottleneck here. I know that there are multiple render targets but: there are max. 8 targets (I need 20+), if I want to generate 8 results in one pixel shader I exceed it's size limit (512 arithmetic for my HW).
Then I need to calculate for each of calculated textures box filter with windows where r > 9.
Another problem here: Because window is so big I need to split box filtering into two Pixel Shaders (vertical and horizontal direction separately) because loops unrolling stage results with very long code. Manual implementation of those loops won't help cuz still it would create to big pixel shader. So another bottleneck here - CPU needs to be involved to pass results from temp texture (result of V pass) to the second pass (H pass).
Then in next step some arithmetic operations are applied for each pair of results from 1st step and 2nd step.
I haven't reach yet here with my development so no idea what kind of bottlenecks are waiting for me here.
Then minimal D (value of parameter from 1st step) is taken for each pixel based on pixel value from step 3.
... same as in step 3.
Here basically is VERY simple graph showing my current implementation (excluding steps 3 and 4).
Red dots/circles/whatever are temporary buffers (textures) where partial results are stored and at every red dot CPU is getting involved.
Question 1: Isn't it possible somehow to let GPU know how to perform each branch form up to the bottom without involving CPU and leading to bottleneck? I.e. to program sequence of graphics pipelines in one go and then let the GPU do it's job.
One additional question about render-to-texture thing: Does all textures resides in GPU memory all the time even between Draw() method calls and Pixel/Vertex shaders switching? Or there is any transfer from GPU to CPU happening... Cuz this may be another issue here which leads to bottleneck.
Any help would be appreciated!
Thank you in advance.
Best regards,
Lukasz
Writing computational algorithms in pixel shaders can be very difficult. Writing such algorithms for 9_3 target can be impossible. Too much restrictions. But, well, I think I know how to workaround your problems.
1. Shader repetition
First of all, it is unclear, what do you call "bottleneck" here. Yes, theoretically, draw calls in for loop is a performance loss. But does it bottleneck? Does your application really looses performance here? How much? Only profilers (CPU and GPU) can answer. But to run it, you must first complete your algorithm (stages 3 and 4). So, I'd better stick with current solution, and started to implement whole algorithm, then profile and than fix performance issues.
But, if you feel ready to tweaks... Common "repetition" technology is instancing. You can create one more vertex buffer (called instance buffer), which will contains parameters not for each vertex, but for one draw instance. Then you do all the stuff with one DrawInstanced() call.
For you first stage, instance buffer can contain your D value and index of target Texture3D layer. You can pass-through them from vertex shader.
As always, you have a tradeof here: simplicity of code to (probably) performance.
2. Multi-pass rendering
CPU needs to be involved to pass results from temp texture (result of
V pass) to the second pass (H pass)
Typically, you do chaining like this, so no CPU involved:
// Pass 1: from pTexture0 to pTexture1
// ...set up pipeline state for Pass1 here...
pContext->PSSetShaderResources(slot, 1, pTexture0); // source
pContext->OMSetRenderTargets(1, pTexture1, 0); // target
pContext->Draw(...);
// Pass 2: from pTexture1 to pTexture2
// ...set up pipeline state for Pass1 here...
pContext->PSSetShaderResources(slot, 1, pTexture1); // previous target is now source
pContext->OMSetRenderTargets(1, pTexture2, 0);
pContext->Draw(...);
// Pass 3: ...
Note, that pTexture1 must have both D3D11_BIND_SHADER_RESOURCE and D3D11_BIND_RENDER_TARGET flags. You can have multiple input textures and multiple render targets. Just make sure, that every next pass knows what previous pass outputs.
And if previous pass uses more resources than current, don't forget to unbind unneeded, to prevent hard-to-find errors:
pContext->PSSetShaderResources(2, 1, 0);
pContext->PSSetShaderResources(3, 1, 0);
pContext->PSSetShaderResources(4, 1, 0);
// Only 0 and 1 texture slots will be used
3. Resource data location
Does all textures resides in GPU memory all the time even between
Draw() method calls and Pixel/Vertex shaders switching?
We can never know that. Driver chooses appropriate location for resources. But if you have resources created with DEFAULT usage and 0 CPU access flag, you can be almost sure it will always be in video memory.
Hope it helps. Happy coding!

Resources