Image Processing Pipelining in VHDL - memory

I am currently trying to develop a Sobel filter in VHDL. I am using a 640x480 picture that is stored in a BRAM. The algorithm uses a 3x3 matrix of pixels of the image for processing each output pixel. My problem is that I currently only know of putting an image into a BRAM where each address of the BRAM holds one pixel value. This means I can only read one pixel per clock. My problem is that I am trying to pipeline the data so I would ideally need to be able to get three pixel values (one from each row of the picture) per clock so after my initial latency, I can load in three new pixel values per clock and get an output pixel on every clock. I am looking for a way to do this but cannot figure it out.
The only way I can think of to fix this is to have the image in 3 BRAMs. that way I can read in values from 3 rows per each clock cycle. However, there is not enough memory space to fit even one more RAM large enough to fit a 640x480 image let alone three. I could lower the picture size to do it this way, but I really want to do it with my current 640x480 image size.
Any help or guidance would be greatly appreciated.

A simple solution would be to store 1/4th of the image in 4 separate memories. First memory contain every 4th line, second every 4th line, starting from second line, etc. I would use 4 even if you need 3 lines, since 4 evenly divides 480 and every other standard resolution. Also, finding a binary number modulo 4 is trivial, which is needed to order the memories.
You can use the MSB of the line number to address your RAM, and the LSBs to figure out the relative order of each RAM output (code is only to demonstrate idea, it's not usable as is...):
address <= line(line'left downto 2) & col; -- Or something more efficent on packing
data0 <= ram0(address);
data1 <= ram1(address);
data2 <= ram2(address);
data3 <= ram3(address);
case line(1 downto 0) is
when "00" =>
line0 <= data0;
line1 <= data1;
line2 <= data2;
when "01" =>
line0 <= data1;
line1 <= data2;
line2 <= data3;
when "10" =>
line0 <= data2;
line1 <= data3;
line2 <= data0;
when "11" =>
line0 <= data3;
line1 <= data0;
line2 <= data1;
when others => null;
end case;

I made a sobel filter few years ago. To do that, i wrote a pipeline that gives 9 pixels at each clock cycle:
architecture rtl of matrix_3x3_builder_8b is
type fifo_t is array (0 to 2*IM_WIDTH + 2) of std_logic_vector(7 downto 0);
signal fifo_int : fifo_t;
begin
p0_build_5x5: process(rst_i,clk_i)
begin
if( rst_i = '1' )then
fifo_int <= (others => (others => '0'));
elsif( rising_edge(clk_i) )then
if(data_valid_i = '1')then
for i in 1 to 2*IM_WIDTH + 2 loop
fifo_int(i) <= fifo_int(i-1);
end loop;
fifo_int(0) <= data_i;
end if;
end if;
end process p0_build_5x5;
data_o1 <= fifo_int(0*IM_WIDTH + 0);
data_o2 <= fifo_int(0*IM_WIDTH + 1);
data_o3 <= fifo_int(0*IM_WIDTH + 2);
data_o4 <= fifo_int(1*IM_WIDTH + 0);
data_o5 <= fifo_int(1*IM_WIDTH + 1);
data_o6 <= fifo_int(1*IM_WIDTH + 2);
data_o7 <= fifo_int(2*IM_WIDTH + 0);
data_o8 <= fifo_int(2*IM_WIDTH + 1);
data_o9 <= fifo_int(2*IM_WIDTH + 2);
end rtl;
Here you read the image pixel by pixel to build your 3x3 matrix. The pipeline is longer to fill up but once completed, you have a new matrix each clock pulse.

If you want to continue storing the whole image, then I would do as Jonathan Drolet recommended and cycle between four rams while writing and read all 4 at once (muxing the three you care about into 3 registers).
This works because your rams will be deep enough that you will still be able to get full BRAM utilization at 1/4 the depth (77k deep still) and your reads can be predictably segmented.
For the specifics of this problem, Nicolas Roudel's method is much cheaper with BRAM, although you can't store the whole image at one time, so wherever you send your results can't backpressure you unless you can backpressure your data source. That may or may not matter for your application.
When you try to do something like this with extremely wide, but fairly shallow (1k deep) rams segmenting will use more block ram (or even start inferring distributed ram). When the reads do not follow a particular pattern (the pattern in your case is that they are all sequential and adjacent locations), the ram cannot be segmented. The best strategy to maintain efficient BRAM use is often to build quad port rams from the natively dual port block rams by clocking them with a 2x clock that is phase aligned with your normal clock, allowing you to do a write and 3 reads every 1x clock cycle.

Related

MedianBlur() calculate max kernel size

I want to use MedianBlur function with very high Ksize, like 301 or more. But if I pass ksize too high, sometimes the function will crash. The error message is:
OpenCV Error: (k < 16) in cv::medianBlur_8u_O1, in file ../opencv\modules\imgproc\src\smooth.cpp
(I use opencv4nodejs, but I also tried the original OpenCV 3.4.6).
I did reduce the ksize in a try/catch loop, but not so effective, since I have to work with videos.
I did checkout the OpenCV source code and did some research.
In OpenCV 3.4.6, the crash come from line 241, file opencv\modules\imgproc\src\median_blur.simd.hpp:
for ( k = 0; k < 16 ; ++k )
{
sum += H.coarse[k];
if ( sum > t )
{
sum -= H.coarse[k];
break;
}
}
CV_Assert( k < 16 ); // Error here
t is caculated base on ksize. But sum and H.coarse array's calculations are quite complicated.
Did further researches, I found a scientific document about the algorithm: https://www.researchgate.net/publication/321690537_Efficient_Scalable_Median_Filtering_Using_Histogram-Based_Operations
I am trying to read but honestly, I don't understand too much.
How do I calculate the maximum ksize with a given image?
The maximum kernel size is determined from the bit depth of the image. As mentioned in the publication you cited:
"An 8-bit value is limited to a max value of 255. Our goal is to
support larger kernel sizes, including kernels that are greater in
size than 17 × 17, thus the larger 32-bit data type is used"
so for an image of data type CV_8U the maximum kernel size is 255.

arbitrarily weighted moving average (low- and high-pass filters)

Given input signal x (e.g. a voltage, sampled thousand times per second couple of minutes long), I'd like to calculate e.g.
/ this is not q
y[3] = -3*x[0] - x[1] + x[2] + 3*x[3]
y[4] = -3*x[1] - x[2] + x[3] + 3*x[4]
. . .
I'm aiming for variable window length and weight coefficients. How can I do it in q? I'm aware of mavg and signal processing in q and moving sum qidiom
In the DSP world it's called applying filter kernel by doing convolution. Weight coefficients define the kernel, which makes a high- or low-pass filter. The example above calculates the slope from last four points, placing the straight line via least squares method.
Something like this would work for parameterisable coefficients:
q)x:10+sums -1+1000?2f
q)f:{sum x*til[count x]xprev\:y}
q)f[3 1 -1 -3] x
0n 0n 0n -2.385585 1.423811 2.771659 2.065391 -0.951051 -1.323334 -0.8614857 ..
Specific cases can be made a bit faster (running 0 xprev is not the best thing)
q)g:{prev[deltas x]+3*x-3 xprev x}
q)g[x]~f[3 1 -1 -3]x
1b
q)\t:100000 f[3 1 1 -3] x
4612
q)\t:100000 g x
1791
There's a kx white paper of signal processing in q if this area interests you: https://code.kx.com/q/wp/signal-processing/
This may be a bit old but I thought I'd weigh in. There is a paper I wrote last year on signal processing that may be of some value. Working purely within KDB, dependent on the signal sizes you are using, you will see much better performance with a FFT based convolution between the kernel/window and the signal.
However, I've only written up a simple radix-2 FFT, although in my github repo I do have the untested work for a more flexible Bluestein algorithm which will allow for more variable signal length. https://github.com/callumjbiggs/q-signals/blob/master/signal.q
If you wish to go down the path of performing a full manual convolution by a moving sum, then the best method would be to break it up into blocks equal to the kernel/window size (which was based on some work Arthur W did many years ago)
q)vec:10000?100.0
q)weights:30?1.0
q)wsize:count weights
q)(weights$(((wsize-1)#0.0),vec)til[wsize]+) each til count v
32.5931 75.54583 100.4159 124.0514 105.3138 117.532 179.2236 200.5387 232.168.
If your input list not big then you could use the technique mentioned here:
https://code.kx.com/q/cookbook/programming-idioms/#how-do-i-apply-a-function-to-a-sequence-sliding-window
That uses 'scan' adverb. As that process creates multiple lists which might be inefficient for big lists.
Other solution using scan is:
q)f:{sum y*next\[z;x]} / x-input list, y-weights, z-window size-1
q)f[x;-3 -1 1 3;3]
This function also creates multiple lists so again might not be very efficient for big lists.
Other option is to use indices to fetch target items from the input list and perform the calculation. This will operate only on input list.
q) f:{[l;w;i]sum w*l i+til 4} / w- weight, l- input list, i-current index
q) f[x;-3 -1 1 3]#'til count x
This is a very basic function. You can add more variables to it as per your requirements.

Image computation on GPU and value returning

I have a C# project in which I retreive grey-scale images from cameras and do some computation with the image data. The computations are quite time-consuming since I need to loop over the total image several times and I am doing it all on the CPU.
Now I would like to try to get the evaluation running on the GPU, but I have a lot of struggle achieving that, since I never did any GPU calculations before.
The software should be able to run on several computers with varying hardware, so CUDA for example is not a solution for me, since the code should also run on laptops which only have onboard graphics. After some research I came accross Cloo (found it on this project), which seems to be a quite resonable choice.
So far I integrated Cloo in my project and tried to get this hello world example running. I guess it is running, since I don´t get any exception, but I don´t know where I can see the printed output.
For my computations I need to pass the image to the GPU and I also need the x-y coordinates during the computation. So, in C# the computation looks like this:
int a = 0;
for (int y = 0; y < img_height; y++){
for (int x = 0; x < img_width; x++){
a += image[x,y] * x * y;
}
}
int b = 0;
for (int y = 0; y < img_height; y++){
for (int x = 0; x < img_width; x++){
b += image[x,y] * (x-a) * y;
}
}
Now I want to have these calculations to run on the GPU, and I want to parallel the y-loop, so that in every task one x-loop is running. Then I could take all the resulting a values and add them up before the second loop block would start.
Afterwards I would like to return the values a and b to my C# code and use them there.
So, to wrap up my questions:
Is Cloo a recommendable choice for this task?
What is the best way to pass the image-data (16bit, short-array) and the dimensions (img_width, img_height) to the GPU?
How can I return a value from the GPU? As far as I know kernels are always used as kernel void...
What would be the best way to implement the loops?
I hope my questions are clear and I provided sufficient information to understand my struggles. Any help is appreciated. Thanks in advance.
Let's reverse engineer the problem. Understanding the efficient processing of the "dependency-chain" of image[][], image_height, image_width, a, b
Ad 4 ) the tandem of identical for-loops has a poor performance
given the defined code, there could be just a single loop, thus with reduced overhead costs and best with also maximising cache-aligned vectorised code.
Cache-Naive re-formulation:
int a = 0;
int c = 1;
for ( int y = 0; y < img_height; y++ ){
for ( int x = 0; x < img_width; x++ ){
int intermediate = image[x,y] * y; // .SET PROD(i[x,y],y)
a += x * intermediate; // .REUSE 1st
c -= intermediate; // .REUSE 2nd
}
}
int b = a * c; // was my fault upon being in a hurry leaving for weekend :o)
Moving the code into the split tandem loops is only increasing these overheads and devastating any possible cache-friendly tricks in the code-performance tweaking.
Ad 3 + 2 ) kernel call-signature + CPU-side methods allow this
OpenCL and Cloo document these details, so nothing magical beyond the documented methods is needed here.
Yet, there are latency costs associated with each such host-side to device-side + device-side to host-side transfers. Given you claim that the 16bit-1920x1200 image-data are to be re-processed ~ 10 times in a loop, there are some chances these latencies need not be spent on every such loop pass-through.
The worst performance-killer is a very shallow kernel mathematical density. The problem is, there is indeed not much to calculate in the kernel, so the chances for any efficient SIMD / GPU parallel tricks are indeed pretty low.
In this sense, the CPU-side smart-vectorised code will do much better than the ( H2D + D2H )-overheads-far latency-hostile computationally-shallow GPU-kernel processing.
Ad 1) Given 2+3 and 4 above, 1 may easily loose sense
As prototyped and given additional cache-friendly vectorised tricks, the in-ram + in-cache vectorised code will have chances to beat all OpenCL and mixed-GPU/CPU automated ad-hoc kernel compilation generated device code and it's computing efforts.

Reliably Finding the Number of Holes in an Object

I'm currently using MATLAB's bweuler function to find the number of "holes" in an object. I'm having difficulty due to the relatively low resolution of these images, currently only achieving a 50% success rate. Here is an example of an image that will fail (cropped for testing convenience):
Normalized and scaled up so that you can see what's going on:
The number of holes should be 1 here, but I haven't found a reliable way to threshold that small area out while leaving its surroundings intact.
My current method uses an adaptive threshold, but this is problematic because the correct window size is not constant across all samples. For instance, this example will work with a window size of 6, but others will fail.
For reference, here are a couple more failure cases:
N=1 ->
N=2 ->
And a few cases my current method handles:
N=6 ->
N=1 ->
N=1 ->
I'll post a short version of my current code as well, though as I said previously, I do not believe that an adaptive thresholding method will work. I will be playing with some other morphological segmentation methods in the meantime. Just read one of the test images in and call findholes(image).
function N = findholes(I)
bw = imclearborder(adaptivethreshold(I, 9));
N = abs(1 - bweuler(bw2, 4));
end
function bw=adaptivethreshold(IM,ws,C,tm)
%ADAPTIVETHRESHOLD An adaptive thresholding algorithm that seperates the
%foreground from the background with nonuniform illumination.
% bw=adaptivethreshold(IM,ws,C) outputs a binary image bw with the local
% threshold mean-C or median-C to the image IM.
% ws is the local window size.
% C is a constant offset subtracted from the final input image to im2bw (range 0...1).
% tm is 0 or 1, a switch between mean and median. tm=0 mean(default); tm=1 median.
%
% Contributed by Guanglei Xiong (xgl99#mails.tsinghua.edu.cn)
% at Tsinghua University, Beijing, China.
%
% For more information, please see
% http://homepages.inf.ed.ac.uk/rbf/HIPR2/adpthrsh.htm
if (nargin < 2)
error('You must provide IM and ws');
end
if (nargin<3)
C = 0;
tm = 0;
end
if (nargin==3)
tm=0;
elseif (tm~=0 && tm~=1)
error('tm must be 0 or 1.');
end
IM=mat2gray(IM);
if tm==0
mIM=imfilter(IM,fspecial('average',ws),'replicate');
else
mIM=medfilt2(IM,[ws ws]);
end
sIM=mIM-IM-C;
bw=im2bw(sIM,0);
bw=imcomplement(bw);
end

Understanding Overlap and Add for Filtering

I am trying to implement the overlap and add method in oder to apply a filter in a real time context. However, it seems that there is something I am doing wrong, as the resulting output has a larger error than I would expect. For comparing the accuracy of my computations I created a file, that I am processing in one chunk. I am comparing this with the output of the overlap and add process and take the resulting comparison as an indicator for the accuracy of the computation. So here is my process of doing Overlap and add:
I take a chunk of length L from my input signal
I pad the chunk with zeros to length L*2
I transform that signal into frequency domain
I multiply the signal in frequency domain with my filter response of length L*2 in frequency domain (the filter response is actually created by interpolating control points in the UI - so this is not transformed from time domain. However using length L*2 in frequency domain should be similar to using a ffted time domain signal of length L padded to L*2)
Then I transform the resulting signal back to time domain and add it to the output stream with an overlap of L
Is there anything wrong with that procedure? After reading a lot of different papers and books I've gotten pretty unsure which is the right way to deal with that.
Here is some more data from the tests I have been running:
I created a signal, which consists of three cosine waves
I used this filter function in the time domain for filtering. (It's symmetric, as it is applied to the whole output of the FFT, which also is symmetric for real input signals)
The output of the IFFT looks like this: It can be seen that low frequencies are attenuated more than frequency in the mid range.
For the overlap add/save and the windowed processing I divided the input signal into 8 chunks of 256 samples. After reassembling them they look like that. (sample 490 - 540)
Output Signal overlap and add:
output signal overlap and save:
output signal using STFT with Hanning window:
It can be seen that the overlap add/save processes differ from the STFT version at the point where chunks are put together (sample 511). This is the main error which leads to different results when comparing windowed process and overlap add/save. However the STFT is closer to the output signal, which has been processed in one chunk.
I am pretty much stuck at this point since a few days. What is wrong here?
Here is my source
// overlap and add
// init Buffers
for (UInt32 j = 0; j<samples; j++){
output[j] = 0.0;
}
// process multiple chunks of data
for (UInt32 i = 0; i < (float)div * 2; i++){
for (UInt32 j = 0; j < chunklength/2; j++){
// copy input data to the first half ofcurrent buffer
inBuffer[j] = input[(int)((float)i * chunklength / 2 + j)];
// pad second half with zeros
inBuffer[j + chunklength/2] = 0.0;
}
// clear buffers
for (UInt32 j = 0; j < chunklength; j++){
outBuffer[j][0] = 0.0;
outBuffer[j][8] = 0.0;
FFTBuffer[j][0] = 0.0;
FFTBuffer[j][9] = 0.0;
}
FFT(inBuffer, FFTBuffer, chunklength);
// processing
for(UInt32 j = 0; j < chunklength; j++){
// multiply with filter
FFTBuffer[j][0] *= multiplier[j];
FFTBuffer[j][10] *= multiplier[j];
}
// Inverse Transform
IFFT((const double**)FFTBuffer, outBuffer, chunklength);
for (UInt32 j = 0; j < chunklength; j++){
// copy to output
if ((int)((float)i * chunklength / 2 + j) < samples){
output[(int)((float)i * chunklength / 2 + j)] += outBuffer[j][0];
}
}
}
After the suggestion below, I tried the following:
IFFTed my Filter. This looks like this:
set the second half to zero:
FFTed the signal and compared the magnitudes to the old filter (blue):
After trying to do overlap and add with this filter, the results have obviously gotten worse instead of better. In order to make sure my FFT works correctly, I tried to IFFT and FFT the filter without setting the second half zero. The result is identical to the orignal filter. So the problem shouldn't be the FFTing. I suppose that this is more of some general understanding of the overlap and add method. But I still can't figure out what is going wrong...
One thing to check is the length of the impulse response of your filter. It must be shorter than the length of zero padding used before the fast convolution FFT, or you will get wrap around errors.
I think the problem might be in the windowing approach that you are using. You simply add zeros to the chunks so there is no actual overlap. In the overlap and add method, you need to damp the edges of the window. What this means is that where you add zeros to the chunk you instead have add weighted input signal and the weight in your case should be 0.5 since only two windows overlap.
Rest of the procedure seems OK. You then simply take FTs, multiply and take inverse FTS and finally add up all the chunks to get the final signal which should be exactly the same if you filtered the whole signal at once.

Resources