Separable gaussian blur - optimize vertical pass - image-processing

I have implemented separable Gaussian blur. Horizontal pass was relatively easy to optimize with SIMD processing. However, I am not sure how to optimize vertical pass.
Accessing elements is not very cache friendly and filling SIMD lane would mean reading many different pixels. I was thinking about transpose the image and run horizontal pass and then transpose image back, however, I am not sure if it will gain any improvement because of two tranpose operations.
I have quite large images 16k resolution and kernel size is 19, so vectorization of vertical pass gain was about 15%.
My Vertical pass is as follows (it is sinde generic class typed to T which can be uint8_t or float):
int yStart = kernelHalfSize;
int xStart = kernelHalfSize;
int yEnd = input.GetWidth() - kernelHalfSize;
int xEnd = input.GetHeigh() - kernelHalfSize;
const T * inData = input.GetData().data();
V * outData = output.GetData().data();
int kn = kernelHalfSize * 2 + 1;
int kn4 = kn - kn % 4;
for (int y = yStart; y < yEnd; y++)
{
size_t yW = size_t(y) * output.GetWidth();
size_t outX = size_t(xStart) + yW;
size_t xEndSimd = xStart;
int len = xEnd - xStart;
len = len - len % 4;
xEndSimd = xStart + len;
for (int x = xStart; x < xEndSimd; x += 4)
{
size_t inYW = size_t(y) * input.GetWidth();
size_t x0 = ((x + 0) - kernelHalfSize) + inYW;
size_t x1 = x0 + 1;
size_t x2 = x0 + 2;
size_t x3 = x0 + 3;
__m128 sumDot = _mm_setzero_ps();
int i = 0;
for (; i < kn4; i += 4)
{
__m128 kx = _mm_set_ps1(kernelDataX[i + 0]);
__m128 ky = _mm_set_ps1(kernelDataX[i + 1]);
__m128 kz = _mm_set_ps1(kernelDataX[i + 2]);
__m128 kw = _mm_set_ps1(kernelDataX[i + 3]);
__m128 dx, dy, dz, dw;
if constexpr (std::is_same<T, uint8_t>::value)
{
//we need co convert uint8_t inputs to float
__m128i u8_0 = _mm_loadu_si128((const __m128i*)(inData + x0));
__m128i u8_1 = _mm_loadu_si128((const __m128i*)(inData + x1));
__m128i u8_2 = _mm_loadu_si128((const __m128i*)(inData + x2));
__m128i u8_3 = _mm_loadu_si128((const __m128i*)(inData + x3));
__m128i u32_0 = _mm_unpacklo_epi16(
_mm_unpacklo_epi8(u8_0, _mm_setzero_si128()),
_mm_setzero_si128());
__m128i u32_1 = _mm_unpacklo_epi16(
_mm_unpacklo_epi8(u8_1, _mm_setzero_si128()),
_mm_setzero_si128());
__m128i u32_2 = _mm_unpacklo_epi16(
_mm_unpacklo_epi8(u8_2, _mm_setzero_si128()),
_mm_setzero_si128());
__m128i u32_3 = _mm_unpacklo_epi16(
_mm_unpacklo_epi8(u8_3, _mm_setzero_si128()),
_mm_setzero_si128());
dx = _mm_cvtepi32_ps(u32_0);
dy = _mm_cvtepi32_ps(u32_1);
dz = _mm_cvtepi32_ps(u32_2);
dw = _mm_cvtepi32_ps(u32_3);
}
else
{
/*
//load 8 consecutive values
auto dd = _mm256_loadu_ps(inData + x0);
//extract parts by shifting and casting to 4 values float
dx = _mm256_castps256_ps128(dd);
dy = _mm256_castps256_ps128(_mm256_permutevar8x32_ps(dd, _mm256_set_epi32(0, 0, 0, 0, 4, 3, 2, 1)));
dz = _mm256_castps256_ps128(_mm256_permutevar8x32_ps(dd, _mm256_set_epi32(0, 0, 0, 0, 5, 4, 3, 2)));
dw = _mm256_castps256_ps128(_mm256_permutevar8x32_ps(dd, _mm256_set_epi32(0, 0, 0, 0, 6, 5, 4, 3)));
*/
dx = _mm_loadu_ps(inData + x0);
dy = _mm_loadu_ps(inData + x1);
dz = _mm_loadu_ps(inData + x2);
dw = _mm_loadu_ps(inData + x3);
}
//calculate 4 dots at once
//[dx, dy, dz, dw] <dot> [kx, ky, kz, kw]
auto mx = _mm_mul_ps(dx, kx); //dx * kx
auto my = _mm_fmadd_ps(dy, ky, mx); //mx + dy * ky
auto mz = _mm_fmadd_ps(dz, kz, my); //my + dz * kz
auto res = _mm_fmadd_ps(dw, kw, mz); //mz + dw * kw
sumDot = _mm_add_ps(sumDot, res);
x0 += 4;
x1 += 4;
x2 += 4;
x3 += 4;
}
for (; i < kn; i++)
{
auto v = _mm_set_ps1(kernelDataX[i]);
auto v2 = _mm_set_ps(
*(inData + x3), *(inData + x2),
*(inData + x1), *(inData + x0)
);
sumDot = _mm_add_ps(sumDot, _mm_mul_ps(v, v2));
x0++;
x1++;
x2++;
x3++;
}
sumDot = _mm_mul_ps(sumDot, _mm_set_ps1(weightX));
if constexpr (std::is_same<V, uint8_t>::value)
{
__m128i asInt = _mm_cvtps_epi32(sumDot);
asInt = _mm_packus_epi32(asInt, asInt);
asInt = _mm_packus_epi16(asInt, asInt);
uint32_t res = _mm_cvtsi128_si32(asInt);
((uint32_t *)(outData + outX))[0] = res;
outX += 4;
}
else
{
float tmpRes[4];
_mm_store_ps(tmpRes, sumDot);
outData[outX + 0] = tmpRes[0];
outData[outX + 1] = tmpRes[1];
outData[outX + 2] = tmpRes[2];
outData[outX + 3] = tmpRes[3];
outX += 4;
}
}
for (int x = xEndSimd; x < xEnd; x++)
{
int kn = kernelHalfSize * 2 + 1;
const T * v = input.GetPixelStart(x - kernelHalfSize, y);
float tmp = 0;
for (int i = 0; i < kn; i++)
{
tmp += kernelDataX[i] * v[i];
}
tmp *= weightX;
outData[outX] = ImageUtils::clamp_cast<V>(tmp);
outX++;
}
}

There’s a well-known trick for that.
While you compute both passes, read them sequentially, use SIMD to compute, but write out the result into another buffer, transposed, using scalar stores. Protip: SSE 4.1 has _mm_extract_ps just don’t forget to cast your destination image pointer from float* into int*. Another thing about these stores, I would recommend using _mm_stream_si32 for that as you want maximum cache space used by your input data. When you’ll be computing the second pass, you’ll be reading sequential memory addresses again, the prefetcher hardware will deal with the latency.
This way both passes will be identical, I usually call same function twice, with different buffers.
Two transposes caused by your 2 passes cancel each other. Here’s an HLSL version, BTW.
There’s more. If your kernel size is only 19, that fits in 3 AVX registers. I think shuffle/permute/blend instructions are still faster than even L1 cache loads, i.e. it might be better to load the kernel outside the loop.

Related

Histogram based on image as vector graphic

I would like to transform histograms based on images to vector graphics.
This could be a start:
function preload() {
img = loadImage("https://upload.wikimedia.org/wikipedia/commons/thumb/3/36/Cirrus_sky_panorama.jpg/1200px-Cirrus_sky_panorama.jpg");
}
function setup() {
createCanvas(400, 400);
background(255);
img.resize(0, 200);
var maxRange = 256
colorMode(HSL, maxRange);
image(img, 0, 0);
var histogram = new Array(maxRange);
for (i = 0; i <= maxRange; i++) {
histogram[i] = 0
}
loadPixels();
for (var x = 0; x < img.width; x += 5) {
for (var y = 0; y < img.height; y += 5) {
var loc = (x + y * img.width) * 4;
var h = pixels[loc];
var s = pixels[loc + 1];
var l = pixels[loc + 2];
var a = pixels[loc + 3];
b = int(l);
histogram[b]++
}
}
image(img, 0, 0);
stroke(300, 100, 80)
push()
translate(10, 0)
for (x = 0; x <= maxRange; x++) {
index = histogram[x];
y1 = int(map(index, 0, max(histogram), height, height - 300));
y2 = height
xPos = map(x, 0, maxRange, 0, width - 20)
line(xPos, y1, xPos, y2);
}
pop()
}
<script src="https://cdn.jsdelivr.net/npm/p5#1.4.1/lib/p5.js"></script>
But I would need downloadable vector graphic files as results that are closed shapes without any gaps between. It should look like that for example:
<svg viewBox="0 0 399.84 200"><polygon points="399.84 200 399.84 192.01 361.91 192.01 361.91 182.87 356.24 182.87 356.24 183.81 350.58 183.81 350.58 184.74 344.91 184.74 344.91 188.19 339.87 188.19 339.87 189.89 334.6 189.89 334.6 185.29 328.93 185.29 328.93 171.11 323.26 171.11 323.26 172.55 317.59 172.55 317.59 173.99 311.92 173.99 311.92 179.42 306.88 179.42 306.88 182.03 301.21 182.03 301.21 183.01 295.54 183.01 295.54 179.04 289.87 179.04 289.87 175.67 284.21 175.67 284.21 182.03 278.54 182.03 278.54 176 273.5 176 273.5 172.42 267.83 172.42 267.83 179.42 262.79 179.42 262.79 182.03 257.12 182.03 257.12 183.01 251.45 183.01 251.45 178.63 245.78 178.63 245.78 175.21 240.11 175.21 240.11 182.03 234.86 182.03 234.86 150.42 229.2 150.42 229.2 155.98 223.53 155.98 223.53 158.06 217.86 158.06 217.86 167.44 212.19 167.44 212.19 162.58 206.52 162.58 206.52 155.98 200.85 155.98 200.85 158.06 195.18 158.06 195.18 167.44 189.51 167.44 189.51 177.46 183.84 177.46 183.84 166.93 178.17 166.93 178.17 153.69 172.5 153.69 172.5 155.87 166.82 155.87 166.82 158.05 161.78 158.05 161.78 155.63 156.11 155.63 156.11 160.65 150.84 160.65 150.84 146.59 145.17 146.59 145.17 109.63 139.49 109.63 139.49 113.67 133.82 113.67 133.82 61.48 128.15 61.48 128.15 80.59 123.11 80.59 123.11 93.23 117.44 93.23 117.44 97.97 111.76 97.97 111.76 78.07 106.09 78.07 106.09 61.66 100.42 61.66 100.42 93.23 94.75 93.23 94.75 98.51 89.7 98.51 89.7 85.4 84.03 85.4 84.03 111.03 78.99 111.03 78.99 120.57 73.32 120.57 73.32 124.14 67.65 124.14 67.65 23.48 61.97 23.48 61.97 0 56.3 0 56.3 120.57 50.63 120.57 50.63 167.01 45.38 167.01 45.38 170.83 39.71 170.83 39.71 172.26 34.03 172.26 34.03 178.7 28.36 178.7 28.36 175.36 22.69 175.36 22.69 170.83 17.02 170.83 17.02 172.26 11.34 172.26 11.34 178.7 5.67 178.7 5.67 103.85 0 103.85 0 200 399.84 200"/></svg>
Has anyone an idea how to program that? It doesn't necessarily need to be based on p5.js, but would be cool.
Closing Gaps
In order to have a gapless histogram, you need to meet the following condition:
numberOfBars * barWidth === totalWidth
Right now you are using the p5 line() function to draw your bars. You have not explicitly set the width of your bars, so it uses the default value of 1px wide.
We know that the numberOfBars in your code is always maxRange which is 256.
Right now the total width of your histogram is width - 20, where width is set to 400 by createCanvas(400, 400). So the totalWidth is 380.
256 * 1 !== 380
If you have 256 pixels of bars in a 380 pixel space then there are going to be gaps!
We need to change the barWidth and/or the totalWidth to balance the equation.
For example, you can change your canvas size to 276 (256 + your 20px margin) and the gaps disappear!
createCanvas(276, 400);
However this is not an appropriate solution because now your image is cropped and your pixel data is wrong. But actually...it was already wrong before!
Sampling Pixels
When you call the global loadPixels() function in p5.js you are loading all of the pixels for the whole canvas. This includes the white areas outside of your image.
for (var x = 0; x < img.width; x += 5) {
for (var y = 0; y < img.height; y += 5) {
var loc = (x + y * img.width) * 4;
It is a 1-dimensional array, so your approach of limiting the x and y values here is not giving you the correct position. Your loc variable needs to use the width of the entire canvas rather than the width of just the image, since the pixels array includes the entire canvas.
var loc = (x + y * width) * 4;
Alternatively, you can look at just the pixels of the image by using img.loadPixels() and img.pixels.
img.loadPixels();
for (var x = 0; x < img.width; x += 5) {
for (var y = 0; y < img.height; y += 5) {
var loc = (x + y * img.width) * 4;
var h = img.pixels[loc];
var s = img.pixels[loc + 1];
var l = img.pixels[loc + 2];
var a = img.pixels[loc + 3];
b = int(l);
histogram[b]++;
}
}
The pixel values are always returned in RGBA regardless of the colorMode. So your third channel value is actually the blue, not the lightness. You can make use of the p5.js lightness() function to compute the lightness from the RGBA.
Updated Code
The actual lightness histogram looks dumb because 100% dwarfs all of the other bars.
function preload() {
img = loadImage("https://upload.wikimedia.org/wikipedia/commons/thumb/3/36/Cirrus_sky_panorama.jpg/1200px-Cirrus_sky_panorama.jpg");
}
function setup() {
const barCount = 100;
const imageHeight = 200;
createCanvas(400, 400);
background(255);
colorMode(HSL, barCount - 1);
img.resize(0, imageHeight);
imageMode(CENTER);
image(img, width / 2, imageHeight / 2);
img.loadPixels();
const histogram = new Array(barCount).fill(0);
for (let x = 0; x < img.width; x += 5) {
for (let y = 0; y < img.height; y += 5) {
const loc = (x + y * img.width) * 4;
const r = img.pixels[loc];
const g = img.pixels[loc + 1];
const b = img.pixels[loc + 2];
const a = img.pixels[loc + 3];
const barIndex = floor(lightness([r, g, b, a]));
histogram[barIndex]++;
}
}
fill(300, 100, 80);
strokeWeight(0);
const maxCount = max(histogram);
const barWidth = width / barCount;
const histogramHeight = height - imageHeight;
for (let i = 0; i < barCount; i++) {
const count = histogram[i];
const y1 = round(map(count, 0, maxCount, height, imageHeight));
const y2 = height;
const x1 = i * barWidth;
const x2 = x1 + barWidth;
rect(x1, y1, barWidth, height - y1);
}
}
<script src="https://cdn.jsdelivr.net/npm/p5#1.4.1/lib/p5.js"></script>
But the blue channel histogram looks pretty good!
function preload() {
img = loadImage("https://upload.wikimedia.org/wikipedia/commons/thumb/3/36/Cirrus_sky_panorama.jpg/1200px-Cirrus_sky_panorama.jpg");
}
function setup() {
const barCount = 100;
const imageHeight = 200;
createCanvas(400, 400);
background(255);
img.resize(0, imageHeight);
imageMode(CENTER);
image(img, width / 2, imageHeight / 2);
img.loadPixels();
const histogram = new Array(barCount).fill(0);
for (let x = 0; x < img.width; x += 5) {
for (let y = 0; y < img.height; y += 5) {
const loc = (x + y * img.width) * 4;
const r = img.pixels[loc];
const g = img.pixels[loc + 1];
const b = img.pixels[loc + 2];
const a = img.pixels[loc + 3];
const barIndex = floor(barCount * b / 255);
histogram[barIndex]++;
}
}
fill(100, 100, 300);
strokeWeight(0);
const maxCount = max(histogram);
const barWidth = width / barCount;
const histogramHeight = height - imageHeight;
for (let i = 0; i < barCount; i++) {
const count = histogram[i];
const y1 = round(map(count, 0, maxCount, height, imageHeight));
const y2 = height;
const x1 = i * barWidth;
const x2 = x1 + barWidth;
rect(x1, y1, barWidth, height - y1);
}
}
<script src="https://cdn.jsdelivr.net/npm/p5#1.4.1/lib/p5.js"></script>
Just to add to Linda's excellent answer(+1), you can use p5.svg to render to SVG using p5.js:
let histogram;
function setup() {
createCanvas(660, 210, SVG);
background(255);
noStroke();
fill("#ed225d");
// make an array of 256 random values in the (0, 255) range
histogram = Array.from({length: 256}, () => int(random(255)));
//console.log(histogram);
// plot the histogram
barPlot(histogram, 0, 0, width, height);
// change shape rendering so bars appear connected
document.querySelector('g').setAttribute('shape-rendering','crispEdges');
// save the plot
save("histogram.svg");
}
function barPlot(values, x, y, plotWidth, plotHeight){
let numValues = values.length;
// calculate the width of each bar in the plot
let barWidth = plotWidth / numValues;
// calculate min/max value (to map height)
let minValue = min(values);
let maxValue = max(values);
// for each value
for(let i = 0 ; i < numValues; i++){
// map the value to the plot height
let barHeight = map(values[i], minValue, maxValue, 0, plotHeight);
// render each bar, offseting y
rect(x + (i * barWidth),
y + (plotHeight - barHeight),
barWidth, barHeight);
}
}
<script src="https://unpkg.com/p5#1.3.1/lib/p5.js"></script>
<script src="https://unpkg.com/p5.js-svg#1.0.7"></script>
(In the p5 editor (or when testing locally) a save dialog should pop up.
If you use the browser's Developer Tools to inspect the bar chart it should confirm it's an SVG (not <canvas/>))

Converting cv::Mat to MTLTexture

An intermediate step of my current project requires conversion of opencv's cv::Mat to MTLTexture, the texture container of Metal. I need to store the Floats in the Mat as Floats in the texture; my project cannot quite afford the loss of precision.
This is my attempt at such a conversion.
- (id<MTLTexture>)texForMat:(cv::Mat)image context:(MBEContext *)context
{
id<MTLTexture> texture;
int width = image.cols;
int height = image.rows;
Float32 *rawData = (Float32 *)calloc(height * width * 4,sizeof(float));
int bytesPerPixel = 4;
int bytesPerRow = bytesPerPixel * width;
float r, g, b,a;
for(int i = 0; i < height; i++)
{
Float32* imageData = (Float32*)(image.data + image.step * i);
for(int j = 0; j < width; j++)
{
r = (Float32)(imageData[4 * j]);
g = (Float32)(imageData[4 * j + 1]);
b = (Float32)(imageData[4 * j + 2]);
a = (Float32)(imageData[4 * j + 3]);
rawData[image.step * (i) + (4 * j)] = r;
rawData[image.step * (i) + (4 * j + 1)] = g;
rawData[image.step * (i) + (4 * j + 2)] = b;
rawData[image.step * (i) + (4 * j + 3)] = a;
}
}
MTLTextureDescriptor *textureDescriptor = [MTLTextureDescriptor texture2DDescriptorWithPixelFormat:MTLPixelFormatRGBA16Float
width:width
height:height
mipmapped:NO];
texture = [context.device newTextureWithDescriptor:textureDescriptor];
MTLRegion region = MTLRegionMake2D(0, 0, width, height);
[texture replaceRegion:region mipmapLevel:0 withBytes:rawData bytesPerRow:bytesPerRow];
free(rawData);
return texture;
}
But it doesn't seem to be working. It reads zeroes every time from the Mat, and throws up EXC_BAD_ACCESS. I need the MTLTexture in MTLPixelFormatRGBA16Float to keep the precision.
Thanks for considering this issue.
One problem here is you’re loading up rawData with Float32s but your texture is RGBA16Float, so the data will be corrupted (16Float is half the size of Float32). This shouldn’t cause your crash, but it’s an issue you’ll have to deal with.
Also as “chappjc” noted you’re using ‘image.step’ when writing your data out, but that buffer should be contiguous and not ever have a step that’s not just (width * bytesPerPixel).

Multi otsu(multi-thresholding) with openCV

I am trying to carry out multi-thresholding with otsu. The method I am using currently is actually via maximising the between class variance, I have managed to get the same threshold value given as that by the OpenCV library. However, that is just via running otsu method once.
Documentation on how to do multi-level thresholding or rather recursive thresholding is rather limited. Where do I do after obtaining the original otsu's value? Would appreciate some hints, I been playing around with the code, adding one external for loop, but the next value calculated is always 254 for any given image:(
My code if need be:
//compute histogram first
cv::Mat imageh; //image edited to grayscale for histogram purpose
//imageh=image; //to delete and uncomment below;
cv::cvtColor(image, imageh, CV_BGR2GRAY);
int histSize[1] = {256}; // number of bins
float hranges[2] = {0.0, 256.0}; // min andax pixel value
const float* ranges[1] = {hranges};
int channels[1] = {0}; // only 1 channel used
cv::MatND hist;
// Compute histogram
calcHist(&imageh, 1, channels, cv::Mat(), hist, 1, histSize, ranges);
IplImage* im = new IplImage(imageh);//assign the image to an IplImage pointer
IplImage* finalIm = cvCreateImage(cvSize(im->width, im->height), IPL_DEPTH_8U, 1);
double otsuThreshold= cvThreshold(im, finalIm, 0, 255, cv::THRESH_BINARY | cv::THRESH_OTSU );
cout<<"opencv otsu gives "<<otsuThreshold<<endl;
int totalNumberOfPixels= imageh.total();
cout<<"total number of Pixels is " <<totalNumberOfPixels<< endl;
float sum = 0;
for (int t=0 ; t<256 ; t++)
{
sum += t * hist.at<float>(t);
}
cout<<"sum is "<<sum<<endl;
float sumB = 0; //sum of background
int wB = 0; // weight of background
int wF = 0; //weight of foreground
float varMax = 0;
int threshold = 0;
//run an iteration to find the maximum value of the between class variance(as between class variance shld be maximise)
for (int t=0 ; t<256 ; t++)
{
wB += hist.at<float>(t); // Weight Background
if (wB == 0) continue;
wF = totalNumberOfPixels - wB; // Weight Foreground
if (wF == 0) break;
sumB += (float) (t * hist.at<float>(t));
float mB = sumB / wB; // Mean Background
float mF = (sum - sumB) / wF; // Mean Foreground
// Calculate Between Class Variance
float varBetween = (float)wB * (float)wF * (mB - mF) * (mB - mF);
// Check if new maximum found
if (varBetween > varMax) {
varMax = varBetween;
threshold = t;
}
}
cout<<"threshold value is: "<<threshold;
To extend Otsu's thresholding method to multi-level thresholding the between class variance equation becomes:
Please check out Deng-Yuan Huang, Ta-Wei Lin, Wu-Chih Hu, Automatic
Multilevel Thresholding Based on Two-Stage Otsu's Method with Cluster
Determination by Valley Estimation, Int. Journal of Innovative
Computing, 2011, 7:5631-5644 for more information.
http://www.ijicic.org/ijicic-10-05033.pdf
Here is my C# implementation of Otsu Multi for 2 thresholds:
/* Otsu (1979) - multi */
Tuple < int, int > otsuMulti(object sender, EventArgs e) {
//image histogram
int[] histogram = new int[256];
//total number of pixels
int N = 0;
//accumulate image histogram and total number of pixels
foreach(int intensity in image.Data) {
if (intensity != 0) {
histogram[intensity] += 1;
N++;
}
}
double W0K, W1K, W2K, M0, M1, M2, currVarB, optimalThresh1, optimalThresh2, maxBetweenVar, M0K, M1K, M2K, MT;
optimalThresh1 = 0;
optimalThresh2 = 0;
W0K = 0;
W1K = 0;
M0K = 0;
M1K = 0;
MT = 0;
maxBetweenVar = 0;
for (int k = 0; k <= 255; k++) {
MT += k * (histogram[k] / (double) N);
}
for (int t1 = 0; t1 <= 255; t1++) {
W0K += histogram[t1] / (double) N; //Pi
M0K += t1 * (histogram[t1] / (double) N); //i * Pi
M0 = M0K / W0K; //(i * Pi)/Pi
W1K = 0;
M1K = 0;
for (int t2 = t1 + 1; t2 <= 255; t2++) {
W1K += histogram[t2] / (double) N; //Pi
M1K += t2 * (histogram[t2] / (double) N); //i * Pi
M1 = M1K / W1K; //(i * Pi)/Pi
W2K = 1 - (W0K + W1K);
M2K = MT - (M0K + M1K);
if (W2K <= 0) break;
M2 = M2K / W2K;
currVarB = W0K * (M0 - MT) * (M0 - MT) + W1K * (M1 - MT) * (M1 - MT) + W2K * (M2 - MT) * (M2 - MT);
if (maxBetweenVar < currVarB) {
maxBetweenVar = currVarB;
optimalThresh1 = t1;
optimalThresh2 = t2;
}
}
}
return new Tuple(optimalThresh1, optimalThresh2);
}
And this is the result I got by thresholding an image scan of soil with the above code:
(T1 = 110, T2 = 147).
Otsu's original paper: "Nobuyuki Otsu, A Threshold Selection Method
from Gray-Level Histogram, IEEE Transactions on Systems, Man, and
Cybernetics, 1979, 9:62-66" also briefly mentions the extension to
Multithresholding.
https://engineering.purdue.edu/kak/computervision/ECE661.08/OTSU_paper.pdf
Hope this helps.
Here is a simple general approach for 'n' thresholds in python (>3.0) :
# developed by- SUJOY KUMAR GOSWAMI
# source paper- https://people.ece.cornell.edu/acharya/papers/mlt_thr_img.pdf
import cv2
import numpy as np
import math
img = cv2.imread('path-to-image')
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
a = 0
b = 255
n = 6 # number of thresholds (better choose even value)
k = 0.7 # free variable to take any positive value
T = [] # list which will contain 'n' thresholds
def sujoy(img, a, b):
if a>b:
s=-1
m=-1
return m,s
img = np.array(img)
t1 = (img>=a)
t2 = (img<=b)
X = np.multiply(t1,t2)
Y = np.multiply(img,X)
s = np.sum(X)
m = np.sum(Y)/s
return m,s
for i in range(int(n/2-1)):
img = np.array(img)
t1 = (img>=a)
t2 = (img<=b)
X = np.multiply(t1,t2)
Y = np.multiply(img,X)
mu = np.sum(Y)/np.sum(X)
Z = Y - mu
Z = np.multiply(Z,X)
W = np.multiply(Z,Z)
sigma = math.sqrt(np.sum(W)/np.sum(X))
T1 = mu - k*sigma
T2 = mu + k*sigma
x, y = sujoy(img, a, T1)
w, z = sujoy(img, T2, b)
T.append(x)
T.append(w)
a = T1+1
b = T2-1
k = k*(i+1)
T1 = mu
T2 = mu+1
x, y = sujoy(img, a, T1)
w, z = sujoy(img, T2, b)
T.append(x)
T.append(w)
T.sort()
print(T)
For full paper and more informations visit this link.
I've written an example on how otsu thresholding work in python before. You can see the source code here: https://github.com/subokita/Sandbox/blob/master/otsu.py
In the example there's 2 variants, otsu2() which is the optimised version, as seen on Wikipedia page, and otsu() which is more naive implementation based on the algorithm description itself.
If you are okay in reading python codes (in this case, they are pretty simple, almost pseudo code like), you might want to look at otsu() in the example and modify it. Porting it to C++ code is not hard either.
#Antoni4 gives the best answer in my opinion and it's very straight forward to increase the number of levels.
This is for three-level thresholding:
#include "Shadow01-1.cuh"
void multiThresh(double &optimalThresh1, double &optimalThresh2, double &optimalThresh3, cv::Mat &imgHist, cv::Mat &src)
{
double W0K, W1K, W2K, W3K, M0, M1, M2, M3, currVarB, maxBetweenVar, M0K, M1K, M2K, M3K, MT;
unsigned char *histogram = (unsigned char*)(imgHist.data);
int N = src.rows*src.cols;
W0K = 0;
W1K = 0;
M0K = 0;
M1K = 0;
MT = 0;
maxBetweenVar = 0;
for (int k = 0; k <= 255; k++) {
MT += k * (histogram[k] / (double) N);
}
for (int t1 = 0; t1 <= 255; t1++)
{
W0K += histogram[t1] / (double) N; //Pi
M0K += t1 * (histogram[t1] / (double) N); //i * Pi
M0 = M0K / W0K; //(i * Pi)/Pi
W1K = 0;
M1K = 0;
for (int t2 = t1 + 1; t2 <= 255; t2++)
{
W1K += histogram[t2] / (double) N; //Pi
M1K += t2 * (histogram[t2] / (double) N); //i * Pi
M1 = M1K / W1K; //(i * Pi)/Pi
W2K = 1 - (W0K + W1K);
M2K = MT - (M0K + M1K);
if (W2K <= 0) break;
M2 = M2K / W2K;
W3K = 0;
M3K = 0;
for (int t3 = t2 + 1; t3 <= 255; t3++)
{
W2K += histogram[t3] / (double) N; //Pi
M2K += t3 * (histogram[t3] / (double) N); // i*Pi
M2 = M2K / W2K; //(i*Pi)/Pi
W3K = 1 - (W1K + W2K);
M3K = MT - (M1K + M2K);
M3 = M3K / W3K;
currVarB = W0K * (M0 - MT) * (M0 - MT) + W1K * (M1 - MT) * (M1 - MT) + W2K * (M2 - MT) * (M2 - MT) + W3K * (M3 - MT) * (M3 - MT);
if (maxBetweenVar < currVarB)
{
maxBetweenVar = currVarB;
optimalThresh1 = t1;
optimalThresh2 = t2;
optimalThresh3 = t3;
}
}
}
}
}
#Guilherme Silva
Your code has a BUG
You Must Replace:
W3K = 0;
M3K = 0;
with
W2K = 0;
M2K = 0;
and
W3K = 1 - (W1K + W2K);
M3K = MT - (M1K + M2K);
with
W3K = 1 - (W0K + W1K + W2K);
M3K = MT - (M0K + M1K + M2K);
;-)
Regards
EDIT(1): [Toby Speight]
I discovered this bug by applying the effect to the same picture at different resoultions(Sizes) and seeing that the output results were to much different from each others (Even changing resolution a little bit)
W3K and M3K must be the totals minus the Previous WKs and MKs.
(I thought about this for Code-similarity with the one with one level less)
At the moment due to my lacks of English I cannot explain Better How and Why
To be honest I'm still not 100% sure that this way is correct, even thought from my outputs I could tell that it gives better results. (Even with 1 Level more (5 shades of gray))
You could try yourself ;-)
Sorry
My Outputs:
3 Thresholds
4 Thresholds
I found a useful piece of code in this thread. I was looking for a multi-level Otsu implementation for double/float images. So, I tried to generalize example for N-levels with double/float matrix as input. In my code below I am using armadillo library as dependency. But this code can be easily adapted for standard C++ arrays, just replace vec, uvec objects with single dimensional double and integer arrays, mat and umat with two-dimensional. Two other functions from armadillo used here are: vectorise and hist.
// Input parameters:
// map - input image (double matrix)
// mask - region of interest to be thresholded
// nBins - number of bins
// nLevels - number of Otsu thresholds
#include <armadillo>
#include <algorithm>
#include <vector>
mat OtsuFilterMulti(mat map, int nBins, int nLevels) {
mat mapr; // output thresholded image
mapr = zeros<mat>(map.n_rows, map.n_cols);
unsigned int numElem = 0;
vec threshold = zeros<vec>(nLevels);
vec q = zeros<vec>(nLevels + 1);
vec mu = zeros<vec>(nLevels + 1);
vec muk = zeros<vec>(nLevels + 1);
uvec binv = zeros<uvec>(nLevels);
if (nLevels <= 1) return mapr;
numElem = map.n_rows*map.n_cols;
uvec histogram = hist(vectorise(map), nBins);
double maxval = map.max();
double minval = map.min();
double odelta = (maxval - abs(minval)) / nBins; // distance between histogram bins
vec oval = zeros<vec>(nBins);
double mt = 0, variance = 0.0, bestVariance = 0.0;
for (int ii = 0; ii < nBins; ii++) {
oval(ii) = (double)odelta*ii + (double)odelta*0.5; // centers of histogram bins
mt += (double)ii*((double)histogram(ii)) / (double)numElem;
}
for (int ii = 0; ii < nLevels; ii++) {
binv(ii) = ii;
}
double sq, smuk;
int nComb;
nComb = nCombinations(nBins,nLevels);
std::vector<bool> v(nBins);
std::fill(v.begin(), v.begin() + nLevels, true);
umat ibin = zeros<umat>(nComb, nLevels); // indices from combinations will be stored here
int cc = 0;
int ci = 0;
do {
for (int i = 0; i < nBins; ++i) {
if(ci==nLevels) ci=0;
if (v[i]) {
ibin(cc,ci) = i;
ci++;
}
}
cc++;
} while (std::prev_permutation(v.begin(), v.end()));
uvec lastIndex = zeros<uvec>(nLevels);
// Perform operations on pre-calculated indices
for (int ii = 0; ii < nComb; ii++) {
for (int jj = 0; jj < nLevels; jj++) {
smuk = 0;
sq = 0;
if (lastIndex(jj) != ibin(ii, jj) || ii == 0) {
q(jj) += double(histogram(ibin(ii, jj))) / (double)numElem;
muk(jj) += ibin(ii, jj)*(double(histogram(ibin(ii, jj)))) / (double)numElem;
mu(jj) = muk(jj) / q(jj);
q(jj + 1) = 0.0;
muk(jj + 1) = 0.0;
if (jj>0) {
for (int kk = 0; kk <= jj; kk++) {
sq += q(kk);
smuk += muk(kk);
}
q(jj + 1) = 1 - sq;
muk(jj + 1) = mt - smuk;
mu(jj + 1) = muk(jj + 1) / q(jj + 1);
}
if (jj>0 && jj<(nLevels - 1)) {
q(jj + 1) = 0.0;
muk(jj + 1) = 0.0;
}
lastIndex(jj) = ibin(ii, jj);
}
}
variance = 0.0;
for (int jj = 0; jj <= nLevels; jj++) {
variance += q(jj)*(mu(jj) - mt)*(mu(jj) - mt);
}
if (variance > bestVariance) {
bestVariance = variance;
for (int jj = 0; jj<nLevels; jj++) {
threshold(jj) = oval(ibin(ii, jj));
}
}
}
cout << "Optimized thresholds: ";
for (int jj = 0; jj<nLevels; jj++) {
cout << threshold(jj) << " ";
}
cout << endl;
for (unsigned int jj = 0; jj<map.n_rows; jj++) {
for (unsigned int kk = 0; kk<map.n_cols; kk++) {
for (int ll = 0; ll<nLevels; ll++) {
if (map(jj, kk) >= threshold(ll)) {
mapr(jj, kk) = ll+1;
}
}
}
}
return mapr;
}
int nCombinations(int n, int r) {
if (r>n) return 0;
if (r*2 > n) r = n-r;
if (r == 0) return 1;
int ret = n;
for( int i = 2; i <= r; ++i ) {
ret *= (n-i+1);
ret /= i;
}
return ret;
}

What's the best way to fit a set of points in an image one or more good lines using RANSAC using OpenCV?

What's the best way to fit a set of points in an image one or more good lines using RANSAC using OpenCV?
Is RANSAC is the most efficient way to fit a line?
RANSAC is not the most efficient but it is better for a large number of outliers. Here is how to do it using opencv:
A useful structure-
struct SLine
{
SLine():
numOfValidPoints(0),
params(-1.f, -1.f, -1.f, -1.f)
{}
cv::Vec4f params;//(cos(t), sin(t), X0, Y0)
int numOfValidPoints;
};
Total Least squares used to make a fit for a successful pair
cv::Vec4f TotalLeastSquares(
std::vector<cv::Point>& nzPoints,
std::vector<int> ptOnLine)
{
//if there are enough inliers calculate model
float x = 0, y = 0, x2 = 0, y2 = 0, xy = 0, w = 0;
float dx2, dy2, dxy;
float t;
for( size_t i = 0; i < nzPoints.size(); ++i )
{
x += ptOnLine[i] * nzPoints[i].x;
y += ptOnLine[i] * nzPoints[i].y;
x2 += ptOnLine[i] * nzPoints[i].x * nzPoints[i].x;
y2 += ptOnLine[i] * nzPoints[i].y * nzPoints[i].y;
xy += ptOnLine[i] * nzPoints[i].x * nzPoints[i].y;
w += ptOnLine[i];
}
x /= w;
y /= w;
x2 /= w;
y2 /= w;
xy /= w;
//Covariance matrix
dx2 = x2 - x * x;
dy2 = y2 - y * y;
dxy = xy - x * y;
t = (float) atan2( 2 * dxy, dx2 - dy2 ) / 2;
cv::Vec4f line;
line[0] = (float) cos( t );
line[1] = (float) sin( t );
line[2] = (float) x;
line[3] = (float) y;
return line;
}
The actual RANSAC
SLine LineFitRANSAC(
float t,//distance from main line
float p,//chance of hitting a valid pair
float e,//percentage of outliers
int T,//number of expected minimum inliers
std::vector<cv::Point>& nzPoints)
{
int s = 2;//number of points required by the model
int N = (int)ceilf(log(1-p)/log(1 - pow(1-e, s)));//number of independent trials
std::vector<SLine> lineCandidates;
std::vector<int> ptOnLine(nzPoints.size());//is inlier
RNG rng((uint64)-1);
SLine line;
for (int i = 0; i < N; i++)
{
//pick two points
int idx1 = (int)rng.uniform(0, (int)nzPoints.size());
int idx2 = (int)rng.uniform(0, (int)nzPoints.size());
cv::Point p1 = nzPoints[idx1];
cv::Point p2 = nzPoints[idx2];
//points too close - discard
if (cv::norm(p1- p2) < t)
{
continue;
}
//line equation -> (y1 - y2)X + (x2 - x1)Y + x1y2 - x2y1 = 0
float a = static_cast<float>(p1.y - p2.y);
float b = static_cast<float>(p2.x - p1.x);
float c = static_cast<float>(p1.x*p2.y - p2.x*p1.y);
//normalize them
float scale = 1.f/sqrt(a*a + b*b);
a *= scale;
b *= scale;
c *= scale;
//count inliers
int numOfInliers = 0;
for (size_t i = 0; i < nzPoints.size(); ++i)
{
cv::Point& p0 = nzPoints[i];
float rho = abs(a*p0.x + b*p0.y + c);
bool isInlier = rho < t;
if ( isInlier ) numOfInliers++;
ptOnLine[i] = isInlier;
}
if ( numOfInliers < T)
{
continue;
}
line.params = TotalLeastSquares( nzPoints, ptOnLine);
line.numOfValidPoints = numOfInliers;
lineCandidates.push_back(line);
}
int bestLineIdx = 0;
int bestLineScore = 0;
for (size_t i = 0; i < lineCandidates.size(); i++)
{
if (lineCandidates[i].numOfValidPoints > bestLineScore)
{
bestLineIdx = i;
bestLineScore = lineCandidates[i].numOfValidPoints;
}
}
if ( lineCandidates.empty() )
{
return SLine();
}
else
{
return lineCandidates[bestLineIdx];
}
}
Take a look at Least Mean Square metod. It's faster and simplier than RANSAC.
Also take look at OpenCV's fitLine method.
RANSAC performs better when you have a lot of outliers in your data, or a complex hypothesis.

Fast Pixel Count on Binary Image- ARM neon intrinsics - iOS Dev

Can someone tell me a fast function to count the number of white pixels in a binary image. I need it for iOS app dev. I am working directly on the memory of the image defined as
bool *imageData = (bool *) malloc(noOfPixels * sizeof(bool));
I am implementing the function
int whiteCount = 0;
for (int q=i; q<i+windowHeight; q++)
{
for (int w=j; w<j+windowWidth; w++)
{
if (imageData[q*W + w] == 1)
whiteCount++;
}
}
This is obviously the slowest function possible. I heard that ARM Neon intrinsics on the iOS
can be used to make several operations in 1 cycle. Maybe thats the way to go ??
The problem is that I am not very familiar and don't have enough time to learn assembly language at the moment. So it would be great if anyone can post a Neon intrinsics code for the problem mentioned above or any other fast implementation in C/C++.
The only code in neon intrinsics that I am able to find online is the code for rgb to gray
http://computer-vision-talks.com/2011/02/a-very-fast-bgra-to-grayscale-conversion-on-iphone/
Firstly you can speed up the original code a little by factoring out the multiply and getting rid of the branch:
int whiteCount = 0;
for (int q = i; q < i + windowHeight; q++)
{
const bool * const row = &imageData[q * W];
for (int w = j; w < j + windowWidth; w++)
{
whiteCount += row[w];
}
}
(This assumes that imageData[] is truly binary, i.e. each element can only ever be 0 or 1.)
Here is a simple NEON implementation:
#include <arm_neon.h>
// ...
int i, w;
int whiteCount = 0;
uint32x4_t v_count = { 0 };
for (q = i; q < i + windowHeight; q++)
{
const bool * const row = &imageData[q * W];
uint16x8_t vrow_count = { 0 };
for (w = j; w <= j + windowWidth - 16; w += 16) // SIMD loop
{
uint8x16_t v = vld1q_u8(&row[j]); // load 16 x 8 bit pixels
vrow_count = vpadalq_u8(vrow_count, v); // accumulate 16 bit row counts
}
for ( ; w < j + windowWidth; ++w) // scalar clean up loop
{
whiteCount += row[j];
}
v_count = vpadalq_u16(v_count, vrow_count); // update 32 bit image counts
} // from 16 bit row counts
// add 4 x 32 bit partial counts from SIMD loop to scalar total
whiteCount += vgetq_lane_s32(v_count, 0);
whiteCount += vgetq_lane_s32(v_count, 1);
whiteCount += vgetq_lane_s32(v_count, 2);
whiteCount += vgetq_lane_s32(v_count, 3);
// total is now in whiteCount
(This assumes that imageData[] is truly binary, imageWidth <= 2^19, and sizeof(bool) == 1.)
Updated version for unsigned char and values of 255 for white, 0 for black:
#include <arm_neon.h>
// ...
int i, w;
int whiteCount = 0;
const uint8x16_t v_mask = { 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 };
uint32x4_t v_count = { 0 };
for (q = i; q < i + windowHeight; q++)
{
const uint8_t * const row = &imageData[q * W];
uint16x8_t vrow_count = { 0 };
for (w = j; w <= j + windowWidth - 16; w += 16) // SIMD loop
{
uint8x16_t v = vld1q_u8(&row[j]); // load 16 x 8 bit pixels
v = vandq_u8(v, v_mask); // mask out all but LS bit
vrow_count = vpadalq_u8(vrow_count, v); // accumulate 16 bit row counts
}
for ( ; w < j + windowWidth; ++w) // scalar clean up loop
{
whiteCount += (row[j] == 255);
}
v_count = vpadalq_u16(v_count, vrow_count); // update 32 bit image counts
} // from 16 bit row counts
// add 4 x 32 bit partial counts from SIMD loop to scalar total
whiteCount += vgetq_lane_s32(v_count, 0);
whiteCount += vgetq_lane_s32(v_count, 1);
whiteCount += vgetq_lane_s32(v_count, 2);
whiteCount += vgetq_lane_s32(v_count, 3);
// total is now in whiteCount
(This assumes that imageData[] is has values of 255 for white and 0 for black, and imageWidth <= 2^19.)
Note that all the above code is untested and may need some further work.
http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html
Section 6.55.3.6
The vectorized algorithm will do the comparisons and put them in a structure for you, but you'd still need to go through each element of the structure and determine if it's a zero or not.
How fast does that loop currently run and how fast do you need it to run? Also remember that NEON will work in the same registers as the floating point unit, so using NEON here may force an FPU context switch.

Resources