Opencv - trouble creating matrix from other matrix (malloc.c:2451: sYSMALLOc: Assertion ) - opencv

I'm creating a new Mat by jumping pixels of the original image, but I get this error:
PRM algorithm: malloc.c:2451: sYSMALLOc: Assertion `(old_top == (((mbinptr) (((char *) &((av)->bins[((1) - 1) * 2])) - __builtin_offsetof (struct malloc_chunk, fd)))) && old_size == 0) || ((unsigned long) (old_size) >= (unsigned long)((((__builtin_offsetof (struct malloc_chunk, fd_nextsize))+((2 * (sizeof(size_t))) - 1)) & ~((2 * (sizeof(size_t))) - 1))) && ((old_top)->size & 0x1) && ((unsigned long)old_end & pagemask) == 0)' failed.
My code is:
int width = round(img.cols / M);
int height = round(img.rows / M);
cv::Mat res(height, width, CV_8U);
for (int i = 0; i < width; i++) {
for (int j = 0; j < height; j++) {
res.at<cv::Vec3b>(i, j)[0] = img.at<cv::Vec3b>(i * M, j * M)[0];
res.at<cv::Vec3b>(i, j)[1] = img.at<cv::Vec3b>(i * M, j * M)[1];
res.at<cv::Vec3b>(i, j)[2] = img.at<cv::Vec3b>(i * M, j * M)[2];
}
}
return res;
I also tried using uchar* ptr = img.ptr<uchar>(i) and ptr[j] so it's possible to access the data directly, but I receive the same error.
I was searching, and tried some "solution" such as sYSMALLOc: Assertion Failed error in opencv, but the trouble keeps appearing.

In general, this error occurs when you try to access memory at an invalid location. Your code has two issues that contribute to the problem.
First, you access elements of res with cv::Vec3b, which implies a three-channel image, but you initialize it as single channel. Change the initialization to:
cv::Mat res(height, width, CV_8UC3); // Needs to be three channels!
Second, .at<>(i, j) accesses elements by their row index i and column index j. However, your i and j indices refer to column and row, respectively. Your iteration limits should be swapped:
for (int i = 0; i < height; i++) {
for (int j = 0; j < width; j++) {
// ...

Related

Opencv Mat efficiency linearized by right triangle

How to efficiency linearized Mat (symmetric matrix) to one row by right triangle.
For example, when I have:
0aabbb
b0aaaa
ba0bba
bac0aa
aaaa0c
abcab0
and then from that I get:
aabbbaaaabbaaac
Something like this:
...
template<class T>
Mat SSMJ::triangleLinearized(Mat mat){
int c = mat.cols;
Mat row = Mat(1, ((c*c)-c)/2, mat.type());
int i = 0;
for(int y = 1; y < mat.rows; y++)
for(int x = y; x < mat.cols; x++) {
row.at<T>(i)=mat.at<T>(y, x);
i++;
}
return row;
}
...
Since data in your mat is just a 1d array stored in row.data you can do whatever you want with it. I don't think you will find anything more special (w/o using vectorized methods) than just copying from this array.
int rows = 6;
char data[] = { 0,1,2,3,4,5,
0,1,2,3,4,5,
0,1,2,3,4,5,
0,1,2,3,4,5,
0,1,2,3,4,5};
char result[100];
int offset = 0;
for (int i = 0; i < 5; offset += 5-i, i++) {
memcpy(&result[offset] , &data[rows * i + i + 1], 5 - i);
}
Or with opencv Mat it would be
int rows = mat.cols;
char result[100]; // you can calculate how much data u need
int offset = 0;
for (int i = 0; i < 5; offset += 5-i, i++) {
memcpy(&result[offset] , &mat.data[rows * i + i + 1], 5 - i);
}
Mat resultMat(1, offset, result);

Opencv - polynomial function fitting

In opencv (or other c++ lib), is there a similar function like matlab fit which can do 3d polynomial surface fitting (i.e. f(x,y)= p00 + p10*x + p01*y + p20*x^2 + p11*x*y + p02*y^2). Thanks
I don't think there is a lib in opencv but you can do like that :
int main( int argc, char** argv )
{
Mat z = imread("1449862093156643.jpg",CV_LOAD_IMAGE_GRAYSCALE);
Mat M = Mat_<double>(z.rows*z.cols,6);
Mat I=Mat_<double>(z.rows*z.cols,1);
for (int i=0;i<z.rows;i++)
for (int j = 0; j < z.cols; j++)
{
double x=(j - z.cols / 2) / double(z.cols),y= (i - z.rows / 2) / double(z.rows);
M.at<double>(i*z.cols+j, 0) = x*x;
M.at<double>(i*z.cols+j, 1) = y*y;
M.at<double>(i*z.cols+j, 2) = x*y;
M.at<double>(i*z.cols+j, 3) = x;
M.at<double>(i*z.cols+j, 4) = y;
M.at<double>(i*z.cols+j, 5) = 1;
I.at<double>(i*z.cols+j, 0) = z.at<uchar>(i,j);
}
SVD s(M);
Mat q;
s.backSubst(I,q);
cout<<q;
imshow("Orignal",z);
cout<<q.at<double>(2,0);
Mat background(z.rows,z.cols,CV_8UC1);
for (int i=0;i<z.rows;i++)
for (int j = 0; j < z.cols; j++)
{
double x=(j - z.cols / 2) / double(z.cols),y= (i - z.rows / 2) / double(z.rows);
double quad=q.at<double>(0,0)*x*x+q.at<double>(1,0)*y*y+q.at<double>(2,0)*x*y;
quad+=q.at<double>(3,0)*x+q.at<double>(4,0)*y+q.at<double>(5,0);
background.at<uchar>(i,j) = saturate_cast<uchar>(quad);
}
imshow("Simulated background",background);
waitKey();
return 0;
}
Original post is here
There is an undocumented function in openCV (contrib.hpp) called cv::polyfit(). It takes as input a Mat of x coordinates and another Mat of y coordinates. Not very easy to use Mats for this but you can build a wrapper for sending a vector of cv::Point points.
vector <float> fitPoly(const vector <Point> &src, int order){
Mat src_x = Mat(src.size(), 1, CV_32F);
Mat src_y = Mat(src.size(), 1, CV_32F);
for (int i = 0; i < src.size(); i++){
src_x.at<float>(i, 0) = (float)src[i].x;
src_y.at<float>(i, 0) = (float)src[i].y;
}
return cv::polyfit(src_x, src_y, order);
}

ARM NEON Optimization for image transformation

I'm applying an NV12 video transformation which shuffles pixels of the video. On an ARM device such as Google Nexus 7 2013, performance is pretty bad at 30fps for a 1024x512 area with the following C code:
* Pre-processing done only once at beginning of video *
//Temporary tables for the destination
for (j = 0; j < height; j++)
for (i = 0; i < width; i++) {
toY[i][j] = j * width + i;
toUV[i][j] = j / 2 * width + ((int)(i / 2)) * 2;
}
//Temporary tables for the source
for (j = 0; j < height; j++)
for (i = 0; i < width; i++) {
fromY[i][j] = funcY(i, j) * width + funcX(i, j);
fromUV[i][j] = funcY(i, j) / 2 * width + ((int)(funcX(i, j) / 2)) * 2;
}
* Process done at each frame *
for (j = 0; j < height; j++)
for (i = 0; i < width; i++) {
destY[ toY[i][j] ] = srcY[ fromY[i][j] ];
if ((i % 2 == 0) && (j % 2 == 0)) {
destUV[ toUV[i][j] ] = srcUV[ fromUV[i][j] ];
destUV[ toUV[i][j] + 1 ] = srcUV[ fromUV[i][j] + 1 ];
}
}
Though it's computed only once, funcX/Y is a pretty complex transformation so it's not very easy to optimize this part.
Is there still a way to fasten the double loop computed at each frame with the given "from" tables?
You create FOUR lookup tables 8 times as large as the original image?
You put an unnecessary if statement in the inner most loop?
What about swapping i and j?
Honestly, your question should be tagged with [c] instead of arm, neon, or image-processing to start with.
Since you didn't show what funcY and funcX do, the best answer I can give is following.
(Performance skyrocketed. And it's something really really fundamental)
//Temporary tables for the source
pTemp = fromYUV;
for (j = 0; j < height; j+=2)
{
for (i = 0; i < width; i+=2) {
*pTemp++ = funcY(i, j) * width + funcX(i, j);
*pTemp++ = funcY(i+1, j) * width + funcX(i+1, j);
*pTemp++ = funcY(i, j) / 2 * width + ((int)(funcX(i, j) / 2)) * 2;
}
for (i = 0; i < width; i+=2) {
*pTemp++ = funcY(i, j+1) * width + funcX(i, j+1);
*pTemp++ = funcY(i+1, j+1) * width + funcX(i+1, j+1);
}
}
* Process done at each frame *
pTemp = fromYUV;
pTempY = destY;
pTempUV = destUV;
for (j = 0; j < height; j+=2)
{
for (i = 0; i < width; i+=2) {
*pTempY++ = srcY[*pTemp++];
*pTempY++ = srcY[*pTemp++];
*pTempUV++ = srcUV[*pTemp++];
}
for (i = 0; i < width; i+=2) {
*pTempY++ = srcY[*pTemp++];
*pTempY++ = srcY[*pTemp++];
}
}
You should avoid (when possible) :
access on multiple memory area
random memory access
if statements within loops
The worst crime you committed is the order of i and j. (Which you don't need to start with)
If you access a pixel at the coordinate x and y, it's pixel = image[y][x] and NOT image[x][y]

Fast Gaussian Blur image filter with ARM NEON

I'm trying to make a mobile fast version of Gaussian Blur image filter.
I've read other questions, like: Fast Gaussian blur on unsigned char image- ARM Neon Intrinsics- iOS Dev
For my purpose i need only a fixed size (7x7) fixed sigma (2) Gaussian filter.
So, before optimizing for ARM NEON, I'm implementing 1D Gaussian Kernel in C++, and comparing performance with OpenCV GaussianBlur() method directly in mobile environment (Android with NDK). This way it will result in a much simpler code to optimize.
However the result is that my implementation is 10 times slower then OpenCV4Android version. I've read that OpenCV4 Tegra have optimized GaussianBlur implementation, but I don't think that standard OpenCV4Android have those kind of optimizations, so why is my code so slow?
Here is my implementation (note: reflect101 is used for pixel reflection when applying filter near borders):
Mat myGaussianBlur(Mat src){
Mat dst(src.rows, src.cols, CV_8UC1);
Mat temp(src.rows, src.cols, CV_8UC1);
float sum, x1, y1;
// coefficients of 1D gaussian kernel with sigma = 2
double coeffs[] = {0.06475879783, 0.1209853623, 0.1760326634, 0.1994711402, 0.1760326634, 0.1209853623, 0.06475879783};
//Normalize coeffs
float coeffs_sum = 0.9230247873f;
for (int i = 0; i < 7; i++){
coeffs[i] /= coeffs_sum;
}
// filter vertically
for(int y = 0; y < src.rows; y++){
for(int x = 0; x < src.cols; x++){
sum = 0.0;
for(int i = -3; i <= 3; i++){
y1 = reflect101(src.rows, y - i);
sum += coeffs[i + 3]*src.at<uchar>(y1, x);
}
temp.at<uchar>(y,x) = sum;
}
}
// filter horizontally
for(int y = 0; y < src.rows; y++){
for(int x = 0; x < src.cols; x++){
sum = 0.0;
for(int i = -3; i <= 3; i++){
x1 = reflect101(src.rows, x - i);
sum += coeffs[i + 3]*temp.at<uchar>(y, x1);
}
dst.at<uchar>(y,x) = sum;
}
}
return dst;
}
A big part of the problem, here, is that the algorithm is overly precise, as #PaulR pointed out. It's usually best to keep your coefficient table no more precise than your data. In this case, since you appear to be processing uchar data, you would use roughly an 8-bit coefficient table.
Keeping these weights small will particularly matter in your NEON implementation because the narrower you have the arithmetic, the more lanes you can process at once.
Beyond that, the first major slowdown that stands out is that having the image edge reflection code within the main loop. That's going to make the bulk of the work less efficient because it will generally not need to do anything special in that case.
It might work out better if you use a special version of the loop near the edges, and then when you're safe from that you use a simplified inner loop that doesn't call that reflect101() function.
Second (more relevant to prototype code) is that it's possible to add the wings of the window together before applying the weighting function, because the table contains the same coefficients on both sides.
sum = src.at<uchar>(y1, x) * coeffs[3];
for(int i = -3; i < 0; i++) {
int tmp = src.at<uchar>(y + i, x) + src.at<uchar>(y - i, x);
sum += coeffs[i + 3] * tmp;
}
This saves you six multiplies per pixel, and it's a step towards some other optimisations around controlling overflow conditions.
Then there are a couple of other problems related to the memory system.
The two-pass approach is good in principle, because it saves you from performing a lot of recomputation. Unfortunately it can push the useful data out of L1 cache, which can make everything quite a lot slower. It also means that when you write the result out to memory, you're quantising the intermediate sum, which can reduce precision.
When you convert this code to NEON, one of the things you will want to focus on is trying to keep your working set inside the register file, but without discarding calculations before they've been fully utilised.
When people do use two passes, it's usual for the intermediate data to be transposed -- that is, a column of input becomes a row of output.
This is because the CPU will really not like fetching small amounts of data across multiple lines of the input image. It works out much more efficient (because of the way the cache works) if you collect together a bunch of horizontal pixels, and filter those. If the temporary buffer is transposed, then the second pass also collects together a bunch of horizontal points (which would vertical in the original orientation) and it transposes its output again so it comes out the right way.
If you optimise to keep your working set localised, then you might not need this transposition trick, but it's worth knowing about so that you can set yourself a healthy baseline performance. Unfortunately, localisation like this does force you to go back to the non-optimal memory fetches, but with the wider data types that penalty can be mitigated.
If this is specifically for 8 bit images then you really don't want floating point coefficients, especially not double precision. Also you don't want to use floats for x1, y1. You should just use integers for coordinates and you can use fixed point (i.e. integer) for the coefficients to keep all the filter arithmetic in the integer domain, e.g.
Mat myGaussianBlur(Mat src){
Mat dst(src.rows, src.cols, CV_8UC1);
Mat temp(src.rows, src.cols, CV_16UC1); // <<<
int sum, x1, y1; // <<<
// coefficients of 1D gaussian kernel with sigma = 2
double coeffs[] = {0.06475879783, 0.1209853623, 0.1760326634, 0.1994711402, 0.1760326634, 0.1209853623, 0.06475879783};
int coeffs_i[7] = { 0 }; // <<<
//Normalize coeffs
float coeffs_sum = 0.9230247873f;
for (int i = 0; i < 7; i++){
coeffs_i[i] = (int)(coeffs[i] / coeffs_sum * 256); // <<<
}
// filter vertically
for(int y = 0; y < src.rows; y++){
for(int x = 0; x < src.cols; x++){
sum = 0; // <<<
for(int i = -3; i <= 3; i++){
y1 = reflect101(src.rows, y - i);
sum += coeffs_i[i + 3]*src.at<uchar>(y1, x); // <<<
}
temp.at<uchar>(y,x) = sum;
}
}
// filter horizontally
for(int y = 0; y < src.rows; y++){
for(int x = 0; x < src.cols; x++){
sum = 0; // <<<
for(int i = -3; i <= 3; i++){
x1 = reflect101(src.rows, x - i);
sum += coeffs_i[i + 3]*temp.at<uchar>(y, x1); // <<<
}
dst.at<uchar>(y,x) = sum / (256 * 256); // <<<
}
}
return dst;
}
This is the code after implementing all the suggestions of #Paul R and #sh1, summarized as follows:
1) use only integer arithmetic (with precision to taste)
2) add the values ​​of the pixels at the same distance from the mask center before applying the multiplications, to reduce the number of multiplications.
3) apply only horizontal filters to take advantage of the storage by rows of the matrices
4) separate cycles around the edges from those inside the image not to make unnecessary calls to reflection functions. I totally removed the functions of reflection, including them inside the loops along the edges.
5) In addition, as a personal observation, to improve rounding without calling a (slow) function "round" or "cvRound", I've added to both temporary and final pixel results 0.5f (= 32768 in integers precision) to reduce the error / difference compared to OpenCV.
Now the performance is much better from about 15 to about 6 times slower than OpenCV.
However, the resulting matrix is not perfectly identical to that obtained with the Gaussian Blur of OpenCV. This is not due to arithmetic length (sufficient) as well as removing the error remains. Note that this is a minimum difference, between 0 and 2 (in absolute value) of pixel intensity, between the matrices resulting from the two versions. Coefficient are the same used by OpenCV, obtained with getGaussianKernel with same size and sigma.
Mat myGaussianBlur(Mat src){
Mat dst(src.rows, src.cols, CV_8UC1);
Mat temp(src.rows, src.cols, CV_8UC1);
int sum;
int x1;
double coeffs[] = {0.070159, 0.131075, 0.190713, 0.216106, 0.190713, 0.131075, 0.070159};
int coeffs_i[7] = { 0 };
for (int i = 0; i < 7; i++){
coeffs_i[i] = (int)(coeffs[i] * 65536); //65536
}
// filter horizontally - inside the image
for(int y = 0; y < src.rows; y++){
uchar *ptr = src.ptr<uchar>(y);
for(int x = 3; x < (src.cols - 3); x++){
sum = ptr[x] * coeffs_i[3];
for(int i = -3; i < 0; i++){
int tmp = ptr[x+i] + ptr[x-i];
sum += coeffs_i[i + 3]*tmp;
}
temp.at<uchar>(y,x) = (sum + 32768) / 65536;
}
}
// filter horizontally - edges - needs reflect
for(int y = 0; y < src.rows; y++){
uchar *ptr = src.ptr<uchar>(y);
for(int x = 0; x <= 2; x++){
sum = 0;
for(int i = -3; i <= 3; i++){
x1 = x + i;
if(x1 < 0){
x1 = -x1;
}
sum += coeffs_i[i + 3]*ptr[x1];
}
temp.at<uchar>(y,x) = (sum + 32768) / 65536;
}
}
for(int y = 0; y < src.rows; y++){
uchar *ptr = src.ptr<uchar>(y);
for(int x = (src.cols - 3); x < src.cols; x++){
sum = 0;
for(int i = -3; i <= 3; i++){
x1 = x + i;
if(x1 >= src.cols){
x1 = 2*src.cols - x1 - 2;
}
sum += coeffs_i[i + 3]*ptr[x1];
}
temp.at<uchar>(y,x) = (sum + 32768) / 65536;
}
}
// transpose to apply again horizontal filter - better cache data locality
transpose(temp, temp);
// filter horizontally - inside the image
for(int y = 0; y < src.rows; y++){
uchar *ptr = temp.ptr<uchar>(y);
for(int x = 3; x < (src.cols - 3); x++){
sum = ptr[x] * coeffs_i[3];
for(int i = -3; i < 0; i++){
int tmp = ptr[x+i] + ptr[x-i];
sum += coeffs_i[i + 3]*tmp;
}
dst.at<uchar>(y,x) = (sum + 32768) / 65536;
}
}
// filter horizontally - edges - needs reflect
for(int y = 0; y < src.rows; y++){
uchar *ptr = temp.ptr<uchar>(y);
for(int x = 0; x <= 2; x++){
sum = 0;
for(int i = -3; i <= 3; i++){
x1 = x + i;
if(x1 < 0){
x1 = -x1;
}
sum += coeffs_i[i + 3]*ptr[x1];
}
dst.at<uchar>(y,x) = (sum + 32768) / 65536;
}
}
for(int y = 0; y < src.rows; y++){
uchar *ptr = temp.ptr<uchar>(y);
for(int x = (src.cols - 3); x < src.cols; x++){
sum = 0;
for(int i = -3; i <= 3; i++){
x1 = x + i;
if(x1 >= src.cols){
x1 = 2*src.cols - x1 - 2;
}
sum += coeffs_i[i + 3]*ptr[x1];
}
dst.at<uchar>(y,x) = (sum + 32768) / 65536;
}
}
transpose(dst, dst);
return dst;
}
According to Google document, on Android device, using float/double is twice slower than using int/uchar.
You may find some solutions to speed up your C++ code on this Android documents.
https://developer.android.com/training/articles/perf-tips

OpenCV Mat class: Accessing elements of a multi-channel matrix

I currently want to read in some values into a 3-channel, 480 row by 640 column matrix of 8 bit unsigned integer values. I am initializing the matrix like this:
Declaration:
rgbMatrix = Mat::zeros(480,640,CV_8UC3);
When I try to iterate through the entire matrix I am unable to assign/grab values using the following method. The values simply stay 0. My code looks like this:
for (int i = 0; i < rgbMatrix.rows; i++)
{
for (int j = 0; j < rgbMatrix.cols; j++)
{
(rgbMatrix.data + rgbMatrix.step * i)[j * rgbMatrix.channels() + 0] = *value0*;
(rgbMatrix.data + rgbMatrix.step * i)[j * rgbMatrix.channels() + 1] = *value1*;
(rgbMatrix.data + rgbMatrix.step * i)[j * rgbMatrix.channels() + 2] = *value2*;
}
}
However, when I declare three separate 1-channel matrices (also 480 row by 640 column of 8 bit unsigned integer values) and attempt to access elements of those matrices the following code works:
Declaration:
rgbMatrix0 = Mat::zeros(480,640,CV_8UC1);
rgbMatrix1 = Mat::zeros(480,640,CV_8UC1);
rgbMatrix2 = Mat::zeros(480,640,CV_8UC1);
for (int i = 0; i < rgbMatrix0.rows; i++)
{
for (int j = 0; j < rgbMatrix0.cols; j++)
{
(rgbMatrix0.data + rgbMatrix0.step * i)[j] = *value0*;
(rgbMatrix1.data + rgbMatrix1.step * i)[j] = *value1*;
(rgbMatrix2.data + rgbMatrix2.step * i)[j] = *value2*;
}
}
Now, I want to use just one matrix for these operations, as having to keep track of three separate variables will get tiresome after a while. I have a feeling that I am not accessing the right point in memory for the three-channel matrix. Does anyone know how I can accomplish what I did in the second portion of code but using one three-channel matrix instead of three separate one-channel matrices?
Thanks.
There are plenty of ways to do it, for example:
cv::Mat rgbMatrix(480,640,CV_8UC3);
for (int i = 0; i < rgbMatrix.rows; i++)
for (int j = 0; j < rgbMatrix.cols; j++)
for (int k = 0; k < 3; k++)
rgbMatrix.at<cv::Vec3b>(i,j)[k] = value;
[k] here is the channel value.
To set the all the matrix elements to a specific value like 5 for example you can do this:
cv::Mat rgbMatrix2(cv::Size(480,640), CV_8UC3, cv::Scalar(5,5,5));
std::cout << rgbMatrix2 << std::endl;
Sorry I can't see your code since I am writing from iPhone. When you use 3 channel matrix you can get the pixel using:
Vec3b pix = rgbMatrix.at(row,col);
Now you can access channel using:
pix[0] = 255; pix[1] += pix[2];
P.s. Generally rgbMatrix pixel is of type vec3b or vec3d. Always cast image.at<> with relevant type
Very Simple using Vec3b - for uchar, Vec3i - for int, Vec3f - for float, Vec3d - for double
Mat rgbMatrix = Mat::zeros(480,640,CV_8UC1);
for (int i = 0; i < rgbMatrix.rows; i++)
{
for (int j = 0; j < rgbMatrix.cols; j++)
{
rgbMatrix.at<Vec3b>(i,j)[0] = *value0;
rgbMatrix.at<Vec3b>(i,j)[1] = *value1;
rgbMatrix.at<Vec3b>(i,j)[2] = *value2;
}
}
vector<cv::Point3f> xyzBuffer;
cv::Mat xyzBuffMat = cv::Mat(307200, 1, CV_32FC3);
for (int i = 0; i < xyzBuffer.size(); i++) {
xyzBuffMat.at<cv::Vec3f>(i, 1, 0) = xyzBuffer[i].x;
xyzBuffMat.at<cv::Vec3f>(i, 1, 1) = xyzBuffer[i].y;
xyzBuffMat.at<cv::Vec3f>(i, 1, 2) = xyzBuffer[i].z;
}
Here, 0, 1, and 2 are respectively the channels that store x, y and z values.

Resources