OpenCV, Haar cascade classifier: scaling feature or computing image pyramid? - opencv

I read the paper of Viola and Jones.
They stated clearly in the paper that their algorithm is faster than others because calculation of image pyramid is avoided by scaling feature rectangles.
But I googled around for a long time, only to find that OpenCV implements the image pyramid method instead of scaling the feature rectangles. And integral image is computed for all sub images in the pyramid. And this is done for every frame if this algorithm is used to process video in stead of picture.
What's the rationale of this choice? I don't quite get it.
All I can understand is completely the opposite: for video applications, scaling the features only needs to be done once, and the scaled features can be reused by all the frames. And only the integral image of the whole image needs to be computed .
Am I correct on this?
Viola and Jones also presented a 15fps frame rate on a Pentium 3 computer, but I hardly see anybody achieving that performance with the OpenCV implementation on modern computer. That's strange, isn't it?
Any input will be helpful. Thank you.

I have tried to verify this by looking into their code. This is based on version 2.4.10. The short answer is: both. OpenCv scales the image according to the scale factor at which the detection is performed and it can also rescale the features at different window sizes according to the scale factor. Justification is bellow:
1. Looking at the older functions, cvHaarDetectObjectsForROC from objdetect module (haar.cpp). Notable arguments are the CvSize minSize, CvSize maxSize and const CvArr* _img, double scaleFactor, int minNeighbors.
cvHaarDetectObjectsForROC( const CvArr* _img,
CvHaarClassifierCascade* cascade, CvMemStorage* storage,
std::vector<int>& rejectLevels, std::vector<double>& levelWeights,
double scaleFactor, int minNeighbors, int flags,
CvSize minSize, CvSize maxSize, bool outputRejectLevels )
CvMat stub, *img = (CvMat*)_img;
.... // skip a bit ahead to this part
if( flags & CV_HAAR_SCALE_IMAGE )
CvSize winSize0 = cascade->orig_window_size; // this would be the trained size of 24x24 pixels mentioned in the paper
for( factor = 1; ; factor *= scaleFactor )
// detection window for current scale
CvSize winSize = { cvRound(winSize0.width*factor), cvRound(winSize0.height*factor) };
//resized image size
CvSize sz = { cvRound( img->cols/factor ), cvRound( img->rows/factor ) };
// take every possible scale factor as long as the resulting window doesn't exceed the maximum size given and is bigger than the minimum one
if( winSize.width > maxSize.width || winSize.height > maxSize.height )
if( winSize.width < minSize.width || winSize.height < minSize.height )
img1 = cvMat( sz.height, sz.width, CV_8UC1, imgSmall->data.ptr );
... // skip sum, square sum, tilted sums a.k.a interal image arrays initialization
cvResize( img, &img1, CV_INTER_LINEAR ); // scaling down the image here
cvIntegral( &img1, &sum1, &sqsum1, _tilted ); // compute integral representation for the scaled down version
... //skip some lines
cvSetImagesForHaarClassifierCascade( cascade, &sum1, &sqsum1, _tilted, 1. ) //-> set the structures and also rescales the feature according to the last parameter which is the scale factor.
// Notice it is 1.0 because the image was scaled down this time.
<call detection function with notable arguments: cascade,... factor, cv::Mat(&sum1), cv::Mat(&sqsum1) ...>
// the above call is a parallel for that evaluates a window at a certain position in the image with the cascade classifier
// note the class naming HaarDetectObjects_ScaleImage_Invoker in the actual code and skipped here.
} // end for
} // if
int n_factors = 0; // total number of factors
cvIntegral( img, sum, sqsum, tilted ); // -> makes a single integral image for the given image (the original one passed in the cvHaarDetectObjects)
// below aims to see the total number of scale factors at which detection is performed.
for( n_factors = 0, factor = 1;
factor*cascade->orig_window_size.width < img->cols - 10 &&
factor*cascade->orig_window_size.height < img->rows - 10;
n_factors++, factor *= scaleFactor );
... // skip some lines
for( ; n_factors-- > 0; factor *= scaleFactor )
CvSize winSize = { cvRound( cascade->orig_window_size.width * factor ), cvRound( cascade->orig_window_size.height * factor )};
... // skip check for minSize and maxSize here
cvSetImagesForHaarClassifierCascade( cascade, sum, sqsum, tilted, factor ); // -> notice here the scale factor is given so that the trained Haar features can be rescaled.
<parallel for detect call given a startX, endX and startY endY, window size and cascade> // Note the name here HaarDetectObjects_ScaleCascade_Invoker used in actual code and skipped here
} // end of if
... // skip rest
} // end of cvHaarDetectObjectsForROC function
If you take the new API (C++) the class CascadeClassifier if it loads the new .xml format of the cascade outputted by the traincascade.exe application will scale the image according to the scale factor (for Haars it should be up from what I know of). The detectMultiScale method of the class will default to the detectSingleScale method at some point in the code:
if( !detectSingleScale( scaledImage, stripCount, processingRectSize, stripSize, yStep, factor, candidates, rejectLevels, levelWeights, outputRejectLevels ) )
break; // from cascadedetect.cpp in the detectMultiScale method.
Possible reason I can think of: In order to have a unified design in C++ this is the only method that can achieve transparency with a single interface for different types of features.
I left the trail of thought in case I have understood something wrong or have omitted something another user can correct me by verifying this trail.


How can I find the brightest point in a CIImage (in Metal maybe)?

I created a custom CIKernel in Metal. This is useful because it is close to real-time. I am avoiding any cgcontext or cicontext that might lag in real time. My kernel essentially does a Hough transform, but I can't seem to figure out how to read the white points from the image buffer.
Here is kernel.metal:
#include <CoreImage/CoreImage.h>
extern "C" {
namespace coreimage {
float4 hough(sampler src) {
// Math
// More Math
// eventually:
if (luminance > 0.8) {
uint2 position = src.coord()
// Somehow add this to an array because I need to know the x,y pair
return float4(luminance, luminance, luminance, 1.0);
I am fine if this part can be extracted to a different kernel or function. The caveat to CIKernel, is its return type is a float4 representing the new color of a pixel. Ideally, instead of a image -> image filter, I would like an image -> array sort of deal. E.g. reduce instead of map. I have a bad hunch this will require me to render it and deal with it on the CPU.
Ultimately I want to retrieve the qualifying coordinates (which there can be multiple per image) back in my swift function.
As per suggestions of the answer, I am doing large per-pixel calculations on the GPU, and some math on the CPU. I designed 2 additional kernels that work like the builtin reduction kernels. One kernel returns a 1 pixel high image of the highest values in each column, and the other kernel returns a 1 pixel high image of the normalized y-coordinate of the highest value:
/// Returns the maximum value in each column.
/// - Parameter src: a sampler for the input texture
/// - Returns: maximum value in for column
float4 maxValueForColumn(sampler src) {
const float2 size = float2(src.extent().z, src.extent().w);
/// Destination pixel coordinate, normalized
const float2 pos = src.coord();
float maxV = 0;
for (float y = 0; y < size.y; y++) {
float v = src.sample(float2(pos.x, y / size.y)).x;
if (v > maxV) {
maxV = v;
return float4(maxV, maxV, maxV, 1.0);
/// Returns the normalized coordinate of the maximum value in each column.
/// - Parameter src: a sampler for the input texture
/// - Returns: normalized y-coordinate of the maximum value in for column
float4 maxCoordForColumn(sampler src) {
const float2 size = float2(src.extent().z, src.extent().w);
/// Destination pixel coordinate, normalized
const float2 pos = src.coord();
float maxV = 0;
float maxY = 0;
for (float y = 0; y < size.y; y++) {
float v = src.sample(float2(pos.x, y / size.y)).x;
if (v > maxV) {
maxY = y / size.y;
maxV = v;
return float4(maxY, maxY, maxY, 1.0);
This won't give every pixel where luminance is greater than 0.8, but for my purposes, it returns enough: the highest value in each column, and its location.
Pro: copying only (2 * image width) bytes over to the CPU instead of every pixel saves TONS of time (a few ms).
Con: If you have two major white points in the same column, you will never know. You might have to alter this and do calculations by row instead of column if that fits your use-case.
There seems to be a problem in rendering the outputs. The Float values returned in metal are not correlated to the UInt8 values I am getting in swift.
This unanswered question describes the problem.
Edit: This answered question provides a very convenient metal function. When you call it on a metal value (e.g. 0.5) and return it, you will get the correct value (e.g. 128) on the CPU.
Check out the filters in the CICategoryReduction (like CIAreaAverage). They return images that are just a few pixels tall, containing the reduction result. But you still have to render them to be able to read the values in your Swift function.
The problem for using this approach for your problem is that you don't know the number of coordinates you are returning beforehand. Core Image needs to know the extend of the output when it calls your kernel, though. You could just assume a static maximum number of coordinates, but that all sounds tedious.
I think you are better off using Accelerate APIs for iterating the pixels of your image (parallelized, super efficiently) on the CPU to find the corresponding coordinates.
You could do a hybrid approach where you do the per-pixel heavy math on the GPU with Core Image and then do the analysis on the CPU using Accelerate. You can even integrate the CPU part into your Core Image pipeline using a CIImageProcessorKernel.

Resize organized point cloud

I have an organized point cloud (1280 * 720) captured from a 3D camera. I just wonder whether there's a method to resize(cut down) this point cloud to a smaller size (eg. 128 * 72), when keeping this cloud organized.
(I think this shouldn't be the same as down sampling. "Resize" means like zooming an image).
I am using Point Cloud Library 1.8.0 but stuck with this.
Any advice is welcome, thanks first!
The answer of Rooscannon is in particular correct, but has some bugs in it. The correct uniform subsampling of a organized point cloud is as follows:
// Downsampling or keypoint extraction
int scale = 3;
PointCloud<PointXYZRGB>::Ptr keypoints (new PointCloud<PointXYZRGB>);
keypoints->width = cloud->width / scale;
keypoints->height = cloud->height / scale;
keypoints->points.resize(keypoints->width * keypoints->height);
for( size_t i = 0, ii = 0; i < keypoints->height; ii += scale, i++){
for( size_t j = 0, jj = 0; j < keypoints->width; jj += scale, j++){
keypoints->at(j, i) = cloud->at(jj, ii); //at(column, row)
So the loop conditions, the indexing and the initialization of the subsampled point cloud are different. Otherwise, the subsampled point cloud would not be organized anymore.
Just take a point out of the number of time you want to reduce your cloud,
something like that shloud work :
for (pcl::PointCloud<pcl::PointXYZ>::const_iterator it = src->begin(); it< src->end(); it+=times)
Only problem is the cloud might containt some NaN values. To correct it just set is_dense to false into dest and call removeNaNFromPointCloud on it.
Hope this can help you !
Can't comment but removing NaNs from your point cloud by default makes it unorganized. Quite likely the NaNs are there as dummy points in case your instrument was not able to observe a point in the matrix just to keep the matrix dimensions correct. Removing those breaks the matrix structure and you'll have a different amount of points than your 1280 * 720 matrix would expect.
If you wish to down sample an organized point cloud say by a factor of 2, you could try something like
int scale = 2;
pcl::PointCloud<pcl::your_point_type> down_sampled_cloud;
down_sampled_cloud.width = original_cloud.width / scale;
down_sampled_cloud.height = original_cloud.height / scale;
for( int ii = 0; ii < original_cloud.height; ii+=scale){
for( int jj = 0; jj < original_cloud.width; jj+=scale ){
Change scale to what you wish.
This method just down samples the original point cloud, it will not interpolate points between existing points. Scaling by a decimal factor is trickier and might yield unwanted results if the surface is not continuous.

How to Display a 3D image when we have Depth and rgb Mat's in OpenCV (captured from Kinect)

We captured a 3d Image using Kinect with OpenNI Library and got the rgb and depth images in the form of OpenCV Mat using this code.
puts( "Kinect initialization..." );
Device device;
if ( openni::ANY_DEVICE ) != 0 )
puts( "Kinect not found !" );
return -1;
puts( "Kinect opened" );
VideoStream depth, color;
color.create( device, SENSOR_COLOR );
puts( "Camera ok" );
depth.create( device, SENSOR_DEPTH );
puts( "Depth sensor ok" );
VideoMode paramvideo;
paramvideo.setResolution( 640, 480 );
paramvideo.setFps( 30 );
paramvideo.setPixelFormat( PIXEL_FORMAT_DEPTH_100_UM );
depth.setVideoMode( paramvideo );
paramvideo.setPixelFormat( PIXEL_FORMAT_RGB888 );
color.setVideoMode( paramvideo );
puts( "Réglages des flux vidéos ok" );
// If the depth/color synchronisation is not necessary, start is faster :
//device.setDepthColorSyncEnabled( false );
// Otherwise, the streams can be synchronized with a reception in the order of our choice :
device.setDepthColorSyncEnabled( true );
device.setImageRegistrationMode( openni::IMAGE_REGISTRATION_DEPTH_TO_COLOR );
VideoStream** stream = new VideoStream*[2];
stream[0] = &depth;
stream[1] = &color;
puts( "Kinect initialization completed" );
if ( device.getSensorInfo( SENSOR_DEPTH ) != NULL )
VideoFrameRef depthFrame, colorFrame;
cv::Mat colorcv( cv::Size( 640, 480 ), CV_8UC3, NULL );
cv::Mat depthcv( cv::Size( 640, 480 ), CV_16UC1, NULL );
cv::namedWindow( "RGB", CV_WINDOW_AUTOSIZE );
cv::namedWindow( "Depth", CV_WINDOW_AUTOSIZE );
int changedIndex;
while( device.isValid() )
OpenNI::waitForAnyStream( stream, 2, &changedIndex );
switch ( changedIndex )
case 0:
depth.readFrame( &depthFrame );
if ( depthFrame.isValid() )
{ = (uchar*) depthFrame.getData();
cv::imshow( "Depth", depthcv );
case 1:
color.readFrame( &colorFrame );
if ( colorFrame.isValid() )
{ = (uchar*) colorFrame.getData();
cv::cvtColor( colorcv, colorcv, CV_BGR2RGB );
cv::imshow( "RGB", colorcv );
puts( "Error retrieving a stream" );
cv::waitKey( 1 );
cv::destroyWindow( "RGB" );
cv::destroyWindow( "Depth" );
We added some code to above and got the RGB and depth Mat from that and the we processed RGB using OpenCV.
Now we need to display that image in 3D.
We are using :-
1) Windows 8 x64
2) Visual Studio 2012 x64
3) OpenCV 2.4.10
4) OpenNI
5) Kinect1
6) Kinect SDK 1.8.0
Questions :-
1) Can we directly display this Image using OpenCV OR we need any external libraries ??
2) If we need to use external Libraries which one is better for this simple task OpenGL, PCL Or any other ??
3) Does PCL support Visual Studio 12 and OpenNI2 and Since PCL comes packed with other version of OpenNI does these two versions conflict??
To improve the answer of antarctician,
to display the image in 3D you need to create your point cloud first...
The RGB and Depth images give you the necessary data to create an organized colored pointcloud. To do so, you need to calculate the x,y,z values for each point. The z value comes from the depth pixel, but the x and y must be calculated.
to do it you can do something like this:
void Viewer::get_pcl(cv::Mat& color_mat, cv::Mat& depth_mat, pcl::PointCloud<pcl::PointXYZRGBA>& cloud ){
float x,y,z;
for (int j = 0; j< depth_mat.rows; j ++){
for(int i = 0; i < depth_mat.cols; i++){
// the RGB data is created
pcd_BGRA.B =<cv::Vec3b>(j,i)[0];
pcd_BGRA.R =<cv::Vec3b>(j,i)[2];
pcd_BGRA.G =<cv::Vec3b>(j,i)[1];
pcd_BGRA.A = 0;
pcl::PointXYZRGBA vertex;
int depth_value = (int)<unsigned short>(j,i);
// find the world coordinates
openni::CoordinateConverter::convertDepthToWorld(depth, i, j, (openni::DepthPixel)<unsigned short>(j,i), &x, &y,&z );
// the point is created with depth and color data
if ( limitx_min <= i && limitx_max >=i && limity_min <= j && limity_max >= j && depth_value != 0 && depth_value <= limitz_max && depth_value >= limitz_min){
vertex.x = (float) x;
vertex.y = (float) y;
vertex.z = (float) depth_value;
} else {
// if the data is outside the boundaries
vertex.x = bad_point;
vertex.y = bad_point;
vertex.z = bad_point;
vertex.rgb = pcd_BGRA.RGB_float;
// the point is pushed back in the cloud
cloud.points.push_back( vertex );
and PCD_BGRA is
union PCD_BGRA
uchar B; // LSB
uchar G; // ---
uchar R; // MSB
uchar A; //
float RGB_float;
uint RGB_uint;
Of course, this is for the case you want to use PCL, but it is more or less the calculations of the x,y,z values stands. This relies on openni::CoordinateConverter::convertDepthToWorld to find the position of the point in 3D. You may also do this manually
const float invfocalLength = 1.f / 525.f;
const float centerX = 319.5f;
const float centerY = 239.5f;
const float factor = 1.f / 1000.f;
float dist = factor * (float)(*depthdata);
p.x = (x-centerX) * dist * invfocalLength;
p.y = (y-centerY) * dist * invfocalLength;
p.z = dist;
Where centerX, centerY, and focallength are the intrinsic calibration of the camera (this one is for Kinect). and the factor it is if you need the distance in meters or millimeters... this value depends on your program
For the questions:
Yes, you can display it using the latest OpenCV with the viz class or with another external library that suits your needs.
OpenGl is nice, but PCL (or OpenCV) is easier to use if you do not know how to use any of them (I mean for displaying pointclouds)
I haven't use it with windows, but in theory it can be used with visual studio 2012. As far as I know the version that PCL comes packed with is OpenNI 1, and it won't affect OpenNI2...
I haven't done this with OpenNI and OpenCV but I hope I can help you. So first of all to answer your first two questions:
Probably yes, as far as I understood you want to visualize a 3D point cloud. OpenCV is only an image processing library and you would need a 3D rendering library to do what you want.
I worked with OpenSceneGraph and would recommend it. However you can also use OpenGL or Direct X.
If you only want to visualize a point cloud such as the "3D View" of the Kinect Studio, you wouldn't need PCL as it would be too much for this simple job.
The basic idea of doing this task is to create 3D quads as the same number of pixels you have on your images. For example if you have a 640x480 resolution, you would need 640*480 quads. Each quad would have the color of the corresponding pixel depending on the pixel values from the color image. You would then move these quads back and forth on the Z axis, depending on the values from the depth image. This can be done with modern OpenGL or if you feel more comfortable with C++, OpenSceneGraph(which is also based on OpenGL).
You would have to be careful about two things:
Drawing so many quads can be slow even on a modern computer. You would need to read about "instanced rendering" to render a large number of instances of an object(in our case a quad) in a single GPU draw call. This can be done using the vertex shader.
Since the RGB and the Depth camera of the Kinect have different physical locations, you would need to calibrate both of them. There are functions to do this in the official Kinect SDK, however I don't know about OpenNI.
If you decide to do this with OpenGL, I would suggest reading about the GPU pipeline if you aren't familiar with it. This would help you to save a lot of time when working with the vertex shaders.

Using IOS Accelerate Framework for 2D Signal Processing on Non-Power-of-Two images?

I'm editing my question slightly to address the issue of working specifically with non-power-of-two images. I've got a basic structure that works with square grayscale images with sizes like 256x256 or 1024x1024, but can't see how to generalize to arbitrarily sized images. The fft functions seem to want you to include the log2 of the width and height, but then its unclear how to unpack the resulting data, or if the data isn't just getting scrambled. I suppose the obvious thing to do would be to center the npot image within a larger, all black image and then ignore any values in those positions when looking at the data. But wondering if there's a less awkward way to work with npot data.
I'm having a bit of trouble with the Accelerate Framework documentation. I would normally use FFTW3, but I'm having trouble getting that to compile on an actual IOS device (see this question). Can anybody point me to a super simple implementation using Accelerate that does something like the following:
1) Turns image data into an appropriate data structure that can be passed to Accelerate's FFT methods.
In FFTW3, at its simplest, using a grayscale image, this involves placing the unsigned bytes into a "fftw_complex" array, which is simply a struct of two floats, one holding the real value and the other the imaginary (and where the imaginary is initialized to zero for each pixel).
2) Takes this data structure and performs an FFT on it.
3) Prints out the magnitude and phase.
4) Performs an IFFT on it.
5) Recreates the original image from the data resulting from the IFFT.
Although this is a very basic example, I am having trouble using the documentation from Apple's site. The SO answer by Pi here is very helpful, but I am still somewhat confused about how to use Accelerate to do this basic functionality using a grayscale (or color) 2D image.
Anyhow, any pointers or especially some simple working code that processes a 2D image would be extremely helpful!
\\\ EDIT \\\
Okay, after taking some time to dive into the documentation and some very helpful code on SO as well as on pkmital's github repo, I've got some working code that I thought I'd post since 1) it took me a while to figure it out and 2) since I have a couple of remaining questions...
Initialize FFT "plan". Assuming a square power-of-two image:
#include <Accelerate/Accelerate.h>
UInt32 N = log2(length*length);
UInt32 log2nr = N / 2;
UInt32 log2nc = N / 2;
UInt32 numElements = 1 << ( log2nr + log2nc );
float SCALE = 1.0/numElements;
SInt32 rowStride = 1;
SInt32 columnStride = 0;
FFTSetup setup = create_fftsetup(MAX(log2nr, log2nc), FFT_RADIX2);
Pass in a byte array for a square power-of-two grayscale image and turn it into a COMPLEX_SPLIT:
in_fft.realp = ( float* ) malloc ( numElements * sizeof ( float ) );
in_fft.imagp = ( float* ) malloc ( numElements * sizeof ( float ) );
for ( UInt32 i = 0; i < numElements; i++ ) {
if (i < t->width * t->height) {
in_fft.realp[i] = t->data[i] / 255.0;
in_fft.imagp[i] = 0.0;
Run the FFT on the transformed image data, then grab the magnitude and phase:
out_fft.realp = ( float* ) malloc ( numElements * sizeof ( float ) );
out_fft.imagp = ( float* ) malloc ( numElements * sizeof ( float ) );
fft2d_zop ( setup, &in_fft, rowStride, columnStride, &out_fft, rowStride, columnStride, log2nc, log2nr, FFT_FORWARD );
magnitude = (float *) malloc(numElements * sizeof(float));
phase = (float *) malloc(numElements * sizeof(float));
for (int i = 0; i < numElements; i++) {
magnitude[i] = sqrt(out_fft.realp[i] * out_fft.realp[i] + out_fft.imagp[i] * out_fft.imagp[i]) ;
phase[i] = atan2(out_fft.imagp[i],out_fft.realp[i]);
Now you can run an IFFT on the out_fft data to get the original image...
out_ifft.realp = ( float* ) malloc ( numElements * sizeof ( float ) );
out_ifft.imagp = ( float* ) malloc ( numElements * sizeof ( float ) );
fft2d_zop (setup, &out_fft, rowStride, columnStride, &out_ifft, rowStride, columnStride, log2nc, log2nr, FFT_INVERSE);
vsmul( out_ifft.realp, 1, SCALE, out_ifft.realp, 1, numElements );
vsmul( out_ifft.imagp, 1, SCALE, out_ifft.imagp, 1, numElements );
Or you can run an IFFT on the magnitude to get an autocorrelation...
in_ifft.realp = ( float* ) malloc ( numElements * sizeof ( float ) );
in_ifft.imagp = ( float* ) malloc ( numElements * sizeof ( float ) );
for (int i = 0; i < numElements; i++) {
in_ifft.realp[i] = (magnitude[i]);
in_ifft.imagp[i] = 0.0;
fft2d_zop ( setup, &in_fft, rowStride, columnStride, &out_ifft, rowStride, columnStride, log2nc, log2nr, FFT_INVERSE );
vsmul( out_ifft.realp, 1, SCALE, out_ifft.realp, 1, numElements );
vsmul( out_ifft.imagp, 1, SCALE, out_ifft.imagp, 1, numElements );
Finally, you can put the ifft results back into an image array:
for ( UInt32 i = 0; i < numElements; i++ ) {
t->data[i] = (int) (out_ifft.realp[i] * 255.0);
I haven't figured out how to use the Accelerate framework to handle non-power-of-two images. If I allocate enough memory in the setup, then I can do an FFT, followed by an IFFT to get my original image. But if try to do an autocorrelation (with the magnitude of the FFT), then my image gets wonky results. I'm not sure of the best way to pad the image appropriately, so hopefully someone has an idea of how to do this. (Or share a working version of the vDSP_conv method!)
I would say that in order to perform work on arbitrary image sizes, all you have to do is size your input value array appropriately to the next power of 2.
The hard part is where to put your original image data and what to fill with. What you are really trying to do to the image or data mine from the image is crucial.
In the linked PDF below, pay particular attention to the paragraph just above 12.4.2
While the above speaks about the manipulation along 2 axes, we could potentialy perform a similar idea prior to the second dimension, and following onto the second dimension. If Im correct, then this example could apply (and this is by no means an exact algorithm yet):
say we have an image that is 900 by 900:
first we could split the image into vertical strips of 512, 256, 128, and 4.
We would then process 4 1D FFTs for each row, one for the first 512 pixels, the next for the following 256 pixels, the next for the following 128, then the last for the remaining 4. Since the output of the FFT is essentially popularity of frequency, then these could simply be added (from the frequency ONLY perspective, not the angular offset).
We could then push this same techniquie toward the 2nd dimension. At this point we would have taken into consideration every input pixel without actually having to pad.
This is really just food for thought, I have not tried this myself, and indeed should research this myself. If you are truly doing this kind of work right now, you may have more time than I at this point though.

Having some difficulty in image stitching using OpenCV

I'm currently working on Image stitching using OpenCV 2.3.1 on Visual Studio 2010, but I'm having some trouble.
Problem Description
I'm trying to write a code for stitching multiple images derived from a few cameras(about 3~4), i,e, the code should keep executing image stitching until I ask it to stop.
The following is what I've done so far:
(For simplification, I'll replace some part of the code with just a few words)
1.Reading frames(images) from 2 cameras (Currently I'm just working on 2 cameras.)
2.Feature detection, descriptor calculation (SURF)
3.Feature matching using FlannBasedMatcher
4.Removing outliers and calculate the Homography with inliers using RANSAC.
5.Warp one of both images.
For step 5., I followed the answer in the following thread and just changed some parameters:
Stitching 2 images in opencv
However, the result is terrible though.
I just uploaded the result onto youtube and of course only those who have the link will be able to see it.
My code is shown below:
(Only crucial parts are shown)
VideoCapture cam1, cam2;;;
Mat frm1, frm2;
cam1 >> frm1;
cam2 >> frm2;
//(SURF detection, descriptor calculation
//and matching using FlannBasedMatcher)
double max_dist = 0; double min_dist = 100;
//-- Quick calculation of max and min distances between keypoints
for( int i = 0; i < descriptors_1.rows; i++ )
double dist = matches[i].distance;
if( dist < min_dist ) min_dist = dist;
if( dist > max_dist ) max_dist = dist;
(Draw only "good" matches
(i.e. whose distance is less than 3*min_dist ))
vector<Point2f> frame1;
vector<Point2f> frame2;
for( int i = 0; i < good_matches.size(); i++ )
//-- Get the keypoints from the good matches
frame1.push_back( keypoints_1[ good_matches[i].queryIdx ].pt );
frame2.push_back( keypoints_2[ good_matches[i].trainIdx ].pt );
Mat H = findHomography( Mat(frame1), Mat(frame2), CV_RANSAC );
cout << "Homography: " << H << endl;
/* warp the image */
Mat warpImage2;
warpPerspective(frm2, warpImage2,
H, Size(frm2.cols, frm2.rows), INTER_CUBIC);
Mat final(Size(frm2.cols*3 + frm1.cols, frm2.rows),CV_8UC3);
Mat roi1(final, Rect(frm1.cols, 0, frm1.cols, frm1.rows));
Mat roi2(final, Rect(2*frm1.cols, 0, frm2.cols, frm2.rows));
imshow("final", final);
What else should I do to make the stitching better?
Besides, is it reasonable to make the Homography matrix fixed instead of keeping computing it ?
What I mean is to specify the angle and the displacement between the 2 cameras by myself so as to derive a Homography matrix that satisfies what I want.
Thanks. :)
It sounds like you are going about this sensibly, but if you have access to both of the cameras, and they will remain stationary with respect to each other, then calibrating offline, and simply applying the transformation online will make your application more efficient.
One point to note is, you say you are using the findHomography function from OpenCV. From the documentation, this function:
Finds a perspective transformation between two planes.
However, your points are not restricted to a specific plane as they are imaging a 3D scene. If you wanted to calibrate offline, you could image a chessboard with both cameras, and the detected corners could be used in this function.
Alternatively, you may like to investigate the Fundamental matrix, which can be calculated with a similar function. This matrix describes the relative position of the cameras, but some work (and a good textbook) will be required to extract them.
If you can find it, I would strongly recommend having a look at Part II: "Two-View Geometry" in the book "Multiple View Geometry in computer vision", by Richard Hartley and Andrew Zisserman, which goes through the process in detail.
I have been working lately on image registration. My algorithm takes two images, calculates the SURF features, find correspondences, find homography matrix and then stitch both images together, I did it with the next code:
void stich(Mat base, Mat target,Mat homography, Mat& panorama){
Mat corners1(1, 4,CV_32F);
Mat corners2(1,4,CV_32F);
Mat corners(1,4,CV_32F);
vector<Mat> planes;
/* compute corners
of warped image
perspectiveTransform(corners, corners, homography);
/* compute size of resulting
image and allocate memory
double x_start = min( min( (double)<Vec2f>(0,0)[0], (double)<Vec2f> (0,1)[0]),0.0);
double x_end = max( max( (double)<Vec2f>(0,2)[0], (double)<Vec2f>(0,3)[0]), (double)base.cols);
double y_start = min( min( (double)<Vec2f>(0,0)[1], (double)<Vec2f>(0,2)[1]), 0.0);
double y_end = max( max( (double)<Vec2f>(0,1)[1], (double)<Vec2f>(0,3)[1]), (double)base.rows);
/*Creating image
with same channels, depth
as target
and proper size
panorama.create(Size(x_end - x_start + 1, y_end - y_start + 1), target.depth());
/*Planes should
have same n.channels
as target
for (int i=0;i<target.channels();i++){
// create translation matrix in order to copy both images to correct places
Mat T;
// copy base image to correct position within output image
warpPerspective(base, panorama, T,panorama.size(),INTER_LINEAR| CV_WARP_FILL_OUTLIERS);
// change homography to take necessary translation into account
gemm(T, homography,1,T,0,T);
// warp second image and copy it to output image
warpPerspective(target,panorama, T, panorama.size(),INTER_LINEAR);
Any question I will try
