track eye pupil in a video

track eye pupil in a video - opencv

I am working on a project aimed to track eye pupil. For this I have made a head-mounted system that captures the images of the eye. Completed with the hardware portion I am struck in software part. I am using opencv. Please let me know what would be the most efficient way to track the pupil. Houghcircles didn't performing well.
After that I have also tried with HSV filter and here is the code and
link to screenshot of the raw-image and processed one. Please help me to resolve this issue. The link also contains video of eye pupil that I am using in this code.
https://picasaweb.google.com/118169326982637604860/16November2011?authuser=0&authkey=Gv1sRgCPKwwrGTyvX1Aw&feat=directlink
Code:
include "cv.h"
include"highgui.h"
IplImage* GetThresholdedImage(IplImage* img)
{
IplImage *imgHSV=cvCreateImage(cvGetSize(img),8,3);
cvCvtColor(img,imgHSV,CV_BGR2HSV);
IplImage *imgThresh=cvCreateImage(cvGetSize(img),8,1);
cvInRangeS(imgHSV,cvScalar(0, 84, 0, 0),cvScalar(179, 256, 11, 0),imgThresh);
cvReleaseImage(&imgHSV);
return imgThresh;
}
void main(int *argv,char **argc)
{
IplImage *imgScribble= NULL;
char c=0;
CvCapture *capture;
capture=cvCreateFileCapture("main.avi");
if(!capture)
{
printf("Camera could not be initialized");
exit(0);
}
cvNamedWindow("Simple");
cvNamedWindow("Thresholded");
while(c!=32)
{
IplImage *img=0;
img=cvQueryFrame(capture);
if(!img)
break;
if(imgScribble==NULL)
imgScribble=cvCreateImage(cvGetSize(img),8,3);
IplImage *timg=GetThresholdedImage(img);
CvMoments *moments=(CvMoments*)malloc(sizeof(CvMoments));
cvMoments(timg,moments,1);
double moment10 = cvGetSpatialMoment(moments, 1, 0);
double moment01 = cvGetSpatialMoment(moments, 0, 1);
double area = cvGetCentralMoment(moments, 0, 0);
static int posX = 0;
static int posY = 0;
int lastX = posX;
int lastY = posY;
posX = moment10/area;
posY = moment01/area;
// Print it out for debugging purposes
printf("position (%d,%d)\n", posX, posY);
// We want to draw a line only if its a valid position
if(lastX>0 && lastY>0 && posX>0 && posY>0)
{
// Draw a yellow line from the previous point to the current point
cvLine(imgScribble, cvPoint(posX, posY), cvPoint(lastX, lastY), cvScalar(0,255,255), 5);
}
// Add the scribbling image and the frame...
cvAdd(img, imgScribble, img);
cvShowImage("Simple",img);
cvShowImage("Thresholded",timg);
c=cvWaitKey(3);
cvReleaseImage(&timg);
delete moments;
}
//cvReleaseImage(&img);
cvDestroyWindow("Simple");
cvDestroyWindow("Thresholded");
}
I am able to track the eye and find the center coordinates of pupil precisely.
First I thresholded the image taken by the head mounted camera. After that I have used contour finding algorithm then I find the centroid of all the contours. This gives me the center coordinates of eye pupil, this method is working fine in real time and also detecting eye blinking with very good accuracy.
Now, my aim is to embed this feature into a game(a racing game). In which If I look to left/right then the car moves left/right and If I blink the car slows down. How could I proceed now??? Would I need a game engine to do that?
I heard of some open source game engines compatible with visual studio 2010(unity etc.). Is it feasible??? If yes, how should I proceed ?

I am one of the developers of SimpleCV. We maintain an open-source python library for computer vision. You can download it at SimpleCV.org. SimpleCV is great for solving these types of problems by hacking on the command line. I was able to extract the pupil in only a couple lines of code. Here you go:
img = Image("eye4.jpg") # load the image
bm = BlobMaker() # create the blob extractor
# invert the image so the pupil is white, threshold the image, and invert again
# and then extract the information from the image
blobs = bm.extractFromBinary(img.invert().binarize(thresh=240).invert(),img)
if(len(blobs)>0): # if we got a blob
blobs[0].draw() # the zeroth blob is the largest blob - draw it
locationStr = "("+str(blobs[0].x)+","+str(blobs[0].y)+")"
# write the blob's centroid to the image
img.dl().text(locationStr,(0,0),color=Color.RED)
# save the image
img.save("eye4pupil.png")
# and show us the result.
img.show()
Here are the results.
So your next steps are to use some sort of tracker, like a Kalmann filter, to track the pupil robustly. You may want to model the eye as a sphere and track the pupil's centroid in sphereical coordinates (i.e. theta and phi). You will also want to write a bit of code to detect blink events so the system doesn't go all wonky when the user blinks. I suggest using a canny edge detector to find the largest horizontal lines in the image and assuming those are the eye lids. I hope this helps and please let us know how your work progresses.

It all depends on how good your system must be. If it's a 2-months university project, that's ok to find and track some blobs or to use a ready-made solution, as Kscottz recommended.
But if you aim to have a more serious system, you must go deeper.
An approach I recommend you is to detect the face interest points. A good example is Active Appearance Models, which seems to be the best at tracking faces
http://www2.imm.dtu.dk/~aam/
and
http://www.youtube.com/watch?v=M1iu__viJN8
It requires you a solid understanding of computer vision algorithms, good programming skills, and some work. But the results will be worth the effort.
And do not be fooled by the fact that the demos show whole-face tracking. You can train it to track anything: hands, eyes, flowers or leaves, etc.
(Before starting with AAM, you may want to read more about other face-tracking algorithms. They may be better for you)

This is my solution, I am able to track the eye and find the center coordinates of pupil precisely.
First I thresholded the image taken by the head mounted camera. After that I have used contour finding algorithm then I find the centroid of all the contours. This gives me the center coordinates of eye pupil, this method is working fine in real time and also detecting eye blinking with very good accuracy.

Related

How to use OpenCV stereoCalibrate output to map pixels from one camera to another

Context: I have two cameras of very different focus and size that I want to align for image processing. One is RGB, one is near-infrared. The cameras are in a static rig, so fixed relative to each other. Because the image focus/width are so different, it's hard to even get both images to recognize the chessboard at the same time. Pretty much only works when the chessboard is centered in both images with very little skew/tilt.
I need to perform computations on the aligned images, so I need as good of a mapping between the optical frames as I can get. Right now the results I'm getting are pretty far off. I'm not sure if I'm using the method itself wrong, or if I am misusing the output. Details and image below.
Computation: I am using OpenCV stereoCalibrate to estimate the rotation and translation matrices with the following code, and throwing out bad results based on final error.
int flag = cv::CALIB_FIX_INTRINSIC;
double err = cv::stereoCalibrate(temp_points_object_vec, temp_points_alignvec, temp_points_basevec, camera_mat_align, camera_distort_align, camera_mat_base, camera_distort_base, mat_align.size(), rotate_mat, translate_mat, essential_mat, F, flag, cv::TermCriteria(cv::TermCriteria::MAX_ITER + cv::TermCriteria::EPS, 30, 1e-6));
if (last_error_ == -1.0 || (err < last_error_ + improve_threshold_)) {
// -1.0 indicate first calibration, accept points. Other cond indicates acceptable error.
points_alignvec_.push_back(addalign);
points_basevec_.push_back(addbase);
points_object_vec_.push_back(object_points);
}
The result doesn't produce an OpenCV error as is, and due to the large difference between images, more than half of the matched points are rejected. Results are much better since I added the conditional on the error, but still pretty poor. Error as computed above starts around 30, but doesn't get lower than 15-17. For comparison, I believe a "good" error would be <1. So for starters, I know the output isn't great, but on top of that, I'm not sure I'm using the output right for validating visually. I've attached images showing some of the best and worst results I see. The middle image on the right of each shows the "cross-validated" chessboard keypoints. These are computed like this (note addalign is the temporary vector containing only the chessboard keypoints from the current image in the frame to be aligned):
for (int i = 0; i < addalign.size(); i++) {
cv::Point2f validate_pt;// = rotate_mat * addalign.at(i) + translate_mat;
// Project pixel from image aligned to 3D
cv::Point3f ray3d = align_camera_model_.projectPixelTo3dRay(addalign.at(i));
// Rotate and translate
rotate_mat.convertTo(rotate_mat, CV_32F);
cv::Mat temp_result = rotate_mat * cv::Mat(ray3d, false);
cv::Point3f ray_transformed;
temp_result.copyTo(cv::Mat(ray_transformed, false));
cv::Mat tmat = cv::Mat(translate_mat, false);
ray_transformed.x += tmat.at<float>(0);
ray_transformed.y += tmat.at<float>(1);
ray_transformed.z += tmat.at<float>(2);
// Reproject to base image pixel
cv::Point2f pixel = base_camera_model_.project3dToPixel(ray_transformed);
corners_validated.push_back(pixel);
}
Here are two images showing sample outputs, including both raw images, both images with "drawChessboard," and a cross-validated image showing the base image with above-computed keypoints translated from the alignment image.
Better result
Worse result
In the computation of corners_validated, I'm not sure I'm using rotate_mat andtranslate_mat correctly. I'm sure there is probably an OpenCV method that does this more efficiently, but I just did it the way that made sense to me at the time.
Also relevant: This is all inside a ROS package, using ROS noetic on Ubuntu 20.04 which only permits the use of OpenCV 4.2, so I don't have access to some of the newer opencv methods.

Extract an object on a sheet of paper

From pictures of tools on a sheet of paper, I'm asked to find their outline contour to vectorize them.
I'm a total beginner in computer-vision-related problems and the only thing I thought about was OpenCV and edge detection.
The result is better than what I've imagined, this is still very unreliable, especially if the source picture isn't "perfect".
I took 2 photographies of a wrench they gave me.
After playing around with opencv bindings for node, I get this:
Then, I've tried with the less-good picture:
That's totally inexploitable.
I can get something a little better by changing the Canny thresold, but that must be automatized (given that the picture is relatively correct).
So I've got a few questions:
Am I taking the right approach? Is GrabCut better for this? A combination of Grabcut and Canny edge detection? I still need vertices at the end, but I feel that GrabCut does what I want too.
The borders are rough and have certain errors. I can augment approxPolyDP's multiplier, but without a loss of precision on good parts.
Related to the above point, I'm thinking of integrating Savitzky-Golay algorithm to smooth the outline, instead of polygon simplification with approxPolyDP. Is it a good idea?
Normally, the line of the outer border must form a simple, cuttable block. Is there a way in OpenCL to avoid that line to do impossible things, like passing on itself? - Or, simply, detect the problem? Those configurations are, of course, impossible but happen when the detection is failed (like in the second pic).
I'm searching a way to do automatic Canny thresold calculation, since I must tweak it manually for each image. Do you have a good example for that?
I noticed that converting the image to grayscale before edge detection sometimes deteriorates the result, and sometimes makes it better. Which one should I choose? (tools can be of any color btw!)
here is the source for my tests:
const cv = require('opencv');
const lowThresh = 90;
const highThresh = 90;
const nIters = 1;
const GRAY = [120, 120, 120];
const WHITE = [255, 255, 255];
cv.readImage('./files/viv1.jpg', function(err, im) {
if (err) throw err;
width = im.width()
height = im.height()
if (width < 1 || height < 1) throw new Error('Image has no size');
const out = new cv.Matrix(height, width);
im.convertGrayscale();
im_canny = im.copy();
im_canny.canny(lowThresh, highThresh);
im_canny.dilate(nIters);
contours = im_canny.findContours();
let maxArea = 0;
let biggestContour;
for (i = 0; i < contours.size(); i++) {
const area = contours.area(i);
if (area > maxArea) {
maxArea = area;
biggestContour = i;
}
out.drawContour(contours, i, GRAY);
}
const arcLength = contours.arcLength(biggestContour, true);
contours.approxPolyDP(biggestContour, 0.001 * arcLength, true);
out.drawContour(contours, biggestContour, WHITE, 5);
out.save('./tmp/out.png');
console.log('Image saved to ./tmp/out.png');
});

You'll need to add some pre-processing to clean up the image. Because you have a large variation in intensities in the image because of shadow, poor lighting, high shine on tools, etc you should equalize the image. This will help you get a better response in the regions that are currently poorly lit or have high shine.
Here's an opencv tutorial on Histogram equalization in C++: http://docs.opencv.org/2.4/doc/tutorials/imgproc/histograms/histogram_equalization/histogram_equalization.html
Hope this helps
EDIT:
You can have an automatic threshold based on some loss function(?). For eg: If you know that the tool will be completely captured in the frame, you know that you should get a high value at every column from x = 10 to x = 800(say). You could then keep reducing the threshold until you get a high value at every column from x = 10 to x = 800. This is a very naive way of doing it, but its an interesting experiment, I think, since you are generating the images yourself and have control over object placement.

You might also try running your images through an adaptive threshold first. This type of binarization is fairly adept at segmenting foreground and background in cases like this, even with inconsistent lighting/shadows (which seems to be the issue in your second example above). Adathresh will require some parameter fine-tuning, but once the entire tool is segmented from the background, Canny edge detection should produce more consistent results.
As for the roughness in your contours, you could try setting your findContours mode to one of the CV_CHAIN_APPROX methods described here.

Background subtraction and Optical flow for tracking object in OpenCV C++

I am working on a project to detect object of interest using background subtraction and track them using optical flow in OpenCV C++. I was able to detect the object of interest using background subtraction. I was able to implement OpenCV Lucas Kanade optical flow on separate program. But, I am stuck at how to these two program in a single program. frame1 holds the actual frame from the video, contours2are the selected contours from the foreground object.
To summarize, how do I feed the forground object obtained from Background subtraction method to the calcOpticalFlowPyrLK? Or, help me if my approach is wrong. Thank you in advance.
Mat mask = Mat::zeros(fore.rows, fore.cols, CV_8UC1);
drawContours(mask, contours2, -1, Scalar(255), 4, CV_FILLED);
if (first_frame)
{
goodFeaturesToTrack(mask, features_next, 1000, 0.01, 10, noArray(), 3, false, 0.04);
fm0 = mask.clone();
features_prev = features_next;
first_frame = false;
}
else
{
features_next.clear();
if (!features_prev.empty())
{
calcOpticalFlowPyrLK(fm0, mask, features_prev, features_next, featuresFound, err, winSize, 3, termcrit, 0, 0.001);
for (int i = 0; i < features_prev.size(); i++)
line(frame1, features_prev[i], features_next[i], CV_RGB(0, 0, 255), 1, 8);
imshow("final optical", frame1);
waitKey(1);
}
goodFeaturesToTrack(mask, features_next, 1000, 0.01, 10, noArray(), 3, false, 0.04);
features_prev = features_next;
fm0 = mask.clone();
}

Your approach of using optical flow for tracking is wrong. The idea behind optical flow approach is that a movning point in two consequtive images has at the start and endpoint the same pixel intensity. That means a motion for a feautre is estimated by observing its appearance from the start images and search for the structure in the end image (very simplified).
calcOpticalFlowPyrLK is a point tracker that means point in the previous images are tracked to the current one. Therefore the methods need the original gray valued image of your system. Because it only can estimate motion on structured / textured region ( you need x and y gradients in your image).
I think your code should do somethink like:
Extract objects by background substraction (by contour) this is in the literature called a blob
Extract objects in the next image and apply a blob-assoziation (which countour belong to whom) this is also called blob-tracken
It is possible to do a blob-tracking with the calcOpticalFlowPyrLK. E.g. in a very simple way:
Track points from the countour or a point inside the blob.
Assoziation: The previous contour is one of the current if the points track, that belong to the previous contour are located at the current countour

I think the output of background subtraction in OpenCV not Gray Scale image. for input Optical flow we need gray scale images.

Detect objects similar to circles

I'm trying to detect objects that are similar to circles using OpenCV's HoughCircles. The problem is: HoughCircles fails to detect such objects in some cases.
Does anyone know any alternative way to detect objects similar to circles like these ones?
Update
Update
Hello Folks I'm adding a gif of the result of my detection method.
It's easier use a gif to explain the problem. The undesired effect that I want to remove is the circle size variation. Even for a static shape like the one on the right, the result on the left is imprecise. Does anyone know a solution for that?
Update
All that I need from this object is its diameter. I've done it using findContours. Now I can't use findContours once it is too slow when using openCV and OpenMP. Does anyone know a fast alternatives to findContours?
Update
The code that I'm using to detect these shapes.
for (int j=0; j<=NUM_THREADS-1;j++)
{
capture >> frame[j];
}
#pragma omp parallel shared(frame,processOutput,circles,diameterArray,diameter)
{
int n=omp_get_thread_num();
cvtColor( frame[n], processOutput[n], CV_BGR2GRAY);
GaussianBlur(processOutput[n], processOutput[n], Size(9, 9), 2, 2);
threshold(processOutput[n], processOutput[n], 21, 250, CV_THRESH_BINARY);
dilate(processOutput[n], processOutput[n], Mat(), Point(-1, -1), 2, 1, 1);
erode(processOutput[n], processOutput[n], Mat(), Point(-1, -1), 2, 1, 1);
Canny(processOutput[n], processOutput[n], 20, 20*2, 3 );
HoughCircles( processOutput[n],circles[n], CV_HOUGH_GRADIENT, 1, frame[n].rows/8, 100,21, 50, 100);
}
#pragma omp parallel private(m, n) shared(circles)
{
#pragma omp for
for (n=0; n<=NUM_THREADS-1;n++)
{
for( m = 0; m < circles[n].size(); m++ )
{
Point center(cvRound(circles[n][m][0]), cvRound(circles[n][m][2]));
int radius = cvRound(circles[n][m][3]);
diameter = 2*radius;
diameterArray[n] = diameter;
circle( frame[0], center, 3, Scalar(0,255,0), -1, 8, 0 );
circle( frame[0], center, radius, Scalar(0,0,255), 3, 8, 0 );
}
}
}

Edited based on new description and additional performance and accuracy requirements.
This is getting beyond the scope of an "OpenCV sample project", and getting into the realm of actual application development. Both performance and accuracy become requirements.
This requires a combination of techniques. So, don't just pick one approach. You will have to try all combinations of approaches, as well as fine-tune the parameters to find an acceptable combination.
#1. overall approach for continuous video frame recognition tasks
Use a slow but accurate method to acquire an initial detection result.
Once a positive detection is found on one frame, the next frame should switch to a fast local search algorithm using the position detected on the most recent frame.
As a reminder, don't forget to update the "most recent position" for use by the next frame.
#2. suggestion for initial object acquisition.
Stay with your current approach, and incorporate the suggestions.
You can still fine-tune the balance between speed and precision, because a correct but imprecise result (off by tens of pixels) will be updated and refined when the next frame is processed with the local search approach.
Try my suggestion of increasing the dp parameter.
A large value of dp reduces the resolution at which Hough Gradient Transform is performed. This reduces the precision of the center coordinates, but will improve the chance of detecting a dented circle because the dent will become less significant when the transform is performed at a lower resolution.
An added benefit is that reduced resolution should run faster.
#3. suggestion for fast local search around a previously detected position
Because of the limited search space and amount of data needed, it is possible to make local search both fast and precise.
For tracking the movement of the boundary of iris through video frames, I suggest using a family of algorithms called the Snakes model.
The focus is on tracking the movement of edges through profiles. There are many algorithms that can implement the Snakes model. Unfortunately, most implementations are tailored to very complex shape recognition, which would be an overkill and too slow for your project.
Basic idea: (assuming that the previous result is a curve)
Choose some sampling points on the curve.
Scan the edge profile (perpendicular to the curve) at each the sampling point, on the new frame, using the position of the old frame. Look for the sharpest change.
Remember the new edge position for this sampling point.
After all of the sampling points have been updated, create a new curve by joining all of the updated sampling point positions.
There are many varieties, and different levels of sophistication of implementations which you can find on the Internet. Unfortunately, it was reported that the one packaged with OpenCV might not work very well. You may have to try different open-source implementation, and ultimately you may have to implement one that is simple but well-tuned to your project's needs.
#4. Seek advice for your speed optimization attempts.
Use a software performance profiler.
Add some timing and logging code around each call to OpenCV function to print out the time spent on each step. You will be surprised. The reason is that some OpenCV functions are more heavily vectorized and parallelized than others, perhaps as a result of the labor of love.
Unfortunately, for the slowest step - initial object acquisition, there is not much you can parallelize (by multithread).
This is perhaps already obvious to you since you did not put #pragma omp for around the first block of code. (It would not help anyway.)
Vectorization (SIMD) would only benefit pixel-level processing. If OpenCV implements it, great; if not, there is not much you can do.
My guess is that cvtColor, GaussianBlur, threshold, dilate, erode could have been vectorized, but the others might not be.

Give try on below,
Find contour in source.
Find minimum enclosing circle for the contour.
Now draw contour to new Mat with CV_FILLED.
Similarly draw enclosing circle to new Mat with filled option.
Perform x-or operation between the above two and count non-zero.
You can decide the contour is close to circle or not by comparing the non-zero pixel between contour and enclosimg circle with a threshold. You can decide the threshold by calculating the area of encosing circle, and taking it percent.
The idea is simple the area between contour and its enclosing circle decreases as the contour closes to circle

OpenCV: solvePnP detection problems

I've got problem with precise detection of markers using OpenCV.
I've recorded video presenting that issue: http://youtu.be/IeSSW4MdyfU
As you see I'm markers that I'm detecting are slightly moved at some camera angles. I've read on the web that this may be camera calibration problems, so I'll tell you guys how I'm calibrating camera, and maybe you'd be able to tell me what am I doing wrong?
At the beginnig I'm collecting data from various images, and storing calibration corners in _imagePoints vector like this
std::vector<cv::Point2f> corners;
_imageSize = cvSize(image->size().width, image->size().height);
bool found = cv::findChessboardCorners(*image, _patternSize, corners);
if (found) {
cv::Mat *gray_image = new cv::Mat(image->size().height, image->size().width, CV_8UC1);
cv::cvtColor(*image, *gray_image, CV_RGB2GRAY);
cv::cornerSubPix(*gray_image, corners, cvSize(11, 11), cvSize(-1, -1), cvTermCriteria(CV_TERMCRIT_EPS+ CV_TERMCRIT_ITER, 30, 0.1));
cv::drawChessboardCorners(*image, _patternSize, corners, found);
}
_imagePoints->push_back(_corners);
Than, after collecting enough data I'm calculating camera matrix and coefficients with this code:
std::vector< std::vector<cv::Point3f> > *objectPoints = new std::vector< std::vector< cv::Point3f> >();
for (unsigned long i = 0; i < _imagePoints->size(); i++) {
std::vector<cv::Point2f> currentImagePoints = _imagePoints->at(i);
std::vector<cv::Point3f> currentObjectPoints;
for (int j = 0; j < currentImagePoints.size(); j++) {
cv::Point3f newPoint = cv::Point3f(j % _patternSize.width, j / _patternSize.width, 0);
currentObjectPoints.push_back(newPoint);
}
objectPoints->push_back(currentObjectPoints);
}
std::vector<cv::Mat> rvecs, tvecs;
static CGSize size = CGSizeMake(_imageSize.width, _imageSize.height);
cv::Mat cameraMatrix = [_userDefaultsManager cameraMatrixwithCurrentResolution:size]; // previously detected matrix
cv::Mat coeffs = _userDefaultsManager.distCoeffs; // previously detected coeffs
cv::calibrateCamera(*objectPoints, *_imagePoints, _imageSize, cameraMatrix, coeffs, rvecs, tvecs);
Results are like you've seen in the video.
What am I doing wrong? is that an issue in the code? How much images should I use to perform calibration (right now I'm trying to obtain 20-30 images before end of calibration).
Should I use images that containg wrongly detected chessboard corners, like this:
or should I use only properly detected chessboards like these:
I've been experimenting with circles grid instead of of chessboards, but results were much worse that now.
In case of questions how I'm detecting marker: I'm using solvepnp function:
solvePnP(modelPoints, imagePoints, [_arEngine currentCameraMatrix], _userDefaultsManager.distCoeffs, rvec, tvec);
with modelPoints specified like this:
markerPoints3D.push_back(cv::Point3d(-kMarkerRealSize / 2.0f, -kMarkerRealSize / 2.0f, 0));
markerPoints3D.push_back(cv::Point3d(kMarkerRealSize / 2.0f, -kMarkerRealSize / 2.0f, 0));
markerPoints3D.push_back(cv::Point3d(kMarkerRealSize / 2.0f, kMarkerRealSize / 2.0f, 0));
markerPoints3D.push_back(cv::Point3d(-kMarkerRealSize / 2.0f, kMarkerRealSize / 2.0f, 0));
and imagePoints are coordinates of marker corners in processing image (I'm using custom algorithm to do that)

In order to properly debug your problem I would need all the code :-)
I assume you are following the approach suggested in the tutorials (calibration and pose) cited by #kobejohn in his comment and so that your code follows these steps:
collect various images of chessboard target
find chessboard corners in images of point 1)
calibrate the camera (with cv::calibrateCamera) and so obtain as a result the intrinsic camera parameters (let's call them intrinsic) and the lens distortion parameters (let's call them distortion)
collect an image of your own custom target (the target is seen at 0:57 in your video) and it is shown in the following figure and find some relevant points in it (let's call the point you found in image image_custom_target_vertices and world_custom_target_vertices the corresponding 3D points).
estimate the rotation matrix (let's call it R) and the translation vector (let's call it t) of the camera from the image of your own custom target you get in point 4), with a call to cv::solvePnP like this one cv::solvePnP(world_custom_target_vertices,image_custom_target_vertices,intrinsic,distortion,R,t)
giving the 8 corners cube in 3D (let's call them world_cube_vertices) you get the 8 2D image points (let's call them image_cube_vertices) by means of a call to cv2::projectPoints like this one cv::projectPoints(world_cube_vertices,R,t,intrinsic,distortion,image_cube_vertices)
draw the cube with your own draw function.
Now, the final result of the draw procedure depends on all the previous computed data and we have to find where the problem lies:
Calibration: as you observed in your answer, in 3) you should discard the images where the corners are not properly detected. You need a threshold for the reprojection error in order to discard "bad" chessboard target images. Quoting from the calibration tutorial:
Re-projection Error
Re-projection error gives a good estimation of just how exact is the
found parameters. This should be as close to zero as possible. Given
the intrinsic, distortion, rotation and translation matrices, we first
transform the object point to image point using cv2.projectPoints().
Then we calculate the absolute norm between what we got with our
transformation and the corner finding algorithm. To find the average
error we calculate the arithmetical mean of the errors calculate for
all the calibration images.
Usually you will find a suitable threshold with some experiments. With this extra step you will get better values for intrinsic and distortion.
Finding you own custom target: it does not seem to me that you explain how you find your own custom target in the step I labeled as point 4). Do you get the expected image_custom_target_vertices? Do you discard images where that results are "bad"?
Pose of the camera: I think that in 5) you use intrinsic found in 3), are you sure nothing is changed in the camera in the meanwhile? Referring to the Callari's Second Rule of Camera Calibration:
Second Rule of Camera Calibration: "Thou shalt not touch the lens
after calibration". In particular, you may not refocus nor change the
f-stop, because both focusing and iris affect the nonlinear lens
distortion and (albeit less so, depending on the lens) the field of
view. Of course, you are completely free to change the exposure time,
as it does not affect the lens geometry at all.
And then there may be some problems in the draw function.

So, I've experimented a lot with my code, and I still haven't fixed the main issue (shifted objects), but I've managed to answer some of calibration questions I've asked.
First of all - in order to obtain good calibration results you have to use images with properly detected grid elements/circles positions!. Using all captured images in calibration process (even those that aren't properly detected) will result bad calibration.
I've experimented with various calibration patterns:
Asymmetric circles pattern (CALIB_CB_ASYMMETRIC_GRID), give much worse results than any other pattern. By worse results I mean that it produces a lot of wrongly detected corners like these:
I've experimented with CALIB_CB_CLUSTERING and it haven't helped much - in some cases (different light environment) it got better, but not much.
Symmetric circles pattern (CALIB_CB_SYMMETRIC_GRID) - better results than asymmetric grid, but still I've got much worse results than standard grid (chessboard). It often produces errors like these:
Chessboard (found using findChessboardCorners function) - this method is producing best possible results - it doesn't produce misaligned corners very often, and almost every calibration is producing similar results to best-possible results from symmetric circles grid
For every calibration I've been using 20-30 images that were coming from different angles. I've tried even with 100+ images but it haven't produced noticeable change in calibration results than smaller amount of images. It's worth noticing that larger number of test images is increasing time needed to compute camera parameters in non-linear way (100 test images in 480x360 resolution are computing 25 minutes in iPad4, compared with 4 minutes with ~50 images)
I've also experimented with solvePNP parameters - but is also haven't gave me any acceptable results: I've tried all 3 detection methods (ITERATIVE, EPNP and P3P), but I haven't seen aby noticeable change.
Also I've tried with useExtrinsicGuess set to true, and I've used rvec and tvec from previous detection, but this one resulted with complete disapperance of detected cube.
I've ran out of ideas - what else could be affecting these shifting problems?

For those still interested:
this is an old question, but I think your problem is not the bad calibration.
I developed an AR app for iOS, using OpenCV and SceneKit, and I have had your same issue.
I think your problem is the wrong render position of the cube:
OpenCV's solvePnP returns the X, Y, Z coordinates of the marker center, but you wanna render the cube over the marker, at a specific distance along the Z axis of the marker, exactly at one half of the cube side size. So you need to improve the Z coordinate of the marker translation vector of this distance.
In fact, when you see your cube from the top, the cube is render properly.
I have done an image in order to explain the problem, but my reputation prevent to post it.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart