SceneKit Metal Depth Buffer - ios

I'm attempting to write an augmented reality app using SceneKit, and I need accurate 3D points from the current rendered frame, given a 2D pixel and depth using SCNSceneRenderer's unprojectPoint method. This requires an x, y, and z where the x and y is a pixel coordinate and normally the z is a value read from the depth buffer at that frame.
The SCNView's delegate has this method to render the depth frame:
func renderer(_ renderer: SCNSceneRenderer, willRenderScene scene: SCNScene, atTime time: TimeInterval) {
renderDepthFrame()
}
func renderDepthFrame(){
// setup our viewport
let viewport: CGRect = CGRect(x: 0, y: 0, width: Double(SettingsModel.model.width), height: Double(SettingsModel.model.height))
// depth pass descriptor
let renderPassDescriptor = MTLRenderPassDescriptor()
let depthDescriptor: MTLTextureDescriptor = MTLTextureDescriptor.texture2DDescriptor(pixelFormat: MTLPixelFormat.depth32Float, width: Int(SettingsModel.model.width), height: Int(SettingsModel.model.height), mipmapped: false)
let depthTex = scnView!.device!.makeTexture(descriptor: depthDescriptor)
depthTex.label = "Depth Texture"
renderPassDescriptor.depthAttachment.texture = depthTex
renderPassDescriptor.depthAttachment.loadAction = .clear
renderPassDescriptor.depthAttachment.clearDepth = 1.0
renderPassDescriptor.depthAttachment.storeAction = .store
let commandBuffer = commandQueue.makeCommandBuffer()
scnRenderer.scene = scene
scnRenderer.pointOfView = scnView.pointOfView!
scnRenderer!.render(atTime: 0, viewport: viewport, commandBuffer: commandBuffer, passDescriptor: renderPassDescriptor)
// setup our depth buffer so the cpu can access it
let depthImageBuffer: MTLBuffer = scnView!.device!.makeBuffer(length: depthTex.width * depthTex.height*4, options: .storageModeShared)
depthImageBuffer.label = "Depth Buffer"
let blitCommandEncoder: MTLBlitCommandEncoder = commandBuffer.makeBlitCommandEncoder()
blitCommandEncoder.copy(from: renderPassDescriptor.depthAttachment.texture!, sourceSlice: 0, sourceLevel: 0, sourceOrigin: MTLOriginMake(0, 0, 0), sourceSize: MTLSizeMake(Int(SettingsModel.model.width), Int(SettingsModel.model.height), 1), to: depthImageBuffer, destinationOffset: 0, destinationBytesPerRow: 4*Int(SettingsModel.model.width), destinationBytesPerImage: 4*Int(SettingsModel.model.width)*Int(SettingsModel.model.height))
blitCommandEncoder.endEncoding()
commandBuffer.addCompletedHandler({(buffer) -> Void in
let rawPointer: UnsafeMutableRawPointer = UnsafeMutableRawPointer(mutating: depthImageBuffer.contents())
let typedPointer: UnsafeMutablePointer<Float> = rawPointer.assumingMemoryBound(to: Float.self)
self.currentMap = Array(UnsafeBufferPointer(start: typedPointer, count: Int(SettingsModel.model.width)*Int(SettingsModel.model.height)))
})
commandBuffer.commit()
}
This works. I get depth values between 0 and 1. The problem is that I can't use them in the unprojectPoint because they don't appear to be scaled the same as the initial pass, despite using the same SCNScene and SCNCamera.
My questions:
Is there any way to get the depth values directly from SceneKit SCNView's main pass without having to do an extra pass with a separate SCNRenderer?
Why don't the depth values in my pass match the values I get from doing a hit test and then unprojecting? The depth values from my pass go from 0.78 to 0.94. The depth values in the hit test range from 0.89 to 0.97, which curiously enough, matches the OpenGL depth values of the scene when I rendered it in Python.
My hunch is this is a difference in viewports and SceneKit is doing something to scale the depth values from -1 to 1 just like OpenGL.
EDIT: And in case you're wondering, I can't use the hitTest method directly. It's too slow for what I'm trying to achieve.

SceneKit uses a log scale reverse Z-Buffer by default. You can disable the reverse Z-Buffer quite easily (scnView.usesReverseZ = false) but taking the log depth to [0, 1] range with linear distribution requires access to the depth buffer, the value of the far clipping range and the value of the near clipping range. Here is the process of taking a non-reverse-z-log-depth to a linearly distributed depth in the range of [0, 1]:
float delogDepth(float depth, float nearClip, float farClip) {
// The depth buffer is in Log Format. Probably a 24bit float depth with 8 for stencil.
// https://outerra.blogspot.com/2012/11/maximizing-depth-buffer-range-and.html
// We need to undo the log format.
// https://stackoverflow.com/questions/18182139/logarithmic-depth-buffer-linearization
float logTuneConstant = nearClip / farClip;
float deloggedDepth = ((pow(logTuneConstant * farClip + 1.0, depth) - 1.0) / logTuneConstant) / farClip;
// The values are going to hover around a particular range. Linearize that distribution.
// This part may not be necessary, depending on how you will use the depth.
// http://glampert.com/2014/01-26/visualizing-the-depth-buffer/
float negativeOneOneDepth = deloggedDepth * 2.0 - 1.0;
float zeroOneDepth = ((2.0 * nearClip) / (farClip + nearClip - negativeOneOneDepth * (farClip - nearClip)));
return zeroOneDepth;
}

As a workaround, I switched to OpenGL ES and read the depth buffer by adding a fragment shader that packs the depth value into the RGBA renderbuffer SCNShadable.
See here for more info: http://concord-consortium.github.io/lab/experiments/webgl-gpgpu/webgl.html
I understand this is a valid approach as it is used in shadow mapping quite often on OpenGL ES devices and WebGL, but this feels hacky to me and I shouldn't have to do this. I would still be interested in another answer if someone can figure out Metal's viewport transformation.

Related

Shadow Mapping - Space Transformations are going bad

I am currently studying shadow mapping, and my biggest issue right now is the transformations between spaces. This is my current working theory/steps.
Pass 1:
Get depth of pixel from camera, store in depth buffer
Get depth of pixel from light, store in another buffer
Pass 2:
Use texture coordinate to sample camera's depth buffer at current pixel
Convert that depth to a view space position by multiplying the projection coordinate with invProj matrix. (also do a perspective divide).
Take that view position and multiply by invV (camera's inverse view) to get a world space position
Multiply world space position by light's viewProjection matrix.
Perspective divide that projection-space coordinate, and manipulate into [0..1] to sample from light depth buffer.
Get current depth from light and closest (sampled) depth, if current depth > closest depth, it's in shadow.
Shader Code
Pass1:
PS_INPUT vs(VS_INPUT input) {
output.pos = mul(input.vPos, mvp);
output.cameraDepth = output.pos.zw;
..
float4 vPosInLight = mul(input.vPos, m);
vPosInLight = mul(vPosInLight, light.viewProj);
output.lightDepth = vPosInLight.zw;
}
PS_OUTPUT ps(PS_INPUT input){
float cameraDepth = input.cameraDepth.x / input.cameraDepth.y;
//Bundle cameraDepth in alpha channel of a normal map.
output.normal = float4(input.normal, cameraDepth);
//4 Lights in total -- although only 1 is active right now. Going to use r/g/b/a for each light depth.
output.lightDepths.r = input.lightDepth.x / input.lightDepth.y;
}
Pass 2 (Screen Quad):
float4 ps(PS_INPUT input) : SV_TARGET{
float4 pixelPosView = depthToViewSpace(input.texCoord);
..
float4 pixelPosWorld = mul(pixelPosView, invV);
float4 pixelPosLight = mul(pixelPosWorld, light.viewProj);
float shadow = shadowCalc(pixelPosLight);
//For testing / visualisation
return float4(shadow,shadow,shadow,1);
}
float4 depthToViewSpace(float2 xy) {
//Get pixel depth from camera by sampling current texcoord.
//Extract the alpha channel as this holds the depth value.
//Then, transform from [0..1] to [-1..1]
float z = (_normal.Sample(_sampler, xy).a) * 2 - 1;
float x = xy.x * 2 - 1;
float y = (1 - xy.y) * 2 - 1;
float4 vProjPos = float4(x, y, z, 1.0f);
float4 vPositionVS = mul(vProjPos, invP);
vPositionVS = float4(vPositionVS.xyz / vPositionVS.w,1);
return vPositionVS;
}
float shadowCalc(float4 pixelPosL) {
//Transform pixelPosLight from [-1..1] to [0..1]
float3 projCoords = (pixelPosL.xyz / pixelPosL.w) * 0.5 + 0.5;
float closestDepth = _lightDepths.Sample(_sampler, projCoords.xy).r;
float currentDepth = projCoords.z;
return currentDepth > closestDepth; //Supposed to have bias, but for now I just want shadows working haha
}
CPP Matrices
// (Position, LookAtPos, UpDir)
auto lightView = XMMatrixLookAtLH(XMLoadFloat4(&pos4), XMVectorSet(0,0,0,1), XMVectorSet(0,1,0,0));
// (FOV, AspectRatio (1000/680), NEAR, FAR)
auto lightProj = XMMatrixPerspectiveFovLH(1.57f , 1.47f, 0.01f, 10.0f);
XMStoreFloat4x4(&_cLightBuffer.light.viewProj, XMMatrixTranspose(XMMatrixMultiply(lightView, lightProj)));
Current Outputs
White signifies that a shadow should be projected there. Black indicates no shadow.
CameraPos (0, 2.5, -2)
CameraLookAt (0, 0, 0)
CameraFOV (1.57)
CameraNear (0.01)
CameraFar (10.0)
LightPos (0, 2.5, -2)
LightLookAt (0, 0, 0)
LightFOV (1.57)
LightNear (0.01)
LightFar (10.0)
If I change the CameraPosition to be (0, 2.5, 2), basically just flipped on the Z axis, this is the result.
Obviously a shadow shouldn't change its projection depending on where the observer is, so I think I'm making a mistake with the invV. But I really don't know for sure. I've debugged the light's projView matrix, and the values seem correct - going from CPU to GPU. It's also entirely possible I've misunderstood some theory along the way because this is quite a tricky technique for me.
Aha! Found my problem. It was a silly mistake, I was calculating the depth of pixels from each light, but storing them in a texture that was based on the view of the camera. The following image should explain my mistake better than I can with words.
For future reference, the solution I decided was to scrap my idea for storing light depths in texture channels. Instead, I basically make a new pass for each light, and bind a unique depth-stencil texture to render the geometry to. When I want to do light calculations, I bind each of the depth textures to a shader resource slot and go from there. Obviously this doesn't scale well with many lights, but for my student project where I'm only required to have 2 shadow casters, it suffices.
_context->DrawIndexed(indexCount, 0, 0); //Draw to regular render target
_sunlight->use(1, _context); //Use sunlight shader (basically just runs a Vertex Shader & Null Pixel shader so depth can be written to depth map)
_sunlight->bindDSVSetNullRenderTarget(_context);
_context->DrawIndexed(indexCount, 0, 0); //Draw to sunlight depth target
bindDSVSetNullRenderTarget(ctx){
ID3D11RenderTargetView* nullrv = { nullptr };
ctx->OMSetRenderTargets(1, &nullrv, _sunlightDepthStencilView);
}
//The purpose of setting a null render target before doing the draw call is
//that a draw call with only a depth target bound is much faster.
//(At least I believe so, from my reading online)

How to draw specific part of texture in fragment shader (SpriteKit)

I have a node size of 64x32 and texture size of 192x192 and I am trying to draw the first part of this texture at the first node, the second part at the second node...
Fragment shader (attached to SKSpriteNode with texture size of 64x32)
void main() {
float bX = 64.0 / 192.0 * (offset.x + 1);
float aX = 64.0 / 192.0 * (offset.x );
float bY = 32.0 / 192.0 * (offset.y + 1);
float aY = 32.0 / 192.0 * (offset.y);
float normalizedX = (bX - aX) * v_tex_coord.x + aX;
float normalizedY = (bY - aY) * v_tex_coord.y + aY;
gl_FragColor = texture2D(u_temp, vec2(normalizedX, normalizedY));
}
offset.x - [0, 2]
offset.y - [0, 5]
u_temp - texture size of 192x192
function to convert a value from [0,1] to, for example, [0, 0.33]
But the result seems to be wrong:
SKSpriteNode with attached texture
SKSpriteNode without texture (what I want to achieve with texture)
When a texture is in an altas, it's not addressed by coordinates from (0,0) to (1,1) anymore. The atlas is really one large texture that has been assembled behind the scenes. When you use a particular named image from an atlas in a normal sprite, SpriteKit is looking up that image name in information about how the atlas was assembled and then telling the GPU something like "draw this sprite with bigAtlasTexture, coordinates (0.1632,0.8814) through (0.1778, 0.9143)". If you're going to write a custom shader using the same texture, you need that information about where it lives inside the atlas, which you get from textureRect:
https://developer.apple.com/documentation/spritekit/sktexture/1519707-texturerect
So you have your texture which is not really one image but defined by a location textureRect() in a big packed-up image of lots of textures. I find it easiest to think in terms of (0,0) to (1,1), so when writing a shader I usually do textureRect => subtract and scale to get to (0,0)-(1,1) => compute desired modified coordinates => scale and add to get to textureRect again => texture2D lookup.
Since your shader will need to know about textureRect but you can't call that from the shader code, you have two choices:
Make an attribute or uniform to hold that information, fill it in from the outside, and then have the shader reference it.
If the shader is only used for a specific texture or for a few textures, then you can generate shader code that's specialized for the required textureRect, i.e., it just has some constants in the code for the texture.
Here's a part of an example using approach #2:
func myShader(forTexture texture: SKTexture) -> SKShader {
// Be careful not to assume that the texture has v_tex_coord ranging in (0, 0) to
// (1, 1)! If the texture is part of a texture atlas, this is not true. I could
// make another attribute or uniform to pass in the textureRect info, but since I
// only use this with a particular texture, I just pass in the texture and compile
// in the required v_tex_coord transformations for that texture.
let rect = texture.textureRect()
let shaderSource = """
void main() {
// Normalize coordinates to (0,0)-(1,1)
v_tex_coord -= vec2(\(rect.origin.x), \(rect.origin.y));
v_tex_coord *= vec2(\(1 / rect.size.width), \(1 / rect.size.height));
// Update the coordinates in whatever way here...
// v_tex_coord = desired_transform(v_tex_coord)
// And then go back to the actual coordinates for the real texture
v_tex_coord *= vec2(\(rect.size.width), \(rect.size.height));
v_tex_coord += vec2(\(rect.origin.x), \(rect.origin.y));
gl_FragColor = texture2D(u_texture, v_tex_coord);
}
"""
let shader = SKShader(source: shaderSource)
return shader
}
That's a cut-down version of some specific examples from here:
https://github.com/bg2b/RockRats/blob/master/Asteroids/Hyperspace.swift

Porting from OpenGL to MetalKit - Projection Matrix (?) Problems

Question
I'm working on porting from OpenGL (OGL) to MetalKit (MTK) on iOS. I'm failing to get identical display in the MetalKit version of the app. I modified the projection matrix to account for differences in Normalized Device Coordinates between the two frameworks, but don't know what else to change to get identical display. Any ideas what else needs to be changed to port from OpenGL to MetalKit?
Projection Matrix Changes so far...
I understand that the Normalized Device Coordinates (NDC) are different in OGL vs MTK:
OGL NDC: -1 < z < 1
MTK NDC: 0 < z < 1
I modified the projection matrix to address the NDC difference, as indicated here. Unfortunately, this modification to the projection matrix doesn't result in identical display to the old OGL code.
I'm struggling to even know what else to try.
Background
For reference, here's some misc background information:
The view matrix is very simple (identity matrix); i.e. camera is at (0, 0, 0) and looking toward (0, 0, -1)
In the legacy OpenGL code, I used GLKMatrix4MakeFrustum to produce the projection matrix, using the screen bounds for left, right, top, bottom, and near=1, far=1000
I stripped the scene down to bare bones while debugging and below are 2 images, the first from legacy OGL code and the second from MTK, both just showing the "ground" plane with a debug texture and a black background.
Any ideas about what else might need to change to get to identical display in MetalKit would be greatly appreciated.
Screenshots
OpenGL (legacy)
MetalKit
Edit 1
I tried to extract code relevant to calculation and use of the projection matrix:
float aspectRatio = 1.777; // iPhone 8 device
float top = 1;
float bottom = -1;
float left = -aspectRatio;
float right = aspectRatio;
float RmL = right - left;
float TmB = top - bottom;
float nearZ = 1;
float farZ = 1000;
GLKMatrix4 projMatrix = { 2 * nearZ / RmL, 0, 0, 0,
0, 2 * nearZ / TmB, 0, 0,
0, 0, -farZ / (farZ - nearZ), -1,
0, 0, -farZ * nearZ / (farZ - nearZ), 0 };
GLKMatrix4 viewMatrix = ...; // Identity matrix: camera at origin, looking at (0, 0, -1), yUp=(0, 1, 0);
GLKMatrix4 modelMatrix = ...; // Different for various models, but even when this is the identity matrix in old/new code the visual output is different
GLKMatrix4 mvpMatrix = GLKMatrix4Multiply(projMatrix, GLKMatrix4Multiply(viewMatrix, modelMatrix));
...
GLKMatrix4 x = mvpMatrix; // rename for brevity below
float mvpMatrixArray[16] = {x.m00, x.m01, x.m02, x.m03, x.m10, x.m11, x.m12, x.m13, x.m20, x.m21, x.m22, x.m23, x.m30, x.m31, x.m32, x.m33};
// making MVP matrix available to vertex shader
[renderCommandEncoder setVertexBytes:&mvpMatrixArray
length:16 * sizeof(float)
atIndex:1]; // vertex data is at "0"
[renderCommandEncoder setVertexBuffer:vertexBuffer
offset:0
atIndex:0];
...
[renderCommandEncoder drawPrimitives:MTLPrimitiveTypeTriangleStrip
vertexStart:0
vertexCount:4];
Sadly this issue ended up being due to a bug in the vertex shader that was pushing all geometry +1 on the Z axis, leading to the visual differences.
For any future OpenGL-to-Metal porters: the projection matrix changes above, accounting for the differences in normalized device coordinates, are enough.
Without seeing the code it's hard to say what the problem is. One of the most common issues could be a wrongly configured viewport:
// Set the region of the drawable to draw into.
[renderEncoder setViewport:(MTLViewport){0.0, 0.0, _viewportSize.x, _viewportSize.y, 0.0, 1.0 }];
The default values for the viewport are:
originX = 0.0
originY = 0.0
width = w
height = h
znear = 0.0
zfar = 1.0
*Metal: znear = minZ, zfar = maxZ.
MinZ and MaxZ indicate the depth-ranges into which the scene will be
rendered and are not used for clipping. Most applications will set
these members to 0.0 and 1.0 to enable the system to render to the
entire range of depth values in the depth buffer. In some cases, you
can achieve special effects by using other depth ranges. For instance,
to render a heads-up display in a game, you can set both values to 0.0
to force the system to render objects in a scene in the foreground, or
you might set them both to 1.0 to render an object that should always
be in the background.
Applications typically set MinZ and MaxZ to 0.0 and 1.0 respectively
to cause the system to render to the entire depth range. However, you
can use other values to achieve certain affects. For example, you
might set both values to 0.0 to force all objects into the foreground,
or set both to 1.0 to render all objects into the background.

Get the up vector of the camera in ARKit

I'm trying to get the four vectors that make up the boundaries of the frustum in ARKit, and the solution I came up with is as follows:
Find the field of view angles of the camera
Then find the direction and up vectors of the camera
Using these information, find the four vectors using cross products and rotations
This may be a sloppy way of doing it, however it is the best one I got so far.
I am able to get the FOV angles and the direction vector from the ARCamera.intrinsics and ARCamera.transform properties. However, I don't know how to get the up vector of the camera at this point.
Below is the piece of code I use to find the FOV angles and the direction vector:
func session(_ session: ARSession, didUpdate frame: ARFrame) {
if xFovDegrees == nil || yFovDegrees == nil {
let imageResolution = frame.camera.imageResolution
let intrinsics = frame.camera.intrinsics
xFovDegrees = 2 * atan(Float(imageResolution.width) / (2 * intrinsics[0,0])) * 180 / Float.pi
yFovDegrees = 2 * atan(Float(imageResolution.height) / (2 * intrinsics[1,1])) * 180 / Float.pi
}
let cameraTransform = SCNMatrix4(frame.camera.transform)
let cameraDirection = SCNVector3(-1 * cameraTransform.m31,
-1 * cameraTransform.m32,
-1 * cameraTransform.m33)
}
I am also open to suggestions for ways to find the the four vectors I'm trying to get.
I had not understood how this line worked:
let cameraDirection = SCNVector3(-1 * cameraTransform.m31,
-1 * cameraTransform.m32,
-1 * cameraTransform.m33)
This gives the direction vector of the camera because the 3rd row of the transformation matrix gives where the new z-direction of the transformed camera points at. We multiply it by -1 because the default direction of the camera is the negative z-axis.
Considering this information and the fact that the default up vector for a camera is the positive y-axis, the 2nd row of the transformation matrix gives us the up vector of the camera. The following code gives me what I want:
let cameraUp = SCNVector3(cameraTransform.m21,
cameraTransform.m22,
cameraTransform.m23)
It could be that I'm misunderstanding what you're trying to do, but I'd like to offer an alternative solution (the method and result is different than your answer).
For my purposes, I define the up vector as (0, 1, 0) when the phone is pointing straight up - basically I want the unit vector that is pointing straight out of the top of the phone. ARKit defines the up vector as (0, 1, 0) when the phone is horizontal to the left - so the y-axis is pointing out of the right side of the phone - supposedly because they expect AR apps to prefer horizontal orientation.
camera.transform returns the camera's orientation relative to its initial orientation when the AR session started. It is a 4x4 matrix - the first 3x3 of which is the rotation matrix - so when you write cameraTransform.m21 etc. you are referencing part of the rotation matrix, which is NOT the same as the up vector (however you define it).
So if I define the up vector as the unit y-vector where the y axis is pointing out of the top of the phone, I have to write this as (-1, 0, 0) in ARKit space. Then simply multiplying this vector (slightly modified... see below) by the camera's transform will give me the "up vector" that I'm looking for. Below is an example of using this calculation in a ARSessionDelegate callback.
func session(_ session: ARSession, didUpdate frame: ARFrame) {
// the unit y vector is appended with an extra element
// for multiplying with the 4x4 transform matrix
let unitYVector = float4(-1, 0, 0, 1)
let upVectorH = frame.camera.transform * unitYVector
// drop the 4th element
let upVector = SCNVector3(upVectorH.x, upVectorH.y, upVectorH.z)
}
You can use let unitYVector = float4(0, 1, 0, 1) if you are working with ARKit's horizontal orientation.
You can also do the same sort of calculation to get the "direction vector" (pointing out of the front of the phone) by multiplying unit vector (0, 0, 1, 1) by the camera transform.

iOS revert camera projection

I'm trying to estimate my device position related to a QR code in space. I'm using ARKit and the Vision framework, both introduced in iOS11, but the answer to this question probably doesn't depend on them.
With the Vision framework, I'm able to get the rectangle that bounds a QR code in the camera frame. I'd like to match this rectangle to the device translation and rotation necessary to transform the QR code from a standard position.
For instance if I observe the frame:
* *
B
C
A
D
* *
while if I was 1m away from the QR code, centered on it, and assuming the QR code has a side of 10cm I'd see:
* *
A0 B0
D0 C0
* *
what has been my device transformation between those two frames? I understand that an exact result might not be possible, because maybe the observed QR code is slightly non planar and we're trying to estimate an affine transform on something that is not one perfectly.
I guess the sceneView.pointOfView?.camera?.projectionTransform is more helpful than the sceneView.pointOfView?.camera?.projectionTransform?.camera.projectionMatrix since the later already takes into account transform inferred from the ARKit that I'm not interested into for this problem.
How would I fill
func get transform(
qrCodeRectangle: VNBarcodeObservation,
cameraTransform: SCNMatrix4) {
// qrCodeRectangle.topLeft etc is the position in [0, 1] * [0, 1] of A0
// expected real world position of the QR code in a referential coordinate system
let a0 = SCNVector3(x: -0.05, y: 0.05, z: 1)
let b0 = SCNVector3(x: 0.05, y: 0.05, z: 1)
let c0 = SCNVector3(x: 0.05, y: -0.05, z: 1)
let d0 = SCNVector3(x: -0.05, y: -0.05, z: 1)
let A0, B0, C0, D0 = ?? // CGPoints representing position in
// camera frame for camera in 0, 0, 0 facing Z+
// then get transform from 0, 0, 0 to current position/rotation that sees
// a0, b0, c0, d0 through the camera as qrCodeRectangle
}
====Edit====
After trying number of things, I ended up going for camera pose estimation using openCV projection and perspective solver, solvePnP This gives me a rotation and translation that should represent the camera pose in the QR code referential. However when using those values and placing objects corresponding to the inverse transformation, where the QR code should be in the camera space, I get inaccurate shifted values, and I'm not able to get the rotation to work:
// some flavor of pseudo code below
func renderer(_ sender: SCNSceneRenderer, updateAtTime time: TimeInterval) {
guard let currentFrame = sceneView.session.currentFrame, let pov = sceneView.pointOfView else { return }
let intrisics = currentFrame.camera.intrinsics
let QRCornerCoordinatesInQRRef = [(-0.05, -0.05, 0), (0.05, -0.05, 0), (-0.05, 0.05, 0), (0.05, 0.05, 0)]
// uses VNDetectBarcodesRequest to find a QR code and returns a bounding rectangle
guard let qr = findQRCode(in: currentFrame) else { return }
let imageSize = CGSize(
width: CVPixelBufferGetWidth(currentFrame.capturedImage),
height: CVPixelBufferGetHeight(currentFrame.capturedImage)
)
let observations = [
qr.bottomLeft,
qr.bottomRight,
qr.topLeft,
qr.topRight,
].map({ (imageSize.height * (1 - $0.y), imageSize.width * $0.x) })
// image and SceneKit coordinated are not the same
// replacing this by:
// (imageSize.height * (1.35 - $0.y), imageSize.width * ($0.x - 0.2))
// weirdly fixes an issue, see below
let rotation, translation = openCV.solvePnP(QRCornerCoordinatesInQRRef, observations, intrisics)
// calls openCV solvePnP and get the results
let positionInCameraRef = -rotation.inverted * translation
let node = SCNNode(geometry: someGeometry)
pov.addChildNode(node)
node.position = translation
node.orientation = rotation.asQuaternion
}
Here is the output:
where A, B, C, D are the QR code corners in the order they are passed to the program.
The predicted origin stays in place when the phone rotates, but it's shifted from where it should be. Surprisingly, if I shift the observations values, I'm able to correct this:
// (imageSize.height * (1 - $0.y), imageSize.width * $0.x)
// replaced by:
(imageSize.height * (1.35 - $0.y), imageSize.width * ($0.x - 0.2))
and now the predicted origin stays robustly in place. However I don't understand where the shift values come from.
Finally, I've tried to get an orientation fixed relatively to the QR code referential:
var n = SCNNode(geometry: redGeometry)
node.addChildNode(n)
n.position = SCNVector3(0.1, 0, 0)
n = SCNNode(geometry: blueGeometry)
node.addChildNode(n)
n.position = SCNVector3(0, 0.1, 0)
n = SCNNode(geometry: greenGeometry)
node.addChildNode(n)
n.position = SCNVector3(0, 0, 0.1)
The orientation is fine when I look at the QR code straight, but then it shifts by something that seems to be related to the phone rotation:
Outstanding questions I have are:
How do I solve the rotation?
where do the position shift values come from?
What simple relationship do rotation, translation, QRCornerCoordinatesInQRRef, observations, intrisics verify? Is it O ~ K^-1 * (R_3x2 | T) Q ? Because if so that's off by a few order of magnitude.
If that's helpful, here are a few numerical values:
Intrisics matrix
Mat 3x3
1090.318, 0.000, 618.661
0.000, 1090.318, 359.616
0.000, 0.000, 1.000
imageSize
1280.0, 720.0
screenSize
414.0, 736.0
==== Edit2 ====
I've noticed that the rotation works fine when the phone stays horizontally parallel to the QR code (ie the rotation matrix is [[a, 0, b], [0, 1, 0], [c, 0, d]]), no matter what the actual QR code orientation is:
Other rotation don't work.
Coordinate systems' correspondence
Take into consideration that Vision/CoreML coordinate system doesn't correspond to ARKit/SceneKit coordinate system. For details look at this post.
Rotation's direction
I suppose the problem is not in matrix. It's in vertices placement. For tracking 2D images you need to place ABCD vertices counter-clockwise (the starting point is A vertex located in imaginary origin x:0, y:0). I think Apple Documentation on VNRectangleObservation class (info about projected rectangular regions detected by an image analysis request) is vague. You placed your vertices in the same order as is in official documentation:
var bottomLeft: CGPoint
var bottomRight: CGPoint
var topLeft: CGPoint
var topRight: CGPoint
But they need to be placed the same way like positive rotation direction (about Z axis) occurs in Cartesian coordinates system:
World Coordinate Space in ARKit (as well as in SceneKit and Vision) always follows a right-handed convention (the positive Y axis points upward, the positive Z axis points toward the viewer and the positive X axis points toward the viewer's right), but is oriented based on your session's configuration. Camera works in Local Coordinate Space.
Rotation direction about any axis is positive (Counter-Clockwise) and negative (Clockwise). For tracking in ARKit and Vision it's critically important.
The order of rotation also makes sense. ARKit, as well as SceneKit, applies rotation relative to the node’s pivot property in the reverse order of the components: first roll (about Z axis), then yaw (about Y axis), then pitch (about X axis). So the rotation order is ZYX.
Math (Trig.):
Notes: the bottom is l (the QR code length), the left angle is k, and the top angle is i (the camera)

Resources