In a new project we plan to create following AR showcase:
We want to have a wall with some pipes and cables on it. These will have sensors mounted to control and monitor the pipe/cable-system. Since each sensor will have the same dimensions and appearance we plan to add individual QR Codes to each sensor. Reading the documentation of ARWorldTrackingConfiguration and ARImageTrackingConfiguration shows that ARKit is capable of recognizing known images. But the requirements to images make me wonder if the application would work as we want it to when using several QR Codes:
From detectionImages:
[...], identifying art in a museum or adding animated elements to a movie poster.
From Apples Keynote:
Good Images to Track: High Texture, High local Contrast, well distributed histogram, no repetitive structures
Since QR Codes don't match the requirements completely I'm wondering if it's possible to use about 10 QR Codes and have ARKit recognize each of them individually and reliable. Especially when e.g. 3 Codes are in the view. Does anyone have experience in tracking several QR Codes or even a similar showcase?
Recognizing (several) QR-codes has nothing to do with ARKit and can be done in 3 different ways (AVFramework, CIDetector, Vision), of which the latter is preferable in my opinion because you may also want to use its object tracking capabilities (VNTrackObjectRequest). Also it is more robust in my experience.
If you need to place objects in ARKit scene using locations of the QR-codes, you will need to execute hitTest on ARFrame to find code's 3D location (transform) in the world. On that location you will need to place a custom ARAnchor. Using the anchor, you can add a custom SceneKit node to the scene.
UPDATE: So the suggested strategy would be: 1. find QR codes and their 2D location with Vision, 2. find their 3D location (worldTransform) with ARFrame.hitTest(), 3. create custom (subclassed) ARAnchor and add it to the session, 4. in renderer(_ renderer: SCNSceneRenderer, nodeFor anchor: ARAnchor) add a custom node (such as SCNText with billboard constraint) for your custom ARAnchor.
If by any chance you are using RxSwift, it can done the easiest with RxVision framework, because it allows to easily pass the relevant ARFrame along into the handler -
var requests = [RxVNRequest<ARFrame>]()
let barcodesRequest: RxVNDetectBarcodesRequest<ARFrame> = VNDetectBarcodesRequest.rx.request(symbologies: [.QR])
self
.barcodesRequest
.observable
.observeOn(Scheduler.main)
.subscribe { [unowned self] (event) in
switch event {
case .next(let completion):
self.detectCodeHandler(value: completion.value, request: completion.request, error: completion.error) // define the method first
default:
break
}
}
.disposed(by: disposeBag)
if let image = anchor as? ARImageAnchor{
guard let buffer: CVPixelBuffer = sceneView.session.currentFrame?.capturedImage else {
print("could not get a pixel buffer")
return
}
let image = CIImage(cvPixelBuffer: buffer)
var message = ""
let features = detector.features(in: image)
for feature in features as! [CIQRCodeFeature] {
message = feature.messageString
break
}
if image.referenceImage.name == "QR1"{
if message == "QR1"{
// add node 1
}else{
sceneView.session.remove(anchor: anchor)
}
} else if image.referenceImage.name == "QR2"{
if message == "QR2"{
// add node 2
}else{
sceneView.session.remove(anchor: anchor)
}
}
}
detector here is CIDetector.Also you need to check renderer(_:didUpdate:for:). I worked on 4 QR codes.
It works assuming no two QR codes can be seen in a frame at same time.
Related
I am using MLKIt for detect QRCode from image. for andrid it is working proper, for ios I am using below pods
pod 'GoogleMLKit/BarcodeScanning'
Here is sample code detect QRcode from image which picked from gallery. every time features array comes empty.
let format: BarcodeFormat = BarcodeFormat.all
let barcodeOptions = BarcodeScannerOptions(formats: format)
let visionImage = VisionImage(image: image)
visionImage.orientation = image.imageOrientation
let barcodeScanner = BarcodeScanner.barcodeScanner(options: barcodeOptions)
barcodeScanner.process(visionImage) { features, error in
guard error == nil, let features = features, !features.isEmpty else {
// Error handling
return
}
// Recognized barcodes
print("Data :: \(features.first?.rawValue ?? "")")
}
We noticed this may happen when there are no padding around the QR code, I also tried to add some padding to it: and it works after that. Could you confirm that it works?
On the other side, ML Kit is also working on a public document on this limitation. Thanks for reporting this.
Julie from ML Kit team
There are two different requests that you can use for face detection tasks with the iOS Vision Framework: VNDetectFaceLandmarksRequest and VNDetectFaceRectanglesRequest. Both of them return an array of VNFaceObservation, one for each detected face. VNFaceObservation has a variety of optional properties, including boundingBox and landmarks. The landmarks object then also includes optional properties like nose, innerLips, leftEye, etc.
Do the two different Vision requests differ in how they perform face detection?
It seems that VNDetectFaceRectanglesRequest only finds a bounding box (and maybe some other properties), but does not find any landmarks. On the other hand, VNDetectFaceLandmarksRequest seems to find both, bounding box and landmarks.
Are there cases where one request type will find a face and the other one will not? Is VNDetectFaceLandmarksRequest superior to VNDetectFaceRectanglesRequest, or does the latter maybe have advantages in performance or reliability?
Here is an example code of how these two Vision requests can be used:
let faceLandmarkRequest = VNDetectFaceLandmarksRequest()
let faceRectangleRequest = VNDetectFaceRectanglesRequest()
let requestHandler = VNImageRequestHandler(ciImage: image, options: [:])
try requestHandler.perform([faceRectangleRequest, faceLandmarkRequest])
if let rectangleResults = faceRectangleRequest.results as? [VNFaceObservation] {
let boundingBox1 = rectangleResults.first?.boundingBox //this is an optional type
}
if let landmarkResults = faceLandmarkRequest.results as? [VNFaceObservation] {
let boundingBox2 = landmarkResults.first?.boundingBox //this is an optional type
let landmarks = landmarkResults.first?.landmarks //this is an optional type
}
VNDetectFaceRectanglesRequest is a more lightweight operation for finding face rectangle
VNDetectFaceLandmarksRequest is a heavier operation, which can also help locate landmarks on the face
Has anyone successfully extracted the estimatedDepthData and segmentationBuffer from an ARKit application? I am trying to identify any collisions between a person and a rendered asset. As there already exists a segmentation mask and a depth mask during runtime since people occlusion is on, I am wondering if I can get that array and use it to identify collision events.
I've done a bit of research and it seems like I might need to set-up a custom renderer to handle that, but I was wondering if anyone else has had figured out an easier way. I am using a very straightforward configuration.
let configuration = ARWorldTrackingConfiguration()
configuration.planeDetection = .horizontal
configuration.frameSemantics.insert(.personSegmentationWithDepth)
// Run the view's session
sceneView.session.run(configuration)
Both segmentation and depth data are accessible through the properties of the ARFrame. You will need to confirm to the ARSessionDelegate protocol in order to get updates.
func session(_ session: ARSession, didUpdate frame: ARFrame) {
if let segmentationData = frame.segmentationBuffer {
// Do smth with segmentation data
}
if let depthData = frame.estimatedDepthData {
// Do smth with depth data
}
}
I'm trying to take two images using the camera, and align them using the iOS Vision framework:
func align(firstImage: CIImage, secondImage: CIImage) {
let request = VNTranslationalImageRegistrationRequest(
targetedCIImage: firstImage) {
request, error in
if error != nil {
fatalError()
}
let observation = request.results!.first
as! VNImageTranslationAlignmentObservation
secondImage = secondImage.transformed(
by: observation.alignmentTransform)
let compositedImage = firstImage!.applyingFilter(
"CIAdditionCompositing",
parameters: ["inputBackgroundImage": secondImage])
// Save the compositedImage to the photo library.
}
try! visionHandler.perform([request], on: secondImage)
}
let visionHandler = VNSequenceRequestHandler()
But this produces grossly mis-aligned images:
You can see that I've tried three different types of scenes — a close-up subject, an indoor scene, and an outdoor scene. I tried more outdoor scenes, and the result is the same in almost every one of them.
I was expecting a slight misalignment at worst, but not such a complete misalignment. What is going wrong?
I'm not passing the orientation of the images into the Vision framework, but that shouldn't be a problem for aligning images. It's a problem only for things like face detection, where a rotated face isn't detected as a face. In any case, the output images have the correct orientation, so orientation is not the problem.
My compositing code is working correctly. It's only the Vision framework that's a problem. If I remove the calls to the Vision framework, put the phone of a tripod, the composition works perfectly. There's no misalignment. So the problem is the Vision framework.
This is on iPhone X.
How do I get Vision framework to work correctly? Can I tell it to use gyroscope, accelerometer and compass data to improve the alignment?
You should set secondImage as targetImage, and perform handler with firstImage.
I use your composite way.
check out this example from MLBoy:
let request = VNTranslationalImageRegistrationRequest(targetedCIImage: image2, options: [:])
let handler = VNImageRequestHandler(ciImage: image1, options: [:])
do {
try handler.perform([request])
} catch let error {
print(error)
}
guard let observation = request.results?.first as? VNImageTranslationAlignmentObservation else { return }
let alignmentTransform = observation.alignmentTransform
image2 = image2.transformed(by: alignmentTransform)
let compositedImage = image1.applyingFilter("CIAdditionCompositing", parameters: ["inputBackgroundImage": image2])
have an issue with getting from VNClassificationObservation.
My goal id to recognize the object and display popup with the object name, I'm able to get name but I can't get object coordinates or frame.
Here is code:
let handler = VNImageRequestHandler(cvPixelBuffer: pixelBuffer, options: requestOptions)
do {
try handler.perform([classificationRequest, detectFaceRequest])
} catch {
print(error)
}
Then I handle
func handleClassification(request: VNRequest, error: Error?) {
guard let observations = request.results as? [VNClassificationObservation] else {
fatalError("unexpected result type from VNCoreMLRequest")
}
// Filter observation
let filteredOservations = observations[0...10].filter({ $0.confidence > 0.1 })
// Update UI
DispatchQueue.main.async { [weak self] in
for observation in filteredOservations {
print("observation: ",observation.identifier)
//HERE: I need to display popup with observation name
}
}
}
UPDATED:
lazy var classificationRequest: VNCoreMLRequest = {
// Load the ML model through its generated class and create a Vision request for it.
do {
let model = try VNCoreMLModel(for: Inceptionv3().model)
let request = VNCoreMLRequest(model: model, completionHandler: self.handleClassification)
request.imageCropAndScaleOption = VNImageCropAndScaleOptionCenterCrop
return request
} catch {
fatalError("can't load Vision ML model: \(error)")
}
}()
A pure classifier model can only answer "what is this a picture of?", not detect and locate objects in the picture. All the free models on the Apple developer site (including Inception v3) are of this kind.
When Vision works with such a model, it identifies the model as a classifier based on the outputs declared in the MLModel file, and returns VNClassificationObservation objects as output.
If you find or create a model that's trained to both identify and locate objects, you can still use it with Vision. When you convert that model to Core ML format, the MLModel file will describe multiple outputs. When Vision works with a model that has multiple outputs, it returns an array of VNCoreMLFeatureValueObservation objects — one for each output of the model.
How the model declares its outputs would determine which feature values represent what. A model that reports a classification and a bounding box could output a string and four doubles, or a string and a multi array, etc.
Addendum: Here's a model that works on iOS 11 and returns VNCoreMLFeatureValueObservation: TinyYOLO
That's because classifiers do not return objects coordinates or frames. A classifier only gives a probability distribution over a list of categories.
What model are you using here?
For tracking and identifying objects, you’ll have to create your own model using Darknet. I have struggled the same problem, and used TuriCreate to train model, and instead of just providing images to the framework, you’ll have also to provide ones with bounding boxes. Apple have documented here, how to create those models:
Apple TuriCreate docs