How to get object rect/coordinates from VNClassificationObservation - ios

have an issue with getting from VNClassificationObservation.
My goal id to recognize the object and display popup with the object name, I'm able to get name but I can't get object coordinates or frame.
Here is code:
let handler = VNImageRequestHandler(cvPixelBuffer: pixelBuffer, options: requestOptions)
do {
try handler.perform([classificationRequest, detectFaceRequest])
} catch {
print(error)
}
Then I handle
func handleClassification(request: VNRequest, error: Error?) {
guard let observations = request.results as? [VNClassificationObservation] else {
fatalError("unexpected result type from VNCoreMLRequest")
}
// Filter observation
let filteredOservations = observations[0...10].filter({ $0.confidence > 0.1 })
// Update UI
DispatchQueue.main.async { [weak self] in
for observation in filteredOservations {
print("observation: ",observation.identifier)
//HERE: I need to display popup with observation name
}
}
}
UPDATED:
lazy var classificationRequest: VNCoreMLRequest = {
// Load the ML model through its generated class and create a Vision request for it.
do {
let model = try VNCoreMLModel(for: Inceptionv3().model)
let request = VNCoreMLRequest(model: model, completionHandler: self.handleClassification)
request.imageCropAndScaleOption = VNImageCropAndScaleOptionCenterCrop
return request
} catch {
fatalError("can't load Vision ML model: \(error)")
}
}()

A pure classifier model can only answer "what is this a picture of?", not detect and locate objects in the picture. All the free models on the Apple developer site (including Inception v3) are of this kind.
When Vision works with such a model, it identifies the model as a classifier based on the outputs declared in the MLModel file, and returns VNClassificationObservation objects as output.
If you find or create a model that's trained to both identify and locate objects, you can still use it with Vision. When you convert that model to Core ML format, the MLModel file will describe multiple outputs. When Vision works with a model that has multiple outputs, it returns an array of VNCoreMLFeatureValueObservation objects — one for each output of the model.
How the model declares its outputs would determine which feature values represent what. A model that reports a classification and a bounding box could output a string and four doubles, or a string and a multi array, etc.
Addendum: Here's a model that works on iOS 11 and returns VNCoreMLFeatureValueObservation: TinyYOLO

That's because classifiers do not return objects coordinates or frames. A classifier only gives a probability distribution over a list of categories.
What model are you using here?

For tracking and identifying objects, you’ll have to create your own model using Darknet. I have struggled the same problem, and used TuriCreate to train model, and instead of just providing images to the framework, you’ll have also to provide ones with bounding boxes. Apple have documented here, how to create those models:
Apple TuriCreate docs

Related

Vision Framework (iOS): How are VNDetectFaceLandmarksRequest and VNDetectFaceRectanglesRequest different?

There are two different requests that you can use for face detection tasks with the iOS Vision Framework: VNDetectFaceLandmarksRequest and VNDetectFaceRectanglesRequest. Both of them return an array of VNFaceObservation, one for each detected face. VNFaceObservation has a variety of optional properties, including boundingBox and landmarks. The landmarks object then also includes optional properties like nose, innerLips, leftEye, etc.
Do the two different Vision requests differ in how they perform face detection?
It seems that VNDetectFaceRectanglesRequest only finds a bounding box (and maybe some other properties), but does not find any landmarks. On the other hand, VNDetectFaceLandmarksRequest seems to find both, bounding box and landmarks.
Are there cases where one request type will find a face and the other one will not? Is VNDetectFaceLandmarksRequest superior to VNDetectFaceRectanglesRequest, or does the latter maybe have advantages in performance or reliability?
Here is an example code of how these two Vision requests can be used:
let faceLandmarkRequest = VNDetectFaceLandmarksRequest()
let faceRectangleRequest = VNDetectFaceRectanglesRequest()
let requestHandler = VNImageRequestHandler(ciImage: image, options: [:])
try requestHandler.perform([faceRectangleRequest, faceLandmarkRequest])
if let rectangleResults = faceRectangleRequest.results as? [VNFaceObservation] {
let boundingBox1 = rectangleResults.first?.boundingBox //this is an optional type
}
if let landmarkResults = faceLandmarkRequest.results as? [VNFaceObservation] {
let boundingBox2 = landmarkResults.first?.boundingBox //this is an optional type
let landmarks = landmarkResults.first?.landmarks //this is an optional type
}
VNDetectFaceRectanglesRequest is a more lightweight operation for finding face rectangle
VNDetectFaceLandmarksRequest is a heavier operation, which can also help locate landmarks on the face

CoreML Output labels NSCFString - Labels not showing correctly

I am working on an iOS app where I need to use a CoreML model to perform image classification.
I used Google Cloud Platform AutoML Vision to train the model. Google provides a CoreML version of the model and I downloaded it to use in my app.
I followed Google's tutorial and everything appeared to be going smoothly. However when it use time to start using the model and got very strange prediction. I got the confidence of the prediction and then I got a very strange string that I didn't know what it was.
<VNClassificationObservation: 0x600002091d40> A7DBD70C-541C-4112-84A4-C6B4ED2EB7E2 requestRevision=1 confidence=0.332127 "CICAgICAwPmveRIJQWdsYWlzX2lv"
The string I am referring to is CICAgICAwPmveRIJQWdsYWlzX2lv.
After some research and debugging I found out that this is a NSCFString.
https://developer.apple.com/documentation/foundation/1395135-nsclassfromstring
Apparently this is part of the foundation API. Does anyone has any experience with this?
With the CoreML file also comes a dict.txt file with the correct labels. Do I have to convert this string to the labels? How do I do that.
This the code I have so far.
//
// Classification.swift
// Lepidoptera
//
// Created by Tomás Mamede on 15/09/2020.
// Copyright © 2020 Tomás Santiago. All rights reserved.
//
import Foundation
import SwiftUI
import Vision
import CoreML
import ImageIO
class Classification {
private lazy var classificationRequest: VNCoreMLRequest = {
do {
let model = try VNCoreMLModel(for: AutoML().model)
let request = VNCoreMLRequest(model: model, completionHandler: { [weak self] request, error in
if let classifications = request.results as? [VNClassificationObservation] {
print(classifications.first ?? "No classification!")
}
})
request.imageCropAndScaleOption = .scaleFit
return request
}
catch {
fatalError("Error! Can't use Model.")
}
}()
func classifyImage(receivedImage: UIImage) {
let orientation = CGImagePropertyOrientation(rawValue: UInt32(receivedImage.imageOrientation.rawValue))
if let image = CIImage(image: receivedImage) {
DispatchQueue.global(qos: .userInitiated).async {
let handler = VNImageRequestHandler(ciImage: image, orientation: orientation!)
do {
try handler.perform([self.classificationRequest])
}
catch {
fatalError("Error classifying image!")
}
}
}
}
}
The labels are stored in your mlmodel file. If you open the mlmodel in the Xcode 12 model viewer, it will display what those labels are.
My guess is that instead of actual labels, your mlmodel file contains "CICAgICAwPmveRIJQWdsYWlzX2lv" and so on.
It looks like Google's AutoML does not put the correct class labels into the Core ML model.
You can make a dictionary in the app that maps "CICAgICAwPmveRIJQWdsYWlzX2lv" and so on to the real labels.
Or you can replace these labels inside the mlmodel file by editing it using coremltools. (My e-book Core ML Survival Guide has a chapter on how to replace the labels in the model.)

Why my pre-trained mlmodel is so wrong in object recognition?

recently I wanted to check out CoreML and CreateML so I created simple app with object recognition.
I created model only for bananas and carrots (just for a try).I used over 60 images to trained my model and in Create ML app the training process looked fine.
Everything was going great until I printed out the results in the console and I saw that my model is 100% confident that waterfall is a banana ...
Ideally, I thought the output would be 0% confidence for banana and 0% confidence for carrots (because I used image of waterfall).
Could You explain me why the output look like this and give any kind of advice how to improve my app ?
This is my code for image recognition :
func recognizeObject (image: CIImage) {
guard let myModel = try? VNCoreMLModel(for: FruitVegeClassifier_1().model) else {
fatalError("Couldn't load ML Model")
}
let recognizeRequest = VNCoreMLRequest(model: myModel) { (recognizeRequest, error) in
guard let output = recognizeRequest.results as? [VNClassificationObservation] else {
fatalError("Your model failed !")
}
print(output)
}
let handler = VNImageRequestHandler(ciImage: image)
do {
try handler.perform([recognizeRequest])
} catch {
print(error)
}
}
in the console we can see that :
[<VNClassificationObservation: 0x600001c77810> 24503983-5770-4F43-8078-F3F6243F47B2 requestRevision=1 confidence=1.000000 "banana", <VNClassificationObservation: 0x600001c77840> E73BFBAE-D6E1-4D31-A2AE-0B3C860EAF99 requestRevision=1 confidence=0.000000 "carrot"]
and the image looks like this :
Thanks for any help !
If you only trained on images of bananas and carrots, the model should only be used on images of bananas and carrots.
When you give it a totally different kind of images, it will try to match it to the patterns it has learned, which are either bananas or carrots and nothing else.
In other words, these models do not work they way you were expecting them to.

Several QR Codes with ARKit

In a new project we plan to create following AR showcase:
We want to have a wall with some pipes and cables on it. These will have sensors mounted to control and monitor the pipe/cable-system. Since each sensor will have the same dimensions and appearance we plan to add individual QR Codes to each sensor. Reading the documentation of ARWorldTrackingConfiguration and ARImageTrackingConfiguration shows that ARKit is capable of recognizing known images. But the requirements to images make me wonder if the application would work as we want it to when using several QR Codes:
From detectionImages:
[...], identifying art in a museum or adding animated elements to a movie poster.
From Apples Keynote:
Good Images to Track: High Texture, High local Contrast, well distributed histogram, no repetitive structures
Since QR Codes don't match the requirements completely I'm wondering if it's possible to use about 10 QR Codes and have ARKit recognize each of them individually and reliable. Especially when e.g. 3 Codes are in the view. Does anyone have experience in tracking several QR Codes or even a similar showcase?
Recognizing (several) QR-codes has nothing to do with ARKit and can be done in 3 different ways (AVFramework, CIDetector, Vision), of which the latter is preferable in my opinion because you may also want to use its object tracking capabilities (VNTrackObjectRequest). Also it is more robust in my experience.
If you need to place objects in ARKit scene using locations of the QR-codes, you will need to execute hitTest on ARFrame to find code's 3D location (transform) in the world. On that location you will need to place a custom ARAnchor. Using the anchor, you can add a custom SceneKit node to the scene.
UPDATE: So the suggested strategy would be: 1. find QR codes and their 2D location with Vision, 2. find their 3D location (worldTransform) with ARFrame.hitTest(), 3. create custom (subclassed) ARAnchor and add it to the session, 4. in renderer(_ renderer: SCNSceneRenderer, nodeFor anchor: ARAnchor) add a custom node (such as SCNText with billboard constraint) for your custom ARAnchor.
If by any chance you are using RxSwift, it can done the easiest with RxVision framework, because it allows to easily pass the relevant ARFrame along into the handler -
var requests = [RxVNRequest<ARFrame>]()
let barcodesRequest: RxVNDetectBarcodesRequest<ARFrame> = VNDetectBarcodesRequest.rx.request(symbologies: [.QR])
self
.barcodesRequest
.observable
.observeOn(Scheduler.main)
.subscribe { [unowned self] (event) in
switch event {
case .next(let completion):
self.detectCodeHandler(value: completion.value, request: completion.request, error: completion.error) // define the method first
default:
break
}
}
.disposed(by: disposeBag)
if let image = anchor as? ARImageAnchor{
guard let buffer: CVPixelBuffer = sceneView.session.currentFrame?.capturedImage else {
print("could not get a pixel buffer")
return
}
let image = CIImage(cvPixelBuffer: buffer)
var message = ""
let features = detector.features(in: image)
for feature in features as! [CIQRCodeFeature] {
message = feature.messageString
break
}
if image.referenceImage.name == "QR1"{
if message == "QR1"{
// add node 1
}else{
sceneView.session.remove(anchor: anchor)
}
} else if image.referenceImage.name == "QR2"{
if message == "QR2"{
// add node 2
}else{
sceneView.session.remove(anchor: anchor)
}
}
}
detector here is CIDetector.Also you need to check renderer(_:didUpdate:for:). I worked on 4 QR codes.
It works assuming no two QR codes can be seen in a frame at same time.

Swift: Process UIImage data for use in Firebase custom TFLite model

I am using Swift, Firebase, and Tensorflow to build an image recognition model. I have a re-trained MobileNet model that takes an input array of [1,224,224,3] copied into my Xcode bundle, and when I try to add data from an image as an input, I get the error: Input 0 should have 602112 bytes, but found 627941 bytes. I am using the following code:
let input = ModelInputs()
do {
let newImage = image.resizeTo(size: CGSize(width: 224, height: 224))
let data = UIImagePNGRepresentation(newImage)
// Store input data in `data`
// ...
try input.addInput(data)
// Repeat as necessary for each input index
} catch let error as NSError {
print("Failed to add input: \(error.localizedDescription)")
}
interpreter.run(inputs: input, options: ioOptions) { outputs, error in
guard error == nil, let outputs = outputs else {
print(error!.localizedDescription)//ERROR BEING CALLED HERE
return }
// Process outputs
print(outputs)
// ...
}
How can I re-process the image data to be 602112 bytes? I am so confused if someone could please help me it would be great :)
Please check out the Quick Start iOS demo app in Swift on how to use a custom TFLite model:
https://github.com/firebase/quickstart-ios/tree/master/mlmodelinterpreter
In particular, I think this is what you are looking for:
https://github.com/firebase/quickstart-ios/blob/master/mlmodelinterpreter/MLModelInterpreterExample/UIImage%2BTFLite.swift#L47
Good luck!

Resources