Why my pre-trained mlmodel is so wrong in object recognition? - ios

recently I wanted to check out CoreML and CreateML so I created simple app with object recognition.
I created model only for bananas and carrots (just for a try).I used over 60 images to trained my model and in Create ML app the training process looked fine.
Everything was going great until I printed out the results in the console and I saw that my model is 100% confident that waterfall is a banana ...
Ideally, I thought the output would be 0% confidence for banana and 0% confidence for carrots (because I used image of waterfall).
Could You explain me why the output look like this and give any kind of advice how to improve my app ?
This is my code for image recognition :
func recognizeObject (image: CIImage) {
guard let myModel = try? VNCoreMLModel(for: FruitVegeClassifier_1().model) else {
fatalError("Couldn't load ML Model")
}
let recognizeRequest = VNCoreMLRequest(model: myModel) { (recognizeRequest, error) in
guard let output = recognizeRequest.results as? [VNClassificationObservation] else {
fatalError("Your model failed !")
}
print(output)
}
let handler = VNImageRequestHandler(ciImage: image)
do {
try handler.perform([recognizeRequest])
} catch {
print(error)
}
}
in the console we can see that :
[<VNClassificationObservation: 0x600001c77810> 24503983-5770-4F43-8078-F3F6243F47B2 requestRevision=1 confidence=1.000000 "banana", <VNClassificationObservation: 0x600001c77840> E73BFBAE-D6E1-4D31-A2AE-0B3C860EAF99 requestRevision=1 confidence=0.000000 "carrot"]
and the image looks like this :
Thanks for any help !

If you only trained on images of bananas and carrots, the model should only be used on images of bananas and carrots.
When you give it a totally different kind of images, it will try to match it to the patterns it has learned, which are either bananas or carrots and nothing else.
In other words, these models do not work they way you were expecting them to.

Related

iOS fast image difference comparison

Im looking for a fast way to compare two frames of video, and decide if a lot has changed between them. This will be used to decide if I should send a request to image recognition service over REST, so I don't want to keep sending them, until there might be some different results. Something similar is doing Vuforia SDK. Im starting with a Framebuffer from ARKit, and I have it scaled to 640:480 and converted to RGB888 vBuffer_image. It could compare just few points, but it needs to find out if difference is significant nicely.
I started by calculating difference between few points using vDSP functions, but this has a disadvantage - if I move camera even very slightly to left/right, then the same points have different portions of image, and the calculated difference is high, even if nothing really changed much.
I was thinking about using histograms, but I didn't test this approach yet.
What would be the best solution for this? It needs to be fast, it can compare just smaller version of image, etc.
I have tested another approach using VNFeaturePointObservation from Vision. This works a lot better, but Im afraid it might be more CPU demanding. I need to test this on some older devices. Anyway, this is a part of code that works nicely. If someone could suggest some better approach to test, please let know:
private var lastScanningImageFingerprint: VNFeaturePrintObservation?
// Returns true if these are different enough
private func compareScanningImages(current: VNFeaturePrintObservation, last: VNFeaturePrintObservation?) -> Bool {
guard let last = last else { return true }
var distance = Float(0)
try! last.computeDistance(&distance, to: current)
print(distance)
return distance > 10
}
// After scanning is done, subclass should prepare suggestedTargets array.
private func performScanningIfNeeded(_ sender: Timer) {
guard !scanningInProgress else { return } // Wait for previous scanning to finish
guard let vImageBuffer = deletate?.currentFrameScalledImage else { return }
guard let image = CGImage.create(from: vImageBuffer) else { return }
func featureprintObservationForImage(image: CGImage) -> VNFeaturePrintObservation? {
let requestHandler = VNImageRequestHandler(cgImage: image, options: [:])
let request = VNGenerateImageFeaturePrintRequest()
do {
try requestHandler.perform([request])
return request.results?.first as? VNFeaturePrintObservation
} catch {
print("Vision error: \(error)")
return nil
}
}
guard let imageFingerprint = featureprintObservationForImage(image: image) else { return }
guard compareScanningImages(current: imageFingerprint, last: lastScanningImageFingerprint) else { return }
print("SCANN \(Date())")
lastScanningImageFingerprint = featureprintObservationForImage(image: image)
executeScanning(on: image) { [weak self] in
self?.scanningInProgress = false
}
}
Tested on older iPhone - as expected this causes some frame drops on camera preview. So I need a faster algorithm

CoreML Output labels NSCFString - Labels not showing correctly

I am working on an iOS app where I need to use a CoreML model to perform image classification.
I used Google Cloud Platform AutoML Vision to train the model. Google provides a CoreML version of the model and I downloaded it to use in my app.
I followed Google's tutorial and everything appeared to be going smoothly. However when it use time to start using the model and got very strange prediction. I got the confidence of the prediction and then I got a very strange string that I didn't know what it was.
<VNClassificationObservation: 0x600002091d40> A7DBD70C-541C-4112-84A4-C6B4ED2EB7E2 requestRevision=1 confidence=0.332127 "CICAgICAwPmveRIJQWdsYWlzX2lv"
The string I am referring to is CICAgICAwPmveRIJQWdsYWlzX2lv.
After some research and debugging I found out that this is a NSCFString.
https://developer.apple.com/documentation/foundation/1395135-nsclassfromstring
Apparently this is part of the foundation API. Does anyone has any experience with this?
With the CoreML file also comes a dict.txt file with the correct labels. Do I have to convert this string to the labels? How do I do that.
This the code I have so far.
//
// Classification.swift
// Lepidoptera
//
// Created by Tomás Mamede on 15/09/2020.
// Copyright © 2020 Tomás Santiago. All rights reserved.
//
import Foundation
import SwiftUI
import Vision
import CoreML
import ImageIO
class Classification {
private lazy var classificationRequest: VNCoreMLRequest = {
do {
let model = try VNCoreMLModel(for: AutoML().model)
let request = VNCoreMLRequest(model: model, completionHandler: { [weak self] request, error in
if let classifications = request.results as? [VNClassificationObservation] {
print(classifications.first ?? "No classification!")
}
})
request.imageCropAndScaleOption = .scaleFit
return request
}
catch {
fatalError("Error! Can't use Model.")
}
}()
func classifyImage(receivedImage: UIImage) {
let orientation = CGImagePropertyOrientation(rawValue: UInt32(receivedImage.imageOrientation.rawValue))
if let image = CIImage(image: receivedImage) {
DispatchQueue.global(qos: .userInitiated).async {
let handler = VNImageRequestHandler(ciImage: image, orientation: orientation!)
do {
try handler.perform([self.classificationRequest])
}
catch {
fatalError("Error classifying image!")
}
}
}
}
}
The labels are stored in your mlmodel file. If you open the mlmodel in the Xcode 12 model viewer, it will display what those labels are.
My guess is that instead of actual labels, your mlmodel file contains "CICAgICAwPmveRIJQWdsYWlzX2lv" and so on.
It looks like Google's AutoML does not put the correct class labels into the Core ML model.
You can make a dictionary in the app that maps "CICAgICAwPmveRIJQWdsYWlzX2lv" and so on to the real labels.
Or you can replace these labels inside the mlmodel file by editing it using coremltools. (My e-book Core ML Survival Guide has a chapter on how to replace the labels in the model.)

Swift: Process UIImage data for use in Firebase custom TFLite model

I am using Swift, Firebase, and Tensorflow to build an image recognition model. I have a re-trained MobileNet model that takes an input array of [1,224,224,3] copied into my Xcode bundle, and when I try to add data from an image as an input, I get the error: Input 0 should have 602112 bytes, but found 627941 bytes. I am using the following code:
let input = ModelInputs()
do {
let newImage = image.resizeTo(size: CGSize(width: 224, height: 224))
let data = UIImagePNGRepresentation(newImage)
// Store input data in `data`
// ...
try input.addInput(data)
// Repeat as necessary for each input index
} catch let error as NSError {
print("Failed to add input: \(error.localizedDescription)")
}
interpreter.run(inputs: input, options: ioOptions) { outputs, error in
guard error == nil, let outputs = outputs else {
print(error!.localizedDescription)//ERROR BEING CALLED HERE
return }
// Process outputs
print(outputs)
// ...
}
How can I re-process the image data to be 602112 bytes? I am so confused if someone could please help me it would be great :)
Please check out the Quick Start iOS demo app in Swift on how to use a custom TFLite model:
https://github.com/firebase/quickstart-ios/tree/master/mlmodelinterpreter
In particular, I think this is what you are looking for:
https://github.com/firebase/quickstart-ios/blob/master/mlmodelinterpreter/MLModelInterpreterExample/UIImage%2BTFLite.swift#L47
Good luck!

Why is the Vision framework unable to align two images?

I'm trying to take two images using the camera, and align them using the iOS Vision framework:
func align(firstImage: CIImage, secondImage: CIImage) {
let request = VNTranslationalImageRegistrationRequest(
targetedCIImage: firstImage) {
request, error in
if error != nil {
fatalError()
}
let observation = request.results!.first
as! VNImageTranslationAlignmentObservation
secondImage = secondImage.transformed(
by: observation.alignmentTransform)
let compositedImage = firstImage!.applyingFilter(
"CIAdditionCompositing",
parameters: ["inputBackgroundImage": secondImage])
// Save the compositedImage to the photo library.
}
try! visionHandler.perform([request], on: secondImage)
}
let visionHandler = VNSequenceRequestHandler()
But this produces grossly mis-aligned images:
You can see that I've tried three different types of scenes — a close-up subject, an indoor scene, and an outdoor scene. I tried more outdoor scenes, and the result is the same in almost every one of them.
I was expecting a slight misalignment at worst, but not such a complete misalignment. What is going wrong?
I'm not passing the orientation of the images into the Vision framework, but that shouldn't be a problem for aligning images. It's a problem only for things like face detection, where a rotated face isn't detected as a face. In any case, the output images have the correct orientation, so orientation is not the problem.
My compositing code is working correctly. It's only the Vision framework that's a problem. If I remove the calls to the Vision framework, put the phone of a tripod, the composition works perfectly. There's no misalignment. So the problem is the Vision framework.
This is on iPhone X.
How do I get Vision framework to work correctly? Can I tell it to use gyroscope, accelerometer and compass data to improve the alignment?
You should set secondImage as targetImage, and perform handler with firstImage.
I use your composite way.
check out this example from MLBoy:
let request = VNTranslationalImageRegistrationRequest(targetedCIImage: image2, options: [:])
let handler = VNImageRequestHandler(ciImage: image1, options: [:])
do {
try handler.perform([request])
} catch let error {
print(error)
}
guard let observation = request.results?.first as? VNImageTranslationAlignmentObservation else { return }
let alignmentTransform = observation.alignmentTransform
image2 = image2.transformed(by: alignmentTransform)
let compositedImage = image1.applyingFilter("CIAdditionCompositing", parameters: ["inputBackgroundImage": image2])

How to get object rect/coordinates from VNClassificationObservation

have an issue with getting from VNClassificationObservation.
My goal id to recognize the object and display popup with the object name, I'm able to get name but I can't get object coordinates or frame.
Here is code:
let handler = VNImageRequestHandler(cvPixelBuffer: pixelBuffer, options: requestOptions)
do {
try handler.perform([classificationRequest, detectFaceRequest])
} catch {
print(error)
}
Then I handle
func handleClassification(request: VNRequest, error: Error?) {
guard let observations = request.results as? [VNClassificationObservation] else {
fatalError("unexpected result type from VNCoreMLRequest")
}
// Filter observation
let filteredOservations = observations[0...10].filter({ $0.confidence > 0.1 })
// Update UI
DispatchQueue.main.async { [weak self] in
for observation in filteredOservations {
print("observation: ",observation.identifier)
//HERE: I need to display popup with observation name
}
}
}
UPDATED:
lazy var classificationRequest: VNCoreMLRequest = {
// Load the ML model through its generated class and create a Vision request for it.
do {
let model = try VNCoreMLModel(for: Inceptionv3().model)
let request = VNCoreMLRequest(model: model, completionHandler: self.handleClassification)
request.imageCropAndScaleOption = VNImageCropAndScaleOptionCenterCrop
return request
} catch {
fatalError("can't load Vision ML model: \(error)")
}
}()
A pure classifier model can only answer "what is this a picture of?", not detect and locate objects in the picture. All the free models on the Apple developer site (including Inception v3) are of this kind.
When Vision works with such a model, it identifies the model as a classifier based on the outputs declared in the MLModel file, and returns VNClassificationObservation objects as output.
If you find or create a model that's trained to both identify and locate objects, you can still use it with Vision. When you convert that model to Core ML format, the MLModel file will describe multiple outputs. When Vision works with a model that has multiple outputs, it returns an array of VNCoreMLFeatureValueObservation objects — one for each output of the model.
How the model declares its outputs would determine which feature values represent what. A model that reports a classification and a bounding box could output a string and four doubles, or a string and a multi array, etc.
Addendum: Here's a model that works on iOS 11 and returns VNCoreMLFeatureValueObservation: TinyYOLO
That's because classifiers do not return objects coordinates or frames. A classifier only gives a probability distribution over a list of categories.
What model are you using here?
For tracking and identifying objects, you’ll have to create your own model using Darknet. I have struggled the same problem, and used TuriCreate to train model, and instead of just providing images to the framework, you’ll have also to provide ones with bounding boxes. Apple have documented here, how to create those models:
Apple TuriCreate docs

Resources