I am trying to improve the performance of drawing the skeleton with body tracking for VNDetectHumanBodyPoseRequest even when further than 5 metres away, and with a stable iPhone XS camera.
The tracking has low confidence for the lower right limbs of my body, noticeable lag and there is jitter. I am unable to replicate the showcased performance in this year's WWDC demo video.
Here is the relevant code, adapted from Apple's sample code:
class Predictor {
func extractPoses(_ sampleBuffer: CMSampleBuffer) throws -> [VNRecognizedPointsObservation] {
let requestHandler = VNImageRequestHandler(cmSampleBuffer: sampleBuffer, orientation: .down)
let request = VNDetectHumanBodyPoseRequest()
do {
// Perform the body pose-detection request.
try requestHandler.perform([request])
} catch {
print("Unable to perform the request: \(error).\n")
}
return (request.results as? [VNRecognizedPointsObservation]) ?? [VNRecognizedPointsObservation]()
}
}
I've captured the video data and am handling the sample buffers here:
class CameraViewController: AVCaptureVideoDataOutputSampleBufferDelegate {
func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
let observations = try? predictor.extractPoses(sampleBuffer)
observations?.forEach { processObservation($0) }
}
func processObservation(_ observation: VNRecognizedPointsObservation) {
// Retrieve all torso points.
guard let recognizedPoints =
try? observation.recognizedPoints(forGroupKey: .all) else {
return
}
let storedPoints = Dictionary(uniqueKeysWithValues: recognizedPoints.compactMap { (key, point) -> (String, CGPoint)? in
return (key.rawValue, point.location)
})
DispatchQueue.main.sync {
let mappedPoints = Dictionary(uniqueKeysWithValues: recognizedPoints.compactMap { (key, point) -> (String, CGPoint)? in
guard point.confidence > 0.1 else { return nil }
let norm = VNImagePointForNormalizedPoint(point.location,
Int(drawingView.bounds.width),
Int(drawingView.bounds.height))
return (key.rawValue, norm)
})
let time = 1000 * observation.timeRange.start.seconds
// Draw the points onscreen.
DispatchQueue.main.async {
self.drawingView.draw(points: mappedPoints)
}
}
}
}
The drawingView.draw function is for a custom UIView on top of the camera view, and draws the points using CALayer sublayers. The AVCaptureSession code is exactly the same as the sample code here.
I tried using the VNDetectHumanBodyPoseRequest(completionHandler:) variant but this made no difference to the performance for me. I could try smoothing with a moving average filter though.. but there is still a problem with outlier predictions which are very inaccurate.
What am I missing?
This was a bug on iOS 14 beta v1-v3 I think. After upgrading to v4 and later beta releases, tracking is much better. The API also became a bit clearer with fine-grained type names with the latest beta updates.
Note I didn’t get an official answer from Apple regarding this bug, but this problem will probably completely disappear in the official iOS 14 release.
Related
I’m trying to create an app that can record video at 100 FPS using AVAssetWriter AND detect if a person is performing an action using the ActionClassifier from Create ML. But when I try to put the 2 together the FPS drops to 30 when recording and detecting actions.
If I do the recording by itself then it records at 100 FPS.
I am able to set the FPS of the camera to 100 FPS through the device configuration.
Capture output Function is setup
func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
bufferImage = sampleBuffer
guard let calibrationData = CMGetAttachment(sampleBuffer, key: kCMSampleBufferAttachmentKey_CameraIntrinsicMatrix, attachmentModeOut: nil) as? Data else {
return
}
cameraCalibrationMatrix = calibrationData.withUnsafeBytes { $0.pointee }
if self.isPredictorActivated == true {
do {
let poses = try predictor.processFrame(sampleBuffer)
if (predictor.isReadyToMakePrediction) {
let prediction = try predictor.makePrediction()
let confidence = prediction.confidence * 100
DispatchQueue.main.async {
self.predictionLabel.text = prediction.label + " " + String(confidence.rounded(toPlaces: 0))
if (prediction.label == "HandsUp" && prediction.confidence > 0.85) {
print("Challenging")
self.didChallengeVideo()
}
}
}
} catch {
print(error.localizedDescription)
}
}
let presentationTimeStamp = CMSampleBufferGetPresentationTimeStamp(sampleBuffer)
if assetWriter == nil {
createWriterInput(for: presentationTimeStamp)
} else {
let chunkDuration = CMTimeGetSeconds(CMTimeSubtract(presentationTimeStamp, chunkStartTime))
// print("Challenge\(isChallenging)")
if chunkDuration > 1500 || isChallenging {
assetWriter.endSession(atSourceTime: presentationTimeStamp)
// make a copy, as finishWriting is asynchronous
let newChunkURL = chunkOutputURL!
let chunkAssetWriter = assetWriter!
chunkAssetWriter.finishWriting {
print("finishWriting says: \(chunkAssetWriter.status.rawValue) \(String(describing: chunkAssetWriter.error))")
print("queuing \(newChunkURL)")
print("Chunk Duration: \(chunkDuration)")
let asset = AVAsset(url: newChunkURL)
print("FPS of CHUNK \(asset.tracks.first?.nominalFrameRate)")
if self.isChallenging {
self.challengeVideoProcess(video: asset)
}
self.isChallenging = false
}
createWriterInput(for: presentationTimeStamp)
}
}
if !assetWriterInput.append(sampleBuffer) {
print("append says NO: \(assetWriter.status.rawValue) \(String(describing: assetWriter.error))")
}
}
Performing action classification is quite expensive if you want to run it every frame so it may affect overall performance of the app (including video footage FPS). I don't know how often you need prediction but I would suggest you to try running Action Classifier 2-3 times per second maximum and see if that helps.
Running action classifier every frame won't change your classification that much because you're adding just one frame to your classifier action window so there is no need to run it so often.
For example if your action classifier was setup with window 3s and trained on 30fps videos, your classification is based on 3 * 30 = 90 frames. One frame won't make a difference.
Also make sure that your 100fps matches footage that you used for training action classifier. Otherwise you can get wrong predictions because running Action Classifier trained on 30fps video will treat 1s of 100fps footage as more than 3,333s.
Im looking for a fast way to compare two frames of video, and decide if a lot has changed between them. This will be used to decide if I should send a request to image recognition service over REST, so I don't want to keep sending them, until there might be some different results. Something similar is doing Vuforia SDK. Im starting with a Framebuffer from ARKit, and I have it scaled to 640:480 and converted to RGB888 vBuffer_image. It could compare just few points, but it needs to find out if difference is significant nicely.
I started by calculating difference between few points using vDSP functions, but this has a disadvantage - if I move camera even very slightly to left/right, then the same points have different portions of image, and the calculated difference is high, even if nothing really changed much.
I was thinking about using histograms, but I didn't test this approach yet.
What would be the best solution for this? It needs to be fast, it can compare just smaller version of image, etc.
I have tested another approach using VNFeaturePointObservation from Vision. This works a lot better, but Im afraid it might be more CPU demanding. I need to test this on some older devices. Anyway, this is a part of code that works nicely. If someone could suggest some better approach to test, please let know:
private var lastScanningImageFingerprint: VNFeaturePrintObservation?
// Returns true if these are different enough
private func compareScanningImages(current: VNFeaturePrintObservation, last: VNFeaturePrintObservation?) -> Bool {
guard let last = last else { return true }
var distance = Float(0)
try! last.computeDistance(&distance, to: current)
print(distance)
return distance > 10
}
// After scanning is done, subclass should prepare suggestedTargets array.
private func performScanningIfNeeded(_ sender: Timer) {
guard !scanningInProgress else { return } // Wait for previous scanning to finish
guard let vImageBuffer = deletate?.currentFrameScalledImage else { return }
guard let image = CGImage.create(from: vImageBuffer) else { return }
func featureprintObservationForImage(image: CGImage) -> VNFeaturePrintObservation? {
let requestHandler = VNImageRequestHandler(cgImage: image, options: [:])
let request = VNGenerateImageFeaturePrintRequest()
do {
try requestHandler.perform([request])
return request.results?.first as? VNFeaturePrintObservation
} catch {
print("Vision error: \(error)")
return nil
}
}
guard let imageFingerprint = featureprintObservationForImage(image: image) else { return }
guard compareScanningImages(current: imageFingerprint, last: lastScanningImageFingerprint) else { return }
print("SCANN \(Date())")
lastScanningImageFingerprint = featureprintObservationForImage(image: image)
executeScanning(on: image) { [weak self] in
self?.scanningInProgress = false
}
}
Tested on older iPhone - as expected this causes some frame drops on camera preview. So I need a faster algorithm
I am trying to use the Google Mobile Vision API to detect when a user smiles from the camera feed. The problem I have is that the Google Mobile Vision API is not detecting any faces while apple's vision api immediately recognizes and tracks any face that I test my app with. I am using func captureOutput(_ captureOutput: AVCaptureOutput!, didOutputSampleBuffer sampleBuffer: CMSampleBuffer!, from connection: AVCaptureConnection!) { } to detect when a user is smiling. Apple's Vision API seems to work fine but Google's API does not detect any faces. How would I fix my code to get Google's API to work as well? What am I doing wrong?
My Code...
var options = [GMVDetectorFaceTrackingEnabled: true, GMVDetectorFaceLandmarkType: GMVDetectorFaceLandmark.all.rawValue, GMVDetectorFaceMinSize: 0.15] as [String : Any]
var GfaceDetector = GMVDetector.init(ofType: GMVDetectorTypeFace, options: options)
extension ViewController: AVCaptureVideoDataOutputSampleBufferDelegate {
func captureOutput(_ captureOutput: AVCaptureOutput!, didOutputSampleBuffer sampleBuffer: CMSampleBuffer!, from connection: AVCaptureConnection!) {
let pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer)
let attachments = CMCopyDictionaryOfAttachments(kCFAllocatorDefault, sampleBuffer, kCMAttachmentMode_ShouldPropagate)
let ciImage1 = CIImage(cvImageBuffer: pixelBuffer!, options: attachments as! [String : Any]?)
let Gimage = UIImage(ciImage: ciImage1)
var Gfaces = GfaceDetector?.features(in: Gimage, options: nil) as? [GMVFaceFeature]
let options: [String : Any] = [CIDetectorImageOrientation: exifOrientation(orientation: UIDevice.current.orientation),
CIDetectorSmile: true,
CIDetectorEyeBlink: true]
let allFeatures = faceDetector?.features(in: ciImage1, options: options)
let formatDescription = CMSampleBufferGetFormatDescription(sampleBuffer)
let cleanAperture = CMVideoFormatDescriptionGetCleanAperture(formatDescription!, false)
var smilingProb = CGFloat()
guard let features = allFeatures else { return }
print("GFace \(Gfaces?.count)")
//THE PRINT ABOVE RETURNS 0
//MARK: ------ Google System Setup
for face: GMVFaceFeature in Gfaces! {
print("Google1")
if face.hasSmilingProbability {
print("Google \(face.smilingProbability)")
smilingProb = face.smilingProbability
}
}
for feature in features {
if let faceFeature = feature as? CIFaceFeature {
let faceRect = calculateFaceRect(facePosition: faceFeature.mouthPosition, faceBounds: faceFeature.bounds, clearAperture: cleanAperture)
let featureDetails = ["has smile: \(faceFeature.hasSmile), \(smilingProb)",
"has closed left eye: \(faceFeature.leftEyeClosed)",
"has closed right eye: \(faceFeature.rightEyeClosed)"]
update(with: faceRect, text: featureDetails.joined(separator: "\n"))
}
}
if features.count == 0 {
DispatchQueue.main.async {
self.detailsView.alpha = 0.0
}
}
}
UPDATE
I copied and pasted the Google Mobile Vision Detection Code into another app and it worked. The difference was that instead of constantly receiving frames, the app had only one image to analyze. Could this have something to do with how often I send a request or the format/quality of the CIImage?
ANOTHER UPDATE
I have identified an issue with how my app works. It seems that the image that the API is receiving is not upright or inline with the orientation of the phone. For example if I hold my phone up in front of my face (in normal portrait mode) the image is rotated 90 degrees anti clockwise. I have absolutely no idea why this is happening as the live camera preview is normal. The Google Docs say...
The face detector expects images and the faces in them to be in an upright orientation. If you need to rotate the image, pass in orientation information in the dictionary options with GMVDetectorImageOrientation key. The detector will rotate the images for you based on the orientation value.
New Question: (I believe the answer to either one of these questions will solve my problem)
A: How would I use the GMVImageDetectorImageOrientation key to set the orientation right?
B: How would I rotate the UIImage 90 degrees clockwise (NOT THE UIIMAGEVIEW)?
THIRD UPDATE
I have successfully rotated the image right side up but Google Mobile Vision is still not detecting any faces, the image is a bit distorted but I do not think the amount of distortion is affecting Google Mobile Vision's response. So...
How would I use the GMVImageDetectorImageOrientation key to set the orientation right?
ANY HELP/RESPONSE IS APPRECIATED.
I've setup an AVCaptureSession with a video data output and am attempting to use iOS 11's Vision framework to read QR codes. The camera is setup like basically any AVCaptureSession is. I will abbreviate and just show setting up the output.
let output = AVCaptureVideoDataOutput()
output.setSampleBufferDelegate(self, queue: captureQueue)
captureSession.addOutput(output)
// I did this to get the CVPixelBuffer to be oriented in portrait.
// I don't know if it's needed and I'm not sure it matters anyway.
output.connection(with: .video)!.videoOrientation = .portrait
So the camera is up and running as always. Here is the code I am using to perform a VNImageRequestHandler for QR codes.
func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
guard let pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) else { return }
let imageRequestHandler = VNImageRequestHandler(cvPixelBuffer: pixelBuffer, orientation: .up, options: [:])
let qrRequest = VNDetectBarcodesRequest { request, error in
let barcodeObservations = request.results as? [VNBarcodeObservation]
guard let qrCode = barcodeObservations?.flatMap({ $0.barcodeDescriptor as? CIQRCodeDescriptor }).first else { return }
if let code = String(data: qrCode.errorCorrectedPayload, encoding: .isoLatin1) {
debugPrint(code)
}
}
qrRequest.symbologies = [.QR]
try! imageRequestHandler.perform([qrRequest])
}
I am using a QR code that encodes http://www.google.com as a test. The debugPrint line prints out:
AVGG\u{03}¢ò÷wwrævöövÆRæ6öÐì\u{11}ì
I have tested this same QR code with the AVCaptureMetadataOutput that has been around for a while and that method decodes the QR code correctly. So my question is, what have I missed to get the output that I am getting?
(Obviously I could just use the AVCaptureMetadataOutput as a solution, because I can see that it works. But that doesn't help me learn how to use the Vision framework.)
Most likely the problem is here:
if let code = String(data: qrCode.errorCorrectedPayload, encoding: .isoLatin1)
Try to use .utf8.
Also i would suggest to look at the raw output of the 'errorCorrectedPayload' without encoding. Maybe it already has correct encoding.
The definition of errorCorrectedPayload says:
-- QR Codes are formally specified in ISO/IEC 18004:2006(E). Section 6.4.10 "Bitstream to codeword conversion" specifies the set of 8-bit codewords in the symbol immediately prior to splitting the message into blocks and applying error correction. --
This seems to work fine with VNBarcodeObservation.payloadStringValue instead of transforming VNBarcodeObservation.barcodeDescriptor.
I am trying to use CoreImage CIFeature for detecting face emotions as those are the native APIs. I have created a sample view controller project and updated related code. When I launch this iOS application, it open up the camera. When I look up the camera and show smile emotion, this below sample code detects fine.
I need to also find other emotions like, Surprise, Sad and Angry emotions. I understand that CoreImage CIFeature doesn't have direct APIs for these other emotions. But, Is it possible to try manipulating the available APIs (such as hasSmile, leftEyeClosed, rightEyeClosed etc.) to detect other emotions such as Surprise, Sad and Angry through iOS program?
Could anyone come across working with this APIs, scenario and solve this issue, please suggest and share your ideas.
func captureOutput(_ captureOutput: AVCaptureOutput!, didOutputSampleBuffer sampleBuffer: CMSampleBuffer!, from connection: AVCaptureConnection!) {
let imageBuffer = CMSampleBufferGetImageBuffer(sampleBuffer)
let opaqueBuffer = Unmanaged<CVImageBuffer>.passUnretained(imageBuffer!).toOpaque()
let pixelBuffer = Unmanaged<CVPixelBuffer>.fromOpaque(opaqueBuffer).takeUnretainedValue()
let sourceImage = CIImage(cvPixelBuffer: pixelBuffer, options: nil)
options = [CIDetectorSmile : true as AnyObject, CIDetectorEyeBlink: true as AnyObject, CIDetectorImageOrientation : 6 as AnyObject]
let features = self.faceDetector!.features(in: sourceImage, options: options)
for feature in features as! [CIFaceFeature] {
if (feature.hasSmile) {
DispatchQueue.main.async {
self.updateSmileEmotion()
}
}
else {
DispatchQueue.main.async {
self.resetEmotionLabel()
}
}
}
func updateSmileEmotion () {
self.emtionLabel.text = " "
self.emtionLabel.text = "HAPPY"
}
func resetEmotionLabel () {
self.emtionLabel.text = " "
}
There are a variety of libraries that can do sentiment analysis on images, and most of those rely on machine learning. Its highly unlikely that you are going to get the same kind of results by just looking at what CIFeature gives you because its pretty limited even in comparison to other facial recognition libraries. See Google Cloud Vison, IBM Watson Cloud iOS SDK, Microsoft Cognitive Services