How to calculate regionOfInterest for iOS Vision's VNRequest

How to calculate regionOfInterest for iOS Vision's VNRequest - ios

We have implemented scanning app which shows camera preview at bottom of screen with 300px height and screen width.
What is the way of calculating RegionOfInterest to pass Vision to detect barcodes as per camera preview size?
We have configured camera as below-
func setupCamera() {
guard let captureDevice = AVCaptureDevice.default(.builtInWideAngleCamera, for: AVMediaType.video, position: .back) else {
print("Could not create capture device.")
return
}
self.captureDevice = captureDevice
if captureDevice.supportsSessionPreset(.hd4K3840x2160) {
captureSession.sessionPreset = AVCaptureSession.Preset.hd4K3840x2160
bufferAspectRatio = 3840.0 / 2160.0
} else {
captureSession.sessionPreset = AVCaptureSession.Preset.hd1920x1080
bufferAspectRatio = 1920.0 / 1080.0
}
guard let deviceInput = try? AVCaptureDeviceInput(device: captureDevice) else {
print("Could not create device input.")
return
}
if captureSession.canAddInput(deviceInput) {
captureSession.addInput(deviceInput)
}
// Configure video data output.
videoDataOutput.alwaysDiscardsLateVideoFrames = true
videoDataOutput.setSampleBufferDelegate(self, queue: videoDataOutputQueue)
videoDataOutput.videoSettings = [kCVPixelBufferPixelFormatTypeKey as String: kCVPixelFormatType_420YpCbCr8BiPlanarFullRange]
if captureSession.canAddOutput(videoDataOutput) {
captureSession.addOutput(videoDataOutput)
videoDataOutput.connection(with: AVMediaType.video)?.preferredVideoStabilizationMode = .off
} else {
print("Could not add VDO output")
return
}
// Set zoom and autofocus to help focus on very small text.
do {
try captureDevice.lockForConfiguration()
//captureDevice.videoZoomFactor = 2
captureDevice.autoFocusRangeRestriction = .near
captureDevice.unlockForConfiguration()
} catch {
print("Could not set zoom level due to error: \(error)")
return
}
captureSession.startRunning()
}
Calculation of RegionOfInterest as below
func calculateRegionOfInterest() {
// Figure out size of ROI.
let size: CGSize = CGSize(width: previewWidth/screenWidth, height: previewHeight/screenHeight) //ratio of preview to screen.
// Make it centered.
regionOfInterest.origin = CGPoint(x: (1 - size.width) / 2, y: (1 - size.height) / 2)
regionOfInterest.size = size
// ROI changed, update transform.
setupOrientationAndTransform()
// Update the cutout to match the new ROI.
DispatchQueue.main.async {
// Wait for the next run cycle before updating the cutout. This
// ensures that the preview layer already has its new orientation.
self.updateCutout()
}
}
func updateCutout() {
// Figure out where the cutout ends up in layer coordinates.
let roiRectTransform = bottomToTopTransform.concatenating(uiRotationTransform)
let cutout = previewView.videoPreviewLayer.layerRectConverted(fromMetadataOutputRect: regionOfInterest.applying(roiRectTransform))
// Create the mask.
let path = UIBezierPath(rect: cutoutView.frame)
path.append(UIBezierPath(rect: cutout))
maskLayer.path = path.cgPath
// Move the number view down to under cutout.
var numFrame = cutout
numFrame.origin.y += numFrame.size.height
numberView.frame = numFrame
}
func setupOrientationAndTransform() {
// Recalculate the affine transform between Vision coordinates and AVF coordinates.
// Compensate for region of interest.
let roi = regionOfInterest
roiToGlobalTransform = CGAffineTransform(translationX: roi.origin.x, y: roi.origin.y).scaledBy(x: roi.width, y: roi.height)
// Compensate for orientation (buffers always come in the same orientation).
switch currentOrientation {
case .landscapeLeft:
textOrientation = CGImagePropertyOrientation.up
uiRotationTransform = CGAffineTransform.identity
case .landscapeRight:
textOrientation = CGImagePropertyOrientation.down
uiRotationTransform = CGAffineTransform(translationX: 1, y: 1).rotated(by: CGFloat.pi)
case .portraitUpsideDown:
textOrientation = CGImagePropertyOrientation.left
uiRotationTransform = CGAffineTransform(translationX: 1, y: 0).rotated(by: CGFloat.pi / 2)
default: // We default everything else to .portraitUp
textOrientation = CGImagePropertyOrientation.right
uiRotationTransform = CGAffineTransform(translationX: 0, y: 1).rotated(by: -CGFloat.pi / 2)
}
// Full Vision ROI to AVF transform.
visionToAVFTransform = roiToGlobalTransform.concatenating(bottomToTopTransform).concatenating(uiRotationTransform)
}
Setting regionOfInterest and textOrientation to VNImageRequestHandler as below-
override func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
if let pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) {
var request: VNDetectBarcodesRequest()
if #available(iOS 15.0, *) {
request.revision = VNDetectBarcodesRequestRevision2
} else {
// Fallback on earlier versions
request.revision = VNDetectBarcodesRequestRevision1
}
// Only run on the region of interest for maximum speed.
request.regionOfInterest = regionOfInterest
let requestHandler = VNImageRequestHandler(cvPixelBuffer: pixelBuffer, orientation: textOrientation, options: [:])
do {
try requestHandler.perform([request])
} catch {
print(error)
}
}
}
Region of Interest is working as expected for iPad in Portrait mode but for iPod and iPad in Landscape mode the above code does not scan barcodes from top and bottom corners when barcode is completely visible in camera preview.
Reference links:
Vision framework barcode detection region of interest not working
https://developer.apple.com/documentation/vision/reading_phone_numbers_in_real_time

Related

Number text recognition not highlighting/recognizing text

I am following the apple phone number recognition sample. Normally it creates a red outline around the recognized text. Mine does not seem to do recognizing the text and creating the red outline even though I used their code. The only difference is my view controller class is called "TextScanViewController" where their's is just "ViewController". I went through and made sure that any "ViewControllers" were changed to "TextScanViewController". Am I missing something else that I should change?
Here is what it should look like (when I use the original Apple project) compared to what it is doing (should have red outlines but is not showing them even if the text is perfectly in the center of the rectangle)
Should look like:
Looks like:
There are 5 different swift files I am using (PreviewView, TextScanViewController, VisionViewController, StringUtils, AppDelegate)
TextScanViewController:
import UIKit
import AVFoundation
import Vision
class TextScanViewController: UIViewController {
// MARK: - UI objects
#IBOutlet weak var previewView: PreviewView!
#IBOutlet weak var cutoutView: UIView!
#IBOutlet weak var numberView: UILabel!
var maskLayer = CAShapeLayer()
// Device orientation. Updated whenever the orientation changes to a
// different supported orientation.
var currentOrientation = UIDeviceOrientation.portrait
// MARK: - Capture related objects
private let captureSession = AVCaptureSession()
let captureSessionQueue = DispatchQueue(label: "com.example.apple-samplecode.CaptureSessionQueue")
var captureDevice: AVCaptureDevice?
var videoDataOutput = AVCaptureVideoDataOutput()
let videoDataOutputQueue = DispatchQueue(label: "com.example.apple-samplecode.VideoDataOutputQueue")
// MARK: - Region of interest (ROI) and text orientation
// Region of video data output buffer that recognition should be run on.
// Gets recalculated once the bounds of the preview layer are known.
var regionOfInterest = CGRect(x: 0, y: 0, width: 1, height: 1)
// Orientation of text to search for in the region of interest.
var textOrientation = CGImagePropertyOrientation.up
// MARK: - Coordinate transforms
var bufferAspectRatio: Double!
// Transform from UI orientation to buffer orientation.
var uiRotationTransform = CGAffineTransform.identity
// Transform bottom-left coordinates to top-left.
var bottomToTopTransform = CGAffineTransform(scaleX: 1, y: -1).translatedBy(x: 0, y: -1)
// Transform coordinates in ROI to global coordinates (still normalized).
var roiToGlobalTransform = CGAffineTransform.identity
// Vision -> AVF coordinate transform.
var visionToAVFTransform = CGAffineTransform.identity
// MARK: - View controller methods
override func viewDidLoad() {
super.viewDidLoad()
// Set up preview view.
previewView.session = captureSession
// Set up cutout view.
cutoutView.backgroundColor = UIColor.gray.withAlphaComponent(0.5)
maskLayer.backgroundColor = UIColor.clear.cgColor
maskLayer.fillRule = .evenOdd
cutoutView.layer.mask = maskLayer
// Starting the capture session is a blocking call. Perform setup using
// a dedicated serial dispatch queue to prevent blocking the main thread.
captureSessionQueue.async {
self.setupCamera()
// Calculate region of interest now that the camera is setup.
DispatchQueue.main.async {
// Figure out initial ROI.
self.calculateRegionOfInterest()
}
}
}
override func viewWillTransition(to size: CGSize, with coordinator: UIViewControllerTransitionCoordinator) {
super.viewWillTransition(to: size, with: coordinator)
// Only change the current orientation if the new one is landscape or
// portrait. You can't really do anything about flat or unknown.
let deviceOrientation = UIDevice.current.orientation
if deviceOrientation.isPortrait || deviceOrientation.isLandscape {
currentOrientation = deviceOrientation
}
// Handle device orientation in the preview layer.
if let videoPreviewLayerConnection = previewView.videoPreviewLayer.connection {
if let newVideoOrientation = AVCaptureVideoOrientation(deviceOrientation: deviceOrientation) {
videoPreviewLayerConnection.videoOrientation = newVideoOrientation
}
}
// Orientation changed: figure out new region of interest (ROI).
calculateRegionOfInterest()
}
override func viewDidLayoutSubviews() {
super.viewDidLayoutSubviews()
updateCutout()
}
// MARK: - Setup
func calculateRegionOfInterest() {
// In landscape orientation the desired ROI is specified as the ratio of
// buffer width to height. When the UI is rotated to portrait, keep the
// vertical size the same (in buffer pixels). Also try to keep the
// horizontal size the same up to a maximum ratio.
let desiredHeightRatio = 0.15
let desiredWidthRatio = 0.6
let maxPortraitWidth = 0.8
// Figure out size of ROI.
let size: CGSize
if currentOrientation.isPortrait || currentOrientation == .unknown {
size = CGSize(width: min(desiredWidthRatio * bufferAspectRatio, maxPortraitWidth), height: desiredHeightRatio / bufferAspectRatio)
} else {
size = CGSize(width: desiredWidthRatio, height: desiredHeightRatio)
}
// Make it centered.
regionOfInterest.origin = CGPoint(x: (1 - size.width) / 2, y: (1 - size.height) / 2)
regionOfInterest.size = size
// ROI changed, update transform.
setupOrientationAndTransform()
// Update the cutout to match the new ROI.
DispatchQueue.main.async {
// Wait for the next run cycle before updating the cutout. This
// ensures that the preview layer already has its new orientation.
self.updateCutout()
}
}
func updateCutout() {
// Figure out where the cutout ends up in layer coordinates.
let roiRectTransform = bottomToTopTransform.concatenating(uiRotationTransform)
let cutout = previewView.videoPreviewLayer.layerRectConverted(fromMetadataOutputRect: regionOfInterest.applying(roiRectTransform))
// Create the mask.
let path = UIBezierPath(rect: cutoutView.frame)
path.append(UIBezierPath(rect: cutout))
maskLayer.path = path.cgPath
// Move the number view down to under cutout.
var numFrame = cutout
numFrame.origin.y += numFrame.size.height
numberView.frame = numFrame
}
func setupOrientationAndTransform() {
// Recalculate the affine transform between Vision coordinates and AVF coordinates.
// Compensate for region of interest.
let roi = regionOfInterest
roiToGlobalTransform = CGAffineTransform(translationX: roi.origin.x, y: roi.origin.y).scaledBy(x: roi.width, y: roi.height)
// Compensate for orientation (buffers always come in the same orientation).
switch currentOrientation {
case .landscapeLeft:
textOrientation = CGImagePropertyOrientation.up
uiRotationTransform = CGAffineTransform.identity
case .landscapeRight:
textOrientation = CGImagePropertyOrientation.down
uiRotationTransform = CGAffineTransform(translationX: 1, y: 1).rotated(by: CGFloat.pi)
case .portraitUpsideDown:
textOrientation = CGImagePropertyOrientation.left
uiRotationTransform = CGAffineTransform(translationX: 1, y: 0).rotated(by: CGFloat.pi / 2)
default: // We default everything else to .portraitUp
textOrientation = CGImagePropertyOrientation.right
uiRotationTransform = CGAffineTransform(translationX: 0, y: 1).rotated(by: -CGFloat.pi / 2)
}
// Full Vision ROI to AVF transform.
visionToAVFTransform = roiToGlobalTransform.concatenating(bottomToTopTransform).concatenating(uiRotationTransform)
}
func setupCamera() {
guard let captureDevice = AVCaptureDevice.default(.builtInWideAngleCamera, for: AVMediaType.video, position: .back) else {
print("Could not create capture device.")
return
}
self.captureDevice = captureDevice
// NOTE:
// Requesting 4k buffers allows recognition of smaller text but will
// consume more power. Use the smallest buffer size necessary to keep
// down battery usage.
if captureDevice.supportsSessionPreset(.hd4K3840x2160) {
captureSession.sessionPreset = AVCaptureSession.Preset.hd4K3840x2160
bufferAspectRatio = 3840.0 / 2160.0
} else {
captureSession.sessionPreset = AVCaptureSession.Preset.hd1920x1080
bufferAspectRatio = 1920.0 / 1080.0
}
guard let deviceInput = try? AVCaptureDeviceInput(device: captureDevice) else {
print("Could not create device input.")
return
}
if captureSession.canAddInput(deviceInput) {
captureSession.addInput(deviceInput)
}
// Configure video data output.
videoDataOutput.alwaysDiscardsLateVideoFrames = true
videoDataOutput.setSampleBufferDelegate(self, queue: videoDataOutputQueue)
videoDataOutput.videoSettings = [kCVPixelBufferPixelFormatTypeKey as String: kCVPixelFormatType_420YpCbCr8BiPlanarFullRange]
if captureSession.canAddOutput(videoDataOutput) {
captureSession.addOutput(videoDataOutput)
// NOTE:
// There is a trade-off to be made here. Enabling stabilization will
// give temporally more stable results and should help the recognizer
// converge. But if it's enabled the VideoDataOutput buffers don't
// match what's displayed on screen, which makes drawing bounding
// boxes very hard. Disable it in this app to allow drawing detected
// bounding boxes on screen.
videoDataOutput.connection(with: AVMediaType.video)?.preferredVideoStabilizationMode = .off
} else {
print("Could not add VDO output")
return
}
// Set zoom and autofocus to help focus on very small text.
do {
try captureDevice.lockForConfiguration()
captureDevice.videoZoomFactor = 2
captureDevice.autoFocusRangeRestriction = .near
captureDevice.unlockForConfiguration()
} catch {
print("Could not set zoom level due to error: \(error)")
return
}
captureSession.startRunning()
}
// MARK: - UI drawing and interaction
func showString(string: String) {
// Found a definite number.
// Stop the camera synchronously to ensure that no further buffers are
// received. Then update the number view asynchronously.
captureSessionQueue.sync {
self.captureSession.stopRunning()
DispatchQueue.main.async {
self.numberView.text = string
self.numberView.isHidden = false
}
}
}
#IBAction func handleTap(_ sender: UITapGestureRecognizer) {
captureSessionQueue.async {
if !self.captureSession.isRunning {
self.captureSession.startRunning()
}
DispatchQueue.main.async {
self.numberView.isHidden = true
}
}
}
}
// MARK: - AVCaptureVideoDataOutputSampleBufferDelegate
extension TextScanViewController: AVCaptureVideoDataOutputSampleBufferDelegate {
func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
// This is implemented in VisionViewController.
}
}
// MARK: - Utility extensions
extension AVCaptureVideoOrientation {
init?(deviceOrientation: UIDeviceOrientation) {
switch deviceOrientation {
case .portrait: self = .portrait
case .portraitUpsideDown: self = .portraitUpsideDown
case .landscapeLeft: self = .landscapeRight
case .landscapeRight: self = .landscapeLeft
default: return nil
}
}
}
PreviewView:
import Foundation
import UIKit
import AVFoundation
class PreviewView: UIView {
var videoPreviewLayer: AVCaptureVideoPreviewLayer {
guard let layer = layer as? AVCaptureVideoPreviewLayer else {
fatalError("Expected `AVCaptureVideoPreviewLayer` type for layer. Check PreviewView.layerClass implementation.")
}
return layer
}
var session: AVCaptureSession? {
get {
return videoPreviewLayer.session
}
set {
videoPreviewLayer.session = newValue
}
}
// MARK: UIView
override class var layerClass: AnyClass {
return AVCaptureVideoPreviewLayer.self
}
}
VisionViewController:
import UIKit
import AVFoundation
import Vision
class VisionViewController: TextScanViewController {
var request: VNRecognizeTextRequest!
// Temporal string tracker
let numberTracker = StringTracker()
override func viewDidLoad() {
// Set up vision request before letting ViewController set up the camera
// so that it exists when the first buffer is received.
request = VNRecognizeTextRequest(completionHandler: recognizeTextHandler)
super.viewDidLoad()
}
// MARK: - Text recognition
// Vision recognition handler.
func recognizeTextHandler(request: VNRequest, error: Error?) {
var numbers = [String]()
var redBoxes = [CGRect]() // Shows all recognized text lines
var greenBoxes = [CGRect]() // Shows words that might be serials
guard let results = request.results as? [VNRecognizedTextObservation] else {
return
}
let maximumCandidates = 1
for visionResult in results {
guard let candidate = visionResult.topCandidates(maximumCandidates).first else { continue }
// Draw red boxes around any detected text, and green boxes around
// any detected phone numbers. The phone number may be a substring
// of the visionResult. If a substring, draw a green box around the
// number and a red box around the full string. If the number covers
// the full result only draw the green box.
var numberIsSubstring = true
if let result = candidate.string.extractPhoneNumber() {
let (range, number) = result
// Number may not cover full visionResult. Extract bounding box
// of substring.
if let box = try? candidate.boundingBox(for: range)?.boundingBox {
numbers.append(number)
greenBoxes.append(box)
numberIsSubstring = !(range.lowerBound == candidate.string.startIndex && range.upperBound == candidate.string.endIndex)
}
}
if numberIsSubstring {
redBoxes.append(visionResult.boundingBox)
}
}
// Log any found numbers.
numberTracker.logFrame(strings: numbers)
show(boxGroups: [(color: UIColor.red.cgColor, boxes: redBoxes), (color: UIColor.green.cgColor, boxes: greenBoxes)])
// Check if we have any temporally stable numbers.
if let sureNumber = numberTracker.getStableString() {
showString(string: sureNumber)
numberTracker.reset(string: sureNumber)
}
}
override func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
if let pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) {
// Configure for running in real-time.
request.recognitionLevel = .fast
// Language correction won't help recognizing phone numbers. It also
// makes recognition slower.
request.usesLanguageCorrection = false
// Only run on the region of interest for maximum speed.
request.regionOfInterest = regionOfInterest
let requestHandler = VNImageRequestHandler(cvPixelBuffer: pixelBuffer, orientation: textOrientation, options: [:])
do {
try requestHandler.perform([request])
} catch {
print(error)
}
}
}
// MARK: - Bounding box drawing
// Draw a box on screen. Must be called from main queue.
var boxLayer = [CAShapeLayer]()
func draw(rect: CGRect, color: CGColor) {
let layer = CAShapeLayer()
layer.opacity = 0.5
layer.borderColor = color
layer.borderWidth = 2
layer.frame = rect
boxLayer.append(layer)
previewView.videoPreviewLayer.insertSublayer(layer, at: 1)
}
// Remove all drawn boxes. Must be called on main queue.
func removeBoxes() {
for layer in boxLayer {
layer.removeFromSuperlayer()
}
boxLayer.removeAll()
}
typealias ColoredBoxGroup = (color: CGColor, boxes: [CGRect])
// Draws groups of colored boxes.
func show(boxGroups: [ColoredBoxGroup]) {
DispatchQueue.main.async {
let layer = self.previewView.videoPreviewLayer
self.removeBoxes()
for boxGroup in boxGroups {
let color = boxGroup.color
for box in boxGroup.boxes {
let rect = layer.layerRectConverted(fromMetadataOutputRect: box.applying(self.visionToAVFTransform))
self.draw(rect: rect, color: color)
}
}
}
}
}
StringUtils:
import Foundation
extension Character {
// Given a list of allowed characters, try to convert self to those in list
// if not already in it. This handles some common misclassifications for
// characters that are visually similar and can only be correctly recognized
// with more context and/or domain knowledge. Some examples (should be read
// in Menlo or some other font that has different symbols for all characters):
// 1 and l are the same character in Times New Roman
// I and l are the same character in Helvetica
// 0 and O are extremely similar in many fonts
// oO, wW, cC, sS, pP and others only differ by size in many fonts
func getSimilarCharacterIfNotIn(allowedChars: String) -> Character {
let conversionTable = [
"s": "S",
"S": "5",
"5": "S",
"o": "O",
"Q": "O",
"O": "0",
"0": "O",
"l": "I",
"I": "1",
"1": "I",
"B": "8",
"8": "B"
]
// Allow a maximum of two substitutions to handle 's' -> 'S' -> '5'.
let maxSubstitutions = 2
var current = String(self)
var counter = 0
while !allowedChars.contains(current) && counter < maxSubstitutions {
if let altChar = conversionTable[current] {
current = altChar
counter += 1
} else {
// Doesn't match anything in our table. Give up.
break
}
}
return current.first!
}
}
extension String {
// Extracts the first US-style phone number found in the string, returning
// the range of the number and the number itself as a tuple.
// Returns nil if no number is found.
func extractPhoneNumber() -> (Range<String.Index>, String)? {
// Do a first pass to find any substring that could be a US phone
// number. This will match the following common patterns and more:
// xxx-xxx-xxxx
// xxx xxx xxxx
// (xxx) xxx-xxxx
// (xxx)xxx-xxxx
// xxx.xxx.xxxx
// xxx xxx-xxxx
// xxx/xxx.xxxx
// +1-xxx-xxx-xxxx
// Note that this doesn't only look for digits since some digits look
// very similar to letters. This is handled later.
let pattern = #"""
(?x) # Verbose regex, allows comments
(?:\+1-?)? # Potential international prefix, may have -
[(]? # Potential opening (
\b(\w{3}) # Capture xxx
[)]? # Potential closing )
[\ -./]? # Potential separator
(\w{3}) # Capture xxx
[\ -./]? # Potential separator
(\w{4})\b # Capture xxxx
"""#
guard let range = self.range(of: pattern, options: .regularExpression, range: nil, locale: nil) else {
// No phone number found.
return nil
}
// Potential number found. Strip out punctuation, whitespace and country
// prefix.
var phoneNumberDigits = ""
let substring = String(self[range])
let nsrange = NSRange(substring.startIndex..., in: substring)
do {
// Extract the characters from the substring.
let regex = try NSRegularExpression(pattern: pattern, options: [])
if let match = regex.firstMatch(in: substring, options: [], range: nsrange) {
for rangeInd in 1 ..< match.numberOfRanges {
let range = match.range(at: rangeInd)
let matchString = (substring as NSString).substring(with: range)
phoneNumberDigits += matchString as String
}
}
} catch {
print("Error \(error) when creating pattern")
}
// Must be exactly 10 digits.
guard phoneNumberDigits.count == 10 else {
return nil
}
// Substitute commonly misrecognized characters, for example: 'S' -> '5' or 'l' -> '1'
var result = ""
let allowedChars = "0123456789"
for var char in phoneNumberDigits {
char = char.getSimilarCharacterIfNotIn(allowedChars: allowedChars)
guard allowedChars.contains(char) else {
return nil
}
result.append(char)
}
return (range, result)
}
}
class StringTracker {
var frameIndex: Int64 = 0
typealias StringObservation = (lastSeen: Int64, count: Int64)
// Dictionary of seen strings. Used to get stable recognition before
// displaying anything.
var seenStrings = [String: StringObservation]()
var bestCount = Int64(0)
var bestString = ""
func logFrame(strings: [String]) {
for string in strings {
if seenStrings[string] == nil {
seenStrings[string] = (lastSeen: Int64(0), count: Int64(-1))
}
seenStrings[string]?.lastSeen = frameIndex
seenStrings[string]?.count += 1
print("Seen \(string) \(seenStrings[string]?.count ?? 0) times")
}
var obsoleteStrings = [String]()
// Go through strings and prune any that have not been seen in while.
// Also find the (non-pruned) string with the greatest count.
for (string, obs) in seenStrings {
// Remove previously seen text after 30 frames (~1s).
if obs.lastSeen < frameIndex - 30 {
obsoleteStrings.append(string)
}
// Find the string with the greatest count.
let count = obs.count
if !obsoleteStrings.contains(string) && count > bestCount {
bestCount = Int64(count)
bestString = string
}
}
// Remove old strings.
for string in obsoleteStrings {
seenStrings.removeValue(forKey: string)
}
frameIndex += 1
}
func getStableString() -> String? {
// Require the recognizer to see the same string at least 10 times.
if bestCount >= 10 {
return bestString
} else {
return nil
}
}
func reset(string: String) {
seenStrings.removeValue(forKey: string)
bestCount = 0
bestString = ""
}
}
AppDelegate:
import UIKit
#UIApplicationMain
class AppDelegate: UIResponder, UIApplicationDelegate {
var window: UIWindow?
}

I was using the wrong class on the view controller.. instead of it being TextScanViewController it should have been set to Visionviewcontroller... it was skipping a whole class. I didn't realize how classes are inherited and that there was an important order to them. I have a lot to learn but learning a lot! :)

AVCaptureVideoPreviewLayer converts capture device coordinates incorrectly

I'm working on implementing a camera barcode scanner with Vision framework. My setup seems fairly simple: a preview layer for video from the device's rear camera, frames from the video stream are processed using VNDetectBarcodesRequest, and if a barcode is detected, I would like to show a rectangle around it in the preview.
Vision framework coordinates are normalized to the video frame size, so I have to convert them back to the layer coordinates to be able to draw the rectangle. The recommended way to do this is to use AVCaptureVideoPreviewLayer.layerPointConverted(fromCaptureDevicePoint:). There is also a difference between the point of origin (it's in the lower left corner for Vision framework), so I have to take care of that as well. Here's what the result looks like:
let convertPoint = { (point: CGPoint) in
let coordinateFlip = CGAffineTransform.identity
.translatedBy(x: 0, y: 1)
.scaledBy(x: 1, y: -1)
return videoLayer.layerPointConverted(fromCaptureDevicePoint: point.applying(coordinateFlip))
}
The problem is that this conversion is incorrect: the rectangle ends up being rotated 90º clockwise:
I have no idea why, I set video orientation to .portrait on both AVCaptureVideoPreviewLayer.connection and AVCaptureVideoDataOutput.connection(with: .video). The Vision coordinates seem to be correct, if I convert them manually like so:
extension VNBarcodeObservation {
func verticesConvertedToCoordinatesOf(_ targetLayer: CALayer) -> [CGPoint] {
let layerSize = targetLayer.bounds.size
return [topLeft, topRight, bottomRight, bottomLeft]
.map { CGPoint(x: $0.x * layerSize.width, y: $0.y * layerSize.height) }
}
}
the result is correct:
Why is AVCaptureVideoPreviewLayer.layerPointConverted(fromCaptureDevicePoint:) returning incorrect results?
It seems like somehow AVCaptureVideoPreviewLayer is stuck in the .landscapeRight orientation.
Here's how I set up video capture:
let captureSessionQueue = DispatchQueue.global(qos: .userInteractive)
let camera = AVCaptureDevice.default(for: .video)
let session = AVCaptureSession()
session.beginConfiguration()
session.setCamera(camera, preferredPresets: [.hd1280x720])
let request = VNDetectBarcodesRequest(symbologies: VNDetectBarcodesRequest.supportedSymbologies)
request.regionOfInterest = CGRect(x: 0.15, y: 0.45, width: 0.7, height: 0.4)
let delegate = CaptureVideoDataOutputDelegate()
delegate.vnRequests = [request]
let deviceOutput = AVCaptureVideoDataOutput()
deviceOutput.setSampleBufferDelegate(delegate, queue: captureSessionQueue)
deviceOutput.alwaysDiscardsLateVideoFrames = true
session.addOutput(deviceOutput)
deviceOutput.connection(with: .video)?.videoOrientation = .portrait
session.commitConfiguration()
captureSessionQueue.async {
session.startRunning()
}
and CaptureVideoDataOutputDelegate:
public class CaptureVideoDataOutputDelegate: NSObject, AVCaptureVideoDataOutputSampleBufferDelegate {
public var vnRequests: [VNRequest]?
private let vnHandler = VNSequenceRequestHandler()
private var frameCounter: Int64 = 0
public func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
defer { frameCounter += 1 }
guard
// Try to reduce CPU usage and energy consumption by processing only every 10th frame
frameCounter.isMultiple(of: 10),
let vnRequests = vnRequests,
let pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer)
else { return }
do {
let size = CGSize(width: CVPixelBufferGetWidth(pixelBuffer), height: CVPixelBufferGetHeight(pixelBuffer))
print(">>> will scan a \(size) frame #\(frameCounter) for barcodes (connection: \(connection), orientation: \(connection.videoOrientation))")
try vnHandler.perform(vnRequests, on: pixelBuffer, orientation: .up)
} catch {
print(">>> failed to scan frame #\(frameCounter) for barcodes: \(error)")
}
}
}
I have verified that both deviceOutput.connection(with: .video) and videoLayer.connection are not nil when I try to configure video orientation. Examining the size of buffer received by CaptureVideoDataOutputDelegate confirms that orientation is correct (the size is 720x1280, not 1280x720). Orientation of the connection is also .portrait at this point.

AVCaptureVideoDataOutputSampleBufferDelegate drop frames using CIFilters for video filtering

I have very strange case where AVCaptureVideoDataOutputSampleBufferDelegate drops frames if I use 13 different filter chains. Let me explain:
I have CameraController setup, nothing special, here is my delegate method:
func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
if !paused {
if connection.output?.connection(with: .audio) == nil {
//capture video
// my try to avoid "Out of buffers error", no luck ;(
lastCapturedBuffer = nil
let err = CMSampleBufferCreateCopy(allocator: kCFAllocatorDefault, sampleBuffer: sampleBuffer, sampleBufferOut: &lastCapturedBuffer)
if err == noErr {
}
connection.videoOrientation = .portrait
// getting image
let pixelBuffer = CMSampleBufferGetImageBuffer(lastCapturedBuffer!)
// remove if any
CVPixelBufferLockBaseAddress(pixelBuffer!, CVPixelBufferLockFlags(rawValue: 0))
// captured - is just ciimage property
captured = CIImage(cvPixelBuffer: pixelBuffer!)
//remove if any
CVPixelBufferUnlockBaseAddress(pixelBuffer!,CVPixelBufferLockFlags(rawValue: 0))
//CVPixelBufferUnlockBaseAddress(pixelBuffer!, .readOnly)
// transform image to targer resolution
let srcWidth = CGFloat(captured.extent.width)
let srcHeight = CGFloat(captured.extent.height)
let dstWidth: CGFloat = ConstantsManager.shared.k_video_width
let dstHeight: CGFloat = ConstantsManager.shared.k_video_height
let scaleX = dstWidth / srcWidth
let scaleY = dstHeight / srcHeight
var transform = CGAffineTransform.init(scaleX: scaleX, y: scaleY)
captured = captured.transformed(by: transform).cropped(to: CGRect(x: 0, y: 0, width: dstWidth, height: dstHeight))
// mirror for front camera
if front {
var t = CGAffineTransform.init(scaleX: -1, y: 1)
t = t.translatedBy(x: -ConstantsManager.shared.k_video_width, y: 0)
captured = captured.transformed(by: t)
}
// video capture logic
let writable = canWrite()
if writable,
sessionAtSourceTime == nil {
sessionAtSourceTime = CMSampleBufferGetPresentationTimeStamp(lastCapturedBuffer!)
videoWriter.startSession(atSourceTime: sessionAtSourceTime!)
}
if writable, (videoWriterInput.isReadyForMoreMediaData) {
videoWriterInput.append(lastCapturedBuffer!)
}
// apply effect in realtime <- here is problem. If I comment next line, it will be fixed but effect will n't be applied
captured = FilterManager.shared.applyFilterForCamera(inputImage: captured)
// current frame in case user wants to save image as photo
self.capturedPhoto = captured
// sent frame to Camcoder view controller
self.delegate?.didCapturedFrame(frame: captured)
} else {
// capture sound
let writable = canWrite()
if writable, (audioWriterInput.isReadyForMoreMediaData) {
//print("write audio buffer")
audioWriterInput?.append(lastCapturedBuffer!)
}
}
} else {
// paused
}
}
I also implemented didDrop delegate method, here is how I figure out why it drops frames:
func captureOutput(_ output: AVCaptureOutput, didDrop sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
print("did drop")
var mode: CMAttachmentMode = 0
let reason = CMGetAttachment(sampleBuffer, key: kCMSampleBufferAttachmentKey_DroppedFrameReason, attachmentModeOut: &mode)
print("reason \(String(describing: reason))") // Optional(OutOfBuffers)
}
So I did it like a pro and just commented parts of code to find where is the problem. So, it here:
captured = FilterManager.shared.applyFilterForCamera(inputImage: captured)
FilterManager - is singleton, here is called func:
func applyFilterForCamera(inputImage: CIImage) -> CIImage {
return currentVsFilter!.apply(sourceImage: inputImage)
}
currentVsFilter is object of VSFilter type - here is example of one:
import Foundation
import AVKit
class TestFilter: CustomFilter {
let _name = "Тестовый Фильтр"
let _displayName = "Test Filter"
var tempImage: CIImage?
var final: CGImage?
override func name() -> String {
return _name
}
override func displayName() -> String {
return _displayName
}
override init() {
super.init()
print("Test Filter init")
// setup my custom kernel filter
self.noise.type = GlitchFilter.GlitchType.allCases[2]
}
// this returns composition for playback using AVPlayer
override func composition(asset: AVAsset) -> AVMutableVideoComposition {
let composition = AVMutableVideoComposition(asset: asset, applyingCIFiltersWithHandler: { request in
let inputImage = request.sourceImage.cropped(to: request.sourceImage.extent)
DispatchQueue.global(qos: .userInitiated).async {
let output = self.apply(sourceImage: inputImage, forComposition: true)
request.finish(with: output, context: nil)
}
})
let size = FilterManager.shared.cropRectForOrientation().size
composition.renderSize = size
return composition
}
// this returns actual filtered CIImage, used for both AVPlayer composition and realtime camera
override func apply(sourceImage: CIImage, forComposition: Bool = false) -> CIImage {
// rendered text
tempImage = FilterManager.shared.textRenderedImage()
// some filters chained one by one
self.screenBlend?.setValue(tempImage, forKey: kCIInputImageKey)
self.screenBlend?.setValue(sourceImage, forKey: kCIInputBackgroundImageKey)
self.noise.inputImage = self.screenBlend?.outputImage
self.noise.inputAmount = CGFloat.random(in: 1.0...3.0)
// result
tempImage = self.noise.outputImage
// correct crop
let rect = forComposition ? FilterManager.shared.cropRectForOrientation() : FilterManager.shared.cropRect
final = self.context.createCGImage(tempImage!, from: rect!)
return CIImage(cgImage: final!)
}
}
And now, the most strange thing, I have 30 VSFilters and when I got to 13(switching one by one by UIButton) I got error "Out of Buffer", this one:
kCMSampleBufferDroppedFrameReason_OutOfBuffers
What I tested:
I changed vsFilters order in filters array inside FilterManager singleton - same
I tried switch from first to 12 one by one, then go back - works, but after I switched to 13tn(of 30th from 0) - bug
Looks like it can handle only 12 VSFIlter objects, like if it retains them somehow or maybe it's related to threading, I don't know.
This app made for iOs devices, tested on iPhone X iOs 13.3.1
This is video editor app to apply different effects to both live stream from camera and video files from camera roll
Maybe someone has experience with this?
Have a great day
Best, Victor
Edit 1. If I reinit cameraController(AVCaptureSession. input/output devices) it works but this is ugly option and it adds lag when switching filters

Ok, so I finally won this battle. In case some one else get this "OutOfBuffer" problem, here is my solution
As I figured out, CIFilter grabs CVPixelBuffer and don't release it while filtering images. It's kinda creates one huge buffer, I guess. Strange thing: it don't create memory leak, so I guess it grabs not particular buffer but creates strong reference to it. As rumors(me) say, it can handle only 12 such references.
So, my approach was to copy CVPixelBuffer and then work with it instead of buffer I got from AVCaptureVideoDataOutputSampleBufferDelegate didOutput func
Here is my new code:
func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
if !paused {
//print("camera controller \(id) got frame")
if connection.output?.connection(with: .audio) == nil {
//capture video
connection.videoOrientation = .portrait
// getting image
guard let pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) else { return }
// this works!
let copyBuffer = pixelBuffer.copy()
// captured - is just ciimage property
captured = CIImage(cvPixelBuffer: copyBuffer)
//remove if any
// transform image to targer resolution
let srcWidth = CGFloat(captured.extent.width)
let srcHeight = CGFloat(captured.extent.height)
let dstWidth: CGFloat = ConstantsManager.shared.k_video_width
let dstHeight: CGFloat = ConstantsManager.shared.k_video_height
let scaleX = dstWidth / srcWidth
let scaleY = dstHeight / srcHeight
var transform = CGAffineTransform.init(scaleX: scaleX, y: scaleY)
captured = captured.transformed(by: transform).cropped(to: CGRect(x: 0, y: 0, width: dstWidth, height: dstHeight))
// mirror for front camera
if front {
var t = CGAffineTransform.init(scaleX: -1, y: 1)
t = t.translatedBy(x: -ConstantsManager.shared.k_video_width, y: 0)
captured = captured.transformed(by: t)
}
// video capture logic
let writable = canWrite()
if writable,
sessionAtSourceTime == nil {
sessionAtSourceTime = CMSampleBufferGetPresentationTimeStamp(sampleBuffer)
videoWriter.startSession(atSourceTime: sessionAtSourceTime!)
}
if writable, (videoWriterInput.isReadyForMoreMediaData) {
videoWriterInput.append(sampleBuffer)
}
self.captured = FilterManager.shared.applyFilterForCamera(inputImage: self.captured)
// current frame in case user wants to save image as photo
self.capturedPhoto = captured
// sent frame to Camcoder view controller
self.delegate?.didCapturedFrame(frame: captured)
} else {
// capture sound
let writable = canWrite()
if writable, (audioWriterInput.isReadyForMoreMediaData) {
//print("write audio buffer")
audioWriterInput?.append(sampleBuffer)
}
}
} else {
// paused
//print("paused camera controller \(id)")
}
}
and there is func to copy buffer:
func copy() -> CVPixelBuffer {
precondition(CFGetTypeID(self) == CVPixelBufferGetTypeID(), "copy() cannot be called on a non-CVPixelBuffer")
var _copy : CVPixelBuffer?
CVPixelBufferCreate(
kCFAllocatorDefault,
CVPixelBufferGetWidth(self),
CVPixelBufferGetHeight(self),
CVPixelBufferGetPixelFormatType(self),
nil,
&_copy)
guard let copy = _copy else { fatalError() }
CVPixelBufferLockBaseAddress(self, CVPixelBufferLockFlags.readOnly)
CVPixelBufferLockBaseAddress(copy, CVPixelBufferLockFlags(rawValue: 0))
let copyBaseAddress = CVPixelBufferGetBaseAddress(copy)
let currBaseAddress = CVPixelBufferGetBaseAddress(self)
print("copy data size: \(CVPixelBufferGetDataSize(copy))")
print("self data size: \(CVPixelBufferGetDataSize(self))")
memcpy(copyBaseAddress, currBaseAddress, CVPixelBufferGetDataSize(copy))
//memcpy(copyBaseAddress, currBaseAddress, CVPixelBufferGetDataSize(self) * 2)
CVPixelBufferUnlockBaseAddress(copy, CVPixelBufferLockFlags(rawValue: 0))
CVPixelBufferUnlockBaseAddress(self, CVPixelBufferLockFlags.readOnly)
return copy
}
I used it as extension
I hope, this will help anyone with similar problem
Best, Victor

Toggle flash in ios swift

I am building an image clasifier app. On camera screen I have a switch button which I want to use to toggle flash so that user can switch on flash in low light.
Here is my code:
import UIKit
import AVFoundation
import Vision
// controlling the pace of the machine vision analysis
var lastAnalysis: TimeInterval = 0
var pace: TimeInterval = 0.33 // in seconds, classification will not repeat faster than this value
// performance tracking
let trackPerformance = false // use "true" for performance logging
var frameCount = 0
let framesPerSample = 10
var startDate = NSDate.timeIntervalSinceReferenceDate
var flash=0
class ImageDetectionViewController: UIViewController {
var callBackImageDetection :(State)->Void = { state in
}
#IBOutlet weak var previewView: UIView!
#IBOutlet weak var stackView: UIStackView!
#IBOutlet weak var lowerView: UIView!
#IBAction func swithch(_ sender: UISwitch) {
if(sender.isOn == true)
{
stopActiveSession();
let captureSession=AVCaptureSession()
let captureDevice: AVCaptureDevice?
setupCamera(flash: 1)
}
}
var previewLayer: AVCaptureVideoPreviewLayer!
let bubbleLayer = BubbleLayer(string: "")
let queue = DispatchQueue(label: "videoQueue")
var captureSession = AVCaptureSession()
var captureDevice: AVCaptureDevice?
let videoOutput = AVCaptureVideoDataOutput()
var unknownCounter = 0 // used to track how many unclassified images in a row
let confidence: Float = 0.8
// MARK: Load the Model
let targetImageSize = CGSize(width: 227, height: 227) // must match model data input
lazy var classificationRequest: [VNRequest] = {
do {
// Load the Custom Vision model.
// To add a new model, drag it to the Xcode project browser making sure that the "Target Membership" is checked.
// Then update the following line with the name of your new model.
// let model = try VNCoreMLModel(for: Fruit().model)
let model = try VNCoreMLModel(for: CodigocubeAI().model)
let classificationRequest = VNCoreMLRequest(model: model, completionHandler: self.handleClassification)
return [ classificationRequest ]
} catch {
fatalError("Can't load Vision ML model: \(error)")
}
}()
// MARK: Handle image classification results
func handleClassification(request: VNRequest, error: Error?) {
guard let observations = request.results as? [VNClassificationObservation]
else { fatalError("unexpected result type from VNCoreMLRequest") }
guard let best = observations.first else {
fatalError("classification didn't return any results")
}
// Use results to update user interface (includes basic filtering)
print("\(best.identifier): \(best.confidence)")
if best.identifier.starts(with: "Unknown") || best.confidence < confidence {
if self.unknownCounter < 3 { // a bit of a low-pass filter to avoid flickering
self.unknownCounter += 1
} else {
self.unknownCounter = 0
DispatchQueue.main.async {
self.bubbleLayer.string = nil
}
}
} else {
self.unknownCounter = 0
DispatchQueue.main.async {[weak self] in
guard let strongSelf = self
else
{
return
}
// Trimming labels because they sometimes have unexpected line endings which show up in the GUI
let identifierString = best.identifier.trimmingCharacters(in: CharacterSet.whitespacesAndNewlines)
strongSelf.bubbleLayer.string = identifierString
let state : State = strongSelf.getState(identifierStr: identifierString)
strongSelf.stopActiveSession()
strongSelf.navigationController?.popViewController(animated: true)
strongSelf.callBackImageDetection(state)
}
}
}
func getState(identifierStr:String)->State
{
var state :State = .none
if identifierStr == "entertainment"
{
state = .entertainment
}
else if identifierStr == "geography"
{
state = .geography
}
else if identifierStr == "history"
{
state = .history
}
else if identifierStr == "knowledge"
{
state = .education
}
else if identifierStr == "science"
{
state = .science
}
else if identifierStr == "sports"
{
state = .sports
}
else
{
state = .none
}
return state
}
// MARK: Lifecycle
override func viewDidLoad() {
super.viewDidLoad()
previewLayer = AVCaptureVideoPreviewLayer(session: captureSession)
previewView.layer.addSublayer(previewLayer)
}
override func viewDidAppear(_ animated: Bool) {
self.edgesForExtendedLayout = UIRectEdge.init(rawValue: 0)
bubbleLayer.opacity = 0.0
bubbleLayer.position.x = self.view.frame.width / 2.0
bubbleLayer.position.y = lowerView.frame.height / 2
lowerView.layer.addSublayer(bubbleLayer)
setupCamera(flash:2)
}
override func viewDidLayoutSubviews() {
super.viewDidLayoutSubviews()
previewLayer.frame = previewView.bounds;
}
// MARK: Camera handling
func setupCamera(flash :Int) {
let deviceDiscovery = AVCaptureDevice.DiscoverySession(deviceTypes: [.builtInWideAngleCamera], mediaType: .video, position: .back)
if let device = deviceDiscovery.devices.last {
if(flash == 1)
{
if (device.hasTorch) {
do {
try device.lockForConfiguration()
if (device.isTorchAvailable) {
do {
try device.setTorchModeOn(level:0.2 )
}
catch
{
print(error)
}
device.unlockForConfiguration()
}
}
catch
{
print(error)
}
}
}
captureDevice = device
beginSession()
}
}
func beginSession() {
do {
videoOutput.videoSettings = [((kCVPixelBufferPixelFormatTypeKey as NSString) as String) : (NSNumber(value: kCVPixelFormatType_32BGRA) as! UInt32)]
videoOutput.alwaysDiscardsLateVideoFrames = true
videoOutput.setSampleBufferDelegate(self, queue: queue)
captureSession.sessionPreset = .hd1920x1080
captureSession.addOutput(videoOutput)
let input = try AVCaptureDeviceInput(device: captureDevice!)
captureSession.addInput(input)
captureSession.startRunning()
} catch {
print("error connecting to capture device")
}
}
func stopActiveSession()
{
if captureSession.isRunning == true
{
captureSession.stopRunning()
}
}
override func viewWillDisappear(_ animated: Bool) {
self.stopActiveSession()
}
deinit {
print("deinit called")
}
}
// MARK: Video Data Delegate
extension ImageDetectionViewController: AVCaptureVideoDataOutputSampleBufferDelegate {
// called for each frame of video
func captureOutput(_ captureOutput: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
let currentDate = NSDate.timeIntervalSinceReferenceDate
// control the pace of the machine vision to protect battery life
if currentDate - lastAnalysis >= pace {
lastAnalysis = currentDate
} else {
return // don't run the classifier more often than we need
}
// keep track of performance and log the frame rate
if trackPerformance {
frameCount = frameCount + 1
if frameCount % framesPerSample == 0 {
let diff = currentDate - startDate
if (diff > 0) {
if pace > 0.0 {
print("WARNING: Frame rate of image classification is being limited by \"pace\" setting. Set to 0.0 for fastest possible rate.")
}
print("\(String.localizedStringWithFormat("%0.2f", (diff/Double(framesPerSample))))s per frame (average)")
}
startDate = currentDate
}
}
// Crop and resize the image data.
// Note, this uses a Core Image pipeline that could be appended with other pre-processing.
// If we don't want to do anything custom, we can remove this step and let the Vision framework handle
// crop and resize as long as we are careful to pass the orientation properly.
guard let croppedBuffer = croppedSampleBuffer(sampleBuffer, targetSize: targetImageSize) else {
return
}
do {
let classifierRequestHandler = VNImageRequestHandler(cvPixelBuffer: croppedBuffer, options: [:])
try classifierRequestHandler.perform(classificationRequest)
} catch {
print(error)
}
}
}
let context = CIContext()
var rotateTransform: CGAffineTransform?
var scaleTransform: CGAffineTransform?
var cropTransform: CGAffineTransform?
var resultBuffer: CVPixelBuffer?
func croppedSampleBuffer(_ sampleBuffer: CMSampleBuffer, targetSize: CGSize) -> CVPixelBuffer? {
guard let imageBuffer: CVImageBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) else {
fatalError("Can't convert to CVImageBuffer.")
}
// Only doing these calculations once for efficiency.
// If the incoming images could change orientation or size during a session, this would need to be reset when that happens.
if rotateTransform == nil {
let imageSize = CVImageBufferGetEncodedSize(imageBuffer)
let rotatedSize = CGSize(width: imageSize.height, height: imageSize.width)
guard targetSize.width < rotatedSize.width, targetSize.height < rotatedSize.height else {
fatalError("Captured image is smaller than image size for model.")
}
let shorterSize = (rotatedSize.width < rotatedSize.height) ? rotatedSize.width : rotatedSize.height
rotateTransform = CGAffineTransform(translationX: imageSize.width / 2.0, y: imageSize.height / 2.0).rotated(by: -CGFloat.pi / 2.0).translatedBy(x: -imageSize.height / 2.0, y: -imageSize.width / 2.0)
let scale = targetSize.width / shorterSize
scaleTransform = CGAffineTransform(scaleX: scale, y: scale)
// Crop input image to output size
let xDiff = rotatedSize.width * scale - targetSize.width
let yDiff = rotatedSize.height * scale - targetSize.height
cropTransform = CGAffineTransform(translationX: xDiff/2.0, y: yDiff/2.0)
}
// Convert to CIImage because it is easier to manipulate
let ciImage = CIImage(cvImageBuffer: imageBuffer)
let rotated = ciImage.transformed(by: rotateTransform!)
let scaled = rotated.transformed(by: scaleTransform!)
let cropped = scaled.transformed(by: cropTransform!)
// Note that the above pipeline could be easily appended with other image manipulations.
// For example, to change the image contrast. It would be most efficient to handle all of
// the image manipulation in a single Core Image pipeline because it can be hardware optimized.
// Only need to create this buffer one time and then we can reuse it for every frame
if resultBuffer == nil {
let result = CVPixelBufferCreate(kCFAllocatorDefault, Int(targetSize.width), Int(targetSize.height), kCVPixelFormatType_32BGRA, nil, &resultBuffer)
guard result == kCVReturnSuccess else {
fatalError("Can't allocate pixel buffer.")
}
}
// Render the Core Image pipeline to the buffer
context.render(cropped, to: resultBuffer!)
// For debugging
// let image = imageBufferToUIImage(resultBuffer!)
// print(image.size) // set breakpoint to see image being provided to CoreML
return resultBuffer
}
// Only used for debugging.
// Turns an image buffer into a UIImage that is easier to display in the UI or debugger.
func imageBufferToUIImage(_ imageBuffer: CVImageBuffer) -> UIImage {
CVPixelBufferLockBaseAddress(imageBuffer, CVPixelBufferLockFlags(rawValue: 0))
let baseAddress = CVPixelBufferGetBaseAddress(imageBuffer)
let bytesPerRow = CVPixelBufferGetBytesPerRow(imageBuffer)
let width = CVPixelBufferGetWidth(imageBuffer)
let height = CVPixelBufferGetHeight(imageBuffer)
let colorSpace = CGColorSpaceCreateDeviceRGB()
let bitmapInfo = CGBitmapInfo(rawValue: CGImageAlphaInfo.noneSkipFirst.rawValue | CGBitmapInfo.byteOrder32Little.rawValue)
let context = CGContext(data: baseAddress, width: width, height: height, bitsPerComponent: 8, bytesPerRow: bytesPerRow, space: colorSpace, bitmapInfo: bitmapInfo.rawValue)
let quartzImage = context!.makeImage()
CVPixelBufferUnlockBaseAddress(imageBuffer, CVPixelBufferLockFlags(rawValue: 0))
let image = UIImage(cgImage: quartzImage!, scale: 1.0, orientation: .right)
return image
}
I am getting error An AVCaptureOutput instance may not be added to more than one session'
Now I want to give user the facility to toggle flash. How to destroy active camera session and open new with flash on?
Can anyone help me also any other way to achieve this?

Face Detection with Camera

How can I do face detection in realtime just as "Camera" does?
I noticed that AVCaptureStillImageOutput is deprecated after 10.0, so I use
AVCapturePhotoOutput instead. However, I found that the image I saved for facial detection is not so satisfied? Any ideas?
UPDATE
After giving a try of #Shravya Boggarapu mentioned. Currently, I use AVCaptureMetadataOutput to detect the face without CIFaceDetector. It works as expected. However, when I'm trying to draw bounds of the face, it seems mislocated. Any idea?
let metaDataOutput = AVCaptureMetadataOutput()
captureSession.sessionPreset = AVCaptureSessionPresetPhoto
let backCamera = AVCaptureDevice.defaultDevice(withDeviceType: .builtInWideAngleCamera, mediaType: AVMediaTypeVideo, position: .back)
do {
let input = try AVCaptureDeviceInput(device: backCamera)
if (captureSession.canAddInput(input)) {
captureSession.addInput(input)
// MetadataOutput instead
if(captureSession.canAddOutput(metaDataOutput)) {
captureSession.addOutput(metaDataOutput)
metaDataOutput.setMetadataObjectsDelegate(self, queue: DispatchQueue.main)
metaDataOutput.metadataObjectTypes = [AVMetadataObjectTypeFace]
previewLayer = AVCaptureVideoPreviewLayer(session: captureSession)
previewLayer?.frame = cameraView.bounds
previewLayer?.videoGravity = AVLayerVideoGravityResizeAspectFill
cameraView.layer.addSublayer(previewLayer!)
captureSession.startRunning()
}
}
} catch {
print(error.localizedDescription)
}
and
extension CameraViewController: AVCaptureMetadataOutputObjectsDelegate {
func captureOutput(_ captureOutput: AVCaptureOutput!, didOutputMetadataObjects metadataObjects: [Any]!, from connection: AVCaptureConnection!) {
if findFaceControl {
findFaceControl = false
for metadataObject in metadataObjects {
if (metadataObject as AnyObject).type == AVMetadataObjectTypeFace {
print("😇😍😎")
print(metadataObject)
let bounds = (metadataObject as! AVMetadataFaceObject).bounds
print("origin x: \(bounds.origin.x)")
print("origin y: \(bounds.origin.y)")
print("size width: \(bounds.size.width)")
print("size height: \(bounds.size.height)")
print("cameraView width: \(self.cameraView.frame.width)")
print("cameraView height: \(self.cameraView.frame.height)")
var face = CGRect()
face.origin.x = bounds.origin.x * self.cameraView.frame.width
face.origin.y = bounds.origin.y * self.cameraView.frame.height
face.size.width = bounds.size.width * self.cameraView.frame.width
face.size.height = bounds.size.height * self.cameraView.frame.height
print(face)
showBounds(at: face)
}
}
}
}
}
Original
see in Github
var captureSession = AVCaptureSession()
var photoOutput = AVCapturePhotoOutput()
var previewLayer: AVCaptureVideoPreviewLayer?
override func viewWillAppear(_ animated: Bool) {
super.viewWillAppear(true)
captureSession.sessionPreset = AVCaptureSessionPresetHigh
let backCamera = AVCaptureDevice.defaultDevice(withMediaType: AVMediaTypeVideo)
do {
let input = try AVCaptureDeviceInput(device: backCamera)
if (captureSession.canAddInput(input)) {
captureSession.addInput(input)
if(captureSession.canAddOutput(photoOutput)){
captureSession.addOutput(photoOutput)
captureSession.startRunning()
previewLayer = AVCaptureVideoPreviewLayer(session: captureSession)
previewLayer?.videoGravity = AVLayerVideoGravityResizeAspectFill
previewLayer?.frame = cameraView.bounds
cameraView.layer.addSublayer(previewLayer!)
}
}
} catch {
print(error.localizedDescription)
}
}
func captureImage() {
let settings = AVCapturePhotoSettings()
let previewPixelType = settings.availablePreviewPhotoPixelFormatTypes.first!
let previewFormat = [kCVPixelBufferPixelFormatTypeKey as String: previewPixelType
]
settings.previewPhotoFormat = previewFormat
photoOutput.capturePhoto(with: settings, delegate: self)
}
func capture(_ captureOutput: AVCapturePhotoOutput, didFinishProcessingPhotoSampleBuffer photoSampleBuffer: CMSampleBuffer?, previewPhotoSampleBuffer: CMSampleBuffer?, resolvedSettings: AVCaptureResolvedPhotoSettings, bracketSettings: AVCaptureBracketedStillImageSettings?, error: Error?) {
if let error = error {
print(error.localizedDescription)
}
// Not include previewPhotoSampleBuffer
if let sampleBuffer = photoSampleBuffer,
let dataImage = AVCapturePhotoOutput.jpegPhotoDataRepresentation(forJPEGSampleBuffer: sampleBuffer, previewPhotoSampleBuffer: nil) {
self.imageView.image = UIImage(data: dataImage)
self.imageView.isHidden = false
self.previewLayer?.isHidden = true
self.findFace(img: self.imageView.image!)
}
}
The findFace works with normal image. However, the image I capture via camera will not work or sometimes only recognize one face.
Normal Image
Capture Image
func findFace(img: UIImage) {
guard let faceImage = CIImage(image: img) else { return }
let accuracy = [CIDetectorAccuracy: CIDetectorAccuracyHigh]
let faceDetector = CIDetector(ofType: CIDetectorTypeFace, context: nil, options: accuracy)
// For converting the Core Image Coordinates to UIView Coordinates
let detectedImageSize = faceImage.extent.size
var transform = CGAffineTransform(scaleX: 1, y: -1)
transform = transform.translatedBy(x: 0, y: -detectedImageSize.height)
if let faces = faceDetector?.features(in: faceImage, options: [CIDetectorSmile: true, CIDetectorEyeBlink: true]) {
for face in faces as! [CIFaceFeature] {
// Apply the transform to convert the coordinates
var faceViewBounds = face.bounds.applying(transform)
// Calculate the actual position and size of the rectangle in the image view
let viewSize = imageView.bounds.size
let scale = min(viewSize.width / detectedImageSize.width,
viewSize.height / detectedImageSize.height)
let offsetX = (viewSize.width - detectedImageSize.width * scale) / 2
let offsetY = (viewSize.height - detectedImageSize.height * scale) / 2
faceViewBounds = faceViewBounds.applying(CGAffineTransform(scaleX: scale, y: scale))
print("faceBounds = \(faceViewBounds)")
faceViewBounds.origin.x += offsetX
faceViewBounds.origin.y += offsetY
showBounds(at: faceViewBounds)
}
if faces.count != 0 {
print("Number of faces: \(faces.count)")
} else {
print("No faces 😢")
}
}
}
func showBounds(at bounds: CGRect) {
let indicator = UIView(frame: bounds)
indicator.frame = bounds
indicator.layer.borderWidth = 3
indicator.layer.borderColor = UIColor.red.cgColor
indicator.backgroundColor = .clear
self.imageView.addSubview(indicator)
faceBoxes.append(indicator)
}

There are two ways to detect faces: CIFaceDetector and AVCaptureMetadataOutput. Depending on your requirements, choose what is relevant for you.
CIFaceDetector has more features, it gives you the location of the eyes and mouth, a smile detector, etc.
On the other hand, AVCaptureMetadataOutput is computed on the frames and the detected faces are tracked and there is no extra code to be added by us. I find that, because of tracking. faces are detected more reliably in this process. The downside of this is that you will simply detect faces, no the position of the eyes or mouth.
Another advantage of this method is that orientation issues are smaller as you can use videoOrientation whenever the device orientation changes and the orientation of the faces will relative to that orientation.
In my case, my application uses YUV420 as the required format so using CIDetector (which works with RGB) in real-time was not viable. Using AVCaptureMetadataOutput saved a lot of effort and performed more reliably due to continuous tracking.
Once I had the bounding box for the faces, I coded extra features, such as skin detection and applied it on the still image.
Note: When you capture a still image, the face box information is added along with the metadata so there are no sync issues.
You can also use a combination of the two to get better results.
Explore and evaluate the pros and cons as per your application.
The face rectangle is wrt image origin. So, for the screen, it may be different.
Use:
for (AVMetadataFaceObject *faceFeatures in metadataObjects) {
CGRect face = faceFeatures.bounds;
CGRect facePreviewBounds = CGRectMake(face.origin.y * previewLayerRect.size.width,
face.origin.x * previewLayerRect.size.height,
face.size.width * previewLayerRect.size.height,
face.size.height * previewLayerRect.size.width);
/* Draw rectangle facePreviewBounds on screen */
}

To perform face detection on iOS, there are either CIDetector (Apple)
or Mobile Vision (Google) API.
IMO, Google Mobile Vision provides better performance.
If you are interested, here is the project you can play with. (iOS 10.2, Swift 3)
After WWDC 2017, Apple introduces CoreML in iOS 11.
The Vision framework makes the face detection more accurate :)
I've made a Demo Project. containing Vision v.s. CIDetector. Also, it contains face landmarks detection in real time.

A bit late, but here it is the solution for the coordinates problem. There is a method you can call on the preview layer to transform the metadata object to your coordinate system: transformedMetadataObject(for: metadataObject).
guard let transformedObject = previewLayer.transformedMetadataObject(for: metadataObject) else {
continue
}
let bounds = transformedObject.bounds
showBounds(at: bounds)
Source: https://developer.apple.com/documentation/avfoundation/avcapturevideopreviewlayer/1623501-transformedmetadataobjectformeta
By the way, in case you are using (or upgrade your project to) Swift 4, the delegate method of AVCaptureMetadataOutputsObject has change into:
func metadataOutput(_ output: AVCaptureMetadataOutput, didOutput metadataObjects: [AVMetadataObject], from connection: AVCaptureConnection)
Kind regards

extension CameraViewController: AVCaptureMetadataOutputObjectsDelegate {
func captureOutput(_ captureOutput: AVCaptureOutput!, didOutputMetadataObjects metadataObjects: [Any]!, from connection: AVCaptureConnection!) {
if findFaceControl {
findFaceControl = false
let faces = metadata.flatMap { $0 as? AVMetadataFaceObject } .flatMap { (face) -> CGRect in
guard let localizedFace =
previewLayer?.transformedMetadataObject(for: face) else { return nil }
return localizedFace.bounds }
for face in faces {
let temp = UIView(frame: face)
temp.layer.borderColor = UIColor.white
temp.layer.borderWidth = 2.0
view.addSubview(view: temp)
}
}
}
}
Be sure to remove the views created by didOutputMetadataObjects.
Keeping track of the active facial ids is the best way to do this ^
Also when you're trying to find the location of faces for your preview layer, it is much easier to use facial data and transform. Also I think CIDetector is junk, metadataoutput will use hardware stuff for face detection making it really fast.

Create CaptureSession
For AVCaptureVideoDataOutput create following settings
output.videoSettings = [ kCVPixelBufferPixelFormatTypeKey as AnyHashable: Int(kCMPixelFormat_32BGRA) ]
3.When you receive CMSampleBuffer, create image
DispatchQueue.main.async {
let sampleImg = self.imageFromSampleBuffer(sampleBuffer: sampleBuffer)
self.imageView.image = sampleImg
}
func imageFromSampleBuffer(sampleBuffer : CMSampleBuffer) -> UIImage
{
// Get a CMSampleBuffer's Core Video image buffer for the media data
let imageBuffer = CMSampleBufferGetImageBuffer(sampleBuffer);
// Lock the base address of the pixel buffer
CVPixelBufferLockBaseAddress(imageBuffer!, CVPixelBufferLockFlags.readOnly);
// Get the number of bytes per row for the pixel buffer
let baseAddress = CVPixelBufferGetBaseAddress(imageBuffer!);
// Get the number of bytes per row for the pixel buffer
let bytesPerRow = CVPixelBufferGetBytesPerRow(imageBuffer!);
// Get the pixel buffer width and height
let width = CVPixelBufferGetWidth(imageBuffer!);
let height = CVPixelBufferGetHeight(imageBuffer!);
// Create a device-dependent RGB color space
let colorSpace = CGColorSpaceCreateDeviceRGB();
// Create a bitmap graphics context with the sample buffer data
var bitmapInfo: UInt32 = CGBitmapInfo.byteOrder32Little.rawValue
bitmapInfo |= CGImageAlphaInfo.premultipliedFirst.rawValue & CGBitmapInfo.alphaInfoMask.rawValue
//let bitmapInfo: UInt32 = CGBitmapInfo.alphaInfoMask.rawValue
let context = CGContext.init(data: baseAddress, width: width, height: height, bitsPerComponent: 8, bytesPerRow: bytesPerRow, space: colorSpace, bitmapInfo: bitmapInfo)
// Create a Quartz image from the pixel data in the bitmap graphics context
let quartzImage = context?.makeImage();
// Unlock the pixel buffer
CVPixelBufferUnlockBaseAddress(imageBuffer!, CVPixelBufferLockFlags.readOnly);
// Create an image object from the Quartz image
let image = UIImage.init(cgImage: quartzImage!);
return (image);
}

By looking at your code I detected 2 things that could lead to wrong/poor face detection.
One of them is the face detector features options where you are filtering the results by [CIDetectorSmile: true, CIDetectorEyeBlink: true]. Try to set it to nil: faceDetector?.features(in: faceImage, options: nil)
Another guess I have is the result image orientation. I noticed you use AVCapturePhotoOutput.jpegPhotoDataRepresentation method to generate the source image for the detection and the system, by default, it generates that image with a specific orientation, of type Left/LandscapeLeft, I think. So, basically you can tell the face detector to have that in mind by using the CIDetectorImageOrientation key.
CIDetectorImageOrientation: the value for this key is an integer NSNumber from 1..8 such as that found in kCGImagePropertyOrientation. If present, the detection will be done based on that orientation but the coordinates in the returned features will still be based on those of the image.
Try to set it like faceDetector?.features(in: faceImage, options: [CIDetectorImageOrientation: 8 /*Left, bottom*/]).

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart