Piping AudioKit Microphone to Google Speech-to-Text - ios

I'm trying to get AudioKit to pipe the microphone to Google's Speech-to-Text API as seen here but I'm not entirely sure how to go about it.
To prepare the audio for the Speech-to-Text engine, you need to set up the encoding and pass it through as chunks. In the example Google uses, they use Apple's AVFoundation, but I'd like to use AudioKit so I can preform some pre-processing such as cutting of low amplitudes etc.
I believe the right way to do this is to use a Tap:
First, I should match the format by:
var asbd = AudioStreamBasicDescription()
asbd.mSampleRate = 16000.0
asbd.mFormatID = kAudioFormatLinearPCM
asbd.mFormatFlags = kAudioFormatFlagIsSignedInteger | kAudioFormatFlagIsPacked
asbd.mBytesPerPacket = 2
asbd.mFramesPerPacket = 1
asbd.mBytesPerFrame = 2
asbd.mChannelsPerFrame = 1
asbd.mBitsPerChannel = 16
AudioKit.format = AVAudioFormat(streamDescription: &asbd)!
Then create a tap such as:
open class TestTap {
internal let bufferSize: UInt32 = 1_024
#objc public init(_ input: AKNode?) {
input?.avAudioNode.installTap(onBus: 0, bufferSize: bufferSize, format: AudioKit.format) { buffer, _ in
// do work here
}
}
}
But I wasn't able to identify the right way of handling this data to be sent to the Google Speech-to-Text API via the method streamAudioData in real-time with AudioKit but perhaps I am going about this the wrong way?
UPDATE:
I've created a Tap as such:
open class TestTap {
internal var audioData = NSMutableData()
internal let bufferSize: UInt32 = 1_024
func toData(buffer: AVAudioPCMBuffer) -> NSData {
let channelCount = 2 // given PCMBuffer channel count is
let channels = UnsafeBufferPointer(start: buffer.floatChannelData, count: channelCount)
return NSData(bytes: channels[0], length:Int(buffer.frameCapacity * buffer.format.streamDescription.pointee.mBytesPerFrame))
}
#objc public init(_ input: AKNode?) {
input?.avAudioNode.installTap(onBus: 0, bufferSize: bufferSize, format: AudioKit.format) { buffer, _ in
self.audioData.append(self.toData(buffer: buffer) as Data)
// We recommend sending samples in 100ms chunks (from Google)
let chunkSize: Int /* bytes/chunk */ = Int(0.1 /* seconds/chunk */
* AudioKit.format.sampleRate /* samples/second */
* 2 /* bytes/sample */ )
if self.audioData.length > chunkSize {
SpeechRecognitionService
.sharedInstance
.streamAudioData(self.audioData,
completion: { response, error in
if let error = error {
print("ERROR: \(error.localizedDescription)")
SpeechRecognitionService.sharedInstance.stopStreaming()
} else if let response = response {
print(response)
}
})
self.audioData = NSMutableData()
}
}
}
}
and in viewDidLoad:, I'm setting AudioKit up with:
AKSettings.sampleRate = 16_000
AKSettings.bufferLength = .shortest
However, Google complains with:
ERROR: Audio data is being streamed too fast. Please stream audio data approximately at real time.
I've tried changing multiple parameters such as the chunk size to no avail.

I found the solution here.
Final code for my Tap is:
open class GoogleSpeechToTextStreamingTap {
internal var converter: AVAudioConverter!
#objc public init(_ input: AKNode?, sampleRate: Double = 16000.0) {
let format = AVAudioFormat(commonFormat: AVAudioCommonFormat.pcmFormatInt16, sampleRate: sampleRate, channels: 1, interleaved: false)!
self.converter = AVAudioConverter(from: AudioKit.format, to: format)
self.converter?.sampleRateConverterAlgorithm = AVSampleRateConverterAlgorithm_Normal
self.converter?.sampleRateConverterQuality = .max
let sampleRateRatio = AKSettings.sampleRate / sampleRate
let inputBufferSize = 4410 // 100ms of 44.1K = 4410 samples.
input?.avAudioNode.installTap(onBus: 0, bufferSize: AVAudioFrameCount(inputBufferSize), format: nil) { buffer, time in
let capacity = Int(Double(buffer.frameCapacity) / sampleRateRatio)
let bufferPCM16 = AVAudioPCMBuffer(pcmFormat: format, frameCapacity: AVAudioFrameCount(capacity))!
var error: NSError? = nil
self.converter?.convert(to: bufferPCM16, error: &error) { inNumPackets, outStatus in
outStatus.pointee = AVAudioConverterInputStatus.haveData
return buffer
}
let channel = UnsafeBufferPointer(start: bufferPCM16.int16ChannelData!, count: 1)
let data = Data(bytes: channel[0], count: capacity * 2)
SpeechRecognitionService
.sharedInstance
.streamAudioData(data,
completion: { response, error in
if let error = error {
print("ERROR: \(error.localizedDescription)")
SpeechRecognitionService.sharedInstance.stopStreaming()
} else if let response = response {
print(response)
}
})
}
}

You can likely record using AKNodeRecorder, and pass along the buffer from the resulting AKAudioFile to the API. If you wanted more real-time, you could try installing a tap on the avAudioNode property of the AKNode you want to record and pass the buffers to the API continuously.
However, I'm curious why you see the need for pre-processing - I'm sure the Google API is plenty optimized for recordings produced by the sample code you noted.
I've had a lot of success / fun with the iOS Speech API. Not sure if there's a reason you want to go with the Google API, but I'd consider checking it out and seeing if it might better serve your needs if you haven't already.
Hope this helps!

Related

Combining AVAudioPCMBuffers

I am recording audio with an AVEngine using installTap(onBus:bufferSize:format). This generates AVAudioPCMBuffers that I accumulate. When I'm done recording, I want to concatenate those into a single AVAudioPCMBuffer, so I can use it with other code that operates on buffers. (While in some cases I want to write this to a file, in general I do not.)
Is there a way to combine the buffers without dropping all the way down to the Core Audio layer and manipulating the AudioBufferList?
Would something like this work? (Untested and hard-coded for Float data but might be a start)
extension AVAudioPCMBuffer {
func append(_ buffer: AVAudioPCMBuffer) {
append(buffer, startingFrame: 0, frameCount: buffer.frameLength)
}
func append(_ buffer: AVAudioPCMBuffer, startingFrame: AVAudioFramePosition, frameCount: AVAudioFrameCount) {
precondition(format == buffer.format, "Format mismatch")
precondition(startingFrame + AVAudioFramePosition(frameCount) <= AVAudioFramePosition(buffer.frameLength), "Insufficient audio in buffer")
precondition(frameLength + frameCount <= frameCapacity, "Insufficient space in buffer")
let dst = floatChannelData!
let src = buffer.floatChannelData!
memcpy(dst.pointee.advanced(by: stride * Int(frameLength)),
src.pointee.advanced(by: stride * Int(startingFrame)),
Int(frameCount) * stride * MemoryLayout<Float>.size)
frameLength += frameCount
}
convenience init?(concatenating buffers: AVAudioPCMBuffer...) {
precondition(buffers.count > 0)
let totalFrames = buffers.reduce(0, { $1.frameLength })
self.init(pcmFormat: buffers[0].format, frameCapacity: totalFrames)
buffers.forEach { append($0) }
}
}

Clipping sound with opus on Android, sent from IOS

I am recording audio in IOS from audioUnit, encoding the bytes with opus and sending it via UDP to android side. The problem is that the sound is playing a bit clipped. I have also tested the sound by sending the Raw data from IOS to Android and it plays perfect.
My AudioSession code is
try audioSession.setCategory(.playAndRecord, mode: .voiceChat, options: [.defaultToSpeaker])
try audioSession.setPreferredIOBufferDuration(0.02)
try audioSession.setActive(true)
My recording callBack code is:
func performRecording(
_ ioActionFlags: UnsafeMutablePointer<AudioUnitRenderActionFlags>,
inTimeStamp: UnsafePointer<AudioTimeStamp>,
inBufNumber: UInt32,
inNumberFrames: UInt32,
ioData: UnsafeMutablePointer<AudioBufferList>) -> OSStatus
{
var err: OSStatus = noErr
err = AudioUnitRender(audioUnit!, ioActionFlags, inTimeStamp, 1, inNumberFrames, ioData)
if let mData = ioData[0].mBuffers.mData {
let ptrData = mData.bindMemory(to: Int16.self, capacity: Int(inNumberFrames))
let bufferPtr = UnsafeBufferPointer(start: ptrData, count: Int(inNumberFrames))
count += 1
addedBuffer += Array(bufferPtr)
if count == 2 {
let _ = TPCircularBufferProduceBytes(&circularBuffer, addedBuffer, UInt32(addedBuffer.count * 2))
count = 0
addedBuffer = []
let buffer = TPCircularBufferTail(&circularBuffer, &availableBytes)
memcpy(&targetBuffer, buffer, Int(min(bytesToCopy, Int(availableBytes))))
TPCircularBufferConsume(&circularBuffer, UInt32(min(bytesToCopy, Int(availableBytes))))
self.audioRecordingDelegate(inTimeStamp.pointee.mSampleTime / Double(16000), targetBuffer)
}
}
return err;
}
Here i am getting inNumberOfFrames almost 341 and i am appending 2 arrays together to get a bigger framesize (needed 640) for Android but i am only encoding 640 by the help of TPCircularBuffer.
func gotSomeAudio(timeStamp: Double, samples: [Int16]) {
samples.count))
let encodedData = opusHelper?.encodeStream(of: samples)
OPUS_SET_BITRATE_REQUEST)
let myData = encodedData!.withUnsafeBufferPointer {
Data(buffer: $0)
}
var protoModel = ProtoModel()
seqNumber += 1
protoModel.sequenceNumber = seqNumber
protoModel.timeStamp = Date().currentTimeInMillis()
protoModel.payload = myData
DispatchQueue.global().async {
do {
try self.tcpClient?.send(data: protoModel)
} catch {
print(error.localizedDescription)
}
}
let diff = CFAbsoluteTimeGetCurrent() - start
print("Time diff is \(diff)")
}
In the above code i am opus encoding 640 frameSize and adding it to ProtoBuf payload and Sending it via UDP.
On Android side i am parsing the Protobuf and decoding the 640 framesize and playing it with AudioTrack.There is no problem with android side as i have recorded and played sound just by using Android but the problem comes when i record sound via IOS and play through Android Side.
Please don't suggest to increase the frameSize by setting Preferred IO Buffer Duration. I want to do it without changing this.
https://stackoverflow.com/a/57873492/12020007 It was helpful.
https://stackoverflow.com/a/58947295/12020007
I have updated my code according to your suggestion, removed the delegate and array concatenation but there is still clipping on android side. I have also calculated the time it takes to encode bytes that is approx 2-3 ms.
Updated callback code is
var err: OSStatus = noErr
// we are calling AudioUnitRender on the input bus of AURemoteIO
// this will store the audio data captured by the microphone in ioData
err = AudioUnitRender(audioUnit!, ioActionFlags, inTimeStamp, 1, inNumberFrames, ioData)
if let mData = ioData[0].mBuffers.mData {
_ = TPCircularBufferProduceBytes(&circularBuffer, mData, inNumberFrames * 2)
print("mDataByteSize: \(ioData[0].mBuffers.mDataByteSize)")
count += 1
if count == 2 {
count = 0
let buffer = TPCircularBufferTail(&circularBuffer, &availableBytes)
memcpy(&targetBuffer, buffer, min(bytesToCopy, Int(availableBytes)))
TPCircularBufferConsume(&circularBuffer, UInt32(min(bytesToCopy, Int(availableBytes))))
let encodedData = opusHelper?.encodeStream(of: targetBuffer)
let myData = encodedData!.withUnsafeBufferPointer {
Data(buffer: $0)
}
var protoModel = ProtoModel()
seqNumber += 1
protoModel.sequenceNumber = seqNumber
protoModel.timeStamp = Date().currentTimeInMillis()
protoModel.payload = myData
do {
try self.udpClient?.send(data: protoModel)
} catch {
print(error.localizedDescription)
}
}
}
return err;
Your code is doing Swift memory allocation (Array concatenation) and Swift method calls (your recording delegate) inside the audio callback. Apple (in a WWDC session on Audio) recommends not doing any memory allocation or method calls inside the real-time audio callback context (especially when requesting short Preferred IO Buffer Durations). Stick to C function calls, such as memcpy and TPCircularBuffer.
Added: Also, don't discard samples. If you get 680 samples, but only need 640 for a packet, keep the 40 "left over" samples and use them appended in front of a later packet. The circular buffer will save them for you. Rinse and repeat. Send all the samples you get from the audio callback when you've accumulated enough for a packet, or yet another packet when you end up accumulating 1280 (2*640) or more.

Get Mic data callbacks for 20 Miliseconds VoIP App

I am developing VOIP calling app so now I am in the stage where I need to transfer the voice data to the server. For that I want to get Real time audio voice data from mic with 20 mili Seconds callbacks.
I did searched many links but I am unable find solution as
i am new to audio frameworks.
Details
We have our own stack like WebRTC which gives RTP sends data from remote for every 20 mili second and asks data from Mic for 20 mili second , What I am trying to achieve is to get 20 mili second data from mic and pass it the same to the stack. So need to know how to do so. Audio format is pcmFormatInt16 and sample rate is 8000 Hz with 20 mili seconds data.
I have searched for
AVAudioEngine,
AUAudioUnit,
AVCaptureSession Etc.
1.I am Using AVAudioSession and AUAudioUnit but setPreferredIOBufferDuration of audioSession is not setting with exact value what i have set. In result of that i am not getting the exact data size. Can anybody help me on setPreferredIOBufferDuration.
2.One more issue is auAudioUnit.outputProvider () is giving inputData in UnsafeMutableAudioBufferListPointer. inputData list has two element and I want only one sample. Can anybody help me on that to change it into data format which can be played in AVAudioPlayer.
I have followed before link
https://gist.github.com/hotpaw2/ba815fc23b5d642705f2b1dedfaf0107
let hwSRate = audioSession.sampleRate
try audioSession.setActive(true)
print("native Hardware rate : \(hwSRate)")
try audioSession.setPreferredIOBufferDuration(preferredIOBufferDuration)
try audioSession.setPreferredSampleRate(8000) // at 8000.0 Hz
print("Changed native Hardware rate : \(audioSession.sampleRate) buffer duration \(audioSession.ioBufferDuration)")
try auAudioUnit = AUAudioUnit(componentDescription: self.audioComponentDescription)
auAudioUnit.outputProvider = { // AURenderPullInputBlock
(actionFlags, timestamp, frameCount, inputBusNumber, inputData) -> AUAudioUnitStatus in
if let block = self.renderBlock { // AURenderBlock?
let err : OSStatus = block(actionFlags,
timestamp,
frameCount,
1,
inputData,
.none)
if err == noErr {
// save samples from current input buffer to circular buffer
print("inputData = \(inputData) and frameCount: \(frameCount)")
self.recordMicrophoneInputSamples(
inputDataList: inputData,
frameCount: UInt32(frameCount) )
}
}
let err2 : AUAudioUnitStatus = noErr
return err2
}
Log:-
Changed native Hardware rate : 8000.0 buffer duration 0.01600000075995922
try to get 40 ms data from the Audio interface and then split it up into 20ms data.
also check if you are able to set the sampling frequency (8 Khz) of the audio interface.
Render block will give you call backs according to the accepted set up by hardware for AUAudioUnit and AudioSession. We have to manage buffer if we want different size of input from mic. Output to the speaker should be same size as it expects like 128, 256 ,512 bytes etc.
try audioSession.setPreferredSampleRate(sampleRateProvided) // at 48000.0
try audioSession.setPreferredIOBufferDuration(preferredIOBufferDuration)
These values can be different from our preferred size. That is why we have to use buffer logic get out preferred size of input.
Link: https://gist.github.com/hotpaw2/ba815fc23b5d642705f2b1dedfaf0107
renderBlock = auAudioUnit.renderBlock
if ( enableRecording
&& micPermissionGranted
&& audioSetupComplete
&& audioSessionActive
&& isRecording == false ) {
auAudioUnit.inputHandler = { (actionFlags, timestamp, frameCount, inputBusNumber) in
if let block = self.renderBlock { // AURenderBlock?
var bufferList = AudioBufferList(
mNumberBuffers: 1,
mBuffers: AudioBuffer(
mNumberChannels: audioFormat!.channelCount,
mDataByteSize: 0,
mData: nil))
let err : OSStatus = block(actionFlags,
timestamp,
frameCount,
inputBusNumber,
&bufferList,
.none)
if err == noErr {
// save samples from current input buffer to circular buffer
print("inputData = \(bufferList.mBuffers.mDataByteSize) and frameCount: \(frameCount) and count: \(count)")
count += 1
if !self.isMuteState {
self.recordMicrophoneInputSamples(
inputDataList: &bufferList,
frameCount: UInt32(frameCount) )
}
}
}
}
auAudioUnit.isInputEnabled = true
auAudioUnit.outputProvider = { ( // AURenderPullInputBlock?
actionFlags,
timestamp,
frameCount,
inputBusNumber,
inputDataList ) -> AUAudioUnitStatus in
if let block = self.renderBlock {
if let dataReceived = self.getInputDataForConsumption() {
let mutabledata = NSMutableData(data: dataReceived)
var bufferListSpeaker = AudioBufferList(
mNumberBuffers: 1,
mBuffers: AudioBuffer(
mNumberChannels: 1,
mDataByteSize: 0,
mData: nil))
let err : OSStatus = block(actionFlags,
timestamp,
frameCount,
1,
&bufferListSpeaker,
.none)
if err == noErr {
bufferListSpeaker.mBuffers.mDataByteSize = UInt32(mutabledata.length)
bufferListSpeaker.mBuffers.mData = mutabledata.mutableBytes
inputDataList[0] = bufferListSpeaker
print("Output Provider mDataByteSize: \(inputDataList[0].mBuffers.mDataByteSize) output FrameCount: \(frameCount)")
return err
} else {
print("Output Provider \(err)")
return err
}
}
}
return 0
}
auAudioUnit.isOutputEnabled = true
do {
circInIdx = 0 // initialize circular buffer pointers
circOutIdx = 0
circoutSpkIdx = 0
circInSpkIdx = 0
try auAudioUnit.allocateRenderResources()
try auAudioUnit.startHardware() // equivalent to AudioOutputUnitStart ???
isRecording = true
} catch let e {
print(e)
}

Read UInt32 from InputStream

I need to communicate with a server that has a special message format: Each message begins with 4 bytes (together a unsigned long / UInt32 in big endian format) which determines the length of the following message. After those 4 bytes the message is sent as a normal string
So I first need to read 4 bytes into an Integer (32 bit unsigned). In Java I do this like:
DataInputStream is;
...
int len = is.readInt();
How can I do this in Swift 4?
At the moment I use
var lengthbuffer = [UInt8](repeating: 0, count: 4)
let bytecount = istr.read(&lengthbuffer, maxLength: 4)
let lengthbytes = lengthbuffer[0...3]
let bigEndianValue = lengthbytes.withUnsafeBufferPointer {
($0.baseAddress!.withMemoryRebound(to: UInt32.self, capacity: 1) { $0 })
}.pointee
let bytes_expected = Int(UInt32(bigEndian: bigEndianValue))
But this looks not like this is the most elegant way. And furthermore, sometimes (I cannot reproduces it reliably) there is a wrong value read (too big). When I then try to allocate memory for the following message, the app crashes:
let buffer = UnsafeMutablePointer<UInt8>.allocate(capacity: bytes_expected)
let bytes_read = istr.read(buffer, maxLength: bytes_expected)
So what is the swift way to read a UInt32 from a InputStream?
EDIT:
My current code (implemented things from the comments. Thanks!) looks like this:
private let inputStreamAccessQueue = DispatchQueue(label: "SynchronizedInputStreamAccess") // NOT concurrent!!!
// This is called on Stream.Event.hasBytesAvailable
func handleInput() {
self.inputStreamAccessQueue.sync(flags: .barrier) {
guard let istr = self.inputStream, istr.hasBytesAvailable else {
log.error(self.buildLogMessage("handleInput() called when inputstream has no bytes available"))
return
}
let lengthbuffer = UnsafeMutablePointer<UInt8>.allocate(capacity: 4)
defer { lengthbuffer.deallocate(capacity: 4) }
let lenbytes_read = istr.read(lengthbuffer, maxLength: 4)
guard lenbytes_read == 4 else {
self.errorHandler(NetworkingError.InputError("Input Stream received \(lenbytes_read) (!=4) bytes"))
return
}
let bytes_expected = Int(UnsafeRawPointer(lengthbuffer).load(as: UInt32.self).bigEndian)
log.info(self.buildLogMessage("expect \(bytes_expected) bytes"))
let buffer = UnsafeMutablePointer<UInt8>.allocate(capacity: bytes_expected)
let bytes_read = istr.read(buffer, maxLength: bytes_expected)
guard bytes_read == bytes_expected else {
print("Error: Expected \(bytes_expected) bytes, read \(bytes_read)")
return
}
guard let message = String(bytesNoCopy: buffer, length: bytes_expected, encoding: .utf8, freeWhenDone: true) else {
log.error("ERROR WHEN READING")
return
}
self.handleMessage(message)
}
}
This works most of the time, but sometimes istr.read() does not read bytes_expected bytes but bytes_read < bytes_expected. This results in another hasbytesAvailable event and handleInput() is called again. This time, of course, the first 4 bytes that are read do not contain the length of a new message but some content of the last message. But my code does not know that, so the first bytes are interpreted as the length. In many cases this is a real big value => allocating too much memory => crash
I think this is the explanation for the bug. But how to solve it?
Call read() on the stream while hasBytesAvailable = true? Is there maybe a better solution?
I would assume that when I loop, the hasBytesAvailableEvent would still happen after every read() => handleInput would still be called again too early... How can I avoid this?
EDIT 2: I have implemented the loop now, unfortunately it is still crashing with the same error (and probably same reason). Relevant code:
let bytes_expected = Int(UnsafeRawPointer(lengthbuffer).load(as: UInt32.self).bigEndian)
var message = ""
var bytes_missing = bytes_expected
while bytes_missing > 0 {
print("missing", bytes_missing)
let buffer = UnsafeMutablePointer<UInt8>.allocate(capacity: bytes_missing)
let bytes_read = istr.read(buffer, maxLength: bytes_missing)
guard bytes_read > 0 else {
print("bytes_read not <= 0: \(bytes_read)")
return
}
guard bytes_read <= bytes_missing else {
print("Read more bytes than expected. missing=\(bytes_missing), read=\(bytes_read)")
return
}
guard let partial_message = String(bytesNoCopy: buffer, length: bytes_expected, encoding: .utf8, freeWhenDone: true) else {
log.error("ERROR WHEN READING")
return
}
message = message + partial_message
bytes_missing -= bytes_read
}
My console output when it crashes:
missing 1952807028 malloc: * mach_vm_map(size=1952808960) failed
(error code=3)
* error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
So it seems that the whole handleInput() method is called too early, although I use the barrier! What am I doing wrong?
I‘d do it like this (ready to be pasted into a playground):
import Foundation
var stream = InputStream(data: Data([0,1,0,0]))
stream.open()
defer { stream.close() }
var buffer = UnsafeMutablePointer<UInt8>.allocate(capacity: 4)
defer { buffer.deallocate(capacity: 4) }
guard stream.read(buffer, maxLength: 4) >= 4 else {
// handle all cases: end of stream, error, waiting for more data to arrive...
fatalError()
}
let number = UnsafeRawPointer(buffer).load(as: UInt32.self)
number // 256
number.littleEndian // 256
number.bigEndian // 65536
Using UnsafeRawPointer.load directly (without explicit rebinding) is safe for trivial types according to the documentation. Trivial types are generally those that don‘t require ARC operations.
Alternatively, you can access the same memory as a different type without rebinding through untyped memory access, so long as the bound type and the destination type are trivial types.
I would suggest load(as:) to convert the buffer to the UInt32, and I would make sure you make the endianness explicit, e.g.
let value = try stream.read(type: UInt32.self, endianness: .little)
Where:
enum InputStreamError: Error {
case readFailure
}
enum Endianness {
case little
case big
}
extension InputStream {
func read<T: FixedWidthInteger>(type: T.Type, endianness: Endianness = .little) throws -> T {
let size = MemoryLayout<T>.size
var buffer = [UInt8](repeating: 0, count: size)
let count = read(&buffer, maxLength: size)
guard count == size else {
throw InputStreamError.readFailure
}
return buffer.withUnsafeBytes { pointer -> T in
switch endianness {
case .little: return T(littleEndian: pointer.load(as: T.self))
case .big: return T(bigEndian: pointer.load(as: T.self))
}
}
}
func readFloat(endianness: Endianness) throws -> Float {
return try Float(bitPattern: read(type: UInt32.self, with: endianness))
}
func readDouble(endianness: Endianness) throws -> Double {
return try Double(bitPattern: read(type: UInt64.self, with: endianness))
}
}
Note, I made read(type:endianness:) a generic, so it can be reused with any of the standard integer types. I have also thrown in readFloat and readDouble for good measure.

AAC encoding using AudioConverter and writing to AVAssetWriter

I'm struggling to encode audio buffers received from AVCaptureSession using
AudioConverter and then appending them to an AVAssetWriter.
I'm not getting any errors (including OSStatus responses), and the
CMSampleBuffers generated seem to have valid data, however the resulting file
simply does not have any playable audio. When writing together with video, the video
frames stop getting appended a couple of frames in (appendSampleBuffer()
returns false, but with no AVAssetWriter.error), probably because the asset
writer is waiting for the audio to catch up. I suspect it's related to the way
I'm setting up the priming for AAC.
The app uses RxSwift, but I've removed the RxSwift parts so that it's easier to
understand for a wider audience.
Please check out comments in the code below for more... comments
Given a settings struct:
import Foundation
import AVFoundation
import CleanroomLogger
public struct AVSettings {
let orientation: AVCaptureVideoOrientation = .Portrait
let sessionPreset = AVCaptureSessionPreset1280x720
let videoBitrate: Int = 2_000_000
let videoExpectedFrameRate: Int = 30
let videoMaxKeyFrameInterval: Int = 60
let audioBitrate: Int = 32 * 1024
/// Settings that are `0` means variable rate.
/// The `mSampleRate` and `mChennelsPerFrame` is overwritten at run-time
/// to values based on the input stream.
let audioOutputABSD = AudioStreamBasicDescription(
mSampleRate: AVAudioSession.sharedInstance().sampleRate,
mFormatID: kAudioFormatMPEG4AAC,
mFormatFlags: UInt32(MPEG4ObjectID.AAC_Main.rawValue),
mBytesPerPacket: 0,
mFramesPerPacket: 1024,
mBytesPerFrame: 0,
mChannelsPerFrame: 1,
mBitsPerChannel: 0,
mReserved: 0)
let audioEncoderClassDescriptions = [
AudioClassDescription(
mType: kAudioEncoderComponentType,
mSubType: kAudioFormatMPEG4AAC,
mManufacturer: kAppleSoftwareAudioCodecManufacturer) ]
}
Some helper functions:
public func getVideoDimensions(fromSettings settings: AVSettings) -> (Int, Int) {
switch (settings.sessionPreset, settings.orientation) {
case (AVCaptureSessionPreset1920x1080, .Portrait): return (1080, 1920)
case (AVCaptureSessionPreset1280x720, .Portrait): return (720, 1280)
default: fatalError("Unsupported session preset and orientation")
}
}
public func createAudioFormatDescription(fromSettings settings: AVSettings) -> CMAudioFormatDescription {
var result = noErr
var absd = settings.audioOutputABSD
var description: CMAudioFormatDescription?
withUnsafePointer(&absd) { absdPtr in
result = CMAudioFormatDescriptionCreate(nil,
absdPtr,
0, nil,
0, nil,
nil,
&description)
}
if result != noErr {
Log.error?.message("Could not create audio format description")
}
return description!
}
public func createVideoFormatDescription(fromSettings settings: AVSettings) -> CMVideoFormatDescription {
var result = noErr
var description: CMVideoFormatDescription?
let (width, height) = getVideoDimensions(fromSettings: settings)
result = CMVideoFormatDescriptionCreate(nil,
kCMVideoCodecType_H264,
Int32(width),
Int32(height),
[:],
&description)
if result != noErr {
Log.error?.message("Could not create video format description")
}
return description!
}
This is how the asset writer is initialized:
guard let audioDevice = defaultAudioDevice() else
{ throw RecordError.MissingDeviceFeature("Microphone") }
guard let videoDevice = defaultVideoDevice(.Back) else
{ throw RecordError.MissingDeviceFeature("Camera") }
let videoInput = try AVCaptureDeviceInput(device: videoDevice)
let audioInput = try AVCaptureDeviceInput(device: audioDevice)
let videoFormatHint = createVideoFormatDescription(fromSettings: settings)
let audioFormatHint = createAudioFormatDescription(fromSettings: settings)
let writerVideoInput = AVAssetWriterInput(mediaType: AVMediaTypeVideo,
outputSettings: nil,
sourceFormatHint: videoFormatHint)
let writerAudioInput = AVAssetWriterInput(mediaType: AVMediaTypeAudio,
outputSettings: nil,
sourceFormatHint: audioFormatHint)
writerVideoInput.expectsMediaDataInRealTime = true
writerAudioInput.expectsMediaDataInRealTime = true
let url = NSURL(fileURLWithPath: NSTemporaryDirectory(), isDirectory: true)
.URLByAppendingPathComponent(NSProcessInfo.processInfo().globallyUniqueString)
.URLByAppendingPathExtension("mp4")
let assetWriter = try AVAssetWriter(URL: url, fileType: AVFileTypeMPEG4)
if !assetWriter.canAddInput(writerVideoInput) {
throw RecordError.Unknown("Could not add video input") }
if !assetWriter.canAddInput(writerAudioInput) {
throw RecordError.Unknown("Could not add audio input") }
assetWriter.addInput(writerVideoInput)
assetWriter.addInput(writerAudioInput)
And this is how audio samples are being encoded, problem area is most likely to
be around here. I've re-written this so that it doesn't use any Rx-isms.
var outputABSD = settings.audioOutputABSD
var outputFormatDescription: CMAudioFormatDescription! = nil
CMAudioFormatDescriptionCreate(nil, &outputABSD, 0, nil, 0, nil, nil, &formatDescription)
var converter: AudioConverter?
// Indicates whether priming information has been attached to the first buffer
var primed = false
func encodeAudioBuffer(settings: AVSettings, buffer: CMSampleBuffer) throws -> CMSampleBuffer? {
// Create the audio converter if it's not available
if converter == nil {
var classDescriptions = settings.audioEncoderClassDescriptions
var inputABSD = CMAudioFormatDescriptionGetStreamBasicDescription(CMSampleBufferGetFormatDescription(buffer)!).memory
var outputABSD = settings.audioOutputABSD
outputABSD.mSampleRate = inputABSD.mSampleRate
outputABSD.mChannelsPerFrame = inputABSD.mChannelsPerFrame
var converter: AudioConverterRef = nil
var result = noErr
result = withUnsafePointer(&outputABSD) { outputABSDPtr in
return withUnsafePointer(&inputABSD) { inputABSDPtr in
return AudioConverterNewSpecific(inputABSDPtr,
outputABSDPtr,
UInt32(classDescriptions.count),
&classDescriptions,
&converter)
}
}
if result != noErr { throw RecordError.Unknown }
// At this point I made an attempt to retrieve priming info from
// the audio converter assuming that it will give me back default values
// I can use, but ended up with `nil`
var primeInfo: AudioConverterPrimeInfo? = nil
var primeInfoSize = UInt32(sizeof(AudioConverterPrimeInfo))
// The following returns a `noErr` but `primeInfo` is still `nil``
AudioConverterGetProperty(converter,
kAudioConverterPrimeInfo,
&primeInfoSize,
&primeInfo)
// I've also tried to set `kAudioConverterPrimeInfo` so that it knows
// the leading frames that are being primed, but the set didn't seem to work
// (`noErr` but getting the property afterwards still returned `nil`)
}
let converter = converter!
// Need to give a big enough output buffer.
// The assumption is that it will always be <= to the input size
let numSamples = CMSampleBufferGetNumSamples(buffer)
// This becomes 1024 * 2 = 2048
let outputBufferSize = numSamples * Int(inputABSD.mBytesPerPacket)
let outputBufferPtr = UnsafeMutablePointer<Void>.alloc(outputBufferSize)
defer {
outputBufferPtr.destroy()
outputBufferPtr.dealloc(1)
}
var result = noErr
var outputPacketCount = UInt32(1)
var outputData = AudioBufferList(
mNumberBuffers: 1,
mBuffers: AudioBuffer(
mNumberChannels: outputABSD.mChannelsPerFrame,
mDataByteSize: UInt32(outputBufferSize),
mData: outputBufferPtr))
// See below for `EncodeAudioUserData`
var userData = EncodeAudioUserData(inputSampleBuffer: buffer,
inputBytesPerPacket: inputABSD.mBytesPerPacket)
withUnsafeMutablePointer(&userData) { userDataPtr in
// See below for `fetchAudioProc`
result = AudioConverterFillComplexBuffer(
converter,
fetchAudioProc,
userDataPtr,
&outputPacketCount,
&outputData,
nil)
}
if result != noErr {
Log.error?.message("Error while trying to encode audio buffer, code: \(result)")
return nil
}
// See below for `CMSampleBufferCreateCopy`
guard let newBuffer = CMSampleBufferCreateCopy(buffer,
fromAudioBufferList: &outputData,
newFromatDescription: outputFormatDescription) else {
Log.error?.message("Could not create sample buffer from audio buffer list")
return nil
}
if !primed {
primed = true
// Simply picked 2112 samples based on convention, is there a better way to determine this?
let samplesToPrime: Int64 = 2112
let samplesPerSecond = Int32(settings.audioOutputABSD.mSampleRate)
let primingDuration = CMTimeMake(samplesToPrime, samplesPerSecond)
// Without setting the attachment the asset writer will complain about the
// first buffer missing the `TrimDurationAtStart` attachment, is there are way
// to infer the value from the given `AudioBufferList`?
CMSetAttachment(newBuffer,
kCMSampleBufferAttachmentKey_TrimDurationAtStart,
CMTimeCopyAsDictionary(primingDuration, nil),
kCMAttachmentMode_ShouldNotPropagate)
}
return newBuffer
}
Below is the proc that fetches samples for the audio converter, and the data
structure that gets passed to it:
private class EncodeAudioUserData {
var inputSampleBuffer: CMSampleBuffer?
var inputBytesPerPacket: UInt32
init(inputSampleBuffer: CMSampleBuffer,
inputBytesPerPacket: UInt32) {
self.inputSampleBuffer = inputSampleBuffer
self.inputBytesPerPacket = inputBytesPerPacket
}
}
private let fetchAudioProc: AudioConverterComplexInputDataProc = {
(inAudioConverter,
ioDataPacketCount,
ioData,
outDataPacketDescriptionPtrPtr,
inUserData) in
var result = noErr
if ioDataPacketCount.memory == 0 { return noErr }
let userData = UnsafeMutablePointer<EncodeAudioUserData>(inUserData).memory
// If its already been processed
guard let buffer = userData.inputSampleBuffer else {
ioDataPacketCount.memory = 0
return -1
}
var inputBlockBuffer: CMBlockBuffer?
var inputBufferList = AudioBufferList()
result = CMSampleBufferGetAudioBufferListWithRetainedBlockBuffer(
buffer,
nil,
&inputBufferList,
sizeof(AudioBufferList),
nil,
nil,
0,
&inputBlockBuffer)
if result != noErr {
Log.error?.message("Error while trying to retrieve buffer list, code: \(result)")
ioDataPacketCount.memory = 0
return result
}
let packetsCount = inputBufferList.mBuffers.mDataByteSize / userData.inputBytesPerPacket
ioDataPacketCount.memory = packetsCount
ioData.memory.mBuffers.mNumberChannels = inputBufferList.mBuffers.mNumberChannels
ioData.memory.mBuffers.mDataByteSize = inputBufferList.mBuffers.mDataByteSize
ioData.memory.mBuffers.mData = inputBufferList.mBuffers.mData
if outDataPacketDescriptionPtrPtr != nil {
outDataPacketDescriptionPtrPtr.memory = nil
}
return noErr
}
This is how I am converting AudioBufferLists to CMSampleBuffers:
public func CMSampleBufferCreateCopy(
buffer: CMSampleBuffer,
inout fromAudioBufferList bufferList: AudioBufferList,
newFromatDescription formatDescription: CMFormatDescription? = nil)
-> CMSampleBuffer? {
var result = noErr
var sizeArray: [Int] = [Int(bufferList.mBuffers.mDataByteSize)]
// Copy timing info from the previous buffer
var timingInfo = CMSampleTimingInfo()
result = CMSampleBufferGetSampleTimingInfo(buffer, 0, &timingInfo)
if result != noErr { return nil }
var newBuffer: CMSampleBuffer?
result = CMSampleBufferCreateReady(
kCFAllocatorDefault,
nil,
formatDescription ?? CMSampleBufferGetFormatDescription(buffer),
Int(bufferList.mNumberBuffers),
1, &timingInfo,
1, &sizeArray,
&newBuffer)
if result != noErr { return nil }
guard let b = newBuffer else { return nil }
CMSampleBufferSetDataBufferFromAudioBufferList(b, nil, nil, 0, &bufferList)
return newBuffer
}
Is there anything that I am obviously doing wrong? Is there a proper way to
construct CMSampleBuffers from AudioBufferList? How do you transfer priming
information from the converter to CMSampleBuffers that you create?
For my use case I need to do the encoding manually as the buffers will be
manipulated further down the pipeline (although I've disabled all
transformations after the encode in order to make sure that it works.)
Any help would be much appreciated. Sorry that there's so much code to
digest, but I wanted to provide as much context as possible.
Thanks in advance :)
Some related questions:
CMSampleBufferRef kCMSampleBufferAttachmentKey_TrimDurationAtStart crash
Can I use AVCaptureSession to encode an AAC stream to memory?
Writing video + generated audio to AVAssetWriterInput, audio stuttering
How do I use CoreAudio's AudioConverter to encode AAC in real-time?
Some references I've used:
Apple sample code demonstrating how to use AudioConverter
Note describing AAC encoder delay
Turns out there were a variety of things that I was doing wrong. Instead of posting a garble of code, I'm going to try and organize this into bite-sized pieces of things that I discovered..
Samples vs Packets vs Frames
This had been a huge source of confusion for me:
Each CMSampleBuffer can have 1 or more sample buffers (discovered via CMSampleBufferGetNumSamples)
Each CMSampleBuffer that contains 1 sample represents a single audio packet.
Therefore, CMSampleBufferGetNumSamples(sample) will return the number of packets contained in the given buffer.
Packets contain frames. This is governed by the mFramesPerPacket property of the buffer's AudioStreamBasicDescription. For linear PCM buffers, the total size of each sample buffer is frames * bytes per frame. For compressed buffers (like AAC), there is no relationship between the total size and frame count.
AudioConverterComplexInputDataProc
This callback is used to retrieve more linear PCM audio data for encoding. It's imperative that you must supply at least the number of packets specified by ioNumberDataPackets. Since I've been using the converter for real-time push-style encoding, I needed to ensure that each data push contains the minimum amount of packets. Something like this (pseudo-code):
let minimumPackets = outputFramesPerPacket / inputFramesPerPacket
var buffers: [CMSampleBuffer] = []
while getTotalSize(buffers) < minimumPackets {
buffers = buffers + [getNextBuffer()]
}
AudioConverterFillComplexBuffer(...)
Slicing CMSampleBuffer's
You can actually slice CMSampleBuffer's if they contain multiple buffers. The tool to do this is CMSampleBufferCopySampleBufferForRange. This is nice so that you can provide the AudioConverterComplexInputDataProc with the exact number of packets that it asks for, which makes handling timing information for the resulting encoded buffer easier. Because if you give the converter 1500 frames of data when it expects 1024, the result sample buffer will have a duration of 1024/sampleRate as opposed to 1500/sampleRate.
Priming and trim duration
When doing AAC encoding, you must set the trim duration like so:
CMSetAttachment(buffer,
kCMSampleBufferAttachmentKey_TrimDurationAtStart,
CMTimeCopyAsDictionary(primingDuration, kCFAllocatorDefault),
kCMAttachmentMode_ShouldNotPropagate)
One thing I did wrong was that I added the trim duration at encode time. This should be handled by your writer so that it can guarantee the information gets added to your leading audio frames.
Also, the value of kCMSampleBufferAttachmentKey_TrimDurationAtStart should never be greater than the duration of the sample buffer. An example of priming:
Priming frames: 2112
Sample rate: 44100
Priming duration: 2112 / 44100 = ~0.0479s
First frame, frames: 1024, priming duration: 1024 / 44100
Second frame, frames: 1024, priming duration: 1088 / 41100
Creating the new CMSampleBuffer
AudioConverterFillComplexBuffer has an optional outputPacketDescriptionsPtr. You should use it. It will point to a new array of packet descriptions that contains sample size information. You need this sample size information to construct the new compressed sample buffer:
let bufferList: AudioBufferList
let packetDescriptions: [AudioStreamPacketDescription]
var newBuffer: CMSampleBuffer?
CMAudioSampleBufferCreateWithPacketDescriptions(
kCFAllocatorDefault, // allocator
nil, // dataBuffer
false, // dataReady
nil, // makeDataReadyCallback
nil, // makeDataReadyRefCon
formatDescription, // formatDescription
Int(bufferList.mNumberBuffers), // numSamples
CMSampleBufferGetPresentationTimeStamp(buffer), // sbufPTS (first PTS)
&packetDescriptions, // packetDescriptions
&newBuffer)

Resources