Combining AVAudioPCMBuffers - ios

I am recording audio with an AVEngine using installTap(onBus:bufferSize:format). This generates AVAudioPCMBuffers that I accumulate. When I'm done recording, I want to concatenate those into a single AVAudioPCMBuffer, so I can use it with other code that operates on buffers. (While in some cases I want to write this to a file, in general I do not.)
Is there a way to combine the buffers without dropping all the way down to the Core Audio layer and manipulating the AudioBufferList?

Would something like this work? (Untested and hard-coded for Float data but might be a start)
extension AVAudioPCMBuffer {
func append(_ buffer: AVAudioPCMBuffer) {
append(buffer, startingFrame: 0, frameCount: buffer.frameLength)
}
func append(_ buffer: AVAudioPCMBuffer, startingFrame: AVAudioFramePosition, frameCount: AVAudioFrameCount) {
precondition(format == buffer.format, "Format mismatch")
precondition(startingFrame + AVAudioFramePosition(frameCount) <= AVAudioFramePosition(buffer.frameLength), "Insufficient audio in buffer")
precondition(frameLength + frameCount <= frameCapacity, "Insufficient space in buffer")
let dst = floatChannelData!
let src = buffer.floatChannelData!
memcpy(dst.pointee.advanced(by: stride * Int(frameLength)),
src.pointee.advanced(by: stride * Int(startingFrame)),
Int(frameCount) * stride * MemoryLayout<Float>.size)
frameLength += frameCount
}
convenience init?(concatenating buffers: AVAudioPCMBuffer...) {
precondition(buffers.count > 0)
let totalFrames = buffers.reduce(0, { $1.frameLength })
self.init(pcmFormat: buffers[0].format, frameCapacity: totalFrames)
buffers.forEach { append($0) }
}
}

Related

Clipping sound with opus on Android, sent from IOS

I am recording audio in IOS from audioUnit, encoding the bytes with opus and sending it via UDP to android side. The problem is that the sound is playing a bit clipped. I have also tested the sound by sending the Raw data from IOS to Android and it plays perfect.
My AudioSession code is
try audioSession.setCategory(.playAndRecord, mode: .voiceChat, options: [.defaultToSpeaker])
try audioSession.setPreferredIOBufferDuration(0.02)
try audioSession.setActive(true)
My recording callBack code is:
func performRecording(
_ ioActionFlags: UnsafeMutablePointer<AudioUnitRenderActionFlags>,
inTimeStamp: UnsafePointer<AudioTimeStamp>,
inBufNumber: UInt32,
inNumberFrames: UInt32,
ioData: UnsafeMutablePointer<AudioBufferList>) -> OSStatus
{
var err: OSStatus = noErr
err = AudioUnitRender(audioUnit!, ioActionFlags, inTimeStamp, 1, inNumberFrames, ioData)
if let mData = ioData[0].mBuffers.mData {
let ptrData = mData.bindMemory(to: Int16.self, capacity: Int(inNumberFrames))
let bufferPtr = UnsafeBufferPointer(start: ptrData, count: Int(inNumberFrames))
count += 1
addedBuffer += Array(bufferPtr)
if count == 2 {
let _ = TPCircularBufferProduceBytes(&circularBuffer, addedBuffer, UInt32(addedBuffer.count * 2))
count = 0
addedBuffer = []
let buffer = TPCircularBufferTail(&circularBuffer, &availableBytes)
memcpy(&targetBuffer, buffer, Int(min(bytesToCopy, Int(availableBytes))))
TPCircularBufferConsume(&circularBuffer, UInt32(min(bytesToCopy, Int(availableBytes))))
self.audioRecordingDelegate(inTimeStamp.pointee.mSampleTime / Double(16000), targetBuffer)
}
}
return err;
}
Here i am getting inNumberOfFrames almost 341 and i am appending 2 arrays together to get a bigger framesize (needed 640) for Android but i am only encoding 640 by the help of TPCircularBuffer.
func gotSomeAudio(timeStamp: Double, samples: [Int16]) {
samples.count))
let encodedData = opusHelper?.encodeStream(of: samples)
OPUS_SET_BITRATE_REQUEST)
let myData = encodedData!.withUnsafeBufferPointer {
Data(buffer: $0)
}
var protoModel = ProtoModel()
seqNumber += 1
protoModel.sequenceNumber = seqNumber
protoModel.timeStamp = Date().currentTimeInMillis()
protoModel.payload = myData
DispatchQueue.global().async {
do {
try self.tcpClient?.send(data: protoModel)
} catch {
print(error.localizedDescription)
}
}
let diff = CFAbsoluteTimeGetCurrent() - start
print("Time diff is \(diff)")
}
In the above code i am opus encoding 640 frameSize and adding it to ProtoBuf payload and Sending it via UDP.
On Android side i am parsing the Protobuf and decoding the 640 framesize and playing it with AudioTrack.There is no problem with android side as i have recorded and played sound just by using Android but the problem comes when i record sound via IOS and play through Android Side.
Please don't suggest to increase the frameSize by setting Preferred IO Buffer Duration. I want to do it without changing this.
https://stackoverflow.com/a/57873492/12020007 It was helpful.
https://stackoverflow.com/a/58947295/12020007
I have updated my code according to your suggestion, removed the delegate and array concatenation but there is still clipping on android side. I have also calculated the time it takes to encode bytes that is approx 2-3 ms.
Updated callback code is
var err: OSStatus = noErr
// we are calling AudioUnitRender on the input bus of AURemoteIO
// this will store the audio data captured by the microphone in ioData
err = AudioUnitRender(audioUnit!, ioActionFlags, inTimeStamp, 1, inNumberFrames, ioData)
if let mData = ioData[0].mBuffers.mData {
_ = TPCircularBufferProduceBytes(&circularBuffer, mData, inNumberFrames * 2)
print("mDataByteSize: \(ioData[0].mBuffers.mDataByteSize)")
count += 1
if count == 2 {
count = 0
let buffer = TPCircularBufferTail(&circularBuffer, &availableBytes)
memcpy(&targetBuffer, buffer, min(bytesToCopy, Int(availableBytes)))
TPCircularBufferConsume(&circularBuffer, UInt32(min(bytesToCopy, Int(availableBytes))))
let encodedData = opusHelper?.encodeStream(of: targetBuffer)
let myData = encodedData!.withUnsafeBufferPointer {
Data(buffer: $0)
}
var protoModel = ProtoModel()
seqNumber += 1
protoModel.sequenceNumber = seqNumber
protoModel.timeStamp = Date().currentTimeInMillis()
protoModel.payload = myData
do {
try self.udpClient?.send(data: protoModel)
} catch {
print(error.localizedDescription)
}
}
}
return err;
Your code is doing Swift memory allocation (Array concatenation) and Swift method calls (your recording delegate) inside the audio callback. Apple (in a WWDC session on Audio) recommends not doing any memory allocation or method calls inside the real-time audio callback context (especially when requesting short Preferred IO Buffer Durations). Stick to C function calls, such as memcpy and TPCircularBuffer.
Added: Also, don't discard samples. If you get 680 samples, but only need 640 for a packet, keep the 40 "left over" samples and use them appended in front of a later packet. The circular buffer will save them for you. Rinse and repeat. Send all the samples you get from the audio callback when you've accumulated enough for a packet, or yet another packet when you end up accumulating 1280 (2*640) or more.

Get Mic data callbacks for 20 Miliseconds VoIP App

I am developing VOIP calling app so now I am in the stage where I need to transfer the voice data to the server. For that I want to get Real time audio voice data from mic with 20 mili Seconds callbacks.
I did searched many links but I am unable find solution as
i am new to audio frameworks.
Details
We have our own stack like WebRTC which gives RTP sends data from remote for every 20 mili second and asks data from Mic for 20 mili second , What I am trying to achieve is to get 20 mili second data from mic and pass it the same to the stack. So need to know how to do so. Audio format is pcmFormatInt16 and sample rate is 8000 Hz with 20 mili seconds data.
I have searched for
AVAudioEngine,
AUAudioUnit,
AVCaptureSession Etc.
1.I am Using AVAudioSession and AUAudioUnit but setPreferredIOBufferDuration of audioSession is not setting with exact value what i have set. In result of that i am not getting the exact data size. Can anybody help me on setPreferredIOBufferDuration.
2.One more issue is auAudioUnit.outputProvider () is giving inputData in UnsafeMutableAudioBufferListPointer. inputData list has two element and I want only one sample. Can anybody help me on that to change it into data format which can be played in AVAudioPlayer.
I have followed before link
https://gist.github.com/hotpaw2/ba815fc23b5d642705f2b1dedfaf0107
let hwSRate = audioSession.sampleRate
try audioSession.setActive(true)
print("native Hardware rate : \(hwSRate)")
try audioSession.setPreferredIOBufferDuration(preferredIOBufferDuration)
try audioSession.setPreferredSampleRate(8000) // at 8000.0 Hz
print("Changed native Hardware rate : \(audioSession.sampleRate) buffer duration \(audioSession.ioBufferDuration)")
try auAudioUnit = AUAudioUnit(componentDescription: self.audioComponentDescription)
auAudioUnit.outputProvider = { // AURenderPullInputBlock
(actionFlags, timestamp, frameCount, inputBusNumber, inputData) -> AUAudioUnitStatus in
if let block = self.renderBlock { // AURenderBlock?
let err : OSStatus = block(actionFlags,
timestamp,
frameCount,
1,
inputData,
.none)
if err == noErr {
// save samples from current input buffer to circular buffer
print("inputData = \(inputData) and frameCount: \(frameCount)")
self.recordMicrophoneInputSamples(
inputDataList: inputData,
frameCount: UInt32(frameCount) )
}
}
let err2 : AUAudioUnitStatus = noErr
return err2
}
Log:-
Changed native Hardware rate : 8000.0 buffer duration 0.01600000075995922
try to get 40 ms data from the Audio interface and then split it up into 20ms data.
also check if you are able to set the sampling frequency (8 Khz) of the audio interface.
Render block will give you call backs according to the accepted set up by hardware for AUAudioUnit and AudioSession. We have to manage buffer if we want different size of input from mic. Output to the speaker should be same size as it expects like 128, 256 ,512 bytes etc.
try audioSession.setPreferredSampleRate(sampleRateProvided) // at 48000.0
try audioSession.setPreferredIOBufferDuration(preferredIOBufferDuration)
These values can be different from our preferred size. That is why we have to use buffer logic get out preferred size of input.
Link: https://gist.github.com/hotpaw2/ba815fc23b5d642705f2b1dedfaf0107
renderBlock = auAudioUnit.renderBlock
if ( enableRecording
&& micPermissionGranted
&& audioSetupComplete
&& audioSessionActive
&& isRecording == false ) {
auAudioUnit.inputHandler = { (actionFlags, timestamp, frameCount, inputBusNumber) in
if let block = self.renderBlock { // AURenderBlock?
var bufferList = AudioBufferList(
mNumberBuffers: 1,
mBuffers: AudioBuffer(
mNumberChannels: audioFormat!.channelCount,
mDataByteSize: 0,
mData: nil))
let err : OSStatus = block(actionFlags,
timestamp,
frameCount,
inputBusNumber,
&bufferList,
.none)
if err == noErr {
// save samples from current input buffer to circular buffer
print("inputData = \(bufferList.mBuffers.mDataByteSize) and frameCount: \(frameCount) and count: \(count)")
count += 1
if !self.isMuteState {
self.recordMicrophoneInputSamples(
inputDataList: &bufferList,
frameCount: UInt32(frameCount) )
}
}
}
}
auAudioUnit.isInputEnabled = true
auAudioUnit.outputProvider = { ( // AURenderPullInputBlock?
actionFlags,
timestamp,
frameCount,
inputBusNumber,
inputDataList ) -> AUAudioUnitStatus in
if let block = self.renderBlock {
if let dataReceived = self.getInputDataForConsumption() {
let mutabledata = NSMutableData(data: dataReceived)
var bufferListSpeaker = AudioBufferList(
mNumberBuffers: 1,
mBuffers: AudioBuffer(
mNumberChannels: 1,
mDataByteSize: 0,
mData: nil))
let err : OSStatus = block(actionFlags,
timestamp,
frameCount,
1,
&bufferListSpeaker,
.none)
if err == noErr {
bufferListSpeaker.mBuffers.mDataByteSize = UInt32(mutabledata.length)
bufferListSpeaker.mBuffers.mData = mutabledata.mutableBytes
inputDataList[0] = bufferListSpeaker
print("Output Provider mDataByteSize: \(inputDataList[0].mBuffers.mDataByteSize) output FrameCount: \(frameCount)")
return err
} else {
print("Output Provider \(err)")
return err
}
}
}
return 0
}
auAudioUnit.isOutputEnabled = true
do {
circInIdx = 0 // initialize circular buffer pointers
circOutIdx = 0
circoutSpkIdx = 0
circInSpkIdx = 0
try auAudioUnit.allocateRenderResources()
try auAudioUnit.startHardware() // equivalent to AudioOutputUnitStart ???
isRecording = true
} catch let e {
print(e)
}

Why am I not getting all the bytes of the image in the server?

I am building an iOS app that takes a photo and sends it to a TCP server running on my computer. The way I'm doing it is configuring the connection with Streams like this:
func setupCommunication() {
var readStream: Unmanaged<CFReadStream>?
var writeStream: Unmanaged<CFWriteStream>?
CFStreamCreatePairWithSocketToHost(kCFAllocatorDefault,
"192.168.1.40" as CFString, 2323, &readStream, &writeStream)
outputStream = writeStream!.takeRetainedValue()
outputStream.schedule(in: .current, forMode: .common)
outputStream.open()
}
Then, when I press the camera button, the photo is taken and sent through the outputStream. Since the TCP server doesn't know how much data it has to read, the first 8 bytes correspond to the size of the image, and the image is sent right after, as we can see in this code:
func photoOutput(_ output: AVCapturePhotoOutput,
didFinishProcessingPhoto photo: AVCapturePhoto, error: Error?) {
if let image = photo.fileDataRepresentation() {
print(image)
print(image.count)
var nBytes = UInt64(image.count)
let nData = Data(bytes: &nBytes, count: 8)
_ = nData.withUnsafeBytes({
outputStream.write($0, maxLength: nData.count)
})
_ = image.withUnsafeBytes({
outputStream.write($0, maxLength: image.count)
})
outputStream.close()
}
}
On the server side, which is written in C, I perform the next actions:
Read first 8 bytes to know the size of the image
printf("\n[*] New client connected\n");
while (n_recv < sizeof(uint64_t)) {
if ((n = read(client_sd, buffer, BUF_SIZ)) == -1) {
printf("\n[-] Error reading data from the client\n");
close(client_sd);
close(server_sd);
return 0;
}
n_recv += n;
}
memcpy(&img_size, buffer, sizeof(uint64_t));
printf("\n[+] Client says he's going to send %llu bytes\n", img_size);
Allocate enough memory to store the received image, and if we already read any byte of the image next to the its size, copy it.
if ((img_data = (uint8_t *) malloc(img_size)) == NULL) {
printf("\n[-] Error allocating memory for image\n");
close(client_sd);
close(server_sd);
return 0;
}
n_recv -= sizeof(uint64_t);
if (n_recv > 0) {
memcpy(img_data, buffer, n_recv);
}
From now on, n_recv is the number of bytes received of the image only, not including the first 8 bytes for the size. Then just read till the end.
while (n_recv < img_size) {
if ((n = read(client_sd, buffer, BUF_SIZ)) == -1) {
printf("\n[-] Error reading data from the client\n");
close(client_sd);
close(server_sd);
return 0;
}
memcpy(img_data + n_recv, buffer, n);
n_recv += n;
}
printf("\n[+] Data correctly recived from client\n");
close(client_sd);
close(server_sd);
This works pretty nice at the beginning. In fact, I can see that I'm getting the right number for the image size every time:
However, I'm not getting the full image, and the server just keeps waiting blocked in the read function. To see what's happening, I added this
printf("%llu\n", n_recv);
inside the loop for reading the image, to watch the number of bytes received. It stops in the middle of the image, for some reason I'm not able to explain:
What's the problem that is causing the communication to stop? Is the problem in the server code or is it something related to iOS app?
First, the C code looks okay to me.. but you realize you are missing return code/result handling in Swift?
In the C code you are checking the return value of recv to know if the bytes were read.. IE: You are checking if read returns -1..
However, in the swift code you make the assumption that ALL the data was written.. You never checked the result of the write operation on OutputStream which tells you how many bytes were written or returns -1 on failure..
You should be doing the same thing (after all, you did it in C).. For such cases I created two extensions:
extension InputStream {
/**
* Reads from the stream into a data buffer.
* Returns the count of the amount of bytes read from the stream.
* Returns -1 if reading fails or an error has occurred on the stream.
**/
func read(data: inout Data) -> Int {
let bufferSize = 1024
var totalBytesRead = 0
while true {
let buffer = UnsafeMutablePointer<UInt8>.allocate(capacity: bufferSize)
let count = read(buffer, maxLength: bufferSize)
if count == 0 {
return totalBytesRead
}
if count == -1 {
if let streamError = self.streamError {
debugPrint("Stream Error: \(String(describing: streamError))")
}
return -1
}
data.append(buffer, count: count)
totalBytesRead += count
}
return totalBytesRead
}
}
extension OutputStream {
/**
* Writes from a buffer into the stream.
* Returns the count of the amount of bytes written to the stream.
* Returns -1 if writing fails or an error has occurred on the stream.
**/
func write(data: Data) -> Int {
var bytesRemaining = data.count
var bytesWritten = 0
while bytesRemaining > 0 {
let count = data.withUnsafeBytes {
self.write($0.advanced(by: bytesWritten), maxLength: bytesRemaining)
}
if count == 0 {
return bytesWritten
}
if count < 0 {
if let streamError = self.streamError {
debugPrint("Stream Error: \(String(describing: streamError))")
}
return -1
}
bytesRemaining -= count
bytesWritten += count
}
return bytesWritten
}
}
Usage:
var readStream: Unmanaged<CFReadStream>?
var writeStream: Unmanaged<CFWriteStream>?
//For testing I used 127.0.0.1
CFStreamCreatePairWithSocketToHost(kCFAllocatorDefault, "192.168.1.40" as CFString, 2323, &readStream, &writeStream)
//Actually not sure if these need to be retained or unretained might be fine..
//Again, not sure..
var inputStream = readStream!.takeRetainedValue() as InputStream
var outputStream = writeStream!.takeRetainedValue() as OutputStream
inputStream.schedule(in: .current, forMode: .common)
outputStream.schedule(in: .current, forMode: .common)
inputStream.open()
outputStream.open()
var dataToWrite = Data() //Your Image
var dataRead = Data(capacity: 256) //Server response -- Pre-Allocate something large enough that you "think" you might read..
outputStream.write(data: dataToWrite)
inputStream.read(data: &dataRead)
Now you get error handling (printing) and you have buffered reading/writing.. After all, you're not guaranteed that the socket or pipe or w/e.. the stream is attached to has read/written ALL your bytes at once.. hence the reading/writing chunks.

Piping AudioKit Microphone to Google Speech-to-Text

I'm trying to get AudioKit to pipe the microphone to Google's Speech-to-Text API as seen here but I'm not entirely sure how to go about it.
To prepare the audio for the Speech-to-Text engine, you need to set up the encoding and pass it through as chunks. In the example Google uses, they use Apple's AVFoundation, but I'd like to use AudioKit so I can preform some pre-processing such as cutting of low amplitudes etc.
I believe the right way to do this is to use a Tap:
First, I should match the format by:
var asbd = AudioStreamBasicDescription()
asbd.mSampleRate = 16000.0
asbd.mFormatID = kAudioFormatLinearPCM
asbd.mFormatFlags = kAudioFormatFlagIsSignedInteger | kAudioFormatFlagIsPacked
asbd.mBytesPerPacket = 2
asbd.mFramesPerPacket = 1
asbd.mBytesPerFrame = 2
asbd.mChannelsPerFrame = 1
asbd.mBitsPerChannel = 16
AudioKit.format = AVAudioFormat(streamDescription: &asbd)!
Then create a tap such as:
open class TestTap {
internal let bufferSize: UInt32 = 1_024
#objc public init(_ input: AKNode?) {
input?.avAudioNode.installTap(onBus: 0, bufferSize: bufferSize, format: AudioKit.format) { buffer, _ in
// do work here
}
}
}
But I wasn't able to identify the right way of handling this data to be sent to the Google Speech-to-Text API via the method streamAudioData in real-time with AudioKit but perhaps I am going about this the wrong way?
UPDATE:
I've created a Tap as such:
open class TestTap {
internal var audioData = NSMutableData()
internal let bufferSize: UInt32 = 1_024
func toData(buffer: AVAudioPCMBuffer) -> NSData {
let channelCount = 2 // given PCMBuffer channel count is
let channels = UnsafeBufferPointer(start: buffer.floatChannelData, count: channelCount)
return NSData(bytes: channels[0], length:Int(buffer.frameCapacity * buffer.format.streamDescription.pointee.mBytesPerFrame))
}
#objc public init(_ input: AKNode?) {
input?.avAudioNode.installTap(onBus: 0, bufferSize: bufferSize, format: AudioKit.format) { buffer, _ in
self.audioData.append(self.toData(buffer: buffer) as Data)
// We recommend sending samples in 100ms chunks (from Google)
let chunkSize: Int /* bytes/chunk */ = Int(0.1 /* seconds/chunk */
* AudioKit.format.sampleRate /* samples/second */
* 2 /* bytes/sample */ )
if self.audioData.length > chunkSize {
SpeechRecognitionService
.sharedInstance
.streamAudioData(self.audioData,
completion: { response, error in
if let error = error {
print("ERROR: \(error.localizedDescription)")
SpeechRecognitionService.sharedInstance.stopStreaming()
} else if let response = response {
print(response)
}
})
self.audioData = NSMutableData()
}
}
}
}
and in viewDidLoad:, I'm setting AudioKit up with:
AKSettings.sampleRate = 16_000
AKSettings.bufferLength = .shortest
However, Google complains with:
ERROR: Audio data is being streamed too fast. Please stream audio data approximately at real time.
I've tried changing multiple parameters such as the chunk size to no avail.
I found the solution here.
Final code for my Tap is:
open class GoogleSpeechToTextStreamingTap {
internal var converter: AVAudioConverter!
#objc public init(_ input: AKNode?, sampleRate: Double = 16000.0) {
let format = AVAudioFormat(commonFormat: AVAudioCommonFormat.pcmFormatInt16, sampleRate: sampleRate, channels: 1, interleaved: false)!
self.converter = AVAudioConverter(from: AudioKit.format, to: format)
self.converter?.sampleRateConverterAlgorithm = AVSampleRateConverterAlgorithm_Normal
self.converter?.sampleRateConverterQuality = .max
let sampleRateRatio = AKSettings.sampleRate / sampleRate
let inputBufferSize = 4410 // 100ms of 44.1K = 4410 samples.
input?.avAudioNode.installTap(onBus: 0, bufferSize: AVAudioFrameCount(inputBufferSize), format: nil) { buffer, time in
let capacity = Int(Double(buffer.frameCapacity) / sampleRateRatio)
let bufferPCM16 = AVAudioPCMBuffer(pcmFormat: format, frameCapacity: AVAudioFrameCount(capacity))!
var error: NSError? = nil
self.converter?.convert(to: bufferPCM16, error: &error) { inNumPackets, outStatus in
outStatus.pointee = AVAudioConverterInputStatus.haveData
return buffer
}
let channel = UnsafeBufferPointer(start: bufferPCM16.int16ChannelData!, count: 1)
let data = Data(bytes: channel[0], count: capacity * 2)
SpeechRecognitionService
.sharedInstance
.streamAudioData(data,
completion: { response, error in
if let error = error {
print("ERROR: \(error.localizedDescription)")
SpeechRecognitionService.sharedInstance.stopStreaming()
} else if let response = response {
print(response)
}
})
}
}
You can likely record using AKNodeRecorder, and pass along the buffer from the resulting AKAudioFile to the API. If you wanted more real-time, you could try installing a tap on the avAudioNode property of the AKNode you want to record and pass the buffers to the API continuously.
However, I'm curious why you see the need for pre-processing - I'm sure the Google API is plenty optimized for recordings produced by the sample code you noted.
I've had a lot of success / fun with the iOS Speech API. Not sure if there's a reason you want to go with the Google API, but I'd consider checking it out and seeing if it might better serve your needs if you haven't already.
Hope this helps!

AAC encoding using AudioConverter and writing to AVAssetWriter

I'm struggling to encode audio buffers received from AVCaptureSession using
AudioConverter and then appending them to an AVAssetWriter.
I'm not getting any errors (including OSStatus responses), and the
CMSampleBuffers generated seem to have valid data, however the resulting file
simply does not have any playable audio. When writing together with video, the video
frames stop getting appended a couple of frames in (appendSampleBuffer()
returns false, but with no AVAssetWriter.error), probably because the asset
writer is waiting for the audio to catch up. I suspect it's related to the way
I'm setting up the priming for AAC.
The app uses RxSwift, but I've removed the RxSwift parts so that it's easier to
understand for a wider audience.
Please check out comments in the code below for more... comments
Given a settings struct:
import Foundation
import AVFoundation
import CleanroomLogger
public struct AVSettings {
let orientation: AVCaptureVideoOrientation = .Portrait
let sessionPreset = AVCaptureSessionPreset1280x720
let videoBitrate: Int = 2_000_000
let videoExpectedFrameRate: Int = 30
let videoMaxKeyFrameInterval: Int = 60
let audioBitrate: Int = 32 * 1024
/// Settings that are `0` means variable rate.
/// The `mSampleRate` and `mChennelsPerFrame` is overwritten at run-time
/// to values based on the input stream.
let audioOutputABSD = AudioStreamBasicDescription(
mSampleRate: AVAudioSession.sharedInstance().sampleRate,
mFormatID: kAudioFormatMPEG4AAC,
mFormatFlags: UInt32(MPEG4ObjectID.AAC_Main.rawValue),
mBytesPerPacket: 0,
mFramesPerPacket: 1024,
mBytesPerFrame: 0,
mChannelsPerFrame: 1,
mBitsPerChannel: 0,
mReserved: 0)
let audioEncoderClassDescriptions = [
AudioClassDescription(
mType: kAudioEncoderComponentType,
mSubType: kAudioFormatMPEG4AAC,
mManufacturer: kAppleSoftwareAudioCodecManufacturer) ]
}
Some helper functions:
public func getVideoDimensions(fromSettings settings: AVSettings) -> (Int, Int) {
switch (settings.sessionPreset, settings.orientation) {
case (AVCaptureSessionPreset1920x1080, .Portrait): return (1080, 1920)
case (AVCaptureSessionPreset1280x720, .Portrait): return (720, 1280)
default: fatalError("Unsupported session preset and orientation")
}
}
public func createAudioFormatDescription(fromSettings settings: AVSettings) -> CMAudioFormatDescription {
var result = noErr
var absd = settings.audioOutputABSD
var description: CMAudioFormatDescription?
withUnsafePointer(&absd) { absdPtr in
result = CMAudioFormatDescriptionCreate(nil,
absdPtr,
0, nil,
0, nil,
nil,
&description)
}
if result != noErr {
Log.error?.message("Could not create audio format description")
}
return description!
}
public func createVideoFormatDescription(fromSettings settings: AVSettings) -> CMVideoFormatDescription {
var result = noErr
var description: CMVideoFormatDescription?
let (width, height) = getVideoDimensions(fromSettings: settings)
result = CMVideoFormatDescriptionCreate(nil,
kCMVideoCodecType_H264,
Int32(width),
Int32(height),
[:],
&description)
if result != noErr {
Log.error?.message("Could not create video format description")
}
return description!
}
This is how the asset writer is initialized:
guard let audioDevice = defaultAudioDevice() else
{ throw RecordError.MissingDeviceFeature("Microphone") }
guard let videoDevice = defaultVideoDevice(.Back) else
{ throw RecordError.MissingDeviceFeature("Camera") }
let videoInput = try AVCaptureDeviceInput(device: videoDevice)
let audioInput = try AVCaptureDeviceInput(device: audioDevice)
let videoFormatHint = createVideoFormatDescription(fromSettings: settings)
let audioFormatHint = createAudioFormatDescription(fromSettings: settings)
let writerVideoInput = AVAssetWriterInput(mediaType: AVMediaTypeVideo,
outputSettings: nil,
sourceFormatHint: videoFormatHint)
let writerAudioInput = AVAssetWriterInput(mediaType: AVMediaTypeAudio,
outputSettings: nil,
sourceFormatHint: audioFormatHint)
writerVideoInput.expectsMediaDataInRealTime = true
writerAudioInput.expectsMediaDataInRealTime = true
let url = NSURL(fileURLWithPath: NSTemporaryDirectory(), isDirectory: true)
.URLByAppendingPathComponent(NSProcessInfo.processInfo().globallyUniqueString)
.URLByAppendingPathExtension("mp4")
let assetWriter = try AVAssetWriter(URL: url, fileType: AVFileTypeMPEG4)
if !assetWriter.canAddInput(writerVideoInput) {
throw RecordError.Unknown("Could not add video input") }
if !assetWriter.canAddInput(writerAudioInput) {
throw RecordError.Unknown("Could not add audio input") }
assetWriter.addInput(writerVideoInput)
assetWriter.addInput(writerAudioInput)
And this is how audio samples are being encoded, problem area is most likely to
be around here. I've re-written this so that it doesn't use any Rx-isms.
var outputABSD = settings.audioOutputABSD
var outputFormatDescription: CMAudioFormatDescription! = nil
CMAudioFormatDescriptionCreate(nil, &outputABSD, 0, nil, 0, nil, nil, &formatDescription)
var converter: AudioConverter?
// Indicates whether priming information has been attached to the first buffer
var primed = false
func encodeAudioBuffer(settings: AVSettings, buffer: CMSampleBuffer) throws -> CMSampleBuffer? {
// Create the audio converter if it's not available
if converter == nil {
var classDescriptions = settings.audioEncoderClassDescriptions
var inputABSD = CMAudioFormatDescriptionGetStreamBasicDescription(CMSampleBufferGetFormatDescription(buffer)!).memory
var outputABSD = settings.audioOutputABSD
outputABSD.mSampleRate = inputABSD.mSampleRate
outputABSD.mChannelsPerFrame = inputABSD.mChannelsPerFrame
var converter: AudioConverterRef = nil
var result = noErr
result = withUnsafePointer(&outputABSD) { outputABSDPtr in
return withUnsafePointer(&inputABSD) { inputABSDPtr in
return AudioConverterNewSpecific(inputABSDPtr,
outputABSDPtr,
UInt32(classDescriptions.count),
&classDescriptions,
&converter)
}
}
if result != noErr { throw RecordError.Unknown }
// At this point I made an attempt to retrieve priming info from
// the audio converter assuming that it will give me back default values
// I can use, but ended up with `nil`
var primeInfo: AudioConverterPrimeInfo? = nil
var primeInfoSize = UInt32(sizeof(AudioConverterPrimeInfo))
// The following returns a `noErr` but `primeInfo` is still `nil``
AudioConverterGetProperty(converter,
kAudioConverterPrimeInfo,
&primeInfoSize,
&primeInfo)
// I've also tried to set `kAudioConverterPrimeInfo` so that it knows
// the leading frames that are being primed, but the set didn't seem to work
// (`noErr` but getting the property afterwards still returned `nil`)
}
let converter = converter!
// Need to give a big enough output buffer.
// The assumption is that it will always be <= to the input size
let numSamples = CMSampleBufferGetNumSamples(buffer)
// This becomes 1024 * 2 = 2048
let outputBufferSize = numSamples * Int(inputABSD.mBytesPerPacket)
let outputBufferPtr = UnsafeMutablePointer<Void>.alloc(outputBufferSize)
defer {
outputBufferPtr.destroy()
outputBufferPtr.dealloc(1)
}
var result = noErr
var outputPacketCount = UInt32(1)
var outputData = AudioBufferList(
mNumberBuffers: 1,
mBuffers: AudioBuffer(
mNumberChannels: outputABSD.mChannelsPerFrame,
mDataByteSize: UInt32(outputBufferSize),
mData: outputBufferPtr))
// See below for `EncodeAudioUserData`
var userData = EncodeAudioUserData(inputSampleBuffer: buffer,
inputBytesPerPacket: inputABSD.mBytesPerPacket)
withUnsafeMutablePointer(&userData) { userDataPtr in
// See below for `fetchAudioProc`
result = AudioConverterFillComplexBuffer(
converter,
fetchAudioProc,
userDataPtr,
&outputPacketCount,
&outputData,
nil)
}
if result != noErr {
Log.error?.message("Error while trying to encode audio buffer, code: \(result)")
return nil
}
// See below for `CMSampleBufferCreateCopy`
guard let newBuffer = CMSampleBufferCreateCopy(buffer,
fromAudioBufferList: &outputData,
newFromatDescription: outputFormatDescription) else {
Log.error?.message("Could not create sample buffer from audio buffer list")
return nil
}
if !primed {
primed = true
// Simply picked 2112 samples based on convention, is there a better way to determine this?
let samplesToPrime: Int64 = 2112
let samplesPerSecond = Int32(settings.audioOutputABSD.mSampleRate)
let primingDuration = CMTimeMake(samplesToPrime, samplesPerSecond)
// Without setting the attachment the asset writer will complain about the
// first buffer missing the `TrimDurationAtStart` attachment, is there are way
// to infer the value from the given `AudioBufferList`?
CMSetAttachment(newBuffer,
kCMSampleBufferAttachmentKey_TrimDurationAtStart,
CMTimeCopyAsDictionary(primingDuration, nil),
kCMAttachmentMode_ShouldNotPropagate)
}
return newBuffer
}
Below is the proc that fetches samples for the audio converter, and the data
structure that gets passed to it:
private class EncodeAudioUserData {
var inputSampleBuffer: CMSampleBuffer?
var inputBytesPerPacket: UInt32
init(inputSampleBuffer: CMSampleBuffer,
inputBytesPerPacket: UInt32) {
self.inputSampleBuffer = inputSampleBuffer
self.inputBytesPerPacket = inputBytesPerPacket
}
}
private let fetchAudioProc: AudioConverterComplexInputDataProc = {
(inAudioConverter,
ioDataPacketCount,
ioData,
outDataPacketDescriptionPtrPtr,
inUserData) in
var result = noErr
if ioDataPacketCount.memory == 0 { return noErr }
let userData = UnsafeMutablePointer<EncodeAudioUserData>(inUserData).memory
// If its already been processed
guard let buffer = userData.inputSampleBuffer else {
ioDataPacketCount.memory = 0
return -1
}
var inputBlockBuffer: CMBlockBuffer?
var inputBufferList = AudioBufferList()
result = CMSampleBufferGetAudioBufferListWithRetainedBlockBuffer(
buffer,
nil,
&inputBufferList,
sizeof(AudioBufferList),
nil,
nil,
0,
&inputBlockBuffer)
if result != noErr {
Log.error?.message("Error while trying to retrieve buffer list, code: \(result)")
ioDataPacketCount.memory = 0
return result
}
let packetsCount = inputBufferList.mBuffers.mDataByteSize / userData.inputBytesPerPacket
ioDataPacketCount.memory = packetsCount
ioData.memory.mBuffers.mNumberChannels = inputBufferList.mBuffers.mNumberChannels
ioData.memory.mBuffers.mDataByteSize = inputBufferList.mBuffers.mDataByteSize
ioData.memory.mBuffers.mData = inputBufferList.mBuffers.mData
if outDataPacketDescriptionPtrPtr != nil {
outDataPacketDescriptionPtrPtr.memory = nil
}
return noErr
}
This is how I am converting AudioBufferLists to CMSampleBuffers:
public func CMSampleBufferCreateCopy(
buffer: CMSampleBuffer,
inout fromAudioBufferList bufferList: AudioBufferList,
newFromatDescription formatDescription: CMFormatDescription? = nil)
-> CMSampleBuffer? {
var result = noErr
var sizeArray: [Int] = [Int(bufferList.mBuffers.mDataByteSize)]
// Copy timing info from the previous buffer
var timingInfo = CMSampleTimingInfo()
result = CMSampleBufferGetSampleTimingInfo(buffer, 0, &timingInfo)
if result != noErr { return nil }
var newBuffer: CMSampleBuffer?
result = CMSampleBufferCreateReady(
kCFAllocatorDefault,
nil,
formatDescription ?? CMSampleBufferGetFormatDescription(buffer),
Int(bufferList.mNumberBuffers),
1, &timingInfo,
1, &sizeArray,
&newBuffer)
if result != noErr { return nil }
guard let b = newBuffer else { return nil }
CMSampleBufferSetDataBufferFromAudioBufferList(b, nil, nil, 0, &bufferList)
return newBuffer
}
Is there anything that I am obviously doing wrong? Is there a proper way to
construct CMSampleBuffers from AudioBufferList? How do you transfer priming
information from the converter to CMSampleBuffers that you create?
For my use case I need to do the encoding manually as the buffers will be
manipulated further down the pipeline (although I've disabled all
transformations after the encode in order to make sure that it works.)
Any help would be much appreciated. Sorry that there's so much code to
digest, but I wanted to provide as much context as possible.
Thanks in advance :)
Some related questions:
CMSampleBufferRef kCMSampleBufferAttachmentKey_TrimDurationAtStart crash
Can I use AVCaptureSession to encode an AAC stream to memory?
Writing video + generated audio to AVAssetWriterInput, audio stuttering
How do I use CoreAudio's AudioConverter to encode AAC in real-time?
Some references I've used:
Apple sample code demonstrating how to use AudioConverter
Note describing AAC encoder delay
Turns out there were a variety of things that I was doing wrong. Instead of posting a garble of code, I'm going to try and organize this into bite-sized pieces of things that I discovered..
Samples vs Packets vs Frames
This had been a huge source of confusion for me:
Each CMSampleBuffer can have 1 or more sample buffers (discovered via CMSampleBufferGetNumSamples)
Each CMSampleBuffer that contains 1 sample represents a single audio packet.
Therefore, CMSampleBufferGetNumSamples(sample) will return the number of packets contained in the given buffer.
Packets contain frames. This is governed by the mFramesPerPacket property of the buffer's AudioStreamBasicDescription. For linear PCM buffers, the total size of each sample buffer is frames * bytes per frame. For compressed buffers (like AAC), there is no relationship between the total size and frame count.
AudioConverterComplexInputDataProc
This callback is used to retrieve more linear PCM audio data for encoding. It's imperative that you must supply at least the number of packets specified by ioNumberDataPackets. Since I've been using the converter for real-time push-style encoding, I needed to ensure that each data push contains the minimum amount of packets. Something like this (pseudo-code):
let minimumPackets = outputFramesPerPacket / inputFramesPerPacket
var buffers: [CMSampleBuffer] = []
while getTotalSize(buffers) < minimumPackets {
buffers = buffers + [getNextBuffer()]
}
AudioConverterFillComplexBuffer(...)
Slicing CMSampleBuffer's
You can actually slice CMSampleBuffer's if they contain multiple buffers. The tool to do this is CMSampleBufferCopySampleBufferForRange. This is nice so that you can provide the AudioConverterComplexInputDataProc with the exact number of packets that it asks for, which makes handling timing information for the resulting encoded buffer easier. Because if you give the converter 1500 frames of data when it expects 1024, the result sample buffer will have a duration of 1024/sampleRate as opposed to 1500/sampleRate.
Priming and trim duration
When doing AAC encoding, you must set the trim duration like so:
CMSetAttachment(buffer,
kCMSampleBufferAttachmentKey_TrimDurationAtStart,
CMTimeCopyAsDictionary(primingDuration, kCFAllocatorDefault),
kCMAttachmentMode_ShouldNotPropagate)
One thing I did wrong was that I added the trim duration at encode time. This should be handled by your writer so that it can guarantee the information gets added to your leading audio frames.
Also, the value of kCMSampleBufferAttachmentKey_TrimDurationAtStart should never be greater than the duration of the sample buffer. An example of priming:
Priming frames: 2112
Sample rate: 44100
Priming duration: 2112 / 44100 = ~0.0479s
First frame, frames: 1024, priming duration: 1024 / 44100
Second frame, frames: 1024, priming duration: 1088 / 41100
Creating the new CMSampleBuffer
AudioConverterFillComplexBuffer has an optional outputPacketDescriptionsPtr. You should use it. It will point to a new array of packet descriptions that contains sample size information. You need this sample size information to construct the new compressed sample buffer:
let bufferList: AudioBufferList
let packetDescriptions: [AudioStreamPacketDescription]
var newBuffer: CMSampleBuffer?
CMAudioSampleBufferCreateWithPacketDescriptions(
kCFAllocatorDefault, // allocator
nil, // dataBuffer
false, // dataReady
nil, // makeDataReadyCallback
nil, // makeDataReadyRefCon
formatDescription, // formatDescription
Int(bufferList.mNumberBuffers), // numSamples
CMSampleBufferGetPresentationTimeStamp(buffer), // sbufPTS (first PTS)
&packetDescriptions, // packetDescriptions
&newBuffer)

Resources