My app is using Audio Converter Services to convert audio from 44.1 khz to 48 khz (16 bit linear mono), using AudioConverterFillComplexBuffer.
After upgrading iOS to 11.0 (or maybe 11.4) the audio contains "noises" that are cause by the callback returning samples with the value of zero at the "edges" of the buffer (not sure if first or last sample)
Does anyone know or noticed any change? It has been working fine for years, and still works fine on devices that run iOS 9.x
This is my setup:
// prepare the formats
// origin
AudioStreamBasicDescription originFormat = {0};
FillOutASBDForLPCM(originFormat, 44100.00, 1, sizeof(SInt16)*8, sizeof(SInt16)*8, false, false, false);
originFormat.mFormatFlags |= kAudioFormatFlagIsSignedInteger | kAudioFormatFlagsNativeEndian | kAudioFormatFlagIsPacked;
originFormat.mReserved = 0;
// destination
AudioStreamBasicDescription destFormat = {0};
FillOutASBDForLPCM(destFormat, 48000.0, 1, sizeof(SInt16)*8, sizeof(SInt16)*8, false, false, false);
destFormat.mFormatFlags |= kAudioFormatFlagIsSignedInteger | kAudioFormatFlagsNativeEndian | kAudioFormatFlagIsPacked;
destFormat.mReserved = 0;
// create a converter
AudioConverterRef audioConverter;
AudioConverterNew(&originFormat, &destFormat, &audioConverter);
I have found that converting between sample rates used to be more tolerant to missing data on the edges of the buffer.
For example, if you convert a buffer of 1024 frames, and need all of those to be converted to a new sample rate, but never provided samples before or after the buffer, apple used to round the numbers so that the noise is minimal.
However, starting iOS 11.4 (or so), the first frame of the converted buffer is very close to zero (probably because the converter is looking for samples before the first sample and can't find any)
The fix was to provide some extra samples to the buffer in question. For example, to convert the 1024 buffer, I sent the converter about 100 samples before and after that range (1224 in total), then read the result starting from sample number 100. Once I did this for every buffer, the result was clean
Related
I'm attempting to sync recorded audio (from an AVAudioEngine inputNode) to an audio file that was playing during the recording process. The result should be like multitrack recording where each subsequent new track is synced with the previous tracks that were playing at the time of recording.
Because sampleTime differs between the AVAudioEngine's output and input nodes, I use hostTime to determine the offset of the original audio and the input buffers.
On iOS, I would assume that I'd have to use AVAudioSession's various latency properties (inputLatency, outputLatency, ioBufferDuration) to reconcile the tracks as well as the host time offset, but I haven't figured out the magic combination to make them work. The same goes for the various AVAudioEngine and Node properties like latency and presentationLatency.
On macOS, AVAudioSession doesn't exist (outside of Catalyst), meaning I don't have access to those numbers. Meanwhile, the latency/presentationLatency properties on the AVAudioNodes report 0.0 in most circumstances. On macOS, I do have access to AudioObjectGetPropertyData and can ask the system about kAudioDevicePropertyLatency, kAudioDevicePropertyBufferSize,kAudioDevicePropertySafetyOffset, etc, but am again at a bit of a loss as to what the formula is to reconcile all of these.
I have a sample project at https://github.com/jnpdx/AudioEngineLoopbackLatencyTest that runs a simple loopback test (on macOS, iOS, or Mac Catalyst) and shows the result. On my Mac, the offset between tracks is ~720 samples. On others' Macs, I've seen as much as 1500 samples offset.
On my iPhone, I can get it close to sample-perfect by using AVAudioSession's outputLatency + inputLatency. However, the same formula leaves things misaligned on my iPad.
What's the magic formula for syncing the input and output timestamps on each platform? I know it may be different on each, which is fine, and I know I won't get 100% accuracy, but I would like to get as close as possible before going through my own calibration process
Here's a sample of my current code (full sync logic can be found at https://github.com/jnpdx/AudioEngineLoopbackLatencyTest/blob/main/AudioEngineLoopbackLatencyTest/AudioManager.swift):
//Schedule playback of original audio during initial playback
let delay = 0.33 * state.secondsToTicks
let audioTime = AVAudioTime(hostTime: mach_absolute_time() + UInt64(delay))
state.audioBuffersScheduledAtHost = audioTime.hostTime
...
//in the inputNode's inputTap, store the first timestamp
audioEngine.inputNode.installTap(onBus: 0, bufferSize: 1024, format: recordingFormat) { (pcmBuffer, timestamp) in
if self.state.inputNodeTapBeganAtHost == 0 {
self.state.inputNodeTapBeganAtHost = timestamp.hostTime
}
}
...
//after playback, attempt to reconcile/sync the timestamps recorded above
let timestampToSyncTo = state.audioBuffersScheduledAtHost
let inputNodeHostTimeDiff = Int64(state.inputNodeTapBeganAtHost) - Int64(timestampToSyncTo)
let inputNodeDiffInSamples = Double(inputNodeHostTimeDiff) / state.secondsToTicks * inputFileBuffer.format.sampleRate //secondsToTicks is calculated using mach_timebase_info
//play the original metronome audio at sample position 0 and try to sync everything else up to it
let originalAudioTime = AVAudioTime(sampleTime: 0, atRate: renderingEngine.mainMixerNode.outputFormat(forBus: 0).sampleRate)
originalAudioPlayerNode.scheduleBuffer(metronomeFileBuffer, at: originalAudioTime, options: []) {
print("Played original audio")
}
//play the tap of the input node at its determined sync time -- this _does not_ appear to line up in the result file
let inputAudioTime = AVAudioTime(sampleTime: AVAudioFramePosition(inputNodeDiffInSamples), atRate: renderingEngine.mainMixerNode.outputFormat(forBus: 0).sampleRate)
recordedInputNodePlayer.scheduleBuffer(inputFileBuffer, at: inputAudioTime, options: []) {
print("Input buffer played")
}
When running the sample app, here's the result I get:
This answer is applicable to native macOS only
General Latency Determination
Output
In the general case the output latency for a stream on a device is determined by the sum of the following properties:
kAudioDevicePropertySafetyOffset
kAudioStreamPropertyLatency
kAudioDevicePropertyLatency
kAudioDevicePropertyBufferFrameSize
The device safety offset, stream, and device latency values should be retrieved for kAudioObjectPropertyScopeOutput.
On my Mac for the audio device MacBook Pro Speakers at 44.1 kHz this equates to 71 + 424 + 11 + 512 = 1018 frames.
Input
Similarly, the input latency is determined by the sum of the following properties:
kAudioDevicePropertySafetyOffset
kAudioStreamPropertyLatency
kAudioDevicePropertyLatency
kAudioDevicePropertyBufferFrameSize
The device safety offset, stream, and device latency values should be retrieved for kAudioObjectPropertyScopeInput.
On my Mac for the audio device MacBook Pro Microphone at 44.1 kHz this equates to 114 + 2404 + 40 + 512 = 3070 frames.
AVAudioEngine
How the information above relates to AVAudioEngine is not immediately clear. Internally AVAudioEngine creates a private aggregate device and Core Audio essentially handles latency compensation for aggregate devices automatically.
During experimentation for this answer I've found that some (most?) audio devices don't report latency correctly. At least that is how it seems, which makes accurate latency determination nigh impossible.
I was able to get fairly accurate synchronization using my Mac's built-in audio using the following adjustments:
// Some non-zero value to get AVAudioEngine running
let startDelay = 0.1
// The original audio file start time
let originalStartingFrame: AVAudioFramePosition = AVAudioFramePosition(playerNode.outputFormat(forBus: 0).sampleRate * startDelay)
// The output tap's first sample is delivered to the device after the buffer is filled once
// A number of zero samples equal to the buffer size is produced initially
let outputStartingFrame: AVAudioFramePosition = Int64(state.outputBufferSizeFrames)
// The first output sample makes it way back into the input tap after accounting for all the latencies
let inputStartingFrame: AVAudioFramePosition = outputStartingFrame - Int64(state.outputLatency + state.outputStreamLatency + state.outputSafetyOffset + state.inputSafetyOffset + state.inputLatency + state.inputStreamLatency)
On my Mac the values reported by the AVAudioEngine aggregate device were:
// Output:
// kAudioDevicePropertySafetyOffset: 144
// kAudioDevicePropertyLatency: 11
// kAudioStreamPropertyLatency: 424
// kAudioDevicePropertyBufferFrameSize: 512
// Input:
// kAudioDevicePropertySafetyOffset: 154
// kAudioDevicePropertyLatency: 0
// kAudioStreamPropertyLatency: 2404
// kAudioDevicePropertyBufferFrameSize: 512
which equated to the following offsets:
originalStartingFrame = 4410
outputStartingFrame = 512
inputStartingFrame = -2625
I may not be able to answer your question, but I believe there is a property not mentioned in your question that does report additional latency information.
I've only worked at the HAL/AUHAL layers (never AVAudioEngine), but in discussions about computing the overall latencies, some audio device/stream properties come up: kAudioDevicePropertyLatency and kAudioStreamPropertyLatency.
Poking around a bit, I see those properties mentioned in the documentation for AVAudioIONode's presentationLatency property (https://developer.apple.com/documentation/avfoundation/avaudioionode/1385631-presentationlatency). I expect that the hardware latency reported by the driver will be there. (I suspect that the standard latency property reports latency for an input sample to appear in the output of a "normal" node, and IO case is special)
It's not in the context of AVAudioEngine, but here's one message from the CoreAudio mailing list that talks a bit about using the low level properties that may provide some additional background: https://lists.apple.com/archives/coreaudio-api/2017/Jul/msg00035.html
The one who solves this has to have the Sherlock Holmes trophy. Here it goes.
I'm using AudioQueues to record sound (LPCM, SInt16, 4 buffers) In the callback, I tried measuring the mean amplitude by converting the samples to float and using vDSP_meamgv. Here are some example means:
Mean, No of samples
44.400364, 44100
36.077393, 44100
27.672422, 41984
2889.821289, 44100
57.481972, 44100
58.967506, 42872
54.691631, 44100
2894.467285, 44100
62.697800, 42872
63.732948, 44100
66.575623, 44100
2979.566406, 42872
As you can see, every fourth (last) buffer is wild. I looked at the separate samples, there are lots of 0's and lots of huge numbers, and no normal numbers, like for the other buffers. Things get more interesting. If I use 3 buffers instead, the third one (always the last) is a bogey. And this holds for any number of buffers I choose.
I put an if in the callback to not enqueue the wild buffers, and once it's gone, there are no more huge numbers, the other buffers continue to fill normally. I put in a button that reenqueues this queue after it is being dropped, and once I reenqueue it, it again gets filled with gigantic samples (namely that buffer!)
And now the cherry - I put my code to calculate the mean in other projects, like SpeakHere from Apple, and the same thing happens there o.O , although the app works fine, recording and playing back what was recorded.
I just don't get it, I've cracked my brain trying to figure this one out. If somebody would have a clue...
Here's the callback, if it helps:
void Recorder::MyInputBufferHandler(void * inUserData,
AudioQueueRef inAQ,
AudioQueueBufferRef inBuffer,
const AudioTimeStamp * inStartTime,
UInt32 inNumPackets,
const AudioStreamPacketDescription* inPacketDesc) {
Recorder* eu = (Recorder*)inUserData;
vDSP_vflt16((SInt16*)inBuffer->mAudioData, 1, eu->conveier, 1, inBuffer->mAudioDataByteSize);
float mean;
vDSP_meamgv(eu->conveier, 1, &mean, inBuffer->mAudioDataByteSize);
printf("values: %f, %d\n",mean,inBuffer->mAudioDataByteSize);
// if (mean<2300)
AudioQueueEnqueueBuffer(inAQ, inBuffer, 0, NULL);
}
'conveier' is a float array I've preallocated.
It's also me that gets the trophy. The error was that the vDSP functions shouldn't have got the mAudioDataByteSize parameter, because they need the number of ELEMENTS in the array. In my case each element (SInt16) has 2 bytes, so I should have passed mAudioDataByteSize / 2. When it read the last buffer, it fell off the edge by another length and counted some random data. Voila! Very basic mistake, but when you look in all the wrong places, it doesn't appear so.
For anybody that stepped on the same rake...
PS. It came to me while taking a bath :)
How to calculate correct PTS value for frame before encoding in FFmpeg C API?
For encoding I'm using function avcodec_encode_video2 and then writing it by av_interleaved_write_frame.
I found some formulas, but none of them work.
In doxygen example they are using
frame->pts = 0;
for (;;) {
// encode & write frame
// ...
frame->pts += av_rescale_q(1, video_st->codec->time_base, video_st->time_base);
}
This blog says that formula must be like this:
(1 / FPS) * sample rate * frame number
Someone uses only frame number to set pts:
frame->pts = videoCodecCtx->frame_number;
Or an alternative way:
int64_t now = av_gettime();
frame->pts = av_rescale_q(now, (AVRational){1, 1000000}, videoCodecCtx->time_base);
And the last one:
// 40 * 90 means 40 ms and 90 because of the 90kHz by the standard for PTS-values.
frame->pts = encodedFrames * 40 * 90;
Which one is correct? I think answer for this question will be helpful for not only for me.
It's better to think about PTS more abstractly before trying code.
What you're doing is meshing 3 "time sets" together. The first is time we're used to, based on 1000 ms per second, 60 seconds per minute, and so on. The second is the codec time for the particular codec you are using. Each codec has a certain way it wants to represent time, usually in a 1/number format meaning that for every second there is "number" amount of ticks. The third format works similar to the second except that it is the time base for the container that you are used.
Some people prefer to start with actual time, others frame count, neither is "wrong".
Starting with a frame count you need to first convert it based on your frame rate. Note all conversions I speak of use av_rescale_q(...). The purpose of this conversion is to turn a counter into time, so you rescale with your frame rate (video steam time base usually). Then you have to convert that into the time_base of your video codec before encoding.
Similarly, with a real time, your first conversion needs to be from current_time - start_time scaled to your video codec time.
Anyone using only frame counter is probably using a codec with a time_base equal to their frame rate. Most codecs do not work like this and their hack is not portable. Example:
frame->pts = videoCodecCtx->frame_number; // BAD
Additionally, anyone using hardcoded numbers in their av_rescale_q is leveraging the fact that they know what their time_base is and this should be avoided. The code isn't portable to other video formats. Instead use video_st->time_base, video_st->codec->time_base, and output_ctx->time_base to figure things out.
I hope understanding it from a higher level will help you see which of those are "correct" and which are "bad practice". There is no single answer, but maybe now you can decide which approach is best for you.
Time is measured not in seconds or milliseconds or any standard unit. Instead, it is measured by the avCodecContext's timebase.
So if you set the codecContext->time_base to 1/1, it means using second for measurement.
cctx->time_base = (AVRational){1, 1};
Assuming you want to encode at a steady fps of 30. Then, the time when a frame is encoded is framenumber * (1.0/fps)
But once again, the PTS is also not measured in seconds or any standard unit. It's measured by avStream's time_base.
In the question, the author mentioned 90k as the standard resolution for pts. But you will see that this is not always true. The exact resolution is saved in avstream. you can read it back by:
if ((err = avformat_write_header(ofctx, NULL)) < 0) {
std::cout << "Failed to write header" << err << std::endl;
return -1;
}
av_dump_format(ofctx, 0, "test.webm", 1);
std::cout << stream->time_base.den << " " << stream->time_base.num << std::endl;
The value of stream->time_stamp is only populated after calling avformat_write_header
Therefore, the right formula for calculating PTS is:
//The following assumes that codecContext->time_base = (AVRational){1, 1};
videoFrame->pts = frameduration * (frameCounter++) * stream->time_base.den / (stream->time_base.num * fps);
So really there are 3 components in the formula,
fps
codecContext->time_base
stream->time_base
so pts = fps*codecContext->time_base/stream->time_base
I have detailed my discovery here
There's also the option with setting it like frame->pts = av_frame_get_best_effort_timestamp(frame) but I'm not sure this is the correct approach either.
I've specified and instantiated two Audio Units: a multichannel mixer unit and a generator of subtype AudioFilePlayer.
I would have thought I needed to set the ASBD of the filePlayer's output to match the ASBD I set for the mixer input. However when I attempt to set the filePlayer's output I get a kAudioUnitErr_FormatNotSupported (-10868) error.
Here's the stream format I set on the mixer input (successfully) and am also trying to set on the filePlayer (it's the monostream format copied from Apple's mixerhost sample project):
Sample Rate: 44100
Format ID: lpcm
Format Flags: C
Bytes per Packet: 2
Frames per Packet: 1
Bytes per Frame: 2
Channels per Frame: 1
Bits per Channel: 16
In the course of troubleshooting this I queried the filePlayer AU for the format it is 'natively' set to. This is what's returned:
Sample Rate: 44100
Format ID: lpcm
Format Flags: 29
Bytes per Packet: 4
Frames per Packet: 1
Bytes per Frame: 4
Channels per Frame: 2
Bits per Channel: 32
All the example code I've found sends the output of the filePlayer unit to an effect unit and set the filePlayer's output to match the ASBD set for the effect unit. Given I have no effect unit it seems like setting the filePlayer's output to the mixer input's ASBD would be the correct - and required - thing to do.
How have you configured the AUGraph? I might need to see some code to help you out.
Setting the output scope of AUMultiChannelMixer ASBD once only (as in MixerHost) works. However if you have any kind of effect at all, you will need to think about where their ASBDs are defined and how you arrange your code so CoreAudio does not jump in and mess with your effects AudioUnits ASBDs. By messing with I mean overriding your ASBD to the default kAudioFormatFlagIsFloat, kAudioFormatFlagIsPacked, 2 channels, non-interleaved. This was a big pain for me at first.
I would set the effects AudioUnits to their default ASBD. Assuming you have connected the AUFilePlayer node, then you can pull it out later in the program like this
result = AUGraphNodeInfo (processingGraph,
filePlayerNode,
NULL,
&filePlayerUnit);
And then proceed to set
AudioUnitSetProperty(filePlayerUnit,
kAudioUnitProperty_StreamFormat,
kAudioUnitScope_Output,
0,
&monoStreamFormat,
sizeof(monoStreamFormat));
Hopefully this helps.
Basically I didn't bother setting the filePlayer ASBD but rather retrieved the 'native' ASBD it was set to and updated only the sample rate and channel count.
Likewise I didn't set input on the mixer and let the mixer figure it's format out.
I need help understanding the following ASBD. It's the default ASBD assigned to a fresh instance of RemoteIO (I got it by executing AudioUnitGetProperty(..., kAudioUnitProperty_StreamFormat, ...) on the RemoteIO audio unit, right after allocating and initializing it).
Float64 mSampleRate 44100
UInt32 mFormatID 1819304813
UInt32 mFormatFlags 41
UInt32 mBytesPerPacket 4
UInt32 mFramesPerPacket 1
UInt32 mBytesPerFrame 4
UInt32 mChannelsPerFrame 2
UInt32 mBitsPerChannel 32
UInt32 mReserved 0
The question is, shouldn't mBytesPerFrame be 8? If I have 32 bits (4 bytes) per channel, and 2 channels per frame, shouldn't each frame be 8 bytes long (instead of 4)?
Thanks in advance.
The value of mBytesPerFrame depends on mFormatFlags. From CoreAudioTypes.h:
Typically, when an ASBD is being used, the fields describe the complete layout
of the sample data in the buffers that are represented by this description -
where typically those buffers are represented by an AudioBuffer that is
contained in an AudioBufferList.
However, when an ASBD has the kAudioFormatFlagIsNonInterleaved flag, the
AudioBufferList has a different structure and semantic. In this case, the ASBD
fields will describe the format of ONE of the AudioBuffers that are contained in
the list, AND each AudioBuffer in the list is determined to have a single (mono)
channel of audio data. Then, the ASBD's mChannelsPerFrame will indicate the
total number of AudioBuffers that are contained within the AudioBufferList -
where each buffer contains one channel. This is used primarily with the
AudioUnit (and AudioConverter) representation of this list - and won't be found
in the AudioHardware usage of this structure.
I believe that because the format flags specify kAudioFormatFlagIsNonInterleaved it follows that the size of a frame in any buffer can only be the size of a 1 channel frame. If this is correct mChannelsPerFrame is certainly a confusing name.
I hope someone else will confirm / clarify this.