iOS Accurate AudioTimeStamp when rendering Audio Units - ios

In my AudioInputRenderCallback I'm looking to capture an accurate time stamp of certain audio events. To test my code, I'm inputting a click track #120BPM or every 500 milliseconds (The click is accurate, I checked, and double checked). I first get the decibel of every sample, and check if it's over a threshold, this works as expected. I then take the hostTime from the AudioTimeStamp, and convert it to milliseconds. The first click gets assigned to that static timestamp and the second time through does a calculation of the interval and then reassigns to the static one. I expected to see a 500 interval. To be able to calculate the click correctly I have to be with in 5 milliseconds. The numbers seem to bounce back and forth between 510 & 489. I understand it's not an RTOS, but can iOS be this accurate? Is there any issues with using the mach_absolute_time member of the AudioUnitTimeStamp?

Audio Units are buffer based. The minimum length of an iOS Audio Unit buffer seems to be around 6 mS. So if you use the time-stamps of the buffer callbacks, your time resolution or time sampling jitter will be about +- 6 mS.
If you look at the actual raw PCM samples inside the Audio Unit buffer and pattern match the "attack" transient (by threshold or autocorrelation, etc.) you might be able get sub-millisecond resolution.

Related

Play pure tones periodically with AudioKit

I want to play very basic sounds using the AudioKit oscillator at a given frequency.
For example: play a simple sine wave at 400Hz for 50 miliseconds, then 100 ms of silence, then play 600Hz for 50ms …
I have a metal view where I render some visual stimuli. My intention was to use the basic AKOscillator and the render function of the CADisplayLink to play/stop the sound at some frames.
I tried using oscillator.play() and oscillator.stop() or changing the amplitude with oscillator.amplitude = 0 and oscillator.amplitude = 1 but the result in both cases is a jittering of about 10 ms.
If I first create .wav files and then played them with AKPlayer.play() the timing is correct.
I want the flexibility to use any frequency at any time. How can I do something similar to the first approach? Wrapping the oscillator in an midi instrument is the way to go?
CADisplayLink runs in the UI thread, so will have some sub-frame-time jitter, as the UI thread has lower priority than audio (and perhaps other OS) threads.
The only way I've succeeded in reliable sub-millisecond accurate real-time synthesis of arbitrary frequency audio is to put a sinewave oscillator (or waveform generator or table look-up) inside an Audio Unit callback (RemoteIO for iOS), and count samples, before dynamically inserting the stop/start of the desired sinewave (or other waveform) into the callback buffer(s), starting at the correct sample buffer offset index.
At first, I pre-generated sinewave look-up tables; but later did some profiling; and found that just calling the sin() function with a rotating phase, at audio sample rates, took a barely measurable fraction of a percent of CPU time on any contemporary iOS device. If you do use an incrementing phase, make sure to keep the phase in a reasonable numeric range by occasionally adding/subtracting multiples of 2*pi.

How do you make Media Source work with timestampOffset lower than appendWindowStart?

I want to use appendBuffer and append only piece of the media I have.
To cut the piece from the end, I use appendWindowEnd and it works.
To cut it from the beginning I have to set timestampOffset lower than appendWindowStart. I have seen shaka-player doing something similar.
var appendWindowStart = Math.max(0, currentPeriod.startTime - windowFudge);
var appendWindowEnd = followingPeriod ? followingPeriod.startTime : duration;
...
var timestampOffset = currentPeriod.startTime -mediaState.stream.presentationTimeOffset;
From my tests, it works when timestampOffset is
same as appendWindowStart
1/10 second lower
Does't work when timestampOffset is lower than that. The segment doesn't get added. Does that have something to do with my media or the spec/implementation doesn't allow it?
From MDN web docs:
The appendWindowStart property of the SourceBuffer interface controls the timestamp for the start of the append window, a timestamp range that can be used to filter what media data is appended to the SourceBuffer. Coded media frames with timestamps within this range will be appended, whereas those outside the range will be filtered out.
Just found this in the specification, so I am updating the question:
If presentation timestamp is less than appendWindowStart, then set the need random access point flag to true, drop the coded frame, and jump to the top of the loop to start processing the next coded frame.
Some implementations may choose to collect some of these coded frames with presentation timestamp less than appendWindowStart and use them to generate a splice at the first coded frame that has a presentation timestamp greater than or equal to appendWindowStart even if that frame is not a random access point. Supporting this requires multiple decoders or faster than real-time decoding so for now this behavior will not be a normative requirement.
If frame end timestamp is greater than appendWindowEnd, then set the need random access point flag to true, drop the coded frame, and jump to the top of the loop to start processing the next coded frame.
Some implementations may choose to collect coded frames with presentation timestamp less than appendWindowEnd and frame end timestamp greater than appendWindowEnd and use them to generate a splice across the portion of the collected coded frames within the append window at time of collection, and the beginning portion of later processed frames which only partially overlap the end of the collected coded frames. Supporting this requires multiple decoders or faster than real-time decoding so for now this behavior will not be a normative requirement. In conjunction with collecting coded frames that span appendWindowStart, implementations may thus support gapless audio splicing.
If the need random access point flag on track buffer equals true, then run the following steps:
If the coded frame is not a random access point, then drop the coded frame and jump to the top of the loop to start processing the next coded frame.
Set the need random access point flag on track buffer to false.
and
Random Access Point
A position in a media segment where decoding and continuous playback can begin without relying on any previous data in the segment. For video this tends to be the location of I-frames. In the case of audio, most audio frames can be treated as a random access point. Since video tracks tend to have a more sparse distribution of random access points, the location of these points are usually considered the random access points for multiplexed streams.
Does that mean, that for a video, I have to choose timeOffset, which lands on 'I' frame?
The use of timestampOffset doesn't require an I-Frame. It just shifts the timestamp of each frame by that value. That shift calculations is performed before anything else (before appendWindowStart getting involved)
It's the use of appendWindowStart that are impacted to where your I-frames are.
appendWindowStart and appendWindowEnd act as an AND over the data you're adding.
MSE doesn't reprocess your data, by setting appendWindowStart you're telling the source buffer that any data contained prior that time are to be excluded
Also MSE works at the fundamental level of GOP (group of picture): from one I-Frame to another.
So let's imagine this group of images, made of 16 frames GOP, each having a duration of 1s.
.IPPPPPPPPPPPPPPP IPPPPPPPPPPPPPPP IPPPPPPPPPPPPPPP IPPPPPPPPPPPPPPP
Say now you set appendWindowStart to 10
In the ideal world you would have:
. PPPPPPP IPPPPPPPPPPPPPPP IPPPPPPPPPPPPPPP IPPPPPPPPPPPPPPP
All previous 9 frames with a time starting prior appendWindowStart have been dropped.
However, now those P-Frames can't be decoded, hence MSE set in the spec the "need random access point flag" to true, so the next frame added to the source buffer can only be an I-Frame
and so you end up in your source buffer with:
. IPPPPPPPPPPPPPPP IPPPPPPPPPPPPPPP IPPPPPPPPPPPPPPP
To be able to add the frames between appendWindowStart and the next I-Frame would be incredibly hard and time expensive.
It would require to decode all frames before adding them to the source buffer, storing them either as raw YUV data, or if hardware accelerated storing the GPU backed image.
A source buffer could contain over a minute of video at any given time. Imagine if it had to deal with decompressed data now rather than compressed one.
Now, if we wanted to preserve the same memory constraint as now (which is around 100MiB of data maximum per source buffer), you would have to recompress on the fly the content before adding it to the source buffer.
not gonna happen.

iOS dynamically slow down the playback of a video, with continuous value

I have a problem with the iOS SDK. I can't find the API to slowdown a video with continuous values.
I have made an app with a slider and an AVPlayer, and I would like to change the speed of the video, from 50% to 150%, according to the slider value.
As for now, I just succeeded to change the speed of the video, but only with discrete values, and by recompiling the video. (In order to do that, I used AVMutableComposition APIs.
Do you know if it is possible to change continuously the speed, and without recompiling?
Thank you very much!
Jery
The AVPlayer's rate property allows playback speed changes if the associated AVPlayerItem is capable of it (responds YES to canPlaySlowForward or canPlayFastForward). The rate is 1.0 for normal playback, 0 for stopped, and can be set to other values but will probably round to the nearest discrete value it is capable of, such as 2:1, 3:2, 5:4 for faster speeds, and 1:2, 2:3 and 4:5 for slower speeds.
With the older MPMoviePlayerController, and its similar currentPlaybackRate property, I found that it would take any setting and report it back, but would still round it to one of the discrete values above. For example, set it to 1.05 and you would get normal speed (1:1) even though currentPlaybackRate would say 1.05 if you read it. Set it to 1.2 and it would play at 1.25X (5:4). And it was limited to 2:1 (double speed), beyond which it would hang or jump.
For some reason, the iOS API Reference doesn't mention these discrete speeds. They were found by experimentation. They make some sense. Since the hardware displays video frames at a fixed rate (e.g.- 30 or 60 frames per second), some multiples are easier than others. Half speed can be achieved by showing each frame twice, and double speed by dropping every other frame. Dropping 1 out of every 3 frames gives you 150% (3:2) speed. But to do 105% is harder, dropping 1 out of every 21 frames. Especially if this is done in hardware, you can see why they might have limited it to only certain multiples.

How can I ensure the correct frame rate when recording an animation using DirectShow?

I am attempting to record an animation (computer graphics, not video) to a WMV file using DirectShow. The setup is:
A Push Source that uses an in-memory bitmap holding the animation frame. Each time FillBuffer() is called, the bitmap's data is copied over into the sample, and the sample is timestamped with a start time (frame number * frame length) and duration (frame length). The frame rate is set to 10 frames per second in the filter.
An ASF Writer filter. I have a custom profile file that sets the video to 10 frames per second. Its a video-only filter, so there's no audio.
The pins connect, and when the graph is run, a wmv file is created. But...
The problem is it appears DirectShow is pushing data from the Push Source at a rate greater than 10 FPS. So the resultant wmv, while playable and containing the correct animation (as well as reporting the correct FPS), plays the animation back several times too slowly because too many frames were added to the video during recording. That is, a 10 second video at 10 FPS should only have 100 frames, but about 500 are being stuffed into the video, resulting in the video being 50 seconds long.
My initial attempt at a solution was just to slow down the FillBuffer() call by adding a sleep() for 1/10th second. And that indeed does more or less work. But it seems hackish, and I question whether that would work well at higher FPS.
So I'm wondering if there's a better way to do this. Actually, I'm assuming there's a better way and I'm just missing it. Or do I just need to smarten up the manner in which FillBuffer() in the Push Source is delayed and use a better timing mechanism?
Any suggestions would be greatly appreciated!
I do this with threads. The main thread is adding bitmaps to a list and the recorder thread takes bitmaps from that list.
Main thread
Animate your graphics at time T and render bitmap
Add bitmap to renderlist. If list is full (say more than 8 frames) wait. This is so you won't use too much memory.
Advance T with deltatime corresponding to desired framerate
Render thread
When a frame is requested, pick and remove a bitmap from the renderlist. If list is empty wait.
You need a threadsafe structure such as TThreadList to hold the bitmaps. It's a bit tricky to get right but your current approach is guaranteed to give to timing problems.
I am doing just the right thing for my recorder application (www.videophill.com) for purposes of testing the whole thing.
I am using Sleep() method to delay the frames, but am taking great care to ensure that timestamps of the frames are correct. Also, when Sleep()ing from frame to frame, please try to use 'absolute' time differences, because Sleep(100) will sleep about 100ms, not exactly 100ms.
If it won't work for you, you can always go for IReferenceClock, but I think that's overkill here.
So:
DateTime start=DateTime.Now;
int frameCounter=0;
while (wePush)
{
FillDataBuffer(...);
frameCounter++;
DateTime nextFrameTime=start.AddMilliseconds(frameCounter*100);
int delay=(nextFrameTime-DateTime.Now).TotalMilliseconds;
Sleep(delay);
}
EDIT:
Keep in mind: IWMWritter is time insensitive as long as you feed it with SAMPLES that are properly time-stamped.

In image processing, what is real time?

in image processing applications what is considered real time? Is 33 fps real time? Is 20 fps real time? If 33 and 20 fps are considered real time then is 1 or 2 fps also real time?
Can anyone throw some light.
In my experience, it's a pretty vague term. Often, what is meant is that the algorithm will run at the rate of the source (e.g. a camera) supplying the images; however, I would prefer to state this explicitly ("the algorithm can process images at the frame rate of the camera").
Real time image processing = produce output simultaneously with the input.
The input may be 25 fps but you may choose to process 1 of every 5 frames(that makes 5 fps processing) and your application is still real time.
TV streaming software: all the frames are processed.
Security application and the input is CCTV security cams: you may choose to skip some frames to fit the performance.
3d game or simulation: fps changes depending on the current scene.
And they are all real time.
Strictly speaking, I would say real-time means that the application is generating images based on user input as it occurs, e.g. a mouse movement which changes the facing of an avatar.
How successful it is at this task - 1 fps, 10 fps, 100 fps, etc - is actually another question.
Real-time describes an approach, not a performance metric.
If however you ask what is the slowest fps which passes as usable by a human, the answer is about 15, I think.
i think it depends on what the real time application is. If the app is showing slideshows with 1 picture every 3 seconds, and the app can process 1 picture within this 3 seconds and show it, then it is real time processing.
If the movie is 29.97 frames per second, and the app can process all 29.97 frames within the second, then it is also real time.
An example is, if an app can take the movie from a VCR or Cable's analog output, and compress it into 29.97 frames per second video and also send all that info to a remote area for another person to watch, then it is real time processing.
(Hard) Real time is when an outcome has no value when delivered too early or too late.
Any FPS is real time provided that displayed frames represent what should be displayed at the very instant they are displayed.
The notion of real-time display is not really tied to a specific frame rate - it could be defined as the minimum frame rate at which movement is perceived as being continuous. So for slow moving objects in a visual frame (e.g. ships in a harbour, or stars in the night sky) a relatively slow frame rate might suffice, whereas for rapid movement (e.g. a racing car simulator) a much higher frame rate would be needed.
There is also a secondary consideration of latency. A real-time display must have sufficiently low latency in relation to other events (e.g. behaviour of a real-time simulation) that there is no perceptible lag in display updates.
That's not actually an easy question (even without taking into account differences between individulas).
Wikipedia has a good article explaining why. For what it's worth, I think cinema films run at 24fps so, if you're happy with that, that's what I'd consider realtime.
It depends on what exactly you are trying to do. For some purposes 1fps or even 2 spf (Seconds per frame) could be considered real-time. For others thats way too slow ...
That said, real-time means that it takes as long (or less) to process x frames as it would take to just present those x frames.
It depends.
automatic aircraft cannon - 1000 fps
monitoring - 10 - 15 fps
authentication - 1 fps
medical devices - 1 fph
I guess the term is used with different meanings in different contexts. In industrial image processing, real time processing is usually the opposite of offline processing. In offline processing applications, you record images (many of them) and process them at a later time. In real time processing, the system that acquires the images also processes them, at the same time, so the processing frame rate must not be higher than the acquisition frame rate.
Real-time means your implementation is fast enough to meet some deadline. The deadline is part of your system's specification. If it's an interactive UI and the users are not too picky, 15Hz update can be OK, although it can feel laggy. If you're using it to drive a car along the motorway 30Hz is about right. If it's a missile, well, maybe 100Hz?

Resources