How do you make Media Source work with timestampOffset lower than appendWindowStart? - media-source

I want to use appendBuffer and append only piece of the media I have.
To cut the piece from the end, I use appendWindowEnd and it works.
To cut it from the beginning I have to set timestampOffset lower than appendWindowStart. I have seen shaka-player doing something similar.
var appendWindowStart = Math.max(0, currentPeriod.startTime - windowFudge);
var appendWindowEnd = followingPeriod ? followingPeriod.startTime : duration;
var timestampOffset = currentPeriod.startTime;
From my tests, it works when timestampOffset is
same as appendWindowStart
1/10 second lower
Does't work when timestampOffset is lower than that. The segment doesn't get added. Does that have something to do with my media or the spec/implementation doesn't allow it?
From MDN web docs:
The appendWindowStart property of the SourceBuffer interface controls the timestamp for the start of the append window, a timestamp range that can be used to filter what media data is appended to the SourceBuffer. Coded media frames with timestamps within this range will be appended, whereas those outside the range will be filtered out.
Just found this in the specification, so I am updating the question:
If presentation timestamp is less than appendWindowStart, then set the need random access point flag to true, drop the coded frame, and jump to the top of the loop to start processing the next coded frame.
Some implementations may choose to collect some of these coded frames with presentation timestamp less than appendWindowStart and use them to generate a splice at the first coded frame that has a presentation timestamp greater than or equal to appendWindowStart even if that frame is not a random access point. Supporting this requires multiple decoders or faster than real-time decoding so for now this behavior will not be a normative requirement.
If frame end timestamp is greater than appendWindowEnd, then set the need random access point flag to true, drop the coded frame, and jump to the top of the loop to start processing the next coded frame.
Some implementations may choose to collect coded frames with presentation timestamp less than appendWindowEnd and frame end timestamp greater than appendWindowEnd and use them to generate a splice across the portion of the collected coded frames within the append window at time of collection, and the beginning portion of later processed frames which only partially overlap the end of the collected coded frames. Supporting this requires multiple decoders or faster than real-time decoding so for now this behavior will not be a normative requirement. In conjunction with collecting coded frames that span appendWindowStart, implementations may thus support gapless audio splicing.
If the need random access point flag on track buffer equals true, then run the following steps:
If the coded frame is not a random access point, then drop the coded frame and jump to the top of the loop to start processing the next coded frame.
Set the need random access point flag on track buffer to false.
Random Access Point
A position in a media segment where decoding and continuous playback can begin without relying on any previous data in the segment. For video this tends to be the location of I-frames. In the case of audio, most audio frames can be treated as a random access point. Since video tracks tend to have a more sparse distribution of random access points, the location of these points are usually considered the random access points for multiplexed streams.
Does that mean, that for a video, I have to choose timeOffset, which lands on 'I' frame?

The use of timestampOffset doesn't require an I-Frame. It just shifts the timestamp of each frame by that value. That shift calculations is performed before anything else (before appendWindowStart getting involved)
It's the use of appendWindowStart that are impacted to where your I-frames are.
appendWindowStart and appendWindowEnd act as an AND over the data you're adding.
MSE doesn't reprocess your data, by setting appendWindowStart you're telling the source buffer that any data contained prior that time are to be excluded
Also MSE works at the fundamental level of GOP (group of picture): from one I-Frame to another.
So let's imagine this group of images, made of 16 frames GOP, each having a duration of 1s.
Say now you set appendWindowStart to 10
In the ideal world you would have:
All previous 9 frames with a time starting prior appendWindowStart have been dropped.
However, now those P-Frames can't be decoded, hence MSE set in the spec the "need random access point flag" to true, so the next frame added to the source buffer can only be an I-Frame
and so you end up in your source buffer with:
To be able to add the frames between appendWindowStart and the next I-Frame would be incredibly hard and time expensive.
It would require to decode all frames before adding them to the source buffer, storing them either as raw YUV data, or if hardware accelerated storing the GPU backed image.
A source buffer could contain over a minute of video at any given time. Imagine if it had to deal with decompressed data now rather than compressed one.
Now, if we wanted to preserve the same memory constraint as now (which is around 100MiB of data maximum per source buffer), you would have to recompress on the fly the content before adding it to the source buffer.
not gonna happen.


Openlayers-3 fitExtent capture replay not working

I'm creating a 'bookmarking' feature on my map, recording the extent of the current view via ol.View.calculateExtent(). Once I've grabbed this extent I persist it (no loss of precision, in 'EPSG:900913').
Problem now is if I feed this extent into ol.View.fitExtent() I don't get exactly the same view, I get a slightly 'zoomed out' one.
The coordinates are exactly the same, the map size (ol.Map.getSize()) even the resolution (ol.View().getResolution()) but each time my recorded 'view' when I call it is further out than the recorded one.
Any ideas how I can exactly record the current 'view' and replay it accurately?Is this rounding? Should I not be using fitExtent?
N.B. This doesn't ALWAYS' happen! At high zooms it can sometimes accurately record and return me to the same view - resolutions at 2.388657133911758, 1.194328566955879 and 305.748113140705, when recorded, do not seem to exhibit this behaviour.
It has been replaced by in v3.7.0:
Replace ol.View.fitExtent() and ol.View.fitGeometry() with ... This combines two previously distinct functions into
one more flexible call which takes either a geometry or an extent.

iOS Accurate AudioTimeStamp when rendering Audio Units

In my AudioInputRenderCallback I'm looking to capture an accurate time stamp of certain audio events. To test my code, I'm inputting a click track #120BPM or every 500 milliseconds (The click is accurate, I checked, and double checked). I first get the decibel of every sample, and check if it's over a threshold, this works as expected. I then take the hostTime from the AudioTimeStamp, and convert it to milliseconds. The first click gets assigned to that static timestamp and the second time through does a calculation of the interval and then reassigns to the static one. I expected to see a 500 interval. To be able to calculate the click correctly I have to be with in 5 milliseconds. The numbers seem to bounce back and forth between 510 & 489. I understand it's not an RTOS, but can iOS be this accurate? Is there any issues with using the mach_absolute_time member of the AudioUnitTimeStamp?
Audio Units are buffer based. The minimum length of an iOS Audio Unit buffer seems to be around 6 mS. So if you use the time-stamps of the buffer callbacks, your time resolution or time sampling jitter will be about +- 6 mS.
If you look at the actual raw PCM samples inside the Audio Unit buffer and pattern match the "attack" transient (by threshold or autocorrelation, etc.) you might be able get sub-millisecond resolution.

What is the decoded output of a video codec?

I am wondering if someone can explain to me what exactly is the output of video decoding. Let's say it is a H.264 stream in an MP4 container.
From displaying something on the screen, I guess decoder can provider two different types of output:
Point - (x, y) coordinate of the location and the (R, G, B) color for the pixel
Rectangle (x, y, w, h) units for the rectangle and the (R, G, B) color to display
There is also the issue of time stamp.
Can you please enlighten me or point me the right link on what is generated out of a decoder and how a video client can use this information to display something on screen?
I intend to download VideoLAN source and examine it but some explanation would be helpful.
Thank you in advance for your help.
None of the above.
Usually the output will be a stream of bytes that contains just the color data. The X,Y location is implied by the dimensions of the video.
In other words, the first three bytes might encode the color value at (0, 0), the second three byte the value at (0, 1), and so on. Some formats might use four bytes groups, or even a number of bits that doesn't add up to one byte -- for example, if you use 5 bits for each color component and you have three color components, that's 15 bits per pixel. This might be padded to 16 bits (exactly two bytes) for efficiency, since that will align data in a way that CPUs can better process it.
When you've processed exactly as many values as the video is wide, you've reached the end of that row. When you've processed exactly as many rows as the video is high, you've reached the end of that frame.
As for the interpretation of those bytes, that depends on the color space used by the codec. Common color spaces are YUV, RGB, and HSL/HSV.
It depends strongly on the codec in use and what input format(s) it supports; the output format is usually restricted to the set of formats that are acceptable inputs.
Timestamp data is a bit more complex, since that can be encoded in the video stream itself, or in the container. At a minimum, the stream would need a framerate; from that, the time of each frame can be determined by counting how many frames have been decoded already. Other approaches, like the one taken by AVI, is to include a byte-offset for every Nth frame (or just the keyframes) at the end of the file to enable rapid seeking. (Otherwise, you would need to decode every frame up to the timestamp you're looking for in order to determine where in the file that frame is.)
And if you're considering audio data too, note that with most codecs and containers, the audio and video streams are independent and know nothing about each other. During encoding, the software that writes both streams into the container format does a process called muxing. It will write out the data in chunks of N seconds each, alternating between streams. This allows whoever is reading the stream to get N seconds of video, then N seconds of audio, then another N seconds of video, and so on. (More than one audio stream might be included too -- this technique is frequently used to mux together video, and English and Spanish audio tracks into a single file that contains three streams.) In fact, even subtitles can be muxed in with the other streams.
cdhowie got most of it.
When it comes to timestamps the MPEG4 container contains tables for each frame that tells the video client when to display each frame. You should look at the spec for MPEG4. You normally have to pay for this I think but it's definitely downloadable from places.

How can I ensure the correct frame rate when recording an animation using DirectShow?

I am attempting to record an animation (computer graphics, not video) to a WMV file using DirectShow. The setup is:
A Push Source that uses an in-memory bitmap holding the animation frame. Each time FillBuffer() is called, the bitmap's data is copied over into the sample, and the sample is timestamped with a start time (frame number * frame length) and duration (frame length). The frame rate is set to 10 frames per second in the filter.
An ASF Writer filter. I have a custom profile file that sets the video to 10 frames per second. Its a video-only filter, so there's no audio.
The pins connect, and when the graph is run, a wmv file is created. But...
The problem is it appears DirectShow is pushing data from the Push Source at a rate greater than 10 FPS. So the resultant wmv, while playable and containing the correct animation (as well as reporting the correct FPS), plays the animation back several times too slowly because too many frames were added to the video during recording. That is, a 10 second video at 10 FPS should only have 100 frames, but about 500 are being stuffed into the video, resulting in the video being 50 seconds long.
My initial attempt at a solution was just to slow down the FillBuffer() call by adding a sleep() for 1/10th second. And that indeed does more or less work. But it seems hackish, and I question whether that would work well at higher FPS.
So I'm wondering if there's a better way to do this. Actually, I'm assuming there's a better way and I'm just missing it. Or do I just need to smarten up the manner in which FillBuffer() in the Push Source is delayed and use a better timing mechanism?
Any suggestions would be greatly appreciated!
I do this with threads. The main thread is adding bitmaps to a list and the recorder thread takes bitmaps from that list.
Main thread
Animate your graphics at time T and render bitmap
Add bitmap to renderlist. If list is full (say more than 8 frames) wait. This is so you won't use too much memory.
Advance T with deltatime corresponding to desired framerate
Render thread
When a frame is requested, pick and remove a bitmap from the renderlist. If list is empty wait.
You need a threadsafe structure such as TThreadList to hold the bitmaps. It's a bit tricky to get right but your current approach is guaranteed to give to timing problems.
I am doing just the right thing for my recorder application ( for purposes of testing the whole thing.
I am using Sleep() method to delay the frames, but am taking great care to ensure that timestamps of the frames are correct. Also, when Sleep()ing from frame to frame, please try to use 'absolute' time differences, because Sleep(100) will sleep about 100ms, not exactly 100ms.
If it won't work for you, you can always go for IReferenceClock, but I think that's overkill here.
DateTime start=DateTime.Now;
int frameCounter=0;
while (wePush)
DateTime nextFrameTime=start.AddMilliseconds(frameCounter*100);
int delay=(nextFrameTime-DateTime.Now).TotalMilliseconds;
Keep in mind: IWMWritter is time insensitive as long as you feed it with SAMPLES that are properly time-stamped.

openCV: is it possible to time cvQueryFrame to synchronize with a projector?

When I capture camera images of projected patterns using openCV via 'cvQueryFrame', I often end up with an unintended artifact: the projector's scan line. That is, since I'm unable to precisely time when 'cvQueryFrame' captures an image, the image taken does not respect the constant 30Hz refresh of the projector. The result is that typical horizontal band familiar to those who have turned a video camera onto a TV screen.
Short of resorting to hardware sync, has anyone had some success with approximate (e.g., 'good enough') informal projector-camera sync in openCV?
Below are two solutions I'm considering, but was hoping this is a common enough problem that an elegant solution might exist. My less-than-elegant thoughts are:
Add a slider control in the cvWindow displaying the video for the user to control a timing offset from 0 to 1/30th second, then set up a queue timer at this interval. Whenever a frame is needed, rather than calling 'cvQueryFrame' directly, I would request a callback to execute 'cvQueryFrame' at the next firing of the timer. In this way, theoretically the user would be able to use the slider to reduce the scan line artifact, provided that the timer resolution is sufficient.
After receiving a frame via 'cvQueryFrame', examine the frame for the tell-tale horizontal band by looking for a delta in HSV values for a vertical column of pixels. Naturally this would only work when the subject being photographed contains a fiducial strip of uniform color under smoothly varying lighting.
I've used several cameras with OpenCV, most recently a Canon SLR (7D).
I don't think that your proposed solution will work. cvQueryFrame basically copies the next available frame from the camera driver's buffer (or advances a pointer in a memory mapped region, or blah according to your driver implementation).
In any case, the timing of the cvQueryFrame call has no effect on when the image was captured.
So as you suggested, hardware sync is really the only route, unless you have a special camera, like a point grey camera, which gives you explicit software control of the frame integration start trigger.
I know this has nothing to do with synchronizing but, have you tried extending the exposure time? Or doing so by intentionally "blending" two or more images into one?
