What is the decoded output of a video codec? - video-encoding

Folks,
I am wondering if someone can explain to me what exactly is the output of video decoding. Let's say it is a H.264 stream in an MP4 container.
From displaying something on the screen, I guess decoder can provider two different types of output:
Point - (x, y) coordinate of the location and the (R, G, B) color for the pixel
Rectangle (x, y, w, h) units for the rectangle and the (R, G, B) color to display
There is also the issue of time stamp.
Can you please enlighten me or point me the right link on what is generated out of a decoder and how a video client can use this information to display something on screen?
I intend to download VideoLAN source and examine it but some explanation would be helpful.
Thank you in advance for your help.
Regards,
Peter

None of the above.
Usually the output will be a stream of bytes that contains just the color data. The X,Y location is implied by the dimensions of the video.
In other words, the first three bytes might encode the color value at (0, 0), the second three byte the value at (0, 1), and so on. Some formats might use four bytes groups, or even a number of bits that doesn't add up to one byte -- for example, if you use 5 bits for each color component and you have three color components, that's 15 bits per pixel. This might be padded to 16 bits (exactly two bytes) for efficiency, since that will align data in a way that CPUs can better process it.
When you've processed exactly as many values as the video is wide, you've reached the end of that row. When you've processed exactly as many rows as the video is high, you've reached the end of that frame.
As for the interpretation of those bytes, that depends on the color space used by the codec. Common color spaces are YUV, RGB, and HSL/HSV.
It depends strongly on the codec in use and what input format(s) it supports; the output format is usually restricted to the set of formats that are acceptable inputs.
Timestamp data is a bit more complex, since that can be encoded in the video stream itself, or in the container. At a minimum, the stream would need a framerate; from that, the time of each frame can be determined by counting how many frames have been decoded already. Other approaches, like the one taken by AVI, is to include a byte-offset for every Nth frame (or just the keyframes) at the end of the file to enable rapid seeking. (Otherwise, you would need to decode every frame up to the timestamp you're looking for in order to determine where in the file that frame is.)
And if you're considering audio data too, note that with most codecs and containers, the audio and video streams are independent and know nothing about each other. During encoding, the software that writes both streams into the container format does a process called muxing. It will write out the data in chunks of N seconds each, alternating between streams. This allows whoever is reading the stream to get N seconds of video, then N seconds of audio, then another N seconds of video, and so on. (More than one audio stream might be included too -- this technique is frequently used to mux together video, and English and Spanish audio tracks into a single file that contains three streams.) In fact, even subtitles can be muxed in with the other streams.

cdhowie got most of it.
When it comes to timestamps the MPEG4 container contains tables for each frame that tells the video client when to display each frame. You should look at the spec for MPEG4. You normally have to pay for this I think but it's definitely downloadable from places.
http://en.wikipedia.org/wiki/MPEG-4_Part_14

Related

How do you make Media Source work with timestampOffset lower than appendWindowStart?

I want to use appendBuffer and append only piece of the media I have.
To cut the piece from the end, I use appendWindowEnd and it works.
To cut it from the beginning I have to set timestampOffset lower than appendWindowStart. I have seen shaka-player doing something similar.
var appendWindowStart = Math.max(0, currentPeriod.startTime - windowFudge);
var appendWindowEnd = followingPeriod ? followingPeriod.startTime : duration;
...
var timestampOffset = currentPeriod.startTime -mediaState.stream.presentationTimeOffset;
From my tests, it works when timestampOffset is
same as appendWindowStart
1/10 second lower
Does't work when timestampOffset is lower than that. The segment doesn't get added. Does that have something to do with my media or the spec/implementation doesn't allow it?
From MDN web docs:
The appendWindowStart property of the SourceBuffer interface controls the timestamp for the start of the append window, a timestamp range that can be used to filter what media data is appended to the SourceBuffer. Coded media frames with timestamps within this range will be appended, whereas those outside the range will be filtered out.
Just found this in the specification, so I am updating the question:
If presentation timestamp is less than appendWindowStart, then set the need random access point flag to true, drop the coded frame, and jump to the top of the loop to start processing the next coded frame.
Some implementations may choose to collect some of these coded frames with presentation timestamp less than appendWindowStart and use them to generate a splice at the first coded frame that has a presentation timestamp greater than or equal to appendWindowStart even if that frame is not a random access point. Supporting this requires multiple decoders or faster than real-time decoding so for now this behavior will not be a normative requirement.
If frame end timestamp is greater than appendWindowEnd, then set the need random access point flag to true, drop the coded frame, and jump to the top of the loop to start processing the next coded frame.
Some implementations may choose to collect coded frames with presentation timestamp less than appendWindowEnd and frame end timestamp greater than appendWindowEnd and use them to generate a splice across the portion of the collected coded frames within the append window at time of collection, and the beginning portion of later processed frames which only partially overlap the end of the collected coded frames. Supporting this requires multiple decoders or faster than real-time decoding so for now this behavior will not be a normative requirement. In conjunction with collecting coded frames that span appendWindowStart, implementations may thus support gapless audio splicing.
If the need random access point flag on track buffer equals true, then run the following steps:
If the coded frame is not a random access point, then drop the coded frame and jump to the top of the loop to start processing the next coded frame.
Set the need random access point flag on track buffer to false.
and
Random Access Point
A position in a media segment where decoding and continuous playback can begin without relying on any previous data in the segment. For video this tends to be the location of I-frames. In the case of audio, most audio frames can be treated as a random access point. Since video tracks tend to have a more sparse distribution of random access points, the location of these points are usually considered the random access points for multiplexed streams.
Does that mean, that for a video, I have to choose timeOffset, which lands on 'I' frame?
The use of timestampOffset doesn't require an I-Frame. It just shifts the timestamp of each frame by that value. That shift calculations is performed before anything else (before appendWindowStart getting involved)
It's the use of appendWindowStart that are impacted to where your I-frames are.
appendWindowStart and appendWindowEnd act as an AND over the data you're adding.
MSE doesn't reprocess your data, by setting appendWindowStart you're telling the source buffer that any data contained prior that time are to be excluded
Also MSE works at the fundamental level of GOP (group of picture): from one I-Frame to another.
So let's imagine this group of images, made of 16 frames GOP, each having a duration of 1s.
.IPPPPPPPPPPPPPPP IPPPPPPPPPPPPPPP IPPPPPPPPPPPPPPP IPPPPPPPPPPPPPPP
Say now you set appendWindowStart to 10
In the ideal world you would have:
. PPPPPPP IPPPPPPPPPPPPPPP IPPPPPPPPPPPPPPP IPPPPPPPPPPPPPPP
All previous 9 frames with a time starting prior appendWindowStart have been dropped.
However, now those P-Frames can't be decoded, hence MSE set in the spec the "need random access point flag" to true, so the next frame added to the source buffer can only be an I-Frame
and so you end up in your source buffer with:
. IPPPPPPPPPPPPPPP IPPPPPPPPPPPPPPP IPPPPPPPPPPPPPPP
To be able to add the frames between appendWindowStart and the next I-Frame would be incredibly hard and time expensive.
It would require to decode all frames before adding them to the source buffer, storing them either as raw YUV data, or if hardware accelerated storing the GPU backed image.
A source buffer could contain over a minute of video at any given time. Imagine if it had to deal with decompressed data now rather than compressed one.
Now, if we wanted to preserve the same memory constraint as now (which is around 100MiB of data maximum per source buffer), you would have to recompress on the fly the content before adding it to the source buffer.
not gonna happen.

iOS Extract all frames from a video

I have to extract all frames from video file and then save them to file.
I tried to use AVAssetImageGenerator, but it's very slow - it takes 1s - 3s per each frame ( sample 1280x720 MPEG4 video ) without saving to file process.
Is there anyway to make it much faster?
OpenGL, GPU, (...)?
I will be very grateful for showing me right direction.
AVAssetImageGenerator is a random access (seeking) interface, and seeking takes time, so one optimisation could be to use an AVAssetReader which will quickly and sequentially vend you frames. You can also choose to work in yuv format, which will give you smaller frames (and I think) faster decoding.
However, those raw frames are enormous: are 1280px * 720px * 4 bytes/pixel (if in RGBA), which is about 3.6MB each. You're going to need some pretty serious compression if you want to keep them all (MPEG4 # 720p comes to mind :).
So what are you trying to achieve?
Are you sure you want fill up your users' disks at a rate of 108MB/s (at 30fps) or 864MB/s (at 240fps)?

Programmatically get non-overlapping images from MP4

My ultimate goal is to get meaningful snapshots from MP4 videos that are either 30 min or 1 hour long. "Meaningful" is a bit ambitious, so I have simplified my requirements.
The image should be crisp - non-overlapping, and ideally not blurry. Initially, I thought getting a keyframe would work, but I had no idea that keyframes could have overlapping images embedded in them like this:
Of course, some keyframe images look like this and those are much better:
I was wondering if someone might have source code to:
Take a sequence of say 10-15 continuous keyframes (jpg or png) and identify the best keyframe from all of them.
This must happen entirely programmatically. I found this paper: http://research.microsoft.com/pubs/68802/blur_determination_compressed.pdf
and felt that I could "rank" a few images based on the above paper, but then I was dissuaded by this link: Extracting DCT coefficients from encoded images and video given that my source video is an MP4. Of course, this confuses me because the input into the system is just a sequence of jpg images.
Another link that is interesting is:
Detection of Blur in Images/Video sequences
However, I am not sure if this will work for "overlapping" images.
Th first pic is from a interlaced video at scene change.The two fields belong to different scenes. De-interlacing the video will help, try the ffmpeg filter -filter:v yadif . I am not sure how yadiff works but if it extracts the two fields and scale them to original size, it would work. Another approach is to detect if the two fields(extract alternate lines and form images with half the height and diff them) are very different from each other and ignore those images.

Length (Time) of an (non-vbr) mp3 file

Im currently researching the mp3 format in order to build an mp3 decoder.
After some thinking I figured out that the simplest way to calculate the length of the song would be to divide the size by the bitrate (taking in account the size of the ID3 tag etc.), and transform the result to minutes. Using this method on a few songs I got accurate times.
I always assumed the time of the song is the length of the pure audio data, but in this method, frames are also "considered" part of the song (when calculating the time).
Also, the I understood that the audio data in the mp3 file is compressed, so when its decompressed the size of it will be larger of course, and then the time calculation seems un accurate.
Am I missing something here? because it just doesnt make any sense to me that the songs length is calculated with the compressed data and not the uncompressed ones, and frames which are a DWORD each are not ignored.
I always assumed the time of the song is the length of the pure audio data, but in this method, frames are also "considered" part of the song (when calculating the time). Also, the I understood that the audio data in the mp3 file is compressed, so when its decompressed the size of it will be larger of course, and then the time calculation seems un accurate.
When a media stream, such as an MP3 file, is compressed with a constant bitrate, that bitrate reflects the compressed size of the data, not the uncompressed size. So your math is fine.
What will throw you off with this approach is metadata tags (e.g, ID3) -- those are part of the file size, but are not counted in the bitrate, since they aren't audio data. Luckily, those tend to be relatively small, so they won't affect your results much.

Mixing Sound Waves (CoreAudio on iOS)

It seems to me that CoreAudio adds sound waves together when mixing into a single channel. My program will make synthesised sounds. I know the amplitudes of each of the sounds. When I play them together should I add them together and multiply the resultant wave to keep within the range? I can do it like this:
MaxAmplitude = max1 + max2 + max3 //Maximum amplitude of each sound wave
if MaxAmplitude > 1 then //Over range
Output = (wave1 + wave2 + wave3)/MaxAmplitude //Meet range
else
Output = (wave1 + wave2 + wave3) //Normal addition
end if
Can I do it this way? Should I pre-analyse the sound waves to find the actual maximum amplitude (Because the maximum points may not match on the timeline) and use that?
What I want is a method to play several synthesised sounds together without reducing the volume throughout extremely and sounding seamless. If I play a chord with several synthesised instruments, I don't want to require single notes to be practically silent.
Thank you.
Changing the scale suddenly on a single sample basis, which is what your "if" statement does, can sound very bad, similar to clipping.
You can look into adaptive AGC (automatic gain control) which will change the scale factor more slowly, but could still clip or get sudden volume changes during fast transients.
If you use lookahead with the AGC algorithm to prevent sudden transients from clipping, then your latency will get worse.
If you do use AGC, then isolated notes may sound like they were played much more loudly than when played in a chord, which may not correctly represent a musical composition's intent (although this type of compression is common in annoying TV and radio commercials).
Scaling down the mixer output volume so that the notes will never clip or have their volume reduced other than when the composition indicates will result in a mix with greatly reduced volume for a large number of channels (which is why properly reproduced classical music on the radio is often too quiet to draw enough viewers to make enough money).
It's all a trade-off.
I don't see this is a problem. If you know the max amplitude of all your waves (for all time) it should work. Be sure not to change the amplitude on per sample basis but decide for every "note-on". It is a very simple algorithm but could suit your needs.

Resources