Capture, encode then stream video from an iPhone to a server - ios

I've got experience with building iOS apps but don't have experience with video. I want to build an iPhone app that streams real time video to a server. Once on the server I will deliver that video to consumers in real time.
I've read quite a bit of material. Can someone let me know if the following is correct and fill in the blanks for me.
To record video on the iPhone I should use the AVFoundation classes. When using the AVCaptureSession the delegate method captureOutput:didOutputSampleBuffer::fromConnection I can get access to each frame of video. Now that I have the video frame I need to encode the frame
I know that the Foundation classes only offer H264 encoding via AVAssetWriter and not via a class that easily supports streaming to a web server. Therefore, I am left with writing the video to a file.
I've read other posts that say they can use two AssetWritters to write 10 second blocks then NSStream those 10 second blocks to the server. Can someone explain how to code the use of two AVAssetWriters working together to achieve this. If anyone has code could they please share.

You are correct that the only way to use the hardware encoders on the iPhone is by using the AVAssetWriter class to write the encoded video to a file. Unfortunately the AVAssetWriter does not write the moov atom to the file (which is required to decode the encoded video) until the file is closed.
Thus one way to stream the encoded video to a server would be to write 10 second blocks of video to a file, close it, and send that file to the server. I have read that this method can be used with no gaps in playback caused by the closing and opening of files, though I have not attempted this myself.
I found another way to stream video here.
This example opens 2 AVAssetWriters. Then on the first frame it writes to two files but immediately closes one of the files so the moov atom gets written. Then with the moov atom data the second file can be used as a pipe to get a stream of encoded video data. This example only works for sending video data but it is very clean and easy to understand code that helped me figure out how to deal with many issues with video on the iPhone.

Related

AVSampleBufferDisplayLayer plays too fast

So I have put together a sample project https://github.com/liuxuan30/TestH264.git that uses VideoToolBox to have a H264 sample decoder to display a stream file, captured from a camera.
The H264 decoder using VideoToolBox is copied from internet, I didn't write it, when I tried to play my h264 stream file, it plays too fast, comparing to ffmpeg or ffplay, which both played back at a normal speed.
I wanted to ask, how to fix this behaviour? Thanks.
This happens because of this constant kCMSampleAttachmentKey_DisplayImmediately:
If this key is present, the sample should be displayed as soon as possible rather than
according to its presentation timestamp. Use this attachment at run time to request this
behavior from a display pipeline such as the AVSampleBufferDisplayLayer class.
This attachment is not written to media files.
from Apple documation
So you have two options of displaying:
Display immediately - which is probably good for real-time stream, when you need to display frame as soon as possible
Display frames at specific timestamp
*comparing to ffmpeg or ffplay, which both played back at a normal speed.
ffplay and ffmpeg probably use timestamp at this point.
I have same result as you from your test H.264 file, but it's happens because you get all decoded frame at once so decoder is displaying it immediately.
You can watch this video for more information about VideoToolbox framework:
Direct Access to Video Encoding and Decoding

Transcoding fMP4 to HLS while writing on iOS using FFmpeg

TL;DR
I want to convert fMP4 fragments to TS segments (for HLS) as the fragments are being written using FFmpeg on an iOS device.
Why?
I'm trying to achieve live uploading on iOS while maintaining a seamless, HD copy locally.
What I've tried
Rolling AVAssetWriters where each writes for 8 seconds, then concatenating the MP4s together via FFmpeg.
What went wrong - There are blips in the audio and video at times. I've identified 3 reasons for this.
1) Priming frames for audio written by the AAC encoder creating gaps.
2) Since video frames are 33.33ms long, and audio frames 0.022ms long, it's possible for them to not line up at the end of a file.
3) The lack of frame accurate encoding present on Mac OS, but not available for iOS Details Here
FFmpeg muxing a large video only MP4 file with raw audio into TS segments. The work was based on the Kickflip SDK
What Went Wrong - Every once in a while an audio only file would get uploaded, with no video whatsoever. Never able to reproduce it in-house, but it was pretty upsetting to our users when they didn't record what they thought they did. There were also issues with accurate seeking on the final segments, almost like the TS segments were incorrectly time stamped.
What I'm thinking now
Apple was pushing fMP4 at WWDC this year (2016) and I hadn't looked into it much at all before that. Since an fMP4 file can be read, and played while it's being written, I thought that it would be possible for FFmpeg to transcode the file as it's being written as well, as long as we hold off sending the bytes to FFmpeg until each fragment within the file is finished.
However, I'm not familiar enough with the FFmpeg C API, I only used it briefly within attempt #2.
What I need from you
Is this a feasible solution? Is anybody familiar enough with fMP4 to know if I can actually accomplish this?
How will I know that AVFoundation has finished writing a fragment within the file so that I can pipe it into FFmpeg?
How can I take data from a file on disk, chunk at a time, pass it into FFmpeg and have it spit out TS segments?
Strictly speaking you don't need to transcode the fmp4 if it contains h264+aac, you just need to repackage the sample data as TS. (using ffmpeg -codec copy or gpac)
Wrt. alignment (1.2) I suppose this all depends on your encoder settings (frame rate, sample rate and GOP size). It is certainly possible to make sure that audio and video align exactly at fragment boundaries (see for example: this table). If you're targeting iOS, I would recommend using HLS protocol version 3 (or 4) allowing timing to be represented more accurately. This also allows you to stream audio and video separately (non-multiplexed).
I believe ffmpeg should be capable of pushing a live fmp4 stream (ie. using a long-running HTTP POST), but playout requires origin software to do something meaningful with it (ie. stream to HLS).

Identify audio sample in MDAT Atom without MOOV Atom

I am trying to write a live video broadcaster over RTSP from an ios device. I am utilizing AVAssetWriter so I can take advantage of hardware encoding. To send over RTSP I have to get the avcC information out of the MOOV block, however the MOOV block is only written from AVAssetWriter when you have finished the session, which of course is not finished as I am streaming this live.
I have gotten around this with the video by encoding, writing, and then finishing a single sample buffer to file, and the parsing the file to get the avcC information out. That works just fine.
After that for the live stream, since AVAssetWriter will only write to a file, I am writing it out to file and then reading from that file with a chasing file offset. When I do this with video only, I can read the Nalu's from the MDAT Atom in the written file without any MOOV information as the size of each Nalu is given in the first 4 bytes of the Nalu. So I can read that amount, process it, and send it on its way over an RTSP stream. So with video only, everything works perfectly fine and I get real good HD stream to a stream server.
The problem I am now having is when I try to incorporate audio into the stream from the mic. I can encode it just fine with AVAssetWriter and I get proper interleaved formated mp4 file to read from, however unlike the H264 Nalu's, the audio samples in the file do not have the size of the sample as their first byte. So far the only way I can see to define that is with the STSZ and STCO Atoms in the MOOV, which of course I dont have because it is a live stream.
With all that in mind, does any one know a way to identify audio sample segments in an MDAT Atom without the information from the MOOV Atom? As soon as I figure that out, Im home free.
Thanks in advance for any insight.
After a lot of research and emails out to people, I at least have an answer, and the answer is, I cant do it this way. Normally AAC samples in streams where dont have an index is wrapped in ADTS headers which holds the length field for the packet. However, since I am using AVAssetWriter for the audio, and AVAssetWriter writes directly to an MP4 file, the ADTS wrap is stripped off because of the index that will be in the MOOV Atom.
Therefore I will have to encode the audio differently, probably through Audio Queue services and meld it into the Video packets when applying to the RTSP stream.
Maybe this will help someone else in the future looking down this same road.
Many thanks to Geraint Davies at http://www.gdcl.co.uk for leading me down the right path.

is it possible for avassetwriter to output to memory

I would like to write an iphone app that continuously capture video, h.264 encode them in 10 seconds interval and upload to a storage server. This can be done with avassetwriter, and I can keep on deleting the old files as I create new ones. However, as flash memory have a limited write cycles, this scheme will destroy the flash after a few thousand write cycles through the flash. Is there a way to redirect avassetwriter to memory, or create a ram drive on the iphone?
Thanks!
Yes avassetwriter is the only way to get to the hardware decoder. and simply reading back the file while its written doesn't give you the moov atoms so avfoundation or mpmediaplayer based players won't be able to read it back. you only have a couple choices , periodically stop the asassetwriter and write to the file on a background thread, effectively segmenting your movie into smaller complete files. or you could deal with the incomplete mp4 on the server side, you will have to decode the raw nalu's and recreate the missing moov atoms. If your using ffmpeg mov.c is source to look at. This is also were an incomplete mp4 file would fail.

Play socket-streamed h.264 movie on iOS using AVFoundation

I’m working on a small iPhone app which is streaming movie content over a network connection using regular sockets. The video is in H.264 format. I’m however having difficulties with playing/decoding the data. I’ve been considering using FFMPEG, but the license makes it unsuitable for the project. I’ve been looking into Apple’s AVFoundation framework (AVPlayer in particular), which seems to be able to handle h264 content, however I’m only able to find methods to initiate the movie using an url – not by proving a memory buffer streamed from the network.
I’ve been doing some tests to make this happen anyway, using the following approaches:
Play the movie using a regular AVPlayer. Every time data is received on the network, it’s written to a file using fopen with append-mode. The AVPlayer’s asset is then reloaded/recreated with the updated data. There seems to be two issues with this approach: firstly, the screen goes black for a short moment while the first asset is unloaded and the new loaded. Secondly, I do not know exactly where the playing stopped, so I’m unsure how I would find out the right place to start playing the new asset from.
The second approach is to write the data to the file as in the first approach, but with the difference that the data is loaded into a second asset. A AVQueuedPlayer is then used where the second asset is inserted/queued in the player and then called when the buffering has been done. The first asset can then be unloaded without a black screen. However, using this approach it’s even more troublesome (than the first approach) to find out where to start playing the new asset.
Has anyone done something like this and made it work? Is there a proper way of doing this using AVFoundation?
The official method to do this is the HTTP Live Streaming format which supports multiple quality levels (among other things) and automatically switches between them (eg: if the user moves from WiFi to cellular).
You can find the docs here: Apple Http Streaming Docs

Resources