Decoding a proprietary HEVC/MP4 stream - parsing

One of those time where I am just out of ideas and hoping for a saint.
I am currently trying to decode and use a proprietary video stream of an IP cam and I feel like I am very close but I just cannot find the last piece of the puzzle. The camera is set to 1 FPS, CBR and an I-Frame interval of 1 for maximum consistency.
Overview of what I currently do: Buffer packets, look for header of camera's propriatary protocol (35 Bytes), look for another / next one, flush everything in between out to a file (For the sake of the post, this is called a "segment"), rinse, repeat.
If I set the stream to a very low quality, that is 352*288 with a very low bitrate I can open and play back the resulting file in MPC absolutely fine (Or convert it with FFMPEG and then play it back in VLC).
But here comes the issue: By increasing the video quality more and more, after a certain point, the video starts to get corrupted. One thing that also starts to happen when this case occurs: The maximum "segment" that is found is capped at 8183 bytes (Quite a peculiar size I found as its very close to 2^13). So I decided to look into what actually gets written whenever a 8176 section is encountered and what I've found seems indeed very peculiar as well - Many of almost matching bytes! (These bytes are only written for the first 8176 segment of a frame)
Sample 1:
0000 0001 4001 0c01 ffff 0160 0000 0300 b000 0003 0000 0300 3cac 0900 0000 0142 0101 0160 0000 0300 b000 0003 0000 0300 3ca0 0b08 0485 8dae 4932 fcdc 0404 0402 0000 0001 4401 c0f2 f03c 9000 0000 014e 01e5 04cc cc00 0080 0000 0001 2601 af1b 686f 315f 8bcd 7007
Sample 2 (A couple of seconds later):
0000 0001 4001 0c01 ffff 0160 0000 0300 b000 0003 0000 0300 3cac 0900 0000 0142 0101 0160 0000 0300 b000 0003 0000 0300 3ca0 0b08 0485 8dae 4932 fcdc 0404 0402 0000 0001 4401 c0f2 f03c 9000 0000 014e 01e5 049b 9b00 0080 0000 0001 2601 af17 68c3 3d14 cf63 2cab
As you can see, up until the 8000 0000 0126 01af they seem to be some type of header for / by.. something. Edit: Seems like this part contains the VPS / SPS / PPS of the then following frame.
It almost seems like the demuxer just has no effing idea that the current frame has more data to it than the initial 8176 byte segment seeing how at a quality setting where one frame consists of one 8176 and one ~2000 byte segment the video is fine on the upper two thirds and only start to corrupt at the lower end. This "point of corruption" ofc moves up further and further as I increase the bitrate of the stream.
Why dont you just use a proper camera?!
This camera is actually fine.
Just use its normal RTSP stream then?
Well theres the issue on why I even started to do this - It only supports RTSP over UDP while this propriatary protocol runs over TCP, and if theres packetloss (Which there is) the RTSP stream will start to corrupt, which I am ofc trying to not have happen.
Hope theres somebody here who might be able to help me. If you need sample files or anything I'd be happy to provide them for any soul that is interested to try and help me.
Thanks!
Edit: Seems like I might be onto something. I've just downloaded the Trial version of Zond 265 (A HEVC analyzer), and when opening the resulting file in it, it errors for every frame with both "Unexpected remaining X bytes found" as well as "end_of_subset_one_bit shall be equal to 1". So if I just take those remaining bits and remove that amount of bits in front of it - both of those errors go away! (A new one appears tho, decode CTU #x: exception) The image is obviously however still corrupted as now theres missing information but at least its something to work off. Still not really an idea what the next step would be tho.

So I've managed to solve my issue, heres what I did. I've found a DVR software on some dodgy site that does support the very same protocol and managed to access the camera trough that. I then recorded the stream trough it, as well as trough my software and bindiffing the two results is what gave me the final click. Turns out that I was slicing off a couple of bytes too much from the header (Pretty much slicing into the videostream data), but not always. Occasionally (And on the very first frame) the response header seems to be 8 bytes longer than most of the time, this is indicated by the video stream starting with 00 00 01 FC. So by adding these 8 bytes that I've always sliced off into the stream, or cutting them out under that occasion, I get a non-corrupted video stream :)

Related

Beagle Bone Black Audio Cape rev B synchronization issues

Basically the audio cape is working. Except for one strange phenomena that mistifies me. I will try to explain.
When I play a .wav file for example speaker-test -t vaw -> if lucky I hear Front Left - Front right as one expects. But 9 out of 10, I hear white noise with the audio front left front right very faint in the background or at another time the sound is simply distorted. The same happens when I play a file with aplay or mplayer.
So when I am lucky, or timing with respect to system clock is in sync I hear the audio clearly, if out of sync it might me white noise or distorted playback.
I have google extensively and have not found any solution. So I hope one of you guys knows whats happening here. It has to be something low level.
I'm quite a newby in this matter but according to this: Troubleshooting Linux Sound all seams to work ok.
These are my system parameters and settings: root#beaglebone:~# lsb_release -a Distributor ID: Angstrom Description: Angstrom GNU/Linux v2012.12 (Core edition) Release: v2012.12 Codename: Core edition
root#beaglebone:~# cat /sys/devices/bone_capemgr*/slots 0: 54:PF---
1: 55:PF---
2: 56:P---L CBB-Relay,00A0,Logic_Supply,CBB-Relay
3: 57:PF---
4: ff:P-O-L Bone-LT-eMMC-2G,00A0,Texas Instrument,BB-BONE-EMMC-2G
5: ff:P-O-- Bone-Black-HDMI,00A0,Texas Instrument,BB-BONELT-HDMI
6: ff:P-O-L Bone-Black-HDMIN,00A0,Texas Instrument,BB-BONELT-HDMIN
7: ff:P-O-L Override Board Name,00A0,Override Manuf,BB-BONE-AUDI-02
root#beaglebone:~# speaker-test -t wav
speaker-test 1.0.25
Playback device is default Stream parameters are 48000Hz, S16_LE, 1 channels WAV file(s) Rate set to 48000Hz (requested 48000Hz) Buffer size range from 128 to 32768 Period size range from 8 to 2048 Using max buffer size 32768 Periods = 4 was set period_size = 2048 was set buffer_size = 32768
0 - Front Left
Time per period = 0.641097
0 - Front Left
root#beaglebone:~# mplayer AxelF.wav MPlayer2 2.0-379-ge3f5043 (C) 2000-2011 MPlayer Team 162 audio & 361 video codecs
Playing AxelF.wav. Detected file format: WAV format (libavformat) [wav # 0xb6082780]max_analyze_duration reached [lavf] stream 0: audio (pcm_s16le), -aid 0 Load subtitles in .
==============================================================[edit]
Forced audio codec: mad Opening audio decoder: [pcm] Uncompressed PCM audio decoder AUDIO: 44100 Hz, 2 ch, s16le, 1411.2 kbit/100.00% (ratio: 176400->176400) Selected audio codec: [pcm] afm: pcm (Uncompressed PCM)
==============================================================[edit]
AO: [alsa] 44100Hz 2ch s16le (2 bytes per sample) Video: no video Starting playback... A: 1.6 (01.6) of 15.9 (15.8) 0.3%
MPlayer interrupted by signal 2 in module: unknown
Exiting... (Quit)
I can shed some light on what is causing the artifacts that you experience. I am sorry I do not yet have a countermeasure - I am struggling with the same problem. You describe the perceptible consequences pretty accurately.
Sound data travels from the ARM System on Chip to the Audio Codec on the audio cape using the I2S bus. I2S is a serial protocol, it sends one bit at a time, starting each sample with the most significant bit, then sending all bits down to the least significant bit. After the least significant bit of one sample is sent, the most significant bit of the sample on the next audio channel is sent. To be able to interpret the bit stream, the receiving audio codec needs to know when a new sound sample starts with its most significant bit, and also, to which channel each sound sample belongs. For this purpose, the "Word Select" (WS) signal is part of I2S and changes its value to indicate the start of the sound sample and also identifies the channel, see this I2S timing diagram for a better understanding of the concept.
What you and I perceive on our not-quite-working audio capes can be fully explained by the bit stream being interpreted out-of-step by the audio codec:
When you hear loud noise and the target signal soft in the background, then one or more of the least significant bits of the preceding sample are interpreted as the most significant bits of the current sample. The more bits are shifted, the softer the target signal, until you might only perceive noise when (this is a guess!) about 4 bits are shifted.
When the shift is in the other direction, i.e. most significant bit of the current sample was interpreted as the least significant bit of the preceding sample, then what you hear will sound correct for soft parts of the signal, i.e. when the most significant bit is not actually used (this is a simplification, see below). For louder parts of the signal, e.g. drum beats, you will perceive the missing most significant bit as distortion. Of course, the distortion gets worse and starts at softer levels as more bits are shifted in this direction.
In the above paragraph, the most significant will change with the sign of the data, so the statement that it is not actually used is valid only insofar as the most significant bit will have the same value as the next most significant bit for soft sounds. See Two's Complement for an introduction how negative integers are represented in computers.
I am not sure, where the corruption occurs. It could be that the WS signal is not correctly interpreted by the Audio Codec on the cape, or the WS signal is not correctly sent by the ARM System-on-Chip, or the bit shift might happen already inside the ARM CPU, e.g. in the Alsa driver.

Why does 20 address space with on a 16 bit machine give access to 1 Megabyte and not 2 Megabytes?

OK, this question sounds simple but I am taken by surprise. In the ancient days when 1 Megabyte was a huge amount of memory, Intel was trying to figure out how to use 16 bits to access 1 Megabyte of memory. They came up with the idea of using segment and offset address values to generate a 20 bit address.
Now, 20 bits gives 2^20 = 1,048,576 locations that can be addressed. Now assuming that we access 1 byte per address location we get 1,048,576/(1024*1024) = 2^20/2^20 Megabytes = 1 Megabyte. Ok understood.
The confusion comes here, we have 16 bit data bus in the ancient 8086 and can access 2 bytes at a time rather than 1, this equate 20 bit address to being able to access a total of 2 Megabyte of data right? Why do we assume that each address only has 1 byte stored in it when the data bus is 2 bytes wide? I am confused here.
It is very important to consider the bus when trying to understand this. This is probably more of an electrical question than a software one, but here is the answer:
For 8086, when reading from ROM, The least significant address line (A0) is not used, reducing the number of address lines to 19 right then and there.
In the case where the CPU needs to read 16 bits from an odd address, say, bytes at 0x3 and 0x4, it will actually do two 16-bit reads: One from 0x2 and one from 0x4, and discard bytes 0x2 and 0x5.
For 8-bit ROM reads, the read on the bus is still 16-bits but the unneeded byte is discarded.
But for RAM there is sometimes a need to write just a single byte, this gets a little more complex. There is an extra output signal on the processor called BHE# (Bus high enable). The combination of A0 and BHE# are used to determine if the write is an 8 or 16-bits wide, and whether or not it is at an odd or even address.
Understanding these two signals is key to answering your question. Stating it simply as possible:
8-bit even access: A0 OFF, BHE# OFF
8-bit odd access: A0 ON, BHE# ON
16-bit access (must be even): A0 OFF, BHE# ON
And we cannot have a bus cycle with A0 ON and BHE# OFF because an odd access to the even byte of the bus is meaningless.
Relating this back to your original understanding: You are completely correct in the case of memory devices. A 1 megabyte 16-bit memory chip will indeed only have 19 address lines, to that chip, 16 bits is a byte, and in effect, they do not physically have an A0 address input.
... almost. 16-bit writable memory devices have two extra signals (BHE# and BLE#) which are connected to the CPU's BHE# and A0 respectively. This so they know to ignore part of the bus when an 8-bit access is under way, making them hybrid 8/16 bit devices. ROM chips do not have these signals.
For the hardware unenlightened, this is a fairly complex area we're touching on here, and it does get very complex indeed in terms of performance considerations and in large systems with mixed 8 and 16 bit hardware.
It's is all explained in fantastic detail in the 8086 datasheet
It's because a byte is the 'atom' in memory addressing and the code must be able to access all the individual bytes in the address space. really a matter of software and compatibility with 8-bit existing software back then.
This too may interest you: How a single byte of memory is accessed by CPU in a 32-bit memory and 32-bit processor

Interpreting Frame Control bytes in 802.11 Wireshark trace

I have a Wi-Fi capture (.pcap) that I'm analysing and have run across what appear to me to be inconsistencies between the 802.11 spec and Wireshark's interpretation of the data. Specifically what I'm trying to pull apart is the 2-byte 802.11 Frame Control field.
Taken from http://www4.ncsu.edu/~aliu3/802.bmp, the format of the Frame Control field's subfields are as follows:
And below is a Wireshark screen cap of the packet that has me confused:
So as per the Wireshark screenshot, the flags portion (last 8 bits) of the Frame Control field is 0x22, which is fine. How the Version/Type/Subtype being 0x08 matches up with Wireshark's description of the frame is what has me confused.
0x08 = 0000 1000b, which I thought would translate to Version = 00, Type = 00 (which I thought meant management not data frame) and Subtype = 1000 (which I thought would be a beacon frame). So I would expect this frame to be a management frame and more specifically, a beacon frame. Wireshark however reports it as a Data frame. The second thing that is confusing me is where Wireshark is even pulling 0x20 from in the line Type/Subtype: Data (0x20).
Can anyone clarify my interpretation of the 802.11 spec/Wireshark capture for me and why the two aren't consistent?
The data frame in you example is 0x08 because of the layout of that byte of the frame control (FC). 0x08 = 00001000
- The first 4 bits (0000) are the subtype. 0000 is the subtype of this frame
- The next 2 bits (10) is the type, which is 2 decimal and thus a data type frame
- The last 2 bits (00) are the version, which is 0
The table below translates the hex value of the subtype-type-version byte of the FC for several frame types. A compare of the QoS data to the normal data frame might really help get this down pat. Mind you the table might have an error or two, as I just whipped it up.
You are right that 1000 is a beacon frame, you just were looking at the wrong bits.
You have a radiotap header, you can get the dec representation of the type like so from the pcap API:
int type = pkt_data[20] >> 2;
This is a common error, and has certainly bitten me several times.
It is down to the Byte Ordering.
When you have a multi-byte number to represent, the question arises as to Which byte do you put/send first ?
Natural (human) byte order is to put the big part first, then the smaller parts after it, Left-to-right, also called Big Endian. Note that the Bits in each byte are never the wrong way around from a programmers' point of view.
e.g. 1234 decimal requires 2 bytes, 04D2 hex.
Do you write/send 04 D2, or D2 04 ?
The first is Big-endian, the second is Little-endian.
To confuse it more, the mechanisms involved may use different byte-orders.
There is the Network Byte Order, in this case Little-endian, the Architecture byte order (can be different for each CPU architecture) and the data may be in a buffer, so it will vary depending on whether you read the buffer top-to-bottom, or bottom-to-top.
It doesn't help that the explanation of which bits do what can also be 'backwards', as in your original post.
I am using wireshark version-2.4.3 on windows. My capture file of dataframes is like below.
Frame control field = 0x0842 i.e., in binary format 0000 1000 0100 0010
Framecontrol flag field = 0x42.i.e., in binary format 0100 0010
So, as per my understanding the LSB 8bits in a framecontrol field will correspond to flags.
MSB 8bits will correspond to subtype, type, version i.e. in my case 0000-subtype & 10-type & 00-version.
Which is data frame of subtype 0.
It might be the error with wireshark in your case. It should dispaly frame control field as 0x0822 instead of 0x2208.
Flags field is properly displayed as 0x22.
In My case I am using wireshark-2.4.3 and display of frame control field is correct 0x0842 where flags is 0x42.
My_capture_file:

How to deal with cv::VideoCapture decode errors?

I'm streaming H264 content from an IP camera using the VideoCapture from OpenCV (compiled with ffmpeg support).
So far things work ok, but every once in a while I get decoding errors (from ffmpeg I presume):
[h264 # 0x103006400] mb_type 137 in I slice too large at 26 10
[h264 # 0x103006400] error while decoding MB 26 10
[h264 # 0x103006400] negative number of zero coeffs at 25 5
[h264 # 0x103006400] error while decoding MB 25 5
[h264 # 0x103006400] cbp too large (421) at 35 13
[h264 # 0x103006400] error while decoding MB 35 13
[h264 # 0x103006400] mb_type 121 in P slice too large at 20 3
[h264 # 0x103006400] error decoding MB 20 3
These messages show up in the console. Is there any clean way to listen to these ? I'd like to skip processing the glitchy frames.
Any hints/tips ?
recently i have solved the same problem and try to explain the steps i followed.
i updated most recent opencv_ffmpeg.dll ( i renamed opencv_ffmpeg.dll to opencv_ffmpeg310.dll to use with OpenCV 3.1, also renamed same dll opencv_ffmpeg2412.dll to use with OpenCV 2.4.12
by doing that, a basic capturing frames and display became successful without problem.but still the same problem if i do some image-processing or detection causes delay between capturing frames.
to solve the second problem i used a thread to grab frames continiously and update a global Mat for processing.
here
you can find my test code ( it need some improvements like using mutex and lock memory when update the Mat)
i hope the information will be useful ( sorry for my poor english )
I have the same problem. It seems to me that the problem comes from the fact that the source originating the stream is slower than the one decoding. Probably for the decoding you have an endless loop reading frames and decoding them, which might be faster than what your source can send you.
I don't know how to stop and wait until the buffer is full .. I'm using a file, such that my camera source writes a file and I read frames from it in my decoding program. So far I haven't been able to synch them
what sturkmen said is absolutely right, the opencv version is 2413, and for some reason, I can not update the opencv to 310, I know there is gonna be no any decoding error like this for opencv310. So firstly, I copy lib opencv_ffmpeg310_64.dll to my executable file path E:\GITHUB\JpegRtspCamera\vs2013\JpegRtspCamera\x64\Release
then I just delete opencv_ffmpeg2413.dll and change the name opencv_ffmpeg310_64.dll to opencv_ffmpeg2413.dll. it works!!!

How does sending tinygrams cause network congestion?

I've read advice in many places to the effect that sending a lot of small packets will lead to network congestion. I've even experienced this with a recent multi-threaded tcp app I wrote. However, I don't know if I understand the exact mechanism by which this occurs.
My initial guess is that if the MTU of the physical transmission media is fixed, and you send a bunch of small packets, then each packet may potential take up an entire transmission frame on the physical media.
For example, my understanding is that even though Ethernet supports variable frames most equipment uses a fixed Ethernet frame of 1500 bytes. At 100 Mbit, a 1500 byte frame "goes by" on the wire every 0.12 milliseconds. If I transmit a 1 byte message ( plus tcp & ip headers ) every 0.12 milliseconds I will effectively saturate the 100Mb Ethernet connection with 8333 bytes of user data.
Is this a correct understanding of how tinygrams cause network congestion?
Do I have all my terminology correct?
In wired ethernet at least, there is no "synchronous clock" that times the beginning of every frame. There is a minimum frame size, but it's more like 64 bytes instead of 1500. There are also minimum gaps between frames, but that might only apply to shared-access networks (ATM and modern ethernet is switched, not shared-access). It is the maximum size that is limited to 1500 bytes on virtually all ethernet equipment.
But the smaller your packets get, the higher the ratio of framing headers to data. Eventually you are spending 40-50 bytes of overhead for a single byte. And more for its acknowledgement.
If you could just hold for a moment and collect another byte to send in that packet, you have doubled your network efficiency. (this is the reason for Nagle's Algorithm)
There is a tradeoff on a channel with errors, because the longer frame you send, the more likely it experience an error and will have to be retransmitted. Newer wireless standards load up the frame with forward error correction bits to avoid retransmissions.
The classic example of "tinygrams" is 10,000 users all sitting on a campus network, typing into their terminal session. Every keystroke produces a single packet (and acknowledgement).... At a typing rate of 4 keystrokes per second, That's 80,000 packets per second just to move 40 kbytes per second. On a "classic" 10mbit shared-medium ethernet, this is impossible to achive, because you can only send 27k minimum sized packets in one second - excluding the effect of collisions:
96 bits inter-frame gap
+ 64 bits preamble
+ 112 bits ethernet header
+ 32 bits trailer
-----------------------------
= 304 bits overhead per ethernet frame.
+ 8 bits of data (this doesn't even include IP or TCP headers!!!)
----------------------------
= 368 bits per tinygram
10000000 bits/s รท 368 bits/packet = 27172 Packets/second.
Perhaps a better way to state this is that an ethernet that is maxed out moving tinygrams can only move 216kbits/s across a 10mbit/s medium for an efficiency of 2.16%
A TCP Packet transmitted over a link will have something like 40 bytes of header information. Therefore If you break a transmission into 100 1 byte packets, each packet sent will have 40 bytes, so about 98% of the resources used for transmission are overhead. If instead, you send it as one 100 byte packet, the total transmitted data is only 140 bytes, so only 28% overhead. In both cases you've transmitted 100 bytes of payload over the network, but in one you used 140 bytes of network resources to accomplish it, and in the other you've used 4000 bytes. In addition, it take more resources on the intermediate routers to correctly route 100 41 byte payloads, than 1 40 byte payloads. Routing 1 byte packets is pretty much the worst case scenerio for the routers performancewise, so they will generally exhibit their worst case performance under this situation.
In addition, especially with TCP, as performance degrades due to small packets, the machines can try do do things to compensate (like retransmit) that will actually make things worse, hence the use of Nagles algorithm to try to avoid this.
BDK has about half the answer (+1 for him). A large part of the problem is that every message comes with 40 bytes of overhead. Its actually a little worse than that though.
Another issue is that there is actually minimum packet size specified by IP. (This is not MTU. MTU is a Maximum before it will start fragmenting. Different issue entirely) The minimum is pretty small (I think 46 bytes, including your 24 byte TCP header), but if you don't use that much, it still sends that much.
Another issue is protocol overhead. Each packet sent by TCP causes an ACK packet to be sent back by the recipient as part of the protocol.
The result is that is you do something silly, like send one TCP packet every time the user hits a key, you could easily end up with a tremendous amount of wasted overhead data floating around.

Resources