Multithreading in HM reference software - video-encoding

Encoding an UHD sequences with HEVC HM reference software takes days on CPU’s even with monster computers, I want to know if it’s possible and then how to increase the number of threads (even if it decreases the quality of the encoding) to speed up the process (I want it to rise up to x4 times at least).
is this possible by increasing number of tiles , because by default there is only one tile per pic, or should we change in the source code? and where exactly?!

seems that the answer to increase encoding speed was not the number of tiles but the WPP.
the HM gives the possibility to increase the number of tile in condition that the min tile with is 4 CTU (4*64 pel) and min height is 1 CTU (64pel). so, u can’t just choose any number .
when you activate the WPP , you can have up to 17 line in same time , but you cannot use WPP and tiles in same time.
testing this with basketballdrive HD seq QP=37 :
T(sec) Rate(kbps) PSNR
1 tile : 171013.381 1761.7472 34.5743
4 tiles : 166401.603 1822.1880 34.5439 = saves about 3 hours
WPP : 166187.201 1785.4048 34.5483 = ~same
could saves more with UHD seq but it's not enough for me. 3h is nothing for JEM and WPP are removed from the new VTM (FVC).

Related

Turtle - What precisely is the turtle's speed? X actions/second?

A student asked me this and I can't find the answer. You can set the turtle's speed to 0-10. But what does that actually mean? x actions / second?
We are on Code.org, which translates its code in the lessons into Javascript, but this command is found in the Play lab, which provides no translation. I am assuming this is analogous to JS-turtle, but if you know the answer for Python Turtle, etc, I'd love to hear it.
What precisely is the turtle's speed? X actions/second? ... if you
know the answer for Python Turtle, etc, I'd love to hear it.
In standard Python, the turtle's speed() method indirectly controls the speed of the turtle by dividing up the turtle's motion into smaller or larger steps, where each step has a defined delay.
By default, if we don't mess with setworldcoordinates(), or change the default screen update delay using delay(), or tracer(), then the motion of a turtle is broken up into a number of individual steps determined by:
steps = int(distance / (3 * 1.1**speed * speed))
At the default speed (3 or 'slow'), a 100px line would be drawn in 8 steps. At the slowest speed (1 or 'slowest'), 30 steps. At a fast speed (10 or 'fast'), 1 step. (Oddly, the default speed isn't the 'normal' (6) speed!) Each step incurs a screen update delay of 10ms by default.
Using a speed of 0 ('fastest'), or turning off tracer(), avoids this process altogether and just draws lines in one step.
There's a similar logic for how the speed() setting affects the number of steps the turtle takes to rotate when you do right() or left().
https://docs.python.org/3/library/turtle.html#turtle.speed
From the docs you can see that it is just an arbitrary value.

VTCompressionSession Bitrate/Datarate overshooting

I have been working on an H264 hardware accelerated encoder implementation using VideoToolbox's VTCompressionSession for a while now, and a consistent problem has been the unreliable bitrate coming out of it. I have read many forum posts and looked through existing code for this, and tried to follow suit, but the bitrate out of my encoder is almost always somewhere between 5% and 50% off what it is set at, and on occasion I've seen some huge errors, like even 400% overshoot, where even one frame will be twice the size of the given average bitrate.
My session is setup as follows:
kVTCompressionPropertyKey_AverageBitRate = desired bitrate
kVTCompressionPropertyKey_DataRateLimits = [desired bitrate / 8, 1]; accounting for bits vs bytes
kVTCompressionPropertyKey_ExpectedFrameRate = framerate (30, 15, 5, or 1 fps)
kVTCompressionPropertyKey_MaxKeyFrameInterval = 1500
kVTCompressionPropertyKey_MaxKeyFrameIntervalDuration = 1500 / framerate
kVTCompressionPropertyKey_AllowFrameReordering = NO
kVTCompressionPropertyKey_ProfileLevel = kVTProfileLevel_H264_Main_AutoLevel
kVTCompressionPropertyKey_RealTime = YES
kVTCompressionPropertyKey_H264EntropyMode = kVTH264EntropyMode_CABAC
kVTCompressionPropertyKey_BaseLayerFrameRate = framerate / 2
And I adjust the average bitrate and datarate values throughout the session to try and compensate for the volatility (if it's too high, I reduce them a bit, if too low, I increase them, with restrictions on how high and low to go).
I create the session and then apply the above configuration as a single dictionary using VTSessionSetProperties and feed frames into it like this:
VTCompressionSessionEncodeFrame(compressionSessionRef,
static_cast<CVImageBufferRef<(pixelBuffer),
CMTimeMake(capturetime, 1000),
kCMTimeInvalid,
frameProperties,
frameDetailsStruct,
&encodeInfoFlags);
So I'm supplying timing information as the API says to do.
Then I add up the size of the output for each frame and divide over a periodic time period, to determine the outgoing bitrate and error from desired. This is where I see the significant volatility.
I'm looking for any help in getting the bitrate under control, as I'm not sure what to do at this point. Thank you!
I think you can check the frameTimestamp set in VTCompressionSessionEncodeFrame, it seems affects the bitrate. If you change frame rate, change the frameTimestamp.

Optimal layout of 2D array with least memory access time

Let us say I have a 2D array that I can read from a file
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
I am looking to store them as 1D array arr[16].
I am aware of row wise and column wise storage.
This messes up the structure of the array. Say I would like to convolve this with a 2X2 filter. Then at conv(1,1), I would be accessing memory at position 1,2,5,6.
Instead, can I optimize the storage of the data in a pattern such that the elements 1,2,5,6 are stored next to each other rather than being located far away ?
This reduces memory latency issue.
It depends on your processor, but supposing you have a typical Intel cache line size of 64 bytes, then picking square subregions that are each 64 bytes in size feels like a smart move.
If your individual elements are a byte each then 8x8 subtiles makes sense. So, e.g.
#define index(x, y) (x&7) | ((y&7) << 3) | \
((x&~7) << 6) | ((y&~7) ... shifted and/or multiplied as per size of x)
So in each full tile:
in 49 of every 64 cases all data is going to be within the same cache line;
in a further 14 it's going to lie across two cache lines; and
in one case in 64 is it going to need four.
So that's an average of 1.265625 cache lines touched per output pixel, versus 2.03125 in the naive case.
I found out what I was looking for. I was looking for what is called Morten ordering of an array that has shown to reduce memory access time. One another method would be to use the hilbert curve method which is shown to be more effective than the morten ordering method.
I am attaching a link to an article explaining this
https://insidehpc.com/2015/10/morton-ordering/

FSK demodulation with GNU Radio

I'm trying to demodulate a signal using GNU Radio Companion. The signal is FSK (Frequency-shift keying), with mark and space frequencies at 1200 and 2200 Hz, respectively.
The data in the signal text data generated by a device called GeoStamp Audio. The device generates audio of GPS data fed into it in real time, and it can also decode that audio. I have the decoded text version of the audio for reference.
I have set up a flow graph in GNU Radio (see below), and it runs without error, but with all the variations I've tried, I still can't get the data.
The output of the flow graph should be binary (1s and 0s) that I can later convert to normal text, right?
Is it correct to feed in a wav audio file the way I am?
How can I recover the data from the demodulated signal -- am I missing something in my flow graph?
This is a FFT plot of the wav audio file before demodulation:
This is the result of the scope sink after demodulation (maybe looks promising?):
UPDATE (August 2, 2016): I'm still working on this problem (occasionally), and unfortunately still cannot retrieve the data. The result is a promising-looking string of 1's and 0's, but nothing intelligible.
If anyone has suggestions for figuring out the settings on the Polyphase Clock Sync or Clock Recovery MM blocks, or the gain on the Quad Demod block, I would greatly appreciate it.
Here is one version of an updated flow graph based on Marcus's answer (also trying other versions with polyphase clock recovery):
However, I'm still unable to recover data that makes any sense. The result is a long string of 1's and 0's, but not the right ones. I've tried tweaking nearly all the settings in all the blocks. I thought maybe the clock recovery was off, but I've tried a wide range of values with no improvement.
So, at first sight, my approach here would look something like:
What happens here is that we take the input, shift it in frequency domain so that mark and space are at +-500 Hz, and then use quadrature demod.
"Logically", we can then just make a "sign decision". I'll share the configuration of the Xlating FIR here:
Notice that the signal is first shifted so that the center frequency (middle between 2200 and 1200 Hz) ends up at 0Hz, and then filtered by a low pass (gain = 1.0, Stopband starts at 1 kHz, Passband ends at 1 kHz - 400 Hz = 600 Hz). At this point, the actual bandwidth that's still present in the signal is much lower than the sample rate, so you might also just downsample without losses (set decimation to something higher, e.g. 16), but for the sake of analysis, we won't do that.
The time sink should now show better values. Have a look at the edges; they are probably not extremely steep. For clock sync I'd hence recommend to just go and try the polyphase clock recovery instead of Müller & Mueller; chosing about any "somewhat round" pulse shape could work.
For fun and giggles, I clicked together a quick demo demod (GRC here):
which shows:

Get byte range positions for a specified time range of an mp3 file

I'd like to be able to determine at what byte positions a segment of an NSData compressed mp3 file begins and ends.
For example, if I am playing an mp3 file using the AVPlayer (or any player) that is 1 minute long and 1000000 bytes, I'd like to know approximately at how many bytes in the file the 30 second mark happens, then how many bytes the 40 second mark happens.
Note that due to the mp3 file being compressed I can't just divide the bytes in half to determine the 30 second mark.
If this can't be done with Swift/Objective-C, do you know if this determination can be done with any programming language? Thanks!
It turns out I had a different problem to solve. I was trying to approximate the byte position of a specific time, say, the 4:29 point of a 32:45 long podcast episode, within a few seconds of accuracy.
I used a function along these lines to calculate the approximate byte position:
startTimeBytesPosition = (startTimeInSeconds / episodeDuration) * episodeFileSize
That function worked like a charm for some episodes, but for others the resulting start time would be off by about 30-40 seconds.
It turns out this inaccuracy was happening because some mp3s contain metadata at the very beginning, and image files stored within metadata can be +500KB, so my calculation of time based on byte position for any episode with an image file would be off by about 500KB (which translated into about 30-40 seconds in this case).
To resolve this, I am first determining the size in bytes of the metadata in an mp3 file, and then use that to offset the approximation function:
startTimeBytesPosition = metadataBytesOffset + (startTimeInSeconds / episodeDuration) * episodeFileSize
So far this code seems to be doing a good job of approximating time based on byte position accurately within a few seconds.
I should note that this assumes that the metadata for the image will always appear at the beginning of the mp3 file, and I don't know if that will always be the case.

Resources