The default iphone audio recorder has a sampling rate of 44.1khz and a bit rate of 64 kbps. When exporting an 8 min audio recording from the recorder I can see that the exported file size comes out to a little under 4MB. When I try to export an audio file from my custom audio recorder my 3 min audio recording fails to export because its 22MB. How are they getting their file size so low with such a high sampling rate? Also, I see that the exported audio file is .m4a but in itunes the file "kind" is AAC. Shouldn't that make the audio file .AAC?
Just read this: What is the difference between M4A and AAC Audio Files?. So an m4a file can contain an AAC audio track?
Kind of confused here.
AAC is one of many audio codec which is the flavor of compression ... once encoded it becomes binary data which needs to get wrapped inside a container format for transport over the wire or into a file ... once such container is m4a
Doing the math on 64 kbps (this is an uncompressed figure) for an 8 minute clip ... 64 * 60 * 8 == over 30 meg ... AAC typically does a 10 to 1 compression so it becomes the 4mb you quote ... bit rate is based on two underlying factors : sample rate (44.1 kHz) and bit depth or the number of bits of resolution per sample ... standard CD quality bit depth is 16 bits, whereas telephone audio can be as low as a bit depth of 8 bits per sample ... single channel mono versus 2 ch stereo would double the required bit per second
under the covers all audio processing/recording/playback uses PCM which is raw audio prior to compression
Your figure of 22 meg for 3 minutes --> 22000000 / 60 / 3 == 122 kbps which is about right for stereo uncompressed PCM ... so I would say your 22 meg is uncompressed
hope this helps
Related
I recently created a Video Swin Transformer model that takes in a ([batch_size], 3, 32, 224, 224) [batch_size, channel, temporal_dim, height, width] tensor for video and outputs logits. The goal is to have the model predict on a live stream from a camera. Is there any way to capture the fixed sequence of 32 frames repetitively and have the model predict on a live stream. If prediction time is longer than 32 frames, can I stretch out the frames over a longer time period like a minute? Thanks.
You can try to use my library ffmpegio, which suits your need:
To install:
pip install ffmpegio
To get block of 32 frames from your input url
import ffmpegio
url = 'input stream url'
temporal_dim = 32
height = 224
width = 224
size = [width,height]
pix_fmt = 'rgb24'
with ffmpegio.open(url,'rv',blocksize=temporal_dim,s=size,pix_fmt=pix_fmt) as stream:
for frames in stream: # frames in [time,height,width,ch] ndarray
vswim_in = frames.transpose(3,0,1,2) # reorg for your library
You can specify any other ffmpeg options as you wish to add (like a scaling/cropping filter to make input frame 224px square or input stream options).
Caveat. I haven't tested live stream buffering extensively. If you encounter any issues, please post an issue on the GitHub.
I would like to know if it's possible to resample an already written AVAudioFile.
All the references that I found don't work on this particular problem, since:
They propose the resampling while the user is recording an AVAudioFile, while installTap is running. In this approach, the AVAudioConverter works in each buffer chunk given by the inputNode and appends it in the AVAudioFile. [1] [2]
The point is that I would like to resample my audio file regardless of the recording process.
The harder approach would be to upsample the signal by a L factor and applying decimation by a factor of M, using vDSP:
Audio on Compact Disc has a sampling rate of 44.1 kHz; to transfer it to a digital medium that uses 48 kHz, method 1 above can be used with L = 160, M = 147 (since 48000/44100 = 160/147). For the reverse conversion, the values of L and M are swapped. Per above, in both cases, the low-pass filter should be set to 22.05 kHz. [3]
The last one obviously seems like a too hard coded way to solve it. I hope there's a way to resample it with AVAudioConverter, but it lacks documentation :(
There is a camera that shoots at 20 frame per second. each frame is 4000x3000 pixel.
The frames are sent to a software that contain openCV in it. OpenCV resizes the freames to 1920x1080 then they must be sent to FFMPEG to be encoded to H264 or H265 using Nvidia Nvenc.
The encoded video then got steamed HTTP to a maximum of 10 devices.
The infrastructure is crazy good (10 GB Lan) with state of the art switchers, routers etc...
Right now, i can get 90 FPS when encoding the images from an Nvme SSD. this means that the required encoding speed is achieved.
The question is how to get the images from OpenCV to FFMPEG ?
the stream will be watched on a webapp that was made using MERN stack (assuming that this is relevant).
For cv::Mat you have cv::VideoWriter. If you wish to use FFMpeg, assuming Mat is continuous, which can be enforced:
if (! mat.isContinuous())
{
mat = mat.clone();
}
you can simply feed mat.data into sws_scale
sws_scale(videoSampler, mat.data, stride, 0, mat.rows, videoFrame->data, videoFrame->linesize);
or directly into AVFrame
For cv::cuda::GpuMat, VideoWriter implementation is not available, but you can use NVIDIA Video Codec SDK and similarly feed cv::cuda::GpuMat::data into NvEncoderCuda, just make sure your GpuMat has 4 channels (BGRA):
NV_ENC_BUFFER_FORMAT eFormat = NV_ENC_BUFFER_FORMAT_ABGR;
std::unique_ptr<NvEncoderCuda> pEnc(new NvEncoderCuda(cuContext, nWidth, nHeight, eFormat));
...
cv::cuda::cvtColor(srcIn, srcIn, cv::ColorConversionCodes::COLOR_BG2BGRA);
NvEncoderCuda::CopyToDeviceFrame(cuContext, srcIn.data, 0, (CUdeviceptr)encoderInputFrame->inputPtr,
(int)encoderInputFrame->pitch,
pEnc->GetEncodeWidth(),
pEnc->GetEncodeHeight(),
CU_MEMORYTYPE_HOST,
encoderInputFrame->bufferFormat,
encoderInputFrame->chromaOffsets,
encoderInputFrame->numChromaPlanes);
Here's my complete sample of using GpuMat with NVIDIA Video Codec SDK
For the FDK AAC,
I want to access the spectral data before and after Huffman encoding/decoding in the encoder and in the decoder.
For accessing spectral data before Huffman encoding, I am using pSpectralCoefficient pointer and dumping 1024 samples (on the decoder side) and using qcOutChannel[ch]->quantSpec and dumping 1024 samples (on the encoder side). Is this correct?
Secondly, how do access the Huffman encoded signal in the encoder and decoder. If someone can tell me the location in the code and the name of the pointer to use and the length of this data, I will be extremely thankful.
Thirdly,
I wanted to know that what is the frame size in frequency domain(before huffman encoding)?
I am dumping 1024 samples of *pSpectralCoefficient. Is that correct?
Is it possible that some frames are 1024 in length and others are a set of 8 frames with 128 frequency bins. If it is possible, then is there any flag that can give me this information ?
Thank you for your time. Request you please help me out with this as soon as possible.
Regards,
Akshay
To pull out that specific data from the bitstream you will need to step through the decoder and find the desired peaces of stream. In order to do that you have to have the AAC bitstream specification. Current AAC specification is:
ISO/IEC 14496-3:2009 "Information technology -- Coding of audio-visual objects -- Part 3: Audio"
I read some tutorials about mpeg transport stream, but there are 2 fundamental issues I do not understand:
1. mpeg-ts muxer recieve pes packets from audio and video, and output mpeg-ts packets. How does it do this muxing ? Is it that whenever a packet from any program is waiting on its input, that the muxer wakes up and process the pes slicing into mpeg-ts ?
2. Is it that the user can select which bit rate the mpeg-ts muxer will output ? what is the connection between the rate of the encoding to the rate of mpeg-ts ?
Thank you very much,
Ran
MPEG2-TS muxing is a complex art-form. Suggested reading: MPEG2-TS specification, SPTS/MPTS, VBR vs. CBR, Hypothetical reference decoder and buffers (EB, MB, TB), jitter and drift.
a very short answer to your questions can be summarized like this:
for each encoder, on the other end of the line there is a decoder which wants to display a video frame (or audio frame) every frame interval. this frame needs to be decoded before its presentation time. if this frame uses other frames as reference, they also need to be decoded prior to presentation.
when multiplexing, the data must arrive sufficient time before presentation. A video frame to be presented at time n must be available at decoder at time n - x where is x is a measure of time depending on the buffer rate of the decoder (see MB,TB,EB). if TS bit rate is too low, "underflow" occurs and the video is not in the decoder on time. if TS bit rate is too large, "overflow" occurs, and the buffers have to drop packets which will also create visual artifacts.