This article explains how to do image decoding and preprocessing on server side with Dali while using triton-inference-server.
I am trying to find something similar for doing video decoding from h.264 encoded bytes array on server side, before the input "NTHWC" array is passed to any of the video recognition models like in mmaction2 or swin-transformer, using ensemble model.
All I can find is how to load video from files, but nothing on loading videos from external_source.
Also, as a workaround, I guess I can do the desired thing using python-backend by writing the encoded video bytes to a file, and preprocess the video, but that will not inherently support batch processing, and I will either have to handle the batch sequentially or by starting multiprocess pools for processing each batch. highly un-optimal I guess.
Any help is highly appreciated.
Related
I am trying to build a system exactly same as youtube contentID which will generate fingerprints of the video and will search the fingerprint in the database. I want to know what fingerprinting algorithm or method is used by Youtube ContentID to generate the fingerprints and compare the fingerprints and how it performs fingerprint searching in database
I don't think the exact algorithm is known. You could use scene detection and chunking to create bounded-size chunks of video and audio. Then, you could use locality sensitive hashing techniques to index these chunks, so that similar chunks receive identical hashes. However, this is not straightforward and subject to active research.
There are certain machine learning algorithms in use that takes videos files as input. If I have to pull all the videos from youtube that are associated with a certain tag and provide them as input to this algorithm, what should be my input format?
There is no format in which you can pass a video to a machine learning algorithm, since it won't understand the contents of the video.
You need to preprocess the video first, which might depend on how you have to use it. In general you can do something like converting each frame of the video to CSV (same as preprocessing an image), which you can pass to your machine learning algorithm. If you want to process your frames sequentially, you may want to use a Recurrent Neural Network. Also if the video has some audio, then just find its audio time series, and combine each part of the time series with its corresponding video frame.
I’m working on dataset that is made of avi videos and I want to apply Gist on its frames and use Gist features of each frames for training my classifier to recognize actions. If I convert this videos to mp4 format and then perform Gist what will be the result?
mpeg4 is just a container, it says nearly nothing about how the actual data is compressed. In short - if you use lossy compression then Gist descriptors will change, if you use lossless then they will be the same, and since most common default video compressors are lossy, your Gist will most probably change.
I am trying to capture online streamed content process them image by image. I have the API's written for images in openCV in python 2.7 I am just trying to extend this and see explore different possibilities (and ofcourse choose the best method) for capturing and processing these online video streams. Can this be done in openCV? If not(or simpler) any other alternative (python alternative highly preferred)?
Thanks
Ajay
I am currently in a webcam streaming server project that requires the function of dynamically adjusting the stream's bitrate according to the client's settings (screen sizes, processing power...) or the network bandwidth. The encoder is ffmpeg, since it's free and open sourced, and the codec is MPEG-4 part 2. We use live555 for the server part.
How can I encode MBR MPEG-4 videos using ffmpeg to achieve this?
The multi-bitrate video you are describing is called "Scalable Video Codec". See this wiki link for basic understanding.
Basically, in a scalable video codec, a base layer stream itself has completely decodable; however, additional information is represented in the form of (one or many) enhancement streams. There are couple of techniques to be able to do this including lower/higher resolution, framerate and change in Quantization. The following papers explains in details
of Scalable Video coding for MEPG4 and H.264 respectively. Here is another good paper that explains what you intend to do.
Unfortunately, this is broadly a research topic and till date no open source (ffmpeg and xvid) doesn't support such multi layer encoding. I guess even commercial encoders don't support this as well. This is significantly complex. Probably you can check out if Reference encoder for H.264 supports it.
The alternative (but CPU expensive) way could be transcode in real-time while transmitting the packets. In this case, you should start off with reasonably good quality to start with. If you are using FFMPEG as API, it should not be a problem. Generally multiple resolution could still be a messy but you can keep changing target encoding rate.