Block based Motion Estimation in Video Compression - video-encoding

As we know almosty all video encoders use some temporal coding. It uses block (Rectangular area) based motion estimation to find best macth of a block of pixels for a current frame in reference / previous frames. This gives the motion vector. This is fine if the motion is translational(i.e. if the block moves to left/right or up/down) What if the object rotates and if the object was rectangular in shape and it rotates, then motion estimation would not be so accurate and hence would not result in least presidual(original minus prediction).
So what methods does a video encoder adopt to deal with such rotational motions./movements.
Does it then handle such situation by coding that block as Intra block(Code as it is without any reference to any previous) within the P frame
or
are there any other tricks at hand to deal it while coding it as P macroblock itself?

As far as I know, video encoders don't have any special case for rotational movements. First, detection of rotational motion itself would consume a lot of time. Also, motion estimation is done at the macroblock level and therefore, there might be quite a few macroblocks in the frame that are not moving in a rotational manner, unless the whole frame itself is somehow rotating.
One "trick" that I can suggest is the following-
Calculate PSNR between predicted frame (P Frame) and actual frame. If PSNR is too low, it makes more sense to encode the frame as an information frame (I Frame). Note that this cannot be done for live transmissions because it would be time consuming. But it can be done when encoding time is not a factor. In that case you could simply use a Full Search.

The point of motion estimation is that it is a computationally cheap way of reducing 'typical' videos.
If you were to use motion based coding on something like a video of a waterfall it would fail to reduce the size.
A similar concept applies to JPEG photos. The JPEG compression only works because it takes advantage of the particular sensitivity of the human eye.
Ultimately, data is data and you cannot losslessly reduce the amount of it. The best you can do is to make some guesses about the source and destination and then try to recreate something that will be indistinguishable to the viewer, but which uses less data. That is why motion estimation WORKS. 99.99 percent of movies that humans watch have humans in them, moving around like humans do...left and right...up and down. And by WORKS, I mean, can be done in a quick enough time to make it worthwhile to do it for millions of hours of footage produced every year.
This, of course has something to do with Shannon entropy http://en.wikipedia.org/wiki/Entropy_(information_theory) , but that article makes my brain start to seep out through my eye sockets a bit...

First thing is the computational complexity which increases dramatically for every addition of a rotational direction. For example, the Motion estimation time is 'x' seconds. After adding say right hand 90 degrees, we have again 'x' seconds, since it needs to check the same reference frame search window again with the rotated block. Again after adding the left rotation 90 degrees, again it adds another x seconds to motion estimate, and so on. And the main issue here is that, in the entire encoder, typically, Motion Estimation is the block which consumes major part of encoding time.
Second issue is the complexity of motion compensation unit. If we have rotational block in estimation or prediction then we must generate the same transformation for generating the compensated frame, in the encoder and decoder too. The worst thing is that it adds much complexity in the decoder side also.
The third thing is the prediction unit for the support of variable block size. The standard always defines motion vectors for the block sizes which are fixed. If rotational block sizes are proposed, then the directions needs to be standardized in decoder also, where motion compensation unit, entropy encoder/decoder etc.
The fourth thing is the Motion Vector Coding. Since we add the rotational motion vectors, we need to add extra bits to motion vectors.So, put these things in a beam balance - "adding addition bits for each MV" and "improving compression efficiency using rotational Motion vectors", which one weighs more. If the balance is balanced, or if "adding extra bits for each MV" weighs more, then there is no use of using rotational MVs.
Fifth thing is about the deep understanding of the encoder block diagram. The encoder which we are using is analogous to adaptive Differential Pulse Code Modulator or any similar type with predictive coding. The video signal is always encoder differentially. When a video signal or any signal is coded differentially, the time difference between previous and the current sample is infinitesimally small (here 1/frame-rate), such that the individual blocks always follow translational direction.So, we get in use, the rotational MVs only if we are using multiple reference frame when reference frame if larger than frame-rate or at least larger than GOP-size. So, in this case rotational MVs could give very slight improvement in PSNR or increase Motion Estimation time dramatically.
Another thing is about the subjective and statistical study of the Motion direction.
Despite all these, there are some proposals in JCT-VC for implementing this, but finally not approved in current HEVC standard. May be they will try to figure it out and solve all the issues in future.

Related

Improving an algorithm for detecting fish in a canal

I have many hours of video captured by an infrared camera placed by marine biologists in a canal. Their research goal is to count herring that swim past the camera. It is too time consuming to watch each video, so they'd like to employ some computer vision to help them filter out the frames that do not contain fish. They can tolerate some false positives and false negatives, and we do not have sufficient tagged training data yet, so we cannot use a more sophisticated machine learning approach at this point.
I am using a process that looks like this for each frame:
Load the frame from the video
Apply a Gaussian (or median blur)
Subtract the background using the BackgroundSubtractorMOG2 class
Apply a brightness threshold — the fish tend to reflect the sunlight, or an infrared light that is turned on at night — and dilate
Compute the total area of all of the contours in the image
If this area is greater than a certain percentage of the frame, the frame may contain fish. Extract the frame.
To find optimal parameters for these operations, such as the blur algorithm and its kernel size, the brightness threshold, etc., I've taken a manually tagged video and run many versions of the detector algorithm using an evolutionary algorithm to guide me to optimal parameters.
However, even the best parameter set I can find still creates many false negatives (about 2/3rds of the fish are not detected) and false positives (about 80% of the detected frames in fact contain no fish).
I'm looking for ways that I might be able to improve the algorithm. I don't know specifically what direction to look in, but here are two ideas:
Can I identify the fish by the ellipse of their contour and the angle (they tend to be horizontal, or at an upward or downward angle, but not vertical or head-on)?
Should I do something to normalize the lighting conditions so that the same brightness threshold works whether day or night?
(I'm a novice when it comes to OpenCV, so examples are very appreciated.)
i think you're in the correct direction. Your camera is fixed so it will be easy to extract the fish image.
But you're lacking a good tool to accelerate the process. believe me, coding will cost you a lot of time.
Personally, in the past i choose few data first. Then i use bgslibrary to check which background subtraction method work for my data first. Then i code the program by hand again to run for the entire data. The GUI is very easy to use and the library is awesome.
GUI video
Hope this will help you.

Measure of motion in a video

I am trying to get a measure of motion between frames of an American Football match. When the players are set on the line of scrimmage, the motion should be minimal as few parts are moving. Then once the play begins, there will be a sharp increase in motion.
I am aware of optical flow being used to find the general motion between parts of an image. However, is there any way to quantify how much motion is occurring?
By taking the sum of all motion vectors, you get global motion between frames.
By definition, the motion vectors calculated by optical flow algorithms will be the transformations between corresponding parts of frame(i) and frame(i+1). Therefore, the amount of motion between two frames is the sum of the motion vectors (motion of each part). You may simply add together their magnitude for a rough estimate of how chaotic the transformation/motion was. You may also add together the vectors to receive a new vector that shows net motion.
As far as I can tell, you want the first value; sum(motionVectors.magnitude). This will give you a number allowing you to compare (roughly), the amount of motion in two different frames. This also allows you to graph motion over time, allowing you to compare two videos using metrics such as 'unexpected motion', or 'total motion'.
There are many uses for this data, and I'll leave them to your imagination.

How to decorrelate accelerometer data

Is it possible to decorrelate accelerometer data in real-time? If so, how is it done?
Background:
My application is receiving (X,Y,Z) accelerometer data in real-time (sample rate is 6.75Hz). The sensor is moving in a periodic motion but the motion is not necessarily along only one axis. The 3 signals x(t), y(t) and z(t) are therefore slightly correlated and I would like to know if I can find a rotation matrix (in real time) which can be used to rotate the measured (x,y,z) into a new vector (x*,y*,z*) so that the entire motion is along the z-axis?
I would like to implement the algorithm in C.
Thanks.
What you're trying to do is generally called "principal component analysis". The Wikipedia article is pretty good:
https://en.wikipedia.org/wiki/Principal_component_analysis
For static data you generally use the eigenvectors of the covariance matrix as your new coordinate basis.
PCA in real time is doable, but not super easy. See, for example: http://www.bio-conferences.org/articles/bioconf/pdf/2011/01/bioconf_skills_00055.pdf
I'd like to first of all emphasize that Matt Timmermans' answer has done exactly what people are actually doing when classifying accelerometer data from clinical studies (a project I worked on).
Then: you're observing a sampled signal. In general, if you have a sensor that gives you samples at a rate of 6.75Hz, the highest frequency of a signal you can detect is 6.75Hz/2 = 3.375Hz. Everything that has a frequency higher than that will inherently be aliased back and look like it was something with a frequency f with 0<=f<3.375Hz. If you've not considered this, please go and read up on the Nyquist–Shannon sampling theorem. Especially: shield your sensors (however you do that, e.g. by employing dampeners) from all input above that limit, otherwise your measurements might be worth very little or even nothing. If your sensor does this internally (that's absolutely possible, there are enough accelerometers with analog low pass filters), this has been taken care of. However, document that characteristics of your sensor.
Now, your case is a little bit easier because you know pretty well that your whole observation is going to be periodic, and it's measured along three orthogonal axis.
In this case, just doing three discrete Fourier transforms at once, extracting the "strongest" spectral component over all three channels, and finding the phase of that spectral component (which is but the complex argument of that DFT bin) in the two others would give you something that you can map to a periodic movement around a specific axis in 3D space. If you want to, remove these value (set the bins to 0), and search for strongest component again etc.
Discrete cosine transforms can be done in staggering speed nowadays. with 6.75Hz, no PC in this world will ever get into trouble when you try this while you receive further samples. It's a hilariously low sampling rate.
Another, more elegant (read: you need less samples to compute this) would be using a parametric estimator; in your case, a direction-of-arrival sensor from the world of RF technology with multiple antennas would, as far as I can think, map directly to detection of rotational axis. The classical algorithms here are MUSIC and ESPRIT, and for your case (limited, known amount of oscillating parts), ESPRIT might be the better choice.

Is a "rolling" FFT possible and could it be of use?

Lately I have been experimenting with audio and FFTs, specifically the Minim library in Processing (basically Java, not that its particularly important for this question). What I have come to understand is that with a buffer/sample size N and sample rate K, after performing a forward FFT, I will get N frequency bins (only N/2 usable data and in fact Minim only returns N/2 bins) linearly spaced representing the spectrum from 0 to K/2 HZ.
With Minim (as well as other typical FFT implementations) you wait to gather N samples, and then perform the forward transformation, then wait for N more samples, and so on. In order to get a reasonable frame-rate (for audio visualizations, beat detection, etc.), I must use a small sample size relative to the sampling frequency.
The problem with this, though, is that a small sample size results in a very low resolution for the low end of the spectrum when I compute logarithmically spaced averages (Since a bass octave is much narrower than a high pitched octave).
I was wondering if a possible way to squeeze more apparent resolution would be to perform FFTs more often than every N samples on a slightly larger sample size than I am currently using. (I.E. with input buffer of size 2048, every 100 samples, add those samples to the input buffer and remove the oldest 100 samples, and perform a FFT). It seems like this would possibly create a rolling-average type of affect (which I can live with) but I'm not too sure.
What would be the pros and cons of this approach? Are there any other ways I could increase my apparent resolution while still being able to do real-time visualization and analysis?
That approach goes by the name Short-time Fourier transform. You get all the answers to your question on wikipedia: https://en.wikipedia.org/wiki/Short-time_Fourier_transform
It works great in practice and you can even get better resolution out of it compared to what you would expect from a rolling window by using the phase difference between the fft's.
Here is one article that does pitch shifting of audio signals. The way how to get higher frequency resolution is well explained: http://www.dspdimension.com/admin/pitch-shifting-using-the-ft/
We use the approach you describe, which we call overlapping, to make sure all the rows of a spectral waterfall are filled in. Overlap can be used to provide spectra that are spaced as closely as a single sample interval.
The primary disadvantage is the extra processing to produce all those spectra.
On the positive side, while the time resolution of each spectra is still constrained by FFT size, looking at closely spaced adjacent spectra seems to provide a kind of a visual interpolation that, I think, lets you see the data with higher precision.
One common way this is done is to use multiple lengths of windowed FFTs on the same data, short FFTs for good time resolution, much longer FFTs for better frequency resolution of lower frequencies. Then the problem for visualization becomes picking the best FFT result out of several possible at each plot point (such as the highest contrast sub-block, etc.) and blending them attractively.
Most modern processors (in PCs and mobile phones, etc.) can easily do multiple lengths (dozens) of FFTs still in real-time for audio.

Vehicle segmentation and tracking

I've been working on a project for some time, to detect and track (moving) vehicles in video captured from UAV's, currently I am using an SVM trained on bag-of-feature representations of local features extracted from vehicle and background images. I am then using a sliding window detection approach to try and localise vehicles in the images, which I would then like to track. The problem is that this approach is far to slow and my detector isn't as reliable as I would like so I'm getting quite a few false positives.
So I have been considering attempting to segment the cars from the background to find the approximate position so to reduce the search space before applying my classifier, but I am not sure how to go about this, and was hoping someone could help?
Additionally, I have been reading about motion segmentation with layers, using optical flow to segment the frame by flow model, does anyone have any experience with this method, if so could you offer some input to as whether you think this method would be applicable for my problem.
Below is two frames from a sample video
frame 0:
frame 5:
Assumimg your cars are moving, you could try to estimate the ground plane (road).
You may get a descent ground plane estimate by extracting features (SURF rather than SIFT, for speed), matching them over frame pairs, and solving for a homography using RANSAC, since plane in 3d moves according to a homography between two camera frames.
Once you have your ground plane you can identify the cars by looking at clusters of pixels that don't move according to the estimated homography.
A more sophisticated approach would be to do Structure from Motion on the terrain. This only presupposes that it is rigid, and not that it it planar.
Update
I was wondering if you could expand on how you would go about looking for clusters of pixels that don't move according to the estimated homography?
Sure. Say I and K are two video frames and H is the homography mapping features in I to features in K. First you warp I onto K according to H, i.e. you compute the warped image Iw as Iw( [x y]' )=I( inv(H)[x y]' ) (roughly Matlab notation). Then you look at the squared or absolute difference image Diff=(Iw-K)*(Iw-K). Image content that moves according to the homography H should give small differences (assuming constant illumination and exposure between the images). Image content that violates H such as moving cars should stand out.
For clustering high-error pixel groups in Diff I would start with simple thresholding ("every pixel difference in Diff larger than X is relevant", maybe using an adaptive threshold). The thresholded image can be cleaned up with morphological operations (dilation, erosion) and clustered with connected components. This may be too simplistic, but its easy to implement for a first try, and it should be fast. For something more fancy look at Clustering in Wikipedia. A 2D Gaussian Mixture Model may be interesting; when you initialize it with the detection result from the previous frame it should be pretty fast.
I did a little experiment with the two frames you provided, and I have to say I am somewhat surprised myself how well it works. :-) Left image: Difference (color coded) between the two frames you posted. Right image: Difference between the frames after matching them with a homography. The remaining differences clearly are the moving cars, and they are sufficiently strong for simple thresholding.
Thinking of the approach you currently use, it may be intersting combining it with my proposal:
You could try to learn and classify the cars in the difference image D instead of the original image. This would amount to learning what a car motion pattern looks like rather than what a car looks like, which could be more reliable.
You could get rid of the expensive window search and run the classifier only on regions of D with sufficiently high value.
Some additional remarks:
In theory, the cars should even stand out if they are not moving since they are not flat, but given your distance to the scene and camera resolution this effect may be too subtle.
You can replace the feature extraction / matching part of my proposal with Optical Flow, if you like. This amounts to identifying flow vectors that "stick out" from a consistent frame-to-frame motion of the ground. It may be prone to outliers in the optical flow, however. You can also try to get the homography from the flow vectors.
This is important: Regardless of which method you use, once you have found cars in one frame you should use this information to robustify your search of these cars in consecutive frame, giving a higher likelyhood to detections close to the old ones (Kalman filter, etc). That's what tracking is all about!
If the number of cars in your field of view always remain the same but move around then you can use optical flow...it will give you good results against a still background...if the number of cars are changing then you need to call goodFeaturestoTrack function in OpenCV after certain number of frames and again track the cars using optical flow.
You can use background modelling to model the background and hence the cars are always your foreground.The simplest example is frame differentiation...subtract the previous frame current frame. diff(x,y,k) = I(x,y,k) - I(x,y,k-1) .As your cars are moving in each frame you will get their position..
Both the process will work fine since you have a still background I presume..check this link to find what Optical flow can do.

Resources