Why don't we use motion vector data from video in optical flow? - image-processing

All the implementation i saw in optical flow in opencv uses video as an array of frames and then implement optical flow on each image. That involves slicing the image into NxN block and searching for velocity vector.
Although motion vector in video codec is misleading and it does not necessarily contain motion information, why don't we use it to check which block likely has motion and then run optical flow on those blocks ? Shouldn't that fasten the process ?

Motion vector calculated for video encoding is not 'True Motion Vector'. The calculation targets to find best matching block to achieve highest compression. It doesn't targets the best motion estimation. That's why it can't be readily used for motion estimation purpose.
However, the motion-vectors from decoder can be used in some way to make motion estimation faster and better process. e.g.
As a seed to your algorithm of motion estimation.
Try filter like median (depending on data) to remove outlier motion vectors and use it for better motion estimation.
One of above step can be used as first step in a two-step motion estimation algo.

OpenCV is a universal image processing framework. It takes in frames, not compressed video, for its algorithms.
You can certainly write a video decoder that also hands out info about displacement from the codec to openCV – but that will be very codec-specific and thus isn't in scope of openCV itself.

I distinctly remember having read academic work done on the use of "motion" vectors embedded in h.264 and similar codecs for optical flow/vision motion analysis. The easiest approach to do so, is for iterative estimations (such as Horn and Schunk). One can simply bootstrap the iterative algorithm with the vectors from the decoder.
This will not improve the estimation accuracy, at best I would guess it could speed up the rate of convergence, allowing for earlier stopping.
Not all h.264 encoding is of the same quality either. Especially from real-time systems with limited hardware. There are likely cases where the motion vectors will be outright awful, and actually be a detriment to the flow estimations instead of helping. I am thinking for example of a low-end IP camera that is designed for mostly static scenes, that is moving in varying lighting etc.

Related

best algorithm for face detection and pose estimation

I am looking for algorithms/publications on face detection. There are plenty in the web. But my scenario is somewhat specialized. I want to detect faces accurately in images taken by wearable devices (e.g. narrative clips), so there will be motion blur, and image quality will not be that good. I want to detect faces that are within 15 feet of the camera accurately. Next goal is to estimate the pose, primarily to find out if the person is looking toward the camera ( or better looking at the camera owner).
Any suggestion?
My go to for this would either be a deep-learning framework using convolutional layers for pixel classification, or K-means/ K-Nearest Neighbour algorithm.
This does depend on your data, however. From your post I am assuming that your data isn't labelled? meaning you are unable to feed in the 'truth' to the algorithm for classification.
you could perhaps use a CNN (convolutional neural network) for pixel classification (image segmentation) which should identify the location of a person. given this, perhaps you could run a 'local' CNN i a region close to the face identified to classify the region the body is located in as a certain pose.
This would probably be my first take on the problem but would depend on the exact structure of your data, and the structure of your labels (if you have any).
I have to say it does sound like a fun project!
I found OpenCV's Haar Cascades for Face Detection pretty accurate and robust for motion blur and "live" face recognition.
I'm saying that because I used them for implementing an Eye-Tracker in C++ with a laptop webcam (whose resolution was not excellent and motion blur was naturally always present).
They work in multiresolution and are therefore able to detect faces of any size, but you can easily tune them for your distance of interest.
They might not be your final optimal solution, but since they are already implemented and come with the OpenCV package, they could constitute a good starting point.

Block based Motion Estimation in Video Compression

As we know almosty all video encoders use some temporal coding. It uses block (Rectangular area) based motion estimation to find best macth of a block of pixels for a current frame in reference / previous frames. This gives the motion vector. This is fine if the motion is translational(i.e. if the block moves to left/right or up/down) What if the object rotates and if the object was rectangular in shape and it rotates, then motion estimation would not be so accurate and hence would not result in least presidual(original minus prediction).
So what methods does a video encoder adopt to deal with such rotational motions./movements.
Does it then handle such situation by coding that block as Intra block(Code as it is without any reference to any previous) within the P frame
or
are there any other tricks at hand to deal it while coding it as P macroblock itself?
As far as I know, video encoders don't have any special case for rotational movements. First, detection of rotational motion itself would consume a lot of time. Also, motion estimation is done at the macroblock level and therefore, there might be quite a few macroblocks in the frame that are not moving in a rotational manner, unless the whole frame itself is somehow rotating.
One "trick" that I can suggest is the following-
Calculate PSNR between predicted frame (P Frame) and actual frame. If PSNR is too low, it makes more sense to encode the frame as an information frame (I Frame). Note that this cannot be done for live transmissions because it would be time consuming. But it can be done when encoding time is not a factor. In that case you could simply use a Full Search.
The point of motion estimation is that it is a computationally cheap way of reducing 'typical' videos.
If you were to use motion based coding on something like a video of a waterfall it would fail to reduce the size.
A similar concept applies to JPEG photos. The JPEG compression only works because it takes advantage of the particular sensitivity of the human eye.
Ultimately, data is data and you cannot losslessly reduce the amount of it. The best you can do is to make some guesses about the source and destination and then try to recreate something that will be indistinguishable to the viewer, but which uses less data. That is why motion estimation WORKS. 99.99 percent of movies that humans watch have humans in them, moving around like humans do...left and right...up and down. And by WORKS, I mean, can be done in a quick enough time to make it worthwhile to do it for millions of hours of footage produced every year.
This, of course has something to do with Shannon entropy http://en.wikipedia.org/wiki/Entropy_(information_theory) , but that article makes my brain start to seep out through my eye sockets a bit...
First thing is the computational complexity which increases dramatically for every addition of a rotational direction. For example, the Motion estimation time is 'x' seconds. After adding say right hand 90 degrees, we have again 'x' seconds, since it needs to check the same reference frame search window again with the rotated block. Again after adding the left rotation 90 degrees, again it adds another x seconds to motion estimate, and so on. And the main issue here is that, in the entire encoder, typically, Motion Estimation is the block which consumes major part of encoding time.
Second issue is the complexity of motion compensation unit. If we have rotational block in estimation or prediction then we must generate the same transformation for generating the compensated frame, in the encoder and decoder too. The worst thing is that it adds much complexity in the decoder side also.
The third thing is the prediction unit for the support of variable block size. The standard always defines motion vectors for the block sizes which are fixed. If rotational block sizes are proposed, then the directions needs to be standardized in decoder also, where motion compensation unit, entropy encoder/decoder etc.
The fourth thing is the Motion Vector Coding. Since we add the rotational motion vectors, we need to add extra bits to motion vectors.So, put these things in a beam balance - "adding addition bits for each MV" and "improving compression efficiency using rotational Motion vectors", which one weighs more. If the balance is balanced, or if "adding extra bits for each MV" weighs more, then there is no use of using rotational MVs.
Fifth thing is about the deep understanding of the encoder block diagram. The encoder which we are using is analogous to adaptive Differential Pulse Code Modulator or any similar type with predictive coding. The video signal is always encoder differentially. When a video signal or any signal is coded differentially, the time difference between previous and the current sample is infinitesimally small (here 1/frame-rate), such that the individual blocks always follow translational direction.So, we get in use, the rotational MVs only if we are using multiple reference frame when reference frame if larger than frame-rate or at least larger than GOP-size. So, in this case rotational MVs could give very slight improvement in PSNR or increase Motion Estimation time dramatically.
Another thing is about the subjective and statistical study of the Motion direction.
Despite all these, there are some proposals in JCT-VC for implementing this, but finally not approved in current HEVC standard. May be they will try to figure it out and solve all the issues in future.

How to detect the voice from an audio stream

I need to determine when someone speaks in an audio stream. I applied the Hamming window and calculated the FFT. How do i detect the human voice from here?
If you want to experiment with your own voice activity detection algorithms, an FFT can be used as an initial stage. Next you might want to try subtracting any characterized stationary spectral noise background. Then you could try using the modified FFT results to calculate a cepstrum (or some weighted cepstral coefficients) for feature extraction. You could then do some statistical pattern matching on whatever feature vectors you decided to extract, and feed the results to a decision algorithm.
Each of the above steps has likely been a research topic, and a good implementation might involve studying dozens of published research papers, which perhaps can be found in your university library.
You don't need to do an FFT for this, you need to implement a Voice Activity Detection algorithm.

Finding path obstacles in a 2D image

what approach would you recommend for finding obstacles in a 2D image?
Here are some key points I came up with till now:
I doubt I can use object recognition based on "database of obstacles" search, since I don't know what might the obstruction look like.
I assume color recognition might be problematic if the path does not differ a lot from the object itself.
Possibly, adding one more camera and computing a 3D image (like a Kinect does) would work, but that would not run as smooth as I require.
To illustrate the problem; robot can ride either left or right side of the pavement. In the following picture, left side is the correct choice:
If you know what the path looks like, this is largely a classification problem. Acquire a bunch of images of path at different distances, illumination, etc. and manually label the ground in each image. Use this labeled data to train a classifier that classifies each pixel as either "road" or "not road." Depending upon the texture of the road, this could be as simple as classifying each pixels' RGB (or HSV) values or using OpenCv's built-in histogram back-projection (i.e. cv::CalcBackProjectPatch()).
I suggest beginning with manual thresholds, moving to histogram-based matching, and only using a full-fledged machine learning classifier (such as a Naive Bayes Classifier or a SVM) if the simpler techniques fail. Once the entire image is classified, all pixels that are identified as "not road" are obstacles. By classifying the road instead of the obstacles, we completely avoided building a "database of objects".
Somewhat out of the scope of the question, the easiest solution is to add additional sensors ("throw more hardware at the problem!") and directly measure the three-dimensional position of obstacles. In order of preference:
Microsoft Kinect: Cheap, easy, and effective. Due to ambient IR light, it only works indoors.
Scanning Laser Rangefinder: Extremely accurate, easy to setup, and works outside. Also very expensive (~$1200-10,000 depending upon maximum range and sample rate).
Stereo Camera: Not as good as a Kinect, but it works outside. If you cannot afford a pre-made stereo camera (~$1800), you can make a decent custom stereo camera using USB webcams.
Note that professional stereo vision cameras can be very fast by using custom hardware (Stereo On-Chip, STOC). Software-based stereo is also reasonably fast (10-20 Hz) on a modern computer.

Rapid motion and object detection in opencv

How can we detect rapid motion and object simultaneously, let me give an example,....
suppose there is one soccer match video, and i want to detect position of each and every players with maximum accuracy.i was thinking about human detection but if we see soccer match video then there is nothing with human detection because we can consider human as objects.may be we can do this with blob detection but there are many problems with blobs like:-
1) I want to separate each and every player. so if players will collide then blob detection will not help. so there will problem to identify player separately
2) second will be problem of lights on stadium.
so is there any particular algorithm or method or library to do this..?
i've seen some research paper but not satisfied...so suggest anything related to this like any article,algorithm,library,any method, any research paper etc. and please all express your views in this.
For fast and reliable human detection, Dalal and Triggs' Histogram of Gradients is generally accepted as very good. Have you tried playing with that?
Since you mentioned rapid motion changes, are you worried about fast camera motion or fast player/ball motion?
You can do 2D or 3D video stabilization to fix camera motion (try the excellent Deshaker plugin for VirtualDub).
For fast player motion, background subtraction or other blob detection will definitely help. You can use that to get a rough kinematic estimate and use that as an estimate of your blur kernel. This can then be used to deblur the image chip containing the player.
You can do additional processing to establish identify based upon OCRing jersey numbers, etc.
You mentioned concern about lights on the stadium. Is the main issue that it will cast shadows? That can be dealt with by the HOG detector. Blob detection to get blur kernel should still work fine with the shadow.
If you have control over the camera, you may want to reduce exposure times to reduce blur. Denoising techniques can be used to reduce CCD noise that occurs with extreme low light and dense optical flow approaches align the frames and boost the signal back up to something reasonable via adding the denoised frames.

Resources