In what precise respects are discrete images different from continuous images? - image-processing

Please suggest me how to solution this question
In what precise respects are discrete images different from continuous images?

This is a very general question and I suggest reading about the details in any good textbook on digital image processing, e.g. "Digital Image Processing" by Gonzalez and Woods.
In the following I want to provide a rough overview. The best description of the relationship between a continuous a image and its discrete counterpart is sampling and quantization. Let f(x, y) be a continuous image. Then sampling means to take/sample values at discrete steps (x_1, y_1), (x_2, y_2), ... There is a vast body of literature on how to choose these samples. The most important is probably the Nyquist-Shannon sampling theorem. It is often seen as defining the bridge between continuous and discrete signals. After sampling, the taken values are still continuous, i.e. f(x_1, y_1), f(x_2, y_2) ... are continous. Therefore, the next step is quantization - in order to store the values digitally, they are quantized. The quantization strongly depends on the resolution used to store images. In general, 8 bit per color channel is used (e.g. RGB images have 24 bits per pixel). This means that every value f(x_i, y_i) is quantized into one of the 256 values provided by 8 bit quantization. Together, sampling and quanitzation transform a continuous image into a discrete or digital image.
Note that many image processing techniques originate from the continuous image model and can successfully be transferred to the discrete domain (these include simple principles concerning convolution, Fourier analysis, histograms etc.). However, often the discrete model introduces some difficulties one has to be aware of. Among these are quantization errors, sampling issues (e.g. aliasing etc.) and numerical stability.

Related

Why does ResNet double its feature channels after each stage rather than quadrupling?

I'm curious about the reasoning behind the common pattern in vision backbones like ResNet and others, where the number of feature channels is doubled at the end of each stage.
One might say that quadrupling would be more natural since this would keep the feature "size" consistent between stages.
i.e. 256 channels at 32x32 resolution is 262,144 features, but
512 channels at 16x16 resolution is 131,072 features (half as many)
Couldn't this limit the number of high-level features that a detector could make use of? Have there been experiments that explore this?
First of all, I don't see a good reasoning as to why you want to keep the feature "size" "consistent" across the layers. If you look at the idea of Deep CNN, it's more about using the representation of the small image regions learned in initial layers to learn high level and more complex features in deeper layers.
So how would you argue that the feature vector of length "l" in a layer n doesn't encode the same amount of information as a feature vector of the same length in layer n-1.
I suppose these decisions are much more experimental than analytical.
You can look at this paper to build more intuition about what you are asking.
https://cs.nyu.edu/~fergus/papers/zeilerECCV2014.pdf

metrics for feature detection/extraction methods

I wonder how do we evaluate feature detection/extraction methods (SIFT,SURF,MSER...) for object detection and tracking like pedestrians, lane vehicles etc.. Are there standard metrics for comparison? I have read blogs like http://computer-vision-talks.com/2011/07/comparison-of-the-opencvs-feature-detection-algorithms-ii/ some research papers like this. The problem is the more I learn the more I am confused.
It is very hard to estimate feature detectors per se, because features are only computation artifacts and not things that you are actually searching in images. Feature detectors do not make sense outside their intended context, which is affine-invariant image part matching for the descriptors that you have mentioned.
The very first usage of SIFT, SURF, MSER was multi-view reconstruction and automatic 3D reconstruction pipe-lines. Thus, these features are usually assessed from the quality of the 3D reconstrucution or image part matching that they provide. Roughly speaking, you have a pair of images that are related by a known transform (an affinity or an homography) and you measure the difference between the estimated homography (from the feature detector) and the real one.
This is also the method used in the blog post that you quote by the way.
In order to assess the practical interest of a detector (and not only its precision in an ideal multi-view pipe-line) some additional measurements of stability (under geometric and photometric changes) were added: does the number of detected features vary, does the quality of the estimated homography vary, etc.
Accidentally, it happens that these detectors may also work (also it was not their design purpose) for object detection and track (in tracking-by-detection cases). In this case, their performance is classically evaluated from more-or-less standardized image datasets, and typically expressed in terms of precision (probability of good answer, linked to the false alarm rate) and recall (probability of finding an object when it is present). You can read for example Wikipedia on this topic.
Addendum: What exactly do I mean by accidentally?
Well, as written above, SIFT and the like were designed to match planar and textured image parts. This is why you always see example with similar images from a dataset of graffiti.
Their extension to detection and tracking was then developed in two different ways:
While doing multiview matching (with a spherical rig), Furukawa and Ponce built some kind of 3D locally-planar object model, that they applied then to object detection in presence of severe occlusions. This worlk exploits the fact that an interesting object is often locally planar and textured;
Other people developed a less original (but still efficient in good conditions) approach by considering that they had a query image of the object to track. Individual frame detections are then performed by matching (using SIFT, etc.) the template image with the current frame. This exploits the fact that there are few false matchings with SIFT, that objects are usually observed in a distance (hence are usually almost planar in images) and that they are textured. See for example this paper.

Can I use Image entropy in noise removal algorithms, in order to check their effectiveness?

I am working in digital image restoration field, I have studied number of image noise removal research papers, and all of these papers are using PSNR to check the effectiveness of their algorithms, One thing that I noticed from SSIM Page, that PSNR, which mainly depends on MSE and, one weakness of MSE is that, this measure depends on scaling of variables despite the fact that the image is invariant to scaling..
So Now my question is this.
Can I use Image entropy to check effectiveness of any noise removal method method.
Of course you can do that, see [ http://scholar.google.co.uk/scholar?q=image+denoising+entropy ]
The list shows that entropy is a measure that works better in some domains than others. For example: if you know an optimal basis that efficiently represents your noise-free image (e.g. the Fourier basis or a wavelet basis) but cannot efficiently model noise, your transformed noise-free image will be sparse in the transform domain and your noisey image will not. And sparse signals have a low entropy, while dense signals have a high entropy.
If you know that all these things are true, then you can evaluate your denoising method using a transform-domain entropy measure.
You will need to do some extra work to calibrate your new error message, and of course you cannot use entropy-based methods to do the denoising. That would be double dipping.

Clustering Method Selection in High-Dimension?

If the data to cluster are literally points (either 2D (x, y) or 3D (x, y,z)), it would be quite intuitive to choose a clustering method. Because we can draw them and visualize them, we somewhat know better which clustering method is more suitable.
e.g.1 If my 2D data set is of the formation shown in the right top corner, I would know that K-means may not be a wise choice here, whereas DBSCAN seems like a better idea.
However, just as the scikit-learn website states:
While these examples give some intuition about the algorithms, this
intuition might not apply to very high dimensional data.
AFAIK, in most of the piratical problems we don't have such simple data. Most probably, we have high-dimensional tuples, which cannot be visualized like such, as data.
e.g.2 I wish to cluster a data set where each data is represented as a 4-D tuple <characteristic1, characteristic2, characteristic3, characteristic4>. I CANNOT visualize it in a coordinate system and observes its distribution like before. So I will NOT be able to say DBSCAN is superior to K-means in this case.
So my question:
How does one choose the suitable clustering method for such an "invisualizable" high-dimensional case?
"High-dimensional" in clustering probably starts at some 10-20 dimensions in dense data, and 1000+ dimensions in sparse data (e.g. text).
4 dimensions are not much of a problem, and can still be visualized; for example by using multiple 2d projections (or even 3d, using rotation); or using parallel coordinates. Here's a visualization of the 4-dimensional "iris" data set using a scatter plot matrix.
However, the first thing you still should do is spend a lot of time on preprocessing, and finding an appropriate distance function.
If you really need methods for high-dimensional data, have a look at subspace clustering and correlation clustering, e.g.
Kriegel, Hans-Peter, Peer Kröger, and Arthur Zimek. Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Transactions on Knowledge Discovery from Data (TKDD) 3.1 (2009): 1.
The authors of that survey also publish a software framework which has a lot of these advanced clustering methods (not just k-means, but e.h. CASH, FourC, ERiC): ELKI
There are at least two common, generic approaches:
One can use some dimensionality reduction technique in order to actually visualize the high dimensional data, there are dozens of popular solutions including (but not limited to):
PCA - principal component analysis
SOM - self-organizing maps
Sammon's mapping
Autoencoder Neural Networks
KPCA - kernel principal component analysis
Isomap
After this one goes back to the original space and use some techniques that seems resonable based on observations in the reduced space, or performs clustering in the reduced space itself.First approach uses all avaliable information, but can be invalid due to differences induced by the reduction process. While the second one ensures that your observations and choice is valid (as you reduce your problem to the nice, 2d/3d one) but it loses lots of information due to transformation used.
One tries many different algorithms and choose the one with the best metrics (there have been many clustering evaluation metrics proposed). This is computationally expensive approach, but has a lower bias (as reducting the dimensionality introduces the information change following from the used transformation)
It is true that high dimensional data cannot be easily visualized in an euclidean high dimensional data but it is not true that there are no visualization techniques for them.
In addition to this claim I will add that with just 4 features (your dimensions) you can easily try the parallel coordinates visualization method. Or simply try a multivariate data analysis taking two features at a time (so 6 times in total) to try to figure out which relations intercour between the two (correlation and dependency generally). Or you can even use a 3d space for three at a time.
Then, how to get some info from these visualizations? Well, it is not as easy as in an euclidean space but the point is to spot visually if the data clusters in some groups (eg near some values on an axis for a parallel coordinate diagram) and think if the data is somehow separable (eg if it forms regions like circles or line separable in the scatter plots).
A little digression: the diagram you posted is not indicative of the power or capabilities of each algorithm given some particular data distributions, it simply highlights the nature of some algorithms: for instance k-means is able to separate only convex and ellipsoidail areas (and keep in mind that convexity and ellipsoids exist even in N-th dimensions). What I mean is that there is not a rule that says: given the distributiuons depicted in this diagram, you have to choose the correct clustering algorithm consequently.
I suggest to use a data mining toolbox that lets you explore and visualize the data (and easily transform them since you can change their topology with transformations, projections and reductions, check the other answer by lejlot for that) like Weka (plus you do not have to implement all the algorithms by yourself.
In the end I will point you to this resource for different cluster goodness and fitness measures so you can compare the results rfom different algorithms.
I would also suggest soft subspace clustering, a pretty common approach nowadays, where feature weights are added to find the most relevant features. You can use these weights to increase performance and improve the BMU calculation with euclidean distance, for example.

How to speed up svm.predict?

I'm writing a sliding window to extract features and feed it into CvSVM's predict function.
However, what I've stumbled upon is that the svm.predict function is relatively slow.
Basically the window slides thru the image with fixed stride length, on number of image scales.
The speed traversing the image plus extracting features for each
window takes around 1000 ms (1 sec).
Inclusion of weak classifiers trained by adaboost resulted in around
1200 ms (1.2 secs)
However when I pass the features (which has been marked as positive
by the weak classifiers) to svm.predict function, the overall speed
slowed down to around 16000 ms ( 16 secs )
Trying to collect all 'positive' features first, before passing to
svm.predict utilizing TBB's threads resulted in 19000 ms ( 19 secs ), probably due to the overhead needed to create the threads, etc.
My OpenCV build was compiled to include both TBB (threading) and OpenCL (GPU) functions.
Has anyone managed to speed up OpenCV's SVM.predict function ?
I've been stuck in this issue for quite sometime, since it's frustrating to run this detection algorithm thru my test data for statistics and threshold adjustment.
Thanks a lot for reading thru this !
(Answer posted to formalize my comments, above:)
The prediction algorithm for an SVM takes O(nSV * f) time, where nSV is the number of support vectors and f is the number of features. The number of support vectors can be reduced by training with stronger regularization, i.e. by increasing the hyperparameter C (possibly at a cost in predictive accuracy).
I'm not sure what features you are extracting but from the size of your feature (3780) I would say you are extracting HOG. There is a very robust, optimized, and fast way of HOG "prediction" in cv::HOGDescriptor class. All you need to do is to
extract your HOGs for training
put them in the svmLight format
use svmLight linear kernel to train a model
calculate the 3780 + 1 dimensional vector necessary for prediction
feed the vector to setSvmDetector() method of cv::HOGDescriptor object
use detect() or detectMultiScale() methods for detection
The following document has very good information about how to achieve what you are trying to do: http://opencv.willowgarage.com/wiki/trainHOG although I must warn you that there is a small problem in the original program, but it teaches you how to approach this problem properly.
As Fred Foo has already mentioned, you have to reduce the number of support vectors. From my experience, 5-10% of the training base is enough to have a good level of prediction.
The other means to make it work faster:
reduce the size of the feature. 3780 is way too much. I'm not sure what this size of feature can describe in your case but from my experience, for example, a description of an image like the automobile logo can effectively be packed into size 150-200:
PCA can be used to reduce the size of the feature as well as reduce its "noise". There are examples of how it can be used with SVM;
if not helping - try other principles of image description, for example, LBP and/or LBP histograms
LDA (alone or with SVM) can also be used.
Try linear SVM first. It is much faster and your feature size 3780 (3780 dimensions) is more than enough of "space" to have good separation in higher dimensions if your sets are linearly separatable in principle. If not good enough - try RBF kernel with some pretty standard setup like C = 1 and gamma = 0.1. And only after that - POLY - the slowest one.

Resources