Preprocessing before CNN: Resizing vs Cropping - opencv

I'm using a simple neural network (similar to AlexNet) to classify images into categories. As a preprocessing stage, input images are resized to 256x256 before being fed into the network.
Lately, I have run into the following problem: Many of the images I deal with are of very high resolution (say, 2000x2000). In this case, doing a "hard resize" results in a severe loss of information. For example, a small 100x100 face, easily recognisable in the original image, would be unrecognisable in the resized version. In such cases, I may prefer taking several crops of the 2000x2000 image and run the classification on each crop.
I'm looking for a method to automatically determine which type of pre-processing is most adequate. Ideally, it would be able to recognize, for example, that a high resolution image of a single face should be resized, whereas a high resolution image of a crowd should be cropped several times. The basic requirements, on my part:
As computationally efficient as possible. Hence, something like a "sliding window" would be probably be ruled out (it is computationally cheaper to just crop all the images).
Ability to balance between recall and precision
What I considered thus far:
"Low-level" (image processing) approach: Implement an algorithm that uses local image information (like gradients) to distinguish between high resolution and low resolution images.
"High-level" (semantic) approach: Run the images through a pre-trained network for segmentation of some sort, and use its oputput to determine the appropriate pre-procssing.
I want to try the first option first, but not exactly sure how to go about it. Is there anything I can do in the Fourier domain? Something in OpenCv I can try? Does anyone have any suggestions/thoughts? Other ideas would be very welcome too. Thanks!

Related

semantic segmentation for large images

I am working on a limited number of large size images, each of which can have 3072*3072 pixels. To train a semantic segmentation model using FCN or U-net, I construct a large sample of training sets, each training image is 128*128.
In the prediction stage, what I do is to cut a large image into small pieces, the same as trainning set of 128*128, and feed these small pieces into the trained model, get the predicted mask. Afterwards, I just stitch these small patches together to get the mask for the whole image. Is this the right mechanism to perform the semantic segmentation against the large images?
Your solution is often used for this kind of problem. However, I would argue that it depends on the data if it truly makes sense. Let me give you two examples you can still find on kaggle.
If you wanted to mask certain parts of satellite images, you would probably get away with this approach without a drop in accuracy. These images are highly repetitive and there's likely no correlation between the segmented area and where in the original image it was taken from.
If you wanted to segment a car from its background, it wouldn't be desirable to break it into patches. Over several layers the network will learn the global distribution of a car in the frame. It's very likely that the mask is positive in the middle and negative in the corners of the image.
Since you didn't give any specifics what you're trying to solve, I can only give a general recommendation: Try to keep the input images as large as your hardware allows. In many situation I would rather downsample the original images than breaking it down into patches.
Concerning the recommendation of curio1729, I can only advise against training on small patches and testing on the original images. While it's technically possible thanks to fully convolutional networks, you're changing the data to an extend, that might very likely hurt performance. CNNs are known for their extraction of local features, but there's a large amount of global information that is learned over the abstraction of multiple layers.
Input image data:
I would not advice feeding the big image (3072x3072) directly into the caffe.
Batch of small images will fit better into the memory and parallel programming will too come into play.
Data Augmentation will also be feasible.
Output for big Image:
As for the output of big Image, you better recast the input size of FCN to 3072x3072 during test phase. Because, layers of FCN can accept inputs of any size.
Then you will get 3072x3072 segmented image as output.

Does image size matter when training with TensorFlow?

I was wondering if there is any benefit to training on high resolution images rather than low resolution. I understand that it will take longer to train on larger images and that the dimensions must be a multiple of 32. My current image set is 1440x1920. Would I be better off resizing to 480x640, or is bigger better?
It's certainly not a requirement that your images be powers of two. There may be some cases where it speeds things up (e.g. GPU allocation) but it's not critical.
Smaller images will train significantly faster, and possibly even converge quicker (all other factors held constant) as you will be able to train on bigger batches (e.g. 100-1000 images in one pass, which you might not be able to do on a single machine with high res imagery).
As to whether to resize, you need to ask yourself if every pixel in that image is critical to your task. Often this is not the case - you can probably resize a photo of a bus down to say 128x128 and still recognize that it's a bus.
Using smaller images can also help your network generalise better, too, as there is less data to overfit.
A technique often used in image classification networks is to perform distortions (e.g. random cropping, scaling & brightness adjustment) on images to (a) convert odd-sized images to a constant size, (b) synthesize more data and (c) encourage the network to generalise.
This depends largely on the application. As a rule of thumb, I'd ask myself the question: can I complete the task myself on the resized images? If so, I'd downsize to the lowest resolution before it makes the task more difficult for you yourself. If not... you're going to have to be -very- patient using images 1440 * 1920. I imagine you'll almost always be better off experimenting with more varied architectures and hyper-parameter sets on smaller images compared to fewer models on full resolution images.
Whatever size you choose, you'll have to design your network for the image size you have in mind. If you're using convolutional layers, a larger image will require larger strides, filter sizes and/or layers. The number of parameters will stay the same for each convolution, though the number of features will grow (along with batch normalisation parameters if you're using it).

is it possible to take low resolution image from street camera, increase it and see image details

I would like to know if it is possible to take low resolution image from street camera, increase it
and see image details (for example a face, or car plate number). Is there any software that is able to do it?
Thank you.
example of image: http://imgur.com/9Jv7Wid
Possible? Yes. In existence? not to my knowledge.
What you are referring to is called super-resolution. The way it works, in theory, is that you combine multiple low resolution images, and then combine them to create a high-resolution image.
The way this works is that you essentially map each image onto all the others to form a stack, where the target portion of the image is all the same. This gets extremely complicated extremely fast as any distortion (e.g. movement of the target) will cause the images to differ dramatically, on the pixel level.
But, let's you have the images stacked and have removed the non-relevant pixels from the stack of images. You are left hopefully with a movie/stack of images that all show the exact same image, but with sub-pixel distortions. A sub-pixel distortion simply means that the target has moved somewhere inside the pixel, or has moved partially into the neighboring pixel.
You can't measure if the target has moved within the pixel, but you can detect if the target has moved partially into a neighboring pixel. You can do this by knowing that the target is going to give off X amount of photons, so if you see 1/4 of the photons in one pixel and 3/4 of the photons in the neighboring pixel you know it's approximate location, which is 3/4 in one pixel and 1/4 in the other. You then construct an image that has a resolution of these sub-pixels and place these sub-pixels in their proper place.
All of this gets very computationally intensive, and sometimes the images are just too low-resolution and have too much distortion from image to image to even create a meaningful stack of images. I did read a paper about a lab in a university being able to create high-resolution images form low-resolution images, but it was a very very tightly controlled experiment, where they moved the target precisely X amount from image to image and had a very precise camera (probably scientific grade, which is far more sensitive than any commercial grade security camera).
In essence to do this in the real world reliably you need to set up cameras in a very precise way and they need to be very accurate in a particular way, which is going to be expensive, so you are better off just putting in a better camera than relying on this very imprecise technique.
Actually it is possible to do super-resolution (SR) out of even a single low-resolution (LR) image! So you don't have to hassle taking many LR images with sub-pixel shifts to achieve that. The intuition behind such techniques is that natural scenes are full of many repettitive patterns that can be use to enahance the frequency content of similar patches (e.g. you can implement dictionary learning in your SR reconstruction technique to generate the high-resolution version). Sure the enhancment may not be as good as using many LR images but such technique is simpler and more practicle.
Photoshop would be your best bet. But know that you cannot reliably inclrease the size of an image without making the quality even worse.

image resolution for training data for vehicle detection and tracking?

I'm new to computer vision. I'm working on a research project whose objective is (1) to detect vehicles from images and videos, and then later on (2) to be able to track moving vehicles.
I'm at the initial stage where I'm collecting training data, and I'm really concerned about getting images which are at an optimum resolution for detection and tracking.
Any ideas? The current dataset I've been given (from a past project) has images of about 1200x600 pixels. But I've been told this may or may not be an optimum resolution for the detection and tracking task. Apart from considering the fact that I will be extracting haar-like features from the images, I can't think of any factor to include in making a resolution decision. Any ideas of what a good resolution ought to be for training data images in this case?
First of all, feeding raw images directly to classifiers does not produce great results although sometimes useful such as face-detection. So you need to think about feature extraction.
One big issue is that a 1200x600 has 720,000 pixels. This defines 720,000 dimensions and it poses a challenge for training and classification because of dimension explosion.
So basically you need to scale down your dimensions particularly using feature extraction. What features to detect? It completely depends on the domain.
Another important aspect is the speed. Processing bigger images takes more time and this is especially important for processing real-time images which is something of 15-30 fps.
In my project (see my profile) which was real-time (15fps), I was working on 640x480 images and for some operations I had to scale down to improve performance.
Hope this helps.

find mosquitos' head in the image

I have images of mosquitos similar to these ones and I would like to automatically circle around the head of each mosquito in the images. They are obviously in different orientations and there are random number of them in different images. some error is fine. Any ideas of algorithms to do this?
This problem resembles a face detection problem, so you could try a naïve approach first and refine it if necessary.
First you would need to recreate your training set. For this you would like to extract small images with examples of what is a mosquito head or what is not.
Then you can use those images to train a classification algorithm, be careful to have a balanced training set, since if your data is skewed to one class it would hit the performance of the algorithm. Since images are 2D and algorithms usually just take 1D arrays as input, you will need to arrange your images to that format as well (for instance: http://en.wikipedia.org/wiki/Row-major_order).
I normally use support vector machines, but other algorithms such as logistic regression could make the trick too. If you decide to use support vector machines I strongly recommend you to check libsvm (http://www.csie.ntu.edu.tw/~cjlin/libsvm/), since it's a very mature library with bindings to several programming languages. Also they have a very easy to follow guide targeted to beginners (http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf).
If you have enough data, you should be able to avoid tolerance to orientation. If you don't have enough data, then you could create more training rows with some samples rotated, so you would have a more representative training set.
As for the prediction what you could do is given an image, cut it using a grid where each cell has the same dimension that the ones you used on your training set. Then you pass each of this image to the classifier and mark those squares where the classifier gave you a positive output. If you really need circles then take the center of the given square and the radius would be the half of the square side size (sorry for stating the obvious).
So after you do this you might have problems with sizes (some mosquitos might appear closer to the camera than others) , since we are not trained the algorithm to be tolerant to scale. Moreover, even with all mosquitos in the same scale, we still might miss some of them just because they didn't fit in our grid perfectly. To address this, we will need to repeat this procedure (grid cut and predict) rescaling the given image to different sizes. How many sizes? well here you would have to determine that through experimentation.
This approach is sensitive to the size of the "window" that you are using, that is also something I would recommend you to experiment with.
There are some research may be useful:
A Multistep Approach for Shape Similarity Search in Image Databases
Representation and Detection of Shapes in Images
From the pictures you provided this seems to be an extremely hard image recognition problem, and I doubt you will get anywhere near acceptable recognition rates.
I would recommend a simpler approach:
First, if you have any control over the images, separate the mosquitoes before taking the picture, and use a white unmarked underground, perhaps even something illuminated from below. This will make separating the mosquitoes much easier.
Then threshold the image. For example here i did a quick try taking the red channel, then substracting the blue channel*5, then applying a threshold of 80:
Use morphological dilation and erosion to get rid of the small leg structures.
Identify blobs of the right size to be moquitoes by Connected Component Labeling. If a blob is large enough to be two mosquitoes, cut it out, and apply some more dilation/erosion to it.
Once you have a single blob like this
you can find the direction of the body using Principal Component Analysis. The head should be the part of the body where the cross-section is the thickest.

Resources