Neural networks can be trained to recognize an object, then detect occurrences of that object in an image, regardless of their position and apparent size. An example of doing this in PyTorch is at https://towardsdatascience.com/object-detection-and-tracking-in-pytorch-b3cf1a696a98
As the text observes,
Most of the code deals with resizing the image to a 416px square while maintaining its aspect ratio and padding the overflow.
So the idea is that the model always deals with 416px images, both in training and in the actual object detection. Detected objects, being only part of the image, will typically be smaller than 416px, but that's okay because the model has been trained to detect patterns in a scale-invariant way. The only thing fixed is the size in pixels of the input image.
I'm looking at a context in which it is necessary to do the reverse: train to detect patterns of a fixed size, then detect them in a variable sized image. For example, train to detect patterns 10px square, then look for them in an image that could be 500px or 1000px square, without resizing the image, but with the assurance that it is only necessary to look for 10px occurrences of the pattern.
Is there an idiomatic way to do this in PyTorch?
Even if you trained your detector with a fixed size image, you can use a different sizes at inference time because everything is convolutional in faster rcnn/yolo architectures. On the other hand, if you only care about 10X10 bounding box detections, you can easily define this as your anchors. I would recomend to you to use the detectron2 framework which is implemented in pytorch and is easily configurable/hackable.
Related
I have image patches from DDSM Breast Mammography that are 150x150 in size. I would like to augment my dataset by randomly cropping these images 2x times to 120x120 size. So, If my dataset contains 6500 images, augmenting it with random crop should get me to 13000 images. Thing is, I do NOT want to lose potential information in the image and possibly change ground truth label.
What would be best way to do this? Should I crop them randomly from 150x150 to 120x120 and hope for the best or maybe pad them first and then perform the cropping? What is the standard way to approach this problem?
If your ground truth contains the exact location of what you are trying to classify, use the ground truth to crop your images in an informed way. I.e. adjust the ground truth, if you are removing what you are trying to classify.
If you don't know the location of what you are classifying, you could
attempt to train a classifier on your un-augmented dataset,
find out, what the regions of the images are that your classifier reacts to,
make note of these location
crop your images in an informed way
train a new classifier
But how do you "find out, what regions your classifier reacts to"?
Multiple ways are described in Visualizing and Understanding Convolutional Networks by Zeiler and Fergus:
Imagine your classifier classifies breast cancer or no breast cancer. Now simply take an image that contains positive information for breast cancer and occlude part of the image with some blank color (see gray square in image above, image by Zeiler et al.) and predict cancer or not. Now move the occluded square around. In the end you'll get rough predictions scores for all parts of your original image (see (d) in the image above), because when you covered up the important part that is responsible for a positive prediction, you (should) get a negative cancer prediction.
If you have someone who can actually recognize cancer in an image, this is also a good way to check for and guard against confounding factors.
BTW: You might want to crop on-the-fly and randomize how you crop even more to generate way more samples.
If the 150x150 is already the region of interest (ROI) you could try the following data augmentations:
use a larger patch, e.g. 170x170 that always contains your 150x150 patch
use a larger patch, e.g. 200x200, and scale it down to 150x150
add some gaussian noise to the image
rotate the image slightly (by random amounts)
change image contrast slightly
artificially emulate whatever other (image-)effects you see in the original dataset
I am using the standard AlexNet model with image data size 3*224*224. I artificially construct these images, which consists of numerous sub-images.
I am trying to recognize small, simple sub-images (100*2) that might be at side or corner of 224*224 space.
Is AlexNet likely to handle this well? Or should sub-image really take-up most of the 224*224?
After some testing, I have found that AlexNet does not see very small objects well within image. I suspect that this is due to stride size.
i'm work on graduation project for image forgery detection using CNN , Most of the paper i read before feed the data set to the network they Down scale the image size, i want to know how Does this process effect image information ?
Images are resized/rescaled to a specific size for a few reasons:
(1) It allows the user to set the input size to their network. When designing a CNN you need to know the shape (dimensions) of your data at each step; so, having a static input size is an easy way to make sure your network gets data of the shape it was designed to take.
(2) Using a full resolution image as the input to the network is very inefficient (super slow to compute).
(3) For most cases the features desired to be extracted/learned from an image are also present when downsampling the image. So in a way resizing an image to a smaller size will denoise the image, filtering out much of the unimportant features within the image for you.
Well you change the images size. Of course it changes it's information.
You cannot reduce image size without omitting information. Simple case: Throw away every second pixel to scale image to 50%.
Scaling up adds new pixels. In its simplest form you duplicate pixels, creating redundant information.
More complex solutions create new pixels (less or more) by averaging neighbouring pixels or interpolating between them.
Scaling up is reversible. It doesn't create nor destroy information.
Scaling down divides the amount of information by the square of the downscaling factor*. Upscaling after downscaling results in a blurred image.
(*This is true in a first approximation. If the image doesn't have high frequencies, they are not lost, hence no loss of information.)
I am trying some DCNN to recognize handwriting words (word spotting) where the images are binary, and I am wondering if the computation time will be faster than using DCNNs with other gray-level or color images.
In addition, how one can equalize the image sizes, as normalizing the images of words will produce words with different scales.
Any suggestions?
The computation time for gray-scale images is certainly faster, but not due to zeros, it's simply the input tensor size. Color images are [batch, width, height, 3], while gray-scale images are [batch, width, height, 1]. The difference in depth, as well as in spatial size, affects the time spent on the first convolutional layer, which is usually one of the most time-consuming. That's why consider resizing the images as well.
You may also want to read about 1x1 convolution trick to speed up computation. Usually it's applied in the middle of the network when the number of filters becomes significantly large.
As for the second question (if I get it right), ultimately you have to resize the images. If the images contain the texts of different font sizes, one possible strategy is to resize + pad or crop + resize. You have to know the font size on each particular image to select the right padding or crop size. This method needs (possibly) fair amount of manual work.
A completely different way would to ignore these differences and let the network learn OCR, despite the font size discrepancy. It is a viable solution, doesn't require a lot of manual pre-processing, but simply needs more training data to avoid overfitting. If you examine MNIST dataset, you notice the digits are not always the same size, yet CNNs achieve 99.5% accuracy pretty easily.
Iam a beginner in image mining. I would like to know the minimum dimension required for effective classification of textured images. As what i feel if a image is too small feature extraction step will not extract enough features. And if the image size goes beyond a certain dimension the processing time will increase exponentially with image size.
This is a complex question that requires a bit of thinking.
Short answer: It depends.
Long answer: It depends on the type of texture you want to classify and the type of feature your classification is based on. If the feature extracted is, say, color only, you can use "texture" as small as 1x1 pixel (in that case, using the word "texture" is a bit of an abuse). If you want to classify, say for example characters, you can usually extract a lot of local information from edges (Hough transform, Gabor filters, etc). The image plane just have to be big enough to hold the characters (say 16x16 pixels for Latin alphabet).
If you want to be able to classify any kind of images in any kind of number, you can also base your classification on global information, like entropy, correlogram, energy, inertia, cluster shade, cluster prominence, color and correlation. Those features are used for content based image retrieval.
From the top of my head, I would try using texture as small as 32x32 pixels if the kind of texture you are using is a priori unknown. If on the contrary the kind of texture is a priori known, I would choose one or more feature that I know would classify the images according to my needs (1x1 pixel for color-only, 16x16 pixels for characters, etc). Again, it really depends on what you are trying to achieve. There isn't a unique answer to your question.