I am using the standard AlexNet model with image data size 3*224*224. I artificially construct these images, which consists of numerous sub-images.
I am trying to recognize small, simple sub-images (100*2) that might be at side or corner of 224*224 space.
Is AlexNet likely to handle this well? Or should sub-image really take-up most of the 224*224?
After some testing, I have found that AlexNet does not see very small objects well within image. I suspect that this is due to stride size.
Related
Neural networks can be trained to recognize an object, then detect occurrences of that object in an image, regardless of their position and apparent size. An example of doing this in PyTorch is at https://towardsdatascience.com/object-detection-and-tracking-in-pytorch-b3cf1a696a98
As the text observes,
Most of the code deals with resizing the image to a 416px square while maintaining its aspect ratio and padding the overflow.
So the idea is that the model always deals with 416px images, both in training and in the actual object detection. Detected objects, being only part of the image, will typically be smaller than 416px, but that's okay because the model has been trained to detect patterns in a scale-invariant way. The only thing fixed is the size in pixels of the input image.
I'm looking at a context in which it is necessary to do the reverse: train to detect patterns of a fixed size, then detect them in a variable sized image. For example, train to detect patterns 10px square, then look for them in an image that could be 500px or 1000px square, without resizing the image, but with the assurance that it is only necessary to look for 10px occurrences of the pattern.
Is there an idiomatic way to do this in PyTorch?
Even if you trained your detector with a fixed size image, you can use a different sizes at inference time because everything is convolutional in faster rcnn/yolo architectures. On the other hand, if you only care about 10X10 bounding box detections, you can easily define this as your anchors. I would recomend to you to use the detectron2 framework which is implemented in pytorch and is easily configurable/hackable.
i'm work on graduation project for image forgery detection using CNN , Most of the paper i read before feed the data set to the network they Down scale the image size, i want to know how Does this process effect image information ?
Images are resized/rescaled to a specific size for a few reasons:
(1) It allows the user to set the input size to their network. When designing a CNN you need to know the shape (dimensions) of your data at each step; so, having a static input size is an easy way to make sure your network gets data of the shape it was designed to take.
(2) Using a full resolution image as the input to the network is very inefficient (super slow to compute).
(3) For most cases the features desired to be extracted/learned from an image are also present when downsampling the image. So in a way resizing an image to a smaller size will denoise the image, filtering out much of the unimportant features within the image for you.
Well you change the images size. Of course it changes it's information.
You cannot reduce image size without omitting information. Simple case: Throw away every second pixel to scale image to 50%.
Scaling up adds new pixels. In its simplest form you duplicate pixels, creating redundant information.
More complex solutions create new pixels (less or more) by averaging neighbouring pixels or interpolating between them.
Scaling up is reversible. It doesn't create nor destroy information.
Scaling down divides the amount of information by the square of the downscaling factor*. Upscaling after downscaling results in a blurred image.
(*This is true in a first approximation. If the image doesn't have high frequencies, they are not lost, hence no loss of information.)
I wish to use transfer learning to process images, and my images have different sizes.
I think in general convolutional layers can take variable input size, but fully connected layers can only take input of specific size.
However, the Keras implementation of VGG-16 or ResNet50 can take any image size larger than 32x32, although they do have fully connected layers. I wonder how it is done to get fix fully connected layer size for different image dimensions?
Thanks very much!
What you are saying is misleading, you can build a VGG/ResNet Keras model with any input image size larger than 32x32, but once the model is built, you can't change the input size, and that is usually the problem. So the model cannot really take variable sized images.
I am working on a limited number of large size images, each of which can have 3072*3072 pixels. To train a semantic segmentation model using FCN or U-net, I construct a large sample of training sets, each training image is 128*128.
In the prediction stage, what I do is to cut a large image into small pieces, the same as trainning set of 128*128, and feed these small pieces into the trained model, get the predicted mask. Afterwards, I just stitch these small patches together to get the mask for the whole image. Is this the right mechanism to perform the semantic segmentation against the large images?
Your solution is often used for this kind of problem. However, I would argue that it depends on the data if it truly makes sense. Let me give you two examples you can still find on kaggle.
If you wanted to mask certain parts of satellite images, you would probably get away with this approach without a drop in accuracy. These images are highly repetitive and there's likely no correlation between the segmented area and where in the original image it was taken from.
If you wanted to segment a car from its background, it wouldn't be desirable to break it into patches. Over several layers the network will learn the global distribution of a car in the frame. It's very likely that the mask is positive in the middle and negative in the corners of the image.
Since you didn't give any specifics what you're trying to solve, I can only give a general recommendation: Try to keep the input images as large as your hardware allows. In many situation I would rather downsample the original images than breaking it down into patches.
Concerning the recommendation of curio1729, I can only advise against training on small patches and testing on the original images. While it's technically possible thanks to fully convolutional networks, you're changing the data to an extend, that might very likely hurt performance. CNNs are known for their extraction of local features, but there's a large amount of global information that is learned over the abstraction of multiple layers.
Input image data:
I would not advice feeding the big image (3072x3072) directly into the caffe.
Batch of small images will fit better into the memory and parallel programming will too come into play.
Data Augmentation will also be feasible.
Output for big Image:
As for the output of big Image, you better recast the input size of FCN to 3072x3072 during test phase. Because, layers of FCN can accept inputs of any size.
Then you will get 3072x3072 segmented image as output.
Iam a beginner in image mining. I would like to know the minimum dimension required for effective classification of textured images. As what i feel if a image is too small feature extraction step will not extract enough features. And if the image size goes beyond a certain dimension the processing time will increase exponentially with image size.
This is a complex question that requires a bit of thinking.
Short answer: It depends.
Long answer: It depends on the type of texture you want to classify and the type of feature your classification is based on. If the feature extracted is, say, color only, you can use "texture" as small as 1x1 pixel (in that case, using the word "texture" is a bit of an abuse). If you want to classify, say for example characters, you can usually extract a lot of local information from edges (Hough transform, Gabor filters, etc). The image plane just have to be big enough to hold the characters (say 16x16 pixels for Latin alphabet).
If you want to be able to classify any kind of images in any kind of number, you can also base your classification on global information, like entropy, correlogram, energy, inertia, cluster shade, cluster prominence, color and correlation. Those features are used for content based image retrieval.
From the top of my head, I would try using texture as small as 32x32 pixels if the kind of texture you are using is a priori unknown. If on the contrary the kind of texture is a priori known, I would choose one or more feature that I know would classify the images according to my needs (1x1 pixel for color-only, 16x16 pixels for characters, etc). Again, it really depends on what you are trying to achieve. There isn't a unique answer to your question.