I am trying some DCNN to recognize handwriting words (word spotting) where the images are binary, and I am wondering if the computation time will be faster than using DCNNs with other gray-level or color images.
In addition, how one can equalize the image sizes, as normalizing the images of words will produce words with different scales.
Any suggestions?
The computation time for gray-scale images is certainly faster, but not due to zeros, it's simply the input tensor size. Color images are [batch, width, height, 3], while gray-scale images are [batch, width, height, 1]. The difference in depth, as well as in spatial size, affects the time spent on the first convolutional layer, which is usually one of the most time-consuming. That's why consider resizing the images as well.
You may also want to read about 1x1 convolution trick to speed up computation. Usually it's applied in the middle of the network when the number of filters becomes significantly large.
As for the second question (if I get it right), ultimately you have to resize the images. If the images contain the texts of different font sizes, one possible strategy is to resize + pad or crop + resize. You have to know the font size on each particular image to select the right padding or crop size. This method needs (possibly) fair amount of manual work.
A completely different way would to ignore these differences and let the network learn OCR, despite the font size discrepancy. It is a viable solution, doesn't require a lot of manual pre-processing, but simply needs more training data to avoid overfitting. If you examine MNIST dataset, you notice the digits are not always the same size, yet CNNs achieve 99.5% accuracy pretty easily.
Related
i'm work on graduation project for image forgery detection using CNN , Most of the paper i read before feed the data set to the network they Down scale the image size, i want to know how Does this process effect image information ?
Images are resized/rescaled to a specific size for a few reasons:
(1) It allows the user to set the input size to their network. When designing a CNN you need to know the shape (dimensions) of your data at each step; so, having a static input size is an easy way to make sure your network gets data of the shape it was designed to take.
(2) Using a full resolution image as the input to the network is very inefficient (super slow to compute).
(3) For most cases the features desired to be extracted/learned from an image are also present when downsampling the image. So in a way resizing an image to a smaller size will denoise the image, filtering out much of the unimportant features within the image for you.
Well you change the images size. Of course it changes it's information.
You cannot reduce image size without omitting information. Simple case: Throw away every second pixel to scale image to 50%.
Scaling up adds new pixels. In its simplest form you duplicate pixels, creating redundant information.
More complex solutions create new pixels (less or more) by averaging neighbouring pixels or interpolating between them.
Scaling up is reversible. It doesn't create nor destroy information.
Scaling down divides the amount of information by the square of the downscaling factor*. Upscaling after downscaling results in a blurred image.
(*This is true in a first approximation. If the image doesn't have high frequencies, they are not lost, hence no loss of information.)
How can size of an image effect training the model for this task?
My current training set holds images that are 2880 X 1800, but I am worried this may be too large to train. In total my sample size will be about 200-500 images.
Would this just mean that I need more resources (GPU,RAM, Distribution) when training my model?
If this is too large, how should I go about resizing? -- I want to mimic real-world photo resolutions as best as possible for better accuracy.
Edit:
I would also be using TFRecord format for the image files
Your memory and processing requirements will be proportional to the pixel size of your image. Whether this is too large for you to process efficiently will depend on your hardware constraints and the time you have available.
With regards to resizing the images there is no one answer, you have to consider how to best preserve information that'll be required for your algorithm to learn from your data while removing information that won't be useful. Reducing the size of your input images won't necessarily be a negative for accuracy. Consider two cases:
Handwritten digits
Here the images could be reduced considerably in size and maintain all the structural information necessary to be correctly identified. Have a look at the MNIST data set, these images are distributed at 28 x 28 resolution and identifiable to 99.7%+ accuracy.
Identifying Tree Species
Imagine a set of images of trees where individual leaves could help identify species. Here you might find that reducing the image size reduces small scale detail on leaf shape in a way that's detrimental to the model, but you might find that you get a similar result with a tight crop (which preserves individual leaves) rather than an image resize. If this is the case you may find that creating multiple crops from the same image gives you an augmented data set for training that considerably improves results (which is something to consider, if possible, given your training set is very small)
Deep learning models are achieving results around human level in many image classification tasks: if you struggle to identify your own images then it's less likely you'll train an algorithm to. This is often a useful starting point when considering the level of scaling that might be appropriate.
If you are using GPUs to train, this will def affect your training time. Tensorflow does most of the GPU allocation so you don't have to worry about that. But with big photos you will be experiencing long training time although your dataset is small. You should consider data-augmentation.
You could complement your resizing with the data-augmentation. Resize in equal dimensions and then perform reflection and translation (as in geometric movement)
If your images are too big, your GPU might run out of memory before it can start training because it has to store the convolution outputs on its memory. If that happens, you can do some of the following things to reduce memory consumption:
resize the image
reduce batch size
reduce model complexity
To resize your image, there are many scripts just one Google search away, but I will add that in your case 1440 by 900 is probably a sweet spot.
Higher resolution images will result in a higher training time and an increased memory consumption (mainly GPU memory).
Depending on your concrete task, you might want to reduce the image size in order to therefore fit a reasonable batch size of let's say 32 or 64 on the GPU - for stable learning.
Your accuracy is probably affected more by the size of your training set. So instead of going for image size, you might want to go for 500-1000 sample images. Recent publications like SSD - Single Shot MultiBox Detector achieve high accuracy values like an mAP of 72% on the PascalVOC dataset - with "only" using 300x300 image resolution.
Resizing and augmentation: SSD for instance just scales every input image down to 300x300, independent of the aspect ratio - does not seem to hurt. You could also augment your data by mirroring, translating, ... etc (but I assume there are built-in methods in Tensorflow for that).
I was wondering if there is any benefit to training on high resolution images rather than low resolution. I understand that it will take longer to train on larger images and that the dimensions must be a multiple of 32. My current image set is 1440x1920. Would I be better off resizing to 480x640, or is bigger better?
It's certainly not a requirement that your images be powers of two. There may be some cases where it speeds things up (e.g. GPU allocation) but it's not critical.
Smaller images will train significantly faster, and possibly even converge quicker (all other factors held constant) as you will be able to train on bigger batches (e.g. 100-1000 images in one pass, which you might not be able to do on a single machine with high res imagery).
As to whether to resize, you need to ask yourself if every pixel in that image is critical to your task. Often this is not the case - you can probably resize a photo of a bus down to say 128x128 and still recognize that it's a bus.
Using smaller images can also help your network generalise better, too, as there is less data to overfit.
A technique often used in image classification networks is to perform distortions (e.g. random cropping, scaling & brightness adjustment) on images to (a) convert odd-sized images to a constant size, (b) synthesize more data and (c) encourage the network to generalise.
This depends largely on the application. As a rule of thumb, I'd ask myself the question: can I complete the task myself on the resized images? If so, I'd downsize to the lowest resolution before it makes the task more difficult for you yourself. If not... you're going to have to be -very- patient using images 1440 * 1920. I imagine you'll almost always be better off experimenting with more varied architectures and hyper-parameter sets on smaller images compared to fewer models on full resolution images.
Whatever size you choose, you'll have to design your network for the image size you have in mind. If you're using convolutional layers, a larger image will require larger strides, filter sizes and/or layers. The number of parameters will stay the same for each convolution, though the number of features will grow (along with batch normalisation parameters if you're using it).
Hey I need a 320x240 8Bit gray scale image for some Computer Vision Algorithm (Orb Feature tracking). The Raspicam driver I'm using can provide different Image Sizes. Different Image Sizes are achieved by cropping and not down sampling from the driver. As my environment is not ideal lighted the Image is quite dark and noisy. Now I had the idea to take a 640x480 Image and down sample it to 320x240 by combining always 2x2 pixels to one. Normally I would of course divide by 4 to get the correct result. But what would be the effect of dividing it by two or even one (assuming 99% of the intensity values are not bigger then 64 (256/4)). Wouldn't that simulate the effect of larger CCD cells which could gather more light in less time.
The first tests I did showed some pretty good results. Meaning I detected more Features and could follow them better between two frames.
Here, you are not taking proper average of 2x2 blocks(divide by 4). Say, you have two blocks and they have Delta-I difference in intensity. If you divide the intensity of the two blocks by a larger number, the intensity difference will reduce and vice-versa for smaller number.
When you divide the difference(Delta-I) by 2(instead of 4), you are in a way increasing the contrast(intensity difference between background and foreground. As you mentioned that your image is in poor illumination, thereby division by smaller number increases the contrast which is improving tracking. This approach will come under contrast enhancement technique and is a variation of Linear contrast enhancement.
As I know, there are some functions in the CMOS Image Sensor ISP (Image Signal Processor).
Specifically, I'd like to know the difference between binning and sub-sampling. I think these purpose is same to reduce image size.
However, I'm not sure why these functions exist?
What is their purpose?
Binning and sub-sampling reduce the image size as you have suspected, but what they focus on are different things. Let's tackle each issue separately
Binning
Binning in image processing deals primarily with quantization. The closest thing I can think of is related to what is known as data binning. Basically, consider breaking up your image into distinct (non-overlapping) M x N tiles, where M and N are the rows and columns of a tile and M and N should be much smaller than the rows and columns of the image.
If you consider any grid of M x N pixels, all of these pixels get replaced with a representative colour. The way this representative colour is calculated is done in many ways... the average is a popular method. The reason why binning is performed is primarily as a data pre-processing technique which is used to reduce the effects of minor observation errors. This effectively reduces the amount of information that is representative of the image, and so it certainly reduces the image size by reducing the amount of unique colours that represent the image.
In addition, binning the data may also reduce the impact of noise that impacts the CMOS sensor on the final processed image, but at the cost of a lower dynamic range of colours.
Sub-sampling
Sub-sampling in the case of image processing mostly deals with image resizing. It's also called image scaling. The goal is to take an image and reduce its dimensions so that you get a smaller image as a result. Binning deals with keeping the image the same size (i.e. the same dimensions as the original) while reducing the amount of colours which ultimately reduces the amount of space the image takes up. Subsampling reduces the image size by removing information all together. Usually when you subsample, you also interpolate or smooth the image so that you reduce aliasing.
Sub-sampling has another application in video processing - especially in MPEG where video is encoded in YCbCr. Y is the luminance while Cb and Cr are the chrominance pairs. We tend to notice changes in luminance rather than chrominance, and so the chrominance is subsampled to reduce the amount of space taken up by the video. Specifically, the human visual system has poor acuity when it comes to colour information than we do with luminance / intensity. Usually, the chrominance values are filtered then subsampled by 1/2 or even 1/4 of that of the intensity. Even with a rather high subsampling rate, we don't notice any differences in terms of perceived image quality.
This is obviously a rather rough introduction on the differences between them both, but I hope this gives you enough of what you're after for your purposes.
Good luck!