Every pixel intensity matters - increasing image resolution to help image generation with diffusion model / GAN - machine-learning

I have been playing around with some generative models, namely StyleGAN2 and diffusion models for image generation, specifically a homegrown dataset of 64 x 64 resolution. I have roughly 260 000 data samples.
These images are all grayscale, and every pixel intensity really matters, e.g. if the intensity (out of 255) of a pixel is 160 when it should be 155 for the image to look correct this can have a big impact on the quality of the image overall. This is within the context of the other pixels, of course, otherwise, we'd always be generating the same image.
The self-referential scores (FID and Kernal Inception Distance, KID) can be very low at the end of training - i.e. when the loss has converged, because the distribution of generated samples and real samples are very similar. You can also see this visually, where the images look roughly correct. But unfortunately, I really need the pixel values to be narrower.
One solution I was thinking of was to increase the resolution to 128x128 and essentially duplicate the pixel values in each dimension. This way the "noise" of individual pixels can be averaged out over 4 pixels during generation. Does this make sense, or does this sound naive? Thanks!

Related

keras - flow_from_directory function - target_size parameter

Keras has this function called flow_from_directory and one of the parameters is called target_size. Here is the explanation for it:
target_size: Tuple of integers (height, width), default: (256, 256).
The dimensions to which all images found will be resized.
The thing that is unclear to me is whether it is just cropping the original image into 256x256 matrix (in this case we do not take the entire image) or it is just reducing the resolution of the image (while still showing us the entire image)?
If it is -let's say - just reducing the resolution:
Assume that I have some xray images with the size 1024x1024 each (for breast cancer detection). And if I want to apply transfer learning to a pretrained Convolutional Neural Network which only takes 224x224 input images, will I not be loosing important data/information when I reduce the size of the image (and resolution) from 1024x1024 down to 224x224? Isn't there any such risk?
Thank you in advance!
Reducing the resolution (risizing)
Yes, you are loosing data
The best way for you is to rebuild your CNN to work with your original image size, i.e. 1024*1024
It is reducing the resolution of the image (while still showing us the entire image)
That is true that you are losing data, but you can work with an image size a bit larger than 224224 like 512 * 512 512 as it will keep most of the information and will train in comparatively less time and resources than the original image(10241024).

Effect of Downsample but not fully divide

Hey I need a 320x240 8Bit gray scale image for some Computer Vision Algorithm (Orb Feature tracking). The Raspicam driver I'm using can provide different Image Sizes. Different Image Sizes are achieved by cropping and not down sampling from the driver. As my environment is not ideal lighted the Image is quite dark and noisy. Now I had the idea to take a 640x480 Image and down sample it to 320x240 by combining always 2x2 pixels to one. Normally I would of course divide by 4 to get the correct result. But what would be the effect of dividing it by two or even one (assuming 99% of the intensity values are not bigger then 64 (256/4)). Wouldn't that simulate the effect of larger CCD cells which could gather more light in less time.
The first tests I did showed some pretty good results. Meaning I detected more Features and could follow them better between two frames.
Here, you are not taking proper average of 2x2 blocks(divide by 4). Say, you have two blocks and they have Delta-I difference in intensity. If you divide the intensity of the two blocks by a larger number, the intensity difference will reduce and vice-versa for smaller number.
When you divide the difference(Delta-I) by 2(instead of 4), you are in a way increasing the contrast(intensity difference between background and foreground. As you mentioned that your image is in poor illumination, thereby division by smaller number increases the contrast which is improving tracking. This approach will come under contrast enhancement technique and is a variation of Linear contrast enhancement.

What is the difference between Binning and sub-sampling in Image Signal Processing?

As I know, there are some functions in the CMOS Image Sensor ISP (Image Signal Processor).
Specifically, I'd like to know the difference between binning and sub-sampling. I think these purpose is same to reduce image size.
However, I'm not sure why these functions exist?
What is their purpose?
Binning and sub-sampling reduce the image size as you have suspected, but what they focus on are different things. Let's tackle each issue separately
Binning
Binning in image processing deals primarily with quantization. The closest thing I can think of is related to what is known as data binning. Basically, consider breaking up your image into distinct (non-overlapping) M x N tiles, where M and N are the rows and columns of a tile and M and N should be much smaller than the rows and columns of the image.
If you consider any grid of M x N pixels, all of these pixels get replaced with a representative colour. The way this representative colour is calculated is done in many ways... the average is a popular method. The reason why binning is performed is primarily as a data pre-processing technique which is used to reduce the effects of minor observation errors. This effectively reduces the amount of information that is representative of the image, and so it certainly reduces the image size by reducing the amount of unique colours that represent the image.
In addition, binning the data may also reduce the impact of noise that impacts the CMOS sensor on the final processed image, but at the cost of a lower dynamic range of colours.
Sub-sampling
Sub-sampling in the case of image processing mostly deals with image resizing. It's also called image scaling. The goal is to take an image and reduce its dimensions so that you get a smaller image as a result. Binning deals with keeping the image the same size (i.e. the same dimensions as the original) while reducing the amount of colours which ultimately reduces the amount of space the image takes up. Subsampling reduces the image size by removing information all together. Usually when you subsample, you also interpolate or smooth the image so that you reduce aliasing.
Sub-sampling has another application in video processing - especially in MPEG where video is encoded in YCbCr. Y is the luminance while Cb and Cr are the chrominance pairs. We tend to notice changes in luminance rather than chrominance, and so the chrominance is subsampled to reduce the amount of space taken up by the video. Specifically, the human visual system has poor acuity when it comes to colour information than we do with luminance / intensity. Usually, the chrominance values are filtered then subsampled by 1/2 or even 1/4 of that of the intensity. Even with a rather high subsampling rate, we don't notice any differences in terms of perceived image quality.
This is obviously a rather rough introduction on the differences between them both, but I hope this gives you enough of what you're after for your purposes.
Good luck!

collect negative samples of adaboost algorithm for face detection

Viola-Jones' AdaBoost method is very popular for face detection? We need lots of positive and negative samples o train a face detector.
The rule for collecting positive sample is simple: the image which contains faces. But the rule for collecting negative sample is not very clear: the image which does not contains faces.
But there are so many scene that do not contain faces (which may be sky, river, house animals etc.). Which should I collect it? How can know I have collected enough negative samples?
Some suggested idea for negative samples: using the positive samples and crop the face region using the left part as negative samples. Is this work?
You have asked many questions inside your thread.
Amount of samples. As a rule of thumbs: When you train a detector you need roughly few thousands positive and negative examples per stage. Typical detector has 10-20 stages. Each stage reduces the amount of negative by a factor of 2. So you will need roughly 3,000 - 10,000 positive examples and ~5,000,000 to 100,000,000 negative examples.
Which negatives to take. A rule of thumb: You need to find a face in a given environment. So you need to take that environment as negative examples. For instance, if you try to detect faces of students sitting in a classroom than take as negative examples images from the classroom (walls, windows, human body, clothes etc). Taking images of the moon or of the sky will probably not help you. If you don't know your environment than just take as much as possible different natural images (under different light conditions).
Should you take facial parts (like an eye, or a nose) as negative? You can but this is definitely not enough (to take only those negatives). The real strength of the detector will come from the negative images which represent the typical background of the faces
How to collect/generate negative samples - You don't actually need many negative images. You can take 1000 images and generate 10,000,000 negative samples from them. Here is how you do it. Suppose you take a photo of a car of 1 mega pixel resolution 1000x1000 pixels. Suppose than you want to train face detector to work on resolution of 20x20 pixels (like openCV did). So you take your 1000x1000 big image and cut it to pieces of 20x20. You can get 2,500 pieces (50x50). So this is how from a single big image you generated 2,500 negative examples. Now you can take the same big image and cut it to pieces of size 10x10 pixels. You will now have additional 10,000 negative examples. Each example is of size 10x10 pixels and you can enlarge it by factor of 2 to force all the sample to have the same size. You can repeat this process as much as you want (cutting the input image to pieces of different size). Mathematically speaking, if your image is of size NxN - You can generate O(N^4) negative examples from it by taking each possible rectangle inside it.
In step 4, I described how to take a single big image and cut it to a large amount of negative examples. I must warn you that negative examples should not have high co-variance so I don't recommend taking only one image and generating 1 million negative examples from it. As a rule of thumb - create a library of 1000 images (or download random images from Google). Verify than none of the images contains faces. Crop about 10,000 negative examples from each image and now you have got a decent 10,000,000 negative examples. Train your detector. In the next step you can cut each image to ~50,000 (partially overlapping pieces) and thus enlarge your amount of negatives to 50 millions. You will start having very good results with it.
Final enhancement step of the detector. When you already have a rather good detector, run it on many images. It will produce false detections (detect face where there is no face). Gather all those false detections and add them to your negative set. Now retrain the detector once again. The more such iterations you do the better your detector becomes
Real numbers - The best face detectors today (like Facebooks) use hundreds of millions of positive examples and billions of negatives. As positive examples they take not only frontal faces but faces in many orientations, different facial expressions (smiling, shouting, angry,...), different age groups, different genders, different races (Caucasians, blacks, Thai, Chinese,....), with or without glasses/hat/sunglasses/make-up etc. You will not be able to compete with the best, so don't get angry if your detector misses some faces.
Good luck

More questions about OpenCV's HoG and CvSVM

I've managed to extract HoG features from positive and negative images (from INRIA's person dataset ) using OpenCV's HOGDescriptor::compute function.
I've also managed to pack the data correctly and feed it into CvSVM for training purposes.
I have several questions:
While extracting features, I used positive images with dimension of 96 x 128, while the negative images are on average 320 x 240. I have been using window size of 64 x 128 for HoG extraction, should I use other window size ?
The size of extracted features for positive images are around 28800 features, while the negative ones are around 500000+. I have been truncating the features from negative ones to 28800, I think this is wrong, since I believe I'm losing too much information when feeding these features to SVM. How should I go and tackle this ? (It seems like I can only feed the same sample size for negative and positive features)
While doing prediction on images bigger than 64 x 128 (or 96 x 160), should I use a sliding window to do prediction ? Since large negative images still gives me more than 500000 features, but I can't feed it into SVM due to sample size.
Why you can't just resize all your patches to the same size? Hog descriptor depends on windows size, blocks and cells sizes. You should try different combinations. With small cells you can capture small details, but you will lose in generality and vice versa.
1.) Don't understand the question
2.) Make all descriptors the same size, extracting hog from resized images.
3.) Don't understand the question

Resources