I want to feed an image that is stored in the YUV422 (YUYV) format into a CNN. YUV422 means that two pixels are represented by four bytes, basically two pixels share the chroma but have separate luminances.
I understand that for convolutional neural networks the spatiality plays an important role, i.e. that the filters "see" the luminance pixels together with their corresponding chroma pixels. So how would one approach this problem? Or is this no problem at all?
I want to avoid an additional preprocessing step for performance reasons.
Convolutional Neural Networks as implemented in common frameworks like TensorFlow, PyTorch, etc. stores channels in a planar fashion. That is, each channel (R,G,B or Y,U,V) is stored in a continuous region with all the pixels in the image (width x height). This in contrast to the format where channel data are interleaved inside each pixel. So you will need to upsample the subsampled UV channels to match the size of the Y channel, and then feed it to the network in the same way as RGB data.
Others have found it to work OK, but not reaching the performance of RGB. See https://github.com/ducha-aiki/caffenet-benchmark/blob/master/Colorspace.md
and Effect of image colourspace on performance of convolution neural networks by K Sumanth Reddy; Upasna Singh; Prakash K Uttam.
It is unlikely that the YUV to RGB conversion would be a bottleneck. RGB has the distinct advantage that one can reuse many excellent pretrained models (transfer learning).
1
In this paper YUVMultiNet: Real-time YUV multi-task CNN for autonomous driving, channel Y and UV are are fed into different conv sperately.
2 RGB v.s. YUV v.s. HSV v.s. Lab (1)
As mentioned by Jon Nordby,here shows a comparison.
Interestingly, the learned RGB2GRAY method is better than that in openCV.
YCbCr seems to underperform RGB. The experiment is conducted on ImageNet-2012.
I will Try it on COCO or other datasets later
3. RGB v.s. YUV v.s. HSV v.s. Lab (2)
In paper Deep leaning approach with colorimetric spaces and vegetation indices for vine diseases detection in UAV images,YUV is generally better than RGB.
RGB and YUV color spaces have obtained the best performances in terms of discrimination between the four classes.
RGB:
remains sensitive to the size of the CNN’s convolution filter, especially for patches size (64x64).
The HSV and LAB color spaces have performed less than other color spaces.
For HSV:
this is mainly due to the Hue (H) channel that contains all color information grouped into a single channel which is less relevant for network to learn best color features from this color space. It is the color that has the most relevance in the classification, the saturation and value channels did not make a good contribution to the classification.
For the LAB color space:
classification results were not conclusive. This may be due to that a and b channels,
does not represent effectively the colors related to the diseased vineyard. The L-channel contributes little to the classification because it represents the quantity of light the color space.
From the results, that YUV is more stable and consistent which makes it more suitable for symptoms detection. These good performances are related to the color information of the green colour corresponding to healthy vegetation, and the brown and yellow colours characterising diseased vegetation, presented in UV channels.
The combination of different spaces produced lower scores than each space separately except a slight improvement for (16 1 × 6) patch.
This is due to the fact that CNN has not been able to extract good color features from multiple channels for discriminating healthy and diseased classes
Related
In many examples across the web of facial recognition with OpenCV, i see images being converted to grayscale as part of the "pre-processing" for the facial recognition functionality. What would happen if a color image was used for facial recognition? Why do all examples turn images to grayscale first?
Many image processing and CV algorithms use grayscale images for input rather than color images. One important reason is because by converting to grayscale, it separates the luminance plane from the chrominance planes. Luminance is also more important for distinguishing visual features in a image. For instance, if you want to find edges based on both luminance and chrominance, it requires additional work. Color also doesn't really help us identity important features or characteristics of the image although there may be exceptions.
Grayscale images only have one color channel as opposed to three in a color image (RGB, HSV). The inherent complexity of grayscale images is lower than that of color images as you can obtain features relating to brightness, contrast, edges, shape, contours, textures, and perspective without color.
Processing in grayscale is also much faster. If we make the assumption that processing a three-channel color image takes three times as long as processing a grayscale image then we can save processing time by eliminating color channels we don't need. Essentially, color increases the complexity of the model and in general slows down processing.
Most facial recognition algorithms rely on the general intensity distribution in the image rather than the color intensity information of each channel.
Grayscale images provide exactly this information about the general distribution of intensities in an image (high-intensity areas appearing as white / low-intensity areas as black). Calculating the grayscale image is simple and needs little computing time, you can calculate this intensity by averaging the values of all 3 channels.
In a RGB image, this information is divided in all 3 channels. Take for example a bright yellow with:
RGB (255,217,0)
While this is obviously a color of high intensity, we obtain this information by combining all channels, which is exactly what a grayscale image does. You could of course instead use each channel for your feature calculation and concatenate the results to use all intensity information for this image, but it would result in essentially the same result as using the grayscale version while taking 3 times the computation time.
I am trying to train a cnn model for face gender and age detection. My training set contains facial images both coloured and grayscale. How do I normalize this dataset? Or how do I handle a dataset with a mixture of grayscale and coloured images?
Keep in mind the network will just attempt to learn the relationship between your labels (gender/age) and you training data, in the way they are presented to the network.
The optimal choice is depending if you expect the model to work on gray-scale or colored images in the future.
If want to to predict on gray-scale image only
You should train on grayscale image only!
You can use many approaches to convert the colored images to black and white:
simple average of the 3 RGB channels
more sophisticated transforms using cylindrical color spaces as HSV,HSL. There you could use one of the channels as you gray. Normally, tthe V channel corresponds better to human perception than the average of RGB
https://en.wikipedia.org/wiki/HSL_and_HSV
If you need to predict colored image
Obviously, there is not easy way to reconstruct the colors from a grayscale image. Then you must use color images also during training.
if your model accepts MxNx3 image in input, then it will also accept the grayscale ones, given that you replicate the info on the 3 RGB channels.
You should carefully evaluate the number of examples you have, and compare it to the usual training set sizes required by the model you want to use.
If you have enough color images, just do not use the grayscale cases at all.
If you don't have enough examples, make sure you have balanced training and test set for gray/colored cases, otherwise your net will learn to classify gray-scale vs colored separately.
Alternatively, you could consider using masking, and replace with a masking values the missing color channels.
Further alternative you could consider:
- use a pre-trained CNN for feature extraction e.g. VGGs largely available online, and then fine tune the last layers
To me it feels that age and gender estimation would not be affected largely by presence/absence of color, and it might be that reducing the problem to a gray scale images only will help you to convergence since there will be way less parameters to estimate.
You should probably rather consider normalizing you images in terms of pose, orientation, ...
To train a network you have to ensure same size among all the training images, so convert all to grayscale. To normalize you can subtract the mean of training set from each image. Do the same with validation and testing images.
For detailed procedure go through below article:
https://becominghuman.ai/image-data-pre-processing-for-neural-networks-498289068258
I have a lots of images of paper cards of different shades of colors. Like all blues, or all reds, etc. In the images, they are held up to different objects that are of that color.
I want to write a program to compare the color to the shades on the card and choose the closest shade to the object.
however I realize that for future images my camera is going to be subject to lots of different lighting. I think I should convert into HSV space.
I'm also unsure of what type of distance measure I should use. Given some sort of blobs from the cards, I could average over the HSV and simply see which blob's average is the closest.
But I welcome any and all suggestions, I want to learn more about what I can do with OpenCV.
EDIT: A sample
Here I want to compare the filled in red of the 6th dot to see it is actually the shade of the 3rd paper rectangle.
I think one possibility is to do the following:
Color histograms from Hue and Saturation channels
compute the color histogram of the filled circle.
compute color histogram of the bar of paper.
compute a distance using histogram distance measures.
Possibilities here includes:
Chi square,
Earthmover distance,
Bhattacharya distance,
Histogram intersection etc.
Check this opencv link for details on computing histograms
Check this opencv link for details on the histogram comparisons
Note that when computing the color histograms, convert your images to HSV colorspace as you yourself suggested. Then, there is 2 things to note here.
[EDITED to make this a suggestion rather than a must do because I believe V channel might be necessary to differentiate the shades. Anyhow, try both and go with the one giving better result. Apologies if this sent you off track.] One possibility is to only use the Hue and Saturation channels i.e. you build a 2D
histogram rather than a 3D one consisting of values from the hue and
saturation channels. The reason for doing so is that the variation
in lighting is most felt in the V channel. This, together with the
use of histograms, should hopefully make your comparisons more
robust to lighting changes. There is some discussion on ignoring the
V channel when building color histograms in this post here. You
might find the references therein useful.
Normalize the histograms using the opencv functions. This is to
account for the different sizes of the patches of material (your
small circle vs the huge color bar has different number of pixels).
You might also wish to consider performing some form of preprocessing to "stretch" the color in the image e.g. using histogram equalization or an "S curve" mapping so that the different shades of color get better separated. Then compute the color histograms on this processed image. Keep the information for the mapping and perform it on new test samples before computing their color histograms.
Using ML for classification
Besides simply computing the distance and taking the closest one (i.e. a 1 nearest neighbor classifier), you might want to consider training a classifier to do the classification for you. One reason for doing so is that the training of the classifier will hopefully learn some way to differentiate between the different shades of hues since it has access to them during the training phase and is required to differentiate them. Notice that simply computing a distance, i.e. your suggested method, may not have this property. Hopefully this will give better classification.
The features use in the training can still be the color histograms that I mention above. That is, you compute color histograms as described above for your training samples and pass this to the classifier along with their class (i.e. which shade they are). Then, when you wish to classify a test sample, you likewise compute a color histogram and pass it to the classifier and it will return you the class (shade of color in your case) the color of the test sample belongs to.
Potential problems when training a classifier rather than using a simple distance comparison based approach as you have suggested is partly the added complexity of the program as well as potentially getting bad results when the training data is not good. There is also going to be a lot of parameter tuning involved to get it to work well.
See the opencv machine learning tutorials here for more details. Note that in the examples in the link, the classifier only differentiate between 2 classes whereas you have more than 2 shades of color. This is not a problem as the classifiers in general can work with more than 2 classes.
Hope this helps.
The blending modes Screen, Color Dodge, Soft Light, etc.
like in Photoshop, each have their own math that works
for range 0-1. I wonder how do these blend modes work
for HDR images?
Thanks
I am not familiar with photoshop and it's filter but here is a general explanation of the math behind HDR filters.
Suppose you have 3 images (low light, medium and over exposed). You want to average those images but (I1+I2+I3)/3 is a stupid way. You want to give a higher weight to the image that captures more information in a given area.
So basically you average the images with a weight factor and there are different types of algorithms to calculate the weights. Here are few:
The simplest one is using STD (standard deviation). In each pixel, in each image calculate standard deviation of its 9 neighbours. Use std as weight:
HDR pixel(i,j) = I1(i,j)*stdI1(i,j) + I2(i,j)*stdI2(i,j) + I3(i,j)*stdI3(i,j).
Why std is used? since when std is high it means a high variation in pixels intencity which means more information was captured by the image.
Instead of STD you can use entropy filter, edge detection or any other which represents how much information is encoded around the given pixel
There are also slower but better ways to do HDR. Usually it is done with some kind of wavelet transformation. For example Furier transform. Each image is converted to furier space (coefficients of the frequencies and than the for each frequency, the maximal coefficient of 3 images is taken).
You can even combine the method of std filter and wavelet transforms. For example break the image to different frequencies, smooth the lower frequencies and take a stupid average (I1+I2+I3)/3, but with high frequencies use less smoothing and using std weighted average. The action of smoothing more lower frequencies is called 'blending'. It heavily used when stitching 2 images of different light exposure to a panorama.
Look at this image: http://magazine.magix.com/en/wp-content/uploads/2012/05/Panorama-3.jpg
You can clearly see that the sky gets different color on each image but since sky is a very low frequency (almost no information and no small object) it is heavily smoothed and averaged, thus allowing a gentle stitching.
Hope that answers your question
Here is the problem we are trying to solve:
Goal is to classify pixels of a colored image into 3 different classes.
We have a set of manually classified data for training purposes
Pixels almost do not correlate to each other (each have individual behaviour) - so most likely classification is on each individual pixel and based on it's individual features.
3 classes approximately can be mapped to colors of RED, YELLOW and BLACK color families.
We need to have the system semi-automatic, i.e. 3 parameters to control the probability of the presence of 3 outcomes (for final well-tuning)
Having this in mind:
Which classification technique will you choose?
What pixel features will you use for classification (RGB, Ycc, HSV, etc) ?
What modification functions will you choose for well-tuning between three outcomes.
My first try was based on
Naive bayes classifier
HSV (also tried RGB and Ycc)
(failed to find a proper functions for well-tuning)
Any suggestion?
Thanks
For each pixel in the image try using the histogram of colors the n x n window around that pixel as its features. For general-purpose color matching under varied lighting conditions, I have had good luck with using two-dimensional histograms of hue and saturation with a relatively small number of bins along each dimension. Depending upon your lighting consistency it might make sense for you to directly use the RGB values.
As for the classifier, the manual-tuning requirement is most easily expressed using class weights: parameters that specify the relative costs of false negatives versus false positives. I have only used this functionality with SVMs, but I'm sure you can find implementations of other classifiers that support a similar concept.