I have some background in machine learning and I also just completed a face-identification excersize using support vector machine. I am in the process of trying to convert this exercise to HMM, but I am having problems understanding the notation and how to use it (I am using Kevin Murphy’s HMM package).
I am given about a 50 gray scale images of 6 different people (numbered 1-6). Each image is a 10 pixels by 10 pixels and each pixel can have values between 0-255 (8 bit gray scale). The goal is that I will be able to classify a new image to one of the 6 faces.
My approach is to take each image and make it a long vector of length 100 elements each is a pixel value . Now, I am getting to the confusing part. The notations I am using is as follows:
N : Number of observation symbols - I understand that the hidden state is the person’s face (i.e 1-6), therefore, there are 6 hidden states so N=6.
T : Length of observation sequence – is this equal to a 50 ? I am not sure what this represents
M: Number of observation symbols – is this equal to a 100 ? Does the term of “observation symbol” refer to the number of elements in the vector representing the observation?
O : Number of observations – what does this represent? In every example they use a single binary observed value and they make this to be 2 (i.e on or off). What would this be in my case ?
I greatly appreciate the help
Related
I have 1024 bit long binary representation of three handwritten digits: 0, 1, 8.
Basically, in 32x32 bitmap of a digit, rows are concatenated to form a binary vector.
There are 50 binary vectors for each digit.
When we apply Nearest neighbour to each digit, we can use hamming distance metric or some other, and then apply the algorithm to differentiate between the vectors.
Now I want to use another technique where instead of looking at each bit of a vector, I would like to analyse on less number of bits while comparing the vectors.
For example, I know that when one compares bitmap(size:1024 bits) of digits '8' and '0', We must have 1s in middle of the vector of digit '8' as there digit 8 visually appears as the combination of two zeros placed in column.
So our algorithm would look for the intersection of two zeros(which would be the middle of digit.
Thats the way I want to work. I want to convert the low level representation(looking at 1024 bitmap vector) to the high level representation(that consist of two properties extracted from bitmap).
Any suggestion? I hope, the question is somewhat clear to the audience.
Idea 1: Flood fill
This idea does not use the 50 patterns you have per digit: it is based on the idea that usually a "1" has all 0-bits connected around that "1" shape, while a "0" separates the 0-bits inside it from those outside it, and an "8" has two such enclosed areas. So counting connected areas of 0-bits would identify which of the three it is.
So you could use a flood fill algorithm, starting at any 0 bit in the vector, and set all those connected 0-bits to 1. In a 1 dimensional array you need to take care to correctly identify connected bits (either horizontally: 1 position apart, but not crossing a 32 boundary, or vertically... 32 positions apart). Of course, this flood-filling will destroy the image - so make sure to use a copy. If after one such flood-fill there are still 0 bits (which were therefore not connected to those you turned into 1), then choose one of those and start a second flood-fill there. Repeat if necessary.
When all bits have been set to 1 in that way, use the number of flood-fills you had to perform, as follows:
One flood-fill? It's a "1", because all 0-bits are connected.
Two flood-fills? It's a "0", because the shape of a zero separates two areas (inside/outside)
Three flood-fills? It's an "8", because this shape separates three areas of connected 0-bits.
Of course, this process assumes that these handwritten digits are well-formed. For example, if an 8-shape would have a small gap, like here:
..then it will not be identified as an "8", but a "0". This particular problem could be resolved by identifying "loose ends" of 1-bits (a "line" that stops). When you have two of those at a short distance, then increase the number you got from flood-fill counting with 1 (as if those two ends were connected).
Similarly, if a "0" accidentally has a small second loop, like here:
...it will be identified as an "8" instead of a "0". You could prevent this particular problem by requiring that each flood-fill finds a minimum number of 0-bits (like at least 10 0-bits) to count as one.
Idea 2: probability vector
For each digit, add up the 50 example vectors you have, so that for each position you have a count somewhere between 0 to 50. You would have one such "probability" vector per digit, so prob0, prob1 and prob8. If prob8[501] = 45, it means that it is highly probable (45/50) that an "8" vector will have a 1-bit at index 501.
Now transform these 3 probability vectors as follows: instead of storing a count per position, store the positions in order of decreasing count (probability). So if prob8[513] has the highest value (like 49), then that new array should start like [513, ...]. Let's call these new vectors A0, A8 and A1 (for the corresponding digit).
Finally, when you need to match a given input vector, simultaneously go through A0, A1 and A8 (always looking at the same index in the three vectors) and keep 3 scores. When the input vector has a 1 at the position specified in A0[i], then add 1 to score0. If it also has a 1 at the position specified in A1[i] (same i), then add 1 to score1. Same thing for score8. Increment i, and repeat. Stop this iteration as soon as you have a clear winner, i.e. when the highest score among score0, score1 and score8 has crossed a threshold difference with the second highest score among them. At that point you know which digit is being represented.
I was reading some documentation about HSV histogram, and in several refs the Saturation channel was quantized into 256 values. Why is that? Is there any reason behind choosing this number?
I have the same questions for the Hue channel, often it is quantized into 180 values.
Disclaimer: Off-hand answers (i.e., not backed up by any documentation):
"256" is a popular number for a bin size because Programmers Like Round Numbers -- it fits in a single byte. And "180" because the HSB circle is "360 [degrees]", but "360" does not fit into a single byte.
For many image formats, the range of RGB values is limited to 0..255 per channel -- 3 bytes in total. To store the same amount of data (ignoring any artifacts of converting to another color model), Saturation and Brightness are often expressed in single bytes as well. The same could be done for Hue, by scaling the original range of 0..359 (as Hue is usually expressed as a value in degrees on the HSB Color Wheel) into the byte range 0..255. However, probably because it's easier to do calculations with a number close to the original 360° full circle, the range is clipped to 0..179. That way the value can be stored into a single byte (and thus "HSB" uses as much memory as "RGB") and can be converted trivially back to (close to) its original value -- multiply by 2. Obviously, sticking to the storage space wins over fidelity.
Given 256 values for both S and B, and 180 for H, you end up with a color space of 256*256*180 = 11,796,480 colors. To inspect the number of colors, you build a histogram: an array where you can read out the total amount of pixels in a certain color or color range. Using a color range here, instead of actual values, significantly cuts down the memory requirements.
For an RGB color image, with the colors fairly evenly distributed, you could shift down each channel a certain number of bits. This is how a straightforward conversion from 24-bit "true-color" RGB down to 15-bit RGB "high-color" space works: each channel gets divided by 8, reducing 256 values down to 32 (5 bits per channel). Conversion to a 16-bit high-color RGB space works the same; the bit that got left over in the 15-bit conversion is assigned to green. Thus, the range of colors for green is doubled, which is useful since the human eye is more perceptive for shades of green than for the other two primaries.
It gets more complicated when the colors in the input image are not evenly distributed. A naive solution is to create an array of [256][256][256], initialize all to zero, then fill the array with the colors of the image, and finally sort them. There are better alternatives -- let me consult my old Computer Graphics [1] here. Hold on.
13.4 Reproducing Color mentions the names of two different approaches from Heckbert (Color Image Quantization for Frame Buffer Display, SIGGRAPH 82): the popularity and the median-cut algorithms. (Unfortunately, that's all they say about this topic. I assume efficient code for both can be googled for.)
A rough guess:
The size for each bin (H,S,B) should be reflected by what you are trying to use it for. This older SO question, for example, uses a large bin for hue -- color is considered the most important -- and only 3 different values for both saturation and brightness. Thus, bright images with some subdued areas (say, a comic book) will give a good spread in this histogram, but a real-color photograph will not so much.
The main limit is that the bin sizes, multiplied with each other, should use a reasonably small amount of memory, yet cover enough of each component to get evenly filled. Perhaps some trial-and-error comes into play here. You could initially evenly distribute all of H, S, and B components over the available memory in your histogram and process a small part of the image; say, 1 out of 4 pixels, horizontally and vertically. If you notice one of the component bins fills up too fas where others stay untouched, adjust the ranges and restart.
If you need to do an analysis of multiple pictures, make sure they are all alike in their color gamut. You cannot expect a reasonable bin size to work on all sorts of images; you would end up with an evenly distribution, where all matches are only so-so.
[1] Computer Graphics. Principles and Practices. (1997) J.D. Foley, A. van Dam, S.K. Feiner, and J.F. Hughes, 2nd ed., Reading, MA: Addison-Wesley.
I have a basic question regarding pattern learning, or pattern representation. Assume I have a complex pattern of this form, could you please provide me with some research directions or concepts that I can follow to learn how to represent (mathematically describe) these forms of patterns? in general the pattern does not have a closed contour nor it can be represented with analytical objects like boxes, circles etc.
By mathematically describe I'm assuming you mean derive from the image a vector of values that represents the content of the image. In computer vision/image processing we call this an "image descriptor".
There are several image descriptors that could be applied to pixel based data of the form you showed, which appear to be 1 value per pixel i.e. greyscale images.
One approach is to perform "spatial gridding" where you divide the image up into a regular grid of a constant size e.g. a 4x4 grid. You then average the pixel values within each cell of the grid. Then concatenate these values to form a 16 element vector - this coarsely describes the pixel distribution of the image.
Another approach would be to use "image moments" which are 2D statistical moments. Use this equation:
where f(x,y) is they pixel value at coordinates (x,y). W and H are the image width and height. The mu_x and mu_y indicate the average x and y. The values i and j select the order of moment you want to compute. Various orders of moment can be combined in different ways for example in the "Hu moments" we can compute 7 numbers using combinations of image moments:
The cool thing about the Hu moments is you can scale, rotate, flip etc the image and you still get the same 7 values which makes this a robust ("affine invariant") image descriptor.
Hope this helps as a general direction to read more in.
Given an image (Like the one given below) I need to convert it into a binary image (black and white pixels only). This sounds easy enough, and I have tried with two thresholding functions. The problem is I cant get the perfect edges using either of these functions. Any help would be greatly appreciated.
The filters I have tried are, the Euclidean distance in the RGB and HSV spaces.
Sample image:
Here it is after running an RGB threshold filter. (40% it more artefects after this)
Here it is after running an HSV threshold filter. (at 30% the paths become barely visible but clearly unusable because of the noise)
The code I am using is pretty straightforward. Change the input image to appropriate color spaces and check the Euclidean distance with the the black color.
sqrt(R*R + G*G + B*B)
since I am comparing with black (0, 0, 0)
Your problem appears to be the variation in lighting over the scanned image which suggests that a locally adaptive thresholding method would give you better results.
The Sauvola method calculates the value of a binarized pixel based on the mean and standard deviation of pixels in a window of the original image. This means that if an area of the image is generally darker (or lighter) the threshold will be adjusted for that area and (likely) give you fewer dark splotches or washed-out lines in the binarized image.
http://www.mediateam.oulu.fi/publications/pdf/24.p
I also found a method by Shafait et al. that implements the Sauvola method with greater time efficiency. The drawback is that you have to compute two integral images of the original, one at 8 bits per pixel and the other potentially at 64 bits per pixel, which might present a problem with memory constraints.
http://www.dfki.uni-kl.de/~shafait/papers/Shafait-efficient-binarization-SPIE08.pdf
I haven't tried either of these methods, but they do look promising. I found Java implementations of both with a cursory Google search.
Running an adaptive threshold over the V channel in the HSV color space should produce brilliant results. Best results would come with higher than 11x11 size window, don't forget to choose a negative value for the threshold.
Adaptive thresholding basically is:
if (Pixel value + constant > Average pixel value in the window around the pixel )
Pixel_Binary = 1;
else
Pixel_Binary = 0;
Due to the noise and the illumination variation you may need an adaptive local thresholding, thanks to Beaker for his answer too.
Therefore, I tried the following steps:
Convert it to grayscale.
Do the mean or the median local thresholding, I used 10 for the window size and 10 for the intercept constant and got this image (smaller values might also work):
Please refer to : http://homepages.inf.ed.ac.uk/rbf/HIPR2/adpthrsh.htm if you need more
information on this techniques.
To make sure the thresholding was working fine, I skeletonized it to see if there is a line break. This skeleton may be the one needed for further processing.
To get ride of the remaining noise you can just find the longest connected component in the skeletonized image.
Thank you.
You probably want to do this as a three-step operation.
use leveling, not just thresholding: Take the input and scale the intensities (gamma correct) with parameters that simply dull the mid tones, without removing the darks or the lights (your rgb threshold is too strong, for instance. you lost some of your lines).
edge-detect the resulting image using a small kernel convolution (5x5 for binary images should be more than enough). Use a simple [1 2 3 2 1 ; 2 3 4 3 2 ; 3 4 5 4 3 ; 2 3 4 3 2 ; 1 2 3 2 1] kernel (normalised)
threshold the resulting image. You should now have a much better binary image.
You could try a black top-hat transform. This involves substracting the Image from the closing of the Image. I used a structural element window size of 11 and a constant threshold of 0.1 (25.5 on for a 255 scale)
You should get something like:
Which you can then easily threshold:
Best of luck.
Anyone know what is the difference between gamma and exposure? And what is the difference between gamma correction and exposure adjustment in image processing?
Since you don't have an image processing background i would start with a basics
1) Every digital image has a dynamic range of gray levels.Now gray levels are nothing but values which ultimately corresponds to a color. Say Mono-chrome image(Black and white image) has only 2 gray levels i.e. 0 and 1 where 0 means black and 1 means white color. Here the dynamic range is [0-1]. In these images each pixel is stored as a single bit.
Similarly there is Gray-scale images have shades of gray in them. Here each pixel is stored as 8-bit so dynamic range is [0-255]. How? just apply the formula (2^n -1) where n is number of bits. i.e. (2^8 - 1) i.e. 256-1 = 255.
Similarly there are color-images which are 24-bit images.In general the dynamic range of gray levels in image is given by [0 - L-1] where L is number of gray levels.
2) Now once you have understood what is dynamic range lets understand Gamma correction.Gamma correction is nothing but a function that compress the dynamic range of images so that we can view the image more nicely or properly. But why do we need to compress dynamic range? A best day to day example is during day time when we cannot see the stars, the reason is because the intensity of sun is so large as compared to the intensity of stars that we cannot see the stars in day time.Similarly when dynamic range is high in an image then that of the display device we cannot see the image properly. Therefore we can use gamma correction to compress the dynamic range of image
3) Gamma correction can be written as g(x,y) = c * f(x,y) ^ # where # is symbol of gamma (since i don't know how to write gamma symbol here, i have used #) and f(x,y) is original image with high dynamic range, g(x,y) is modified image. C is a positive constant.
4) Exposure as said earlier in an answer its phenomena in camera. I don't know much about it as it is not covered in the syllabus of image processing which i am currently studying.
Gamma correction is a non-linear global function that compresses certain ranges in your image. It is mainly used in order to be more efficient from human vision point of view, in fixed point format. It is absent in raw files, but exists in JPEG. Each pixel undergoes the following transformation:
y = x^p
Exposure is a physical phenomenon in your camera. Exposure adjustment on the other hand is linear global function. It is used mainly in order to compensate for lack or excess of exposure in the camera:
y = a*x
Exposure is an indication of the total quantity of light that reaches the CCD of your camera (or the silver ions on film). It can be expressed as the number of photons that hit your image-recording elements.
Films and CCD are calibrated to expect a certain quantity of light (certain number of photons) in order to be able to create an "average" image.
The higher the "expected" quantity of light, the lower the ISO number of your film (or camera setting) => in order to obtain a normal image, a film (or camera setting) of 100 ISO needs more light than a film of 3200 ISO, hence the use of 3200 ISO films for night photography.
next step: the camera thing. When you want to make a picture (= have photons hit your CCD or film), you need to open the diaphragm of your camera. Depending on how much you open your diaphragm, the nature of your image will change (speaking from an artistic point of view here). If your diaphragm is wide open, most of the image which is not perfectly in focus will be blurred (e.g. as used in portrait photography). Conversely, if your diaphragm is only a little bit open during exposure, most of your image will be very sharp. This is used very often for landscape photography.
As your film (or CCD) expect a certain quantity of light with a given ISO value, it is obvious that a smaller diaphragm opening requires longer exposure times whereas a wide open diaphraghm requires a very short time.
Good books about this subject are the series "The Camera", "The Negative" and "The Print" by Ansel Adams.
Conclusion: exposure and gamma correction are different things.
- Exposure is a part of the parameters you need to control while creating your initial image through the use of a camera.
- Gamma correction is related to subsequent manipulation of your image file. I'm not sure if the notion of "gamma correction" is being used in the context of film.
Basically:
Gamma is a monitor thing.
Exposure is a camera thing.