While viewing multiple codes I encountered that some different people are referring to the learning rate as "alpha". Where are the roots of it? Is it common?
The update rule for a parameter/weight in a gradient decent algorithm is
i.e we take a small value (multiple) of the gradient and adjust the current value of parameters. This quantity of gradient we are taking is determined by the alpha. The higher the alpha the larger portion of the current gradient is consider and smaller the alpha the smaller is the considered gradient.
This alpha is called the learning rate because the higher the alpha the faster we are moving and lower the alpha the slower the movement.
I am not sure about the exact historical origins, but in general it is very common to use the greek alphabet as shorthand in maths and computer science. Alpha is just the symbol α - the first letter of the Greek alphabet.
Related
Currently I am learning dense optical flow by myself. To understand it, I conduct one experiment. I produce one image using Matlab. One box with a given grays value is placed under one uniform background and the box is translated two pixels in x and y directions in another image. The two images are input into the implementation of the algorithm called TV-L1. The generated motion vector outer of the box is not zero. Is the reason that the gradient outer of the box is zero? Is the values filled in from the values with large gradient value?
In Horn and Schunck's paper, it reads
In parts of the image where the brightness gradient is zero, the velocity
estimates will simply be averages of the neighboring velocity estimates. There
is no local information to constrain the apparent velocity of motion of the
brightness pattern in these areas.
The progress of this filling-in phenomena is similar to the propagation effects
in the solution of the heat equation for a uniform flat plate, where the time rate of change of temperature is proportional to the Laplacian.
Is it not possible to obtain correct motion vectors for pixels with small gradients? Or the experiment is not practical. In practical applications, this doesn't happen.
Yes, in so called homogenous image regions with very small gradients no information where a motion can dervided from exists. That's why the motion from your rectangle is propagated outer the border. If you give your background a texture this effect will be less dominant. I know such problem when it comes to estimate the ego-motion of a car. Then the streat makes a lot of problems cause of here homogenoutiy.
Two pioneers in this field Lukas&Kanade (LK) and Horn&Schunch (HS) are developed methods for computing Optical Flow (OF). Both rely on brightness constancy assumption which feature location pixel values between two sequence frames not change. This constraint may be expressed as two equations: I(x+dx,y+dy,t+dt)=I(x,y,t) and ∂I/∂x dx+∂I/∂y dy+∂I/∂t dt=0 by using a Taylor series expansion I(x+dx,y+dy,t+dt) , we get (x+dx,y+dy,t+dt)=I(x,y,t)+∂I/∂x dx+∂I/∂y dy+∂I/∂t dt… letting ∂x/∂t=u and ∂y/∂t=v and combining these equations we get the OF constraint equation: ∂I/∂t=∂I/∂t u+∂I/∂t v . The OF equation has more than one solution, so the different techniques diverge here. LK equations are derived assuming that pixels in a neighborhood of each tracked feature move with the same velocity as the feature. In OpenCV, to catch large motions with a small window size (to keep the “same local velocity” assumption).
I am confused about the need to change color space for color comparison. I have read about delta E, the Lab format, and I do understand that comparisons in the RGB color space will not seem appropriate to the human eye. However, my program uses a linear color scale to calculate velocity, from a color flow Doppler signal. It takes the mean color of a sample region and compares it to the colors of the scale to find its nearest neighbor using Euclidian distance. I do that entirely in the BGR (OpenCV) color space, as the example image below:
Here, I obtain seemingly correct velocity values for each color circle, but is it only by chance, or is my assumption correct that since the color comparisons take place internally, it does not matter what color space I am in?
Since you searchind for nearest neighbour, and operate with 3D points (in color space) it does not matter what color space you choose, they will only be displayed in different ways.
Comparison of colour is not straight forward. You need to decide what defines a colour being close to another and then pick the most appropriate colour space to support that.
For example, working in HSL will give you an easy way to assess colours based upon the hue. This is fine if you are happy to disregard, or at least reduce the relevance of saturation and luminance.
If on the other hand, you want a point change in saturation to be a relevant as a point change in hue, working in RGB or perhaps CMYK would be more appropriate. Measuring the distance by plotting the channels as three axis and then creating a distance between the two colours. This has the downside that a 10 point shift in saturation has the same measured difference as a 10 point shift in hue, which visually will not make that much sense as the perceived difference will not be equivalent to the mathematical.
And that brings in another consideration. The human eye is more sensitive to colour variance around different colours. Green for example, takes more variation to be noticeable than magentas. All down to evolution but may have a bearing in your representation.
Personally I tend to work with RGB as it is needed for visual display, but most commonly I will arrange colours by hue so keep a conversion handy to HSL/ HSB.
The problem
I've been building a (very) simple OCR engine.
Since I'm trying to classify very small (pixel size) characters, I'm having some difficulties on segmentation. Here's an example, after best-effort image-wide thresholding:
:
What I've tried
Error detection:
large horizontal size of the segments. It works, mostly, but fails (false positive)
for a few larger characters.
classify, and reject on low score. This seems a bit wasteful.
Error correction:
add pixels vertically (vertical histogram), find minimum. It cuts many segments on the wrong place, in many of the samples.
What I haven't tried yet
Trying to classify on all possible segmentation points (pixels). This would be very wasteful, and be difficult to expand for a 3-merged-characters segment.
I've been reading up on morphology approaches to turn the characters into mathematical curves, but I don't know really know where to start, or if it's worth the effort
Where to go from here?
I have no idea. Hence this question :)
Lean back and half close your eyes.
63 :-)
Now, if only it was so easy for a computer!
It's tantalisingly close to what double-patterning does (or un-does?) in silicon masks.
I would suggest oversampling (doubling or quadrupling the pixel count in each axis), filtering (probably low pass - or possibly bandpass where the passband = spatial frequency of a line), re-thresholding until they separate. Expensive, so only apply in problem areas.
Reinvent your problem so you do not need segmentation.
Really, for this scale I think you better invest in other approaches. For example, if you OCR on text (do you?) you can use the information of lines (character height). There are not many fonts that can be used for small (yet readable) characters. My approach would be a algorithm that scan lines in scanlines (from left to right, take pixels from top to bottom) and try to find correlations between trained text and scanlines (n, n-1... n-x)
And you probably need the information I the grayscale levels as well, so better not to threshold the images.
I have problems to fully understand the need for gamma correction. I hope you guys can help me.
Let’s assume we want to display 256 neighboring pixels. These pixels should be a smooth gradient from black to white. To denote theirs colors, we use linear gray values from 0..255. Due to the non-linearity of the human eye, the monitor must not just turn these values into linear luminance values. If the neighboring pixels had the luminance values (1/256)*I_max, (2/256)*I_max, et cetera, we would perceive in the darker area too large differences in brightness between two pixels (the gradient would not be smooth).
Fortunately, a monitor has the reciprocal non-linearity to the human eye. That means, if we put linear gray values 0..255 into the frame buffer, then the monitor turns them into non-linear luminance values x^gamma. However, as our eye is non-linear the other way round, we perceive a smooth linear gradient. The non-linearity of the monitor and the one of our eye cancel each other out.
So, why do we need the gamma correction? I have read in books that we always want the monitor to produce linear luminance values. According to them, the non-linearity of the monitor must be compensated before writing the gray values to the frame buffer. That is done by the gamma correction. However, my problem here is that - as far as I understand it - we would not perceive linear brightness values (i.e. we would not perceive a smooth, steady gradient) when the monitor produces linear luminance values.
As far as I see it, it would be just perfect, if we put linear gray values into the frame buffer. The monitor turns these values into non-linear luminance values and our eye perceives linear brightness values again, because the eye is reciprocal non-linear. There would be no need to gamma correct the gray values in the frame buffer and no need to force the monitor to produce linear luminance values.
What is wrong with my way of looking at these things?
Thanks
Allow me to ‘resurrect’ this question since I am struggling with similar questions right now and I think I have found the answer -it may be useful for someone else. Or I might be wrong and someone could tell me :)
I think there is nothing wrong with your way of thinking. Thing is, you don‘t need to gamma-correct all the time, if you know what you are doing. It depends on what you want to achieve. Let‘s see two different cases.
A) Light simulation (AKA rendering). You have a diffuse surface with a light pointing towards it. Then, the light's intensity is doubled.
Well. Let’s see what happens in the real world in such situation. Assuming a purely diffuse surface, the intensity of the light reflected is going to be the surface's albedo multiplied by the incoming light intensity and the cosine of the incoming light angle and the normal. Whatever. Thing is, when the incoming light intensity is doubled, the reflected light intensity will be doubled too. This is why light transport is said to be a linear process. Funny enough, you will not perceive the surface as twice as bright, because our perception is nonlinear (This is modelled by the so-called Steven's power law). Put again: in the real world the reflected light is doubled, but you do not perceive it twice as bright.
Now, how would we simulate this? Well, if we have a sRGB texture with the surface's albedo, we would need to linearlize it (by de-correcting it, which means applying the 2.2 gamma). Now that it is linear, and we have the light intensity, we can use the formula I said before to compute the reflected light intensity. Since we are in a linear space, by doubling the intensity we will double the output, like in the real world. Now we gamma-correct our results. Because of this, when the screen displays the rendered image, it will apply the gamma and so it will have a linear response, meaning that the intensity of the light emited by the screen will be twice as much when we simulate the twice-as-poweful light than when we simulate the first one. So the light that arrive at your eyes from your screen will have double the intensity. Exactly as it would happen if you were looking at the real surface with real lights affecting it. You will not perceive the second render twice as bright, of course but, again, and as we said earlier, this is exactly what it would happen in the real situation. Same behavior in the real world and in the simulation means that the simulation (the render) was correct :)
B) A different case is precisely if you want a gradient that you want to 'look' (AKA being perceived) as linear.
Since you want the nonlinear response of the screen to cancel out our nonlinear visual perception, you can skip gamma correction altogether (as you suggest). Or, more accurately, keep operating in linear space and gamma-correcting, but creating your gradient not with consecutive values for the pixels(1,2,3...255) that would be perceived nonlinearly (because of Steven's), but values transformed by the inverse of our perceptual brightness response (that is, applying an exponent of 1/0.5=2 to the normalized values. This is applying the reciprocal of Steven's exponent for brightness).
As a matter of fact, if you see gamma-corrected linear gradient such as the one in http://scanline.ca/gradients/ you do not perceive it as linear at all: you see far more variation in the lower intensities than in the higher ones (as expected).
Well, at least this is my current understanding of the topic. I hope it helps anyone. And again, please, please, if it is wrong I would be really grateful if someone could point it out...
The problem is really when doing color calculations. For example, if you are blending two colors, you need to use the linear intensities to do the calculations. To actually display the proper result, you then have to convert the linear intensities back to the gamma-corrected intensities.
How your eyes perceive the intensities isn't relevant. To do color calculations correctly, they have to be done based on the physical principles of optics, which relies on linear luminance values. Once you have calculated a color, you want those luminance values to be output by your monitor, regardless of how it is perceived, so you have to compensate for the fact that the monitor doesn't directly produce the colors that you want.
To actually answer the question which is wrong with your way of looking at this - it is nothing really wrong with it. It WOULD be great to have a linear framebuffer, but as you say, it's definetely not great to have an 8-Bit linear frame buffer.
The fact that 8 bits are so easy to handle is pretty much the only justification for gamma compressed frame buffers and color notations (Think HTML's #888 - wouldn't it be uncool to use #333 for middle gray not #888).
About the monitor - you want to be able to predict it's response to your input, and you know from sRGB what it should be. Normally that's all you need to know. Some people think it's "correct" or something if the monitor produces "linear" output which can be simulated if you compensate for the monitor's gamma. I advise to steer clear of such a setup, which breaks all the apps which (correcly and sanely) assume standard gamma in favour of un-breaking ill-concieved linearity-assuming apps. Don't do that. Instead, fix the apps or dump them.
I am trying to extract numbers from a typical scoreboard that you would find at a high school gym. I have each number in a digital "alarm clock" font and have managed to perspective correct, threshold and extract a given digit from the video feed
Here's a sample of my template input
My problem is that no one classification method will accurately determine all digits 0-9. I have tried several methods
1) Tesseract OCR - this one consistently messes up on 4 and frequently returns weird results. Just using the command line version. If I actually try to train it on an "alarm clock" font, I get unknown character every time.
2) kNearest with OpenCV - I search a database consisting of my template images (0-9) and see which one is nearest. I frequently get confusion between 3/1 and 7/1
3) cvMatchShapes - this one is fairly bad, it usually can't tell the difference between 2 of the digits for each input digit
4) Tangent Distance - This one is the closest, but the smallest tangent distance between the input and my templates ends up mapping "7" to "1" every time
I'm really at a loss to get a classification algorithm for such a simple problem. I feel I have cleaned up the input fairly well and it's a fairly simple case for classification but I can't get anything reliable enough to actually use in practice. Any ideas about where to look for classification algorithms, or how to use them correctly would be appreciated. Am I not cleaning up the input? What about a better input database? I don't know what else I'd use for input, each digit and template looks spot on at this point.
The classical digit recognition, which should work well in this case is to crop the image just around the digit and resize it to 4x4 pixels.
A Discrete Cosine Transform (DCT) can be used to further slim down the search space. You could select the first 4-6 values.
With those values, train a classifier. SVM is a good one, readily available in OpenCV.
It is not as simple as emma's or martin suggestions, but it's more elegant and, I think, more robust.
Given the width/height ratio of your input, you may choose a different resolution, like 3x4. Choose the smallest one that retains readable digits.
Given the highly regular nature of your input, you could define a set of 7 target areas of the image to check. Each area should encompass some significant portion of one of the 7 segments of each digital of the display, but not overlap.
You can then check each area and average the color / brightness of the pixels in to to generate a probability for a given binary state. If your probability is high on all areas you can then easily figure out what the digit is.
It's not as elegant as a pure ML type algorithm, but ML is far more suited to inputs which are not regular, and in this case that does not seem to apply - so you trade elegance for accuracy.
Might sound silly but have you tried simply checking for black bars vertically and then horizontally in the top and bottom halfs - left and right of the centerline ?
If you are trying text recognition with Tesseract, try passing not one digit, but a number of duplicated digits, sometimes it could produce better results, here's the example.
However, if you're planning a business software, you may want to have a look at a commercial OCR SDK. For example, try ABBYY FineReader Engine. It's not affordable for free to use applications, but when it comes to business, it can a good value to your product. As far as i know, ABBYY provides the best OCR quality, for example check out http://www.splitbrain.org/blog/2010-06/15-linux_ocr_software_comparison
You want your scorecard image inputs S feeding an algorithm that maps them to {0,1,2,3,4,5,6,7,8,9}.
Let V denote the set of n-tuples of integers.
Construct an algorithm α that maps each image S to a n-tuple
(k1,k2,...,kn)
that can differentiate between two different scoreboard digits.
If you can specify the range of α then you only have to collect the vectors in V that correspond to a digit in order to solve the problem.
I've applied this idea using Martin Beckett's idea and it works. My initial attempt was a simple injection into a 2-tuple by vertical left-to-right summing, with the first integer a image column offset and the second integer was the length of a 'nice' vertical line.
This did not work - images for 6 and 8 would map to the same vectors. So I needed another mini-info-capture for my digit input types (they are not scoreboard) and a 3-tuple info vector does the trick.