Tesseract OCR, pre-processing for subtitles extraction - image-processing

I am trying to do OCR on some grayscale video frames using tesseract. OCR for subtitles is trickier than document extraction as the text is over the background frame and the background introduce a lot of noise and can have a similar color as the subtitles. The font are genererally anti-aliased, which makes it harder to extract only the text with thresholding, and for even more complexity, the frames I am working on are grayscale and have been downscaled.
Here is a sample:
The result I get after thresholding at 220:
It is quite alright already, but as you can see there is big chunks of the background remaining and Tesseact just can't dissociate it from the text.
There are techniques with some assumptions, like removing after thresholding black regions that touche the border, assuming the text doesn't, but this is quite specific and doesn't apply to a lot of case.
What I am looking for is a method that I can apply live and will generalise to other frames.

Related

refining captcha with a little noise

I'm trying to crack a particular web CAPTCHA. I'm planning to do it by segmenting the characters and passing them to an ANN (mostly for features, I will be using method of moments as it seems difficult to completely remove noise completely)
The captcha is very noisy, and unfortunately there is no color difference between the noise and the actual text, so separation based on color will not work. After quite some thought, I managed to implement a flood-fill style algorithm on the pixels of the captcha to separate small disconnected components, and after this I ended up with something like this:
Most of the noise is gone but some of it is left around the letters themselves (since it is touching the text).
I'm not an expert on image filters, and I'm finding it very difficult to find the right filter to reduce the remaining noise and enhance the characters.
Any Ideas on what filter(s) I could use for this purpose.
(Note: I'm not using any image manipulation tool/library for this. I'm writing raw pixel manipulation code, but I can implement most filters given their convolution kernel)
The problem is that due to this noise, it is becoming difficult to segment the characters. Clearly trying to find vertical lines with no dark pixels is not going to work, since there is noise and some of the letters are touching.
Any ideas on how I could segment these efficiently?
EDIT: Original image
what about trying morphological operators like closing and opening? they are very easy to implement and a simple but efficient tool.
After one closing with a 3x3 cross structuring element (kernel) and binarising the image the noise is almost gone:
I am sure just a bit more trying will render great results.
edit: to clear things up a little, the closing is a dilation followed by an erosion (other way around for opening). A dilation is assigning every pixel in your image the maximal value of all pixels in the kernel (structuring element) around it, conversly, the erosion assign every pixel the minimal value of all pixels in the kernel around it.
Also take a look at the wikipedia link and the external links in there.

OCR on antialiased text

I have to OCR table from PDF document. I wrote simple Python+opencv script to get individual cells. After that new problem arose. Text is antialiased and not good-quality.
Recognition rate of tesseract is very low. I've tried to preprocess images with adaptive thresholding but results weren't much better.
I've tried trial version of ABBYY FineReader and indeed it gives fine output, but I don't want to use non-free software.
I wonder if some preprocessing would solve issue or is it nessesary to write and learn other OCR system.
If you look closely at your antialiased text samples, you'll notice that the edges contain a lot of red and blue:
This suggests that the antialiasing is taking place inside your computer, which has used subpixel rendering to optimise the results for your LCD monitor.
If so, it should be quite easy to extract the text at a higher resolution. For example, you can use ImageMagick to extract images from PDF files at 300 dpi by using a command line like the following:
convert -density 300 source.pdf output.png
You could even try loading the PDF in your favourite viewer and copying the text directly to the clipboard.
Addendum:
I tried converting your sample text back into its original pixels and applying the scaling technique mentioned in the comments. Here are the results:
Original image:
After scaling 300% and applying simple threshold:
After smart scaling and thresholding:
As you can see, some of the letters are still a bit malformed, but I think there's a better chance of reading this with Tesseract.

Need advice on training Tesseract OCR (text with conversion/compression artifacts)

I need to do OCR on images that have gone through a digital to analog (interlaced video) to digital conversion, then jpeg compressed (resulting in compression artifacts). I have not been able to locate the exact fonts used, but we'll be looking at a mix of sans serif - e.g., Arial, Calibri, and Tiresias might work well as a training set. There is no way to get around the jpeg compression. These are text-only, white-on-black images at standard def resolution (720x480 deinterlaced).
An example is located here, resized at 1000%:
I've found a preprocessing pipeline that works fairly well for Tesseract:
Resize to 400-600%
Blur
Threshold (binarization)
Erode (get thinner stroke width)
One problem is that letters like 't' and 'f' end up with a diamond shape at the cross. Still, this process works well, but isn't quite perfect. So I'd like to train tesseract. My question:
How should I create the training set?
Should I try to emulate the analog-to-digital-to-analog by adding a small amount of noise, then compress with jpeg? Should I do preprocessing on my training set, similar to what I listed above? If I train with noisy jpeg compressed images to match my captured images, is it best to skip preprocessing on the captured images?
Additionally, any hints on getting rid of the conversion/compression artifacts without sacrificing the text would be appreciated.

why we should use gray scale for image processing

I think this can be a stupid question but after read a lot and search a lot about image processing every example I see about image processing uses gray scale to work
I understood that gray scale images use just one channel of color, that normally is necessary just 8 bit to be represented, etc... but, why use gray scale when we have a color image? What are the advantages of a gray scale? I could imagine that is because we have less bits to treat but even today with faster computers this is necessary?
I am not sure if I was clear about my doubt, I hope someone can answer me
thank you very much
As explained by John Zhang:
luminance is by far more important in distinguishing visual features
John also gives an excellent suggestion to illustrate this property: take a given image and separate the luminance plane from the chrominance planes.
To do so you can use ImageMagick separate operator that extracts the current contents of each channel as a gray-scale image:
convert myimage.gif -colorspace YCbCr -separate sep_YCbCr_%d.gif
Here's what it gives on a sample image (top-left: original color image, top-right: luminance plane, bottom row: chrominance planes):
To elaborate a bit on deltheil's answer:
Signal to noise. For many applications of image processing, color information doesn't help us identify important edges or other features. There are exceptions. If there is an edge (a step change in pixel value) in hue that is hard to detect in a grayscale image, or if we need to identify objects of known hue (orange fruit in front of green leaves), then color information could be useful. If we don't need color, then we can consider it noise. At first it's a bit counterintuitive to "think" in grayscale, but you get used to it.
Complexity of the code. If you want to find edges based on luminance AND chrominance, you've got more work ahead of you. That additional work (and additional debugging, additional pain in supporting the software, etc.) is hard to justify if the additional color information isn't helpful for applications of interest.
For learning image processing, it's better to understand grayscale processing first and understand how it applies to multichannel processing rather than starting with full color imaging and missing all the important insights that can (and should) be learned from single channel processing.
Difficulty of visualization. In grayscale images, the watershed algorithm is fairly easy to conceptualize because we can think of the two spatial dimensions and one brightness dimension as a 3D image with hills, valleys, catchment basins, ridges, etc. "Peak brightness" is just a mountain peak in our 3D visualization of the grayscale image. There are a number of algorithms for which an intuitive "physical" interpretation helps us think through a problem. In RGB, HSI, Lab, and other color spaces this sort of visualization is much harder since there are additional dimensions that the standard human brain can't visualize easily. Sure, we can think of "peak redness," but what does that mountain peak look like in an (x,y,h,s,i) space? Ouch. One workaround is to think of each color variable as an intensity image, but that leads us right back to grayscale image processing.
Color is complex. Humans perceive color and identify color with deceptive ease. If you get into the business of attempting to distinguish colors from one another, then you'll either want to (a) follow tradition and control the lighting, camera color calibration, and other factors to ensure the best results, or (b) settle down for a career-long journey into a topic that gets deeper the more you look at it, or (c) wish you could be back working with grayscale because at least then the problems seem solvable.
Speed. With modern computers, and with parallel programming, it's possible to perform simple pixel-by-pixel processing of a megapixel image in milliseconds. Facial recognition, OCR, content-aware resizing, mean shift segmentation, and other tasks can take much longer than that. Whatever processing time is required to manipulate the image or squeeze some useful data from it, most customers/users want it to go faster. If we make the hand-wavy assumption that processing a three-channel color image takes three times as long as processing a grayscale image--or maybe four times as long, since we may create a separate luminance channel--then that's not a big deal if we're processing video images on the fly and each frame can be processed in less than 1/30th or 1/25th of a second. But if we're analyzing thousands of images from a database, it's great if we can save ourselves processing time by resizing images, analyzing only portions of images, and/or eliminating color channels we don't need. Cutting processing time by a factor of three to four can mean the difference between running an 8-hour overnight test that ends before you get back to work, and having your computer's processors pegged for 24 hours straight.
Of all these, I'll emphasize the first two: make the image simpler, and reduce the amount of code you have to write.
I disagree with the implication that gray scale images are always better than color images; it depends on the technique and the overall goal of the processing. For example, if you wanted to count the bananas in an image of a fruit bowl image, then it's much easier to segment when you have a colored image!
Many images have to be in grayscale because of the measuring device used to obtain them. Think of an electron microscope. It's measuring the strength of an electron beam at various space points. An AFM is measuring the amount of resonance vibrations at various points topologically on a sample. In both cases, these tools are returning a singular value- an intensity, so they implicitly are creating a gray-scale image.
For image processing techniques based on brightness, they often can be applied sufficiently to the overall brightness (grayscale); however, there are many many instances where having a colored image is an advantage.
Binary might be too simple and it could not represent the picture character.
Color might be too much and affect the processing speed.
Thus, grayscale is chosen, which is in the mid of the two ends.
First of starting image processing whether on gray scale or color images, it is better to focus on the applications which we are applying. Unless and otherwise, if we choose one of them randomly, it will create accuracy problem in our result. For example, if I want to process image of waste bin, I prefer to choose gray scale rather than color. Because in the bin image I want only to detect the shape of bin image using optimized edge detection. I could not bother about the color of image but I want to see rectangular shape of the bin image correctly.

Image preprocessing for text recognition

What's the best set of image preprocessing operations to apply to images for text recognition in EmguCV?
I've included two sample images here.
Applying a low or high pass filter won't be suitable, as the text may be of any size. I've tried median and bilateral filters, but they don't seem to affect the image much.
The ideal result would be a binary image with all the text white, and most of the rest black. This image would then be sent to the OCR engine.
Thanks
There's nothing like the best set. Keep in mind that digital images can be acquired by different capture devices and each device can embed its own preprocessing system (filters) and other characteristics that can drastically change the image and even add noises to them. So every case would have to be treated (preprocessed) differently.
However, there are commmon operations that can be used to improve the detection, for instance, a very basic one would be to convert the image to grayscale and apply a threshold to binarize the image. Another technique I've used before is the bounding box, which allows you to detect the text region. To remove noises from images you might be interested in erode/dilate operations. I demonstrate some of these operations on this post.
Also, there are other interesting posts about OCR and OpenCV that you should take a look:
Simple Digit Recognition OCR in OpenCV-Python
Basic OCR in OpenCV
Now, just to show you a simple approach that can be used with your sample image, this is the result of inverting the color and applying a threshold:
cv::Mat new_img = cv::imread(argv[1]);
cv::bitwise_not(new_img, new_img);
double thres = 100;
double color = 255;
cv::threshold(new_img, new_img, thres, color, CV_THRESH_BINARY);
cv::imwrite("inv_thres.png", new_img);
Try morphological image processing. Have a look at this. However, it works only on binary images - so you will have to binarize the image( threshold?). Although, it is simple, it is dependent on font size, so one structure element will not work for all font sizes. If you want a generic solution, there are a number of papers for text detection in images - A search of this term in google scholar should provide you with some useful publications.

Resources