How to remove watermark from TIFFs to improve OCR - imagemagick

I have a bunch of uncompressed bitonal TIF document images. All of them have a watermark in the middle. When I run them through OCR, the text that overlaps with the watermark does not get recognized. I am trying to see if I can apply some type of cleanup to remove those watermarks to be able to recognize the missing text.
Again, the images are black and white, but when you look at the watermark it appears grey since it has a pattern of black and white pixels that makes the letters in the watermark less "dense" than regular text. At the same time, the watermark letters are very big, much bigger than the regular text.
An example of a somewhat similar image is this (except this one is color and the watermark characters in my case are a lot thicker and bigger; my watermarks are also a lot shorter: only 3 to 4 letters long)
It seems that there might be some sort of clean up filter that would be similar to removing large black borders from an image except borders are ually "denser" than a watermark so they appear "more black".
I have 3 tools at my disposal: GIMP, ImageMagick and IrfanView. Can you recommend any specific features of any subset of these tools that might help me?

Playing with contrast etc did not help, but I found a different way. As stated above, the regular text is a lot "denser" than the watermark text meaning that a regular black pixel has more surrounding black pixels than a watermark black pixel. So I devised a simple window-based filtering and thresholding algorithm.
Here's how I did it in Matlab, using a 5X5 window:
im=imread('imageWithWmark.tif');
imInv = ~im;
nr=size(imInv,1);
nc=size(imInv,2);
d = 2; % for 5X5 window
counts = zeros(nr,nc);
for rr = d+1 : nr-d-1
for cc = d+1 : nc-d-1
counts(rr,cc) = nnz(imInv(rr-d:rr+d,cc-d:cc+d));
end
end
thresh=10; % 10 out of 25 -- the larger the thresh the thinner the resulting letters are
imThresh = (counts>=thresh) & imInv;
imwrite(~imThresh,sprintf('Thresh_%d.tif',thresh),'Compression','none','Resolution',300);
Of course, the size of the window, the threshold and other parameters depend on the parameters of the regular text on the page (letter bigger/smaller, thicker/thinner etc) but even this initial version worked pretty well

Related

Remove color cast using libvips

I have sRGB images with color casts. To remove it manually I usually use Photoshop Level Adjustments. Photoshop also have tools for that: Auto Contrast or even better Auto Tone which also takes shadows, midtones & highlights into account.
If I remove the cast manually I adjust each of the RGB channels individually so that the darkest pixels are set to pure black and the lightest to pure white and then redistribute all other values (spreading the histogram). This is a simple approach but shows good results for my images.
In my node.js app I'm using sharp for image processing which uses libvips as its processing engine. I tried to remove the cast with .normalize() but this command works on all channels together and not individual for each of the RGB channels. So it doesn't work for me.
I also asked this question on the sharp project page. I tested the suggestion from lovell to try it with hist_local but the results are not useable for me.
Now I would like to find out how this could be done using the native libvips. I've played around with nip2 GUI and different commands but could not figure out how it could be achieved:
Histogram > Equalise Histogram > Global => Picture looks over saturated
Image > Levels > Scale to 0 - 255 => Channels ar not all spreading from 0 - 255 (I don't understand exactly what this command does?)
Thanks for every hint!
Addition
Here is a example with pictures from Photoshop to show what I want.
The source image is a picture of a frame from a film negative.
Image before processing
Step1 Invert image
Image after inversion
Step2 using Auto tone in Photoshop (works the same way as my description above about manually remove the color cast)
Image after Auto Tone
This last picture is ok for me.
nip2 has a menu item for this.
Load your image and mark a region on it containing the area you'd like to be neutral. It can be any lightness, it doesn't need to be white.
Use File / Open to get the file dialog and you should see the image loaded in your workspace as a thumbnail.
Doubleclick on the thumbnail to open an image view window.
In the view window, zoom and pan to the right spot. The user guide (press F1) has a section on image navigation.
Hold down CTRL and click and drag down and right to mark a rectangular region.
Back in the main window, click Toolkits / Tasks / Capture / White balance. You should see something like:
You can drag an resize your region to change the neutral point. Use the colour picker to set what white means. You can make other whites with (for example) Colour / New / Colour from CCT and link them together.
Click Colour / New / Colour from CCT to make a colour picker from CCT (correlated colour temperature) -- the temperature in Kelvin of that white.
Set it to something interesting, like 4800 for warm white.
Click on the formula for A5.white to edit it, and enter the cell of your CCT widget (A7 in this case).
Now you can drag the region to adjust the pixels to set the neutral from, and drag the CCT slider to set the temperature.
It can be annoying to find things in the toolkit menu. There's a thing for searching toolkits: in the main window, click View / Toolkit browser. You can enter something like "white" and it'll show related toolkit entries.
Here's another answer, but using pyvips and responding to the previous comments. I didn't want to delete the first answer as it still seemed useful.
This version finds the image histogram, searches for thresholds which will select 0.5% and 99.5% of pixels in each image band, then rescales the image so that those pixel values become 0 and 255.
import sys
import pyvips
# trim off this percentage of pixels from the top and bottom
trim_percent = 0.5
def percent(hist, percentage):
"""From a histogram, find the threshold above which lie
#percentage of pixels."""
# normalised cumulative histogram
norm = hist.hist_cum().hist_norm()
# column and row profile over percentage
c, r = (norm > norm.width * percentage / 100).profile()
return r.avg()
image = pyvips.Image.new_from_file(sys.argv[1])
# photographic negative
image = image.invert()
# find image histogram, split to set of separate bands
bands = image.hist_find().bandsplit()
# for each band, the low and high thresholds
low = [percent(band, trim_percent) for band in bands]
high = [percent(band, 100 - trim_percent) for band in bands]
# rescale image
scale = [255.0 / (h - l) for h, l in zip(high, low)]
image = (image - low) * scale
image.write_to_file(sys.argv[2])
It seems to give roughly similar results to the PS button. If I run:
$ ./autolevel.py ~/pics/before.jpg x.jpg
I see:
In the meantime I've found the Simplest Color Balance Algorithm which exactly describes the problem with color casts and there you can also find a C source code.
It is exactly the same solution as John describes in his second answer but as a small piece of c-code.
I'm now trying to use it as C/C++ addon with N-API under node.js.

Can morphology.remove_small_objects be used to remove small noise?

My binary image has lots of noise (small white blobs about 3-6 pixels in area). Can the function skimage.morphology.remove_small_objects() be used to remove these small blobs?
In my experimentation, the function leaves the image unchanged. Am I using the function incorrectly or is the function not suited to what I want to achieve?
src = cv2.imread('plan4.png')
src = cv2.GaussianBlur(src, (3,3), 1)
edges = get_edges(src.copy())
noise_reduced = morphology.remove_small_objects(edges .copy(), 2,)
cv2.imshow('src', src)
cv2.imshow('noise_reduced', noise_reduced)
cv2.imshow('edges ', edges )
Below is the original with small white blobs (that I want to remove) and the result of remove_small_objects() notice they are the same and no blobs are removed. *Note: morphological closing or opening the image would remove these small blobs but it also degrades my lines too much. I really prefer finding blobs whose area is ~6 pixels and deleting those.
When you pass in an integer image, scikit-image assumes that all the same-valued pixels belong to the same object, even if they are not connected. So, in your case, all the pixels are considered part of the same (big) object, so none are removed. Instead, you should do use
from skimage.measure import label
noise_reduced = morphology.remove_small_objects(label(edges), 2,)
Hope this helps!

How can I segment urine strip colors?

I'm trying to extract a urinalysis strip colors to analyze them and need to segment color areas to have a robust solution.
Currently, I'm using a hardcoded distance from top approximation.
I already tried using adaptative thresholding and can't segment colors correctly without detecting background noise, joining multiple colors or not detecting some colors at all.
I think you are a bit over complicating this: your problem is in essence a 1D problem: you can look at the average color per row of your image, and this should give you a clean and more robust version to work on:
img = imread('http://i.imgur.com/mhGA3hp.jpg');
img = im2double(img);
avg = mean(img,2);
imshow(bsxfun(#times, avg, ones(1,50,3)));
Results with:
I believe you will find it easier to work with the 1D clean version of your image.

How to remove non-periodic lines from binary image

Example Image
I want to remove the lines (shown in RED color) as they are out of order. Lines shown in black color are repeating at same period (approximately). Period is not known beforehand. Is there any way of deleting non-periodic lines( shown in red color) automatically?
NOTE: Image is binary ( back & while).. lines shown in red color only for illustration.
Of course there is any way. There is almost always some way to do something.
Infortunately you have not provided any particular problem. The entire thing is too broad to be answered here.
To help you getting started: (I highly recommend you start with pen, paper and your brain)
Detect the lines -> google or think, there are many standard ways to detect lines in an image. if you don't have noise in your binary image its trivial.
find any aequidistant sets -> think
delete the rest -> think ( you know what is good so everything else has to go away)
I assume, your lines are (almost) vertical.
The following should work
turn the image to a column sum histogram
try a Fourier transformation on the signal (potentially padding the image appropriately)
pick the maximum/peak from the Fourier spectrum as your base period
If you need the lines rather than the position of the lines, generate a mask with lines at appropriate intervals (as determined by your analysis before) and apply to the image.

Eliminating various backgrounds from image and segmenting object?

Let say I have this input image, with any number of boxes. I want to segment out these boxes, so I can eventually extract them out.
input image:
The background could anything that is continuous, like a painted wall, wooden table, carpet.
My idea was that the gradient would be the same throughout the background, and with a constant gradient. I could turn where the gradient is about the same, into zero's in the image.
Through edge detection, I would dilate and fill the regions where edges detected. Essentially my goal is to make a blob of the areas where the boxes are. Having the blobs, I would know the exact location of the boxes, thus being able to crop out the boxes from the input image.
So in this case, I should be able to have four blobs, and then I would be able to crop out four images from the input image.
This is how far I got:
segmented image:
query = imread('AllFour.jpg');
gray = rgb2gray(query);
[~, threshold] = edge(gray, 'sobel');
weightedFactor = 1.5;
BWs = edge(gray,'roberts');
%figure, imshow(BWs), title('binary gradient mask');
se90 = strel('disk', 30);
se0 = strel('square', 3);
BWsdil = imdilate(BWs, [se90]);
%figure, imshow(BWsdil), title('dilated gradient mask');
BWdfill = imfill(BWsdil, 'holes');
figure, imshow(BWdfill);
title('binary image with filled holes');
What a very interesting problem! Here's my solution in an attempt to solve this problem for you. This is assuming that the background has the same colour distribution throughout. First, transform your image from RGB to the HSV colour space with rgb2hsv. The HSV colour space is an ideal transform for analyzing colours. After this, I would look at the saturation and value planes. Saturation is concerned with how "pure" the colour is, while value is the intensity or brightness of the colour itself. If you take a look at the saturation and value planes for the image, this is what is shown:
im = imread('http://i.stack.imgur.com/1SGVm.jpg');
out = rgb2hsv(im);
figure;
subplot(2,1,1);
imshow(out(:,:,2));
subplot(2,1,2);
imshow(out(:,:,3));
This is what I get:
By taking a look at some locations in the gray background, it looks like the majority of the saturation are less than 0.2 as well as the elements in the value plane are greater than 0.3. As such, we want to find the opposite of those pixels to get our objects. As such, we find those pixels whose saturation is greater than 0.2 or those pixels with a value that is less than 0.3:
seg = out(:,:,2) > 0.2 | out(:,:,3) < 0.3;
This is what we get:
Almost there! There are some spurious single pixels, so I'm going to perform an opening with imopen with a line structuring element.
After this, I'll perform a dilation with imdilate to close any gaps, then use imfill with the 'holes' option to fill in the gaps, then use erosion with imerode to shrink the shapes back to their original form. As such:
se = strel('line', 3, 90);
pre = imopen(seg, c);
se = strel('square', 20);
pre2 = imdilate(pre, se);
pre3 = imfill(pre2, 'holes');
final = imerode(pre3, se);
figure;
imshow(final);
final contains the segmented image with the 4 candy boxes. This is what I get:
Try resizing the image. When you make it smaller, it would be easier to join edges. I tried what's shown below. You might have to tune it depending on the nature of the background.
close all;
clear all;
im = imread('1SGVm.jpg');
small = imresize(im, .25); % resize
grad = (double(imdilate(small, ones(3))) - double(small)); % extract edges
gradSum = sum(grad, 3);
bw = edge(gradSum, 'Canny');
joined = imdilate(bw, ones(3)); % join edges
filled = imfill(joined, 'holes');
filled = imerode(filled, ones(3));
imshow(label2rgb(bwlabel(filled))) % label the regions and show
If you have a recent version of MATLAB, try the Color Thresholder app in the image processing toolbox. It lets you interactively play with different color spaces, to see which one can give you the best segmentation.
If your candy covers are fixed or you know all the covers that are coming into the scene then Template matching is best for this. As it is independent of the background in the image.
http://docs.opencv.org/doc/tutorials/imgproc/histograms/template_matching/template_matching.html

Resources