tesseract for empty table cell - image-processing

I am using tesseract to convert image into text on CentOS instance, but I am not able to handle blank cells.
Output that I get from tesseract:
Legal entity Project Category Mon 08/ 20 Tue 08/ 21 Wed 08/ 22 Thu
08/23 Fri 08/24 Sat 08/25 Sun 08/ 26 Total
test Development Improvements - Improvem 8.00 8.00 8.00 8.00 8.00
40.00 H '9
Please note that in second line there is space after last 8 and before 40 (basically Sat/Sun cell are empty)

I would try to locate the region containing the text before performing the OCR part, making it my ROI. Then for the OCR part use the ROIs instead of the whole image. Then you can search if the ROI contains contours then it should perform OCR else make a blank space. Hope it helps a bit, cheers!
Example:
import cv2
import numpy as np
img = cv2.imread('table_so.png')
res = cv2.resize(img,None,fx=0.8, fy=0.8, interpolation = cv2.INTER_CUBIC)
h,w,ch = res.shape
cv2.rectangle(res, (0,0), (w,h), (0,0,0), 10)
gray = cv2.cvtColor(res, cv2.COLOR_BGR2GRAY)
_, thresh = cv2.threshold(gray,220,255,cv2.THRESH_BINARY)
_, contours, hierarchy = cv2.findContours(thresh,cv2.RETR_TREE,cv2.CHAIN_APPROX_NONE)
sort_cnts = sorted(contours, key=lambda ctr: cv2.boundingRect(ctr)[0] + cv2.boundingRect(ctr)[1] * res.shape[1] )
ROIs = []
for cnt in sort_cnts:
x,y,w,h = cv2.boundingRect(cnt)
if 2000 > w > 70 and h < 100:
ROI = res[y:y+h, x:x+w]
ROIs.append(ROI)
cv2.rectangle(res, (x,y), (x+w,y+h), (0,255,0), 2)
for i in ROIs:
roi = i
gray = cv2.cvtColor(roi, cv2.COLOR_BGR2GRAY)
_, thresh = cv2.threshold(gray,220,255,cv2.THRESH_BINARY)
_, contours, hierarchy = cv2.findContours(thresh,cv2.RETR_TREE,cv2.CHAIN_APPROX_NONE)
if len(contours) > 1:
print('DO OCR HERE')
else:
print('BLANK SPACE')
cv2.imshow('img', gray)
cv2.waitKey(0)
cv2.destroyAllWindows()
cv2.imshow('img', res)
Result:
(The green boxes are ROIs)
DO OCR HERE
DO OCR HERE
DO OCR HERE
DO OCR HERE
DO OCR HERE
DO OCR HERE
DO OCR HERE
DO OCR HERE
DO OCR HERE
BLANK SPACE
BLANK SPACE
BLANK SPACE

You can either train Tesseract and make it recognize blank spaces (not recommended, since it can mess up the 100% output you're getting), or resolve the issue by coding. Unfortunatelly, there's no way to just train Tesseract the way you want it to.
The best solution i see here is displaying a 0 or something alike (any character you feel comfortable with) at Saturday and Sunday, so that Tesseract can see them and you can react to that.

Try setting preserve_interword_spaces to 1.
How to preserve document structure in tesseract
Tesseract - ambiguity in space and tab

Related

Extract stripes from low contrast grayscale images

I want to extract stripes from this sample file sample file, and the result should look like this one similar result image. Then, I need to count the number of stripes on the right, and calculate the distance from the end of each left stripe to the end of each adjacent right stripe.
I tried with the following code, but my result my result fileis still a little bit away from my target. Here is what I do:
import numpy as np
import cv2
from matplotlib import pyplot as plt
gray = cv2.imread('input_file.png',cv2.IMREAD_UNCHANGED)
sobelY = cv2.Sobel(gray, cv2.CV_32F, 0, 1, ksize=3)
sobelY2 = cv2.Sobel(sobelY, cv2.CV_32F, 0, 1, ksize=3)
sobelY2[sobelY2<0]=0
mask = np.where(sobelY2==0,0,1)
sobelY2 = cv2.normalize(sobelY2, dst=None, alpha=0, beta=65535, norm_type=cv2.NORM_MINMAX).astype(np.uint16)
clahe=cv2.createCLAHE(clipLimit=6, tileGridSize=(8,8))
sobelY2_clahe = clahe.apply(sobelY2)
sobelY2_clahe = clahe.apply(sobelY2_clahe)
result = np.where(mask!=0,sobelY2_clahe,0)
fig = plt.figure(figsize=(10, 10))
ax = plt.subplot(121)
plt.imshow(gray, cmap='gray')
ax = plt.subplot(122)
plt.imshow(result, cmap='gray')
plt.show()
The input file is in 16 bits format, so I keep it unchanged for accuracy. I do second order Sobel operation in Y direction to high light those stripes, and then I do two times Clahe operations to balance the contrast. To keep the background pixels as 0, I use a mask to set the values back after the Clahe operations.
Any advice is appreciated!
For completeness, I am attaching another more challenged input file for referencemore challenged input file.
Edit:
The sobelY2 image pretty much reflects the stripes, but could we make it look better?
I just opened a new question about how to trim each of these stripes based on gray scale values.trim image based on grayscale values

contouring does not detect an object inside another object

I am trying to detect the number of contours on this image. Ideally supposed to be 3 but due to noise I was not getting idle result. Hence i tried to blur the image before thresholding it as below:
import numpy as np
import cv2
img= cv2.imread('Inkedblueimagewithdot.jpg')
cv2.imshow('original',img)
blur= cv2.pyrMeanShiftFiltering(img,21,49)
gray_image= cv2.cvtColor(blur, cv2.COLOR_BGR2GRAY)
ret,thresh= cv2.threshold(gray_image,70,255,cv2.THRESH_BINARY)
_, contours,hierarchy =cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
print(len(contours))
contourimage=cv2.drawContours(img,contours,-1,(255,255,255),20)
cv2.imshow('countors',contourimage)
cv2.waitKey(0)
cv2.destroyAllWindows()
output is:
2
This is the input image:
This is the input image
This is the output image:This is the output image
In order to obtain 3 contours , you could use cv2.RETR_LIST. It lists all the contours present in the binary image irrespective of any hierarchy as mentioned here
To answer the second question, you could try setting an area constraint such that contours below a certain area would be discarded. For the image provided I set an area of 4000:
for i, c in enumerate(contours):
if cv2.contourArea(c) > 4000:
x, y, w, h = cv2.boundingRect(c)
roi = image[y :y + h, x : x + w ]
cv2.imshow('cropped_region', roi)
cv2.waitKey(0)
Expected result:

Splitting up digits in images

I've gotten access to a lot of reports which are filled out by hand. One of the columns in the report contains a timestamp, which I would like to attempt to identify without going through each report manually.
I am playing with the idea of splitting the times, e.g. 00:30, into four digits, and running these through a classifier trained on MNIST to identify the actual timestamps.
When I manually extract the four digits in Photoshop and run these through an MNIST classifier, it works perfectly. But so far I haven't been able to figure out how to programatically split the number sequences into single digits. I tried to use different types of countour finding in OpenCV, but it didn't work very reliably.
Any suggestions?
I've added a screenshot of some of the relevant columns in the reports.
I would do something like this (no code as long as it is just an idea, you could test it to see if works):
Extract each area for each group of numbers as Rick M. suggested above. So you will have many Kl [hour] rectangles under image form.
For each of these rectangles extract (using OpenCV contours feature) each ROI. Delete Kl if you don't need it (you know the dimensions of this ROI (you can calculate it with img.shape) and they have more or less the same dimensions)
Extract all digits using the same script used above. You can take a look at my questions/answers to find some pieces of code which do this.
You will have a problem with underline in some cases. Search about this on SO, there are few solutions complete with code.
Now, about splitting up. We know the ROI's are in hour format, so hh:mm (or 4 digits). A simply (and very rudimental) solution to split chars wich are attached between would be to split in half the ROI you get with 2 digits inside. It's a raw solution but should perform well in your case because the digits attached are just 2.
Some digits will output with "missing pieces". This can be avoided by using some erosion/dilation/skeletonization.
Here you don't have letters, only numbers so MNIST should work well (not perfect, keep this in mind).
In a few, extracting the data it's not the hard task but recognizing the digits will make you sweat a bit.
I hope I can provide some code to show the steps above as soon as possible.
EDIT - code
This is some code I made. Final output is this:
The code works 100% with this image so, if something don't work for you, check folders/paths/modules installation.
Hope this helped.
import cv2
import numpy as np
# 1 - remove the vertical line on the left
img = cv2.imread('image.jpg', 0)
# gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
edges = cv2.Canny(img, 100, 150, apertureSize=5)
lines = cv2.HoughLines(edges, 1, np.pi / 50, 50)
for rho, theta in lines[0]:
a = np.cos(theta)
b = np.sin(theta)
x0 = a * rho
y0 = b * rho
x1 = int(x0 + 1000 * (-b))
y1 = int(y0 + 1000 * (a))
x2 = int(x0 - 1000 * (-b))
y2 = int(y0 - 1000 * (a))
cv2.line(img, (x1, y1), (x2, y2), (255, 255, 255), 10)
cv2.imshow('marked', img)
cv2.waitKey(0)
cv2.imwrite('image.png', img)
# 2 - remove horizontal lines
img = cv2.imread("image.png")
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
img_orig = cv2.imread("image.png")
img = cv2.bitwise_not(img)
th2 = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_MEAN_C, cv2.THRESH_BINARY, 15, -2)
cv2.imshow("th2", th2)
cv2.waitKey(0)
cv2.destroyAllWindows()
horizontal = th2
rows, cols = horizontal.shape
# inverse the image, so that lines are black for masking
horizontal_inv = cv2.bitwise_not(horizontal)
# perform bitwise_and to mask the lines with provided mask
masked_img = cv2.bitwise_and(img, img, mask=horizontal_inv)
# reverse the image back to normal
masked_img_inv = cv2.bitwise_not(masked_img)
cv2.imshow("masked img", masked_img_inv)
cv2.waitKey(0)
cv2.destroyAllWindows()
horizontalsize = int(cols / 30)
horizontalStructure = cv2.getStructuringElement(cv2.MORPH_RECT, (horizontalsize, 1))
horizontal = cv2.erode(horizontal, horizontalStructure, (-1, -1))
horizontal = cv2.dilate(horizontal, horizontalStructure, (-1, -1))
cv2.imshow("horizontal", horizontal)
cv2.waitKey(0)
cv2.destroyAllWindows()
# step1
edges = cv2.adaptiveThreshold(horizontal, 255, cv2.ADAPTIVE_THRESH_MEAN_C, cv2.THRESH_BINARY, 3, -2)
cv2.imshow("edges", edges)
cv2.waitKey(0)
cv2.destroyAllWindows()
# step2
kernel = np.ones((1, 2), dtype="uint8")
dilated = cv2.dilate(edges, kernel)
cv2.imshow("dilated", dilated)
cv2.waitKey(0)
cv2.destroyAllWindows()
im2, ctrs, hier = cv2.findContours(dilated, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
# sort contours
sorted_ctrs = sorted(ctrs, key=lambda ctr: cv2.boundingRect(ctr)[0])
for i, ctr in enumerate(sorted_ctrs):
# Get bounding box
x, y, w, h = cv2.boundingRect(ctr)
# Getting ROI
roi = img[y:y + h, x:x + w]
# show ROI
rect = cv2.rectangle(img_orig, (x, y), (x + w, y + h), (255, 255, 255), -1)
cv2.imshow('areas', rect)
cv2.waitKey(0)
cv2.imwrite('no_lines.png', rect)
# 3 - detect and extract ROI's
image = cv2.imread('no_lines.png')
cv2.imshow('i', image)
cv2.waitKey(0)
# grayscale
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
cv2.imshow('gray', gray)
cv2.waitKey(0)
# binary
ret, thresh = cv2.threshold(gray, 127, 255, cv2.THRESH_BINARY_INV)
cv2.imshow('thresh', thresh)
cv2.waitKey(0)
# dilation
kernel = np.ones((8, 45), np.uint8) # values set for this image only - need to change for different images
img_dilation = cv2.dilate(thresh, kernel, iterations=1)
cv2.imshow('dilated', img_dilation)
cv2.waitKey(0)
# find contours
im2, ctrs, hier = cv2.findContours(img_dilation.copy(), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
# sort contours
sorted_ctrs = sorted(ctrs, key=lambda ctr: cv2.boundingRect(ctr)[0])
for i, ctr in enumerate(sorted_ctrs):
# Get bounding box
x, y, w, h = cv2.boundingRect(ctr)
# Getting ROI
roi = image[y:y + h, x:x + w]
# show ROI
# cv2.imshow('segment no:'+str(i),roi)
cv2.rectangle(image, (x, y), (x + w, y + h), (255, 255, 255), 1)
# cv2.waitKey(0)
# save only the ROI's which contain a valid information
if h > 20 and w > 75:
cv2.imwrite('roi\\{}.png'.format(i), roi)
cv2.imshow('marked areas', image)
cv2.waitKey(0)
These are next steps:
Understand what I write ;). It's the most important step.
Using pieces of the code above (especially step 3) you can delete remaining Kl in extracted images.
Create folder for each image and extract digits.
Using MNIST, recognize each digit.
Breaking up text into individual characters is not as easy as it sounds at first. You can try to find some rules and manipulate the image by that, but there will be just too many exceptions. For example you can try to find disjoint marks, but the fourth one in your image, 0715 has it's "5" broken up into three pieces, and the 9th one, 17.00 has the two zeros overlapping.
You are very lucky with the horizontal lines - at least it's easy to separate different entries. But you have to come up with a lot of ideas related to semi-fixed character width, a "soft" disjointness rule, etc.
I did a project like that two years ago and we ended up using an external open source library called Tesseract. Here's this article of Roman numerals recognition with it, up to about 90% accuracy. You might also want to look into the Lipi Toolkit, but I have no experience with that.
You might also want to consider to just train a network to recognize the four digits at once. So the input would be the whole field with the four handwritten digits and the output would be the four numbers. And let the network sort out where the characters are. If you have enough training data, that's probably the easiest approach.
EDIT:
Inspired by #Link's answer, I just came up with this idea, you can give it a try. Once you extracted the area between the two lines, trim the image to get rid of white space all around. Then make an educated guess about how big the characters are. Use maybe the height of the area? Then create a sliding window over the image, and run the recognition all the way. There will most likely be four peaks which would correspond to the four digits.

Find the coordinate of a specific text in an image

I am trying to segment the questions in the below image. The only clue I have is the number with the bold text which is indented by a tab space. I am trying to find the bold numbering (4,5,6 in this case) so that I can get the x and y of them and segment the image into 3 separate questions. How to get these or how to approach this problem.
I am using scikit image for image processing
Your image looks quite simple so texts can be segmented quite easily with contour detection around the dilated components. Here are detailed steps:
1) Binarize the image and invert it for easy morphological operations.
2) Dilate the image in horizontal directions only using long horizontal kernal say (20, 1) shape kernal.
3) Find contours of all the connected components and get their coordinates.
4) Use these bounding boxes dimensional information and their coordinates to segment the questions.
Here is the Python implementation of the same:
# Text segmentation
import cv2
import numpy as np
rgb = cv2.imread(r'D:\Image\st4.png')
small = cv2.cvtColor(rgb, cv2.COLOR_BGR2GRAY)
#threshold the image
_, bw = cv2.threshold(small, 0.0, 255.0, cv2.THRESH_BINARY_INV | cv2.THRESH_OTSU)
# get horizontal mask of large size since text are horizontal components
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (20, 1))
connected = cv2.morphologyEx(bw, cv2.MORPH_CLOSE, kernel)
# find all the contours
_, contours, hierarchy,=cv2.findContours(connected.copy(),cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
#Segment the text lines
for idx in range(len(contours)):
x, y, w, h = cv2.boundingRect(contours[idx])
cv2.rectangle(rgb, (x, y), (x+w-1, y+h-1), (0, 255, 0), 2)
Output image:

How to improve OCR of text written on vehicles?

I am trying to do OCR of vehicles such as trains or trucks to identify the numbers and characters written on them. (Please note this is not license plate identification OCR)
I took this image. The idea is to be able to extract the text - BN SF 721 734 written on it.
For pre-processing, I first converted this image to grayscale and then converted it to a binarized image which looks something like this
I wrote some code in tesseract.
myimg = "image.png"
image = Image.open(myimg)
with PyTessBaseAPI() as api:
api.SetImage(image)
api.Recognize()
words = api.GetUTF8Text()
print words
print api.AllWordConfidences()
This code gave me a blank output with a confidence value of 95 which means that tesseract was 95% confident that no text exists in this image.
Then I used the setrectangle api in Tesseract to restrict OCR on a particular window within the image instead of trying to do OCR on the entire image.
myimg = "image.png"
image = Image.open(myimg)
with PyTessBaseAPI() as api:
api.SetImage(image)
api.SetRectangle(665,445,75,40)
api.Recognize()
words = api.GetUTF8Text()
print words
print api.AllWordConfidences()
print "----"
The coordinates 665, 445, 75 and 40 correspond to a rectangle which contains the text BNSF 721 734 in the image.
665 - top, 445- left, 75 - width and 40 - height.
The output I got was this:
an s
m,m
My question is how do I improve the results? I played around with the values in the setrectangle function and the results varied a bit but all of them were equally bad.
Is there a way to improve this?
If you are interested in how I converted the images to binarized images, I used OpenCV
img = cv2.imread(image)
grayscale_img = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY)
(thresh, im_bw) = cv2.threshold(grayscale_img, 128, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)
thresh = 127
binarized_img = cv2.threshold(grayscale_img, thresh, 255, cv2.THRESH_BINARY)[1]
I suggest finding the contours in your cropped rectangle and setting some parameters to match the contours of your characters. For example: contours with area larger or smaller then some thresholds. Then draw one by one contour on an empty bitmap and perform OCR.
I know it's seems like a lot of work, but it gives you better and more robust results.
Good luck!

Resources