Extracting table structures from image - opencv

I have a bunch of images like
What would be the good way to extract just the table structure from the image? I'm only interested extracting the straight lines.
I have been toying around with OpenCV Finding Contours code sample and the results are quite promising. I'm just wondering if there is maybe a better way?

OpenCV has a nice way to detect line segments. Here is a code snippet in python:
import math
import numpy as np
import cv2
img = cv2.imread('page2.png')
gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
lsd = cv2.createLineSegmentDetector(0)
dlines = lsd.detect(gray)
for dline in dlines[0]:
x0 = int(round(dline[0][0]))
y0 = int(round(dline[0][1]))
x1 = int(round(dline[0][2]))
y1 = int(round(dline[0][3]))
cv2.line(img, (x0, y0), (x1,y1), 255, 1, cv2.LINE_AA)
# print line segment length
a = (x0-x1) * (x0-x1)
b = (y0-y1) * (y0-y1)
c = a + b
print(math.sqrt(c))
cv2.imwrite('page2_lines.png', img)

Kindly go through my Github repository Code for table extraction
The developed code detect table and extract out information by keeping the spatial coordinates intact.
The code detects lines from tables as shown in an image below. I hope it solves your problem.
The extracted output in terms of a table is shown below.

Related

Image recognition difficulties with OCR - reading numbers from a picture

I am trying to develop a python script which can read numbers from pictures, to be more exact I am trying to get the gas consumption. The numbers' locations are always the same. There are two "types" of pics, bright and dark. (I am taking photos every 10 mins so I have a lot of examples if needed)
I would like to get as a result 8 digits. e.g. 10974748 (from the dark pic)
I am mainly using Pytesseract and OpenCV2.
So far the best solution seemes to be that first I crop the needed part of the picture than I use pytesseract.image_to_string() with config = --psm 7. But unfortunately it is really not a reliable solution, it can not recognize the same digit combinations when there were no consumption but photos were taken.
import cv2
import numpy as np
import os
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract"
directory = r"C:\Users\user\Desktop\test_pcs\test"
for image in os.listdir(directory):
OriginalImagePath = os.path.join(directory, image)
OriginalImage = cv2.imread(OriginalImagePath)
x_start, y_start = int(1110), int(445)
x_end, y_end = int(1690), int(520)
cropped_image = OriginalImage[y_start:y_end, x_start:x_end]
text = (pytesseract.image_to_string(cropped_image, config="--psm 7 outputbase digits"))
cv2.imshow("Cropped", cropped_image)
cv2.waitKey(0)
print(text + " " + OriginalImagePath)
cv2.destroyAllWindows()
After that I tried using thresholding, but sadly I get worse results than with the simple image_to_string. Adaptive thresholding gives an output image which seems not that bad but tesseract can't read it.
import cv2 as cv
import numpy as np
from matplotlib import pyplot as plt
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract"
img = cv.imread(r"C:\Users\user\Desktop\test_pcs\new2\2022-10-30_14-49-30.jpg",0)
img = cv.medianBlur(img,5)
ret,th1 = cv.threshold(img,127,255,cv.THRESH_BINARY)
#'Adaptive Mean Thresholding'
th2 = cv.adaptiveThreshold(img,255,cv.ADAPTIVE_THRESH_MEAN_C,\
cv.THRESH_BINARY,11,2)
#'Adaptive Gaussian Thresholding'
th3 = cv.adaptiveThreshold(img,255,cv.ADAPTIVE_THRESH_GAUSSIAN_C,\
cv.THRESH_BINARY,11,2)
images = [img, th2, th3]
for i in range(3):
plt.subplot(2,2,i+1),plt.imshow(images[i],'gray')
plt.show()
x_start, y_start = int(1110), int(450)
x_end, y_end = int(1690), int(520)
cropped_image = th2[y_start:y_end, x_start:x_end]
plt.imshow(cropped_image,'gray')
text = (pytesseract.image_to_string(cropped_image, config="--psm 7 outputbase digits"))
print("digits: " + text)
I also tried to read the digits character by character but it failed as well.
Now I am trying to get better pictures somehow but the options are quite limited.
I would be greateful for any suggestions as I am doing this for my thesis.

Is there an equivalent function or an implmentation of skimage.feature.peak_local_max in OpenCV?

I have been trying to segment biological cells in an image using watershed algorithm. I found an excellent article on pyimagesearch which clearly gives an overview of the algorithm and its implementation in python. The code uses both opencv and scikit-image for processing the image.
My goal is to convert the whole code into pure opencv. But the issues is that there's a function called scipy.feature.peak_local_max in scikit-image which does the job of finding local peaks in an image very efficiently. I couldn't find or devise such function in OpenCV.
Original Code(I have documented this snippet according to my understanding, please correct if am wrong):
import the necessary packages
from skimage.feature import peak_local_max
from skimage.morphology import watershed
from scipy import ndimage
import numpy as np
import argparse
import imutils
import cv2
from matplotlib import pyplot as plt
# load the image and perform pyramid mean shift filtering
# to aid the thresholding step
image = cv2.imread("test2.png")
shifted = cv2.pyrMeanShiftFiltering(image, 21, 51)
# Apply grayscale
gray = cv2.cvtColor(shifted, cv2.COLOR_BGR2GRAY)
# Convert to binary
thresh = cv2.threshold(gray, 0, 255,cv2.THRESH_BINARY_INV | cv2.THRESH_OTSU)[1]
# Watershed starts from here
# compute the exact Euclidean distance from every binary
# pixel to the nearest zero pixel, then find peaks in this
# distance map
D = ndimage.distance_transform_edt(thresh)
localMax = peak_local_max(D, indices=False, min_distance=10,labels=thresh)
# perform a connected component analysis on the local peaks,
# using 8-connectivity, then appy the Watershed algorithm
markers = ndimage.label(localMax, structure=np.ones((3, 3)))[0]
# Apply segmentation
labels = watershed(-D, markers, mask=thresh)
print("[INFO] {} unique segments found".format(len(np.unique(labels)) - 1))
cv2.imwrite("labels.png",labels)
# Contouring
for label in np.unique(labels):
# if the label is zero, we are examining the 'background'
# so simply ignore it
if label == 0:
continue
# otherwise, allocate memory for the label region and draw
# it on the mask
mask = np.zeros(gray.shape, dtype="uint8")
mask[labels == label] = 255
cnts = cv2.findContours(mask.copy(), cv2.RETR_EXTERNAL,cv2.CHAIN_APPROX_SIMPLE)
cnts = imutils.grab_contours(cnts)
c = max(cnts, key=cv2.contourArea)
# draw a circle enclosing the object
peri = cv2.arcLength(c, True)
approx = cv2.approxPolyDP(c, 0.018 * peri, True)
cv2.drawContours(image, [approx], -1, (0,0,255), 2)
cv2.imwrite("output.jpg",image)
Pure OpenCV Code till finding distance map:
# import the necessary packages
import numpy as np
import cv2
# load the image and perform pyramid mean shift filtering
# to aid the thresholding step
image = cv2.imread("1.png")
shifted = cv2.pyrMeanShiftFiltering(image, 21, 51)
# Apply grayscale
gray = cv2.cvtColor(shifted, cv2.COLOR_BGR2GRAY)
# Convert to binary
thresh = cv2.threshold(gray, 0, 255,cv2.THRESH_BINARY_INV | cv2.THRESH_OTSU)[1]
# Watershed starts from here
# compute the exact Euclidean distance from every binary
# pixel to the nearest zero pixel, then find peaks in this
# distance map
D = cv2.distanceTransform(thresh,cv2.DIST_L2,0)
The point till D, both the original code and the pure opencv code which I have tried have exactly the same outputs, the issue is I dont exactly have a clear idea on how to implement peak_local_max in opencv which would give identical result as scikit's function.
It would be really helpful if someone who has relavent knowledge could explain how this function works in finding those peaks in such a fine grained manner.
Input Image:
Peak Local max output in scikit-image(BGR format image):
Required output:

Extract text information from PDF files with different layouts - machine learning

I need assistance with a ML project I am currently trying to create.
I receive a lot of invoices from a lot of different suppliers - all in their own unique layout. I need to extract 3 key elements from the invoices. These 3 elements are all located in a table/line items for all the invoices.
The 3 elements are:
1: Tariff number (digit)
2: Quantity (always a digit)
3: Total line amount (monetary value)
Please refer to below screenshot, where I have marked these field on a sample invoice.
I started this project with a template approach, based on regular expressions. This, however, was not scaleable at all and I ended up with tons of different rules.
I am hoping that machine learning can help me here - or maybe a hybrid solution?
The common denominator
In all of my invoices, despite of the different layouts, each line item will always consist of one tariff number. This tariff number is always 8 digits, and is always formatted in one the ways like below:
xxxxxxxx
xxxx.xxxx
xx.xx.xx.xx
(Where "x" is a digit from 0 - 9).
Further, as you can see on the invoice there is both a Unit Price and a Total Amount per line. The amount I will need is always the highest for each line.
The output
For each invoice like the one above, I need the output for each line. This could for example be something like this:
{
"line":"0",
"tariff":"85444290",
"quantity":"3",
"amount":"258.93"
},
{
"line":"1",
"tariff":"85444290",
"quantity":"4",
"amount":"548.32"
},
{
"line":"2",
"tariff":"76109090",
"quantity":"5",
"amount":"412.30"
}
Where to go from here?
I am not sure of what I am looking to do falls under machine learning and if so, under which category. Is it computer vision? NLP? Named Entity Recognition?
My initial thought was to:
Convert the invoice to text. (The invoices are all in textable PDFs, so I can use something like pdftotext to get the exact textual values)
Create custom named entities for quantity, tariff and amount
Export the found entities.
However, I feel like I might be missing something.
Can anyone assist me in the right direction?
Edit:
Please see below for a few more examples of how an invoice table section can look like:
Sample invoice #2
Sample invoice #3
Edit 2:
Please see below for the three sample images, without the borders/bounding boxes:
Image 1:
Image 2:
Image 3:
Here's an attempt using OpenCV, the idea is:
Obtain binary image. We load the image, enlarge using
imutils.resize to help obtain better OCR results (see Tesseract improve quality), convert to grayscale, then Otsu's threshold to obtain a binary image (1-channel).
Remove table grid lines. We create a horizontal and vertical kernels then perform morphological operations to combine adjacent text contours into a single contour. The idea is to extract a ROI row as one piece to OCR.
Extract row ROIs. We find contours then sort from top-to-bottom using imutils.contours.sort_contours. This ensures that we iterate through each row in the correct order. From here we iterate through the contours, extract the row ROI using Numpy slicing, OCR using Pytesseract, then parse the data.
Here's the visualization of each step:
Input image
Binary image
Morph close
Visualization of iterating through each row
Extracted row ROIs
Output invoice data result:
{'line': '0', 'tariff': '85444290', 'quantity': '3', 'amount': '258.93'}
{'line': '1', 'tariff': '85444290', 'quantity': '4', 'amount': '548.32'}
{'line': '2', 'tariff': '76109090', 'quantity': '5', 'amount': '412.30'}
Unfortunately, I get mixed results when trying on the 2nd and 3rd image. This method does not produce great results on the other images since the layout of the invoices are all different. However, this approach shows that it's possible to use traditional image processing techniques to extract the invoice information with the assumption that you have a fixed invoice layout.
Code
import cv2
import numpy as np
import pytesseract
from imutils import contours
import imutils
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
# Load image, enlarge, convert to grayscale, Otsu's threshold
image = cv2.imread('1.png')
image = imutils.resize(image, width=1000)
height, width = image.shape[:2]
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
# Remove horizontal lines
horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (50,1))
detect_horizontal = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, horizontal_kernel, iterations=2)
cnts = cv2.findContours(detect_horizontal, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
cv2.drawContours(thresh, [c], -1, 0, -1)
# Remove vertical lines
vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1,50))
detect_vertical = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, vertical_kernel, iterations=2)
cnts = cv2.findContours(detect_vertical, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
cv2.drawContours(thresh, [c], -1, 0, -1)
# Morph close to combine adjacent contours into a single contour
invoice_data = []
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (85,5))
close = cv2.morphologyEx(thresh, cv2.MORPH_CLOSE, kernel, iterations=3)
# Find contours, sort from top-to-bottom
# Iterate through contours, extract row ROI, OCR, and parse data
cnts = cv2.findContours(close, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
(cnts, _) = contours.sort_contours(cnts, method="top-to-bottom")
row = 0
for c in cnts:
x,y,w,h = cv2.boundingRect(c)
ROI = image[y:y+h, 0:width]
ROI = cv2.GaussianBlur(ROI, (3,3), 0)
data = pytesseract.image_to_string(ROI, lang='eng', config='--psm 6')
parsed = [word.lower() for word in data.split()]
if 'tariff' in parsed or 'number' in parsed:
row_data = {}
row_data['line'] = str(row)
row_data['tariff'] = parsed[-1]
row_data['quantity'] = parsed[2]
row_data['amount'] = str(max(parsed[10], parsed[11]))
row += 1
print(row_data)
invoice_data.append(row_data)
# Visualize row extraction
'''
mask = np.zeros(image.shape, dtype=np.uint8)
cv2.rectangle(mask, (0, y), (width, y + h), (255,255,255), -1)
display_row = cv2.bitwise_and(image, mask)
cv2.imshow('ROI', ROI)
cv2.imshow('display_row', display_row)
cv2.waitKey(1000)
'''
print(invoice_data)
cv2.imshow('thresh', thresh)
cv2.imshow('close', close)
cv2.waitKey()
I'm working on a similar problem in the logistics industry and trust me when I say these document tables come in myriad layouts. Numerous companies that have somewhat solved and are improving on this problem are mentioned as under
Leaders: ABBYY, AntWorks, Kofax, and WorkFusion
Major Contenders: Automation Anywhere, Celaton, Datamatics, EdgeVerve, Extract Systems, Hyland, Hyperscience, Infrrd, and Parascript
Aspirants: Ikarus, Rossum, Shipmnts(Alex), Amazon(Textract), Docsumo, Docparser, Aidock
The category I would like to put this problem under would be multi-modal learning, because both textual and image modalities contribute a good deal in this problem. Though OCR tokens play a vital role in attribute-value classification, their position on the page, spacing and inter-character distances hold as very important features in detecting table, row and column boundaries. The problem gets all the more interesting when rows break across pages, or some columns carry non-empty values.
While the academic world and conferences uses the term Intelligent Document Processing, in general for extracting both singular fields and tabular data. The former is more known by attribute-value classification and the latter is famous by table extraction or repeated-structure extraction, in research literature.
In our foray in processing these semi-structured documents over the 3 years, I feel that achieving both accuracy and scalability is a long and arduous journey. The solutions that offer scalability / 'template free' approach do have annotated corpus of semi-structured business documents in the order of tens of thousand, if not millions. Though this approach is a scalable solution, it's as good as the documents it has been trained on. If your documents hail from the logistics or insurance sector, which are known for their complex layouts, and need to be super-accurate owing to the compliance procedures, a 'template-based' solution would be the panacea to your ills. It is guaranteed to give more accuracy.
If you need links to existing research, do mention in the comments below and I'ld be happy to share them.
Also, I would recommend using pdfparser1 over pdf2text or pdfminer because the former gives character level information in digital files at significantly better performance.
Would be happy to incorporate any feedback, as this is my first answer here.

Count number of objects using watershed algorithm - Scikit-image

I am trying to find the number of objects in a given image using watershed segmentation. Consider for example the coins image. Here I would like to know the number of coins in the image. I implemented the code available at Scikit-image documentation and tweaked with it a little and got results similar to those displayed on the documentation page.
After looking at functions used in the code in detail I found out that ndimage.label() also returns number of unique objects found in the image (mentioned in it's documentation), but when I print that value I am getting 53 which is very high as compared to the number of coins in the actual image.
Can somebody suggest some method to find the number of objects in an image.
Here is a version of your code that counts the coins in one of two ways: a) by directly segmenting the distance image and b) by doing watershed first and rejecting tiny intersecting regions.
from __future__ import print_function
import numpy as np
import matplotlib.pyplot as plt
from skimage import io, color, filter as filters
from scipy import ndimage
from skimage.morphology import watershed
from skimage.feature import peak_local_max
from skimage.measure import regionprops, label
image = color.rgb2gray(io.imread('water_coins.jpg', plugin='freeimage'))
image = image < filters.threshold_otsu(image)
distance = ndimage.distance_transform_edt(image)
# Here's one way to measure the number of coins directly
# from the distance map
coin_centres = (distance > 0.8 * distance.max())
print('Number of coins (method 1):', np.max(label(coin_centres)))
# Or you can proceed with the watershed labeling
local_maxi = peak_local_max(distance, indices=False, footprint=np.ones((3, 3)),
labels=image)
markers, num_features = ndimage.label(local_maxi)
labels = watershed(-distance, markers, mask=image)
# ...but then you have to clean up the tiny intersections between coins
regions = regionprops(labels)
regions = [r for r in regions if r.area > 50]
print('Number of coins (method 2):', len(regions) - 1)
fig, axes = plt.subplots(ncols=3, figsize=(8, 2.7))
ax0, ax1, ax2 = axes
ax0.imshow(image, cmap=plt.cm.gray, interpolation='nearest')
ax0.set_title('Overlapping objects')
ax1.imshow(-distance, cmap=plt.cm.jet, interpolation='nearest')
ax1.set_title('Distances')
ax2.imshow(labels, cmap=plt.cm.spectral, interpolation='nearest')
ax2.set_title('Separated objects')
for ax in axes:
ax.axis('off')
fig.subplots_adjust(hspace=0.01, wspace=0.01, top=1, bottom=0, left=0,
right=1)
plt.show()

Horizontal Histogram in OpenCV

I am newbie to OpenCV,now I am making a senior project related Image processing. I have a question: Can I make a horizontal or vertical histogram with some functions of OpenCV?
Thanks,
Truong
The most efficient way to do this is by using the cvReduce function. There's a parameter to allow to select if you want an horizontal or vertical projection.
You can also do it by hand with the functions cvGetCol and cvGetRow combined with cvSum.
Based on the link you provided in a comment, this is what I believe you're trying to do.
You want to create an array with n elements, where n is the number of columns in the input image. The value of the nth element of the array is the sum of all the pixels in the nth column.
You can calculate this array by looping over the columns of the input image, using cvGetSubRect to access the pixels in that column, and cvSum to sum those pixels.
Here is some Python code that does that, assuming a grayscale image:
import cv
def verticalProjection(img):
"Return a list containing the sum of the pixels in each column"
(w,h) = cv.GetSize(img)
sumCols = []
for j in range(w):
col = cv.GetSubRect(img, (j,0,1,h))
sumCols.append(cv.Sum(col)[0])
return sumCols
Updating carnieri answer (some cv functions are not working today)
import numpy as np
import cv2
def verticalProjection(img):
"Return a list containing the sum of the pixels in each column"
(h, w) = img.shape[:2]
sumCols = []
for j in range(w):
col = img[0:h, j:j+1] # y1:y2, x1:x2
sumCols.append(np.sum(col))
return sumCols
Regards.
An example of using cv2.reduce with OpenCV 3 in Python :
import numpy as np
import cv2
img = cv2.imread("test_1.png")
x_sum = cv2.reduce(img, 0, cv2.REDUCE_SUM, dtype=cv2.CV_32S)
y_sum = cv2.reduce(img, 1, cv2.REDUCE_SUM, dtype=cv2.CV_32S)

Resources