OCR with grouped text based on solid rectangles - opencv

I can read text from an image using OCR. However, it works line by line.
I want to now group text based on solid lines surrounding the text.
For example, consider I have below rectangle banners. I can read text line by line. Fine! Now I want to group them by Board A,B,C and hold them in some data structure, so that I can identify, which lines belong to which board. It is given that images would be diagrams like this with solid lines around each block of text.
Please guide me on the right approach.

As mentioned in the comments by Yunus, you need to crop sub-images and feed them to an OCR module individually. An additional step could be ordering of the contours.
Approach:
Obtain binary image and invert it
Find contours
Crop sub-images based on the bounding rectangle for each contour
Feed each sub-image to OCR module (I used easyocr for demonstration)
Store text for each board in a dictionary
Code:
# Libraries import
import cv2
from easyocr import Reader
reader = Reader(['en'])
img = cv2.imread('board_text.jpg',1)
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# inverse binary
th = cv2.threshold(gray,127,255,cv2.THRESH_BINARY_INV+cv2.THRESH_OTSU)[1]
# find contours and sort them from left to right
contours, hierarchy = cv2.findContours(th, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
contours = sorted(contours, key=lambda x: [cv2.boundingRect(x)[0], cv2.boundingRect(x)[1]])
#initialize dictionary
board_dictionary = {}
# iterate each contour and crop bounding box
for i, c in enumerate(cnts):
x,y,w,h = cv2.boundingRect(c)
crop_img = img[y:y+h, x:x+w]
# feed cropped image to easyOCR module
results = reader.readtext(crop_img)
# result is output per line
# create a list to append all lines in cropped image to it
board_text = []
for (bbox, text, prob) in results:
board_text.append(text)
# convert list of words to single string
board_para = ' '.join(board_text)
#print(board_para)
# store string within a dictionary
board_dictionary[str(i)] = board_para
Dictionary Output:
board_dictionary
{'0': 'Board A Board A contains Some Text, That goes Here Some spaces and then text again', '1': 'Board B Board B has some text too but sparse.', '2': 'Board €C Board C is wide and contains text with white spaces '}
Drawing each contour
img2 = img.copy()
for i, c in enumerate(cnts):
x,y,w,h = cv2.boundingRect(c)
img2 = cv2.rectangle(img2, (x, y), (x + w, y + h), (0,255,0), 3)
Note:
While working on different images make sure the ordering is correct.
Choice of OCR module is yours pytesseract and easyocr are the options I know.

This can be done by performing following steps:
Find the shapes.
Compute the shape centers.
Find the text boxes.
Compute the text boxes centers.
Associate the textboxes with shapes based on distance.
The code is as follows:
import cv2
from easyocr import Reader
import math
shape_number = 2
image = cv2.imread("./ueUco.jpg")
deep_copy = image.copy()
image_gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
ret, thresh = cv2.threshold(image_gray, 150, 255, cv2.THRESH_BINARY)
thresh = 255 - thresh
shapes, hierarchy = cv2.findContours(image=thresh, mode=cv2.RETR_EXTERNAL, method=cv2.CHAIN_APPROX_SIMPLE)
cv2.drawContours(image=deep_copy, contours=shapes, contourIdx=-1, color=(0, 255, 0), thickness=2, lineType=cv2.LINE_AA)
shape_centers = []
for shape in shapes:
row = int((shape[0][0][0] + shape[3][0][0])/2)
column = int((shape[3][0][1] + shape[2][0][1])/2)
center = (row, column, shape)
shape_centers.append(center)
# cv2.imshow('Shapes', deep_copy)
# cv2.waitKey(0)
# cv2.destroyAllWindows()
languages = ['en']
reader = Reader(languages, gpu = True)
results = reader.readtext(image)
def cleanup_text(text):
return "".join([c if ord(c) < 128 else "" for c in text]).strip()
for (bbox, text, prob) in results:
text = cleanup_text(text)
(tl, tr, br, bl) = bbox
tl = (int(tl[0]), int(tl[1]))
tr = (int(tr[0]), int(tr[1]))
br = (int(br[0]), int(br[1]))
bl = (int(bl[0]), int(bl[1]))
column = int((tl[0] + tr[0])/2)
row = int((tr[1] + br[1])/2)
center = (row, column, bbox)
distances = []
for iteration, shape_center in enumerate(shape_centers):
shape_row = shape_center[0]
shape_column = shape_center[1]
dist = int(math.dist([column, row], [shape_row, shape_column]))
distances.append(dist)
min_value = min(distances)
min_index = distances.index(min_value)
if min_index == shape_number:
cv2.rectangle(image, tl, br, (0, 255, 0), 2)
cv2.putText(image, text, (tl[0], tl[1] - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.8, (0, 255, 0), 2)
cv2.imshow("Image", image)
cv2.waitKey(0)
cv2.imwrite(f"image_{shape_number}.jpg", image)
cv2.destroyAllWindows()
The output looks like this.
Please note that this solution is almost complete. You just have to compute the text embodied in each shape and put it in your desired data structure.
Note: shape_number represents the shape that you want to consider.
There is another solution that I would like you to work on.
Find all the text boxes.
Compute the centers for text boxes.
Run k-means clustering on the centers.
I would prefer the second solution but for the time being, I implemented the first one.

Related

Image Processing: Mapping a scanned image on a template image with many identical features

Problem description
We are trying to match a scanned image onto a template image:
Example of a scanned image:
Example of a template image:
The template image contains a collection of hearts varying in size and contour properties (closed, open left and open right). Each heart in the template is a Region of Interest for which we know the location, size, and contour type. Our goal is to match a scanned onto the template so that we can extract these ROIs in the scanned image. In the scanned image, some of these hearts are crossed, and they will be presented to a classifier that decides if they are crossed or not.
Our approach
Following a tutorial on PyImageSearch, we have attempted to use ORB to find matching keypoints (code included below). This should allow us to compute a perspective transform matrix that maps the scanned image on the template image.
We have tried some preprocessing steps such as thresholding and/or blurring the scanned image. We have also tried to increase the maximum number of features as much as possible.
The problem
The method fails to work for our image set. This can be seen in the following image:
It appears that a lot of keypoints are mapped to the wrong part of the template image, so the transform matrix is not calculated correctly.
Is ORB the right technique to use here, or are there parameters of the algorithm that could be fine-tuned to improve performance? It feels like we are missing out on something simple that should make it work, but we really don't know how to go forward with this approach :).
We are trying out an alternative technique where we cross-correlate the scan with individual heart shapes. This should give an image with peaks at the heart locations. By drawing a bounding box around these peaks we hope to map that bounding box on the bounding box of the template (I can elaborat on this upon request)
Any suggestions are greatly appreciated!
import cv2 as cv
import matplotlib.pyplot as plt
import numpy as np
# Preprocessing parameters
THRESHOLD = True
BLUR = False
# ORB parameters
MAX_FEATURES = 4048
KEEP_PERCENT = .01
SHOW_DEBUG = True
# Convert both the input image and template to grayscale
scan_file = r'scan.jpg'
template_file = r'template.jpg'
scan = cv.imread(scan_file)
template = cv.imread(template_file)
scan_gray = cv.cvtColor(scan, cv.COLOR_BGR2GRAY)
template_gray = cv.cvtColor(template, cv.COLOR_BGR2GRAY)
if THRESHOLD:
_, scan_gray = cv.threshold(scan_gray, 127, 255, cv.THRESH_BINARY)
_, template_gray = cv.threshold(template_gray, 127, 255, cv.THRESH_BINARY)
if BLUR:
scan_gray = cv.blur(scan_gray, (5, 5))
template_gray = cv.blur(template_gray, (5, 5))
# Use ORB to detect keypoints and extract (binary) local invariant features
orb = cv.ORB_create(MAX_FEATURES)
(kps_template, desc_template) = orb.detectAndCompute(template_gray, None)
(kps_scan, desc_scan) = orb.detectAndCompute(scan_gray, None)
# Match the features
#method = cv.DESCRIPTOR_MATCHER_BRUTEFORCE_HAMMING
#matcher = cv.DescriptorMatcher_create(method)
#matches = matcher.match(desc_scan, desc_template)
bf = cv.BFMatcher(cv.NORM_HAMMING)
matches = bf.match(desc_scan, desc_template)
# Sort the matches by their distances
matches = sorted(matches, key = lambda x : x.distance)
# Keep only the top matches
keep = int(len(matches) * KEEP_PERCENT)
matches = matches[:keep]
if SHOW_DEBUG:
matched_visualization = cv.drawMatches(scan, kps_scan, template, kps_template, matches, None)
plt.imshow(matched_visualization)
Based on the clarifications provided by #it_guy, I have attempted to find all the crossed hearts using just the scanned image. I would have to try the algorithm on more images to check whether this approach will generalize or not.
Binarize the scanned image.
gray_image = cv2.cvtColor(rgb_image, cv2.COLOR_BGR2GRAY)
ret, thresh = cv2.threshold(gray_image, 180, 255, cv2.THRESH_BINARY_INV)
Perform dilation to close small gaps in the outline of the hearts, and the curves representing crosses. Note - The structuring element np.ones((1,2), np.uint8 can be changed by running the algorithm through multiple images and finding the most suitable structuring element.
closing_original = cv2.morphologyEx(original_binary, cv2.MORPH_DILATE, np.ones((1,2), np.uint8)).
Find all the contours in the image. The contours include all hearts and the triangle at the bottom. We eliminate other contours like dots by placing constraints on the height and width of contours to filter them. Further, we also use contour hierachies to eliminate inner contours in cross hearts.
contours_original, hierarchy_original = cv2.findContours(closing_original, cv2.RETR_CCOMP, cv2.CHAIN_APPROX_SIMPLE)
We iterate through each of the filtered contours.
Contour with normal heart -
Contour with crossed heart -
Let us observe the difference between these two types of hearts. If we look at the transition from white-to-black pixel and black-to-white pixel ( from top to bottom ) inside the normal heart, we see that for majority of the image columns the number of such transitions are 4. ( Top border - 2 transitions, bottom border - 2 transitions )
white-to-black pixel - (255, 255, 0, 0, 0)
black-to-white pixel - (0, 0, 255, 255, 255)
But, in the case of the crossed heart, the number of transitions for majority of the columns must be 6, since the crossed curve / line adds two more transitions inside the heart (black-to-white first, then white-to-black). Hence, among all image columns which have greater than or equal to 4 such transitions, if more than 40% of the columns have 6 transitions, then the given contour represents a crossed contour. Result -
Code -
import cv2
import numpy as np
def convert_to_binary(rgb_image):
gray_image = cv2.cvtColor(rgb_image, cv2.COLOR_BGR2GRAY)
ret, thresh = cv2.threshold(gray_image, 180, 255, cv2.THRESH_BINARY_INV)
return gray_image, thresh
original = cv2.imread('original.jpg')
height, width = original.shape[:2]
original_gray, original_binary = convert_to_binary(original) # Get binary image
cv2.imwrite("binary.jpg", original_binary)
closing_original = cv2.morphologyEx(original_binary, cv2.MORPH_DILATE, np.ones((1,2), np.uint8)) # Close small gaps in the binary image
cv2.imwrite("closed.jpg", closing_original)
contours_original, hierarchy_original = cv2.findContours(closing_original, cv2.RETR_CCOMP, cv2.CHAIN_APPROX_SIMPLE) # Get all the contours
bounding_rects_original = [cv2.boundingRect(c) for c in contours_original] # Get all contour bounding boxes
orig_boxes = list()
all_contour_image = original.copy()
for i, (x, y, w, h) in enumerate(bounding_rects_original):
if h > height / 2 or w > width / 2: # Eliminate extremely large contours
continue
if h < w / 2 or w < h / 2: # Eliminate vertical / horuzontal lines
continue
if w * h < 200: # Eliminate small area contours
continue
if hierarchy_original[0][i][3] != -1: # Eliminate contours created by heart crosses
continue
orig_boxes.append((x, y, w, h))
cv2.rectangle(all_contour_image, (x,y), (x + w, y + h), (0, 255, 0), 3)
# cv2.imshow("warped", closing_original)
cv2.imwrite("all_contours.jpg", all_contour_image)
final_image = original.copy()
for x, y, w, h in orig_boxes:
cropped_image = closing_original[y - 2 :y + h + 2, x: x + w] # Get the heart binary image
col_pixel_diffs = np.abs(np.diff(cropped_image.T.astype(np.int16))/255) # Obtain all consecutive pixel differences in all the columns
column_sums = np.sum(col_pixel_diffs, axis=1) # Get the sum of each column's transitions. This results in an array of size equal
# to the number of columns, each element representing the number of black-white and white-black transitions.
percent_crosses = np.sum(column_sums >= 6)/ np.sum(column_sums >= 4) # Percentage of columns with 6 transitions among columns with 4 transitions
if percent_crosses > 0.4: # Crossed heart criterion
cv2.rectangle(final_image, (x,y), (x + w, y + h), (0, 255, 0), 3)
cv2.imwrite("crossed_heart.jpg", cropped_image)
else:
cv2.imwrite("normal_heart.jpg", cropped_image)
cv2.imwrite("all_crossed_hearts.jpg", final_image)
This approach can be tested on more images to find its accuracy.

How do I measure text size to extract only certain letters from an image in OpenCV?

I am looking to extract text from a license plate. For now I have been using pytesseract with opencv to zero in on the relevant contours and pull out text. This works decently for non-American plates, but I am curious about applying this to American plates which come with a lot of little letters surrounding the big plate id ones. My thoughts were to use font size to filter out letters under a certain threshold. Is that the best approach?
below is code so far:
import cv2
import pytesseract
import imutils
#read image
image = cv2.imread('plateTest2.jpeg')
#RGB to Gray Scale converstion
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
#noise removal
gray = cv2.bilateralFilter(gray,11,17,17)
#find edges of the grayscale image
edged = cv2.Canny(gray, 170,200)
#Find contours based on Edges
_,cnts, new = cv2.findContours(edged.copy(), cv2.RETR_LIST, cv2.CHAIN_APPROX_SIMPLE)
#Create copy of original image to draw all contours
img1 = image.copy()
cv2.drawContours(img1, cnts, -1, (0,255,0), 3)
#sort contours based on their area keeping minimum required area as '30' (anything smaller than this will not be considered)
cnts=sorted(cnts, key = cv2.contourArea, reverse = True)[:30]
NumberPlateCnt = None #we currently have no Number plate contour
#Top 30 Contours
img2 = image.copy()
cv2.drawContours(img2, cnts, -1, (0,255,0), 3)
idx='plateTest2.jpg'
for c in cnts:
peri = cv2.arcLength(c, True)
approx = cv2.approxPolyDP(c, 0.02 * peri, True)
# print ("approx = ",approx)
if len(approx) == 4: # Select the contour with 4 corners
NumberPlateCnt = approx #This is our approx Number Plate Contour
# Crop those contours and store it in Cropped Images folder
x, y, w, h = cv2.boundingRect(c) #This will find out co-ord for plate
new_img = gray[y:y + h, x:x + w] #Create new image
cv2.imwrite('/' + 'cropped_' + str(idx), new_img) #Store new image
#idx+=1
break
#Drawing the selected contour on the original image
#print(NumberPlateCnt)
cv2.drawContours(image, [NumberPlateCnt], -1, (0,255,0), 3)
Cropped_img_loc = '/' + 'cropped_' + str(idx)#'cropped_images/8.png'
#Use tesseract to covert image into string
text = pytesseract.image_to_string(Cropped_img_loc, lang='eng')
text = text.replace('.', '')
text = text.replace(' ','')
return text
Here is a picture of a plate that returns too much text where things like 'SUNSHINESTATE' show up:
Should I rely on pytesseract to identify font size and filter out smaller characters? Or should I be filtering before using contour size? Appreciate the help.

How to get the area of the contours?

I have a picture like this:
And then I transform it into binary image and use canny to detect edge of the picture:
gray = cv.cvtColor(image, cv.COLOR_RGB2GRAY)
edge = Image.fromarray(edges)
And then I get the result as:
I want to get the area of 2 like this:
My solution is to use HoughLines to find lines in the picture and calculate the area of triangle formed by lines. However, this way is not precise because the closed area is not a standard triangle. How to get the area of region 2?
A simple approach using floodFill and countNonZero could be the following code snippet. My standard quote on contourArea from the help:
The function computes a contour area. Similarly to moments, the area is computed using the Green formula. Thus, the returned area and the number of non-zero pixels, if you draw the contour using drawContours or fillPoly, can be different. Also, the function will most certainly give a wrong results for contours with self-intersections.
Code:
import cv2
import numpy as np
# Input image
img = cv2.imread('images/YMMEE.jpg', cv2.IMREAD_GRAYSCALE)
# Needed due to JPG artifacts
_, temp = cv2.threshold(img, 128, 255, cv2.THRESH_BINARY)
# Dilate to better detect contours
temp = cv2.dilate(temp, cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (3, 3)))
# Find largest contour
cnts, _ = cv2.findContours(temp, cv2.RETR_EXTERNAL , cv2.CHAIN_APPROX_NONE)
largestCnt = []
for cnt in cnts:
if (len(cnt) > len(largestCnt)):
largestCnt = cnt
# Determine center of area of largest contour
M = cv2.moments(largestCnt)
x = int(M["m10"] / M["m00"])
y = int(M["m01"] / M["m00"])
# Initiale mask for flood filling
width, height = temp.shape
mask = img2 = np.ones((width + 2, height + 2), np.uint8) * 255
mask[1:width, 1:height] = 0
# Generate intermediate image, draw largest contour, flood filled
temp = np.zeros(temp.shape, np.uint8)
temp = cv2.drawContours(temp, largestCnt, -1, 255, cv2.FILLED)
_, temp, mask, _ = cv2.floodFill(temp, mask, (x, y), 255)
temp = cv2.morphologyEx(temp, cv2.MORPH_OPEN, cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (3, 3)))
# Count pixels in desired region
area = cv2.countNonZero(temp)
# Put result on original image
img = cv2.putText(img, str(area), (x, y), cv2.FONT_HERSHEY_COMPLEX_SMALL, 1, 255)
cv2.imshow('Input', img)
cv2.imshow('Temp image', temp)
cv2.waitKey(0)
Temporary image:
Result image:
Caveat: findContours has some problems one the right side, where the line is very close to the bottom image border, resulting in possibly omitting some pixels.
Disclaimer: I'm new to Python in general, and specially to the Python API of OpenCV (C++ for the win). Comments, improvements, highlighting Python no-gos are highly welcome!
There is a very simple way to find this area, if you take some assumptions that are met in the example image:
The area to be found is bounded on top by a line
Any additional lines in the image are above the line of interest
There are no discontinuities in the line
In this case, the area of the region of interest is given by the sum of the lengths from the bottom of the image to the first set pixel. We can compute this with:
import numpy as np
import matplotlib.pyplot as pp
img = pp.imread('/home/cris/tmp/YMMEE.jpg')
img = np.flip(img, axis=0)
pos = np.argmax(img, axis=0)
area = np.sum(pos)
print('Area = %d\n'%area)
This prints Area = 22040.
np.argmax finds the first set pixel on each column of the image, returning the index. By first using np.flip, we flip this axis so that the first pixel is actually the one on the bottom. The index corresponds to the number of pixels between the bottom of the image and the line (not including the set pixel).
Thus, we're computing the area under the line. If you need to include the line itself in the area, add pos.shape[0] to the area (i.e. the number of columns).

Character segmentation in python

I am working on detecting handwritten symbols using computer vision in python. I trained a cnn on a dataset of individual characters, but now I want to be able to extract characters from an image in order to make predictions on the individual characters. What is the best way to do this? The handwritten text that I will be working with will not be cursive and there will be an obvious separation between the characters.
In the below snippet,the boxes variable has dimensions for each character in the image.
import cv2
import pytesseract
file = '/content/Captchas/image22.jpg'
img = cv2.imread(file)
h, w, _ = img.shape
boxes = pytesseract.image_to_boxes(img)
for b in boxes.splitlines():
b = b.split(' ')
img = cv2.rectangle(img, (int(b[1]), h - int(b[2])), (int(b[3]), h - int(b[4])), (0, 255, 0), 2)
cv2_imshow(img)
print(boxes)
you can use find contours and bound them with a box.
image = cv2.imread("filename")
image = cv2.fastNlMeansDenoisingColored(image,None,10,10,7,21)
gray = cv2.cvtColor(image,cv2.COLOR_BGR2GRAY)
res,thresh = cv2.threshold(gray,150,255,cv2.THRESH_BINARY_INV) #threshold
kernel = cv2.getStructuringElement(cv2.MORPH_CROSS,(3,3))
dilated = cv2.dilate(thresh,kernel,iterations = 5)
val,contours, hierarchy =
cv2.findContours(dilated,cv2.RETR_EXTERNAL,cv2.CHAIN_APPROX_NONE)
coord = []
for contour in contours:
[x,y,w,h] = cv2.boundingRect(contour)
if h>300 and w>300:
continue
if h<40 or w<40:
continue
coord.append((x,y,w,h))
coord.sort(key=lambda tup:tup[0]) # if the image has only one sentence sort in one axis
count = 0
for cor in coord:
[x,y,w,h] = cor
t = image[y:y+h,x:x+w,:]
cv2.imwrite(str(count)+".png",t)
print("number of char in image:", count)

OpenCV - Extracting lines on a graph

I would like to create a program that is able to extract lines from a graph.
For example, if a graph like this is inputted, I would just want the red line to be outputted.
Below I have tried to do this using a hough line transformation, however, I do not get very promising results.
import cv2
import numpy as np
graph_img = cv2.imread("/Users/2020shatgiskessell/Desktop/Graph1.png")
gray = cv2.cvtColor(graph_img, cv2.COLOR_BGR2GRAY)
kernel_size = 5
#grayscale image
blur_gray = cv2.GaussianBlur(gray,(kernel_size, kernel_size),0)
#Canny edge detecion
edges = cv2.Canny(blur_gray, 50, 150)
#Hough Lines Transformation
#distance resoltion of hough grid (pixels)
rho = 1
#angular resolution of hough grid (radians)
theta = np.pi/180
#minimum number of votes
threshold = 15
#play around with these
min_line_length = 25
max_line_gap = 20
#make new image
line_image = np.copy(graph_img)
#returns array of lines
lines = cv2.HoughLinesP(edges, rho, theta, threshold, np.array([]),
min_line_length, max_line_gap)
for line in lines:
for x1,y1,x2,y2 in line:
cv2.line(line_image,(x1,y1),(x2,y2),(255,0,0),2)
lines_edges = cv2.addWeighted(graph_img, 0.8, line_image, 1, 0)
cv2.imshow("denoised image",edges)
if cv2.waitKey(0) & 0xff == 27:
cv2.destroyAllWindows()
This produces the output image below, which does not accurately recognize the graph line. How might I go about doing this?
Note: For now, I am not concerned about the graph titles or any other text.
I would also like the code to work for other graph images aswell, such as:
etc.
If the graph does not have many noises around it (like your example) I would suggest to threshold your image with Otsu threshold instead of looking for edges . Then you simply search the contours, select the biggest one (graph) and draw it on a blank mask. After that you can perform a bitwise operation on image with the mask and you will get a black image with the graph. If you like the white background better, then simply change all black pixels to white. Steps are written in the example. Hope it helps a bit. Cheers!
Example:
import numpy as np
import cv2
# Read the image and create a blank mask
img = cv2.imread('graph.png')
h,w = img.shape[:2]
mask = np.zeros((h,w), np.uint8)
# Transform to gray colorspace and threshold the image
gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
_, thresh = cv2.threshold(gray,0,255,cv2.THRESH_BINARY_INV+cv2.THRESH_OTSU)
# Search for contours and select the biggest one and draw it on mask
_, contours, hierarchy = cv2.findContours(thresh,cv2.RETR_TREE,cv2.CHAIN_APPROX_NONE)
cnt = max(contours, key=cv2.contourArea)
cv2.drawContours(mask, [cnt], 0, 255, -1)
# Perform a bitwise operation
res = cv2.bitwise_and(img, img, mask=mask)
# Convert black pixels back to white
black = np.where(res==0)
res[black[0], black[1], :] = [255, 255, 255]
# Display the image
cv2.imshow('img', res)
cv2.waitKey(0)
cv2.destroyAllWindows()
Result:
EDIT:
For noisier pictures you could try this code. Note that different graphs have different noises and may not work on every graph image since the denoisiation process would be specific in every case. For different noises you can use different ways to denoise it, for example histogram equalization, eroding, blurring etc. This code works well for all 3 graphs. Steps are written in comments. Hope it helps. Cheers!
import numpy as np
import cv2
# Read the image and create a blank mask
img = cv2.imread('graph.png')
h,w = img.shape[:2]
mask = np.zeros((h,w), np.uint8)
# Transform to gray colorspace and threshold the image
gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
_, thresh = cv2.threshold(gray,0,255,cv2.THRESH_BINARY_INV+cv2.THRESH_OTSU)
# Perform opening on the thresholded image (erosion followed by dilation)
kernel = np.ones((2,2),np.uint8)
opening = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, kernel)
# Search for contours and select the biggest one and draw it on mask
_, contours, hierarchy = cv2.findContours(opening,cv2.RETR_TREE,cv2.CHAIN_APPROX_NONE)
cnt = max(contours, key=cv2.contourArea)
cv2.drawContours(mask, [cnt], 0, 255, -1)
# Perform a bitwise operation
res = cv2.bitwise_and(img, img, mask=mask)
# Threshold the image again
gray = cv2.cvtColor(res,cv2.COLOR_BGR2GRAY)
_, thresh = cv2.threshold(gray,0,255,cv2.THRESH_BINARY_INV+cv2.THRESH_OTSU)
# Find all non white pixels
non_zero = cv2.findNonZero(thresh)
# Transform all other pixels in non_white to white
for i in range(0, len(non_zero)):
first_x = non_zero[i][0][0]
first_y = non_zero[i][0][1]
first = res[first_y, first_x]
res[first_y, first_x] = 255
# Display the image
cv2.imshow('img', res)
cv2.waitKey(0)
cv2.destroyAllWindows()
Result:

Resources