I have image of text document. It includes text and block-schemes. The main problem is to detect block-schemes. I think there are two approaches to solve this task: 1) detect geometric primitive that make up the scheme; 2) detect the whole scheme.
How can I solve this task, please, give me some aproaches.
I try to detect where in document block-scheme is placed. Example is shown on the picture below. I didn't try to detect text in block-scheme.
UPDATE 2 The main problem is that i should find block-schemes in different varieties. Even part of the block-scheme.
You can either do 1) Object Detection 2) Semantic Segmentation. I would suggest segmentation because boundary extraction is crucial for your application.
I'm assuming you have the pages of the documents as images.
The following are the steps involved in projects involving segmentation.
Collect the images of the pages required to solve you problem and do
preprocessing steps such as image resizing to bring all images in
your dataset to a common shape and to reduce the number of computations performed. Be sure to maintain variability in your samples.
Now you have to annotate the regions of the images that you are interested and mark them with a name. Here assigning a class (like classification) to certain regions of the image. You can use the following tools for this.
Labelme -- (my recommendation)
Vgg Annotation tool -- (highly portable tool written in html but has less features than labelme)
You can use U-Net Model for your task. Unet Paper. It is very easy to implement but performs very robustly on most real-world tasks such as yours.
We have done something similar at work. This is the blog post. We have explained in detail the steps involved in the pipe line from the data collection stage to the results.
Literature on Document Layout Analysis.
https://arxiv.org/pdf/1804.10371.pdf -- They have used U-Net with ResNet-50 as encoder. They have achieved very good results compared to previous approaches
https://github.com/leonlulu/DeepLayout-- This is a Python implementation of page layout analysis tool using a Deep Lab v2 model which does semantic segmentation.
The approach presented here might seem tedious and time consuming but it is robust to variability in the documents when you are testing. Comment below if you have any questions.
I would prefer if there were more examples for the types of diagram you are searching for, but based on the example you have given, here is my attempt of solving it naively.
1) Resize image to a manageable size to improve speed and reduce operations.
2) Use morphological open to cluster all the dark objects together.
3) Binarize the dark objects.
4) Label the objects using openCV connected components. This will give us the bounding box of each region.
5) Cluster overlapping bounding box together.
6) Analyze each bounding box to find the one with diagram. Here you can apply a more sophisticated algorithm like box detection or even arrow detection but in your example, i think a simple box ratio is sufficient.
Here is the code for the implementation
import cv2
import numpy as np
# Function to fill all the bounding box
def fill_rects(image, stats):
for i,stat in enumerate(stats):
if i > 0:
p1 = (stat[0],stat[1])
p2 = (stat[0] + stat[2],stat[1] + stat[3])
# image name
img_name = 'test_image.png'
# Load image file
diagram = cv2.imread(img_name,0)
diagram = cv2.blur(diagram,(5,5))
fScale = 0.25
# Make it smaller to speed up everything and easier to cluster
small_img = cv2.resize(diagram,(0,0),fx = fScale, fy = fScale)
img_h, img_w = np.shape(small_img)
# Morphological close process to cluster nearby objects
fat_img = cv2.morphologyEx(small_img,cv2.MORPH_OPEN,None,iterations = 1)
# Threshold strong signals
_, bin_img = cv2.threshold(fat_img,210,255,cv2.THRESH_BINARY_INV)
# Analyse connected components
num_labels, labels, stats, centroids = cv2.connectedComponentsWithStats(bin_img)
# Cluster all the intersected bounding box together
rsmall, csmall = np.shape(small_img)
new_img1 = np.zeros((rsmall, csmall), dtype=np.uint8)
# Analyse New connected components to get filled regions
num_labels_new, labels_new, stats_new, centroids_new = cv2.connectedComponentsWithStats(new_img1)
# Check for regions that satifies conditions coresponds to diagram
min_dia_width = img_w * 0.1
dia_regions = []
for i ,stat in enumerate(stats):
if i > 0:
# get basic dimensions
x,y,w,h = stat[0:4]
# calculate ratio
ratio = w / float(h)
# if condition met, save in list
if ratio < 1 and w > min_dia_width:
# For display purpose
diagram_disp = cv2.imread(img_name)
for region in dia_regions:
x,y,w,h = region
x = int(x)
y = int(y)
w = int(w)
h = int(h)
labels_disp = np.uint8(200*labels/np.max(labels)) + 50
labels_disp2 = np.uint8(200*labels_new/np.max(labels_new)) + 50
Here is the result for another type of input.
I am loading a yolo model with opencv in python with cv2.dnn_DetectionModel(cfg,weights)
and then calling net.detect(img). I think I can get a speed-up per image using batches, but I don't see any support for batch size other than one.
Is it possible to set the batch size?
net.detect does not support batch size > 1.
However, it's possible to do inference with batch size > 1 on darknet models with some extra work. Here is some partial code:
net = cv2.dnn.readNetFromDarknet(cfg,weights)
blob = cv2.dnn.blobFromImages(image_list)
results = net.forward(net.getUnconnectedOutLayersNames())
Now loop over all the images, and for each layer output in results, extract the boxes and confidences for this image each class, and having collected this info for every layer, pass this through cv2.dnn.NMSBoxes. This part is non-trivial, but doable.
one idea is you combine images manually and pass it into net and after getting result you separate them:
h1, w1 = im1.shape[:2]
h2, w2 = im2.shape[:2]
#create empty matrix
vis = np.zeros((max(h1, h2), w1+w2,3), np.uint8)
#combine 2 images
vis[:h1, :w1,:3] = im1
vis[:h2, w1:w1+w2,:3] = im2
after inference, you can sperate them again:
result2=pred[:h2, w1:w1+w2,:]
My training data is imbalanced. So I decided to resample my dataset. I want to do slightly changes while resampling. I'd like to apply a horizontal flip and Gaussian filter to minority classes to make all classes equal.
To do so, I'd like to use pure image processing techniques to increase the number of my samples with the minority. To do that, I run this code in my classes which have less images
directory = ''
for file in os.listdir(directory):
img = cv2.imread(directory + file)
# horizontal_img = cv2.flip( img, 0 )
Flip_Horizontal = cv2.flip(img, 1) # 1 means Horizontal Flip
#saving now
cv2.imwrite(file + '_flip' + '.jpg', Flip_Horizontal)
However, I Have seen some tutorials are using Keras libraries to do image augmentation. Like following blog post:
In my case can I use the first technique (pure image processing= manual copy-pasting data with slight changes)? or should I use the libraries that are available in Keras or PyTorch?
So I'm a very experienced developer trying to get into some machine learning/neural networking code.
Essentially I need a HUGE dataset so my first problem is that I need to find a way of labelling a lot of images quickly. So take this as the example.
I was thinking I could use template matching on the main image with the image below it? So that way I would simply need to get permission to use this data and I could label it very quickly.
When using openCV(from the examples) I get some very funky results which don't find the plate in the image, it does draw boxes but not around the plate, having tested it, it gets very very close a few times, but not much, code is...
import cv2 as cv
import numpy as np
from matplotlib import pyplot as plt
img = cv.imread('./image2.jpg',0)
img2 = img.copy()
template = cv.imread('./Plate2.test.png',0)
w, h = template.shape[::-1]
# All the 6 methods for comparison in a list
methods = ['cv.TM_CCOEFF', 'cv.TM_CCOEFF_NORMED', 'cv.TM_CCORR',
for meth in methods:
img = img2.copy()
method = eval(meth)
# Apply template Matching
res = cv.matchTemplate(img,template,method)
min_val, max_val, min_loc, max_loc = cv.minMaxLoc(res)
# If the method is TM_SQDIFF or TM_SQDIFF_NORMED, take minimum
if method in [cv.TM_SQDIFF, cv.TM_SQDIFF_NORMED]:
top_left = min_loc
top_left = max_loc
bottom_right = (top_left[0] + w, top_left[1] + h)
cv.rectangle(img,top_left, bottom_right, 255, 2)
plt.subplot(121),plt.imshow(res,cmap = 'gray')
plt.title('Matching Result'), plt.xticks([]), plt.yticks([])
plt.subplot(122),plt.imshow(img,cmap = 'gray')
plt.title('Detected Point'), plt.xticks([]), plt.yticks([])
The first thing is I'm guessing this isn't working because the Main image we're looking for the template on is orientated differently.
The second thing I should point out is I am NOT a Python programmer so I'm learning this also, and this is my first time touching OpenCV so I'm trying to apply what I DO understanding about object detecting to things I don't properly understand.
What I want to do is get the coordinates for a bounding box in the MAIN image from the smaller plate that way I can(with permission) create a decent dataset to train really quick - otherwise, I have to do it manually :-(
Any help would be greatly appreciated, I have a lot of examples working but this was an interesting problem I didn't find any reading on.
In my mind the steps are:
1)Find the plate and create bounding box
2)Train the dataset across as many images a possible for object detection on said plates
3) When testing the plate needs extracting from the main image and then a perspective transform applying.
4) If you wanted to, then you'd do text extraction once you've got the plate flattened out.
So I tried SIFT from here the results are as follows(note this image is already in the public domain from the above website.) - not quite on target!
I've managed to cobble together a solution from an article as suggested JD in the comments, essentially it lets me label enough images to create a neural network that in turn is much better at detecting them - I'll post an update soon with the answer.
I'm setting up the new Tensorflow Object Detection API to find small objects in large areas of satellite imagery. It works quite well - it finds all 10 objects I want, but I also get 50-100 false positives [things that look a little like the target object, but aren't].
I'm using the sample config from the 'pets' tutorial, to fine-tune the faster_rcnn_resnet101_coco model they offer. I've started small, with only 100 training examples of my objects (just 1 class). 50 examples in my validation set. Each example is a 200x200 pixel image with a labeled object (~40x40) in the center. I train until my precision & loss curves plateau.
I'm relatively new to using deep learning for object detection. What is the best strategy to increase my precision? e.g. Hard-negative mining? Increase my training dataset size? I've yet to try the most accurate model they offer faster_rcnn_inception_resnet_v2_atrous_coco as i'd like to maintain some speed, but will do so if needed.
Hard-negative mining seems to be a logical step. If you agree, how do I implement it w.r.t setting up the tfrecord file for my training dataset? Let's say I make 200x200 images for each of the 50-100 false positives:
Do I create 'annotation' xml files for each, with no 'object' element?
...or do I label these hard negatives as a second class?
If I then have 100 negatives to 100 positives in my training set - is that a healthy ratio? How many negatives can I include?
I've revisited this topic recently in my work and thought I'd update with my current learnings for any who visit in the future.
The topic appeared on Tensorflow's Models repo issue tracker. SSD allows you to set the ratio of how many negative:postive examples to mine (max_negatives_per_positive: 3), but you can also set a minimum number for images with no postives (min_negatives_per_image: 3). Both of these are defined in the model-ssd-loss config section.
That said, I don't see the same option in Faster-RCNN's model configuration. It's mentioned in the issue that models/research/object_detection/core/balanced_positive_negative_sampler.py contains the code used for Faster-RCNN.
One other option discussed in the issue is creating a second class specifically for lookalikes. During training, the model will attempt to learn class differences which should help serve your purpose.
Lastly, I came across this article on Filter Amplifier Networks (FAN) that may be informative for your work on aerial imagery.
The following paper describes hard negative mining for the same purpose you describe:
Training Region-based Object Detectors with Online Hard Example Mining
In section 3.1 they describe using a foreground and background class:
Background RoIs. A region is labeled background (bg) if its maximum
IoU with ground truth is in the interval [bg lo, 0.5). A lower
threshold of bg lo = 0.1 is used by both FRCN and SPPnet, and is
hypothesized in [14] to crudely approximate hard negative mining; the
assumption is that regions with some overlap with the ground truth are
more likely to be the confusing or hard ones. We show in Section 5.4
that although this heuristic helps convergence and detection accuracy,
it is suboptimal because it ignores some infrequent, but important,
difficult background regions. Our method removes the bg lo threshold.
In fact this paper is referenced and its ideas are used in Tensorflow's object detection losses.py code for hard mining:
class HardExampleMiner(object):
"""Hard example mining for regions in a list of images.
Implements hard example mining to select a subset of regions to be
back-propagated. For each image, selects the regions with highest losses,
subject to the condition that a newly selected region cannot have
an IOU > iou_threshold with any of the previously selected regions.
This can be achieved by re-using a greedy non-maximum suppression algorithm.
A constraint on the number of negatives mined per positive region can also be
Reference papers: "Training Region-based Object Detectors with Online
Hard Example Mining" (CVPR 2016) by Srivastava et al., and
"SSD: Single Shot MultiBox Detector" (ECCV 2016) by Liu et al.
Based on your model config file, the HardMinerObject is returned by losses_builder.py in this bit of code:
def build_hard_example_miner(config,
"""Builds hard example miner based on the config.
config: A losses_pb2.HardExampleMiner object.
classification_weight: Classification loss weight.
localization_weight: Localization loss weight.
Hard example miner.
loss_type = None
if config.loss_type == losses_pb2.HardExampleMiner.BOTH:
loss_type = 'both'
if config.loss_type == losses_pb2.HardExampleMiner.CLASSIFICATION:
loss_type = 'cls'
if config.loss_type == losses_pb2.HardExampleMiner.LOCALIZATION:
loss_type = 'loc'
max_negatives_per_positive = None
num_hard_examples = None
if config.max_negatives_per_positive > 0:
max_negatives_per_positive = config.max_negatives_per_positive
if config.num_hard_examples > 0:
num_hard_examples = config.num_hard_examples
hard_example_miner = losses.HardExampleMiner(
return hard_example_miner
which is returned by model_builder.py and called by train.py. So basically, it seems to me that simply generating your true positive labels (with a tool like LabelImg or RectLabel) should be enough for the train algorithm to find hard negatives within the same images. The related question gives an excellent walkthrough.
In the event you want to feed in data that has no true positives (i.e. nothing should be classified in the image), just add the negative image to your tfrecord with no bounding boxes.
I think I was passing through the same or close scenario and it's worth it to share with you.
I managed to solve it by passing images without annotations to the trainer.
On my scenario I'm building a project to detect assembly failures from my client's products, at real time.
I successfully achieved very robust results (for production env) by using detection+classification for components that has explicity a negative pattern (e.g. a screw that has screw on/off(just the hole)) and only detection for things that doesn't has the negative pattens (e.g. a tape that can be placed anywhere).
On the system it's mandatory that the user record 2 videos, one containing the positive scenario and another containing the negative (or the n videos, containing n patterns of positive and negative so the algorithm can generalize).
After a while testing I found out that if I register to detected only tape the detector was giving very confident (0.999) false positive detections of tape. It was learning the pattern where the tape was inserted instead of the tape itself. When I had another component (like a screw on it's negative format) I was passing the negative pattern of tape without being explicitly aware of it, so the FPs didn't happen.
So I found out that, in this scenario, I had to necessarily pass the images without tape so it could differentiate between tape and no-tape.
I considered two alternatives to experiment and try to solve this behavior:
Train passing an considerable amount of images that doesn't has any annotation (10% of all my negative samples) along with all images that I have real annotations.
On the images that I don't have annotation I create a dummy annotation with a dummy label so I could force the detector to train with that image (thus learning the no-tape pattern). Later on, when get the dummy predictions, just ignore them.
Concluded that both alternatives worked perfectly on my scenario.
The training loss got a little messy but the predictions work with robustness for my very controlled scenario (the system's camera has its own box and illumination to decrease variables).
I had to make two little modifications for the first alternative to work:
All images that didn't had any annotation I passed a dummy annotation (class=None, xmin/ymin/xmax/ymax=-1)
When generating the tfrecord files I use this information (xmin == -1, in this case) to add an empty list for the sample:
def create_tf_example(group, path, label_map):
with tf.gfile.GFile(os.path.join(path, '{}'.format(group.filename)), 'rb') as fid:
encoded_jpg = fid.read()
encoded_jpg_io = io.BytesIO(encoded_jpg)
image = Image.open(encoded_jpg_io)
width, height = image.size
filename = group.filename.encode('utf8')
image_format = b'jpg'
xmins = []
xmaxs = []
ymins = []
ymaxs = []
classes_text = []
classes = []
for index, row in group.object.iterrows():
if not pd.isnull(row.xmin):
if not row.xmin == -1:
xmins.append(row['xmin'] / width)
xmaxs.append(row['xmax'] / width)
ymins.append(row['ymin'] / height)
ymaxs.append(row['ymax'] / height)
tf_example = tf.train.Example(features=tf.train.Features(feature={
'image/height': dataset_util.int64_feature(height),
'image/width': dataset_util.int64_feature(width),
'image/filename': dataset_util.bytes_feature(filename),
'image/source_id': dataset_util.bytes_feature(filename),
'image/encoded': dataset_util.bytes_feature(encoded_jpg),
'image/format': dataset_util.bytes_feature(image_format),
'image/object/bbox/xmin': dataset_util.float_list_feature(xmins),
'image/object/bbox/xmax': dataset_util.float_list_feature(xmaxs),
'image/object/bbox/ymin': dataset_util.float_list_feature(ymins),
'image/object/bbox/ymax': dataset_util.float_list_feature(ymaxs),
'image/object/class/text': dataset_util.bytes_list_feature(classes_text),
'image/object/class/label': dataset_util.int64_list_feature(classes),
return tf_example
Part of the traning progress:
Currently I'm using tensorflow object detection along with tensorflow==1.15, using faster_rcnn_resnet101_coco.config.
Hope it will solve someone's problem as I didn't found any solution on the internet. I read a lot of people telling that faster_rcnn is not adapted for negative training for FPs reduction but my tests proved the opposite.
Can anybody please show me how to use RANSAC algorithm to select common feature points in two images which have a certain portion of overlap? The problem came out from feature based image stitching.
I implemented a image stitcher a couple of years back. The article on RANSAC on Wikipedia describes the general algortihm well.
When using RANSAC for feature based image matching, what you want is to find the transform that best transforms the first image to the second image. This would be the model described in the wikipedia article.
If you have already got your features for both images and have found which features in the first image best matches which features in the second image, RANSAC would be used something like this.
The input to the algorithm is:
n - the number of random points to pick every iteration in order to create the transform. I chose n = 3 in my implementation.
k - the number of iterations to run
t - the threshold for the square distance for a point to be considered as a match
d - the number of points that need to be matched for the transform to be valid
image1_points and image2_points - two arrays of the same size with points. Assumes that image1_points[x] is best mapped to image2_points[x] accodring to the computed features.
best_model = null
best_error = Inf
for i = 0:k
rand_indices = n random integers from 0:num_points
base_points = image1_points[rand_indices]
input_points = image2_points[rand_indices]
maybe_model = find best transform from input_points -> base_points
consensus_set = 0
total_error = 0
for i = 0:num_points
error = square distance of the difference between image2_points[i] transformed by maybe_model and image1_points[i]
if error < t
consensus_set += 1
total_error += error
if consensus_set > d && total_error < best_error
best_model = maybe_model
best_error = total_error
The end result is the transform that best tranforms the points in image2 to image1, which is exacly what you want when stitching.