Extracting Semi-structured Text from a complex UI (Golf Simulator) - opencv

I'm fairly new to the world of OCR, OpenCV, Tesseract etc and was hoping to get some advice or a nudge in the right direction for a project I'm working on. For context, I practice golf at an indoor simulator that is powered by Full Swing Golf. My goal is to build an app (preferably iphone, but desktop is fine too) that will be able to grab the data provided by the simulator and process it however I'd like. The overall workflow would look something like:
Set up iPhone or laptop camera to watch the simulator screen.
Hit ball
Statistics Screen is displayed that looks more or less like:
Detect that the Statistics Screen has been displayed and grab all relevant data:
| Distance | Launch | Back Spin | Club Speed | Carry | To Pin | Direction | Ball Speed | Side Spin | Club Face | Club Path |
|----------|--------|-----------|------------|-------|--------|-----------|------------|-----------|-----------|-----------|
| 345 | 13 | 3350 | 135 | 335 | 80 | 2.4 | 190 | 350 | 4.3 | 1.6 |
5-?: Save the data to my app, keep track of it over time etc...
Attempts So Far:
It seemed like OpenCV's matchTemplate would be a simple way to find all of the headings in the image (Distance, Launch etc...) and it does seem to work when the image and template are both the perfect resolution. However, as this will be an iPhone app, the quality is not something I can really guarantee (within reason). Moreso, the screen will almost never be straight-on as it appears above. Most likely, the camera will be off to the side and we will have to de-skew accordingly. I've attempted to use the following image to work on my deskewing logic to no avail:
Finding the reference points in order to deskew via getPerspectiveTransform and warpPerspective has proven to be incredibly difficult due to the above issues with matching templates.
I've also tried dynamically adjusting for scale with code resembling the following:
def findTemplateLocation(image_path):
template = cv2.imread(image_path)
template = cv2.cvtColor(template, cv2.COLOR_BGR2GRAY)
w, h = template.shape[::-1]
threshold = 0.65
loc = []
for scale in np.linspace(0.1, 2, 20)[::-1]:
resized = imutils.resize(template, width=int(template.shape[1] * scale))
w, h = resized.shape[::-1]
res = cv2.matchTemplate(image_gray, resized, cv2.TM_CCOEFF_NORMED)
loc = np.where(res >= threshold)
if len(list(zip(*loc[::-1]))) > 0:
break
if loc and len(list(zip(*loc[::-1]))) > 0:
adjusted_w = int(w/scale)
adjusted_h = int(h/scale)
print(str(adjusted_w) + " " + str(adjusted_h) + " " + str(scale))
ret = []
for pt in zip(*loc[::-1]):
ret.append({'width': w, 'height': h, 'location': pt})
return ret
return None
This still returns a ton of false positives.
I'm hoping to get some advice on how to approach this problem with a clean slate. I'm open to any language / workflow.
If it does seem that I'm on the right track, my current code is at https://gist.github.com/naderhen/9ec8d45f13d92507131d5bce0e84fad8 . Would really appreciate any suggestions for the best next steps.
Thanks for any help you can provide!
EDIT: Additional Resources
I've uploaded a number of videos and still photos from my time at the indoor simulator this weekend: https://www.dropbox.com/sh/5vub2mi4rvunyaw/AAAY1_7Q_WBV4JvmDD0dEiTDa?dl=0
I tried to get a number of different angles, with different lighting etc. Please let me know if I can provide any other resources that may help.

So, I tried two different methods:
Contour detection - This seemed to be the most obvious method since the statistics screen is the primary part of the image and is present in all your images. Although it does work with two of the three images, it might not be very robust with the parameters. Here are the steps that I tried for contour:
First, get the the image in grayscale or take one of the Value channel in HSV. Then, threshold the image using either Otsu or Adaptive Thresholding. After playing with a lot of the associated parameters, I got satisfactory results, which would basically mean nice whole statistics screen in white on a black background. After this, sort the contours like this:
contours = cv2.findContours(thresh, cv2.RETR_TREE, cv2.CHAIN_APPROX_NONE)[1]
# Sort the contours to avoid unnecessary comparison in the for loop below
cntsSorted = sorted(contours, key=lambda x: cv2.contourArea(x), reverse=True)
for cnt in cntsSorted[0:20]:
peri = cv2.arcLength(cnt, True)
approx = cv2.approxPolyDP(cnt, 0.04 * peri, True)
if len(approx) == 4 and peri > 10000:
cv2.drawContours(sorted_image, cnt, -1, (0, 255, 0), 10)
Feature Detection and Matching: Since using contours wasn't robust enough, I tried another method which I've worked on for a similar problem as yours. This method is fairly robust, much faster (I tried this on an android phone 2 years back and it would do the job in less than a second for a 1280 x 760 image). However, after trying on your work cases, I figured that your images are pretty vague. What I mean by that is, you have two images in your question that have fairly similar primaries and it works for that but the images you posted in comments are very different from these and hence it doesn't find suitable number of good matches (at least 10 in my case). If you can post a nice set of images that you actually will encounter, I will update this answer with my results on the new set. More importantly, the images of the scene evidently have change in perspective which shouldn't be an issue assuming you are able to get a very good source image (as the first one in your question). However, the change in lighting conditions can be a pain. I'd suggest either using different color spaces such as HSV, Lab and Luv instead of BGR.
Here is where you can find a working example of how to implement your own feature matcher. There are some code changes required depending on the version of OpenCV you are using but I am sure you can find the solutions ( I did ;) ).
A good example:
Some suggestions:
Try getting as clean an image as possible for the image you are using to match with the others (your first image in my case). Hopefully, this would require you to do less processing.
Try using unsharp mask before finding Keypoints.
My results are from using ORB. You can also try with other detectors/descriptors like SURF, SIFT and FAST.
Finally, your approach of template matching should work in cases where there is a change only in the scaling and not the perspective.
Hope this helps! Write a comment if you have any additional questions and/or when you have a good image set ready (rubs palms). Cheers!
Edit 1: This is the code that I used for the Feature Detection and Matching in Opencv 3.4.3 and Python 3.4
def unsharp_mask(im):
# This is used to sharpen images
gaussian_3 = cv2.GaussianBlur(im, (3, 3), 3.0)
return cv2.addWeighted(im, 2.0, gaussian_3, -1.0, 0, im)
def screen_finder2(image, source, num=0):
def resize(im, new_width):
r = float(new_width) / im.shape[1]
dim = (new_width, int(im.shape[0] * r))
return cv2.resize(im, dim, interpolation=cv2.INTER_AREA)
width = 300
source = resize(source, new_width=width)
image = resize(image, new_width=width)
hsv = cv2.cvtColor(image, cv2.COLOR_BGR2LUV)
image, u, v = cv2.split(hsv)
hsv = cv2.cvtColor(source, cv2.COLOR_BGR2LUV)
source, u, v = cv2.split(hsv)
MIN_MATCH_COUNT = 10
orb = cv2.ORB_create()
kp1, des1 = orb.detectAndCompute(image, None)
kp2, des2 = orb.detectAndCompute(source, None)
flann = cv2.DescriptorMatcher_create(cv2.DescriptorMatcher_FLANNBASED)
# Without the below 2 lines, matching doesn't work
des1 = np.asarray(des1, dtype=np.float32)
des2 = np.asarray(des2, dtype=np.float32)
matches = flann.knnMatch(des1, des2, k=2)
# store all the good matches as per Lowe's ratio test
good = []
for m, n in matches:
if m.distance < 0.7 * n.distance:
good.append(m)
if len(good) >= MIN_MATCH_COUNT:
src_pts = np.float32([kp1[m.queryIdx].pt for m in good]).reshape(-1,
1, 2)
dst_pts = np.float32([kp2[m.trainIdx].pt for m in good]).reshape(-1,
1, 2)
M, mask = cv2.findHomography(src_pts, dst_pts, cv2.RANSAC, 5.0)
matchesMask = mask.ravel().tolist()
h,w = image.shape
pts = np.float32([[0, 0], [0, h-1], [w-1, h-1], [w-1, 0]]).reshape(-1,
1, 2)
dst = cv2.perspectiveTransform(pts, M)
source_bgr = cv2.cvtColor(source, cv2.COLOR_GRAY2BGR)
img2 = cv2.polylines(source_bgr, [np.int32(dst)], True, (0,0,255), 3,
cv2.LINE_AA)
cv2.imwrite("out"+str(num)+".jpg", img2)
else:
print("Not enough matches." + str(len(good)))
matchesMask = None
draw_params = dict(matchColor=(0, 255, 0), # draw matches in green color
singlePointColor=None,
matchesMask=matchesMask, # draw only inliers
flags=2)
img3 = cv2.drawMatches(image, kp1, source, kp2, good, None, **draw_params)
cv2.imwrite("ORB"+str(num)+".jpg", img3)
match_image = unsharp_mask(cv2.imread("source.jpg"))
image_1 = unsharp_mask(cv2.imread("Screen_1.jpg"))
screen_finder2(match_image, image_1, num=1)

Related

Gabor filter parametrs for fingerprint image enhancement?

i am biggner in image processing and in gabor filter and i want to use this filter to enhance fingerprint image
i read many articles about fingerprint image enhancement and i know that the steps for that is
read image -> noramalize -> get orientation map -> gabor filter -> binarize -> skeleton
now i am in step 4 , my question is how to get the right values for ( lambds and gamma ) for gabor
filter
my image :
my code :
1- read image and get the orientation map using HOG features
imgc = imread(r'C:\Users\iP\Desktop\printe.jpg',as_gray=True)
imgc = resize(imgc, (64*3,128*3))
rows,cols=imgc.shape
offset=24
ori=9 # to get angels (0,45,90,135) only
fd, hog_image = hog(imgc, orientations=ori, pixels_per_cell=(offset, offset),
cells_per_block=(1, 1), visualize=True, multichannel=None,feature_vector=False
)
orientation map :
2- reshape the orientation map from (8, 16, 1, 1, 9) to (8, 16, 9),,,
8 ->rows , 16 -> cols , 9 orientation
fd=np.array(fd)
fd=np.reshape(fd,(fd.shape[0],fd.shape[1],ori))
# from (8, 16, 9) to (8, 16, 1)
# Choose the angle that has the most potential ( biggest magntude )
angels=np.zeros((fd.shape[0],fd.shape[1],1))
for r in range(fd.shape[0]):
for c in range(fd.shape[1]):
bloc_prop = fd[r,c]
angelss=bloc_prop.reshape((1,ori))
angel=np.argmax(angelss)
angels[r,c]=angel
angels=angels.astype(np.int32)
3- the convolve function
def conv_gabor(img,orient_map,gabor_kernel_shape):
#
# loop on all pixels in the image and convolve it with it's angel in the orientation map
#
roo,coo=img.shape
#to get the padding value for immage before convolving it with kernels
pad=(gabor_kernel_shape-1)
padded=np.zeros((img.shape[0]+pad,img.shape[1]+pad)) # adding the cols and rows
padded[int(pad/2):-int(pad/2),int(pad/2):-int(pad/2)]=img # copy image to inside the padded
image
#result image
dst=padded.copy()
# start from the image that inside the padded
for r in range(int(pad/2),int(pad/2)+roo):
for c in range(int(pad/2),int(pad/2)+coo):
# get the angel from the orientation map
ro=(r-int(pad/2))//offset
co=(c-int(pad/2))//offset
ang=angels[ro,co]
real_angel=(((180/ori)*ang))
# bloack around the pixe to convolve it
block=padded[r-int(pad/2):r+int(pad/2)+1,c-int(pad/2):c+int(pad/2)+1]
# get Gabor kernel
# here is my question ->> what to get the parametres values for ( lambda and gamma
and phi)
ker= cv2.getGaborKernel( (gabor_kernel_shape,gabor_kernel_shape), 3,
np.deg2rad(real_angel),np.pi/4,0.001,0 )
dst[r,c]=np.sum((ker*block))
return dst
dst=conv_gabor(imgc,angels,11)
dst :
you see the image is too bad i dont know why this , i think because the lambda and gamma or what ?
but when i filter with one angel only 45 :
ker= cv2.getGaborKernel( (11,11), 2, np.deg2rad(45),np.pi/4,0.5,0 )
filt = cv2.filter2D(imgc,cv2.CV_64F,ker)
plt.imshow(filt,'gray')
reslut :
you see the edges that has 45 on the left is good quality
can anyone help me please , and tell me what should i do in this probelm ?
thanks all :)
EDIT:
i searched for another way and i found that i can use gabor fiter bank with many orientation and get best score in filtred images , so how can i find best score for pixels from filtred images
this is the output when i use gabor fiter bank with 45,60,65,90,135 angels and divide the filtered images to 16*16 and find the highest standard deviation (best score -> i use standard deviation as the score) for each block and get the best filtred image
so as you can see there are good and bad parts in the image ,i think using standard deviation alone is ineffective in some parts of the image , so my new question is what is best score function that gives me good output parts in the image
original image :
In my opinion, weighting the filtered images might be enough for your task. Considering your filter orientations, the filters with angle 45 and 135 respond quite well at different regions of the image. So, you can calculate the weighted sum to get the best filter result.
img = cv2.imread('fingerprint.jpg',0)
w_45 = 0.5
w_135 = 0.5
img_45 = cv2.filter2D(img,cv2.CV_64F,cv2.getGaborKernel( (11,11), 2, np.deg2rad(45),np.pi/4,0.5,0 ))
img_135 = cv2.filter2D(img,cv2.CV_64F,cv2.getGaborKernel( (11,11), 2, np.deg2rad(135),np.pi/4,0.5,0 ))
result = img_45*w_45+img_135*w_135
result = result/np.amax(result)*255
plt.imshow(result,cmap='gray')
plt.show()
Feel free to play with the weights. The result totally depends on what your next step is.

OpenCV Best way to match the spot patterns

I'm trying to write an app for wild leopard classification and conservation in South Asia. For this, I have the main challenge to identify the leopards by their spot pattern in the forehead.
The current approach I am using is,
Store the known leopard forehead images as a base list
Get the user-provided leopard image and crop the forehead of the leopard
Pre-process the images with the bilateral filter to reduce the noise
Identify the keypoints using the SIFT algorithm
Use FLANN matcher to get KNN matches
Select good matches based on the ratio threshold
Sample code:
# Pre-Process & reduce noise.
img1 = cv.bilateralFilter(baseImg, 9, 75, 75)
img2 = cv.bilateralFilter(userImage, 9, 75, 75)
detector = cv.xfeatures2d_SIFT.create()
keypoints1, descriptors1 = detector.detectAndCompute(img1, None)
keypoints2, descriptors2 = detector.detectAndCompute(img2, None)
# FLANN parameters
FLANN_INDEX_KDTREE = 1
index_params = dict(algorithm=FLANN_INDEX_KDTREE, trees=5)
search_params = dict(checks=50) # or pass empty dictionary
matcher = cv.FlannBasedMatcher(index_params, search_params)
knn_matches = matcher.knnMatch(descriptors1, descriptors2, 2)
allmatchpointcount = len(knn_matches)
ratio_thresh = 0.7
good_matches = []
for m, n in knn_matches:
if m.distance < ratio_thresh * n.distance:
good_matches.append(m)
goodmatchpointcount = len(good_matches)
print("Good match count : ", goodmatchpointcount)
matchsuccesspercentage = goodmatchpointcount/allmatchpointcount*100
print("Match percentage : ", matchsuccesspercentage)
Problems I have with this approach:
The method has a medium-low success rate and tends to break when there is a new user image.
The user images are sometimes taken from different angles where some key patterns are not visible or warped.
The user image quality affects the match result significantly.
I appreciate any suggestions to get this improved in any manner.
Sample Images
Base Image
Above is matching to below: (Incorrect pattern matched)
More sample images as requested.

Extract text information from PDF files with different layouts - machine learning

I need assistance with a ML project I am currently trying to create.
I receive a lot of invoices from a lot of different suppliers - all in their own unique layout. I need to extract 3 key elements from the invoices. These 3 elements are all located in a table/line items for all the invoices.
The 3 elements are:
1: Tariff number (digit)
2: Quantity (always a digit)
3: Total line amount (monetary value)
Please refer to below screenshot, where I have marked these field on a sample invoice.
I started this project with a template approach, based on regular expressions. This, however, was not scaleable at all and I ended up with tons of different rules.
I am hoping that machine learning can help me here - or maybe a hybrid solution?
The common denominator
In all of my invoices, despite of the different layouts, each line item will always consist of one tariff number. This tariff number is always 8 digits, and is always formatted in one the ways like below:
xxxxxxxx
xxxx.xxxx
xx.xx.xx.xx
(Where "x" is a digit from 0 - 9).
Further, as you can see on the invoice there is both a Unit Price and a Total Amount per line. The amount I will need is always the highest for each line.
The output
For each invoice like the one above, I need the output for each line. This could for example be something like this:
{
"line":"0",
"tariff":"85444290",
"quantity":"3",
"amount":"258.93"
},
{
"line":"1",
"tariff":"85444290",
"quantity":"4",
"amount":"548.32"
},
{
"line":"2",
"tariff":"76109090",
"quantity":"5",
"amount":"412.30"
}
Where to go from here?
I am not sure of what I am looking to do falls under machine learning and if so, under which category. Is it computer vision? NLP? Named Entity Recognition?
My initial thought was to:
Convert the invoice to text. (The invoices are all in textable PDFs, so I can use something like pdftotext to get the exact textual values)
Create custom named entities for quantity, tariff and amount
Export the found entities.
However, I feel like I might be missing something.
Can anyone assist me in the right direction?
Edit:
Please see below for a few more examples of how an invoice table section can look like:
Sample invoice #2
Sample invoice #3
Edit 2:
Please see below for the three sample images, without the borders/bounding boxes:
Image 1:
Image 2:
Image 3:
Here's an attempt using OpenCV, the idea is:
Obtain binary image. We load the image, enlarge using
imutils.resize to help obtain better OCR results (see Tesseract improve quality), convert to grayscale, then Otsu's threshold to obtain a binary image (1-channel).
Remove table grid lines. We create a horizontal and vertical kernels then perform morphological operations to combine adjacent text contours into a single contour. The idea is to extract a ROI row as one piece to OCR.
Extract row ROIs. We find contours then sort from top-to-bottom using imutils.contours.sort_contours. This ensures that we iterate through each row in the correct order. From here we iterate through the contours, extract the row ROI using Numpy slicing, OCR using Pytesseract, then parse the data.
Here's the visualization of each step:
Input image
Binary image
Morph close
Visualization of iterating through each row
Extracted row ROIs
Output invoice data result:
{'line': '0', 'tariff': '85444290', 'quantity': '3', 'amount': '258.93'}
{'line': '1', 'tariff': '85444290', 'quantity': '4', 'amount': '548.32'}
{'line': '2', 'tariff': '76109090', 'quantity': '5', 'amount': '412.30'}
Unfortunately, I get mixed results when trying on the 2nd and 3rd image. This method does not produce great results on the other images since the layout of the invoices are all different. However, this approach shows that it's possible to use traditional image processing techniques to extract the invoice information with the assumption that you have a fixed invoice layout.
Code
import cv2
import numpy as np
import pytesseract
from imutils import contours
import imutils
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
# Load image, enlarge, convert to grayscale, Otsu's threshold
image = cv2.imread('1.png')
image = imutils.resize(image, width=1000)
height, width = image.shape[:2]
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
# Remove horizontal lines
horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (50,1))
detect_horizontal = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, horizontal_kernel, iterations=2)
cnts = cv2.findContours(detect_horizontal, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
cv2.drawContours(thresh, [c], -1, 0, -1)
# Remove vertical lines
vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1,50))
detect_vertical = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, vertical_kernel, iterations=2)
cnts = cv2.findContours(detect_vertical, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
cv2.drawContours(thresh, [c], -1, 0, -1)
# Morph close to combine adjacent contours into a single contour
invoice_data = []
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (85,5))
close = cv2.morphologyEx(thresh, cv2.MORPH_CLOSE, kernel, iterations=3)
# Find contours, sort from top-to-bottom
# Iterate through contours, extract row ROI, OCR, and parse data
cnts = cv2.findContours(close, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
(cnts, _) = contours.sort_contours(cnts, method="top-to-bottom")
row = 0
for c in cnts:
x,y,w,h = cv2.boundingRect(c)
ROI = image[y:y+h, 0:width]
ROI = cv2.GaussianBlur(ROI, (3,3), 0)
data = pytesseract.image_to_string(ROI, lang='eng', config='--psm 6')
parsed = [word.lower() for word in data.split()]
if 'tariff' in parsed or 'number' in parsed:
row_data = {}
row_data['line'] = str(row)
row_data['tariff'] = parsed[-1]
row_data['quantity'] = parsed[2]
row_data['amount'] = str(max(parsed[10], parsed[11]))
row += 1
print(row_data)
invoice_data.append(row_data)
# Visualize row extraction
'''
mask = np.zeros(image.shape, dtype=np.uint8)
cv2.rectangle(mask, (0, y), (width, y + h), (255,255,255), -1)
display_row = cv2.bitwise_and(image, mask)
cv2.imshow('ROI', ROI)
cv2.imshow('display_row', display_row)
cv2.waitKey(1000)
'''
print(invoice_data)
cv2.imshow('thresh', thresh)
cv2.imshow('close', close)
cv2.waitKey()
I'm working on a similar problem in the logistics industry and trust me when I say these document tables come in myriad layouts. Numerous companies that have somewhat solved and are improving on this problem are mentioned as under
Leaders: ABBYY, AntWorks, Kofax, and WorkFusion
Major Contenders: Automation Anywhere, Celaton, Datamatics, EdgeVerve, Extract Systems, Hyland, Hyperscience, Infrrd, and Parascript
Aspirants: Ikarus, Rossum, Shipmnts(Alex), Amazon(Textract), Docsumo, Docparser, Aidock
The category I would like to put this problem under would be multi-modal learning, because both textual and image modalities contribute a good deal in this problem. Though OCR tokens play a vital role in attribute-value classification, their position on the page, spacing and inter-character distances hold as very important features in detecting table, row and column boundaries. The problem gets all the more interesting when rows break across pages, or some columns carry non-empty values.
While the academic world and conferences uses the term Intelligent Document Processing, in general for extracting both singular fields and tabular data. The former is more known by attribute-value classification and the latter is famous by table extraction or repeated-structure extraction, in research literature.
In our foray in processing these semi-structured documents over the 3 years, I feel that achieving both accuracy and scalability is a long and arduous journey. The solutions that offer scalability / 'template free' approach do have annotated corpus of semi-structured business documents in the order of tens of thousand, if not millions. Though this approach is a scalable solution, it's as good as the documents it has been trained on. If your documents hail from the logistics or insurance sector, which are known for their complex layouts, and need to be super-accurate owing to the compliance procedures, a 'template-based' solution would be the panacea to your ills. It is guaranteed to give more accuracy.
If you need links to existing research, do mention in the comments below and I'ld be happy to share them.
Also, I would recommend using pdfparser1 over pdf2text or pdfminer because the former gives character level information in digital files at significantly better performance.
Would be happy to incorporate any feedback, as this is my first answer here.

Searching for objects from database on image

Let's suppose I have database with thousands of images with different forms and sizes (smaller than 100 x 100px) and it's guaranted that every of images shows only one object - symbol, logo, road sign, etc. I would like to be able to take any image ("my_image.jpg") from the Internet and answer the question "Do my_image contains any object (object can be resized, but without deformations) from my database?" - let's say with 95% reliability. To simplify my_images will have white background.
I was trying use imagehash (https://github.com/JohannesBuchner/imagehash), which would be very helpful, but to get rewarding results I think I have to calculate (almost) every possible hash of my_image - the reason is I don't know object size and location on my_image:
hash_list = []
MyImage = Image.open('my_image.jpg')
for x_start in range(image_width):
for y_start in range(image_height):
for x_end in range(x_start, image_width):
for y_end in range(y_start, image_height):
hash_list.append(imagehash.phash(MyImage.\
crop(x_start, y_start, x_end, y_end)))
...and then try to find similar hash in database, but when for example image_width = image_height = 500 this loops and searching will take ages. Of course I can optymalize it a little bit but it still looks like seppuku for bigger images:
MIN_WIDTH = 30
MIN_HEIGHT = 30
STEP = 2
hash_list = []
MyImage = Image.open('my_image.jpg')
for x_start in range(0, image_width - MIN_WIDTH, STEP):
for y_start in range(0, image_height - MIN_HEIGHT, STEP):
for x_end in range(x_start + MIN_WIDTH, image_width, STEP):
for y_end in range(y_start + MIN_HEIGHT, image_height, STEP):
hash_list.append(...)
I wonder if there is some nice way to define which parts of my_image are profitable to calculate hashes - for example cutting edges looks like bad idea. And maybe there is an easier solve? It will be great if the program could give the answer in max 20 minutes. I would be gratefull for any advice.
PS: sorry for my English :)
This looks like an image retrieval problem to me. However, in your case, you are more interested in a binary YES / NO answer which tells if the input image (my_image.jpg) is of an object which is present in your database.
The first thing which I can suggest is that you can resize all the images (including input) to a fixed size, say 100 x 100. But if an object in some image is very small or is present in a specific region of image (for e.g., top left) then resizing can make things worse. However, it was not clear from your question that how likely this is in you case.
About your second question for finding out the location of object, I think you were considering this because your input images are of large size, such as 500 x 500? If so, then resizing is better idea. However, if you asked this question because objects a localized to particular regions in images, then I think you can compute a gradient image which will help you to identify background regions as follows: since background has no variation (complete white) gradient values will be zero for pixels belonging to background regions.
Rather than calculating and using image hash, I suggest you to read about bag-of-visual-words (for e.g., here) based approaches for object categorization. Although your aim is not to categorize objects, but it will help you come up with a different approach to solve your problem.
After all I found solution that looks really nice for me and maybe it will be useful for someone else:
I'm using SIFT to detect "best candidates" from my_image:
def multiscale_template_matching(template, image):
results = []
for scale in np.linspace(0.2, 1.4, 121)[::-1]:
res = imutils.resize(image, width=int(image.shape[1] * scale))
r = image.shape[1] / float(res.shape[1])
if res.shape[0] < template.shape[0] or res.shape[1] < template.shape[1];
break
## bigger correlation <==> better matching
## template_mathing uses SIFT to return best correlation and coordinates
correlation, (x, y) = template_matching(template, res)
coordinates = (x * r, y * r)
results.appent((correlation, coordinates, r))
results.sort(key=itemgetter(0), reverse=True)
return results[:10]
Then for results I'm calculating hashes:
ACCEPTABLE = 10
def find_best(image, template, candidates):
template_hash = imagehash.phash(template)
best_result = 50 ## initial value must be greater than ACCEPTABLE
best_cand = None
for cand in candidates:
cand_hash = get_hash(...)
hash_diff = template_hash - cand_hash
if hash_diff < best_result:
best_result = hash_diff
best_cand = cand
if best_result <= ACCEPTABLE:
return best_cand, best_result
else:
return None, None
If result < ACCEPTABLE, I'm almost sure the answer is "GOT YOU!" :) This solve allows me to compare my_image with 1000 of objects in 7 minutes.

OpenCV: Essential Matrix Decomposition

I am trying to extract Rotation matrix and Translation vector from the essential matrix.
<pre><code>
SVD svd(E,SVD::MODIFY_A);
Mat svd_u = svd.u;
Mat svd_vt = svd.vt;
Mat svd_w = svd.w;
Matx33d W(0,-1,0,
1,0,0,
0,0,1);
Mat_<double> R = svd_u * Mat(W).t() * svd_vt; //or svd_u * Mat(W) * svd_vt;
Mat_<double> t = svd_u.col(2); //or -svd_u.col(2)
</code></pre>
However, when I am using R and T (e.g. to obtain rectified images), the result does not seem to be right(black images or some obviously wrong outputs), even so I used different combination of possible R and T.
I suspected to E. According to the text books, my calculation is right if we have:
E = U*diag(1, 1, 0)*Vt
In my case svd.w which is supposed to be diag(1, 1, 0) [at least in term of a scale], is not so. Here is an example of my output:
svd.w = [21.47903827647813; 20.28555196246256; 5.167099204708699e-010]
Also, two of the eigenvalues of E should be equal and the third one should be zero. In the same case the result is:
eigenvalues of E = 0.0000 + 0.0000i, 0.3143 +20.8610i, 0.3143 -20.8610i
As you see, two of them are complex conjugates.
Now, the questions are:
Is the decomposition of E and calculation of R and T done in a right way?
If the calculation is right, why the internal rules of essential matrix are not satisfied by the results?
If everything about E, R, and T is fine, why the rectified images obtained by them are not correct?
I get E from fundamental matrix, which I suppose to be right. I draw epipolar lines on both the left and right images and they all pass through the related points (for all the 16 points used to calculate the fundamental matrix).
Any help would be appreciated.
Thanks!
I see two issues.
First, discounting the negligible value of the third diagonal term, your E is about 6% off the ideal one: err_percent = (21.48 - 20.29) / 20.29 * 100 . Sounds small, but translated in terms of pixel error it may be an altogether larger amount.
So I'd start by replacing E with the ideal one after SVD decomposition: Er = U * diag(1,1,0) * Vt.
Second, the textbook decomposition admits 4 solutions, only one of which is physically plausible (i.e. with 3D points in front of the camera). You may be hitting one of non-physical ones. See http://en.wikipedia.org/wiki/Essential_matrix#Determining_R_and_t_from_E .

Resources