How is Spark reading my image using the image format?

How is Spark reading my image using the image format? - image-processing

It might be a silly question but I can't figure out how Spark read my image using the spark.read.format("image").load(....) argument.
After importing my image which gives me the following:
>>> image_df.select("image.height","image.width","image.nChannels", "image.mode", "image.data").show()
+------+-----+---------+----+--------------------+
|height|width|nChannels|mode| data|
+------+-----+---------+----+--------------------+
| 430| 470| 3| 16|[4D 55 4E 4C 54 4...|
+------+-----+---------+----+--------------------+
I arrive to the conclusion that:
my image is 430x470 pixels,
my image is colored (RGB due to nChannels = 3) which is an openCV compatible-type,
my image mode is 16 which corresponds to a particular openCV byte-order.
Does someone knows which website/documentation I could browse to know more about it?
the data in the data column is of type Binary but:
when I run image_df.select("image.data").take(1) I got an output which seems to be only one array (see below).
>>> image_df.select("image.data").take(1)
# **1/** Here are the last elements of the result
....<<One Eternity Later>>....x92\x89\x8a\x8d\x84\x86\x89\x80\x84\x87~'))]
# 2/ I got also several part of the result which looks like:
.....\x89\x80\x80\x83z|\x7fvz}tpsjqtkrulsvmsvmsvmrulrulrulqtkpsjnqhnqhmpgmpgmpgnqhnqhn
qhnqhnqhnqhnqhnqhmpgmpgmpgmpgmpgmpgmpgmpgnqhnqhnqhnqhnqhnqhnqhnqhknejmdilcilchkbh
kbilcilckneloflofmpgnqhorioripsjsvmsvmtwnvypx{ry|sz}t{~ux{ry|sy|sy|sy|sz}tz}tz}tz}
ty|sy|sy|sy|sz}t{~u|\x7fv|\x7fv}.....
What come next are linked to the results displayed above. Those might be due to my lack of knowledge concerning openCV (or else). Nonetheless:
1/ I don't understand the fact that if I got an RGB image, I should have 3 matrix but the output finishes by .......\x84\x87~'))]. I was more thinking on obtaining something like [(...),(...),(...\x87~')].
2/ Is this part has a special meaning? Like those are the separator between each matrix or something?
To be more clear about what I'm trying to achieve, I want to process images to do pixel comparison between each images. Therefore, I want to know the pixel values for a given position in my image (I assume that if I have an RGB image, I shall have 3 pixel values for a given position).
Example: let's say that I have a webcam pointing to the sky only during the day and I want to know the values of a pixel at a position corresponding to the top left sky part, I found out that the concatenation of those values gives the colour Light Blue which says that the photo was taken on a sunny day. Let's say that the only possibility is that a sunny day takes the colour Light Blue.
Next I want to compare the previous concatenation with another concat of pixel values at the exact same position but from a picture taken the next day. If I found out that they are not equal then I conclude that the given picture was taken on a cloudy/rainy day. If equal then sunny day.
Any help on that would be highly appreciated. I have vulgarized my example for a better understanding but my goal is pretty much the same. I know that ML model can exist to achieve those stuff but I would be happy to try this first. My first goal is to split this column into 3 columns corresponding to each color code: a red matrix, a green matrix, a blue matrix

I think I have the logic. I used the keras.preprocessing.image.img_to_array() function to understand how the values are classified (since I have an RGB image, I must have 3 matrix: one for each color R G B). Posting that if someone wonder how it works, I might be wrong but I think I have something :
from keras.preprocessing import image
import numpy as np
from PIL import Image
# Using spark built-in data source
first_img = spark.read.format("image").schema(imageSchema).load(".....")
raw = first_img.select("image.data").take(1)[0][0]
np.shape(raw)
(606300,) # which is 470*430*3
# Using keras function
img = image.load_img(".../path/to/img")
yy = image.img_to_array(img)
>>> np.shape(yy)
(430, 470, 3) # the form is good but I have a problem of order since:
>>> raw[0], raw[1], raw[2]
(77, 85, 78)
>>> yy[0][0]
array([78., 85., 77.], dtype=float32)
# Therefore I used the numpy reshape function directly on raw
# to have 470 matrix of 3 lines and 470 columns:
array = np.reshape(raw, (430,470,3))
xx = image.img_to_array(array) # OPTIONAL and not used here
>>> array[0][0] == (raw[0],raw[1],raw[2])
array([ True, True, True])
>>> array[0][1] == (raw[3],raw[4],raw[5])
array([ True, True, True])
>>> array[0][2] == (raw[6],raw[7],raw[8])
array([ True, True, True])
>>> array[0][3] == (raw[9],raw[10],raw[11])
array([ True, True, True])
So if I understood well, spark will read the image as a big array - (606300,) here - where in fact each element are ordered and corresponds to their respective color shade (R G B).
After doing my little transformations, I obtain 430 matrix of 3 columns x 470 lines. Since my image is (470x430) for (WidthxHeight), each matrix corresponds to a pixel heigth position and inside each: 3 columns for each color and 470 lines for each width position.
Hope that helps someone :)!

Related

Difference between absdiff and normal subtraction in OpenCV

I am currently planning on training a binary image classification model. The images I want to train on are the difference between two original pictures. In other words, for each data entry, I start out with 2 pictures, take their difference, and the label that difference as a 0 or 1. My question is what is the best way to find this difference. I know about cv2.absdiff and then normal subtraction of images - what is the most effective way to go about this?
About the data: The images I'm training on are screenshots that usually are the same but may have small differences. I found that normal subtraction seems to show the differences less than absdiff.
This is the code I use for absdiff:
diff = cv2.absdiff(img1, img2)
mask = cv2.cvtColor(diff, cv2.COLOR_BGR2GRAY)
th = 1
imask = mask>1
canvas = np.zeros_like(img2, np.uint8)
canvas[imask] = img2[imask]
And then this for normal subtraction:
def extract_diff(self,imageA, imageB, image_name, path):
subtract = imageB.astype(np.float32) - imageA.astype(np.float32)
mask = cv2.inRange(np.abs(subtract),(30,30,30),(255,255,255))
th = 1
imask = mask>1
canvas = np.zeros_like(imageA, np.uint8)
canvas[imask] = imageA[imask]
Thanks!

A difference can be negative or positive.
For some number types, such as uint8 (unsigned 8-bit int), which can't be negative (have no sign), a negative value wraps around and the value would make no sense anymore. Other types can be signed (e.g. floats, signed ints), so a negative value can be represented correctly.
That's why cv.absdiff exists. It always gives you absolute differences, and those are okay to represent in an unsigned type.
Example with numbers: a = 4, b = 6. a-b should be -2, right?
That value, as an uint8, will wrap around to become 0xFE, or 254 in decimal. The 254 value has some relation to the true -2 difference, but it also incorporates the range of values of the data type (8 bits: 256 values), so it's really just "code".
cv.absdiff would give you the absolute of the difference (-2), which is 2.

Extract text information from PDF files with different layouts - machine learning

I need assistance with a ML project I am currently trying to create.
I receive a lot of invoices from a lot of different suppliers - all in their own unique layout. I need to extract 3 key elements from the invoices. These 3 elements are all located in a table/line items for all the invoices.
The 3 elements are:
1: Tariff number (digit)
2: Quantity (always a digit)
3: Total line amount (monetary value)
Please refer to below screenshot, where I have marked these field on a sample invoice.
I started this project with a template approach, based on regular expressions. This, however, was not scaleable at all and I ended up with tons of different rules.
I am hoping that machine learning can help me here - or maybe a hybrid solution?
The common denominator
In all of my invoices, despite of the different layouts, each line item will always consist of one tariff number. This tariff number is always 8 digits, and is always formatted in one the ways like below:
xxxxxxxx
xxxx.xxxx
xx.xx.xx.xx
(Where "x" is a digit from 0 - 9).
Further, as you can see on the invoice there is both a Unit Price and a Total Amount per line. The amount I will need is always the highest for each line.
The output
For each invoice like the one above, I need the output for each line. This could for example be something like this:
{
"line":"0",
"tariff":"85444290",
"quantity":"3",
"amount":"258.93"
},
{
"line":"1",
"tariff":"85444290",
"quantity":"4",
"amount":"548.32"
},
{
"line":"2",
"tariff":"76109090",
"quantity":"5",
"amount":"412.30"
}
Where to go from here?
I am not sure of what I am looking to do falls under machine learning and if so, under which category. Is it computer vision? NLP? Named Entity Recognition?
My initial thought was to:
Convert the invoice to text. (The invoices are all in textable PDFs, so I can use something like pdftotext to get the exact textual values)
Create custom named entities for quantity, tariff and amount
Export the found entities.
However, I feel like I might be missing something.
Can anyone assist me in the right direction?
Edit:
Please see below for a few more examples of how an invoice table section can look like:
Sample invoice #2
Sample invoice #3
Edit 2:
Please see below for the three sample images, without the borders/bounding boxes:
Image 1:
Image 2:
Image 3:

Here's an attempt using OpenCV, the idea is:
Obtain binary image. We load the image, enlarge using
imutils.resize to help obtain better OCR results (see Tesseract improve quality), convert to grayscale, then Otsu's threshold to obtain a binary image (1-channel).
Remove table grid lines. We create a horizontal and vertical kernels then perform morphological operations to combine adjacent text contours into a single contour. The idea is to extract a ROI row as one piece to OCR.
Extract row ROIs. We find contours then sort from top-to-bottom using imutils.contours.sort_contours. This ensures that we iterate through each row in the correct order. From here we iterate through the contours, extract the row ROI using Numpy slicing, OCR using Pytesseract, then parse the data.
Here's the visualization of each step:
Input image
Binary image
Morph close
Visualization of iterating through each row
Extracted row ROIs
Output invoice data result:
{'line': '0', 'tariff': '85444290', 'quantity': '3', 'amount': '258.93'}
{'line': '1', 'tariff': '85444290', 'quantity': '4', 'amount': '548.32'}
{'line': '2', 'tariff': '76109090', 'quantity': '5', 'amount': '412.30'}
Unfortunately, I get mixed results when trying on the 2nd and 3rd image. This method does not produce great results on the other images since the layout of the invoices are all different. However, this approach shows that it's possible to use traditional image processing techniques to extract the invoice information with the assumption that you have a fixed invoice layout.
Code
import cv2
import numpy as np
import pytesseract
from imutils import contours
import imutils
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
# Load image, enlarge, convert to grayscale, Otsu's threshold
image = cv2.imread('1.png')
image = imutils.resize(image, width=1000)
height, width = image.shape[:2]
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
# Remove horizontal lines
horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (50,1))
detect_horizontal = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, horizontal_kernel, iterations=2)
cnts = cv2.findContours(detect_horizontal, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
cv2.drawContours(thresh, [c], -1, 0, -1)
# Remove vertical lines
vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1,50))
detect_vertical = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, vertical_kernel, iterations=2)
cnts = cv2.findContours(detect_vertical, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
cv2.drawContours(thresh, [c], -1, 0, -1)
# Morph close to combine adjacent contours into a single contour
invoice_data = []
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (85,5))
close = cv2.morphologyEx(thresh, cv2.MORPH_CLOSE, kernel, iterations=3)
# Find contours, sort from top-to-bottom
# Iterate through contours, extract row ROI, OCR, and parse data
cnts = cv2.findContours(close, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
(cnts, _) = contours.sort_contours(cnts, method="top-to-bottom")
row = 0
for c in cnts:
x,y,w,h = cv2.boundingRect(c)
ROI = image[y:y+h, 0:width]
ROI = cv2.GaussianBlur(ROI, (3,3), 0)
data = pytesseract.image_to_string(ROI, lang='eng', config='--psm 6')
parsed = [word.lower() for word in data.split()]
if 'tariff' in parsed or 'number' in parsed:
row_data = {}
row_data['line'] = str(row)
row_data['tariff'] = parsed[-1]
row_data['quantity'] = parsed[2]
row_data['amount'] = str(max(parsed[10], parsed[11]))
row += 1
print(row_data)
invoice_data.append(row_data)
# Visualize row extraction
'''
mask = np.zeros(image.shape, dtype=np.uint8)
cv2.rectangle(mask, (0, y), (width, y + h), (255,255,255), -1)
display_row = cv2.bitwise_and(image, mask)
cv2.imshow('ROI', ROI)
cv2.imshow('display_row', display_row)
cv2.waitKey(1000)
'''
print(invoice_data)
cv2.imshow('thresh', thresh)
cv2.imshow('close', close)
cv2.waitKey()

I'm working on a similar problem in the logistics industry and trust me when I say these document tables come in myriad layouts. Numerous companies that have somewhat solved and are improving on this problem are mentioned as under
Leaders: ABBYY, AntWorks, Kofax, and WorkFusion
Major Contenders: Automation Anywhere, Celaton, Datamatics, EdgeVerve, Extract Systems, Hyland, Hyperscience, Infrrd, and Parascript
Aspirants: Ikarus, Rossum, Shipmnts(Alex), Amazon(Textract), Docsumo, Docparser, Aidock
The category I would like to put this problem under would be multi-modal learning, because both textual and image modalities contribute a good deal in this problem. Though OCR tokens play a vital role in attribute-value classification, their position on the page, spacing and inter-character distances hold as very important features in detecting table, row and column boundaries. The problem gets all the more interesting when rows break across pages, or some columns carry non-empty values.
While the academic world and conferences uses the term Intelligent Document Processing, in general for extracting both singular fields and tabular data. The former is more known by attribute-value classification and the latter is famous by table extraction or repeated-structure extraction, in research literature.
In our foray in processing these semi-structured documents over the 3 years, I feel that achieving both accuracy and scalability is a long and arduous journey. The solutions that offer scalability / 'template free' approach do have annotated corpus of semi-structured business documents in the order of tens of thousand, if not millions. Though this approach is a scalable solution, it's as good as the documents it has been trained on. If your documents hail from the logistics or insurance sector, which are known for their complex layouts, and need to be super-accurate owing to the compliance procedures, a 'template-based' solution would be the panacea to your ills. It is guaranteed to give more accuracy.
If you need links to existing research, do mention in the comments below and I'ld be happy to share them.
Also, I would recommend using pdfparser1 over pdf2text or pdfminer because the former gives character level information in digital files at significantly better performance.
Would be happy to incorporate any feedback, as this is my first answer here.

Using Keras ImageDataGenerator in a regression model

I want to use the flow_from_directory method of the ImageDataGenerator
to generate training data for a regression model, where the target value can be any float value between 1 and -1. flow_from_directory has a "class_mode" parameter with the description
class_mode: one of "categorical", "binary", "sparse" or None. Default:
"categorical". Determines the type of label arrays that are returned:
"categorical" will be 2D one-hot encoded labels, "binary" will be 1D
binary labels, "sparse" will be 1D integer labels.
Which of these values should I take? None of them seems to really fit...

With Keras 2.2.4 you can use flow_from_dataframe which solves what you want to do, allowing you to flow images from a directory for regression problems. You should store all your images in a folder and load a dataframe containing in one column the image IDs and in the other column the regression score (labels) and set class_mode='other' in flow_from_dataframe.
Here you can find an example where the images are in image_dir, the dataframe with the image IDs and the regression scores is loaded with pandas from the "train file"
train_label_df = pd.read_csv(train_file, delimiter=' ', header=None, names=['id', 'score'])
train_datagen = ImageDataGenerator(rescale = 1./255, horizontal_flip = True,
fill_mode = "nearest", zoom_range = 0.2,
width_shift_range = 0.2, height_shift_range=0.2,
rotation_range=30)
train_generator = train_datagen.flow_from_dataframe(dataframe=train_label_df, directory=image_dir,
x_col="id", y_col="score", has_ext=True,
class_mode="other", target_size=(img_width, img_height),
batch_size=bs)

I think that organizing your data differently, using a DataFrame (without necessarily moving your images to new locations) will allow you to run a regression model. In short, create columns in your DataFrame containing the file path of each image and the target value. This allows your generator to keep regression values and images properly synced even when you shuffle your data at each epoch.
Here is an example showing how to link images with binomial targets, multinomial targets and regression targets just to show that "a target is a target is a target" and only the model might change:
df['path'] = df.object_id.apply(file_path_from_db_id)
df
object_id bi multi path target
index
0 461756 dog white /path/to/imgs/756/61/blah_461756.png 0.166831
1 1161756 cat black /path/to/imgs/756/61/blah_1161756.png 0.058793
2 3303651 dog white /path/to/imgs/651/03/blah_3303651.png 0.582970
3 3367756 dog grey /path/to/imgs/756/67/blah_3367756.png -0.421429
4 3767756 dog grey /path/to/imgs/756/67/blah_3767756.png -0.706608
5 5467756 cat black /path/to/imgs/756/67/blah_5467756.png -0.415115
6 5561756 dog white /path/to/imgs/756/61/blah_5561756.png -0.631041
7 31255756 cat grey /path/to/imgs/756/55/blah_31255756.png -0.148226
8 35903651 cat black /path/to/imgs/651/03/blah_35903651.png -0.785671
9 44603651 dog black /path/to/imgs/651/03/blah_44603651.png -0.538359
10 49557622 cat black /path/to/imgs/622/57/blah_49557622.png -0.295279
11 58164756 dog grey /path/to/imgs/756/64/blah_58164756.png 0.407096
12 95403651 cat white /path/to/imgs/651/03/blah_95403651.png 0.790274
13 95555756 dog grey /path/to/imgs/756/55/blah_95555756.png 0.060669
I describe how to do this in great detail with examples here:
https://techblog.appnexus.com/a-keras-multithreaded-dataframe-generator-for-millions-of-image-files-84d3027f6f43

At this moment (newest version of Keras from January 21st 2017) the flow_from_directory could only work in a following manner:
You need to have a directories structured in a following manner:
directory with images\
1st label\
1st picture from 1st label
2nd picture from 1st label
3rd picture from 1st label
...
2nd label\
1st picture from 2nd label
2nd picture from 2nd label
3rd picture from 2nd label
...
...
flow_from_directory returns batches of a fixed size in a format of (picture, label).
So as you can see it could only be used for a classification case and all options provided in a documentation specify only a way in which the class is provided to your classifier. But, there is a neat hack which could make a flow_from_directory useful for a regression task:
You need to structure your directory in a following manner:
directory with images\
1st value (e.g. -0.95423)\
1st picture from 1st value
2nd picture from 1st value
3rd picture from 1st value
...
2nd value (e.g. - 0.9143242)\
1st picture from 2nd value
2nd picture from 2nd value
3rd picture from 2nd value
...
...
You also need to have a list list_of_values = [1st value, 2nd value, ...]. Then your generator is defined in a following manner:
def regression_flow_from_directory(flow_from_directory_gen, list_of_values):
for x, y in flow_from_directory_gen:
yield x, list_of_values[y]
And it's crucial for a flow_from_directory_gen to have a class_mode='sparse' to make this work. Of course this is a little bit cumbersome but it works (I used this solution :) )

There's just one glitch in the accepted answer that I would like to point out. The above code fails with an error message like:
TypeError: only integer scalar arrays can be converted to a scalar index
This is because y is an array. The fix is simple:
def regression_flow_from_directory(flow_from_directory_gen,
list_of_values):
for x, y in flow_from_directory_gen:
values = [list_of_values[y[i]] for i in range(len(y))]
yield x, values
The method to generate the list_of_values can be found in https://stackoverflow.com/a/47944082/4082092

Image filtering - wrong results?

I'm experimenting with convolving an image with a user-supplied mask, in this case
u = array([[-2,-2,-2],[-2,25,-2],[-2,-2,-2]])/9
using the commands
In[1]: import scipy.ndimage as ndi
In[2]: import skimage.io as io
In[3]: c = io.imread('cameraman.png')
In[4]: cu = ndi.convolve(c,u)
In[5]: io.imshow(cu)
I'm checking this against commands in GNU Octave:
Octave-3.8: 1> c = imread('cameraman.png');
Octave-3.8: 2> u = [-2 -2 -2;-2 25 -2;-2 -2 -2]/9
Octave-3.8: 3> cu = imfilter(c,u)
Octave-3.8: 4> imshow(cu)
But here's the thing: Octave seems to give the correct result, but Python doesn't, even though the commands convolve and imfilter are supposed to be implementing the same algorithm. (Well in fact imfilter performs a correlation, which in this case is the same as a convolution.)
The Octave output is:
!
and the Python output is:
!
which as you can see is very different to the Octave result. Does anybody know what's going on here? Or is there a better way of convolving with a user-supplied linear filter than using convolve?

The problem may be the result of your convolution taking your image luminance values out of bounds. I ran the example below in Matlab (~=Octave) and for an image that initially has grey values 0-255 so in normalised range [0,0.99] the result ends in with pixels in range [-0.88,2.03].
>> img=double(imread('cameraman.tif'))./255;
>> K=[-2 -2 -2 ; -2 25 -2; -2 -2 -2]/9;
>> out=conv2(img,K,'same');
>> max(max(out))
ans =
2.0288
>> min(min(out))
ans =
-0.8776
It could be that Python has a problem visualising images with out of range grey values <0 or >255 and this is causing a clamping of values resulting in black/white halos in those areas. Perhaps Octave normalises the image prior to displaying it resulting in few artifacts. If you normalise you image in Python prior to displaying it, do you still have this problem?

Out of memory, plotting 24 images in 1 plot

Ho, I want to plot 24 images in 1 plot using subplot.
I've already made the empty plots using this method:
# Import everything from matplotlib (numpy is accessible via 'np' alias)
from pylab import *
# create new figure of a3 size.
figure(figsize=(16.5, 11.7), dpi=300)
# do plotting for 24 figs in 1 plot
for i in range(1, 25):
#print i
subplot(4, 6, i)
Now i want to fill my subplots with the same data in everyplot (a background to plot against) in a line plot.
I do this using the following line:
plot(myData)
Once i run the program, it crashes telling me:
"_tkinter.TclError: not enough free memory for image buffer"
So after searching the web I read that i need to close the plots after i make them so that the memory can be reused.
However, how do i do this when using subplots ?
Frank
Edit:
I think it would get easily solved if i could 2 lists, 1 with each uniq item in myData, and the second list with the number of occurences of that uniq item. any1 got tips on that ?

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart