Extract text information from PDF files with different layouts - machine learning - machine-learning

I need assistance with a ML project I am currently trying to create.
I receive a lot of invoices from a lot of different suppliers - all in their own unique layout. I need to extract 3 key elements from the invoices. These 3 elements are all located in a table/line items for all the invoices.
The 3 elements are:
1: Tariff number (digit)
2: Quantity (always a digit)
3: Total line amount (monetary value)
Please refer to below screenshot, where I have marked these field on a sample invoice.
I started this project with a template approach, based on regular expressions. This, however, was not scaleable at all and I ended up with tons of different rules.
I am hoping that machine learning can help me here - or maybe a hybrid solution?
The common denominator
In all of my invoices, despite of the different layouts, each line item will always consist of one tariff number. This tariff number is always 8 digits, and is always formatted in one the ways like below:
xxxxxxxx
xxxx.xxxx
xx.xx.xx.xx
(Where "x" is a digit from 0 - 9).
Further, as you can see on the invoice there is both a Unit Price and a Total Amount per line. The amount I will need is always the highest for each line.
The output
For each invoice like the one above, I need the output for each line. This could for example be something like this:
{
"line":"0",
"tariff":"85444290",
"quantity":"3",
"amount":"258.93"
},
{
"line":"1",
"tariff":"85444290",
"quantity":"4",
"amount":"548.32"
},
{
"line":"2",
"tariff":"76109090",
"quantity":"5",
"amount":"412.30"
}
Where to go from here?
I am not sure of what I am looking to do falls under machine learning and if so, under which category. Is it computer vision? NLP? Named Entity Recognition?
My initial thought was to:
Convert the invoice to text. (The invoices are all in textable PDFs, so I can use something like pdftotext to get the exact textual values)
Create custom named entities for quantity, tariff and amount
Export the found entities.
However, I feel like I might be missing something.
Can anyone assist me in the right direction?
Edit:
Please see below for a few more examples of how an invoice table section can look like:
Sample invoice #2
Sample invoice #3
Edit 2:
Please see below for the three sample images, without the borders/bounding boxes:
Image 1:
Image 2:
Image 3:

Here's an attempt using OpenCV, the idea is:
Obtain binary image. We load the image, enlarge using
imutils.resize to help obtain better OCR results (see Tesseract improve quality), convert to grayscale, then Otsu's threshold to obtain a binary image (1-channel).
Remove table grid lines. We create a horizontal and vertical kernels then perform morphological operations to combine adjacent text contours into a single contour. The idea is to extract a ROI row as one piece to OCR.
Extract row ROIs. We find contours then sort from top-to-bottom using imutils.contours.sort_contours. This ensures that we iterate through each row in the correct order. From here we iterate through the contours, extract the row ROI using Numpy slicing, OCR using Pytesseract, then parse the data.
Here's the visualization of each step:
Input image
Binary image
Morph close
Visualization of iterating through each row
Extracted row ROIs
Output invoice data result:
{'line': '0', 'tariff': '85444290', 'quantity': '3', 'amount': '258.93'}
{'line': '1', 'tariff': '85444290', 'quantity': '4', 'amount': '548.32'}
{'line': '2', 'tariff': '76109090', 'quantity': '5', 'amount': '412.30'}
Unfortunately, I get mixed results when trying on the 2nd and 3rd image. This method does not produce great results on the other images since the layout of the invoices are all different. However, this approach shows that it's possible to use traditional image processing techniques to extract the invoice information with the assumption that you have a fixed invoice layout.
Code
import cv2
import numpy as np
import pytesseract
from imutils import contours
import imutils
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
# Load image, enlarge, convert to grayscale, Otsu's threshold
image = cv2.imread('1.png')
image = imutils.resize(image, width=1000)
height, width = image.shape[:2]
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
# Remove horizontal lines
horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (50,1))
detect_horizontal = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, horizontal_kernel, iterations=2)
cnts = cv2.findContours(detect_horizontal, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
cv2.drawContours(thresh, [c], -1, 0, -1)
# Remove vertical lines
vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1,50))
detect_vertical = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, vertical_kernel, iterations=2)
cnts = cv2.findContours(detect_vertical, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
cv2.drawContours(thresh, [c], -1, 0, -1)
# Morph close to combine adjacent contours into a single contour
invoice_data = []
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (85,5))
close = cv2.morphologyEx(thresh, cv2.MORPH_CLOSE, kernel, iterations=3)
# Find contours, sort from top-to-bottom
# Iterate through contours, extract row ROI, OCR, and parse data
cnts = cv2.findContours(close, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
(cnts, _) = contours.sort_contours(cnts, method="top-to-bottom")
row = 0
for c in cnts:
x,y,w,h = cv2.boundingRect(c)
ROI = image[y:y+h, 0:width]
ROI = cv2.GaussianBlur(ROI, (3,3), 0)
data = pytesseract.image_to_string(ROI, lang='eng', config='--psm 6')
parsed = [word.lower() for word in data.split()]
if 'tariff' in parsed or 'number' in parsed:
row_data = {}
row_data['line'] = str(row)
row_data['tariff'] = parsed[-1]
row_data['quantity'] = parsed[2]
row_data['amount'] = str(max(parsed[10], parsed[11]))
row += 1
print(row_data)
invoice_data.append(row_data)
# Visualize row extraction
'''
mask = np.zeros(image.shape, dtype=np.uint8)
cv2.rectangle(mask, (0, y), (width, y + h), (255,255,255), -1)
display_row = cv2.bitwise_and(image, mask)
cv2.imshow('ROI', ROI)
cv2.imshow('display_row', display_row)
cv2.waitKey(1000)
'''
print(invoice_data)
cv2.imshow('thresh', thresh)
cv2.imshow('close', close)
cv2.waitKey()

I'm working on a similar problem in the logistics industry and trust me when I say these document tables come in myriad layouts. Numerous companies that have somewhat solved and are improving on this problem are mentioned as under
Leaders: ABBYY, AntWorks, Kofax, and WorkFusion
Major Contenders: Automation Anywhere, Celaton, Datamatics, EdgeVerve, Extract Systems, Hyland, Hyperscience, Infrrd, and Parascript
Aspirants: Ikarus, Rossum, Shipmnts(Alex), Amazon(Textract), Docsumo, Docparser, Aidock
The category I would like to put this problem under would be multi-modal learning, because both textual and image modalities contribute a good deal in this problem. Though OCR tokens play a vital role in attribute-value classification, their position on the page, spacing and inter-character distances hold as very important features in detecting table, row and column boundaries. The problem gets all the more interesting when rows break across pages, or some columns carry non-empty values.
While the academic world and conferences uses the term Intelligent Document Processing, in general for extracting both singular fields and tabular data. The former is more known by attribute-value classification and the latter is famous by table extraction or repeated-structure extraction, in research literature.
In our foray in processing these semi-structured documents over the 3 years, I feel that achieving both accuracy and scalability is a long and arduous journey. The solutions that offer scalability / 'template free' approach do have annotated corpus of semi-structured business documents in the order of tens of thousand, if not millions. Though this approach is a scalable solution, it's as good as the documents it has been trained on. If your documents hail from the logistics or insurance sector, which are known for their complex layouts, and need to be super-accurate owing to the compliance procedures, a 'template-based' solution would be the panacea to your ills. It is guaranteed to give more accuracy.
If you need links to existing research, do mention in the comments below and I'ld be happy to share them.
Also, I would recommend using pdfparser1 over pdf2text or pdfminer because the former gives character level information in digital files at significantly better performance.
Would be happy to incorporate any feedback, as this is my first answer here.

Related

How is Spark reading my image using the image format?

It might be a silly question but I can't figure out how Spark read my image using the spark.read.format("image").load(....) argument.
After importing my image which gives me the following:
>>> image_df.select("image.height","image.width","image.nChannels", "image.mode", "image.data").show()
+------+-----+---------+----+--------------------+
|height|width|nChannels|mode| data|
+------+-----+---------+----+--------------------+
| 430| 470| 3| 16|[4D 55 4E 4C 54 4...|
+------+-----+---------+----+--------------------+
I arrive to the conclusion that:
my image is 430x470 pixels,
my image is colored (RGB due to nChannels = 3) which is an openCV compatible-type,
my image mode is 16 which corresponds to a particular openCV byte-order.
Does someone knows which website/documentation I could browse to know more about it?
the data in the data column is of type Binary but:
when I run image_df.select("image.data").take(1) I got an output which seems to be only one array (see below).
>>> image_df.select("image.data").take(1)
# **1/** Here are the last elements of the result
....<<One Eternity Later>>....x92\x89\x8a\x8d\x84\x86\x89\x80\x84\x87~'))]
# 2/ I got also several part of the result which looks like:
.....\x89\x80\x80\x83z|\x7fvz}tpsjqtkrulsvmsvmsvmrulrulrulqtkpsjnqhnqhmpgmpgmpgnqhnqhn
qhnqhnqhnqhnqhnqhmpgmpgmpgmpgmpgmpgmpgmpgnqhnqhnqhnqhnqhnqhnqhnqhknejmdilcilchkbh
kbilcilckneloflofmpgnqhorioripsjsvmsvmtwnvypx{ry|sz}t{~ux{ry|sy|sy|sy|sz}tz}tz}tz}
ty|sy|sy|sy|sz}t{~u|\x7fv|\x7fv}.....
What come next are linked to the results displayed above. Those might be due to my lack of knowledge concerning openCV (or else). Nonetheless:
1/ I don't understand the fact that if I got an RGB image, I should have 3 matrix but the output finishes by .......\x84\x87~'))]. I was more thinking on obtaining something like [(...),(...),(...\x87~')].
2/ Is this part has a special meaning? Like those are the separator between each matrix or something?
To be more clear about what I'm trying to achieve, I want to process images to do pixel comparison between each images. Therefore, I want to know the pixel values for a given position in my image (I assume that if I have an RGB image, I shall have 3 pixel values for a given position).
Example: let's say that I have a webcam pointing to the sky only during the day and I want to know the values of a pixel at a position corresponding to the top left sky part, I found out that the concatenation of those values gives the colour Light Blue which says that the photo was taken on a sunny day. Let's say that the only possibility is that a sunny day takes the colour Light Blue.
Next I want to compare the previous concatenation with another concat of pixel values at the exact same position but from a picture taken the next day. If I found out that they are not equal then I conclude that the given picture was taken on a cloudy/rainy day. If equal then sunny day.
Any help on that would be highly appreciated. I have vulgarized my example for a better understanding but my goal is pretty much the same. I know that ML model can exist to achieve those stuff but I would be happy to try this first. My first goal is to split this column into 3 columns corresponding to each color code: a red matrix, a green matrix, a blue matrix
I think I have the logic. I used the keras.preprocessing.image.img_to_array() function to understand how the values are classified (since I have an RGB image, I must have 3 matrix: one for each color R G B). Posting that if someone wonder how it works, I might be wrong but I think I have something :
from keras.preprocessing import image
import numpy as np
from PIL import Image
# Using spark built-in data source
first_img = spark.read.format("image").schema(imageSchema).load(".....")
raw = first_img.select("image.data").take(1)[0][0]
np.shape(raw)
(606300,) # which is 470*430*3
# Using keras function
img = image.load_img(".../path/to/img")
yy = image.img_to_array(img)
>>> np.shape(yy)
(430, 470, 3) # the form is good but I have a problem of order since:
>>> raw[0], raw[1], raw[2]
(77, 85, 78)
>>> yy[0][0]
array([78., 85., 77.], dtype=float32)
# Therefore I used the numpy reshape function directly on raw
# to have 470 matrix of 3 lines and 470 columns:
array = np.reshape(raw, (430,470,3))
xx = image.img_to_array(array) # OPTIONAL and not used here
>>> array[0][0] == (raw[0],raw[1],raw[2])
array([ True, True, True])
>>> array[0][1] == (raw[3],raw[4],raw[5])
array([ True, True, True])
>>> array[0][2] == (raw[6],raw[7],raw[8])
array([ True, True, True])
>>> array[0][3] == (raw[9],raw[10],raw[11])
array([ True, True, True])
So if I understood well, spark will read the image as a big array - (606300,) here - where in fact each element are ordered and corresponds to their respective color shade (R G B).
After doing my little transformations, I obtain 430 matrix of 3 columns x 470 lines. Since my image is (470x430) for (WidthxHeight), each matrix corresponds to a pixel heigth position and inside each: 3 columns for each color and 470 lines for each width position.
Hope that helps someone :)!

Extracting Semi-structured Text from a complex UI (Golf Simulator)

I'm fairly new to the world of OCR, OpenCV, Tesseract etc and was hoping to get some advice or a nudge in the right direction for a project I'm working on. For context, I practice golf at an indoor simulator that is powered by Full Swing Golf. My goal is to build an app (preferably iphone, but desktop is fine too) that will be able to grab the data provided by the simulator and process it however I'd like. The overall workflow would look something like:
Set up iPhone or laptop camera to watch the simulator screen.
Hit ball
Statistics Screen is displayed that looks more or less like:
Detect that the Statistics Screen has been displayed and grab all relevant data:
| Distance | Launch | Back Spin | Club Speed | Carry | To Pin | Direction | Ball Speed | Side Spin | Club Face | Club Path |
|----------|--------|-----------|------------|-------|--------|-----------|------------|-----------|-----------|-----------|
| 345 | 13 | 3350 | 135 | 335 | 80 | 2.4 | 190 | 350 | 4.3 | 1.6 |
5-?: Save the data to my app, keep track of it over time etc...
Attempts So Far:
It seemed like OpenCV's matchTemplate would be a simple way to find all of the headings in the image (Distance, Launch etc...) and it does seem to work when the image and template are both the perfect resolution. However, as this will be an iPhone app, the quality is not something I can really guarantee (within reason). Moreso, the screen will almost never be straight-on as it appears above. Most likely, the camera will be off to the side and we will have to de-skew accordingly. I've attempted to use the following image to work on my deskewing logic to no avail:
Finding the reference points in order to deskew via getPerspectiveTransform and warpPerspective has proven to be incredibly difficult due to the above issues with matching templates.
I've also tried dynamically adjusting for scale with code resembling the following:
def findTemplateLocation(image_path):
template = cv2.imread(image_path)
template = cv2.cvtColor(template, cv2.COLOR_BGR2GRAY)
w, h = template.shape[::-1]
threshold = 0.65
loc = []
for scale in np.linspace(0.1, 2, 20)[::-1]:
resized = imutils.resize(template, width=int(template.shape[1] * scale))
w, h = resized.shape[::-1]
res = cv2.matchTemplate(image_gray, resized, cv2.TM_CCOEFF_NORMED)
loc = np.where(res >= threshold)
if len(list(zip(*loc[::-1]))) > 0:
break
if loc and len(list(zip(*loc[::-1]))) > 0:
adjusted_w = int(w/scale)
adjusted_h = int(h/scale)
print(str(adjusted_w) + " " + str(adjusted_h) + " " + str(scale))
ret = []
for pt in zip(*loc[::-1]):
ret.append({'width': w, 'height': h, 'location': pt})
return ret
return None
This still returns a ton of false positives.
I'm hoping to get some advice on how to approach this problem with a clean slate. I'm open to any language / workflow.
If it does seem that I'm on the right track, my current code is at https://gist.github.com/naderhen/9ec8d45f13d92507131d5bce0e84fad8 . Would really appreciate any suggestions for the best next steps.
Thanks for any help you can provide!
EDIT: Additional Resources
I've uploaded a number of videos and still photos from my time at the indoor simulator this weekend: https://www.dropbox.com/sh/5vub2mi4rvunyaw/AAAY1_7Q_WBV4JvmDD0dEiTDa?dl=0
I tried to get a number of different angles, with different lighting etc. Please let me know if I can provide any other resources that may help.
So, I tried two different methods:
Contour detection - This seemed to be the most obvious method since the statistics screen is the primary part of the image and is present in all your images. Although it does work with two of the three images, it might not be very robust with the parameters. Here are the steps that I tried for contour:
First, get the the image in grayscale or take one of the Value channel in HSV. Then, threshold the image using either Otsu or Adaptive Thresholding. After playing with a lot of the associated parameters, I got satisfactory results, which would basically mean nice whole statistics screen in white on a black background. After this, sort the contours like this:
contours = cv2.findContours(thresh, cv2.RETR_TREE, cv2.CHAIN_APPROX_NONE)[1]
# Sort the contours to avoid unnecessary comparison in the for loop below
cntsSorted = sorted(contours, key=lambda x: cv2.contourArea(x), reverse=True)
for cnt in cntsSorted[0:20]:
peri = cv2.arcLength(cnt, True)
approx = cv2.approxPolyDP(cnt, 0.04 * peri, True)
if len(approx) == 4 and peri > 10000:
cv2.drawContours(sorted_image, cnt, -1, (0, 255, 0), 10)
Feature Detection and Matching: Since using contours wasn't robust enough, I tried another method which I've worked on for a similar problem as yours. This method is fairly robust, much faster (I tried this on an android phone 2 years back and it would do the job in less than a second for a 1280 x 760 image). However, after trying on your work cases, I figured that your images are pretty vague. What I mean by that is, you have two images in your question that have fairly similar primaries and it works for that but the images you posted in comments are very different from these and hence it doesn't find suitable number of good matches (at least 10 in my case). If you can post a nice set of images that you actually will encounter, I will update this answer with my results on the new set. More importantly, the images of the scene evidently have change in perspective which shouldn't be an issue assuming you are able to get a very good source image (as the first one in your question). However, the change in lighting conditions can be a pain. I'd suggest either using different color spaces such as HSV, Lab and Luv instead of BGR.
Here is where you can find a working example of how to implement your own feature matcher. There are some code changes required depending on the version of OpenCV you are using but I am sure you can find the solutions ( I did ;) ).
A good example:
Some suggestions:
Try getting as clean an image as possible for the image you are using to match with the others (your first image in my case). Hopefully, this would require you to do less processing.
Try using unsharp mask before finding Keypoints.
My results are from using ORB. You can also try with other detectors/descriptors like SURF, SIFT and FAST.
Finally, your approach of template matching should work in cases where there is a change only in the scaling and not the perspective.
Hope this helps! Write a comment if you have any additional questions and/or when you have a good image set ready (rubs palms). Cheers!
Edit 1: This is the code that I used for the Feature Detection and Matching in Opencv 3.4.3 and Python 3.4
def unsharp_mask(im):
# This is used to sharpen images
gaussian_3 = cv2.GaussianBlur(im, (3, 3), 3.0)
return cv2.addWeighted(im, 2.0, gaussian_3, -1.0, 0, im)
def screen_finder2(image, source, num=0):
def resize(im, new_width):
r = float(new_width) / im.shape[1]
dim = (new_width, int(im.shape[0] * r))
return cv2.resize(im, dim, interpolation=cv2.INTER_AREA)
width = 300
source = resize(source, new_width=width)
image = resize(image, new_width=width)
hsv = cv2.cvtColor(image, cv2.COLOR_BGR2LUV)
image, u, v = cv2.split(hsv)
hsv = cv2.cvtColor(source, cv2.COLOR_BGR2LUV)
source, u, v = cv2.split(hsv)
MIN_MATCH_COUNT = 10
orb = cv2.ORB_create()
kp1, des1 = orb.detectAndCompute(image, None)
kp2, des2 = orb.detectAndCompute(source, None)
flann = cv2.DescriptorMatcher_create(cv2.DescriptorMatcher_FLANNBASED)
# Without the below 2 lines, matching doesn't work
des1 = np.asarray(des1, dtype=np.float32)
des2 = np.asarray(des2, dtype=np.float32)
matches = flann.knnMatch(des1, des2, k=2)
# store all the good matches as per Lowe's ratio test
good = []
for m, n in matches:
if m.distance < 0.7 * n.distance:
good.append(m)
if len(good) >= MIN_MATCH_COUNT:
src_pts = np.float32([kp1[m.queryIdx].pt for m in good]).reshape(-1,
1, 2)
dst_pts = np.float32([kp2[m.trainIdx].pt for m in good]).reshape(-1,
1, 2)
M, mask = cv2.findHomography(src_pts, dst_pts, cv2.RANSAC, 5.0)
matchesMask = mask.ravel().tolist()
h,w = image.shape
pts = np.float32([[0, 0], [0, h-1], [w-1, h-1], [w-1, 0]]).reshape(-1,
1, 2)
dst = cv2.perspectiveTransform(pts, M)
source_bgr = cv2.cvtColor(source, cv2.COLOR_GRAY2BGR)
img2 = cv2.polylines(source_bgr, [np.int32(dst)], True, (0,0,255), 3,
cv2.LINE_AA)
cv2.imwrite("out"+str(num)+".jpg", img2)
else:
print("Not enough matches." + str(len(good)))
matchesMask = None
draw_params = dict(matchColor=(0, 255, 0), # draw matches in green color
singlePointColor=None,
matchesMask=matchesMask, # draw only inliers
flags=2)
img3 = cv2.drawMatches(image, kp1, source, kp2, good, None, **draw_params)
cv2.imwrite("ORB"+str(num)+".jpg", img3)
match_image = unsharp_mask(cv2.imread("source.jpg"))
image_1 = unsharp_mask(cv2.imread("Screen_1.jpg"))
screen_finder2(match_image, image_1, num=1)

How to correctly implement dropout for convolution in TensorFlow

According to the original paper on Dropout said regularisation method can be applied to convolution layers often improving their performance. TensorFlow function tf.nn.dropout supports that by having a noise_shape parameter to allow the user to choose which parts of the tensors will drop out independently. However, neither the paper nor the documentation give a clear explanation of which dimensions should be kept independently, and the TensorFlow explanation of how noise_shape works is rather unclear.
only dimensions with noise_shape[i] == shape(x)[i] will make independent decisions.
I would assume that for a typical CNN layer output of the shape [batch_size, height, width, channels] we don't want individual rows or columns to drop out by themselves, but rather whole channels (which would be equivalent to a node in a fully connected NN) independently of the examples (i.e. different channels could be dropped for different examples in a batch). Am I correct in this assumption?
If so, how would one go about implementing dropout with such specificity using the noise_shape parameter? Would it be:
noise_shape=[batch_size, 1, 1, channels]
or:
noise_shape=[1, height, width, 1]
from here,
For example, if shape(x) = [k, l, m, n] and noise_shape = [k, 1, 1, n], each batch and channel component will be kept independently and each row and column will be kept or not kept together.
The code may help explain this.
noise_shape = noise_shape if noise_shape is not None else array_ops.shape(x)
# uniform [keep_prob, 1.0 + keep_prob)
random_tensor = keep_prob
random_tensor += random_ops.random_uniform(noise_shape,
seed=seed,
dtype=x.dtype)
# 0. if [keep_prob, 1.0) and 1. if [1.0, 1.0 + keep_prob)
binary_tensor = math_ops.floor(random_tensor)
ret = math_ops.div(x, keep_prob) * binary_tensor
ret.set_shape(x.get_shape())
return ret
the line random_tensor += supports broadcast. When the noise_shape[i] is set to 1, that means all elements in this dimension will add the same random value ranged from 0 to 1. So when noise_shape=[k, 1, 1, n], each row and column in the feature map will be kept or not kept together. On the other hand, each example (batch) or each channel receives different random values and each of them will be kept independently.

Scikit-learn PCA .fit_transform shape is inconsistent (n_samples << m_attributes)

I am getting different shapes for my PCA using sklearn. Why isn't my transformation resulting in an array of the same dimensions like the docs say?
fit_transform(X, y=None)
Fit the model with X and apply the dimensionality reduction on X.
Parameters:
X : array-like, shape (n_samples, n_features)
Training data, where n_samples is the number of samples and n_features is the number of features.
Returns:
X_new : array-like, shape (n_samples, n_components)
Check this out with the iris dataset which is (150, 4) where I'm making 4 PCs:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn import decomposition
import seaborn as sns; sns.set_style("whitegrid", {'axes.grid' : False})
%matplotlib inline
np.random.seed(0)
# Iris dataset
DF_data = pd.DataFrame(load_iris().data,
index = ["iris_%d" % i for i in range(load_iris().data.shape[0])],
columns = load_iris().feature_names)
Se_targets = pd.Series(load_iris().target,
index = ["iris_%d" % i for i in range(load_iris().data.shape[0])],
name = "Species")
# Scaling mean = 0, var = 1
DF_standard = pd.DataFrame(StandardScaler().fit_transform(DF_data),
index = DF_data.index,
columns = DF_data.columns)
# Sklearn for Principal Componenet Analysis
# Dims
m = DF_standard.shape[1]
K = m
# PCA (How I tend to set it up)
M_PCA = decomposition.PCA()
A_components = M_PCA.fit_transform(DF_standard)
#DF_standard.shape, A_components.shape
#((150, 4), (150, 4))
but then when I use the same exact approach on my actual dataset (76, 1989) as in 76 samples and 1989 attributes/dimensions I get a (76, 76) array instead of (76, 1989)
DF_centered = normalize(DF_mydata, method="center", axis=0)
m = DF_centered.shape[1]
# print(m)
# 1989
M_PCA = decomposition.PCA(n_components=m)
A_components = M_PCA.fit_transform(DF_centered)
DF_centered.shape, A_components.shape
# ((76, 1989), (76, 76))
normalize is just a wrapper I made that subtracts the mean from each dimension.
(Note: this answer is adapted from my answer on Cross Validated here: Why are there only n−1 principal components for n data points if the number of dimensions is larger or equal than n?)
PCA (as most typically run) creates a new coordinate system by:
shifting the origin to the centroid of your data,
squeezes and/or stretches the axes to make them equal in length, and
rotates your axes into a new orientation.
(For more details, see this excellent CV thread: Making sense of principal component analysis, eigenvectors & eigenvalues.) However, step 3 rotates your axes in a very specific way. Your new X1 (now called "PC1", i.e., the first principal component) is oriented in your data's direction of maximal variation. The second principal component is oriented in the direction of the next greatest amount of variation that is orthogonal to the first principal component. The remaining principal components are formed likewise.
With this in mind, let's examine a simple example (suggested by #amoeba in a comment). Here is a data matrix with two points in a three dimensional space:
X = [ 1 1 1
2 2 2 ]
Let's view these points in a (pseudo) three dimensional scatterplot:
So let's follow the steps listed above. (1) The origin of the new coordinate system will be located at (1.5,1.5,1.5). (2) The axes are already equal. (3) The first principal component will go diagonally from what used to be (0,0,0) to what was originally (3,3,3), which is the direction of greatest variation for these data. Now, the second principal component must be orthogonal to the first, and should go in the direction of the greatest remaining variation. But what direction is that? Is it from (0,0,3) to (3,3,0), or from (0,3,0) to (3,0,3), or something else? There is no remaining variation, so there cannot be any more principal components.
With N=2 data, we can fit (at most) N−1=1 principal components.

Blocproc in matlab with two output variables

I have the following problem. I have to compute dense SIFT interest points in a very high dimensional image (182MP). When I run the code in the full image Matlab always close suddently. So I decided to run the code in image patches.
the code
I tried to use blocproc in matlab to call the c++ function that performs the dense sift interest points detection this way:
fun = #(block_struct) denseSIFT(block_struct.data, options);
[dsift , infodsift] = blockproc(ndvi,[1000 1000],fun);
where dsift is the sift descriptors (vectors) and infodsift has the information of the interest points, such as the x and y coordinates.
the problem
The problem is the fact that blocproc just allow one output, but i want both outputs. The following error is given by matlab when i run the code.
Error using blockproc
Too many output arguments.
Is there a way for me doing this?
Would it be a problem for you to "hard code" a version of blockproc?
Assuming for a moment that you can divide your image into NxM smaller images, you could loop around as follows:
bigImage = someFunction();
sz = size(bigImage);
smallSize = sz ./ [N M];
dsift = cell(N,M);
infodsift = cell(N,M);
for ii = 1:N
for jj = 1:M
smallImage = bigImage((ii-1)*smallSize(1) + (1:smallSize(1)), (jj-1)*smallSize(2) + (1:smallSize(2));
[dsift{ii,jj} infodsift{ii,jj}] = denseSIFT(smallImage, options);
end
end
The results will then be in the two cell arrays. No real need to pre-allocate, but it's tidier if you do. If the individual matrices are the same size, you can convert into a single large matrix with
dsiftFull = cell2mat(dsift);
Almost magic. This won't work if your matrices are different sizes - but then, if they are, I'm not sure you would even want to put them all in a single one (unless you decide to horzcat them).
If you do decide you want a list of "all the colums as a giant matrix", then you can do
giantMatrix = [dsift{:}];
This will return a matrix with (in your example) 128 rows, and as many columns as there were "interest points" found. It's shorthand for
giantMatrix = [dsift{1,1} dsift{2,1} dsift{3,1} ... dsift{N,M}];

Resources