Examples/tutorials of Adaptive Metropolis for images using PyMC - image-processing

I am looking for examples or tutorials of the AdaptiveMetropolis step method used for image processing.
The only vaguely image-related resource that I have found until now is this astronomy dissertation and the related GitHub repo.
This wider question does not seem to provide PyMC example code.
What about finding the peak on this simulated array?
import numpy as np
from matplotlib import pyplot as plt
sz = (12,18)
data_input = np.random.normal( loc=5.0, size=sz )
data_input[7:10, 2:6] = np.random.normal( loc=100.0, size=(3,4) )
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
im = ax.imshow( data_input )
ax.set_title("input")

The closest I know of is here: http://nbviewer.ipython.org/github/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/blob/master/Chapter5_LossFunctions/LossFunctions.ipynb (see
Example: Kaggle contest on Observing Dark World)
On this thread you asked a specific question: https://github.com/pymc-devs/pymc/issues/653 about finding an array in an image. Here is a first attempt at a model:
In that case it seems like you are trying to estimate a 2-D uniform distribution with gaussian noise. You'll have to translate this into an actual model but this would be one idea:
lower_x ~ DiscreteUniform(0, 20)
upper_x ~ DiscreteUniform(0,20)
lower_y ~ DiscreteUniform(0, 20)
upper_y ~ DiscreteUniform(0, 20)
height ~ Normal(100, 1)
noise ~ InvGamma(1, 1)
means = zeros((20, 20))
means[[lower_x:upper_x,lower_y: upper_y]] = height # this needs to be
a deterministic
data ~ Normal(mu=means, sd=noise)
It might be better to code upper_x as an offset and then do lower_x:lower_x+offset_x, otherwise you need a potential to enforce lower_x < upper_x.

Related

Extracting table structures from image

I have a bunch of images like
What would be the good way to extract just the table structure from the image? I'm only interested extracting the straight lines.
I have been toying around with OpenCV Finding Contours code sample and the results are quite promising. I'm just wondering if there is maybe a better way?
OpenCV has a nice way to detect line segments. Here is a code snippet in python:
import math
import numpy as np
import cv2
img = cv2.imread('page2.png')
gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
lsd = cv2.createLineSegmentDetector(0)
dlines = lsd.detect(gray)
for dline in dlines[0]:
x0 = int(round(dline[0][0]))
y0 = int(round(dline[0][1]))
x1 = int(round(dline[0][2]))
y1 = int(round(dline[0][3]))
cv2.line(img, (x0, y0), (x1,y1), 255, 1, cv2.LINE_AA)
# print line segment length
a = (x0-x1) * (x0-x1)
b = (y0-y1) * (y0-y1)
c = a + b
print(math.sqrt(c))
cv2.imwrite('page2_lines.png', img)
Kindly go through my Github repository Code for table extraction
The developed code detect table and extract out information by keeping the spatial coordinates intact.
The code detects lines from tables as shown in an image below. I hope it solves your problem.
The extracted output in terms of a table is shown below.

Scikit-learn PCA .fit_transform shape is inconsistent (n_samples << m_attributes)

I am getting different shapes for my PCA using sklearn. Why isn't my transformation resulting in an array of the same dimensions like the docs say?
fit_transform(X, y=None)
Fit the model with X and apply the dimensionality reduction on X.
Parameters:
X : array-like, shape (n_samples, n_features)
Training data, where n_samples is the number of samples and n_features is the number of features.
Returns:
X_new : array-like, shape (n_samples, n_components)
Check this out with the iris dataset which is (150, 4) where I'm making 4 PCs:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn import decomposition
import seaborn as sns; sns.set_style("whitegrid", {'axes.grid' : False})
%matplotlib inline
np.random.seed(0)
# Iris dataset
DF_data = pd.DataFrame(load_iris().data,
index = ["iris_%d" % i for i in range(load_iris().data.shape[0])],
columns = load_iris().feature_names)
Se_targets = pd.Series(load_iris().target,
index = ["iris_%d" % i for i in range(load_iris().data.shape[0])],
name = "Species")
# Scaling mean = 0, var = 1
DF_standard = pd.DataFrame(StandardScaler().fit_transform(DF_data),
index = DF_data.index,
columns = DF_data.columns)
# Sklearn for Principal Componenet Analysis
# Dims
m = DF_standard.shape[1]
K = m
# PCA (How I tend to set it up)
M_PCA = decomposition.PCA()
A_components = M_PCA.fit_transform(DF_standard)
#DF_standard.shape, A_components.shape
#((150, 4), (150, 4))
but then when I use the same exact approach on my actual dataset (76, 1989) as in 76 samples and 1989 attributes/dimensions I get a (76, 76) array instead of (76, 1989)
DF_centered = normalize(DF_mydata, method="center", axis=0)
m = DF_centered.shape[1]
# print(m)
# 1989
M_PCA = decomposition.PCA(n_components=m)
A_components = M_PCA.fit_transform(DF_centered)
DF_centered.shape, A_components.shape
# ((76, 1989), (76, 76))
normalize is just a wrapper I made that subtracts the mean from each dimension.
(Note: this answer is adapted from my answer on Cross Validated here: Why are there only n−1 principal components for n data points if the number of dimensions is larger or equal than n?)
PCA (as most typically run) creates a new coordinate system by:
shifting the origin to the centroid of your data,
squeezes and/or stretches the axes to make them equal in length, and
rotates your axes into a new orientation.
(For more details, see this excellent CV thread: Making sense of principal component analysis, eigenvectors & eigenvalues.) However, step 3 rotates your axes in a very specific way. Your new X1 (now called "PC1", i.e., the first principal component) is oriented in your data's direction of maximal variation. The second principal component is oriented in the direction of the next greatest amount of variation that is orthogonal to the first principal component. The remaining principal components are formed likewise.
With this in mind, let's examine a simple example (suggested by #amoeba in a comment). Here is a data matrix with two points in a three dimensional space:
X = [ 1 1 1
2 2 2 ]
Let's view these points in a (pseudo) three dimensional scatterplot:
So let's follow the steps listed above. (1) The origin of the new coordinate system will be located at (1.5,1.5,1.5). (2) The axes are already equal. (3) The first principal component will go diagonally from what used to be (0,0,0) to what was originally (3,3,3), which is the direction of greatest variation for these data. Now, the second principal component must be orthogonal to the first, and should go in the direction of the greatest remaining variation. But what direction is that? Is it from (0,0,3) to (3,3,0), or from (0,3,0) to (3,0,3), or something else? There is no remaining variation, so there cannot be any more principal components.
With N=2 data, we can fit (at most) N−1=1 principal components.

Healpy plotting: How do i make a figure with subplots using the healpy.mollview projection?

I've just recently started trying to use healpy and i can't figure out how to make subplots to contain my maps. I have a thermal emission map of a planet as function of time and i need to look at it at several moments in time (lets say 9 different times) and superimpose some coordinates, to check that my planet is rotating the right way.
So far, i can do 2 things.
Make 9 different figures with the superimposed coordinates.
Make a figure with 9 subplots containing 9 different maps but that superimposes all of my coordinates on all of my subplots, instead of just the time-appropriate ones.
I'm not sure if this is a very simple problem but it's been driving me crazy and i cant find anything that works.
I'll show you what i mean:
OPTION 1:
import healpy as hp
import matplolib.pyplot as plt
MAX = 10**(23)
MIN = 10**10
for i in range(9):
t = 4000+10*i
hp.visufunc.mollview(Fmap_wvpix[t,:],
title = "Map at t="+str(t), min = MIN, max=MAX))
hp.visufunc.projplot(d[t,np.where(np.abs(d[t,:,2]-SSP[t])<0.5),1 ],
d[t,np.where(np.abs(d[t,:,2]-SSP[t])<0.5),2],
'k*',markersize = 6)
hp.visufunc.projplot(d[t,np.where(np.abs(d[t,:,2]-(SOP[t]))<0.2),1 ],
d[t,np.where(np.abs(d[t,:,2]-(SOP[t]))<0.2),2],
'r*',markersize = 6)
This makes 9 figures that look pretty much like this :
Flux map superimposed with some stars at time = t
But i need a lot of them so i want to make an image that contains 9 subplots that look like the image.
OPTION 2:
fig = plt.figure(figsize = (10,8))
for i in range(9):
t = 4000+10*i
hp.visufunc.mollview(Fmap_wvpix[t,:],
title = "Map at t="+str(t), min = MIN, max=MAX,
sub = int('33'+str(i+1)))
hp.visufunc.projplot(d[t,np.where(np.abs(d[t,:,2]-SSP[t])<0.5),1 ],
d[t,np.where(np.abs(d[t,:,2]-SSP[t])<0.5),2],
'k*',markersize = 6)
hp.visufunc.projplot(d[t,np.where(np.abs(d[t,:,2]-(SOP[t]))<0.2),1 ],
d[t,np.where(np.abs(d[t,:,2]-(SOP[t]))<0.2),2],
'r*',markersize = 6)
This gives me subplots but it draws all the projplot stars on all of my subplots! (see following image)
Subplots with too many stars
I know that i need a way to call the axes that has the time = t map and draw the stars for time = t on the appropriate map, but everything i've tried so far has failed. I've mostly tried to use projaxes thinking i can define a matplotlib axes and draw the stars on it but it doesnt work. Any advice?
Also, i would like to draw some lines on my map as well but i cant figure out how to do that either. The documentation says projplot but it won't draw anyting if i don't tell it i want a marker.
PS: This code is probably useless to you as it won't work if you don't have my arrays. Here's a simpler version that should run:
import numpy as np
import healpy as hp
import matplotlib.pyplot as plt
NSIDE = 8
m = np.arange(hp.nside2npix(NSIDE))*1
MAX = 900
MIN = 0
fig = plt.figure(figsize = (10,8))
for i in range(9):
t = 4000+10*i
hp.visufunc.mollview(m+100*i, title = "Map at t="+str(t), min = MIN, max=MAX,
sub = int('33'+str(i+1)))
hp.visufunc.projplot(1.5,0+30*i, 'k*',markersize = 16)
So this is supposed to give me one star for each frame and the star is supposed to be moving. But instead it's drawing all the stars on all the frames.
What can i do? I don't understand the documentation.
If you want to have healpy plots in matplotlib subplots, the following would be the way to go. The key is to use plt.axes() to select the active subplot and to use the hold=True keyword in the healpy functions.
import healpy as hp
import numpy as np
import matplotlib.pyplot as plt
fig, (ax1, ax2) = plt.subplots(ncols=2)
plt.axes(ax1)
hp.mollview(np.random.random(hp.nside2npix(32)), hold=True)
plt.axes(ax2)
hp.mollview(np.arange(hp.nside2npix(32)), hold=True)
I have just encountered this question looking for a solution to the same problem, but managed to find it from the documentation of mollview (here).
As you notice there, they say that 'sub' received the same syntax as the function subplot (from matplotlib). This format is:
( # of rows, # of columns, # of current subplot)
E.g. to make your plot, the value sub wants to receive in each iteration is
sub=(3,3,i)
Where i runs from 1 to 9 (3*3).
This worked for me, I haven't tried this with your code, but should work.
Hope this helps!

Count number of objects using watershed algorithm - Scikit-image

I am trying to find the number of objects in a given image using watershed segmentation. Consider for example the coins image. Here I would like to know the number of coins in the image. I implemented the code available at Scikit-image documentation and tweaked with it a little and got results similar to those displayed on the documentation page.
After looking at functions used in the code in detail I found out that ndimage.label() also returns number of unique objects found in the image (mentioned in it's documentation), but when I print that value I am getting 53 which is very high as compared to the number of coins in the actual image.
Can somebody suggest some method to find the number of objects in an image.
Here is a version of your code that counts the coins in one of two ways: a) by directly segmenting the distance image and b) by doing watershed first and rejecting tiny intersecting regions.
from __future__ import print_function
import numpy as np
import matplotlib.pyplot as plt
from skimage import io, color, filter as filters
from scipy import ndimage
from skimage.morphology import watershed
from skimage.feature import peak_local_max
from skimage.measure import regionprops, label
image = color.rgb2gray(io.imread('water_coins.jpg', plugin='freeimage'))
image = image < filters.threshold_otsu(image)
distance = ndimage.distance_transform_edt(image)
# Here's one way to measure the number of coins directly
# from the distance map
coin_centres = (distance > 0.8 * distance.max())
print('Number of coins (method 1):', np.max(label(coin_centres)))
# Or you can proceed with the watershed labeling
local_maxi = peak_local_max(distance, indices=False, footprint=np.ones((3, 3)),
labels=image)
markers, num_features = ndimage.label(local_maxi)
labels = watershed(-distance, markers, mask=image)
# ...but then you have to clean up the tiny intersections between coins
regions = regionprops(labels)
regions = [r for r in regions if r.area > 50]
print('Number of coins (method 2):', len(regions) - 1)
fig, axes = plt.subplots(ncols=3, figsize=(8, 2.7))
ax0, ax1, ax2 = axes
ax0.imshow(image, cmap=plt.cm.gray, interpolation='nearest')
ax0.set_title('Overlapping objects')
ax1.imshow(-distance, cmap=plt.cm.jet, interpolation='nearest')
ax1.set_title('Distances')
ax2.imshow(labels, cmap=plt.cm.spectral, interpolation='nearest')
ax2.set_title('Separated objects')
for ax in axes:
ax.axis('off')
fig.subplots_adjust(hspace=0.01, wspace=0.01, top=1, bottom=0, left=0,
right=1)
plt.show()

Implementing a linear, binary SVM (support vector machine)

I want to implement a simple SVM classifier, in the case of high-dimensional binary data (text), for which I think a simple linear SVM is best. The reason for implementing it myself is basically that I want to learn how it works, so using a library is not what I want.
The problem is that most tutorials go up to an equation that can be solved as a "quadratic problem", but they never show an actual algorithm! So could you point me either to a very simple implementation I could study, or (better) to a tutorial that goes all the way to the implementation details?
Thanks a lot!
Some pseudocode for the Sequential Minimal Optimization (SMO) method can be found in this paper by John C. Platt: Fast Training of Support Vector Machines using Sequential Minimal Optimization. There is also a Java implementation of the SMO algorithm, which is developed for research and educational purpose (SVM-JAVA).
Other commonly used methods to solve the QP optimization problem include:
constrained conjugate gradients
interior point methods
active set methods
But be aware that some math knowledge is needed to understand this things (Lagrange multipliers, Karush–Kuhn–Tucker conditions, etc.).
Are you interested in using kernels or not? Without kernels, the best way to solve these kinds of optimization problems is through various forms of stochastic gradient descent. A good version is described in http://ttic.uchicago.edu/~shai/papers/ShalevSiSr07.pdf and that has an explicit algorithm.
The explicit algorithm does not work with kernels but can be modified; however, it would be more complex, both in terms of code and runtime complexity.
Have a look at liblinear and for non linear SVM's at libsvm
The following paper "Pegasos: Primal Estimated sub-GrAdient SOlver for SVM" top of page 11 describes the Pegasos algorithm also for kernels.It can be downloaded from http://ttic.uchicago.edu/~nati/Publications/PegasosMPB.pdf
It appears to be a hybrid of coordinate descent and subgradient descent. Also, line 6 of the algorithm is wrong. In the predicate the second appearance of y_i_t should be replaced with y_j instead.
I would like to add a little supplement to the answer about original Platt's work.
There is a bit simplified version presented in Stanford Lecture Notes, but derivation of all the formulas should be found somewhere else (e.g. this random notes I found on the Internet).
If it's ok to deviate from original implementations, I can propose you my own variation of the SMO algorithm that follows.
class SVM:
def __init__(self, kernel='linear', C=10000.0, max_iter=100000, degree=3, gamma=1):
self.kernel = {'poly':lambda x,y: np.dot(x, y.T)**degree,
'rbf':lambda x,y:np.exp(-gamma*np.sum((y-x[:,np.newaxis])**2,axis=-1)),
'linear':lambda x,y: np.dot(x, y.T)}[kernel]
self.C = C
self.max_iter = max_iter
def restrict_to_square(self, t, v0, u):
t = (np.clip(v0 + t*u, 0, self.C) - v0)[1]/u[1]
return (np.clip(v0 + t*u, 0, self.C) - v0)[0]/u[0]
def fit(self, X, y):
self.X = X.copy()
self.y = y * 2 - 1
self.lambdas = np.zeros_like(self.y, dtype=float)
self.K = self.kernel(self.X, self.X) * self.y[:,np.newaxis] * self.y
for _ in range(self.max_iter):
for idxM in range(len(self.lambdas)):
idxL = np.random.randint(0, len(self.lambdas))
Q = self.K[[[idxM, idxM], [idxL, idxL]], [[idxM, idxL], [idxM, idxL]]]
v0 = self.lambdas[[idxM, idxL]]
k0 = 1 - np.sum(self.lambdas * self.K[[idxM, idxL]], axis=1)
u = np.array([-self.y[idxL], self.y[idxM]])
t_max = np.dot(k0, u) / (np.dot(np.dot(Q, u), u) + 1E-15)
self.lambdas[[idxM, idxL]] = v0 + u * self.restrict_to_square(t_max, v0, u)
idx, = np.nonzero(self.lambdas > 1E-15)
self.b = np.sum((1.0-np.sum(self.K[idx]*self.lambdas, axis=1))*self.y[idx])/len(idx)
def decision_function(self, X):
return np.sum(self.kernel(X, self.X) * self.y * self.lambdas, axis=1) + self.b
In simple cases it works not much worth than sklearn.svm.SVC, comparison shown below (I have posted code that generates these images on GitHub)
I used quite different approach to derive formulas, you may want to check my preprint on ResearchGate for details.

Resources