I am trying to understand image segmentation using SegNet implementation in keras. I have read the original paper using the Conv and Deconv architechture and also using the Dilated conv layers. However, I have trouble understanding how the labelling of the pixel works.
I am considering the following implementation:
https://github.com/nicolov/segmentation_keras
Here the pascal dataset attributes are used:
21 Classes:
# 0=background
# 1=aeroplane, 2=bicycle, 3=bird, 4=boat, 5=bottle
# 6=bus, 7=car, 8=cat, 9=chair, 10=cow
# 11=diningtable, 12=dog, 13=horse, 14=motorbike, 15=person
# 16=potted plant, 17=sheep, 18=sofa, 19=train, 20=tv/monitor
The classes are represented by:
pascal_nclasses = 21
pascal_palette = np.array([(0, 0, 0)
, (128, 0, 0), (0, 128, 0), (128, 128, 0), (0, 0, 128), (128, 0, 128)
, (0, 128, 128), (128, 128, 128), (64, 0, 0), (192, 0, 0), (64, 128, 0)
, (192, 128, 0), (64, 0, 128), (192, 0, 128), (64, 128, 128), (192, 128, 128)
, (0, 64, 0), (128, 64, 0), (0, 192, 0), (128, 192, 0), (0, 64, 128)], dtype=np.uint8)
I was trying to open the labelled images for cat and boat, as cat is in only in R space and boat only in blue. I used following to show the labelled images:
For boat:
label = cv2.imread("2008_000120.png")
label = np.multiply(label, 100)
cv2.imshow("kk", label[:,:,2])
cv2.waitKey(0)
For cat:
label = cv2.imread("2008_000056.png")
label = np.multiply(label, 100)
cv2.imshow("kk", label[:,:,0])
cv2.waitKey(0)
However, it doesnt matter which space I choose both images always gives same results. i.e. the following code also gives same results
For boat:
label = cv2.imread("2008_000120.png")
label = np.multiply(label, 100)
cv2.imshow("kk", label[:,:,1]) # changed to Green space
cv2.waitKey(0)
For cat:
label = cv2.imread("2008_000056.png")
label = np.multiply(label, 100)
cv2.imshow("kk", label[:,:,1]) # changed to Green space
cv2.waitKey(0)
My assumption was that I will see the cat only in Red color space and boat only in blue. However, the output in all cases:
I am confused now how these pixels are labelled and how are they read and uniquely used to pair with categories in the process of creating the logits.
It will be great if someone can explain or put some relevant links to understand this process. I tried to search but most of the tutorials only discuss the CNN architecture, not the labelling process or how these labels are used within the CNN.
I have attached the labelled images of cat and boat for reference.
The labels are just binary image masks so single channel images. The pixel value at each location of your label image changes depending on the class present at each pixel. So it will be value 0 when there is no object at a pixel and a value 1-20 depending on the class otherwise.
Semantic segmentation is a classification task so you are trying to classify each pixel with a class ( in this case class labels 0-20).
Your model will produce an output image and you want to perform softmax cross entropy between each output image pixel and each label image pixel.
In the multiclass case where you have K classes (like here K=21) each pixel will have K channels and you perform softmax cross entropy across the channels at each pixel. Why a channel for each class? Think about in classification we produce a vector of length K for K classes and this is compared to a one hot vector of length K.
Related
I have done image segmentation of the image using PyTorch. I am trying to get the pixel count of Boat class to measure the area. As an example in the image I want to get the pixel count to measure the boat. How do I do that? from the pixel count is it possible to measure the are of the boat?
I am confused and trying to find a way. I would appreciate if anybody can guide me for that.
**The coding is as below:
**
from torchvision import models
fcn = models.segmentation.fcn_resnet101(pretrained=True).eval()
from PIL import Image
import matplotlib.pyplot as plt
import torch
img = Image.open('boat.jpg')
plt.imshow(img)
plt.show()
# Apply the transformations needed
#Resize the image to (256 x 256)
#CenterCrop it to (224 x 224)
import torchvision.transforms as T
trf = T.Compose([T.Resize(256),
T.CenterCrop(224),
T.ToTensor(),
T.Normalize(mean = [0.485, 0.456, 0.406],
std = [0.229, 0.224, 0.225])])
inp = trf(img).unsqueeze(0)
out = fcn(inp)['out']
print (out.shape)
#now this 21 channeled output into a 2D image or a 1 channeled image, where each pixel of that image corresponds to a class.
import numpy as np
om = torch.argmax(out.squeeze(), dim=0).detach().cpu().numpy()
print (om.shape)
print (np.unique(om))
# Define the helper function
def decode_segmap(image, nc=21):
label_colors = np.array([(0, 0, 0), # 0=background
# 1=aeroplane, 2=bicycle, 3=bird, 4=boat, 5=bottle
(128, 0, 0), (0, 128, 0), (128, 128, 0), (0, 0, 128), (128, 0, 128),
# 6=bus, 7=car, 8=cat, 9=chair, 10=cow
(0, 128, 128), (128, 128, 128), (64, 0, 0), (192, 0, 0), (64, 128, 0),
# 11=dining table, 12=dog, 13=horse, 14=motorbike, 15=person
(192, 128, 0), (64, 0, 128), (192, 0, 128), (64, 128, 128), (192, 128, 128),
# 16=potted plant, 17=sheep, 18=sofa, 19=train, 20=tv/monitor
(0, 64, 0), (128, 64, 0), (0, 192, 0), (128, 192, 0), (0, 64, 128)])
r = np.zeros_like(image).astype(np.uint8)
g = np.zeros_like(image).astype(np.uint8)
b = np.zeros_like(image).astype(np.uint8)
for l in range(0, nc):
idx = image == l
r[idx] = label_colors[l, 0]
g[idx] = label_colors[l, 1]
b[idx] = label_colors[l, 2]
rgb = np.stack([r, g, b], axis=2)
return rgb
rgb = decode_segmap(om)
plt.imshow(rgb); plt.show()
I want to find some guidance
You are looking for skimage.measure.regionprops. Once you have the predicted label map (om in your code) you can apply regionprops to it and get the area of each region.
According to your code snippet, the output om is a tensor of category indices (0 - background, 1 - aeroplane, 2 - bicycle,....).
In order to get the area of a specific category, you just need to compare the output map with the corresponding index, then sum up the results.
For example, with the category boat with the index 4:
BOAT_INDEX = 4
area = torch.sum(om == BOAT_INDEX).item()
I am currently working on lines extraction from a binary image. I initially performed a few image processing steps including threshold segmentation and obtained the following binary image.
As can be seen in the binary image the lines are splitted or broken. And I wanted to join the broken line as shown in the image below marked in red. I marked the red line manually for a demonstration.
FYI, I used the following code to perform the preprocessing.
img = cv2.imread('original_image.jpg') # loading image
gray_image = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) # coverting to gray scale
median_filter = cv2.medianBlur (gray_image, ksize = 5) # median filtering
th, thresh = cv2.threshold (median_filter, median_filter.mean(), 255, cv2.THRESH_BINARY) # theshold segmentation
# small dots and noise removing
nlabels, labels, stats, centroids = cv2.connectedComponentsWithStats(thresh, None, None, None, 8, cv2.CV_32S)
areas = stats[1:,cv2.CC_STAT_AREA]
result = np.zeros((labels.shape), np.uint8)
min_size = 150
for i in range(0, nlabels - 1):
if areas[i] >= min_size: #keep
result[labels == i + 1] = 255
fig, ax = plt.subplots(2,1, figsize=(30,20))
ax[0].imshow(img)
ax[0].set_title('Original image')
ax[1].imshow(cv2.cvtColor(result, cv2.COLOR_BGR2RGB))
ax[1].set_title('preprocessed image')
I would really appreciate it if you have any suggestions or steps on how to connect the lines? Thank you
Using the following sequence of methods I was able to get a rough approximation. It is a very simple solution and might not work for all cases.
1. Morphological operations
To merge neighboring lines perform morphological (dilation) operations on the binary image.
img = cv2.imread('image_path', 0) # grayscale image
img1 = cv2.imread('image_path', 1) # color image
th = cv2.threshold(img, 150, 255, cv2.THRESH_BINARY)[1]
kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (19, 19))
morph = cv2.morphologyEx(th, cv2.MORPH_DILATE, kernel)
2. Finding contours and extreme points
My idea now is to find contours.
Then find the extreme points of each contour.
Finally find the closest distance among these extreme points between neighboring contours. And draw a line between them.
cnts1 = cv2.findContours(morph, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts1[0] # storing contours in a variable
Lets take a quick detour to visualize where these extreme points are present:
# visualize extreme points for each contour
for c in cnts:
left = tuple(c[c[:, :, 0].argmin()][0])
right = tuple(c[c[:, :, 0].argmax()][0])
top = tuple(c[c[:, :, 1].argmin()][0])
bottom = tuple(c[c[:, :, 1].argmax()][0])
# Draw dots onto image
cv2.circle(img1, left, 8, (0, 50, 255), -1)
cv2.circle(img1, right, 8, (0, 255, 255), -1)
cv2.circle(img1, top, 8, (255, 50, 0), -1)
cv2.circle(img1, bottom, 8, (255, 255, 0), -1)
(Note: The extreme points points are based of contours from morphological operations, but drawn on the original image)
3. Finding closest distances between neighboring contours
Sorry for the many loops.
First, iterate through every contour (split line) in the image.
Find the extreme points for them. Extreme points mean top-most, bottom-most, right-most and left-most points based on its respective bounding box.
Compare the distance between every extreme point of a contour with those of every other contour. And draw a line between points with the least distance.
for i in range(len(cnts)):
min_dist = max(img.shape[0], img.shape[1])
cl = []
ci = cnts[i]
ci_left = tuple(ci[ci[:, :, 0].argmin()][0])
ci_right = tuple(ci[ci[:, :, 0].argmax()][0])
ci_top = tuple(ci[ci[:, :, 1].argmin()][0])
ci_bottom = tuple(ci[ci[:, :, 1].argmax()][0])
ci_list = [ci_bottom, ci_left, ci_right, ci_top]
for j in range(i + 1, len(cnts)):
cj = cnts[j]
cj_left = tuple(cj[cj[:, :, 0].argmin()][0])
cj_right = tuple(cj[cj[:, :, 0].argmax()][0])
cj_top = tuple(cj[cj[:, :, 1].argmin()][0])
cj_bottom = tuple(cj[cj[:, :, 1].argmax()][0])
cj_list = [cj_bottom, cj_left, cj_right, cj_top]
for pt1 in ci_list:
for pt2 in cj_list:
dist = int(np.linalg.norm(np.array(pt1) - np.array(pt2))) #dist = sqrt( (x2 - x1)**2 + (y2 - y1)**2 )
if dist < min_dist:
min_dist = dist
cl = []
cl.append([pt1, pt2, min_dist])
if len(cl) > 0:
cv2.line(img1, cl[0][0], cl[0][1], (255, 255, 255), thickness = 5)
4. Post-processing
Since the final output is not perfect, you can perform additional morphology operations and then skeletonize it.
I have an image something like the image below (on the left):
I want to extract only the pixels in red on the right: the pixels that belong to a 1px vertical line, but not to any thicker line or other region with more than 1 adjacent black pixel. The image is bitonal.
I have so far tried a morphology OPEN with a vertical (10px, which is find for my purposes) and horizontal kernel and taken the difference, but this needs an awkward shift and leaves some "speckles":
vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, 10))
vertical_mask1 = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, vertical_kernel,
iterations=1)
horz_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (2, 1))
horz_mask = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, horz_kernel,
iterations=1)
M = np.float32([[1,0,-1],[0,1,1]])
rows, cols = horz_mask.shape
vertical_mask = cv2.warpAffine(horz_mask, M, (cols, rows))
result = cv2.bitwise_and(thresh, cv2.bitwise_not(horz_mask))
What is the correct way to isolate the 1px lines (and only the 1px lines)?
In the general case, for other kernels, this question is: how do I find all pixels in the image that are in regions that the kernel "fits inside" (and then a subtraction to get my desired result)?
That's basically (binary) template matching. You need to derive proper templates from your "kernels". For larger "kernels", that might involve using masks for these templates, too, cf. cv2.matchTemplate.
What's the most important feature for a single pixel vertical line? The left and right neighbour of the current pixel must be 0. So, the template to match is [0, 1, 0]. By using the TemplateMatchMode cv2.TM_SQDIFF_NORMED, perfect matches will lead to close to 0 values in the result array.
You can mask those locations, and dilate according to the size of your template. Then, you use bitwise_and to extract the actual pixels that belong to your template.
Here's some code with a few template ("kernels"):
import cv2
import numpy as np
img = cv2.imread('AapJk.png', cv2.IMREAD_GRAYSCALE)[:, :50]
vert_line = np.array([[0, 1, 0]], np.uint8)
cross = np.array([[0, 1, 0], [1, 1, 1], [0, 1, 0]], np.uint8)
corner = np.array([[0, 0, 1], [0, 0, 1], [1, 1, 1]], np.uint8)
for i_k, k in enumerate([vert_line, cross, corner]):
m, n = k.shape
img_tmp = 1 - img // 255
mask = cv2.matchTemplate(img_tmp, k, cv2.TM_SQDIFF_NORMED) < 10e-6
mask = cv2.dilate(mask.astype(np.uint8), np.ones((m, n)), anchor=(n-1, m-1))
m, n = mask.shape
mask = cv2.bitwise_and(img_tmp[:m, :n], mask)
out = cv2.cvtColor(img, cv2.COLOR_GRAY2BGR)
roi = out[:m, :n]
roi[mask.astype(bool), :] = [0, 0, 255]
cv2.imwrite('{}.png'.format(i_k), out)
Vertical line:
Cross:
Bottom right corner 3 x 3:
Larger templates ("kernels") most likely will require additional masks, depending on how many or which neighbouring pixels should be considered or not.
----------------------------------------
System information
----------------------------------------
Platform: Windows-10-10.0.19041-SP0
Python: 3.9.1
PyCharm: 2021.1.3
NumPy: 1.20.3
OpenCV: 4.5.2
----------------------------------------
I did an object detection using opencv by loading pre-trained MobileNet SSD model. from this post.
It reads a video and detects objects without any problem. But I would like to use readNet (or readFromDarknet) instead of readNetFromCaffe
net = cv2.dnn.readNetFromCaffe(args["prototxt"], args["model"])
because I have pre-trained weights and cfg file of my own objects only in Darknet framework. Therefore I simply changed readNetFromCaffe into readNet in above post and got an error:
Traceback (most recent call last):
File "people_counter.py", line 124, in <module>
for i in np.arange(0, detections.shape[2]):
IndexError: tuple index out of range
Here detections is an output from
blob = cv2.dnn.blobFromImage(frame, 1.0/255.0, (416, 416), True, crop=False)
net.setInput(blob)
detections = net.forward()
Its shape is (1, 1, 100, 7) tuple (when using readNetFromCaffe).
I was kinda expecting it wouldn't work just by changing the model. Then I decided to look for an object detector code where readNet was used and I found it here. I read through the code and found the same lines as follows:
blob = cv2.dnn.blobFromImage(image, scale, (416,416), (0,0,0), True, crop=False)
net.setInput(blob)
outs = net.forward(get_output_layers(net))
Here, the shape of outs is (1, 845, 6) list. But in order for me to be able to use it right away (here), outs should be of the same size with detections. I've come up to this part and have no clue about how I should proceed.
If something isn't clear, I just need help to use readNet (or readFromDarknet) instead of readNetFromCaffe in this post
If we look at the code closely we can see that everying is dependent on the outputs of detections, line 121, and we should tweak its outputs to match them with the outs of this, line 63. After spending almost a day, I came to a reasonable (not the perfect) solution. Basically, it is all about output blobs of readNetFromCaffe and readFromDarknet, because they output a blob with a shape 1x1xNx7 and NxC, respectively. Here Ns are the number of detections, but with different size vectors, namely, N in 1x1xNx7 is is a number of detections and an every detection is a vector of values
[batchId, classId, confidence, left, top, right, bottom] and N in NxC a number of
detected objects and C is a number of classes + 4 where the first 4 numbers are [center_x, center_y, width, height]. After analyzing these, we may replace (124-130 lines)
for i in np.arange(0, detections.shape[2]):
confidence = detections[0, 0, i, 2]
if confidence > args["confidence"]:
idx = int(detections[0, 0, i, 1])
if CLASSES[idx] != "person":
continue
box = detections[0, 0, i, 3:7] * np.array([W, H, W, H])
(startX, startY, endX, endY) = box.astype("int")
with equivalent lines
for i in np.arange(0, detections.shape[0]):
scores = detections[i][5:]
classId = np.argmax(scores)
confidence = scores[classId]
if confidence > args["confidence"]:
idx = int(classId)
if CLASSES[idx] != "person":
continue
center_x = int(detections[i][0] * 416)
center_y = int(detections[i][1] * 416)
width = int(detections[i][2] * 416)
height = int(detections[i][3] * 416)
left = int(center_x - width / 2)
top = int(center_y - height / 2)
right = width + left - 1
bottom = height + top - 1
box = [left, top, width, height]
(startX, startY, endX, endY) = box
This way we can keep track of "person" class using Darknet's cfg and weights and count them up/down with a visualiation line.
Again, there might be some other more simpler ways of tracking the detections of Darknet weights file, but this works for this particular case.
A reference:
more about blobs output by readNetFromCaffe and readFromDarknet
I read this article about using "resize convolutions" rather than the "deconvolution" (i.e. transposed convolution) method for generating images with neural networks. It's clear how this works with a stride size of 1, but how would you implement it for a stride size >1?
Here is how I've implemented this in TensorFlow. Note: This is the second "deconvolution" layer in the decoder part of an autoencoder network.
h_d_upsample2 = tf.image.resize_images(images=h_d_conv3,
size=(int(self.c2_size), int(self.c2_size)),
method=tf.image.ResizeMethod.NEAREST_NEIGHBOR)
h_d_conv2 = tf.layers.conv2d(inputs=h_d_upsample2,
filters=FLAGS.C2,
kernel_size=(FLAGS.c2_kernel, FLAGS.c2_kernel),
padding='same',
activation=tf.nn.relu)
Resizing images really not a viable option for intermediate layers of network. you may try conv2d_transpose
how would you implement it for a stride size >1?
# best practice is to use the transposed_conv2d function, this function works with stride >1 .
# output_shape_width_height = stride * input_shape_width_height
# input_shape = [32, 32, 48], output_shape = [64, 64, 128]
stride = 2
filter_size_w =filter_size_h= 2
shape = [filter_size_w, filter_size_h, output_shape[-1], input_shape[-1]]
w = tf.get_variable(
name='W',
shape=shape,
initializer=tf.contrib.layers.variance_scalling_initializer(),
trainable=trainable)
output = tf.nn.conv2d_transpose(
input, w, output_shape=output_shape, strides=[1, stride, stride, 1])