I have a dataset of contour images.
In my dataset, each image contain SINGLE object (on black background) which corresponds to a contour-image (i.e. image corresponding to a particular detected contour earlier), retrieved earlier.
I just have these images, and no other contour information.
I need to get contour polygon (height, width, polygon coordinates) for each image so that I can use this dataset for training in Tensorflow models.
Will running cv2.findContours() make sense (as each image is already a single contour) or is there another faster way to extract the bounding polygon from the contour images ?
Thank you so much in advance.
I want to cluster images based on colour similarity. For that I need a good similarity metric between two 3D histograms. A 3D histogram of an image is just a 3 dimensional space where each axis represents one of the base colours. The range of each axis is 0-255 since this are the possible values of the base colours for each pixel.
The histogram is represented as a 256X256X256 matrix and each entry in the matrix represents the count of pixels with that specific colour in the image. For example:
If the value of the matrix element M[0][0][0] = 1150 it means that there are 1150 black pixels in the image (RGB(0,0,0) represents the colour black)
I am looking for the most sensible similarity metric for this kind of problem. The metric will be used in the clustering algorithm (DBSCAN probably) to evaluate image similarity.
Use the L*a*b* (CIELAB) color space, where euclidean distance is indeed similarity, as it is designed to model human eye perceptions non-linearities.
"Our system divides the input image into an S * S grid.
If the center of an object falls into a grid cell, that grid cell
is responsible for detecting that object."
This is from YOLO paper, the input images are divided into S*S grid,which means that the output of conv is the size of S * S, right?
If so, how do these small cells(7 * 7) connect to the original region of input image? I know how conv works, but how the bounding boxes do regression?
The ground truth in original size will be "resized" to SxS, in this case 7x7 in Yolov1 or 13x13 in Yolov2
I think this Yolo implementation could be useful for you to understand how yolo built:
https://github.com/1991viet/Yolo-pytorch
I have a customized camera, which contains 3 individual lens+filters arranged in a triangle so in every shot I get 3 single band grayscale images (r, g, b). I want to merge them to get an RGB.
The problem is, since the 3 lens are physically separated, the image captured by them are not aligned. As a result, when I use command qdal_merge in the software pack QGIS, the result looks weird. I may also need to adjust the weight of the r,g,b. I put the raw r,g,b images and the output I generated using qgis in this dropbox folder.
Is there existing open-source tool to do the alignment and merge? If not, how can I do it using opencv?
Combining R,G,B images is possible using a simple pixel intensity distance metric like Sum of Squared Distances (SSD). A better metric is the Normalized Cross-Correlation (NCC) (see Wikipedia) which first normalizes an image matrix into a unit vector, and computes the dot product of such unit vectors (from 2 input images). The higher the NCC value, the greater the similarity of the two input images.
However, NCC similarity may be insufficient for computing the best alignment of two high resolution images, such as the TIFF images you provide. One should therefore use a downsampling method as described below
to align two input images at a smaller size and then simply compute the offset as you rescale the images.
So for the input images, red, green and blue, there are two approaches to align them into a single RGB image:
Consider the blue image as the reference image for example, w.r.t. which we align the red and green images. Now consider red and blue images. Within a certain window, compute the best alignment offset of the red and blue images using the NCC similarity metric, and find the shifted_red image. Do the same for the green and blue images. Now combine the shifted_red, shifted_green and blue images to get the final RGB image.
For high-resolution images, decide a scale_count. Recursively, at each step resize the image by half, compute the offset of the red image w.r.t. the blue image, rescale the offset and apply it. The benefit of doing such a recursive multi-scale alignment is decrease in computation time and increase in accuracy of alignment (you don't know the best window size for searching for alignment offsets for solution (1), so this will work better). Repeat this approach for computing the alignment for green and blue channels, and then combine the final results as in (1).
Since this problem is common in assignments of computational photography courses, I am not going to share any code. I have, however implemented the two approaches and experimented with the images you provide. I don't know which of the input images is red, so I have two results (rescaled to decrease file size):
If IMG_0290_1.tif is Red, IMG_0290_2.tif is Green and IMG_0290_3.tif is blue:
RGB result if red:1, green:2, blue:3
If IMG_0290_3.tif is Red, IMG_0290_2.tif is Green and IMG_0290_1.tif is blue (this looks more correct to me):
RGB result if red:3, green:2, blue:1
I know that the handwritten digit images in the mnist dataset are 28×28,but why the input in LeNet5 is 32×32?
Your question is answered in the original paper:
The convolution step always takes a smaller input than the feature maps of the previous layer (and this holds true for the 1st layer - the input - as well):
Layer C1 is a convolutional layer with 6 feature maps.
Each unit in each feature map is connected to a 5x5 neighborhood in the input. The size of the feature maps is 28x28
which prevents connection from the input from falling off
the boundary.
This means that using a 5x5 neighborhood on a 32x32 input, you'll get 6 features maps of size 28x28 because there's pixels you won't use at the image boundary (you will always have a remainder with these numbers).
Of course they could have an exception for the first layer. The reason they're still using 32x32 images is:
The input is a 32x32 pixel image. This is significantly larger
than the largest character in the database (at most 20x20
pixels centered in a 28x28 field). The reason is that it is
desirable that potential distinctive features such as stroke
end-points or corner can appear in the center of the receptive field of the highest-level feature detectors.