Image segmentation with UNet is mainly white - image-processing

I'm working with MONAI library in order to segment brain tumor in MRI images. My dataset is composed of 500 patients and for each patient there're three types of images and the segmentation mask.
I've trained my network (UNet architecture) but I've a problem during the inference part.
If the input image has a few non-zero pixels the network's output is mainly white. Why does it happen? Can someone help me?
Example of input image with a few non-zero pixels
Example of the predicted segmentation
I've already checked the output values before the discretization but they're equal to 1.

Related

Calculate similarity of picture and its sketch

I'm trying to develop algorithm, which returns similarity score for two given black and white images: original one and its sketch, drawn by human:
All original images has the same style, but there is no any given limited set of them. Their content could be totally different.
I've tried few approaches, but none of them was successful yet:
OpenCV template matching
OpenCV matchTemplate is not able to calculate similarity score of images. It could only tells me count of matched pixels, and this value is usually quite low, because of not ideal proportions of human's sketch.
OpenCV feature matching
I've failed with this method, because I couldn't find good algorithms for extracting significant features from human's sketch. Algorithms from OpenCV's tutorials are good in extracting corners and blobs as features. But here, in sketches, we have a lot of strokes - each of them produces a lot of insignificant, junk features and leads to fuzzy results.
Neural Network Classification
Also I took a look at neural networks - they are good in image classification, but also they need train sets for each of classes, and this part is impossible, because we have an unlimited set of possible images.
Which methods and algorithms would you use for this kind of task?
METHOD 1
Cosine similarity gives a similarity score ranging between (0 - 1).
I first converted the images to gray scale and binarized them. I cropped the original image to half the size and excluded the text as shown below:
I then converted the image arrays to 1D arrays using flatten(). I used the following to compute cosine similarity:
from scipy import spatial
result = spatial.distance.cosine(im2, im1)
print result
The result I obtained was 0.999999988431, meaning the images are similar to each other by this score.
EDIT
METHOD 2
I had the time to check out another solution. I figured out that OpenCV's cv2.matchTemplate() function performs the same job.
I f you check out THIS DOCUMENTATION PAGE you will come across the different parameters used.
I used the cv2.TM_SQDIFF_NORMED parameter (which gives the normalized square difference between the two images).
res = cv2.matchTemplate(th1, th2, cv2.TM_SQDIFF_NORMED)
print 1 - res
For the given images I obtained a similarity score of: 0.89689457

Poor performance on digit recognition with CNN trained on MNIST dataset

I trained a CNN (on tensorflow) for digit recognition using MNIST dataset.
Accuracy on test set was close to 98%.
I wanted to predict the digits using data which I created myself and the results were bad.
What I did to the images written by me?
I segmented out each digit and converted to grayscale and resized the image into 28x28 and fed to the model.
How come that I get such low accuracy on my data set where as such high accuracy on test set?
Are there other modifications that i'm supposed to make to the images?
EDIT:
Here is the link to the images and some examples:
Excluding bugs and obvious errors, my guess would be that your problem is that you are capturing your hand written digits in a way that is too different from your training set.
When capturing your data you should try to mimic as much as possible the process used to create the MNIST dataset:
From the oficial MNIST dataset website:
The original black and white (bilevel) images from NIST were size
normalized to fit in a 20x20 pixel box while preserving their aspect
ratio. The resulting images contain grey levels as a result of the
anti-aliasing technique used by the normalization algorithm. the
images were centered in a 28x28 image by computing the center of mass
of the pixels, and translating the image so as to position this point
at the center of the 28x28 field.
If your data has a different processing in the training and test phases then your model is not able to generalize from the train data to the test data.
So I have two advices for you:
Try to capture and process your digit images so that they look as similar as possible to the MNIST dataset;
Add some of your examples to your training data to allow your model to train on images similar to the ones you are classifying;
For those still have a hard time with the poor quality of CNN based models for MNIST:
https://github.com/christiansoe/mnist_draw_test
Normalization was the key.

Image per-pixel Scene labeling output issue (using FCN-32s Semantic Segmentation)

I'm looking for a way that, given an input image and a neural network, it will output a labeled class for each pixel in the image (sky, grass, mountain, person, car etc).
I've set up Caffe (the future-branch) and successfully run the FCN-32s Fully Convolutional Semantic Segmentation on PASCAL-Context model. However, I'm unable to produce clear labeled images with it.
Images that visualizes my problem:
Input image
ground truth
And my result:
This might be some resolution issue. Any idea of where I'm going wrong?
It seems like the 32s model is making large strides and thus working at a coarse resolution. Can you try the 8s model that seems to perform less resolution reduction.
Looking at J Long, E Shelhamer, T Darrell Fully Convolutional Networks for Semantic Segmentation, CVPR 2015 (especially at figure 4) it seems like the 32s model is not designed for capturing fine details of the segmentation.

How should I setup my input neurons to recieve my input

I need to be able to determine if a shape was drawn correctly or incorrectly,
I have sample data for the shape, that holds the shape and the order of pixels (denoted by the color of the pixel)
for example, you can see of the downsampled image and color variation
I'm having trouble figuring out the network I need to define that will accept this kind of input for training.
should I convert the sampledown image to a matrix and input it? let's say my image is 64x64, I would need 64x64 input neurons (and that's if I ignore the color of the pixels, I think) is that feasible solution?
If you have any guidance, I could use it :)
I gave you an example as below.
It is a binarized 4x4 image of letter c. You can either concatenate the rows or columns. I am concatenating by columns as shown in the figure. Then each pixel is mapped to an input neuron (totally 16 input neurons). In the output layer, I have 26 outputs, the letters a to z.
Note, in the figure, I did not connect all nodes from layer i to layer i+1 for simplicity, which you probably should connect all.
At the output layer, I highlight the node of c to indicate that for this training instance, c is the target label. The expected input and output vector are listed in the bottom of the figure.
If you want to keep the intensity of color, e.g., R/G/B, then you have to triple the number of inputs. Each single pixel is replaced with three neurons.
Hope this helps more. For a further reading, I strongly suggest the deep learning tutorial by Andrew Ng at here - UFLDL. It's the state of art of such image recognition problem. In the exercise with the tutorial, you will be intensively trained to preprocess the images and work with a lot of engineering tricks for image processing, together with the interesting deep learning algorithm end-to-end.

Conceptual queries on retrieving 'visually similar' images: Dense SIFT or other descriptor?

I am posting 3 images of my dataset to show how my image visually looks:
http://s1306.photobucket.com/user/Bidisha_Chakraborty/library/?page=1
I am using VLFFeat DSIFT implementation. I am using per descriptor 4 orientations instead of 8. So in my case it is 64 dimensional vector instead of 128. I am using the original scale for the image, since my image data does is originally taken from fixed distance. I am computing descriptors densely at 4/8 pixels interval. I have conducted several experiments by varying the window size from 80*80 pixels to 20*20 pixels. I did a clustering approach with various number of cluster centers. And finally I used earth mover's Distance to compute the similarity metric.
After various parameter tuning of window size, number of words, I see that even when I have nearly similar images like 1 and 3, the distance metric says image 1 is more similar to image 2 then image 1 to image 3.
I did Principal Component Analysis to see the variance of the data. I expected image 1 and image 2 to have separated clusters and image 1 and 3 to have overlapped clusters. Since I plotted first 3 dimensions and these 3 dimensions accounted for less than 30percentage of data, I am sure including all dimensions(which I of course could not visualize) will give worse results.
Should I conclude that SIFT is not the best thing for my application or I am missing out something. I already used GLCM for these and did not get a good result.
Any suggestion for any other feature space is most welcome.
thanks for any kind of insight.

Resources