I am experimenting with the generalised dice loss implemented in niftynet to segment MRI volumes containing 4 classes (1 background 3 regions of interest) using the V-Net. I tried to format the labels in 2 ways:
spatial dimensions only with 0 being background and 1,2,3 being the labels for the regions of interest.
5 dimensional images ([3 spatial],1,4) storing binary volumes for each class in the 5th dimension
an inference from the second case produced a 3D volume where only the class with label '3' was detected while the loss didn't decrease at all during training for the first case. Am I storing the labels in the correct format?
I think the first format is the correct one.
You might need to clip the gradients in the code for segmentation application. Does the loss decrease when you use a standard Dice metric?
Related
I am newbie in deep learning and doing my Final Year Project in Deep learning. I know that we use Conv2D in image related task but my professor asked me that why don't we use Conv1D or Conv3D? Why do we specifically use Conv2D here. I've searched whole internet to get proper answer to this question but i don't seem to find any solid answer to it. Please help me in this question because i am very confused and don't seem to find any proper answer.
Thank you!
In a 1 dimensional CNN, the kernel moves in 1 direction. Input and output data of a 1 dimensional CNN is 2 dimensional. It is mostly used on Time-Series data since you can just move left or right (x).
In a 2 dimensional CNN, the kernel moves in 2 directions. Input and output data of a 2 dimensional CNN is 3 dimensional. As you have mentioned it is widely used for instance in image related tasks since apart from left and right you can move up and down (x,y).
In a 3 dimensional CNN, the kernel moves in 2 directions. Input and output data of a 3 dimensional CNN is 4 dimensional. Since the kernel slides in 3 dimensions you have (x,y,z) possible movements. One example use case is medical imaging since they are 3 dimensional images taken by slices and then recostructed. All the slices added together must be analysed as a whole, so it has no sense taking single images and apply a 2 dimensional convolution since relationships are getting lost, you need to stack all the images to have a "3d" representation and analyse it with 3 dimensional convolutions.
I've created a feedforward neural network using DL4J in Java.
Hypothetically and to keep things simple, assume this neural network is a binary classifier of squares and circles.
The input, a feature vector, would be composed of say... 5 different variables:
[number_of_corners,
number_of_edges,
area,
height,
width]
Now so far, my binary classifier can tell the two shapes apart quite well as I'm giving it a complete feature vector.
My question: is it possible to input only maybe 2 or 3 of these features? Or even 1? I understand results will be less accurate while doing so, I just need to be able to do so.
If it is possible, how?
How would I do it for a neural network with 213 different features in the input vector?
Let's assume, for example, that you know the area, height, and width features (so you don't know the number_of_corners and number_of_edges features).
If you know that a shape can have, say, a maximum of 10 corners and 10 edges, you could input 10 feature vectors with the same area, height and width but where each vector has a different value for the number_of_corners and number_of_edges features. Then you can just average over the 10 outputs of the network and round to the nearest integer (so that you still get a binary value).
Similarly, if you only know the area feature you could average over the outputs of the network given several random combinations of input values, where the only fixed value is the area and all the others vary. (I.e. the area feature is the same for each vector but every other feature has a random value.)
This may be a "trick" but I think that the average will converge to a value as you increase the number of (almost-)random vectors.
Edit
My solution would not be a good choice if you have a lot of features. In this case you could try to use maybe a Deep Belief Network or some autoencoder to infer the values of the other features given a small number of them. For example, a DBN can "reconstruct" a noisy output (if you train it enough, of course); you could then try to give the reconstructed input vector to your feed-forward network.
I have a dataset consisting of fMRI images. Each image belongs to one class. The dataset is as follows:
Class 1: 9 images
Class 2: 10 images
Class 3: 6 images
Class 4: 12 images
Each image is 4D (time series), i.e. 90x60x10x350 where 350 is the time dimension (i.e. 350 3D volumes). I want to train a classifier on this data.
Now I want to first extract features and then apply feature selection by applying e.g. PCA and then do clustering, like described in the paper "Principal Feature Analysis: A Multivariate Feature Selection Method for fMRI Data" (http://www.hindawi.com/journals/cmmm/2013/645921/). For feature extraction I see the following possibilities:
Each voxel is a feature and the average of each voxels time series
is taken. Each image has exactly one feature vector of dimension 90*60*10 = 54'000
Each voxel is a feature and each time point (i.e. each 3D volume) is a data point. Each image has 350 feature vectors of dimension 90*60*10 = 54'000 each.
Putting all voxels of the whole time series of an image into one feature vector of
size 90*60*10*350 = 18'900'000. Each image has only one feature vector.
Take the the correlation value between the voxels as feature values. But this is
computationally not doable.
I'm preferring 2. but I'm not sure if this is a good idea.
How would you do the feature extraction? And how would a correlation based approach in a computational feasible way work?
Last but not least, how would you do cross-validation on the dataset? The problem is that the different classes are imbalanced.
Thank you so much for the answers beforehand.
I'm new to neural networks and trying to get the hang of it by solving the following task:
Given a semi circle which defines an area above the x-axis, I would like to teach an ANN to output the length of a vector pointing to any position within that area. In addition, I would also like to know the angle between it and the x-axis.
I thought of this as a classical example of supervised learning and used Backpropagation to train a feed-forward network. The network is built by two Input-, two Output-, and variable amount of Hidden-neurons organised in a variable amount of hidden layers.
My training data is a random and unsorted sample of points within that area and the respective desired values. The coordinates of the points serve as the input of the net while I use the calculated values to minimise the error.
However, even after thousands of training iterations and empirical changes of the networks topology, I am unable to produce results with an error below ~0.2 (Radius: 20.0, Topology: 2/4/2).
Are there any obvious pitfalls I'm failing to see or does the chosen approach just not fit the task? Which other network types and/or learning techniques could be used to complete the task?
I wouldn't use variable amounts of hidden layers, I would use just one.
Then, I wouldn't use two output neurons, I would use two separate ANNs, one for each of the values you're after. This should do better, since your outputs aren't clearly related in my opinion.
Then, I would experiment with number of hidden neurons between 2 and 10 and different activation functions (logistic and tanh, maybe ReLUs).
After that, do you scale your data? It might be worth scaling both your inputs and outputs. Sigmoid units return small numbers, so it is good if you can adapt your outputs to be small as well (in [-1 , 1] or [0, 1]). For example, if want your angles in degrees, divide all of your targets by 360 before training the ANN on them. Then when the ANN returns a result, multiply it by 360 and see if that helps.
Finally, there are a number of ways to train your neural network. Gradient descent is the classic, but probably not the best. Better methods are conjugate gradient, BFGS etc. See here for optimizers if you're using python - even if not, they might give you an idea of what to search for in your language.
My project is to create a software that recognizes certain objects like an apple or a coin etc. I want to use Kinect. My question is: Do I need to have a machine learning algorithm like haar classifier to recognize a object or kinect itself can do that?
Kinect itself cannot recognize objects. It will give you a dense depth map. Then you can use the depth features along with some simple features (in your case, maybe color features or gradient features would do the job). Those features you input to a classifier (SVM or Random Forest for example) to train the system. You use the trained model for testing on new samples.
Regarding Haar features, I think they could do the job but you would need a sufficiently large database of features. It all depends on what you want to detect. In the case of an apple and a coin, just color would suffice.
Refer this paper to get an idea how to perform human pose recognition using Kinect camera. You just have to pay attention to their depth features and their classifiers. Do not apply their approach directly. Your problem is simpler.
Edit: simple gradient orientations histogram
Gradient orientations can give you a coarse idea about the shape of the object (It is not a shape-feature to be specific, better shape features exist, but this one is extremely fast to calculate).
Code snippet:
%calculate gradient
[dx,dy] = gradient(double(img));
A = (atan(dy./(dx+eps))*180)/pi; %eps added to avoid division by zero.
A will contain orientation for each pixel. Segment your original image according to the depth values. For a segment having similar depth values, calculate color histogram. Extract the pixel orientations corresponding to that region, call it A_r. calculate a 9-bin (you can have more bins. Nine bins mean each bin will contain 180/9=20 degrees) histogram. Concatenate the color features and the gradient histogram. Do this for sufficient number of leaves. Then you can give this to a classifier for training.
Edit: This is a reply to a comment below.
Regarding MaxDepth parameter in opencv_traincascade
The documentation says, "Maximal depth of a weak tree. A decent choice is 1, that is case of stumps". When you perform binary classification, it takes a form of:
if yourFeatureValue>=learntThresh
class=1;
else
class=0;
end
The above type of classifier which performs thresholding on a single feature value (a scalar) is called decision stumps. There is only one split between positive and negative class (therefore maxDepth is one). For example, it would work in following scenario. Imagine you have a 1-D feature:
f=[1 2 3 4 -1 -2 -3 -4]
First 4 are class 1, rest are class 0. Decision stumps would get 100% accuracy on this data by setting the threshold to zero. Now, imagine a complicated feature space such as:
f=[1 2 3 4 5 6 7 8 9 10 11 12];
First 4 and last 4 are class 1, rest are class 0. Here, you cannot get 100% classification by decision stumps. You need two thresholds/splits. Therefore, you can construct a tree with depth value 2. You will have 2^(2-1)=2 thresholds. For depth=3, you get 4 thresholds, for depth=4, you get 8 thresholds and so on. Here, I assume a tree with a single node has height 1.
You may feel that the more the number of levels, you can achieve more accuracy, but then there is a problem of overfitting (and computation, memory storage etc.). Therefore, you have to set a good value for depth. I usually set it to 3.