How is 3d convolution carried out in practice? - machine-learning

So in 2d convolution when I define a 3x3 kernel, the operation is actually carried out using a 3x3xn kernel, n being the number of input channels.
Is this the same in 3d convolution? That is to say, if I define a 3x3x3 kernel on an input of dimensions (128,128,128,3) (width,height,depth,channels), then is the operation carried out with a kernel of dimensions 3x3x3x3 where the last three is determined by the number of input channels?

This is a good question. 3d cameras work by capturing two flat images side by side. I’m not sure how it would look in tensor form, but you would need the typical 1080x1080x3 dimensions for one photo and the same for the other photo, and they would have to be associated with each other somehow. Actually Facebook just came out with a new library for this type of operation called PyTorch 3D

Related

Backward pass on convolutional layer with 3-channel images

I know that deconvolution is basically convolution of output with flipped filters and I have implemented it for 2D data. But I am not able to generalize it for 3D data. For example consider the input of dimension 3x5x5 and the filter is of dimension 3x3x3 and stride is set to 1. SO, the output will be of the dimension 1x3x3. What I don't understand is how to calculate the deconvolution for this output. The flipped filter again will be of dimension 3x3x3 and output of convolution is of dimension 1x3x3 which are incompatible for convolution. So how can we calculate deconvolution ?
Perhaps this post will help you out a bit. You are correct in saying that a filter of the same size cannot fit the deconvolution dimensions. So in order to remedy that, the 1x3x3 gets padded throughout with zeros, mean-values, nn, etc. until it is the appropriate size that you require. Depth can be handled the same way. In your example, you want a 3x3x3 filter to 'deconvolve' the 1x3x3 to a 3x5x5. So we pad out the 1x3x3 to a 5x7x7 (with whichever method you prefer), and apply the filter. There are definite drawbacks with this process, stemming from the fact that you're trying to extrapolate more information from less.

Having a neural network output a gaussian distribution rather than one single value?

Let's consider I have a neural network with one single output neuron. To outline the scenario: the network gets an image as input and should find one single object in that image. For simplifying the scenario, it should just output the x-coordinate of the object.
However, since the object can be at various locations, the network's output will certainly have some noise on it. Additionally the image can be a bit blurry and stuff.
Therefore I thought it might be a better idea to have the network output a gaussian distribution of the object's location.
Unfortunately I am struggling to model this idea. How would I design the output? A flattened 100 dimensional vector if the image has a width of 100 pixels? So that the network can fit in a gaussian distribution in this vector and I just need to locate the peaks for getting the approximated object's location?
Additionally I fail in figuring out the cost function and teacher signal. Would the teacher signal be a perfect gaussian distribution on the exact x-coordination of the object?
How to model the cost function, then? Currently I have a softmax cross entropy or simply a squared error: network's output <-> real x coordinate.
Is there maybe a better way to handle this scenario? Like a better distribution or any other way to have the network not output a single value without any information of the noise and so on?
Sounds like what you really need is a convolutional network.
You could train a network to recognize your target object when it's positioned in the center of the network's receptive field. You can then create a moving window, at each step feeding the portion of the larger image under that window into the net. If you keep track of the outputs of the trained network for each (x,y) position of the window, some locations of the window will produce better matches than others. Once you've covered the whole image, you can pick the position with the maximum network output as the position where the target object is most likely located.
To handle scale and rotation variations, consider creating an image pyramid, or sets of images at different scales and rotations that are versions of the original image. Then sieve over those images to find the target image.

Noise and Blur in Cuda

I'm trying to add noise and blur functions to my project in Cuda and after quite some research i've hit a bit of a stumbling block, I've read up on the Gaussian blur matrix but i'm still having trouble getting a working piece of code which would be able to blur certain parts of an image, I've managed to get a form of noise to show.
If anyone could give a bit of help in either explaining how to implement a Gaussian or a simpler blur method or even providing a bit of code which implements blurring.
Gratefully appreciated!!
Gaussian blur is a separable filter, so you can apply the 1D kernel first to all the rows in your ROI and then to the columns of the blurred rows.
The tricky part with CUDA is that this is a neighbourhood operation, so typically you will need to have each block overlap by half the kernel size in order to get the required neighbourhood pixels into shared memory.
FYI, these are two separate questions and should be asked separately in this site.
Regarding the blur - for large blur kernels (strong blurs) the best approach is to use the FFT on the image and on a Gaussian noise kernel image then multiply the results using the complex multiplication and inverse FFT that result. You will have to implement a FFT-Shift function yourself and if you are using color images, you will have to split the image into a separate buffer per channel.
For small blur kernels (gentile blurs) the simplest approach is for each pixel in the result image, sum nearby pixels in the source image (with a Gaussian weight function).
Regarding the noise - test easiest approach is to load a pre-generated pseudo-random generator's result image into CUDA after transforming it from uniformly distributed random numbers to normal distributed random numbers. E.g. this question.
The a correctly size region in the random image should be multiplied by the noise sigma and added to the source image to receive the result.
Last time I checked there was no random buffer generation solution for CUDA, however, that was a few years ago.
Update: CUDA now has cuRand so you should be able to generator random numbers instead of using a pregenerated random buffer.

The blurring kernel of a low-quality camera

I am doing some image enhancement experiments so I take photos from my cheap camera. The camera has mosaic artifacts and all images look like grid. I think pillbox (out-of-focus) kernel and Gaussian kernel would not be the best candidates. Any suggestions?
EDIT:
Sample
I suspect this cannot be done via a constant kernel, because the effects on pixels are not the same (so there are "grids").
The effects are non linear. (And probably non-stationary), so you cannot simply invert the convolution and enhance the image -- if you could, the camera chip would do it on-board.
The best way to work out what the convolution is (or at least an approximation to it) might be to take photos of known patterns, compute, and working in 2D frequency/laplace domain divide the resulting spectra to get a linear approximation to the filter.
I suspect that the convolution you discover by doing this will be very context dependant -- so the best way to enhance an image might be to divide it into tiles, classify each region of the image as belonging to a different set (for each of which you could work out a different linear approximation to the convolution, based on test data), and then deconvolve each separately.

Projecting from 60D (shape context) space to 2D for visual analysis

I have a set of 60D shape context vectors. These were constructed using a sample of 400 edge points from a silhouette using 5 radial bins and 12 angular bins (thus, I have 400 shape context vectors of 60D).
I would like to analyse just how descriptive these vectors are in representing the overall shape of the underlying silhouette. To do this, I would like to project the 60D shape context vectors back into 2D space and visually inspect the result -- what I am hoping to see is a set of points that roughly resemble the original silhouette's shape.
An approach to do this is by projecting on the first two principal components (PCA). Based on my implementation, the projected points did not resemble the silhouette's shape. I can see two main reasons for this (assuming for the time being that my implementation is correct): (1) shape context is either not appropriate as a descriptor given the silhouettes, or it's parameters need to be better tuned (2) this analysis method is flawed / not valid.
My question is whether this is the right approach for analysing the descriptiveness of shape contexts in relation to my silhouette's shape? If not, can someone please explain why and propose an alternative method?
Thanks,
Josh
The good way to check whether features are descriptive or not is to try train some classifier(svm/bayes/tree/whatever) upon them and check it cross-validated precision/recall etc. You also can filter your feature vector by feature selector like Chi/infogain.
Other than PCA, you can visualize your data with SOM, or by clustering.
I think this analysis method is flawed/not valid. I think this would be a similar reasoning: I can reconstruct the view from above on a football field by doing PCA on what each football player sees. It just isn't reasonable to expect that.
I think the simplest way to analyze the descriptiveness of shape context is to download MNIST or some other databases of written digits, and compute the 10x10 matrix of the shape similarities of 5 ones and 5 twos, and then draw this graph using (say) graphviz.

Resources