Analysis of 3d image data as 2d - image-processing

I have a tif image of size around ~10 Gb. I need to perform object classification or pixel classification in this image. The dimension of image data has zyx form. My voxel size in x=0.6, y=0.6 and z=1.2. Z is the depth of the object. My RAM can not take whole image.
If I do classification of pixels in each Z plane separately and then merge to get the final shape and volume of object.
Would I loose any information and my final shape or volume of object will be wrong?

#ankit agrawal you probably have found the answer but my advice would be definitely not to say you need more memory.
I have had a similar problem and if anyone else comes across them the options below will help.
Options
The answer about splitting into just z planes is correct. You could lose information in the z plane. The idea isn't a bad one but you could take Regions of Interest (ROIs) /split your image into chunks. So that they could be more manageable but say split the x/2, y/2 and z/2. Then you get a bunch of chunks that be used in memory. Then stack the data back up later
use the library [Dask] (https://dask.org/) , it creates all this for you. It's designed for parallelism and can be scaled on a single computer or a cluster. Using the dask.array part lets you create lots of chunked numpy arrays. Even better, go use dask-image (there should be a link to this in dask). It's a wrapper of dask.array and many scipy ndimage functions. Lastly when the file is split appropriately the computation can be faster because of parallelism. Not always but I have easily worked with 20GB data sets on a laptop with 16GB. The files were 8bit so when many libraries and function upfloat blowing your memory up. This allows you keep a handle on it all. If you stick to the core functions it will work fine. Gets harder when you work with mapped blocks.
If you still have this issue.

The issue with doing a classification in each z-plane individually is that you might not be able to classify objects with that sort of restricted information.
You can easily think of that the same way for a 2D face detection problem where you would try to detect the face in each row individually - that is probably not going to be very robust and you will loose valuable spatial information. In the end you'll probably end up with no detections to merge.
Solution proposal:
My advice would be to increase the size of your voxels until it can be processed by your processing unit, saying decrease the resolution of your data and do a classification with a low confidence threshold. Then come back and do another classification on the volumes with detections in them, this time aim for a higher confidence threshold. This can be done iteratively as need be.

I think breaking the image any (x/y/z) plane kind of defeats the point of the voxel concept because the representation of a three-dimensional object is flattened and you lose the spatial relational data.
I think a couple of options are:
Use a distributed computing cluster, like Hadoop.
Look into storing that image in a geospatial database like GeoMesa, so it may be queried efficiently, then you can just hold in memory what you need to train locally.
10GB isn't so large, so perhaps upgrade your memory capacity?

Related

If a neural network can optimize traditional image processing algorithms?

I dont mean that a neural network can complete the work of traditional image processing algorithm.What i want to say is if it exists a kind of neural network can use the parameters of the traditional method as input and outputs more universal parameters that dont require manual adjustment.Intuitively, my ideas are less efficient than using neural networks directly,but I don't know much about the mathematics of neural networks.
If I understood correctly, what you mean is for a traditional method (let's say thresholding), you want to find the best parameters using ann. It is possible but you have to supply so many training data which needs to be created, processed and evaluated that it will take a lot of time. AFAIK many mobile phones that have AI assisted camera use this method to find the best aperture, exposure..etc.
First of all, thank you very much. I still have two things to figure out. If I wanted to get a (or a set of) relatively optimal parameters, what data set would I need to build (such as some kind of error between input and output and threshold) ? Second, as you give an example, is it more efficient or better than traversal or Otsu to select the optimal threshold through neural networks in practice?To be honest, I wonder if this is really more efficient than training input and output directly using neural networks
For your second question, Otsu only works on cases where the histogram has two distinct peaks. Thresholding is a simple function but the cut-off value is based on your objective; there is no single "best" value valid for every case. So if you want to train a model for thresholding, I think you have to come up with separate models for each case (like a model for thresholding bright objects, another for darker ones...etc.) Maybe an additional output parameter for determining the aim works but I am not sure. Will it be more efficient and better? Depends on the case (and your definition of better). Otsu, traversal or adaptive thresholding does not work all the time (actually Otsu has very specific use cases). If they work for your case, excellent. If not, then things get messy. So to answer your question, it depends on your problem at hand.
For the first question, TBF, it is quite difficult to work with images in traditional ANNs. Images have a lot of pixels, so standard ANNs struggle with inputs. Moreover, when the location/scale of an object in the image changes, the whole pixel data changes even though the content is the same (These are the reasons why CNN's are superior to ANN's for images). For these reasons it is better to use processed metrics which contain condensed and location-invariant information. E.g. for thresholding, you can give the histogram and it returns a thresholding value. Therefore you need an ann with 256 input neurons (for an intensity histogram of 8bit grayscale image), 1 output neuron, and 1-2 middle layers with some deeply connected neurons (128 maybe?). Your training data will be a bunch of histograms as input and corresponding best threshold value for each histogram. Then once training is finished, you can give the ANN a histogram it has never seen before and it will tell you the optimal thresholding value based on its training.
what I want to do is a model that can output different parameters (parameter sets) based on different input images, so I think if you choose a good enough data set it should be somewhat universal.
Most likely, but your data set should be quite inclusive of expected images (in terms of metrics and features), which means it has to be large.
Also, I don't know much about modeling -- can I use a function about the output/parameters (which might be a function about the result of the traditional method) as an error in the back-propagation by create a custom loss function?
I think so, but training the model will be more involved than using predefined loss functions because, well, you have to write them. Also you have to test they work as expected.

DBSCAN: How to Cluster Large Dataset with One Huge Cluster

I am trying to perform DBSCAN on 18 million data points, so far just 2D but hoping to go up to 6D. I have not been able to find a way to run DBSCAN on that many points. The closest I got was 1 million with ELKI and that took an hour. I have used Spark before but unfortunately it does not have DBSCAN available.
Therefore, my first question is if anyone can recommend a way of running DBSCAN on this much data, likely in a distributed way?
Next, the nature of my data is that the ~85% lies in one huge cluster (anomaly detection). The only technique I have been able to come up with to allow me to process more data is to replace a big chunk of that huge cluster with one data point in a way that it can still reach all its neighbours (the deleted chunk is smaller than epsilon).
Can anyone provide any tips whether I'm doing this right or if there is a better way to reduce the complexity of DBSCAN when you know that most data is in one cluster centered around (0.0,0.0)?
Have you added an index to ELKI, and tried the parallel version? Except for the git version, ELKI will not automatically add an index; and even then fine-turning the index for the problem can help.
DBSCAN is not a good approach for anomaly detection - noise is not the same as anomalies. I'd rather use a density-based anomaly detection. There are variants that try to skip over "clear inliers" more efficiently if you know you are only interested in the top 10%.
If you already know that most of your data is in one huge cluster, why don't you directly model that big cluster, and remove it / replace it with a smaller approximation.
Subsample. There usually is next to no benefit to using the entire data. Even (or in particular) if you are interested in the "noise" objects, there is the trivial strategy of randomly splitting your data in, e.g., 32 subsets, then cluster each of these subsets, and join the results back. These 32 parts can be trivially processed in parallel on separate cores or computers; but because the underlying problem is quadratic in nature, the speedup will be anywhere between 32 and 32*32=1024.
This in particular holds for DBSCAN: larger data usually means you also want to use much larger minPts. But then the results will not differ much from a subsample with smaller minPts.
But by any means: before scaling to larger data, make sure your approach solves your problem, and is the smartest way of solving this problem. Clustering for anomaly detection is like trying to smash a screw into the wall with a hammer. It works, but maybe using a nail instead of a screw is the better approach.
Even if you have "big" data, and are proud of doing "big data", always begin with a subsample. Unless you can show that the result quality increases with data set size, don't bother scaling to big data, the overhead is too high unless you can prove value.

Training with duplicates in dataset

I have a dataset of images for classification purposes. The dataset is very large and most of the images are duplicates of each other. So essentially, the same image occurs multiple times. Moreover, the dataset is unbalanced.
I understand the motivation of cleaning the dataset of duplicates. But it is extensive and very time consuming to do so.
Is there a way to train a net on this dataset, and not overfit the model?
Could enforcing harsher regularization, dropouts, penalize the losses still produce a usable model?
As suggested by Jon.H in comments, instead of training your model on a dataset with duplicates, you could use image hashing to detect and remove them from the dataset. Although the cryptographic hashing (like MD5 and SHA1) will suffice to find exact duplicates, according to your comment you also would like to get rid of similar images, not just exact duplicates (Do you really want to do this? Having a bigger dataset is usually better for training, and keeping similar images with small variations, e.g. in color, is not necessarily a bad thing -- see "data augmentation").
Generating a hash for images is not robust to slight changes in pixel
values, say minor lighting changes which aren't visible to the eye but
the pixel value differs. - Ronica Jethwa
One solution to this is to use perceptual hashing which is quite robust to minor differences in color, rotation, aspect ratio of images etc. In particular I would suggest you to try the pHash algorithm based on Discrete Cosine Transform as described in Looks-Like-It. There is a python library that implements it, called imagehash. Here's how to use it:
from PIL import Image
import imagehash
# Compute the perception-hash values (64 bit) for two images
phash_1 = imagehash.phash(Image.open('image_1')) # e.g. d58e11ce51ee15aa
phash_2 = imagehash.phash(Image.open('image_2')) # e.g. d58e01ae519e559e
# Compare the images using the Hamming distance of their perception hashes
dist = phash_1 - phash_2
Then it's up to you to choose the similarity threshold for the Hamming distance.
Duplicates don't imply over-fitting; they give that image more weight in the training. Yes, you can train on the data set; the results will be valid. For instance, if you have the same quantity of duplicates (say, 10 of everything). then you'll get the same results as if you had just one -- or almost: the shuffling order can slightly affect the balance of training, since a single image can now appear multiple times near the start of epoch 1.
The various counter-measures you list are good tools against over-fitting, but your main danger is merely what you have anyway: the potential of a small set of unique examples.
Adding my cent to this old question.
During training the problem arises only if you have a high chance of having many duplicates in a single batch.
Let's say you choose a batch size of 64; since you will randomly sample the images to compose the batch it could be that on average you have only 2 duplicates. This really depends on how many times (on average) an image is duplicated in proportion to the total number of images.
Anyway the problem is alleviated by using (online) data augmentation which introduces some differences, even between identical images.
The biggest problem is on the test set because the accuracy estimation will be biased towards the images with more duplicates, so I would embrace the effort and deduplicate the test (and validation) sets.
If you have the same images in the validation set as in the train set, but different in the test set, the validation will give a better (accuracy) score than test. In this case, it will be like overfitting. Duplicates occur naturally everywhere, therefore it must be ok.
Train with duplicate data. Use the representation vector i.e output of last convolution. If you using pretrained CNN model use the final out of that. Apply knn or clustering on the representation vectors and identify duplicates. Remove duplicates and retain your model.

Do you have any suggestions for a Machine Learning method that may actually learn to distinguish these two classes?

I have a dataset that overlaps a lot. So far my results with SVM are not good. Do you have any recomendations for a model that may be able to differ between these 2 datasets?
Scatter plot from both classes
It is easy to fit the dataset by interpolation of one of the classes and predicting the other one otherwise. The problem with this approach is though, that it will not generalize well. The question you have to ask yourself is, if you can predict the class of a point given its attributes. If not then every ML algorithm will also fail to do so.
Then the only reasonable thing you can do is to collect more data and more attributes for every point. Maybe by adding a third dimension you can seperate the data more easily.
If the data is overlapping so much, both should be of the same class, but we know they are not. So, there is/are some feature(s) or variable(s) that is/are separating these data points into two classes. Try to add more features for data.
And sometimes, just transforming the data into a different scale can help.
Both the classes need not be equally distributed, as skewed data distribution can be handled separately.
First of all, what is your criterion for "good results"? What style of SVM did you use? Simple linear will certainly fail for most concepts of "good", but a seriously convoluted Gaussian kernel might dredge something out of the handfuls of contiguous points in the upper regions of the plot.
I suggest that you run some basic statistics on the data you've presented, to see whether they're actually as separable as you'd want. I suggest a T-test for starters.
If you have other dimensions, I strongly recommend that you use them. Start with the greatest amount of input you can handle, and reduce from there (principal component analysis). Until we know the full shape and distribution of the data, there's not much hope of identifying a useful algorithm.
That said, I'll make a pre-emptive suggestion that you look into spectral clustering algorithms when you add the other dimensions. Some are good with density, some with connectivity, while others key on gaps.

Are GPUs good for case-based image filtering?

I am trying to figure out whether a certain problem is a good candidate for using CUDA to put the problem on a GPU.
I am essentially doing a box filter that changes based on some edge detection. So there are basically 8 cases that are tested for for each pixel, and then the rest of the operations happen - typical mean calculations and such. Is the presence of these switch statements in my loop going to cause this problem to be a bad candidate to go to GPU?
I am not sure really how to avoid the switch statements, because this edge detection has to happen at every pixel. I suppose the entire image could have the edge detection part split out from the processing algorithm, and you could store a buffer corresponding to which filter to use for each pixel, but that seems like it would add a lot of pre-processing to the algorithm.
Edit: Just to give some context - this algorithm is already written, and OpenMP has been used to pretty good effect at speeding it up. However, the 8 cores on my development box pales in comparison to the 512 in the GPU.
Edge detection, mean calculations and cross-correlation can be implemented as 2D convolutions. Convolutions can be implemented on GPU very effectively (speed-up > 10, up to 100 with respect to CPU), especially for large kernels. So yes, it may make sense rewriting image filtering on GPU.
Though I wouldn't use GPU as a development platform for such a method.
typically, unless you are on the new CUDA architecture, you will want to avoid branching. because GPUs are basically SIMD machines, the pipleline is extremely vurnurable to, and suffers tremendously from, pipeline stalls due to branch misprediction.
if you think that there are significant benefits to be garnered by using a GPU, do some preliminary benchmarks to get a rough idea.
if you want to learn a bit about how to write non-branching code, head over to http://cellperformance.beyond3d.com/ and have a look.
further, investigating into running this problem on multiple CPU cores might also be worth it, in which case you will probably want to look into either OpenCL or the Intel performance libraries (such as TBB)
another go-to source for problems targeting the GPU be it graphics, computational geometry or otherwise, is IDAV, the Institute for Data Analysis and Visualization: http://idav.ucdavis.edu
Branching is actually not that bad, if there is spatial coherence in the branching. In other words, if you are expecting chunks of pixels next to each other in the image to go through the same branch, the performance hit is minimized.
Using a GPU for processing can often be counter-intuitive; things that are obviously inefficient if done in normal serial code, are actually the best way to do it in parallel using the GPU.
The pseudo-code below looks inefficient (since it computes 8 filtered values for every pixel) but will run efficiently on a GPU:
# Compute the 8 possible filtered values for each pixel
for i = 1...8
# filter[i] is the box filter that you want to apply
# to pixels of the i'th edge-type
result[i] = GPU_RunBoxFilter(filter[i], Image)
# Compute the edge type of each pixel
# This is the value you would normally use to 'switch' with
edge_type = GPU_ComputeEdgeType(Image)
# Setup an empty result image
final_result = zeros(sizeof(Image))
# For each possible switch value, replace all pixels of that edge-type
# with its corresponding filtered value
for i = 1..8
final_result = GPU_ReplacePixelIfTrue(final_result, result[i], edge_type==i)
Hopefully that helps!
Yep, control flow usually has performance penalty on GPU, be it if's / switch'es / ternary operator's, because with control flow operations GPU can't optimally run threads. So usual tactics is to avoid branching as possible. In some cases IF's can be replaced by some formula, where IF conditions maps to formula coefficients. But concrete solution/optimization depends on concrete GPU kernel... Maybe you can show exact code, to be analyzed further by stackoverflow community.
EDIT:
Just in case you are interested here is convolution pixel shader that i wrote.

Resources