How big of a Dataset can ELKI handle? - machine-learning

I have 100,000 points that I would like to cluster using the OPTICS algorithm in ELKI. I have a upper triangular distance matrix of about 5 billion entries for this point set. In the format that ELKI wants the matrix, it will take about 100GB in memory. I am wondering does ELKI handle that sort of data load? Can any one confirm if you have made this work before?

I frequently use ELKI with 100k points, up to 10 million.
However, for this to be fast you should use indexes.
For obvious reasons, any dense matrix based approach will scale at best O(n^2), and need O(n^2) memory. Which is why I cannot process these data sets with R or Weka or scipy. They usually first try to compute the full distance matrix, and either fail halfway through, or run out of memory halfway through, or fail with negative allocation size (Weka, when your data set overflows the 2^31 positive integers, i.e. is around 46k objects).
In the binary format, with float precision, the ELKI matrix format should be around 100000*999999/2*4 + 4 bytes, maybe add another 4 bytes for size information. This is 20 GB.
If you use the "easy to use" ascii format, then it will indeed be more. But if you use gzip compression it may end up being about the same size. It's common to have gzip compress such data to 10-20% of the raw size. In my experience gzip compressed ascii can be as small as binary encoded doubles.
The main benefit of the binary format is that it will actually reside on disk, and memory caching will be handled by your operating system.
Either way, I recommend to not compute distance matrixes at all in the first place.
Because if you decide to go from 100k to 1 million, the raw matrix would grow to 2 TB, and when you go to 10 million it will be 200 TB. If you want double precision, double that.
If you are using distance matrixes, your method will be at best O(n^2), and thus not scale. Avoiding computing all pairwise distances in the first place is an important speed factor.
I use indexes for everything. For kNN or radius bound approaches (for OPTICS, use the epsion parameter to make indexes effective! Choose a low epsilon!) you can precompute these queries once, if you are going to need them repeatedly.
On a data set I frequently use, with 75k instances and 27 dimensions, the file storing the precomputed 101 nearest neighbors + ties, with double precision, is 81 MB (note: this can be seen as a sparse similarity matrix). By using an index for precomputing this cache, it takes just a few minutes to compute; and then I can ran most kNN based algorithms such as LOF on this 75k dataset in 108 ms (+262 ms for loading the kNN cache + parsing the raw input data 2364 ms, for a total runtime of 3 seconds; dominated by parsing double values).

Related

What are some augmentation techniques when I only have 3 numbers for each datapoint?

I have 300 datapoints/samples.
They are either 'healthy' or 'diseased'.
For each point I have some NIR spectral data which is made up of 100 different wavelengths.
I have augmented this with multiplicative noise.
For each point I also have 3 numbers (elemental data) which indicate the quantity of Oxygen, Nitrogen, and Hydrogen in the sample.
I am unsure how to augment the elemental data, should I just add random noise again?
Is there anything smarter I can do?
Adding random noise is pretty easy, but seems like with such a small amount of samples and information that I'm making the dangerous assumption that any elemental data within x% should be considered healthy when I'm not sure if that IS the case.

How expensive is SIFT in terms of memory?

I'm trying to create the dataset of SIFT descriptors from the Oxford building dataset.
It's around 5k images and using the default with the largest size (width or height) of 1024pxs. Using the default VLFeat implementation, it generates on average 10k keypoints per image.
Now, let's do some math to compute the needed memory to store all the descriptors in memory:
10^3 (avg # keypoints per image) * 5 * 10^3 (images) * 128 (descriptor dimension) * 32 (float representation) / 8 (bytes) / 10^6 (GBs) = 2560 GBs
Wow, that's a lot of memory! :D So my question is pretty simple: did I write something wrong in the previous line, or we really need all this memory? It's weird because this is a really famous algorithm for object recognition. What's wrong?
In this answer, I will first say something about how you can reduce the number of features in a single image to a more reasonable amount than 10k. Next I try to explain how you can get around saving all extracted features. Oh, and to answer your question: I think you made a mistake when converting bytes to gigabytes: 1 gigabyte is 1024^3 bytes, so I calculate ~2.4 GB.
1. Reducing the number of features
You probably won't need 10k SIFT features per image for a reliable matching. Many of the detected features will correspond to very small gradients, and most likely be noise. The documentation of the vl_sift function gives you two ways to control the number of descriptor points:
using a peak threshold PeakThresh. This is a threshold which is applied on the difference of gradient (DoG) image, so only points with a large peak in the DoG, i.e. large and distinct gradients, are considered as features.
using the edge threshold edgethresh. As we only want key points and not whole edges, these are removed. This process can be controlled with the edgethresh parameter, where a higher value gives you more features.
This helps you reduce the number of features to a reasonable amount, i.e. an amount you can handle (how much this is depends on your infrastructure, your needs, and your patience.). This does work for small-scale retrieval applications. For large-scale image retrieval, however, you might have thousands or millions of images. Even if you only had 10 descriptors per image, you still would have difficulties to handle this.
Clustering into visual words
To solve this problem, you run a k-means clustering on the extracted descriptors, which gives you a number k of so-called "Visual Words" (following the terminology of text retrieval). Next, you build an inverse index, which is exactly the same as a book index: you save, which word is mentioned on which page of the book - or here: which visual word appears in which image.
For a new image, all you have to do is find the nearest visual word for each descriptor, and use the inverse index to see which database image contains the same visual words as your query image.
That way, all you have to save is the centroids of your visual words. The number of visual words can be anything, though I'd say something in the range between 100 and 10k is often reasonable.

How do the cuSPARSE and cuBLAS libraries deal with memory allocated using cudaMallocPitch?

I am implementing a simple routine that performs sparse matrix - dense matrix multiplication using cusparseScsrmm from cuSPARSE. This is part of a bigger application that could allocate memory on GPU using cudaMalloc (more than 99% of the time) or cudaMallocPitch (very rarely used). I have a couple of questions regarding how cuSPARSE deals with pitched memory:
1) I passed in pitched memory into the cuSPARSE routine but the results were incorrect (as expected, since there is no way to pass in the pitch as an argument). Is there a way to get these libraries working with memory allocated using cudaMallocPitch?
2) What is the best way to deal with this? Should I just add a check in the calling function, to enforce that the memory not be allocated using pitched mode?
For sparse matrix operations, the concept of pitched data has no relevance anyway.
For dense matrix operations most operations don't directly support a "pitch" to the data per se, however various operations can operate on a sub-matrix. With a particular caveat, it should be possible for such operations to handle pitched or unpitched data. Any time you see a CUBLAS (or CUSPARSE) operation that accepts "leading dimension" arguments, those arguments could be used to encompass a pitch in the data.
Since the "leading dimension" parameter is specified in matrix elements, and the pitch is (usually) specified in bytes, the caveat here is that the pitch is evenly divisible by the size of the matrix element in question, so that the pitch (in bytes) can be converted to a "leading dimension" parameter specified in matrix elements. I would expect that this would be typically possible for char, int, float, double and similar types, as I believe the pitch quantity returned by cudaMallocPitch will usually be evenly divisible by 16. But there is no stated guarantee of this, so proper run-time checking is advised, if you intend to use this approach.
For example, it should be possible to perform a CUBLAS matrix-matrix multiply (gemm) on pitched data, with appropriate specification of the lda, ldb and ldc parameters.
The operation you indicate does offer such leading dimension parameters for the dense matrices involved.
If 99% of your use-cases don't use pitched data, I would either not support pitched data at all, or else, for operations where no leading dimension parameters are available, copy the pitched data to an unpitched buffer for use in the desired operation. A device-to-device pitched to unpitched copy can run at approximately the rate of memory bandwidth, so it might be fast enough to not be a serious issue for 1% of the use cases.

What are the different strategies for detecting noisy data in a pile of text?

I have around 10 GB of text from which I extract features based on bag of words model. The problem is that the feature space is very high dimensional(1 million words) and I can not discard words based on the count of each word as both the most and least occurring words are important of the model to perform better. What are the different strategies for reducing the size of the training data and number of features while still maintaining/improving the model performance?
Edit:
I want to reduce the size of the training data both because of overfitting and training time. I am using FastRank(Boosted trees) as my ML model. My machine has a core i5 processor running with 8GB RAM. The number of training instances are of the order of 700-800 million. Along with processing it takes more than an hour for the model to train. I currently do random sampling of the training and test data so as to reduce the size to 700MB or so, so that the training of the model finishes in minutes.
I'm not totally sure if this will help you because I dont know what your study is about, but if there is a logical way to divide up the 10Gigs of Text, (into documents or paragraphs) perhaps, you can try tf-idf. http://en.wikipedia.org/wiki/Tf%E2%80%93idf
This will allow you to discard words that appear very often across all partitions, and usually(the understanding is) that they dont contribute significant value to the overall document/paragraph etc.
And if your only requirement is to keep the most and least frequent words - would a standard distribution of the word frequencies help? Get rid of the average and 1 standard deviation(or whatever number you see fit).

How to speed up svm.predict?

I'm writing a sliding window to extract features and feed it into CvSVM's predict function.
However, what I've stumbled upon is that the svm.predict function is relatively slow.
Basically the window slides thru the image with fixed stride length, on number of image scales.
The speed traversing the image plus extracting features for each
window takes around 1000 ms (1 sec).
Inclusion of weak classifiers trained by adaboost resulted in around
1200 ms (1.2 secs)
However when I pass the features (which has been marked as positive
by the weak classifiers) to svm.predict function, the overall speed
slowed down to around 16000 ms ( 16 secs )
Trying to collect all 'positive' features first, before passing to
svm.predict utilizing TBB's threads resulted in 19000 ms ( 19 secs ), probably due to the overhead needed to create the threads, etc.
My OpenCV build was compiled to include both TBB (threading) and OpenCL (GPU) functions.
Has anyone managed to speed up OpenCV's SVM.predict function ?
I've been stuck in this issue for quite sometime, since it's frustrating to run this detection algorithm thru my test data for statistics and threshold adjustment.
Thanks a lot for reading thru this !
(Answer posted to formalize my comments, above:)
The prediction algorithm for an SVM takes O(nSV * f) time, where nSV is the number of support vectors and f is the number of features. The number of support vectors can be reduced by training with stronger regularization, i.e. by increasing the hyperparameter C (possibly at a cost in predictive accuracy).
I'm not sure what features you are extracting but from the size of your feature (3780) I would say you are extracting HOG. There is a very robust, optimized, and fast way of HOG "prediction" in cv::HOGDescriptor class. All you need to do is to
extract your HOGs for training
put them in the svmLight format
use svmLight linear kernel to train a model
calculate the 3780 + 1 dimensional vector necessary for prediction
feed the vector to setSvmDetector() method of cv::HOGDescriptor object
use detect() or detectMultiScale() methods for detection
The following document has very good information about how to achieve what you are trying to do: http://opencv.willowgarage.com/wiki/trainHOG although I must warn you that there is a small problem in the original program, but it teaches you how to approach this problem properly.
As Fred Foo has already mentioned, you have to reduce the number of support vectors. From my experience, 5-10% of the training base is enough to have a good level of prediction.
The other means to make it work faster:
reduce the size of the feature. 3780 is way too much. I'm not sure what this size of feature can describe in your case but from my experience, for example, a description of an image like the automobile logo can effectively be packed into size 150-200:
PCA can be used to reduce the size of the feature as well as reduce its "noise". There are examples of how it can be used with SVM;
if not helping - try other principles of image description, for example, LBP and/or LBP histograms
LDA (alone or with SVM) can also be used.
Try linear SVM first. It is much faster and your feature size 3780 (3780 dimensions) is more than enough of "space" to have good separation in higher dimensions if your sets are linearly separatable in principle. If not good enough - try RBF kernel with some pretty standard setup like C = 1 and gamma = 0.1. And only after that - POLY - the slowest one.

Resources