DBSCAN: How to Cluster Large Dataset with One Huge Cluster

DBSCAN: How to Cluster Large Dataset with One Huge Cluster - machine-learning

I am trying to perform DBSCAN on 18 million data points, so far just 2D but hoping to go up to 6D. I have not been able to find a way to run DBSCAN on that many points. The closest I got was 1 million with ELKI and that took an hour. I have used Spark before but unfortunately it does not have DBSCAN available.
Therefore, my first question is if anyone can recommend a way of running DBSCAN on this much data, likely in a distributed way?
Next, the nature of my data is that the ~85% lies in one huge cluster (anomaly detection). The only technique I have been able to come up with to allow me to process more data is to replace a big chunk of that huge cluster with one data point in a way that it can still reach all its neighbours (the deleted chunk is smaller than epsilon).
Can anyone provide any tips whether I'm doing this right or if there is a better way to reduce the complexity of DBSCAN when you know that most data is in one cluster centered around (0.0,0.0)?

Have you added an index to ELKI, and tried the parallel version? Except for the git version, ELKI will not automatically add an index; and even then fine-turning the index for the problem can help.
DBSCAN is not a good approach for anomaly detection - noise is not the same as anomalies. I'd rather use a density-based anomaly detection. There are variants that try to skip over "clear inliers" more efficiently if you know you are only interested in the top 10%.
If you already know that most of your data is in one huge cluster, why don't you directly model that big cluster, and remove it / replace it with a smaller approximation.
Subsample. There usually is next to no benefit to using the entire data. Even (or in particular) if you are interested in the "noise" objects, there is the trivial strategy of randomly splitting your data in, e.g., 32 subsets, then cluster each of these subsets, and join the results back. These 32 parts can be trivially processed in parallel on separate cores or computers; but because the underlying problem is quadratic in nature, the speedup will be anywhere between 32 and 32*32=1024.
This in particular holds for DBSCAN: larger data usually means you also want to use much larger minPts. But then the results will not differ much from a subsample with smaller minPts.
But by any means: before scaling to larger data, make sure your approach solves your problem, and is the smartest way of solving this problem. Clustering for anomaly detection is like trying to smash a screw into the wall with a hammer. It works, but maybe using a nail instead of a screw is the better approach.
Even if you have "big" data, and are proud of doing "big data", always begin with a subsample. Unless you can show that the result quality increases with data set size, don't bother scaling to big data, the overhead is too high unless you can prove value.

Related

Applying machine learning to training data parameters

I'm new to machine learning, and I understand that there are parameters and choices that apply to the model you attach to a certain set of inputs, which can be tuned/optimised, but those inputs obviously tie back to fields you generated by slicing and dicing whatever source data you had in a way that makes sense to you. But what if the way you decided to model and cut up your source data, and therefore training data, isn't optimal? Are there ways or tools that extend the power of machine learning into, not only the model, but the way training data was created in the first place?
Say you're analysing the accelerometer, GPS, heartrate and surrounding topography data of someone moving. You want to try determine where this person is likely to become exhausted and stop, assuming they'll continue moving in a straight line based on their trajectory, and that going up any hill will increase heartrate to some point where they must stop. If they're running or walking modifies these things obviously.
So you cut up your data, and feel free to correct how you'd do this, but it's less relevant to the main question:
Slice up raw accelerometer data along X, Y, Z axis for the past A number of seconds into B number of slices to try and profile it, probably applying a CNN to it, to determine if running or walking
Cut up the recent C seconds of raw GPS data into a sequence of D (Lat, Long) pairs, each pair representing the average of E seconds of raw data
Based on the previous sequence, determine speed and trajectory, and determine the upcoming slope, by slicing the next F distance (or seconds, another option to determine, of G) into H number of slices, profiling each, etc...
You get the idea. How do you effectively determine A through H, some of which would completely change the number and behaviour of model inputs? I want to take out any bias I may have about what's right, and let it determine end-to-end. Are there practical solutions to this? Each time it changes the parameters of data creation, go back, re-generate the training data, feed it into the model, train it, tune it, over and over again until you get the best result.

What you call your bias is actually the greatest strength you have. You can include your knowledge of the system. Machine learning, including glorious deep learning is, to put it bluntly, stupid. Although it can figure out features for you, interpretation of these will be difficult.
Also, especially deep learning, has great capacity to memorise (not learn!) patterns, making it easy to overfit to training data. Making machine learning models that generalise well in real world is tough.
In most successful approaches (check against Master Kagglers) people create features. In your case I'd probably want to calculate magnitude and vector of the force. Depending on the type of scenario, I might transform (Lat, Long) into distance from specific point (say, point of origin / activation, or established every 1 minute) or maybe use different coordinate system.
Since your data in time series, I'd probably use something well suited for time series modelling that you can understand and troubleshoot. CNN and such are typically your last resort in majority of cases.
If you really would like to automate it, check e.g. Auto Keras or ludwig. When it comes to learning which features matter most, I'd recommend going with gradient boosting (GBDT).
I'd recommend reading this article from AirBnB that takes deeper dive into journey of building such systems and feature engineering.

Analysis of 3d image data as 2d

I have a tif image of size around ~10 Gb. I need to perform object classification or pixel classification in this image. The dimension of image data has zyx form. My voxel size in x=0.6, y=0.6 and z=1.2. Z is the depth of the object. My RAM can not take whole image.
If I do classification of pixels in each Z plane separately and then merge to get the final shape and volume of object.
Would I loose any information and my final shape or volume of object will be wrong?

#ankit agrawal you probably have found the answer but my advice would be definitely not to say you need more memory.
I have had a similar problem and if anyone else comes across them the options below will help.
Options
The answer about splitting into just z planes is correct. You could lose information in the z plane. The idea isn't a bad one but you could take Regions of Interest (ROIs) /split your image into chunks. So that they could be more manageable but say split the x/2, y/2 and z/2. Then you get a bunch of chunks that be used in memory. Then stack the data back up later
use the library [Dask] (https://dask.org/) , it creates all this for you. It's designed for parallelism and can be scaled on a single computer or a cluster. Using the dask.array part lets you create lots of chunked numpy arrays. Even better, go use dask-image (there should be a link to this in dask). It's a wrapper of dask.array and many scipy ndimage functions. Lastly when the file is split appropriately the computation can be faster because of parallelism. Not always but I have easily worked with 20GB data sets on a laptop with 16GB. The files were 8bit so when many libraries and function upfloat blowing your memory up. This allows you keep a handle on it all. If you stick to the core functions it will work fine. Gets harder when you work with mapped blocks.
If you still have this issue.

The issue with doing a classification in each z-plane individually is that you might not be able to classify objects with that sort of restricted information.
You can easily think of that the same way for a 2D face detection problem where you would try to detect the face in each row individually - that is probably not going to be very robust and you will loose valuable spatial information. In the end you'll probably end up with no detections to merge.
Solution proposal:
My advice would be to increase the size of your voxels until it can be processed by your processing unit, saying decrease the resolution of your data and do a classification with a low confidence threshold. Then come back and do another classification on the volumes with detections in them, this time aim for a higher confidence threshold. This can be done iteratively as need be.

I think breaking the image any (x/y/z) plane kind of defeats the point of the voxel concept because the representation of a three-dimensional object is flattened and you lose the spatial relational data.
I think a couple of options are:
Use a distributed computing cluster, like Hadoop.
Look into storing that image in a geospatial database like GeoMesa, so it may be queried efficiently, then you can just hold in memory what you need to train locally.
10GB isn't so large, so perhaps upgrade your memory capacity?

How to differentiate between real improvement and random noise?

I am building an automatic translator in moses. To improve its performance, I use log-linear weight optimisation. This technique has a random component, which can affect slightly the final result (but I do not know exactly how much).
Suppose that the current performance of the model is 25 BLEU.
Suppose now I modify the language model (e.g. change the smoothing), and I get a performance of 26 BLEU.
My question is: how can I know if the improvement is because the modification, or is just noise from the random component?

This is pretty much what statistics is all about. You can basically do one of the two things (from the basic set of solutions, of course there are many more advanced):
try to measure/model/quantify the effect of randomness, if you know what is causing it, you might be able to actually compute how much it can affect your model. If analytical solution is not possible, you can always train 20 models with the same data/settings, gather results and estimate noise distribution. Once you have this you can perform statistical tests to check whether the improvement is statistically significant (for example by ANOVA tests).
simpler approach (but more expensive in terms of data/time) is to simply reduce the variance by averaging. In short - instead of training one model (or evaluating model once) which has this hard to determine noise component - do it many times, 10, 20, and average the results. This way you reduce the variance of the results in your analysis. This can (and should) be combined with the previous option - since now you have 20 results per run, thus you can again use statistical testes to see whether these are significantly different things.

Genetic Algorithm, large population vs small one

Im wondering if there is a general rule of thumb for population sizing. Ive read in a book that 2x the chromosome length is a good starting point. Am i correct in assuming then that if i had an equation with 5 variables, i should have a population of 10?
Im also wondering if the following is correct:
Larger Population Size.
Pros:
Larger diversity so more likely to pick up on traits which return a good fitness.
Cons:
Requires longer to process.
vs
Smaller Population Size.
Pros:
Larger number of generations experienced per unit time.
Cons:
Mutation will have to be more prominent in order to compensate for smaller population??
EDIT
A little additional info, say i have an equation which has 5 unknown parameters. For each parameter i have anywhere between 10-50 values i would like to try assign to each of these variables. So for example
variable1 = 20 different values
variable2 = 15 different values
...
I thought a GA would be a decent approach to such a problem as the search space is quite large, ie worst case for the above would be 312,500,000 permutations (unless i have screwed up?) n!/(n-k)! where n = 50 and k = 1 => 50 * 50 * 50 * 50 * 50
unfortunately the number of parameters/range of values to check can vary alot so i was looking for some sort of rule of thumb as to how large i should set the population.
Thanks for ur help + if there is any more info you need/prefer to discuss in one of the chatrooms, just give me a shout.

I'm not sure where you read that 2x the chromosome length is a good starting point, but I'm guessing it's a book that concentrated on larger problems.
If you only have five variables, a genetic algorithm is probably not the right choice for converging upon a solution. With a chromosome length of five you're probably going to find that you very quickly reach a non-deterministic(this will change in subsequent runs) local minimum and slowly iterate around that space until you find the true local minimum.
However, if you are insistent on using a GA I would suggest abandoning that rule of thumb for this problem and really think about starting population as a measure of how far from the final solution you expect a random solution to be.
The reason that many rule of thumbs is dependent on chromosome length is because that's a decent proxy for this, if I have a hundred variables, and given randomly generating dna sequence is going to be further from ideal than if I had only one variable.
Additionally, if you're worried about computation intensity I'm going to go ahead and say that it shouldn't be an issue since you're dealing with such a small solution set. I think a better rule of thumb for smaller sets like this would be along the lines of:
(ln(chromosome_length*(solution_space/granularity)/mutation_rate))^2
Probably with a constant thrown in to scale for the particular problem.
It's definitely not a great rule of thumb (no rule is) but here's my logic for it:
Chromosome length is just a proxy for size of solution space, so taking into account the size of the solution space will necessarily increase the accuracy of this proxy
A smaller mutation rate necessitates a larger population size to compensate for the fact that you are more prone to get caught in local minima
Any rule of thumb should scale logarithmically since a genetic algorithm is akin to a tree search of your solution space.
The squared term was mostly the result of trying this out, but it looks like the logarithmic scaling was a little aggressive, though the general shape seemed right.
However I think a better choice would be to start at a reasonable number (100) and try iterating up and down until you find a population size that seems to balance accuracy with execution speed.

As with most genetic algorithm parameters population size is highly dependant on the problem. There are certain factors that can help to point in the direction of whether you should have a large or small population size but a lot of the time testing different values against a known solution before running it on your problem is a good idea (if this is possible of course).
A population size of 10 does seem rather small though. You say you have an equation with five variables. Is your problem represented by a chromosome of 5 values? It seems small for a chromosome and if this is the case it's likely that using a genetic algorithm may not be the best way to solve the problem. Perhaps if you give a bit more detail on your problem and how you are representing it people may have a better idea of how to advise you.
I'd also add that your cons for large and small population sizes aren't exactly correct. A larger population size does take longer to process than a small one but since it can often solve the problem quicker then overall the processing time isn't necessarily longer. gain, it's highly dependant on the problem. With a smaller population size mutation shouldn't have to be more prominent. Mutation is generally used to stop the genetic algorithm from becoming stuck in a local maximum and should usually be a very small value. A small population is more likely to become stuck in a local maximum but if you have a mutation value which is too high you may be nullifying the natural improvement of the genetic algorithm.

Image labeling performance using CRF

I need to develop an image labeling application, for this task I'm considering using Conditional Random Fields (CRF) over a set of superpixels, there exists quite a few papers that point out this technology as the state of the art for this task. As usual the task could be devided into two tasks:
Training model: which for this problem would be obtaining the parameter vector 'w', using for example
Testing: which would be obtaining the most feasible label assignment of a given set of superpixels, i.e argmax(P(y|x))
I'm aware of training-time to be quite high, however I have not found anything about testing-time nor performance, does anyone have and idea of how much time could take the testing problem? I suppose it will depend on the number of labels, image size, implementation, hardware, etc

Testing is slowish because you still have to solve a graph cuts problem (but nothing like training). There is an implementation you can try out at http://drwn.anu.edu.au/drwnProjMultiSeg.html (you have probably seen Stephen Gould's papers).
I still have the log file. but it is a bit hard to interpret so the following may not be totally accurate. On a super fast machine, I think it took about:
4.5 hours cpu time to train 20 classes on 276 images from MSRC dataset
50 mins cpu time to classify 256 images, most of which was spent doing alpha expansion

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart