I have univariate time series data and I need to run anomaly detection algorithm on the same. Can anyone suggest any standard algorithm for anomaly detection which works in most cases?
There is no such algorithm "which works in most cases". The task heavily depends on the specifics of your case, e.g. whether you need local anomalies when a point differs from other points near it or global ones when a point does not look similar to any other point in the dataset.
The very good review of anomaly detection algorithms can be found here
Perhaps you can easily try one-class-SVM which is available in many libraries and programming languages. For instance, in Python you can use scikit-learn.
Related
Machine Learning (ML) can do two things from Vibration/ Acoustic Signal for Condition Based Monitoring (CBM):
1 . Feature Extraction (FT) and
2 . Classification
But if we look through the research/process, then why signal processing techniques are used for pre-processing and ML for rest of the part; I mean classification?
We can use only ML for all of these. But I have seen the merging model of the two techniques: conventional signal processing approach and ML.
I want to know the specific reason for that. Why researchers use these two; they could do with ML only; but they use both.
Yes you can do so. However, the task becomes more complicated.
FFT for example transforms the input space into a more meaningful representation. If you have rotating equipment you would expect that the spectrum is mainly on the frequency of rotation. However, if there is a problem the spectrum changes. This can often be detected by for example SVMS.
If you don't do the FFT but only give the raw signal, SVMs have a hard time.
Nevertheless, i've seen recent practical examples using Deep Convolutional Networks which have learned to predict problems on raw vibration data. The disadvantage is, however, that you do need more data. More data is not a problem in general, but if you take for example a wind turbine more failure data is obviously -- or hopefully ;-) -- a problem.
The other thing is that the ConvNet learned the FFT all by itself. But why not use prior knowledge if you have that.....
I have little background knowledge of Machine Learning, so please forgive me if my question seems silly.
Based on what I've read, the best model-free reinforcement learning algorithm to this date is Q-Learning, where each state,action pair in the agent's world is given a q-value, and at each state the action with the highest q-value is chosen. The q-value is then updated as follows:
Q(s,a) = (1-α)Q(s,a) + α(R(s,a,s') + (max_a' * Q(s',a'))) where α is the learning rate.
Apparently, for problems with high dimensionality, the number of states become astronomically large making q-value table storage infeasible.
So the practical implementation of Q-Learning requires using Q-value approximation via generalization of states aka features. For example if the agent was Pacman then the features would be:
Distance to closest dot
Distance to closest ghost
Is Pacman in a tunnel?
And then instead of q-values for every single state you would only need to only have q-values for every single feature.
So my question is:
Is it possible for a reinforcement learning agent to create or generate additional features?
Some research I've done:
This post mentions A Geramifard's iFDD method
http://www.icml-2011.org/papers/473_icmlpaper.pdf
http://people.csail.mit.edu/agf/Files/13RLDM-GQ-iFDD+.pdf
which is a way of "discovering feature dependencies", but I'm not sure if that is feature generation, as the paper assumes that you start off with a set of binary features.
Another paper that I found was apropos is Playing Atari with Deep Reinforcement Learning, which "extracts high level features using a range of neural network architectures".
I've read over the paper but still need to flesh out/fully understand their algorithm. Is this what I'm looking for?
Thanks
It seems like you already answered your own question :)
Feature generation is not part of the Q-learning (and SARSA) algorithm. In a process which is called preprocessing you can however use a wide array of algorithms (of which you showed some) to generate/extract features from your data. Combining different machine learning algorithms results in hybrid architectures, which is a term you might look into when researching what works best for your problem.
Here is an example of using features with SARSA (which is very similar to Q-learning).
Whether the papers you cited are helpful for your scenario, you'll have to decide for yourself. As always with machine learning, your approach is highly problem-dependent. If you're in robotics and it's hard to define discrete states manually, a neural network might be helpful. If you can think of heuristics by yourself (like in the pacman example) then you probably won't need it.
Many machine learning competitions are held in Kaggle where a training set and a set of features and a test set is given whose output label is to be decided based by utilizing a training set.
It is pretty clear that here supervised learning algorithms like decision tree, SVM etc. are applicable. My question is, how should I start to approach such problems, I mean whether to start with decision tree or SVM or some other algorithm or is there is any other approach i.e. how will I decide?
So, I had never heard of Kaggle until reading your post--thank you so much, it looks awesome. Upon exploring their site, I found a portion that will guide you well. On the competitions page (click all competitions), you see Digit Recognizer and Facial Keypoints Detection, both of which are competitions, but are there for educational purposes, tutorials are provided (tutorial isn't available for the facial keypoints detection yet, as the competition is in its infancy. In addition to the general forums, competitions have forums also, which I imagine is very helpful.
If you're interesting in the mathematical foundations of machine learning, and are relatively new to it, may I suggest Bayesian Reasoning and Machine Learning. It's no cakewalk, but it's much friendlier than its counterparts, without a loss of rigor.
EDIT:
I found the tutorials page on Kaggle, which seems to be a summary of all of their tutorials. Additionally, scikit-learn, a python library, offers a ton of descriptions/explanations of machine learning algorithms.
This cheatsheet http://peekaboo-vision.blogspot.pt/2013/01/machine-learning-cheat-sheet-for-scikit.html is a good starting point. In my experience using several algorithms at the same time can often give better results, eg logistic regression and svm where the results of each one have a predefined weight. And test, test, test ;)
There is No Free Lunch in data mining. You won't know which methods work best until you try lots of them.
That being said, there is also a trade-off between understandability and accuracy in data mining. Decision Trees and KNN tend to be understandable, but less accurate than SVM or Random Forests. Kaggle looks for high accuracy over understandability.
It also depends on the number of attributes. Some learners can handle many attributes, like SVM, whereas others are slow with many attributes, like neural nets.
You can shrink the number of attributes by using PCA, which has helped in several Kaggle competitions.
I have my own java based implementation of clustering (knn). However I am facing scalability issues. I do not plan to use Mahout because my requirements are very simple and mahout requires lot of work. I am looking for java based Canopy clustering implementation which i can plug into my algo and do parellel processing.
Mahout based Canopy libraries are coupled with Vectors and indexes and does not work on plain strings. If you know of the way, where i can use canopy clustering on strings using simple library, it would fix my issue.
My requirement is to pass list of strings(say 10K) to Canopy clustering algo and it should return sublists based on T1 and T2.
Canopy clustering is mostly useful as a preprocessing step for parallelization. I'm not sure how much it will get you on a single node. I figure you might as well compute the actual algorithm right away, or build an index such as an M-tree.
The strength of Canopy clustering is that you can run it independently on a number of nodes and then just overlap their results.
Also check if it actually is compatible to your approach. I figure that canopy might need metric properties to be correct. Is your string distance a proper metric (i.e. triangle inequality)?
10,000 data points, if that's all you're concerned with, should be no problem with standard k-means. I'd look at optimising that before you consider canopy clustering (which is really designed for millions or even billions of examples). Some things you may have missed:
pre-compute the feature vectors for each string. Don't do it every time you want to compare s_1 to s_2 or s_1 to cluster centroid
you only need to keep the summary statistics in memory: the sum of all points assigned to a cluster and the number of points assigned to a cluster. When you're done with an iteration, divides sums by ns and you have your new centroids.
what's the dimensionality of your feature space? be aware that you should use a distance metric where the dimensions where both vectors are zero have no impact, so you should only need to compute for non-zero dimensions. Store your points as sparse vectors to facilitate this.
Can you do some analysis and determine where the bottle-neck in your implementation is? I'm a little perplexed by your comment about Mahout not working with plain strings.
You should give the clustering algorithms in ELKI a try. Sorry for so shamelessly promoting a project I'm closely affiliated with. But it is the largest collection of clustering and outlier detection algorithms that are implemented in a comparable fashion. (If you'd take all the clustering algorithms available in some R package, you might end up with more algorithms, but they won't be really comparable because of implementation differences)
And benchmarking showed enormous speed differences with different implementations of the same algorithm. See our benchmarking web site on how much performance can vary even on simple algorithms such as k-means.
We do not yet have Canopy Clustering. The reason is that it's more of a preprocessing index than actually a clustering algorithm. Kind of like a primitive variant of the M-tree, or of DBSCAN clustering. However, we should would like to see a contributed canopy clustering as such a preprocessing step.
ELKIs abilities to process strings are also a bit limited so far. You can load typical TF-IDF vectors just fine and we have somewhat optimized sparse vector classes and similarity functions. They don't fully exploit sparsity for k-means yet, though, and there is no spherical k-means yet either. But there are various reasons why k-means results on sparse vectors cannot be expected to be very meaningful; it's more of a heuristic.
But it would be interesting if you could give it a try for your problem and report back your experiences. Was the performance somewhat competitive with your implementation? And we would love to see contributed modules for text processing, such as e.g. further optimized similarity functions, or a spherical k-means variant.
Update: ELKI now actually includes CanopyClustering: CanopyPreClustering (will be part of 0.6.0 then). But as of now, it's just another clustering algorithm, and not yet used to accelerate other algorithms such as k-means. I need to check how to best use it as some kind of index to accelerate algorithms. I can imagine it also helps for speeding up DBSCAN if you set T1=epsilon and T2=0.5*T1. The big issue with CanopyClustering IMHO is how to choose a good radius.
I have two dependent continuous variables and i want to use their combined values to predict the value of a third binary variable. How do i go about discretizing/categorizing the values? I am not looking for clustering algorithms, i'm specifically interested in obtaining 'meaningful' discrete categories i can subsequently use in in a Bayesian classifier.
Pointers to papers, books, online courses, all very much appreciated!
That is the essence of machine learning and problem one of the most studied problem.
Least-square regression, logistic regression, SVM, random forest are widely used for this type of problem, which is called binary classification.
If your goal is to pragmatically classify your data, several libraries are available, like Scikits-learn in python and weka in java. They have a great documentation.
But if you want to understand what's the intrinsics of machine learning, just search (here or on google) for machine learning resources.
If you wanted to be a real nerd, generate a bunch of different possible discretizations and then train a classifier on it, and then characterize the discretizations by features and then run a classifier on that, and see what sort of discretizations are best!?
In general discretizing stuff is more of an art and having a good understanding of what the input variable ranges mean.