Random cut forest anomaly detection on multi variant without time series data - random-forest

I went through various article/blog on aws sagemaker unsupervised ml algo called random cut forest, i saw all the examples are based on time series data, i have a doubt, is random cut forest detects anomaly only on time series data or can it detect anomaly from data sample with multi variant none time series data also?
My use case to detect based on detecting anomaly based on sudden increase in specific event for eg
event1,event2,event3,device
100,1,1,device1
1,100,100,device2
1,1,1,device3
In this case, anomaly detection algo should predict anomaly for device1 and device2

Yes, Random Cut Forest should work well for your use-case. I agree that samples are a bit limited ; in practice you can apply this algo to any multi-variate numerical dataset.

Related

Anomaly detection on time series with Xgboost algorithm

why xgboost algorithm is not useful for anomaly detection on time series?
There are some cases about forecasting on time series. (https://www.kaggle.com/code/robikscube/tutorial-time-series-forecasting-with-xgboost)
is there an implementation we could use this algorithm for anomaly detection and forecasting together on time series data?
Anomalies by definition are rare. Generally, standard classification algorithms have issues due to the objective function when one of the classes are rare.
If you wanted to detect anomalies, one of the things that you can try is to use xgboost to predict the time series, and then use the residual to determine which are "poorly" predicted by the algorithm and therefore are anomalous.

train/validate/test split for time series anomaly detection

I'm trying to perform a multivariate time series anomaly detection. I have training data that consists of "normal" data. I train on this data and detect anomalies on the test set that contains normal + anomalous data. My understanding is that it would be wrong to tweak the model hyperparameters based on the results from the test set.
What would the train/validate/test set look like to train and evaluate a time-series anomaly detector?
Nothing very specific to anomaly detection here. You neeed to split the testing data into one or more validation and test sets, while making sure they are reasonably independent (no information leakage between them).

How to evaluate unsupervised anomaly detection

I am trying to solve a regression problem by predicting a continuous value using machine learning. I have a dataset which composed of 6 float columns.
The data come from low price sensors, this explain that very likely we will have values that can be considered out of the ordinary. To fix the problem, and before predicting my continuous target, I will predict data anomalies, and use him as a data filter, but the data that I have is not labeled, that's mean I have unsupervised anomaly detection problem.
The algorithms used for this task are Local Outlier Factor, One Class SVM, Isolation Forest, Elliptic Envelope and DBSCAN.
After fitting those algorithms, it is necessary to evaluate them to choose the best one.
Can anyone have an idea how to evaluate an unsupervised algorithm for anomaly detection ?
The only way is to generate synthetic anomalies which mean to introduce outliers by yourself with the knowledge of how a typical outlier will look like.

Sentiment Analysis using classification and clustering algorithms: Which is better?

I am trying to do a Sentiment Analysis on Song Lyrics using Python.
After studying many simple classification problems, with known labels (such as Email classification Spam/Not Spam), I thought that the Lyrics Sentiment Analysis lies on the Classification field.
While actually coding it, i discovered that I had to compute the sentiment for each song's lyrics, and probably adding a column to the original dataset, marking it positive or negative, or using the actual sentiment score.
Couldn't this be done using a clustering approach? Since we don't know each song's class in the first place (positive sentiment / negative sentiment) the algorithm will cluster the data using sentiment analysis.
Clustering usually won't produce sentiments.
It a more likely to produce e.g., a cluster for rap and one for non-rap. Or one for lyrics with an even song length, and one for odd length.
There is more in the data than sentiment. So why would clustering produce sentiment clusters?
If you want particular labels (positive sentiment, negative sentiment) then you need to provide training data and use a supervised approach.
You are thinking of Clustering without supervision i.e, unsupervised clustering which might result in low accuracy results because you actually dont know what is the threshold value of score which seperates the positive and negative classes.So first try to find the threshold which will be your parameter which seperates your classes.Use supervised learning to find the threshold

Anomaly detection - what to use

What system to use for Anomaly detection?
I see that systems like Mahout do not list anomaly detection, but problems like classification, clustering, recommendation...
Any recommendations as well as tutorials and code examples would be great, since I haven't done it before.
There is an anomaly detection implementation in scikit-learn, which is based on One-class SVM. You can also check out the ELKI project which has spatial outlier detection implemented.
In addition to "anomaly detection", you can also expand your search with "outlier detection", "fraud detection", "intrusion detection" to get some more results.
There are three categories of outlier detection approaches, namely, supervised, semi-supervised, and unsupervised.
Supervised: Requires fully labeled training and testing datasets. An ordinary classifier is trained first and applied afterward.
Semi-supervised: Uses training and test datasets, whereas training data only consists of normal data without any outliers. A model of the normal class is learned and outliers can be detected afterward by deviating from that model.
Unsupervised: Does not require any labels; there is no distinction between a training and a test dataset Data is scored solely based on intrinsic properties of the dataset.
If you have unlabeled data the following unsupervised anomaly detection approaches can be used to detect abnormal data:
Use Autoencoder that captures a feature representation of the features present in the data and flags as outliers data points that are not well explained using the new representation. Outlier score for a data point is calculated based on reconstruction error (i.e., squared distance between the original data and its projection) You can find implementations in H2O and Tensorflow
Use a clustering method, such as Self Organizing Map (SOM) and k-prototypes to cluster your unlabeled data into multiple groups. You can detect external and internal outliers in the data. External outliers are defined as the records positioned at the smallest cluster. Internal outliers are defined as the records distantly positioned inside a cluster. You can find codes for SOM and k-prototypes.
If you have labeled data, there are plenty of supervised classification approaches that you can try to detect outliers. Examples are Neural Networks, Decision Tree, and SVM.

Resources