Anomaly detection on time series with Xgboost algorithm - time-series

why xgboost algorithm is not useful for anomaly detection on time series?
There are some cases about forecasting on time series. (https://www.kaggle.com/code/robikscube/tutorial-time-series-forecasting-with-xgboost)
is there an implementation we could use this algorithm for anomaly detection and forecasting together on time series data?

Anomalies by definition are rare. Generally, standard classification algorithms have issues due to the objective function when one of the classes are rare.
If you wanted to detect anomalies, one of the things that you can try is to use xgboost to predict the time series, and then use the residual to determine which are "poorly" predicted by the algorithm and therefore are anomalous.

Related

Random cut forest anomaly detection on multi variant without time series data

I went through various article/blog on aws sagemaker unsupervised ml algo called random cut forest, i saw all the examples are based on time series data, i have a doubt, is random cut forest detects anomaly only on time series data or can it detect anomaly from data sample with multi variant none time series data also?
My use case to detect based on detecting anomaly based on sudden increase in specific event for eg
event1,event2,event3,device
100,1,1,device1
1,100,100,device2
1,1,1,device3
In this case, anomaly detection algo should predict anomaly for device1 and device2
Yes, Random Cut Forest should work well for your use-case. I agree that samples are a bit limited ; in practice you can apply this algo to any multi-variate numerical dataset.

How to replace the anomalous data in time-series analysis?

I applied an isolation forest algorithm to identify the anomalous data in my time series. Now I want to replace those outliers before feeding them into a machine learning model. How can we replace those outliers in time series analysis?
It actually depends on the kind of data and what you want to do. Consider two scenarios that are time-dependent.
Predicting the target variable depending on sensor measurements.
In this case, you can neglect the sensor transmission errors to create a cleaner dataset for the other algorithms to use.
Fraud Detection.
In this case, you want to detect the pattern when the anomaly will be created so you can't drop or replace the outlier because you are analyzing the outlier itself.
There is a forecast package in R tsclean(). The tsclean() function will fit a robust trend using loess (for non-seasonal series), or robust trend and seasonal components using STL (for seasonal series).
For non-seasonal time series, outliers are replaced by linear interpolation. For seasonal time series, the seasonal component from the STL fit is removed and the seasonally adjusted series is linearly interpolated to replace the outliers, before adding back the trend and seasonal components to the result.

How to evaluate unsupervised anomaly detection

I am trying to solve a regression problem by predicting a continuous value using machine learning. I have a dataset which composed of 6 float columns.
The data come from low price sensors, this explain that very likely we will have values that can be considered out of the ordinary. To fix the problem, and before predicting my continuous target, I will predict data anomalies, and use him as a data filter, but the data that I have is not labeled, that's mean I have unsupervised anomaly detection problem.
The algorithms used for this task are Local Outlier Factor, One Class SVM, Isolation Forest, Elliptic Envelope and DBSCAN.
After fitting those algorithms, it is necessary to evaluate them to choose the best one.
Can anyone have an idea how to evaluate an unsupervised algorithm for anomaly detection ?
The only way is to generate synthetic anomalies which mean to introduce outliers by yourself with the knowledge of how a typical outlier will look like.

time series or SVM for frecasting

I am trying to apply machine learning algorithm to a dataset which consits of emission of pollutant gas from an engine called SO2(target variable) which is collected over 6 months of time for at a interval of each of 15 mins each.The dataset also do have other independent variables like pressure,vapour etc with time.
Now the question is
should i go for time series modelling like arima for forcasting the So2?
or should i go for randomforest or svm for forecasting?
Thanks
I suggest that you go for time-series modeling instead of SVM. Your SVM would consider i.i.d (independent and identically distributed) samples, and wouldn't consider the information that encapsulated across time.

Gradient boosting predictions in low-latency production environments?

Can anyone recommend a strategy for making predictions using a gradient boosting model in the <10-15ms range (the faster the better)?
I have been using R's gbm package, but the first prediction takes ~50ms (subsequent vectorized predictions average to 1ms, so there appears to be overhead, perhaps in the call to the C++ library). As a guideline, there will be ~10-50 inputs and ~50-500 trees. The task is classification and I need access to predicted probabilities.
I know there are a lot of libraries out there, but I've had little luck finding information even on rough prediction times for them. The training will happen offline, so only predictions need to be fast -- also, predictions may come from a piece of code / library that is completely separate from whatever does the training (as long as there is a common format for representing the trees).
I'm the author of the scikit-learn gradient boosting module, a Gradient Boosted Regression Trees implementation in Python. I put some effort in optimizing prediction time since the method was targeted at low-latency environments (in particular ranking problems); the prediction routine is written in C, still there is some overhead due to Python function calls. Having said that: prediction time for single data points with ~50 features and about 250 trees should be << 1ms.
In my use-cases prediction time is often governed by the cost of feature extraction. I strongly recommend profiling to pin-point the source of the overhead (if you use Python, I can recommend line_profiler).
If the source of the overhead is prediction rather than feature extraction you might check whether its possible to do batch predictions instead of predicting single data points thus limiting the overhead due to the Python function call (e.g. in ranking you often need to score the top-K documents, so you can do the feature extraction first and then run predict on the K x n_features matrix.
If this doesn't help either you should try the limit the number of trees because the runtime cost for prediction is basically linear in the number of trees.
There are a number of ways to limit the number of trees without affecting the model accuracy:
Proper tuning of the learning rate; the smaller the learning rate, the more trees are needed and thus the slower is prediction.
Post-process GBM with L1 regularization (Lasso); See Elements of Statistical Learning Section 16.3.1 - use predictions of each tree as new features and run the representation through a L1 regularized linear model - remove those trees that don't get any weight.
Fully-corrective weight updates; instead of doing the line-search/weight update just for the most recent tree, update all trees (see [Warmuth2006] and [Johnson2012]). Better convergence - fewer trees.
If none of the above does the trick you could investigate cascades or early-exit strategies (see [Chen2012])
References:
[Warmuth2006] M. Warmuth, J. Liao, and G. Ratsch. Totally corrective boosting algorithms that maximize the margin. In Proceedings of the 23rd international conference on Machine learning, 2006.
[Johnson2012] Rie Johnson, Tong Zhang, Learning Nonlinear Functions Using Regularized Greedy Forest, arxiv, 2012.
[Chen2012] Minmin Chen, Zhixiang Xu, Kilian Weinberger, Olivier Chapelle, Dor Kedem, Classifier Cascade for Minimizing Feature Evaluation Cost, JMLR W&CP 22: 218-226, 2012.

Resources