How to replace the anomalous data in time-series analysis? - machine-learning

I applied an isolation forest algorithm to identify the anomalous data in my time series. Now I want to replace those outliers before feeding them into a machine learning model. How can we replace those outliers in time series analysis?

It actually depends on the kind of data and what you want to do. Consider two scenarios that are time-dependent.
Predicting the target variable depending on sensor measurements.
In this case, you can neglect the sensor transmission errors to create a cleaner dataset for the other algorithms to use.
Fraud Detection.
In this case, you want to detect the pattern when the anomaly will be created so you can't drop or replace the outlier because you are analyzing the outlier itself.
There is a forecast package in R tsclean(). The tsclean() function will fit a robust trend using loess (for non-seasonal series), or robust trend and seasonal components using STL (for seasonal series).
For non-seasonal time series, outliers are replaced by linear interpolation. For seasonal time series, the seasonal component from the STL fit is removed and the seasonally adjusted series is linearly interpolated to replace the outliers, before adding back the trend and seasonal components to the result.

Related

Which SMOTE algorithm should I use for Augmentation of Time Series dataset?

I am working on a Time Series Dataset where i want to do forcasting and prediction both. So, if you have any suggestion please share. Thank You!
T-Smote
This allows one to both impute fully missing observations to allow uniform time series classification across the entire data and, in special cases, to impute individually missing features. To do so, we slightly generalize the well-known class imbalance algorithm SMOTE to allow component wise nearest neighbor interpolation that preserves correlations when there are no missing features. We visualize the method in the simplified setting of 2-dimensional uncoupled harmonic oscillators. Next, we use tSMOTE to train an Encoder/Decoder long-short term memory (LSTM) model with Logistic Regression for predicting and classifying distinct trajectories of different 2D oscillators.

Anomaly detection on time series with Xgboost algorithm

why xgboost algorithm is not useful for anomaly detection on time series?
There are some cases about forecasting on time series. (https://www.kaggle.com/code/robikscube/tutorial-time-series-forecasting-with-xgboost)
is there an implementation we could use this algorithm for anomaly detection and forecasting together on time series data?
Anomalies by definition are rare. Generally, standard classification algorithms have issues due to the objective function when one of the classes are rare.
If you wanted to detect anomalies, one of the things that you can try is to use xgboost to predict the time series, and then use the residual to determine which are "poorly" predicted by the algorithm and therefore are anomalous.

Should we always first perform feature normalization and then the feature reduction?

Sometimes performing feature reduction reduces number of features with methods like PCA and then we could scale only the relevant variables. Is there a rule that we need to do normalization/scaling first and then the feature reduction?
I would suggest first do your normalization/scaling on your feature data and then performing feature selection. This is because most of the feature selection techniques require a meaningful representation of your data. By normalizing your data your features have the same order of magnitude and scatter, which makes it easier to find which one of those is more relevant.
For example, for PCA the computation is based on the standard deviation (SD) of your features to find the relevant axis of a new projection of your data. If you do not normalize your data, features with a high SD will have a higher weight compared to features with a small SD distorting their relevance when computing the PCA.

Isolation Forest for time series data

I just wonder if the isolation Forest (iForest) can work with time-series data. As far as I know, iForest is used for anomaly detection and it is based on randomization techniques to randomly and recursively partition the data and then save the partition in a tree structure.
I have a theoretical question. I just wonder if the iForest can work with the time series data since it is based on some randomization techniques. Would this violate the time series characteristics as the randomization may break the time dependencies?.
Isolation forest will help with detecting point anomalies by default, since in principle it is just working on the rarity of these observations.
But let’s say I am interested in anomalies in time series data. Isolation forest will be able to pick out the extreme Peaks and troughs that occur as point anomalies here but for collective anomalies, you may need to transform the data such that each observation represents a collection of observations (rolling window operations) etc.
The reason is that in time series data you are interested in additive outliers or temporal changes and thus your observations must represent that individually if you plan to use Isolation forest. But you can try other techniques such as STL decomposition, Arima, regression trees, exponential smoothing. You should find a lot of material on how to use the above for anomaly detection in time series.

measuring the accuracy of a model and the importance of a feature in SVM

I'm starting to use LIBSVM for regression analysis. My world has about 20 features and thousands to millions of training samples.
I'm curious about two things:
Is there a metric that indicates the accuracy or confidence of the model, perhaps in the .model file or elsewhere?
How can I determine whether or not a feature is significant? E.g., if I'm trying to predict body weight as a function of height, shoulder width, gender and hair color, I might discover that hair color is not a significant feature in predicting weight. Is that reflected in the .model file, or is there some way to find out?
libSVM calculates p-values for test points based upon the certainty of the classifier (i.e., how far is the test point from the decision boundary and how wide are the margins).
I think you should consider the determination of feature importance a separate problem from training your SVMs. There are tons of approaches for "feature selection" (just open any text book) but one easy to understand, straightforward approach would be a simple cross-validation as follows:
Divide your dataset into k folds (e.g., k = 10 is common)
For each of the k folds:
Separate your data into train/test sets (the current fold is the test set, the rest are the training set)
Train your SVM classifier using only n-1 of your n features
Measure the prediction performance
Average the performance of your n-1 feature classifier for all k test folds
Repeat 1-3 for all remaining features
You could also do the reverse where you test each of the n features separately but you will likely miss out on important second and higher order interactions between the features.
In general, however, SVMs are good at ignoring irrelevant features.
You may also want to try and visualize your data using Principal Components Analysis to get a feel for how the data is distributed.
The F-score is a metric commonly used for features selection in Machine Learning.
Since version 3.0, LIBSVM library includes a directory called tools. In that directory is a python script called fselect.py, which calculates F-score. To use it, just execute from the command line and pass in the file comprised of training data (and optionally a testing data file).
python fselect.py data_training data_testing
The output is comprised of an fscore for each of the features in your data set which corresponds to the importance of that feature to the model result (regression score).

Resources