Do I need stationary data for Isolation Forest Model? - time-series

I am looking to do some outlier detection on time series data using the Isolation Forest Model. Do I have to remove trends/seasonality from my data and make sure that the data is stationary?
Thanks

Related

train/validate/test split for time series anomaly detection

I'm trying to perform a multivariate time series anomaly detection. I have training data that consists of "normal" data. I train on this data and detect anomalies on the test set that contains normal + anomalous data. My understanding is that it would be wrong to tweak the model hyperparameters based on the results from the test set.
What would the train/validate/test set look like to train and evaluate a time-series anomaly detector?
Nothing very specific to anomaly detection here. You neeed to split the testing data into one or more validation and test sets, while making sure they are reasonably independent (no information leakage between them).

Isolation Forest for time series data

I just wonder if the isolation Forest (iForest) can work with time-series data. As far as I know, iForest is used for anomaly detection and it is based on randomization techniques to randomly and recursively partition the data and then save the partition in a tree structure.
I have a theoretical question. I just wonder if the iForest can work with the time series data since it is based on some randomization techniques. Would this violate the time series characteristics as the randomization may break the time dependencies?.
Isolation forest will help with detecting point anomalies by default, since in principle it is just working on the rarity of these observations.
But let’s say I am interested in anomalies in time series data. Isolation forest will be able to pick out the extreme Peaks and troughs that occur as point anomalies here but for collective anomalies, you may need to transform the data such that each observation represents a collection of observations (rolling window operations) etc.
The reason is that in time series data you are interested in additive outliers or temporal changes and thus your observations must represent that individually if you plan to use Isolation forest. But you can try other techniques such as STL decomposition, Arima, regression trees, exponential smoothing. You should find a lot of material on how to use the above for anomaly detection in time series.

Why do we need test_generator and val_generator for data augmentation

Data augmentation is applied for training only. I'm wondering why several tutorials create test_generator and val_generator. Why don't we create only train_generator.
Actually, it is a pretty good practice that you separate train data and validation data. If you just create 1 generator, there is a pretty high chance that you validate your model with the same augmented data, which introduce a bias to your accuracy. Moreover, normally we use data augmentation when we have small number training data which makes thing even worst and end up with highly bias model. Therefore, we should separate the data and make sure the your model have not been exposed to any type of validation data so that it does not add any bias to your performance.
For example, you may end up training model with picture-1 rotated clockwise and validate model with picture-1 rotated anti-clockwise. So, your validation accuracy that we normally use for determine overfitting is biased and you may end up with overfitted model without knowing when does it happens during training.

Clustering model like DBSCAAN,OPTICS, KMEANS

I have a doubt whether after clustering using any algorithm is it possible to segment new data based on the learning from the previous data
The issue is that clustering algorithms are unsupervised learning algorithms. They don't need a dependent variable to predict classes. They are used to find structures/similarities in the data points. What you can do is, treat the clustered data as your supervised data.
The approach would be clustering and assigning labels in the train data. Treat it as a multi-class classification data, train a new multi-class classification model using your data and validate it on the test data.
Let train and test be the datasets.
clusters <- Clustering(train)
train[y] <- clusters
model <- Classification(train, train[y])
prediction <- model.predict(test)
However interestingly KMeans in sklearn provides fit and predict method. So using KMeans from sklearn you can predict in the new data. However, DBScan doesn't have predict which is quite obvious from it's working mechanism.
Clustering is an unsupervised mechanism where the number of clusters and the identity of the segments which need to be clustered are not known to the system.
Hence what you can do is to obtain the learning of a model which is trained for Clustering , classification,Identification or verification and apply that learning to your use case of clustering.
If the new data is from the same domain of the trained data most probably you will end up with better accuracy in clustering. (You need to properly choose the clustering methodology based on the type of data which you choose. eg for voice clustering Dominant sets and hierarchical clustering will be the most potential candidates).
If the New data is from a different domain then the selected model may fail as it learned the features in correspond to your domain of training data.

k-means clustered data: how to label newly incoming data

I have a data set with labels that were produced by a k-means clustering algorithm. Now there is some data (with the same data structure) from another source and I wonder what is the most sensible way to label this new, yet unseen data? I was thinking about either
calculating the distance to the prior k-means centroids and label the data to the the nearest centroids accordingly
run a new algorithm (e.g. SVM) on the new data using the old data as the training set
Unfortunately, I couldn't find anything about this particular problem. There are only a few questions about the general use of k-means as a classification model:
Can k-means clustering do classification?
How to segment new data with existing K-means model?
Thanks in advance.
Uli
You dont need SVM thing.First way is more convenient.If you are using sklearn https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html there is an example here.predict function will do your job.

Resources