Does SVR handles outliers and seasonality? - time-series

I have time-series data and trying to build a model using Support Vector Regression(SVR). If I use SVR to build to model, should I be worried about seasonality, trend, and outliers? If I should care about these things then how can I deal with trends, seasonality, and outliers while building a model using SVR?
Thank you

Related

Which ML algorithm should I use for this dataset

I have a dataset let say data1,data2,data3... output or predictive data should be names of people based on the given dataset. I have a training dataset but not sure which ML algorithm to use. And the list of peoples name does not change.
It sounds like you are doing a classification task, so preferably you should use a classification algorithm. The type of algorithm to use really depends on the quality and structure of your data and its decision boundaries. Typically, before one embarks on a classification task, you must identify your data's outliers, noise, class imbalances, missing values and other data quality issues. And from there, you should select a model that best suits your needs.
For example, if your model contains lots of outliers and missing values, a decision tree might be preferable. However, if you have a large class imbalance, anomaly detection may be better suited. If you decision boundary is linear, you could make use of support vector machines. While if you have non-linear decision boundaries you'll need to look into more complex models such as gaussian discriminative models, self-organizing maps, or neural networks.
In summary, it is entirely dependent on your data.

Application and Deployment of K-Fold Cross-Validation

K-Fold Cross Validation is a technique applied for splitting up the data into K number of Folds for testing and training. The goal is to estimate the generalizability of a machine learning model. The model is trained K times, once on each train fold and then tested on the corresponding test fold.
Suppose I want to compare a Decision Tree and a Logistic Regression model on some arbitrary dataset with 10 Folds. Suppose after training each model on each of the 10 folds and obtaining the corresponding test accuracies, Logistic Regression has a higher mean accuracy across the test folds, indicating that it is the better model for the dataset.
Now, for application and deployment. Do I retrain the Logistic Regression model on all the data, or do I create an ensemble from the 10 Logistic Regression models that were trained on the K-Folds?
The main goal of CV is to validate that we did not get the numbers by chance. So, I believe you can just use a single model for deployment.
If you are already satisfied with hyper-parameters and model performance one option is to train on all data that you have and deploy that model.
And, the other option is obvious that you can deploy one of the CV models.
About the ensemble option, I believe it should not give significant better results than a model trained on all data; as each model train for same amount of time with similar paparameters and they have similar architecture; but train data is slightly different. So, they shouldn't show different performance. In my experience, ensemble helps when the output of models are different due to architecture or input data (like different image sizes).
The models trained during k-fold CV should never be reused. CV is only used for reliably estimating the performance of a model.
As a consequence, the standard approach is to re-train the final model on the full training data after CV.
Note that evaluating different models is akin to hyper-parameter tuning, so in theory the performance of the selected best model should be reevaluated on a fresh test set. But with only two models tested I don't think this is important in your case.
You can find more details about k-fold cross-validation here and there.

Evaluate CNN model for multiclass image classification

i want to ask what metric can be used to evalutate my CNN model for multi class, i have 3 classes for now and i’m just using accuracy and confussion matrix also plot the loss of model, is there any metric can be used to evaluate my model performance?
Evaluating the performance of a model is one of the most crucial phase of any Machine Learning project cycle and must be done effectively. Since, you have mentioned that you are using accuracy and confusion metrics for the evaluation. I would like to add some points for developing a better evaluation strategy:
Consider you are developing a classifier that classifies an EMAIL into SPAM or NON - SPAM (HAM), now one of the possible evaluation criteria can be the FALSE POSITIVE RATE because it can be really annoying if a non-spam email ends in spam category (which means you will read a valuable email)
So, I recommend you to consider metrics based on the problem you are targeting. There are many metrics such as F1 score, recall, precision that you can choose based on the problem you are havning.
You can visit: https://medium.com/apprentice-journal/evaluating-multi-class-classifiers-12b2946e755b for better understanding.

How to use over-sampled data in cross validation?

I have a imbalanced dataset. I am using SMOTE (Synthetic Minority Oversampling Technique)to perform oversampling. When performing the binary classification, I use 10-fold cross validation on this oversampled dataset.
However, I recently came accross this paper; Joint use of over- and under-sampling techniques and cross-validation for the development and assessment of prediction models that mentions that it is incorrect to use the oversampled dataset during cross-validation as it leads to overoptimistic performance estimates.
I want to verify the correct approach/procedure of using the over-sampled data in cross validation?
To avoid overoptimistic performance estimates from cross-validation in Weka when using a supervised filter, use FilteredClassifier (in the meta category) and configure it with the filter (e.g. SMOTE) and classifier (e.g. Naive Bayes) that you want to use.
For each cross-validation fold Weka will use only that fold's training data to parameterise the filter.
When you do this with SMOTE you won't see a difference in the number of instances in the Weka results window, but what's happening is that Weka is building the model on the SMOTE-applied dataset, but showing the output of evaluating it on the unfiltered training set - which makes sense in terms of understanding the real performance. Try changing the SMOTE filter settings (e.g. the -P setting, which controls how many additional minority-class instances are generated as a percentage of the number in the dataset) and you should see the performance changing, showing you that the filter is actually doing something.
The use of FilteredClassifier is illustrated in this video and these slides from the More Data Mining with Weka online course. In this example the filtering operation is supervised discretisation, not SMOTE, but the same principle applies to any supervised filter.
If you have further questions about the SMOTE technique I suggest asking them on Cross Validated and/or the Weka mailing list.
The correct approach would be first splitting the data into multiple folds and then applying sampling just to the training data and let the validation data be as is. The image below states the correct approach of how the dataset should be resampled in a K-fold fashion.
If you want to achieve this in python, there is a library for that:
Link to the library: https://pypi.org/project/k-fold-imblearn/

Multiple sensors = multiple deep learning models?

Let's say I have 30,000 vibration sensors monitoring 30,000 drills (1 sensor per drill) in different workplaces. I need to detect anomalies in vibration patterns.
Given we have enough historical data, how would you go about creating models for this problem?
This is a somewhat ambiguous question, however you can follow the following broad steps to perform anomaly detection:
Load the data into your computing environment, maybe Python, MATLAB, or R. This is assuming your data can fit into memory, else you may want to consider setting up an Hadoop or Spark cluster on Amazon EC2 or other virtual clusters.
You should perform some EDA to understand your data better. This will reveal more on the underlying struture of the data, what kind of distribution is it from, etc.
Make rough visual plots of your data if possible. This will come in handy when you need to polish some final plots for a presentation when reporting your analysis.
Based on the EDA, you can then intuitivey prepare your data for processing. You may need to transform, rescale or standardize the dataset before applying any Machine Learning technique for Anomaly detection.
For supervised datasets (i.e. labels are provided), you may consider algorithms such as SVM, Neural Networks, XGBoost or any other appropriate supervised technique. However, great care much be taken in evaluating the results because typical to anomaly detection datasets, there is more often than not a very small number of positive examples (y = 1) with respect to the total number of examples. This is called class imbalance. There are various ways of mitigating this problem. See Class Imbalance Problem.
For unsupervised datasets, techniques such as the density based methods (i.e. Local Outlier Factor (LOF) and its varieties, k-Nearest Neighbor (kNN) -> its a very popular method), One-class SVM, etc. A monograph of unsupervised methods for anomaly detection is detailed in this study. A Comparative Evaluation of Unsupervised Anomaly Detection Algorithms for Multivariate Data
N.b..
- Don't forget to consider rudimentary ML practices when building your models such as: splitting into training set/ test set or exploring resampling methods such as k-fold CV, LOOCV, etc to control bias/ variance in your results.
- Explore other techniques such as Ensemble methods (i.e. Boosting & Bagging algorithms) to improde model accuracy.
- Deep learning techniques such as the Muli-layer Perceptron can be explored on this problem. If there is some time-series component, a Recurrent Neural Network, RNN can be explored.

Resources