I am trying to apply machine learning algorithm to a dataset which consits of emission of pollutant gas from an engine called SO2(target variable) which is collected over 6 months of time for at a interval of each of 15 mins each.The dataset also do have other independent variables like pressure,vapour etc with time.
Now the question is
should i go for time series modelling like arima for forcasting the So2?
or should i go for randomforest or svm for forecasting?
Thanks
I suggest that you go for time-series modeling instead of SVM. Your SVM would consider i.i.d (independent and identically distributed) samples, and wouldn't consider the information that encapsulated across time.
Related
why xgboost algorithm is not useful for anomaly detection on time series?
There are some cases about forecasting on time series. (https://www.kaggle.com/code/robikscube/tutorial-time-series-forecasting-with-xgboost)
is there an implementation we could use this algorithm for anomaly detection and forecasting together on time series data?
Anomalies by definition are rare. Generally, standard classification algorithms have issues due to the objective function when one of the classes are rare.
If you wanted to detect anomalies, one of the things that you can try is to use xgboost to predict the time series, and then use the residual to determine which are "poorly" predicted by the algorithm and therefore are anomalous.
I have hourly data of no. of minutes spent online by people for 2 years. Hence the values are distributed between 0 and 60 and also most data is either 0 or 60. My goal is to predict the number of minutes the person will spend online in the future (next day/hour/month etc.). What kind of approach or machine learning model can I use to predict this data? Can this be modelled into a regression/forecasting problem in spite of the skewness?hourly data
In the case of time series data and its prediction, it’s better to use a regression model rather than a classification or clustering model. Because it’s related to calculating specific figures.
It can be modeled into a regression problem to some extent, but more skewness means getting far from the normal probability distribution which might influence the expression into the model, lower prediction accuracy, and so forth. Anyway, any data with significant skewness cannot be regarded as well-refined data. So you might need to rearrange the samples of the data so that the skewness of the data can decrease.
i need to create a prediction model that predicts the quantity of an item per day...
this how my data look like on DB...
item id |date | quantity
1000 |2020-02-03 | 5
what I did is converted the date to :
year number
number of the week in year
weekday number
I trained this model on a dataset of 100,000 items with RegressionFastForest, RegressionFastTree, LbfgsPoissonRegression, FastTreeTweedie
but results are not so good (RMSE SCORE OF 3.5 - 4)
am I doing this wrong ?
I am using ML.NET if its matter
thanks
There're several techniques of time series forecasting. But the main point: we don't seek dependency of value on date. Instead, we're seeking dependence of value[i] on value[i-1].
Most common techniques are family of ARIMA models and recurrent neural networks. I would recommend to read about them. But, if you don't have much time or something else, there's something that can help. And it's Auto ARIMA models.
Implementation of auto ARIMA exists at least in Python and R. Here's python version:
from pyramid.arima import auto_arima
model = auto_arima(y)
where y is your time series.
P.S. Even though it is called auto model (which means that the algorithm will choose best hyperparameters by itself), you should still understand what does: p, q, P, Q and S mean.
There are several problems with directly applying linear regression to your data.
1) If item id is an index of sorts and does not reflect physical properties of the item, then it is a categorical feature. Use OneHotEncoding to replace it with regression-friendly labels.
2) If you assume that you data may have a cyclical dependence on the time of the day/week/month, use sin and cos of those functions. It will not work with year, as it is not periodic. Here is a good guide with examples in Python.
Good luck!
P. S. I usually use LogisticRegression in tasks with sparse representations of categorical features (OneHotEncoding) for benchmark. It will not be as good as a state-of-the art NN solution, but gives me a clue what the benchmark looks like.
I have time series data of size 100000*5. 100000 samples and five variables.I have labeled each 100000 samples as either 0 or 1. i.e. binary classification.
I want to train it using LSTM , because of the time series nature of data.I have seen examples of LSTM for time series prediction, Is it suitable to use it in my case.
Not sure about your needs.
LSTM is best suited for sequence models, like time series you said, and your description don't look a time series.
Any way, you may use LSTM for time series, not for prediction, but for classification like this article.
In my experience, for binary classification having only 5 features you could find better methods, will consume more memory thant other methods, and could get worst results.
First of all, you can see it from a different perspective, i.e. instead of having 10,000 labeled samples of 5 variables, you should treat it as 10,000 unlabeled samples of 6 variables, where the 6th variable is the label.
Therefore, you can train your LSTM as a multivariate predictor for your 6th variable, that is the sample label and compare with the ground truth during testing to evaluate its performance.
I am considering using random forest for a classification problem. The data comes in sequences. I plan to use first N(500) to train the classifier. Then, use the classifier to classify the data after that. It will make mistakes and the mistakes sometimes can be recorded.
My question is: can I use those mis-classified data to retrain the original classifier and how? If I simply add the mis-classified ones to original training set with size N, then the importance of the mis-classified ones will be exaggerated as the corrected classified ones are ignored. Do I have to retrain the classifier using all data? What other classifiers can do this kind of learning?
What you describe is a basic version of the Boosting meta-algorithm.
It's better if your underlying learner have a natural way to handle samples weights. I have not tried boosting random forests (generally boosting is used on individual shallow decision trees with a depth limit between 1 and 3) but that might work but will likely be very CPU intensive.
Alternatively you can train several independent boosted decision stumps in parallel with different PRNG seed values and then aggregate the final decision function as you would do with a random forests (e.g. voting or averaging class probability assignments).
If you are using Python, you should have a look at the scikit-learn documentation on the topic.
Disclaimer: I am a scikit-learn contributor.
Here is my understanding of your problem.
You have a dataset and create two subdata set with it say, training dataset and evaluation dataset. How can you use the evaluation dataset to improve classification performance ?
The point of this probleme is'nt to find a better classifier but to find a good way for the evaluation, then have a good classifier in the production environnement.
Evaluation purpose
As the evaluation dataset has been tag for evaluation there is now way yo do this. You must use another way for training and evaluation.
A common way to do is cross-validation;
Randomize your samples in your dataset. Create ten partitions from your initial dataset. Then do ten iteration of the following :
Take all partitions but the n-th for training and do the evaluation with the n-th.
After this take the median of the errors of the ten run.
This will give you the errors rate of yours classifiers.
The least run give you the worst case.
Production purpose
(no more evaluation)
You don't care anymore of evaluation. So take all yours samples of all your dataset and give it for training to your classifier (re-run a complet simple training). The result can be use in production environnement, but can't be evaluate any more with any of yours data. The result is as best as the worst case in previous partitions set.
Flow sample processing
(production or learning)
When you are in a flow where new samples are produce over time. You will face case where some sample correct errors case. This is the wanted behavior because we want the system to
improve itself. If you just correct in place the leaf in errors, after some times your
classifier will have nothing in common with the original random forest. You will be doing
a form of greedy learning, like meta taboo search. Clearly we don't wanna this.
If we try to reprocess all the dataset + the new sample every time a new sample is available we will experiment terrible low latency. The solution is like human, sometime
a background process run (when service is on low usage), and all data get a complet
re-learning; and at the end swap old and new classifier.
Sometime the sleep time is too short for a complet re-learning. So you have to use node computing clusturing like that. It cost lot of developpement because you probably need to re-write the algorithms; but at that time you already have the bigest computer you could have found.
note : Swap process is very important to master. You should already have it in your production plan. What do you do if you want to change algorithms? backup? benchmark? power-cut? etc...
I would simply add the new data and retrain the classifier periodically if it weren't too expensive.
A simple way to keep things in balance is to add weights.
If you weigh all positive samples by 1/n_positive and all negative samples by 1/n_negative ( including all the new negative samples you're getting ), then you don't have to worry about the classifier getting out of balance.