Is there an R or Python autoML library which supports rolling cross-validation? - time-series

I have been used h2o AutoML with the default cross-validation (nfolds=5) for forecasting commodity exchange prices. Nowadays I realized that the cross-validation MAE statistics of the leader model are way better than I can achieve on the test set ever.
I researched and found that the "classic" cross-validation method (and its metrics) is not the proper way to assess the accuracy of a time series model. It would be better to use rolling cross-validation. But h2o autoML has not got such option.
I have researched, but - till this time - not found such an R/Python package/library/solution, which would surely support autoML and rolling cross-validation at the same time. Any guess?

Related

Phishing Website Detection using Machine Learning

I have a semester project where I have to detect phishing website using ML. I have been using support vector binary classifier which is trained on an existing dataset to predict that whether a website is legitimate or not. The problem is SVMs need high calculations to train our data and are delicate with noisy data. Therefore, there is a high chance of overfitting. Is there any other classification model which will help to optimize my model?
I have done the similar project in my Engineering days, i used NB Classifier.

How to evaluate machine learning model performance on brand new datasets, in addition to train, validation, and test datasets?

The Scenario:
Our data science team builds machine learning models for classification tasks. We evaluate our model performance on train, validation and test datasets. We use precision, recall and F1 score.
We then run the models on brand-new datasets in the production environment and make predictions. One week later, we get feedback on how well our predictive models have performed.
The question:
When we evaluate the performance of our models on the real datasets, what metrics should we use? Is prediction accuracy a better metric in this context?
I think you should either measure the same metrics, or some business metrics.
Usually the models are optimized for a certain loss/metric and this means that model having a high value of a certain metric can have a worse value on a different metric.
Accuracy is a metric which is heavily influenced by balance of classes in the data, so it should be used with care.
So I suggest to use the same metrics.
Another approach is using some business metrics - for example the revenue, which these models brought.
Model evaluation
Take a look at this paper. It is fairly easy to follow and covers everything you need to know about machine learning model validation.

Logistic regression Machine Learning?

I have a dataset of 300 respondents (hours studied vs grade), I load the dataset in Excel run the data analysis add-in and run a linear regression. I get my results.
So the question is, Am I doing a Statistical Analysis or Am I doing Machine Learning? I know the question may seem simple but I think we should get some debate from this.
Maybe your question is better suited for Data Science as it is not a question related to app/program development. Running formulas in excel through an add on is not really considered anywhere close to "programming".
Statistical Analysis is when you take statistical metrics of your data, like mean, standard deviation, confidence intervall, p-value...
Supervised Machine Learning is when you try to classify or predict something. For these problemns you use features as input to the model in order to classify a class or predict a value.
In this case you are doing machine learning, because you use the hours studied feature to predict the student grade.
In the proper context, you're actually doing Statistical Analysis... (Which is part of Machine Learn

Get training data little by little

I am working on cifar 10 with azure ml, but it takes too much time for learning because there are too many data. Tensorflow have next_batch function to get training data little by little. I would also like to use it in azure ml. How can I get data little by little and speed up learning per epoch?
Incremental training of DNNs is not supported in Azure ML Studio. I suggest taking a look into Azure ML Workbench with gives you programmatic access to the algorithms to do minibatch training.
See here: https://learn.microsoft.com/en-us/azure/machine-learning/preview/how-to-use-gpu

How to improve classification accuracy for machine learning

I have used the extreme learning machine for classification purpose and found that my classification accuracy is only at 70+% which leads me to use the ensemble method by creating more classification model and testing data will be classified based on the majority of the models' classification. However, this method only increase classification accuracy by a small margin. Can I asked what are the other methods which can be used to improve classification accuracy of the 2 dimension linearly inseparable dataset ?
Your question is very broad ... There's no way to help you properly without knowing the real problem you are treating. But, some methods to enhance a classification accuracy, talking generally, are:
1 - Cross Validation : Separe your train dataset in groups, always separe a group for prediction and change the groups in each execution. Then you will know what data is better to train a more accurate model.
2 - Cross Dataset : The same as cross validation, but using different datasets.
3 - Tuning your model : Its basically change the parameters you're using to train your classification model (IDK which classification algorithm you're using so its hard to help more).
4 - Improve, or use (if you're not using) the normalization process : Discover which techniques (change the geometry, colors etc) will provide a more concise data to you to use on the training.
5 - Understand more the problem you're treating... Try to implement other methods to solve the same problem. Always there's at least more than one way to solve the same problem. You maybe not using the best approach.
Enhancing a model performance can be challenging at times. I’m sure, a lot of you would agree with me if you’ve found yourself stuck in a similar situation. You try all the strategies and algorithms that you’ve learnt. Yet, you fail at improving the accuracy of your model. You feel helpless and stuck. And, this is where 90% of the data scientists give up. Let’s dig deeper now. Now we’ll check out the proven way to improve the accuracy of a model:
Add more data
Treat missing and Outlier values
Feature Engineering
Feature Selection
Multiple algorithms
Algorithm Tuning
Ensemble methods
Cross Validation
if you feel the information is lacking then this link should you learn, hopefully can help : https://www.analyticsvidhya.com/blog/2015/12/improve-machine-learning-results/
sorry if the information I give is less satisfactory

Resources