Is a Test dataset necessary? [closed] - machine-learning

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I am training a CNN machine learning model which detects and classifies cardiac arrhythmia into various categories. I have however used the test set for my validation set and now I have a validation accuracy of 98%. Do I need to have a test set or can I just use my validation accuracy as a final indication of how good my model is?

In general it is best to have a training, validation and test set. You can get a good estimation of how your model generalizes to images it has not seen before by just using the validation accuracy provided you did not bias the model based on the validation set. For example if you use an adjustable learning rate based on monitoring of the validation loss you are to a degree introducing a "bias" in your model toward the specific validation set. In that case it would be best to test your model against an independent test set. They will probably have similar accuracy but this is not always the case. If the probability distribution of your validation set is not representative of the full probability range of potential class images an independent test set with a more encompassing distribution may yield less accurate results.

Related

ML models can only be used for making future predictions? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 7 months ago.
Improve this question
I'm aware that ML models are used to make future prediction but can they also be used for making predictions in the past?
I've a model that predicts the accident prone zones for a given location and given date and time. The model has been developed by studying previous 2 years data (2020 and 2021). I've few datasets that I am required to predict on, which are in the year 2019. This is required to verify if the predictions actually tally.
Now, would it be feasible to use this ML model to test on the dataset for the year 2019?
I'm using sklearn and the model used is Random forest.
Theoretically it is possible. It doesn't matter which direction you go. e.g. if a trend is seen to increase in the future, this means the trend is probably decreasing in the past. So for the model it doesn't matter much - it is going to predict a decrease (for example). However, how relevant is your prediction, it is something to sought for.

Improving Machine learning model for trading and trend prediction [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I am working on making predictions and decisions based on stocks and crypto data.
First I implemented a decision tree model and I had Model Accuracy: 0.5. After that I did some research and found out that decision tree is not enough and I tried to improve it with random forest and adaboosting.
After that I noticed that I have 3 above mentioned algorithms with the same training and test data, and I get three different results.
Now the question is if it is possible to make the three algorithms work together by combining them in some way and benefit from the previous result?
You can combine classifiers, yes. This is considered an ensemble. It's a bit weird to make an ensemble from a decision tree and a random forest, though. A random forest is an ensemble of decision trees. That's why it's called a forest.

Is it a bad idea to always standardize all features by default? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
Is there a reason not to standardize all features by default? I realize it may not be necessary for e.g., decision trees but for certain algorithms such as KNN, SVM and K-Means. Would there be any harm just routinely to do this for all of my features?
Also, it seems the consensus that standardization is preferable to normalization? When would this not be a good idea?
Standardization and normalization, in my experience, have the most (positive) impact when your dataset consists of features that have very different ranges (for instance age vs number of dolars per house)
In my professional experience, while working on a project with sensors from the car (time-series), I noticed that normalization (min-max scaling), even though when applied in case of a neural network, had a negative impact upon the training process and of course the final results. Admittedly, were the sensor features(values) very close as values to one another. It was a very interesting result to remark considering that I was working with Time-Series, where most of the data scientists resort to scaling by default (they are neural network in the end, goes along the theory).
In principle, standardization is better to be applied when it comes to having specific outliers in the dataset, since normalization generates smaller standard deviation values. In my humble knowledge this is the main reason standardization tends to be favored over normalization, its robustness over outliers.
Three years ago, if someone asked me this question, I would have said "standardization" is the way to go. Now I say, follow the principles, but test every hypothesis prior to jumping to a certain conclusion.

Which supervised machine learning classification method suits for randomly spread classes? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
If classes are randomly spread or it is having more noise, which type of supervised ML classification model will give better results, and why?
It is difficult to say which classifier will perform best on general problems. It often requires testing of a variety of algorithms on a given problem in order to determine which classifier performs best.
Best performance is also dependent on the nature of the problem. There is a great answer in this stackoverflow question which looks at various scoring metrics. For each problem, one needs to understand and consider which scoring metric will be best.
All of that said, neural networks, Random Forest classifiers, Support Vector Machines, and a variety of others are all candidates for creating useful models given that classes are, as you indicated, equally distributed. When classes are imbalanced, the rules shift slightly, as most ML algorithms assume balance.
My suggestion would be to try a few different algorithms, and tune the hyper parameters, to compare them for your specific application. You will often find one algorithm is better, but not remarkably so. In my experience, often of far greater importance, is how your data are preprocessed and how your features are prepared. Once again this is a highly generic answer as it depends greatly on your given application.

How to choose which model to fit to data? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
My question is given a particular dataset and a binary classification task, is there a way we can choose a particular type of model that is likely to work best? e.g. consider the titanic dataset on kaggle here: https://www.kaggle.com/c/titanic. Just by analyzing graphs and plots, are there any general rules of thumb to pick Random Forest vs KNNs vs Neural Nets or do I just need to test them out and then pick the best performing one?
Note: I'm not talking about image data since CNNs are obv best for those.
No, you need to test different models to see how they perform.
The top algorithms based on the papers and kaggle seem to be boosting algorithms, XGBoost, LightGBM, AdaBoost, stack of all of those together, or just Random Forests in general. But there are instances where Logistic Regression can outperform them.
So just try them all. If the dataset is >100k, you're not gonna lose that much time, and you might learn something valuable about your data.

Resources