Machine learning - Evolving intelligence [closed] - machine-learning

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
Having some understanding on the concepts and steps of building a ML model covering these aspects -
Understanding and categorising the problem as - Supervised or
Unsupervised, Regression or Classification or Clustering, etc.
Feature designing i.e. features/input parameters to consider
Splitting the data into train and test sets. (Cross-Validation is
another important concept in here.)
Comparing various models (like KNN, SVM, Random Forest, etc.) and understand which fares well. Basically, cross validate the scores and understand the prediction capabilities.
Doubt::
How the newer data is being fed to ML to keep it updated and better prediction?

nothing has to be finalized, once you get new data, you can retrain your model with all relevant data, or update your model(another iteration of gradient descent for linear regression for example).
if it is a relevant data., i.e. data from the same distribution, it shouldn't "hurt" the model.
this is essentially the same question as #1. the details depends on the model. some models you just have to retrain with all relevant data. some model you can just update with new data.

Related

How can I normalize data for Reinforcement Learning when outliers are present? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 10 months ago.
Improve this question
I have to train a reinforcement learning agent (represented by a neural network) whose environment has a dataset where outliers are present.
How can I actually deal with the normalization data considering that I want to normalize them in a range [-1,1]?
I need to maintain outliers in the dataset because they're critical, and they can be actually significant in some circumstances despite being out of the normal range.
So the option to completely delete rows is excluded.
Currently, I'm trying to normalize the dataset by using the IQR method.
I fear that with outliers still present, the agent will take some actions only when intercepts them.
I already experimented that a trained agent always took the same actions, excluding others.
What does your experience suggest?
After some tests, I take this road:
Applied a Z-score normalization with the option "Robust"; in this way, I have mean=0 and sd=1.
I calculated the min_range(feature)+max_range(feature)/2
I divided all the feature data with the mean calculated in point 2.
The agent learned pretty well.

When should I train my own models and when should I use pretrained models? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
Is it recommended to train my own models for things like sentiment analysis, despite only having a very small dataset (5000 reviews), or is it best to use pretrained models which were trained on way larger datasets, however aren't "specialized" on my data.
Also, how could I train my model on my data and then later use it on it too? I was thinking of an iterative approach where the training data would be randomly selected subset of my total data for each learning epoch.
I would go like this:
Try the pre-trained model and see how it goes
If results are non satisfactory, you can fine tune it (see this tutorial). Basically, you are using your own examples to change the weights of the pre-trained model. This should improve the results, but it depends on how your data is and how many examples you can provide. The more you have, the better it should be (I would try to use 10-20k at least)
Also, how could I train my model on my data and then later use it on it too?
Be careful to distinguish between pre-train and fine-tuning.
For pre-training you need a huge amount of text (like billions of characters), it is very resource demanding, and tipically you don't want to do that, unless for a very good reason (for example, a model for your target language does not exist).
Fine-tuning requires much much less examples (some tents of thousands), it take tipycally less than a day on a single GPU and allow you to exploit pre-trained model created by someone else.
From what you write, I would go with fine-tune.
Of course you can save the model for later, as you can see in the tutorial I linked above:
model.save_pretrained("my_imdb_model")

How to choose which model to fit to data? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
My question is given a particular dataset and a binary classification task, is there a way we can choose a particular type of model that is likely to work best? e.g. consider the titanic dataset on kaggle here: https://www.kaggle.com/c/titanic. Just by analyzing graphs and plots, are there any general rules of thumb to pick Random Forest vs KNNs vs Neural Nets or do I just need to test them out and then pick the best performing one?
Note: I'm not talking about image data since CNNs are obv best for those.
No, you need to test different models to see how they perform.
The top algorithms based on the papers and kaggle seem to be boosting algorithms, XGBoost, LightGBM, AdaBoost, stack of all of those together, or just Random Forests in general. But there are instances where Logistic Regression can outperform them.
So just try them all. If the dataset is >100k, you're not gonna lose that much time, and you might learn something valuable about your data.

Interactive learning [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I'm new in NLP and text mining and I'm trying to build a documents classifier.
Once the model is trained, we test it on new documents (they, test-data, don't have labels). It is expected that the model is not 100% accurate; so for misclassified documents, we want interact with a user to correct these bad predictions.
I've two ideas:
Retrain the model where: traindata = old_traindata + data corrected by the user.
After each user's rectification, update model parameters.
Does this sound correct? in the second case, which kind of algorithms should I use? How efficiently can we solve this problem?
You can do this but it will be a very intensive task if you plan on retraining the model on the whole data again and again if it is on a daily basis. Instead of retraining the model completely, you should try transfer learning. Save your model and then load it back and train it on the data corrected by the user. The model will be able to correct it mistakes without losing what it has already learned. The problem with transfer learning is that after some time, it will get fine tuned to the new data that you will have to retrain it from scratch. But this is far better then retraining the model every day.
You should have proper metrics in place to check if your models accuracy starts dropping in the old data after several iterations of "transfer learning". If the accuracy drops, just retrain the model on all of the data till date and you will be good to go.

Should I need to normalize (or scale) the data for Random forest (drf) or Gradient Boosting Machine (GBM) in H2O or in general? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I am creating a classification and regression models using Random forest (DRF) and GBM in H2O.ai. I believe that I don't need to normalize (or scale) the data as it's un-neccessary rather more harmful as it might smooth out the nonlinear nature of the model. Could you please confirm if my understanding is correct.
You don't need to do anything to your data when using H2O - all algorithms handle numeric/categorical/string columns automatically. Some methods do internal standardization automatically, but the tree methods don't and don't need to (split at age > 5 and income < 100000 is fine). As to whether it's "harmful" depends on what you're doing, usually it's a good idea to let the algorithm do the standardization, unless you know exactly what you are doing. One example is clustering, where distances depend on the scaling (or lack thereof) of the data.

Resources