Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I am experimenting with Random Forests for a Classification problem and am wondering about calibration for individual features. For example, a 2-level factor such as home field advantage in a sporting event, that we can be certain has an average effect of around +5% on win rate and its effect is not captured by any other feature in the data.
It appears as though the nature of random forests (selecting N random features to consider at each split) will not allow the model to fully capture the effect of any one particular feature like this. Setting the max_features parameter to None or to include all features appears to solve the problem but then it loses the power of diversity between trees.
I am wondering if there are any good methods of dealing with this type of feature that we want to be fully captured based on some sort of domain knowledge we have about the problem?
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 6 months ago.
Improve this question
I want to do a website project that uses machine learning to optimize car throughput in a city. This would be a cartoonish grid of dots attempting to navigate through a grid of streets with stoplights at each intersection. However, I have not been able to find the right resources for learning about this type of ML optimization.
The idea to start is that the grid of stoplights is given the same set of cars each epoch and the stoplights guess their own frequency of green/red to maximize traffic flow. So the metric that the model will learn against is number of cars through the light (or time for all cars to clear the city, not sure yet).
I have done the Google ML Crash Course and the book A Programmer's Guide to Artificial Intelligence, but I have yet to find the right type of ML I am looking for. I am looking for a learning resource on training a model with no labeled data, with a metric for optimization.
Reinforcement learning was what I was looking for and I’m now looking into tensorflow documentation on how a virtual light signal can take actions and receive rewards from a model
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I am currently learning ML
and I notice that in multiple linear regression we don't need scaling for our independent variable
and I didn't know why?
Whether feature scaling is useful or not depends on the training algorithm you are using.
For example, to find the best parameter values of a linear regression model, there is a closed-form solution, called the Normal Equation. If your implementation makes use of that equation, there is no stepwise optimization process, so feature scaling is not necessary.
However, you could also find the best parameter values with a gradient descent algorithm. This could be a better choice in terms of speed if you have many training instances. If you use gradient descent, feature scaling is recommended, because otherwise the algorithm might take much longer to converge.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
My question is given a particular dataset and a binary classification task, is there a way we can choose a particular type of model that is likely to work best? e.g. consider the titanic dataset on kaggle here: https://www.kaggle.com/c/titanic. Just by analyzing graphs and plots, are there any general rules of thumb to pick Random Forest vs KNNs vs Neural Nets or do I just need to test them out and then pick the best performing one?
Note: I'm not talking about image data since CNNs are obv best for those.
No, you need to test different models to see how they perform.
The top algorithms based on the papers and kaggle seem to be boosting algorithms, XGBoost, LightGBM, AdaBoost, stack of all of those together, or just Random Forests in general. But there are instances where Logistic Regression can outperform them.
So just try them all. If the dataset is >100k, you're not gonna lose that much time, and you might learn something valuable about your data.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 4 years ago.
Improve this question
One of anomaly detection algorithms is to use multivariate Gaussian to construct a probability density, according to Andrew Ng's coursera lecture.
What if data show cluster structures (not a single chunk)? In this case do we resort to unsupervised clustering to construct the density? If yes, how to do it? Are there other systematic ways to discover if such a case exists?
You can just use regular GMM and use a threshold on the likelihood to identify outliers. Points that don't fit the model well are outliers.
This works okay as long as your data really is composed of Gaussians.
Furthermore, clustering is fairly expensive. Usually it will be faster to directly use a nonparametric outlier model like KNN or LOF or LOOP.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 7 years ago.
Improve this question
As I was looking for a fine regression algorithm for my problem. I found out one can do that with simple decision trees as well, which is usually used for classification. The output would be something like:
The red noise would be the prediction states of such a tree or forest.
Now my question is, why at all to use this method, when there are alternatives, that really try to figure out the underlying equation (such as the famous support vector machines SVM). Are there any positive / unique aspects, or was a regression tree more a nice-to-have-algorithm?
The image you posted conveys a smooth function of y in x. A regression tree is certainly not the best technique to estimate such a function and I probably wouldn't use SVMs either. This looks like a good application for splines, e.g., by using a GAM (generalized additive model).
A regression tree on the other hand is a handy tool if you haven't got such smooth functions and if you don't know which explanatory variable will have which effect on the response. It will be particularly useful if there are jumps in the response or interactions - especially if the jump points and interaction patterns are not known in advance.