How to choose which model to fit to data? [closed] - machine-learning

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
My question is given a particular dataset and a binary classification task, is there a way we can choose a particular type of model that is likely to work best? e.g. consider the titanic dataset on kaggle here: https://www.kaggle.com/c/titanic. Just by analyzing graphs and plots, are there any general rules of thumb to pick Random Forest vs KNNs vs Neural Nets or do I just need to test them out and then pick the best performing one?
Note: I'm not talking about image data since CNNs are obv best for those.

No, you need to test different models to see how they perform.
The top algorithms based on the papers and kaggle seem to be boosting algorithms, XGBoost, LightGBM, AdaBoost, stack of all of those together, or just Random Forests in general. But there are instances where Logistic Regression can outperform them.
So just try them all. If the dataset is >100k, you're not gonna lose that much time, and you might learn something valuable about your data.

Related

Improving Machine learning model for trading and trend prediction [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I am working on making predictions and decisions based on stocks and crypto data.
First I implemented a decision tree model and I had Model Accuracy: 0.5. After that I did some research and found out that decision tree is not enough and I tried to improve it with random forest and adaboosting.
After that I noticed that I have 3 above mentioned algorithms with the same training and test data, and I get three different results.
Now the question is if it is possible to make the three algorithms work together by combining them in some way and benefit from the previous result?
You can combine classifiers, yes. This is considered an ensemble. It's a bit weird to make an ensemble from a decision tree and a random forest, though. A random forest is an ensemble of decision trees. That's why it's called a forest.

Which supervised machine learning classification method suits for randomly spread classes? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
If classes are randomly spread or it is having more noise, which type of supervised ML classification model will give better results, and why?
It is difficult to say which classifier will perform best on general problems. It often requires testing of a variety of algorithms on a given problem in order to determine which classifier performs best.
Best performance is also dependent on the nature of the problem. There is a great answer in this stackoverflow question which looks at various scoring metrics. For each problem, one needs to understand and consider which scoring metric will be best.
All of that said, neural networks, Random Forest classifiers, Support Vector Machines, and a variety of others are all candidates for creating useful models given that classes are, as you indicated, equally distributed. When classes are imbalanced, the rules shift slightly, as most ML algorithms assume balance.
My suggestion would be to try a few different algorithms, and tune the hyper parameters, to compare them for your specific application. You will often find one algorithm is better, but not remarkably so. In my experience, often of far greater importance, is how your data are preprocessed and how your features are prepared. Once again this is a highly generic answer as it depends greatly on your given application.

anomaly detection with clustering? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 4 years ago.
Improve this question
One of anomaly detection algorithms is to use multivariate Gaussian to construct a probability density, according to Andrew Ng's coursera lecture.
What if data show cluster structures (not a single chunk)? In this case do we resort to unsupervised clustering to construct the density? If yes, how to do it? Are there other systematic ways to discover if such a case exists?
You can just use regular GMM and use a threshold on the likelihood to identify outliers. Points that don't fit the model well are outliers.
This works okay as long as your data really is composed of Gaussians.
Furthermore, clustering is fairly expensive. Usually it will be faster to directly use a nonparametric outlier model like KNN or LOF or LOOP.

machine learning (unsupervised method) [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I have a question about Reinforcement learning. If we use a mechanism to find the response of the environment in an unsupervised method to improve its performance, is the method still unsupervised?
In other word, using the response of the environment, is a method supervised or we can do it in an unsupervised manner? If so, how?
I have to disagree with #phs. Reinforcment learning is treated in the literature either as:
completely separate, third method of training -- so it is not supervised or unsupervised, it is simply reinforcment
it is sometimes marked as supervised due to its much stronger similarities to this paradigm
So, if the algorithm is trained in the reinforcment fashion and unsupervised, you can call it a unsupervised-reinforcment hybrid or something similar, but no longer "unsupervised", as reinforcment learning requires some additional knowledge about the world, than the one encoded in the data representation (feedbacks are not stored in data representation, they are much more like "true labels").
Unsupervised learning describes a class of problems where the model is not provided "answers" during its training phase, whatever that might mean in the current context.
Clustering is a canonical example. In a clustering problem one is only looking for inherent structure or grouping in the training data, and not seeking to distinguish "right" data points from "wrong" ones.
Your question is vague, but I believe you are asking whether we can call a training method unsupervised even if we have a proscribed algorithm for performing the training. The answer is yes; the word is just a word. All learning algorithms have inherent proscribed structure (the algorithm) and so are in some sense "supervised".

Survey to determine satisfaction: how to find the questions that mattered? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
If a survey is given to determine overall customer satisfaction, and there are 20 general questions and a final summary question: "What's your overall satisfaction 1-10", how could it be determined which questions are most significantly related to the summary question's answer?
In short, which questions actually mattered and which ones were just wasting space on the survey...
Information about the relevance of certain features is given by linear classification and regression weights associated with these features.
For your specific application, you could try training an L1 or L0 regularized regressor (http://en.wikipedia.org/wiki/Least-angle_regression, http://en.wikipedia.org/wiki/Matching_pursuit). These regularizers force many of the regression weights to zero, which means that the features associated with these weights can be effectively ignored.
There are many different approaches for answering this question and at varying levels of sophistication. I would start by calculating the correlation matrix for all pair-wise combinations of answers, thereby indicating which individual questions are most (or most negatively) correlated with the overall satisfaction score. This is pretty straightforward in Excel with the Analysis ToolPak.
Next, I would look into clustering techniques starting simple and moving up in sophistication only if necessary. Not knowing anything about the domain to which this survey data applies it is hard to say which algorithm would be the most effective, but for starters I would look at k-means and variants if your clusters are likely to all be similarly-sized. However, if a vast majority of the responses are very similar, I would look into expectation-maximization-based algorithms. A good open-source toolkit for exploring data and testing the efficacy of various algorithms is called Weka.

Resources