How to apply genetic algorithm to linear regression datasets? - machine-learning

can we apply only genetic algorithm model on a dataset for linear regression?
for example:
assume we have a dataset with features such as toffle score, cgpa, gre score ,etc and output values of chance of admission. In this we have to predict the chances of admission based on the features.Link to the dataset

Lot of things are possible by using genetic algorithms. You just have to be sore that you are using correct dataset, you have to know what you want to get from it and last but not least, you have to know what exactly are you doing, which means you need to have correct fitness function :)

Related

Which SMOTE algorithm should I use for Augmentation of Time Series dataset?

I am working on a Time Series Dataset where i want to do forcasting and prediction both. So, if you have any suggestion please share. Thank You!
T-Smote
This allows one to both impute fully missing observations to allow uniform time series classification across the entire data and, in special cases, to impute individually missing features. To do so, we slightly generalize the well-known class imbalance algorithm SMOTE to allow component wise nearest neighbor interpolation that preserves correlations when there are no missing features. We visualize the method in the simplified setting of 2-dimensional uncoupled harmonic oscillators. Next, we use tSMOTE to train an Encoder/Decoder long-short term memory (LSTM) model with Logistic Regression for predicting and classifying distinct trajectories of different 2D oscillators.

Metric for ML algorithm evaluation

I have a question. Is the best score from GridSearchCV, which corresponds to mean cross-validation score, the right metric to evaluate an algorithm trained with unbalanced data?
GridSearchCV can be used to find appropriate parameter values for your model.
For the right metric to evaluate an algorithm trained with unbalanced data, you want to look at the area under the precision-recall curve (PR AUC) or 'average precision' or maybe even a cost-sensitive one (Jason Brownlee has a bunch of blogs on this topic).

How to evaluate unsupervised anomaly detection

I am trying to solve a regression problem by predicting a continuous value using machine learning. I have a dataset which composed of 6 float columns.
The data come from low price sensors, this explain that very likely we will have values that can be considered out of the ordinary. To fix the problem, and before predicting my continuous target, I will predict data anomalies, and use him as a data filter, but the data that I have is not labeled, that's mean I have unsupervised anomaly detection problem.
The algorithms used for this task are Local Outlier Factor, One Class SVM, Isolation Forest, Elliptic Envelope and DBSCAN.
After fitting those algorithms, it is necessary to evaluate them to choose the best one.
Can anyone have an idea how to evaluate an unsupervised algorithm for anomaly detection ?
The only way is to generate synthetic anomalies which mean to introduce outliers by yourself with the knowledge of how a typical outlier will look like.

Gradient Boosting vs Random forest

According to my understanding, RF selects features randomly and hence is hard to overfit. But, in sklearn Gradient boosting also offers the option of max_features which can help to prevent overfitting. So, why would anyone use Random forest?
Can anyone explain when to use Gradient boosting vs Random forest based on the given data?
Any help is highly appreciated.
According to my personal experience, Random Forest could be a better choice when..
You train a model on small data set.
Your data set has few features to learn.
Your data set has low Y flag count or you try to predict a situation that has low chance to occur or rarely occurs.
In these situations, Gradient Boosting algorithms like XGBoost and Light GBM can overfit (though their parameters are tuned) while simple algorithms like Random Forest or even Logistic Regression may perform better. To illustrate, for XGboost and Ligh GBM, ROC AUC from test set may be higher in comparison with Random Forest but shows too high difference with ROC AUC from train set.
Despite the sharp prediction form Gradient Boosting algorithms, in some cases, Random Forest take advantage of model stability from begging methodology (selecting randomly) and outperform XGBoost and Light GBM. However, Gradient Boosting algorithms perform better in general situations.
Similar question asked on Quora:
https://www.quora.com/How-do-random-forests-and-boosted-decision-trees-compare
I agree with the author at the link that random forests are more robust -- they don't require much problem-specific tuning to get good results. Besides that, a couple other items based on my own experience:
Random forests can perform better on small data sets; gradient boosted trees are data hungry
Random forests are easier to explain and understand. This perhaps seems silly but can lead to better adoption of a model if needed to be used by less technical people
I think that's also true. I have also read on this page How Random Forest Works
There explains the advantages of random forest. like this :
For applications in classification problems, Random Forest algorithm
will avoid the overfitting problem
For both classification and
regression task, the same random forest algorithm can be used
The Random Forest algorithm can be used for identifying the most
important features from the training dataset, in other words,
feature engineering.

What's the meaning of logistic regression dataset labels?

I've learned the Logistic Regression for some days, and i think the logistic regression's dataset's labels needs to be 1 or 0, is it right ?
But when i lookup the libSVM library's regression dataset, i see the label values are continues number(e.g. 1.0086,1.0089 ...), did i miss something ?
Note that the libSVM library could be used for regression problem.
Thanks so much !
Contrary to its name, logistic regression is a classification algorithm and it outputs class probability conditioned on the data point. Therefore the training set labels need to be either 0 or 1. For the dataset you mentioned, logistic regression is not a suitable algorithm.
SVM is a classification algorithm and it uses the input labels -1 or 1. It is not a probabilistic algorithm and it doesn't output class probabilities. It also can be adapted to regression.
Are you using a 3rd party library or programming this yourself? Generally the labels are used as ground truth so you can see how effective your approach was.
For example if your algo is trying to predict what a particular instance is it might output -1, the ground truth label will be +1 which means you did not successfully classify that particular instance.
Note that "regression" is a general term. To say someone will perform regression analysis doesn't necessarily tell you what algorithm they will be using, nor all of the nature of the data sets. All it really tells you is that you have a set of samples with features which you want to use to predict a single outcome value (a model for conditional probability).
One major difference between logistic regression and linear regression is that the former is usually trained on categorical, binary-labeled sample sets; while the latter is trained on real-labeled (ℝ) sample sets.
Any time your labels are real valued, it means you're probably going to use linear regression or similar, or else convert those real valued labels to categorical labels (e.g. via thresholds or bins) if you want to in fact use logistic regression. There is potentially a big difference in the quality and interpretation of your results though, if you try to convert from one such problem setup to another.
See also Regression Analysis.

Resources