Feature Selection Process For Regression - machine-learning

I am trying to solve a regression problem (determine next month expected revenue).I came to know about different feature selection technique like
Filter Method
Wrapper Method
Embedded Method
Q1: Now the problem is, i think those methods are for classification type problem. So how can we use feature selection for regression problem?
Q2: I came to know about "Regularization". Is it the only way to use feature selection for regression problem?

I don't know these filter selections you mentioned, but you can use:
Scikit-learn.selection_feature.RFE (Recursive Feature Elimination)
or
Scikit-learn.selection_feature.PCA (Principal Component Analisys)
I'm pretty sure you can use them for classification or regression.
Here's an example of use of RFE and LinearRegression: https://towardsdatascience.com/feature-selection-with-pandas-e3690ad8504b

Related

Which data visualization techniques to use to analyse data while solving a classification problem?

I am solving a classification problem and I cannot find a good visualization method to analyse my data. Usually while dealing with prediction problems I use barplot, distplot, scatterplot, linegraph,etc. I want to know some common data visualization techniques for classification problems.
Hi guys I figured that countplot is the the equivalent of histogram https://seaborn.pydata.org/generated/seaborn.countplot.html
Example of countplot
Example of catplot
Update : catplot is actually the combination of FacetGrid and countplot.
So if you want to do something simple then countplot will do the work for you but if you want grids then use catplot.

How can re-train my logistic model using pymc3?

I have a binary classification problem where I have around 15 features. I have chosen these features using some other model. Now I want to perform Bayesian Logistic on these features. My target classes are highly imbalance(minority class is 0.001%) and I have around 6 million records. I want to build a model which can be trained nighty or weekend using Bayesian logistic.
Currently, I have divided the data into 15 parts and then I train my model on the first part and test on the last part then I am updating my priors using Interpolated method of pymc3 and rerun the model using the 2nd set of data. I am checking the accuracy and other metrics(ROC, f1-score) after each run.
Problems:
My score is not improving.
Am I using the right approch?
This process is taking too much time.
If someone can guide me with the right approach and code snippets it will be very helpful for me.
You can use variational inference. It is faster than sampling and produces almost similar results. pymc3 itself provides methods for VI, you can explore that.
I only know this part of question. If you can elaborate your problem a bit further, maybe.. I can help you.

What is Weka's InfoGainAttributeEval formula for evaluating Entropy with continuous values?

I'm using Weka's attribute selection function for Information Gain and I'm trying to figure out what the specific formula Weka uses when dealing with continuous data.
I understand the usual formula for Entropy is this for when the values in the data are discrete. I understand that when dealing with continuous data one can either use Differential Entropy or discretize the values. I've tried looking at Weka's explanation to InfoGainAttributeEval and have looking through so many other references, but can't find anything.
Maybe its just me, but would anyone know how Weka implements this case?
Thanks!
I asked the author Mark Hall and he said:
It uses the supervised MDL-based discretization method of Fayad and
Irani. See the javadocs:
http://weka.sourceforge.net/doc.stable-3-8/weka/attributeSelection/InfoGainAttributeEval.html
Also you can see this link for the discretization method:
http://weka.sourceforge.net/doc.stable-3-8/weka/filters/supervised/attribute/Discretize.html

How to remove redundant features using weka

I have around 300 features and I want to find the best subset of features by using feature selection techniques in weka. Can someone please tell me what method to use to remove redundant features in weka :)
There are mainly two types of feature selection techniques that you can use using Weka:
Feature selection with wrapper method:
"Wrapper methods consider the selection of a set of features as a search problem, where different combinations are prepared, evaluated and compared to other combinations. A predictive model us used to evaluate a combination of features and assign a score based on model accuracy.
The search process may be methodical such as a best-first search, it may stochastic such as a random hill-climbing algorithm, or it may use heuristics, like forward and backward passes to add and remove features.
An example if a wrapper method is the recursive feature elimination algorithm." [From http://machinelearningmastery.com/an-introduction-to-feature-selection/]
Feature selection with filter method:
"Filter feature selection methods apply a statistical measure to assign a scoring to each feature. The features are ranked by the score and either selected to be kept or removed from the dataset. The methods are often univariate and consider the feature independently, or with regard to the dependent variable.
Example of some filter methods include the Chi squared test, information gain and correlation coefficient scores." [From http://machinelearningmastery.com/an-introduction-to-feature-selection/]
If you are using Weka GUI, then you can take a look at two of my video casts here and here.

Feature selection

I am trying to find a useful feature selection method on a set of 20000 genes from an expression set(microarray) to get a model with the useful genes only.
I tried using RFE from caret but I got a stackOverflow since backward selection does not support data where n(predictors) > n(samples).
Could anyone suggest a reasonable method to do so, or a solution for this RFE selection method?
Thanks in advance.
did you try using genetic algorithms for feature selection? There are different packages to do this - GA, genalg, caret (in R).
Take a look at this blog, feature selection using genetic algorithms has been explained with example - http://topepo.github.io/caret/GA.html
Hope it helps.

Resources