Selection of Categorical Features on basis of their values frequency

Selection of Categorical Features on basis of their values frequency - machine-learning

I am working on basic Machine Learning Linear Regression Model creation.
I have Categorical Features which are having kind of skewed counts like
AllPub 1459
NoSeWa 1
Name: Utilities, dtype: int64
As one can see that AllPub is the one which is contributed more. So is it useful in model creation? Shall i use it or not??

As you can see most of the values are of AllPub, only one value is of NoSeWa. It will not make much difference if you keep or remove.
Another way of thinking might be a outlier. Since there is count of only one, it might have entered incorrectly. You can impute that value with mode.

Related

Tree vs Regression algorithm- which works better for a model with mostly categorical features?

I'm working on a regression problem to predict the selling price of a product. The features are a 4-level product hierarchy and a proposed price. In summary, there are 4 categorical features and one numerical feature. There are about 1000K rows in total.
I think a decision tree or random forest would work better than regression in this scenario. The reasoning is that there is only one numerical feature. Also, I plan to convert the numerical feature (proposed price) into price buckets, making it another categorical feature.
Does my reasoning make sense? Is there any other algorithm that might be worthy to try? Is there any other clever feature engineering that is worthy trying?
Note 1: This is actually a challenge problem (like Kaggle), so the features have been masked and encoded. Looking at the data, I can say for sure that there is a 4-level product hierarchy, but I'm not very sure about the one numerical feature (which I think is the proposed price), because there is a lot of difference in some scenarios between this number and the sold price (y variable). Also, there are a lot of outliers (probably forcibly introduced to confuse) in this column.

I would not recommend binning the proposed price variable, as one would expect that variable to carry most information needed to predict the selling price. Binning that variable is advantageous when the variable is noisy, however it comes at a cost since you throw away valuable information. You do not have to bin your continuous variable, Trees will do it for you (and RFs likewise). If your categorical variables are ordinal you do not have to do anything, however if they are not you may consider encoding the variables (map the distinct values to let's say one hot vectors -- 0,0,1) and try other regressors that way, such as SVR from https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html (in this case you may consider scaling the variables to [0,1]).
edit: RFs are better than trees in general just make sure you know what you're doing. And make sure you understand that RFs are many trees ensembled together.

Is there a way to quickly decide which variables to use for model fitting and selection?

I loaded a dataset with 156 variables for a project. The goal is to figure out a model to predict a test data set. I am confused about where to start with. Normally I would start with the basic linear regression model, but with 156 columns/variables, how should one start with a model building? Thank you!

The question here is pretty open ended.
You need to confirm whether you are solving for regression or classification.
You need to go through some descriptive statistics of your data set to find out the type of values you have in the dataset. Are there outliers, missing values, columns whose values are in billions as against columns who values are in small fractions.
If you have categorical data, what type of categories do you have. What is the frequency count of the categorical values.
Accordingly you clean the data (if required)
Post this you may want to understand the correlation(via pearsons or chi-square depending on the data types of the variables you have) among these 156 variables and see how correlated they are.
You may then choose to get rid of certain variables after looking at the correlation or by performing a PCA (which helps to retain high variance among the dataset) and bringing the dataset variables down to fewer dimensions.
You may then look at fitting regression models or classification models(depending on your need) to have a simpler model at first and then adjusting things as you look at improving your accuracy (or minimizing the loss)

Data Vectorization

Met a tricky issue when trying to vectorize my feature. I have a feature like this:
most of it is numeric, like 0, 1, 33.3, 100, etc.
some of is empty, which represents not provided.
some of it is "auto", which means it adapts the context.
Now my question is, how to encode this feature into vectors effectively? One thing I can do is just to treat all numerical value as categorical too, but that will result in an explosion in the feature space, also not good for representing similar data points. What should I do?
Thanks!
--- THE ALGORITHM/MODEL I'M USING ---
It's LSTM (Long Short Term Memory) neural network. Currently I'm going with the following approach say I have 2 data points:
col1
entry1: 1.0
entry2: auto
It'll be encoded into:
col1-a col1-b
entry1: 1.0 0
entry2: dummy 1
So col1-b will represent whether it's auto or not. The dummy number will be the median of all the numeric data. Will this work?
Also, I for each numeric value they have a unit associated, so there's another column which has value like 'px', 'pt', in this case, does the numeric value still has meaning if I extracted the unit into another column? They has actual meaning when associated (numeric + unit), but can the NN notice that if they are on different dimensions?

That depends on what type of algorithm you you will be using. If you want to use something like association rule classification then you will have to treat all of your variables as categorical data. If you want to use logistic regression, then that isn't needed. You'd have to provide more details to get a better answer.
edit
I made some edits after reading your edit.
It sounds like what you have is at least reasonable. I've read books where people use the mean/median/mode to fill in missing values for numeric data. As for which specific one works the best for you I don't know. Can you try training your classifier with each version?
As for your issue with the "auto" column, it sounds like you want to do something similar to running a regression with categorical data. I don't have much experience with neural networks, but I know that if you were to use something like logistic regression then this is the approach you would want to use. Hopefully this gives you an idea of what you have to research.
As far as treating all of your numerical data as categorical data, you can do that as well, but you have to normalize it first. You can do something like min-max normalization and then just take the interger part of the number. Now your data will be the same as categorical data.

Classification vs Regression?

I am not quite sure what the differences are between classification and regression.
From what I understand is that classification is something that is categorical. It's either this or it's either that.
Regression is more of a prediction.
Both of the problems above would be more of a Regression problem right? It is both using a learning algorithm to predict. Could anyone give an example of Classification vs Regression?

You are correct: given some data point, classification assigns a label (or 'class') to that point. This label is, as you said, categorical. One example might be, say, malware classification: given some file, is it malware or is it not? (The "label" will be the answer to this question: 'yes' or 'no'.)
But in regression, the goal is instead to predict a real value (i.e. not categorical). An example here might be, given someone's height and age, predict their weight.
So in either of the questions you've quoted, the answer comes down to what you are trying to get out of your prediction: a category, or a real value?
(A side note: there are connections and relations between the two problems, and you could, if you wanted, see regression as an extension of classification in the case where the labels are ordinal and there are infinite labels.)

1.Classification is a process of organizing data into categories for its most effective and efficient use whereas Regression is the process of identifying the relationship and the effect of this relationship on the outcome of the future value of object.
2.classification is used to predict both numerical and categorical data whereas regression is used to predict numerical data.

Classification examples:-
Predicting whether a share of a company is good to buy or no given that the previous history of the company, along with the buyer's review on it saying yes or no for buying the share. (Discrete answer: Buy - Yes/No)
Regression example:-
Predicting the best price at which one should buy the share of a company given that the previous history of the company, along with the buyer's review of the price at which they bought the share in the past. (Continuous answer:- Price range)

How much prediction accuracy of SVM (or other ML models) depend on the way features are encoded?

Suppose that for a given ML problem, we have a feature which car the person possesses. We can encode this information in one of the following ways:
Assign an id to each of the car. Make a column 'CAR_POSSESSED' and put feature id as value.
Make columns for each of the car and put 0 or 1 according to whether that car is possessed by the considered sample or not. Columns will be like "BMW_POSSESSED", "AUDI_POSSESSED".
In my experiments the 2nd way performed much better than 1st one, when tried with SVM.
How does the encoding way affects the model learning, and are there some resources in which affect of encoding has been studied? Or do we need to do hit and trials to check where it performs best?

The problem with the first way is that you use arbitrary numbers to represent the features (e.g. BMW=2, etc.) and SVM take those numbers seriously, as if they have order: e.g. it may try to use cases with CAR_OWNED>3 for the prediction.
So the second way is better.

Chapter 2.1 Categorical Features:
http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf
You'll find many more if you search for "svm Categorical Features"

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Selection of Categorical Features on basis of their values frequency - machine-learning

As you can see most of the values are of AllPub, only one value is of NoSeWa. It will not make much difference if you keep or remove. Another way of thinking might be a outlier. Since there is count of only one, it might have entered incorrectly. You can impute that value with mode.

Related

Tree vs Regression algorithm- which works better for a model with mostly categorical features?

Is there a way to quickly decide which variables to use for model fitting and selection?

Data Vectorization

Classification vs Regression?

How much prediction accuracy of SVM (or other ML models) depend on the way features are encoded?

Categories

Resources