Categorical Variable in EditMetadata module of Azure ML - machine-learning

Could anyone please let me know what is the purpose of making some variable as Categorical in EditMetadata module of Machine Learning? Would appreciate if explained with some example. Also is it applicable on both features as well as label?

There are cases where variables are considered as string instead of categorical. This impacts the accuracy of the model based on how variable is treated. So for these cases, user may want to have system treat these variables categorical instead of string and edit meta-data module helps in enforcing this behavior
https://msdn.microsoft.com/library/azure/370b6676-c11c-486f-bf73-35349f842a66?f=255&MSPPError=-2147217396

Related

Alternatives of LabelEncoder() for target variable while implementing in a pipeline

I am developing a classification base model. I have used the concept of ColumnTransformer and Pipeline for feature engineering and selection, model selection, and for everything. I wanted to encode my categorical target (dependent) variable to numeric inside the pipeline. Came to know that we cannot use LabelEncoder inside both CT and Pipeline because the fit only takes (y) and throws an error, 'TypeError: fit_transform() takes 2 positional arguments but 3 were given.' What are other alternatives for the target variable? Found a lot of stacks for similar but for features and recommendations were to use OHE and OrdinalEncoder!
Basically, don't.
All (or at least most) sklearn classifiers will encode internally, and produce more useful information for you when they've been trained directly on the "real" target values. (E.g. predict will give the actual target values without you having to decode the mapping.)
(As for regression, if the target is actually ordinal in nature, you may be able to use TransformedTargetRegressor. Whether this makes sense probably depends on the model type.)

Best way to treat (too) many classes in one categorical variable

I'm working on a ML prediction model and I have a dataset with a categorical variable (let's say product id) and I have 2k distinct products.
If I convert this variable with dummy variables like one hot enconder, the dataset may have a size of 2k times the number of examples (millions of examples), but it's too many to be processed.
How is this used to be treated?
Should I use the variable only with the whitout the conversion?
Thanks.
High cardinality of categorial features is a well-known problem and "the best" way typically depends on the prediction task and requires a trial-and-error approach. It is case-dependent if you can even find a strategy that is clearly better than others.
Addressing your first question, a good collection of different encoding strategies is provided by the category_encoders library:
A set of scikit-learn-style transformers for encoding categorical variables into numeric
They follow the scikit-learn API for transformers and a simple example is provided as well. Again, which one will provide the best results depends on your dataset and the prediction task. I suggest incorporating them in a pipeline and test (some or all of) them.
In regard to your second question, you would then continue to use the encoded features for your predictions and analysis.

Categorical variables with too many levels in machine learning

I have a machine learning problem where the dependent variable is binomial (Yes/No) and some of the independent variables are categorical (with more than 100 levels). I'm not sure whether dummy coding these categorical variables and then passing them to the machine learning model is a optimal solution.
Is there a way to deal with this problem?
Thanks!
you may try to create dummy variables on the categorical variables. Before that, try to combine some of the categorical variables.

Many-state nominal variables modelling

I was reading about neural networks and found this:
"Many-state nominal variables are more difficult to handle. ST Neural Networks has facilities to convert both two-state and many-state nominal variables for use in the neural network. Unfortunately, a nominal variable with a large number of states would require a prohibitive number of numeric variables for one-of-N encoding, driving up the network size and making training difficult. In such a case it is possible (although unsatisfactory) to model the nominal variable using a single numeric index; a better approach is to look for a different way to represent the information."
This is exactly what is happening when I am building my input layer. One-of-N encoding is making the model very complex to design. However, it is mentioned in the above that you can use a numeric index, which I am not sure what he/she means by it. What is a better approach to represent the information? Can neural networks solve a problem with many-state nominal variables?
References:
http://www.uta.edu/faculty/sawasthi/Statistics/stneunet.html#gathering
Solving this task is very often crucial for modeling. Depending on a complexity of distribution of this nominal variable it'seems very often truly important to find a proper embedding between its values and R^n for some n.
One of the most successful example of such embedding is word2vec where the function between words and vectors is obtained. In other cases - you should use either ready solution if it exists or prepare your own by representational learning (e.g. by autoencoders or RBMs).

Nominal valued dataset in machine learning

What's the best way to use nominal value as opposed to real or boolean ones for being included in a subset of feature vector for machine learning?
Should I map each nominal value to real value?
For example, if I want to make my program to learn a predictive model for an web servie users whose input features may include
{ gender(boolean), age(real), job(nominal) }
where dependent variable may be the number of web-site login.
The variable job may be one of
{ PROGRAMMER, ARTIST, CIVIL SERVANT... }.
Should I map PROGRAMMER to 0, ARTIST to 1 and etc.?
Do a one-hot encoding, if anything.
If your data has categorial attributes, it is recommended to use an algorithm that can deal with such data well without the hack of encoding, e.g decision trees and random forests.
If you read the book called "Machine Learning with Spark", the author
wrote,
Categorical features
Categorical features cannot be used as input in their raw form, as they are not
numbers; instead, they are members of a set of possible values that the variable can take. In the example mentioned earlier, user occupation is a categorical variable that can take the value of student, programmer, and so on.
:
To transform categorical variables into a numerical representation, we can use a
common approach known as 1-of-k encoding. An approach such as 1-of-k encoding
is required to represent nominal variables in a way that makes sense for machine
learning tasks. Ordinal variables might be used in their raw form but are often
encoded in the same way as nominal variables.
:
I had exactly the same thought.
I think that if there is a meaningful(well-designed) transformation function that maps categorical(nominal) to real values, I may also use learning algorithms that only takes numerical vectors.
Actually I've done some projects where I had to do that way and
there was no issue raised concerning the performance of learning system.
To someone who took a vote against my question,
please cancel your evaluation.

Resources