Nominal valued dataset in machine learning - machine-learning

What's the best way to use nominal value as opposed to real or boolean ones for being included in a subset of feature vector for machine learning?
Should I map each nominal value to real value?
For example, if I want to make my program to learn a predictive model for an web servie users whose input features may include
{ gender(boolean), age(real), job(nominal) }
where dependent variable may be the number of web-site login.
The variable job may be one of
{ PROGRAMMER, ARTIST, CIVIL SERVANT... }.
Should I map PROGRAMMER to 0, ARTIST to 1 and etc.?

Do a one-hot encoding, if anything.
If your data has categorial attributes, it is recommended to use an algorithm that can deal with such data well without the hack of encoding, e.g decision trees and random forests.

If you read the book called "Machine Learning with Spark", the author
wrote,
Categorical features
Categorical features cannot be used as input in their raw form, as they are not
numbers; instead, they are members of a set of possible values that the variable can take. In the example mentioned earlier, user occupation is a categorical variable that can take the value of student, programmer, and so on.
:
To transform categorical variables into a numerical representation, we can use a
common approach known as 1-of-k encoding. An approach such as 1-of-k encoding
is required to represent nominal variables in a way that makes sense for machine
learning tasks. Ordinal variables might be used in their raw form but are often
encoded in the same way as nominal variables.
:
I had exactly the same thought.
I think that if there is a meaningful(well-designed) transformation function that maps categorical(nominal) to real values, I may also use learning algorithms that only takes numerical vectors.
Actually I've done some projects where I had to do that way and
there was no issue raised concerning the performance of learning system.
To someone who took a vote against my question,
please cancel your evaluation.

Related

Best way to treat (too) many classes in one categorical variable

I'm working on a ML prediction model and I have a dataset with a categorical variable (let's say product id) and I have 2k distinct products.
If I convert this variable with dummy variables like one hot enconder, the dataset may have a size of 2k times the number of examples (millions of examples), but it's too many to be processed.
How is this used to be treated?
Should I use the variable only with the whitout the conversion?
Thanks.
High cardinality of categorial features is a well-known problem and "the best" way typically depends on the prediction task and requires a trial-and-error approach. It is case-dependent if you can even find a strategy that is clearly better than others.
Addressing your first question, a good collection of different encoding strategies is provided by the category_encoders library:
A set of scikit-learn-style transformers for encoding categorical variables into numeric
They follow the scikit-learn API for transformers and a simple example is provided as well. Again, which one will provide the best results depends on your dataset and the prediction task. I suggest incorporating them in a pipeline and test (some or all of) them.
In regard to your second question, you would then continue to use the encoded features for your predictions and analysis.

best practices for using Categorical Variables in H2O?

I'm trying to use H2O's Random Forest for a multinominal classification into 71 classes with 38,000 training set examples. I have one features that is a string that in many cases are predictive, so I want to use it as a categorical feature.
The hitch is that even after canonicalizing the strings (uppercase, stripping out numbers, punctuation, etc.), I still have 7,000 different strings (some due to spelling or OCR errors, etc.) I have code to remove strings that are relatively rare, but I'm not sure what a reasonable cut off value is. (I can't seem to find any help in the documentation.)
I'm also not sure what to due with nbin_cats hyperparameter. Should I make it equal to the number of different categorical variables I have? [added: default for nbin_cats is 1024 and I'm well below that at around 300 different categorical values, so I guess I don't have to do anything with this parameter]
I'm also thinking perhaps if a categorical value is associated with too many different categories that I'm trying to predict, maybe I should drop it as well.
I'm also guessing I need to increase the tree depth to handle this better.
Also, is there a special value to indicate "don't know" for the strings that I am filtering out? (I'm mapping it to a unique string but I'm wondering if there is a better value that indicates to H2O that the categorical value is unknown.)
Many thanks in advance.
High cardinality categorical predictors can sometimes hurt model performance, and specifically in the case of tree-based models, the tree ensemble (GBM or Random Forest) ends up memorizing the training data. The model has a poor time generalizing on validation data.
A good indication of whether this is happening is if your string/categorical column has very high variable importance. This means that the trees are continuing to split on this column to memorize the training data. Another indication is if you see much smaller error on your training data than on your validation data. This means the trees are overfitting to the training data.
Some methods for handling high cardinality predictors are:
removing the predictor from the model
performing categorical encoding [pdf]
performing grid search on nbins_cats and categorical_encoding
There is a Python example in the H2O tutorials GitHub repo that showcases the effects of removing the predictor from the model and performing grid search here.

Should 'deceptive' training cases be given to a Naive Bayes Classifier

I am setting up a Naive Bayes Classifier to try to determine sameness between two records of five string properties. I am only comparing each pair of properties exactly (i.e., with a java .equals() method). I have some training data, both TRUE and FALSE cases, but let's just focus on the TRUE cases for now.
Let's say there are some TRUE training cases where all five properties are different. That means every comparator fails, but the records are actually determined to be the 'same' after some human assessment.
Should this training case be fed to the Naive Bayes Classifier? On the one hand, considering the fact that NBC treats each variable separately these cases shouldn't totally break it. However, it certainly seems true that feeding in enough of these cases wouldn't be beneficial to the classifier's performance. I understand that seeing a lot of these cases would mean better comparators are required, but I'm wondering what to do in the time being. Another consideration is that the flip-side is impossible; that is, there's no way all five properties could be the same between two records and still have them be 'different' records.
Is this a preferential issue, or is there a definitive accepted practice for handling this?
Usually you will want to have a training data set that is as feasibly representative as possible of the domain from which you hope to classify observations (often difficult though). An unrepresentative set may lead to a poorly functioning classifier, particularly in a production environment where various data are received. That being said, preprocessing may be used to limit the exposure of a classifier trained on a particular subset of data, so it is quite dependent on the purpose of the classifier.
I'm not sure why you wish to exclude some elements though. Parameter estimation/learning should account for the fact that two different inputs may map to the same output --- that is why you would use machine learning instead of simply using a hashmap. Considering that you usually don't have 'all data' to build your model, you have to rely on this type of inference.
Have you had a look at the NLTK; it is in python but it seems that OpenNLP may be a suitable substitute in Java? You can employ better feature extraction techniques that lead to a model that accounts for minor variations in input strings (see here).
Lastly, it seems to me that you want to learn a mapping from input strings to the classes 'same' and 'not same' --- you seem to want to infer a distance measure (just checking). It would make more sense to invest effort in directly finding a better measure (e.g. for character transposition issues you could use edit distances). I'm not sure that NB is well-suited to your problem as it is attempting to determine a class given an observation(s) (or its features). This class will have to be discernible over various different strings (I'm assuming you are going to concatenate string1 & string2, and offer them to the classifier). Will there be enough structure present to derive such a widely applicable property? This classifier is basically going to need to be able to deal with all pair-wise 'comparisons' ,unless you build NBs for each one-vs-many pairing. This does not seem like a simple approach.

Many-state nominal variables modelling

I was reading about neural networks and found this:
"Many-state nominal variables are more difficult to handle. ST Neural Networks has facilities to convert both two-state and many-state nominal variables for use in the neural network. Unfortunately, a nominal variable with a large number of states would require a prohibitive number of numeric variables for one-of-N encoding, driving up the network size and making training difficult. In such a case it is possible (although unsatisfactory) to model the nominal variable using a single numeric index; a better approach is to look for a different way to represent the information."
This is exactly what is happening when I am building my input layer. One-of-N encoding is making the model very complex to design. However, it is mentioned in the above that you can use a numeric index, which I am not sure what he/she means by it. What is a better approach to represent the information? Can neural networks solve a problem with many-state nominal variables?
References:
http://www.uta.edu/faculty/sawasthi/Statistics/stneunet.html#gathering
Solving this task is very often crucial for modeling. Depending on a complexity of distribution of this nominal variable it'seems very often truly important to find a proper embedding between its values and R^n for some n.
One of the most successful example of such embedding is word2vec where the function between words and vectors is obtained. In other cases - you should use either ready solution if it exists or prepare your own by representational learning (e.g. by autoencoders or RBMs).

Difference between parameters, features and class in Machine Learning

I am a newbie in Machine learning and Natural language processing.
I am always confused between what are those three terms?
From my understanding:
class: The various categories our model output. Given a name of person identify whether he/she is male or female?
Lets say I am using Naive Bayes classifier.
What would be my features and parameters?
Also, what are some of the aliases of the above words which are used interchangeably.
Thank you
Let's use the example of classifying the gender of a person. Your understanding about class is correct! Given an input observation, our Naive Bayes Classifier should output a category. The class is that category.
Features: Features in a Naive Bayes Classifier, or any general ML Classification Algorithm, are the data points we choose to define our input. For the example of a person, we can't possibly input all data points about a person; instead, we pick a few features to define a person (say "Height", "Weight", and "Foot Size"). Specifically, in a Naive Bayes Classifier, the key assumption we make is that these features are independent (they don't affect each other): a person's height doesn't affect weight doesn't affect foot size. This assumption may or not be true, but for a Naive Bayes, we assume that it is true. In the particular case of your example where the input is just the name, features might be frequency of letters, number of vowels, length of name, or suffix/prefixes.
Parameters: Parameters in Naive Bayes are the estimates of the true distribution of whatever we're trying to classify. For example, we could say that roughly 50% of people are male, and the distribution of male height is a Gaussian distribution with mean 5' 7" and standard deviation 3". The parameters would be the 50% estimate, the 5' 7" mean estimate, and the 3" standard deviation estimate.
Aliases: Features are also referred to as attributes. I'm not aware of any common replacements for 'parameters'.
I hope that was helpful!
#txizzle explained the case of Naive Bayes well. In a more general sense:
Class: The output category of your data. You can call these categories as well. The labels on your data will point to one of the classes (if it's a classification problem, of course.)
Features: The characteristics that define your problem. These are also called attributes.
Parameters: The variables your algorithm is trying to tune to build an accurate model.
As an example, let us say you are trying to decide to whether admit a student to gard school or not based on various factors like his/her undergrad GPA, test scores, scores on recommendations, projects etc. In this case, the factors mentioned above are your features/attributes, whether the student is given an admit or not become your 2 classes, and the numbers which decide how these features combine together to get your output become your parameters. What the parameters actually represent depends on your algorithm. For a Neural Net, it's the weights on the synaptic links. Similarly, for a regression problem, the parameters are the coefficients of your features when they are combined.
take a simple linear classification problem-
y={0 if 5x-3>=0 else 1}
here y is class, x is feature, 5,3 are parameters.
I just wanted to add a definition that distinguishes between attributes and features, as these are often used interchangeably, and it may not be correct to do so. I'm quoting 'Hands-On Machine Learning with SciKit-Learn and TensorFlow'.
In Machine Learning an attribute is a data type (e.g., “Mileage”),
while a feature has several meanings depending on the context, but
generally means an attribute plus its value (e.g., “Mileage =
15,000”). Many people use the words attribute and feature interchangeably,
though.
I like the definition in “Hands-on Machine Learning with Scikit and Tensorflow” (by Aurelian Geron) where
ATTRIBUTE = DATA TYPE (e.g., Mileage)
FEATURE = DATA TYPE + VALUE (e.g., Mileage = 50000)
Regarding FEATURE versus PARAMETER, based on the definition in Geron’s book I used to interpret FEATURE as the variable and the PARAMETER as the weight or coefficient, such as in the model below
Y = a + b*X
X is the FEATURE
a, b are the PARAMETERS
However, in some publications I have seen the following interpretation:
X is the PARAMETER
a, b are the WEIGHTS
So, lately, I’ve begun to use the following definitions:
FEATURE = variables of the RAW DATA (e.g., all columns in the spreadsheet)
PARAMETER = variables used in the MODEL (ie after selecting the features that will be in the model)
WEIGHT = coefficients of the parameters of the MODEL
Thoughts ?
Let's see if this works :)
Imagine you have an excel spreadsheet which has data about a specific product and the presence of 7 atomic elements in them.
[product] [calcium] [magnesium] [zinc] [iron] [potassium] [nitrogen] [carbon]
Features - are each column except the product because all the other columns are independent, coexisting, has measurable impact on the target i.e. the product. You can even choose to combine some of them to be called Essential Elements i.e. dimension reduction to make it more appropriate for analysis. The term "Dimension Reduction" is strictly for explanation here, not be confused by the PCA technique in unsupervised learning. Features are relevant for supervised learning technique.
Now, imagine a cool machine that has the capability of looking at the data above and inferring what the product is.
parameters are like levers and stopcocks to the specific to that machine which you can juggle with, and make sure that if the machine says "It's soap scum" it really/truly is. If you you think about yourself doing the dart board practice, what are the things you'd do to yourself to get closer to the bullseye (balance bias/variance)?
Hyperparameters are like parameters, BUT external to this machine we're talking about. What if the machine parts/mechanical elements are made of a specific compound e.g. carbon fibre or magnesium poly-alloy? How would this change what the machine can/can't do better?
I suppose it's an oversimplification of what things are, but hopefully acceptable?

Resources