How to tackle classification with string features?

How to tackle classification with string features? - machine-learning

I am working on a ad-click recommendation system in which I have to predict whether a user will click on a Advertisement. I have 98 features in total having both USER features and ADVERTISEMENT features. Some of the features which are very important for the prediction are having string values like this.
**FEATURE**
Inakdtive Kunmden
Stammkfunden
Stammkdunden
Stammkfunden
guteg Quartialskunden
gutes Quartialskunden
guteg Quartialskunden
gutes Quartialskunden
There are 14 different string value like this in whole data column. My model cannot take string values as input so I have to convert them to categorical int values. I have no idea how to do this and make these features useful. I am using K-MEANS CLUSTERING & RANDOMFOREST ALGORITHM.

Be careful in turning a list of string values into categorical ints as the model will likely interpret the integers as being numerically significant, but they probably are not.
For instance, if:
'Dog'=1,'Cat'=2,'Horse'=3,'Mouse'=4,'Human'=5
Then the distance metric in your clustering algorithm would think that humans are more like mice than they are like dogs. It is usually more useful to turn them into 14 binary values e.g.
Turn this:
'Dog'
'Cat'
'Human'
'Mouse'
'Dog'
Into this:
'Dog' 'Cat' 'Mouse' 'Human'
1 0 0 0
0 1 0 0
0 0 0 1
0 0 1 0
1 0 0 0
Not this:
'Species'
1
2
5
4
1
However, if the data are going to be the 'targets' that you are classifying and not the data 'features', you can leave them as ints in most multi-classification algorithms in SciKit-Learn.

I like user1745038's answer and it should give you reasonably good results. However, if you want to extract more meaningful features out of your strings (specially if the number of strings increases significantly), consider using some NLP techniques. For example, 'Dog' and 'Cat' are more similar than 'Dog' and 'Mouse'.
Good luck

Related

replace missing values in categorical data

Let's suppose I have a column with categorical data "red" "green" "blue" and empty cells
red
green
red
blue
NaN
I'm sure that the NaN belongs to red green blue, should I replace the NaN by the average of the colors or is a too strong assumption? It will be
col1 | col2 | col3
1 0 0
0 1 0
1 0 0
0 0 1
0.5 0.25 0.25
Or even scale the last row but keeping the ratio so these values have less influence? Usually what is the best practice?
0.25 0.125 0.125

The simplest strategy for handling missing data is to remove records that contain a missing value.
The scikit-learn library provides the Imputer() pre-processing class that can be used to replace missing values. Since it is categorical data, using mean as replacement value is not recommended. You can use
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)
The Imputer class operates directly on the NumPy array instead of the DataFrame.
Last but not least, not ALL ML algorithm cannot handle missing value. Different implementations of ML also different.

It depends on what you want to do with the data.
Is the average of these colors useful for your purpose?
You are creating a new possible value doing that, that is probably not wanted. Especially since you are talking about categorical data, and you are handling it as if it was numeric data.
In Machine Learning you would replace the missing values with the most common categorical value regarding a target attribute (what you want to predict).
Example: You want to predict if a person is male or female by looking at their car, and the color feature has some missing values. If most of the cars from male(female) drivers are blue(red), you would use that value to fill missing entries of cars from male(female) drivers.

In addition to Lan's answer's approach, which seems most commonly used, you can use something based on matrix factorization. For example there is a variant of Generalized Low Rank Models that can impute such data, just as probabilistic matrix factorization is used to impute continuous data.
GLRMs can be used from H2O which provides bindings for both Python and R.

Store textual dataset for binary classification

I am currently working on a machine learning project, and am in the process of building the dataset. The dataset will be comprised of a number of different textual features, of varying length from 1 sentence to around 50 sentences(including punctuation). What is the best way to store this data to then pre-process and use for machine learning using python?

In most cases, you can use a method called Bag of Word, however, in some cases when you are performing more complicated task like similarity extraction or want to make comparison between sentences, you should use Word2Vec
Bag of Word
You may use the classical Bag-Of-Word representation, in which you encode each sample into a long vector indicating the count of all the words from all samples. For example, if you have two samples:
"I like apple, and she likes apple and banana.",
"I love dogs but Sara prefer cats.".
Then all the possible words are(order doesn't matter here):
I she Sara like likes love prefer and but apple banana dogs cats , .
Then the two samples will be encoded to
First: 1 1 0 1 1 0 0 2 0 2 1 0 0 1 1
Second: 1 0 1 0 0 1 1 0 1 0 0 1 1 0 1
If you are using sklearn, the task would be as simple as:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
corpus = [
'This is the first document.',
'This is the second second document.',
'And the third one.',
'Is this the first document?',
]
X = vectorizer.fit_transform(corpus)
# Now you can feed X into any other machine learning algorithms.
Word2Vec
Word2Vec is a more complicated method, which attempts to find the relationship between words by training a embedding neural network underneath. An embedding, in plain english, can be thought of the mathematical representation of a word, in the context of all the samples provided. The core idea is that words are similar if their contexts are similar.
The result of Word2Vec are the vector representation(embeddings) of all the words shown in all the samples. The amazing thing is that we can perform algorithmic operations on the vector. A cool example is: Queen - Woman + Man = King reference here
To use Word2Vec, we can use a package called gensim, here is a basic setup:
model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)
model.most_similar(positive=['woman', 'king'], negative=['man'])
>>> [('queen', 0.50882536), ...]
Here sentences is your data, size is the dimension of the embeddings, the larger size is, the more space is used to represent a word, and there is always overfitting we should think about. window is the size of the context we are cared about, it is the number of words before the target word we are looking at when we are predicting the target from its context, when training.

One common way is to create your dictionary(all the posible words) and then encode every of your examples in function of this dictonary, for example(this is a very small and limited dictionary just for example) you could have a dictionary : hello ,world, from, python . Every word will be associated to a position, and in every of your examples you define a vector with 0 for inexistence and 1 for existence, for example for the example "hello python" you would encode it as: 1,0,0,1

Categorical and ordinal feature data difference in regression analysis?

I am trying to completely understand difference between categorical and ordinal data when doing regression analysis. For now, what is clear:
Categorical feature and data example:
Color: red, white, black
Why categorical: red < white < black is logically incorrect
Ordinal feature and data example:
Condition: old, renovated, new
Why ordinal: old < renovated < new is logically correct
Categorical-to-numeric and ordinal-to-numeric encoding methods:
One-Hot encoding for categorical data
Arbitrary numbers for ordinal data
Example for categorical:
data = {'color': ['blue', 'green', 'green', 'red']}
Numeric format after One-Hot encoding:
color_blue color_green color_red
0 1 0 0
1 0 1 0
2 0 1 0
3 0 0 1
Example for ordinal:
data = {'con': ['old', 'new', 'new', 'renovated']}
Numeric format after using mapping: Old < renovated < new → 0, 1, 2
0 0
1 2
2 2
3 1
In my data price increases as condition changes from "old" to "new". "Old" in numeric was encoded as '0'. 'New' in numeric was encoded as '2'. So, as condition increases, then price also increases. Correct.
Now lets have a look at 'color' feature. In my case, different colors also affect price. For example, 'black' will be more expensive than 'white'. But from above mentioned numeric representation of categorical data, I do not see increasing dependancy as it was with 'condition' feature. Does it mean that change in color does not affect price in regression model if using one-hot encoding? Why to use one-hot encoding for regression if it does not affect price anyway? Can you clarify it?
UPDATE TO QUESTION:
First I introduce formula for linear regression:
Let have a look at data representations for color:
Let's predict price for 1-st and 2-nd item using formula for both data representations:
One-hot encoding:
In this case different thetas for different colors will exist and prediction will be:
Price (1 item) = 0 + 20*1 + 50*0 + 100*0 = 20$ (thetas are assumed for example)
Price (2 item) = 0 + 20*0 + 50*1 + 100*0 = 50$ (thetas are assumed for example)
Ordinal encoding for color:
In this case all colors have common theta but multipliers differ:
Price (1 item) = 0 + 20*10 = 200$ (theta assumed for example)
Price (2 item) = 0 + 20*20 = 400$ (theta assumed for example)
In my model White < Red < Black in prices. Seem to be that it is logical predictions in both cases. For ordinal and categorical representations. So I can use any encoding for my regression regardless of the data type (categorical or ordinal)? This division is just a matter of conventions and software-oriented representations rather than a matter of regression logic itself?

You will see not increasing dependency. The whole point of this discrimination is that colour is not a feature you can meaningfully place on a continuum, as you've already noted.
The one-hot encoding makes it very convenient for the software to analyze this dimension. Instead of having a feature "colour" with the listed values, you have a set of boolean (present / not-present) features. For instance, your row 0 above has features color_blue = true, color_green = false, and color_red = false.
The prediction data you get should show each of these as a separate dimension. For instance, presence of color_blue may be worth $200, while green is -$100.
Summary: don't look for a linear regression line running across a (non-existent) color axis; rather, look for color_* factors, one for each color. As far as your analysis algorithm is concerned, these are utterly independent features; the "one-hot" encoding (a term from digital circuit design) is merely our convention for dealing with this.
Does this help your understanding?
After your edit of the question 02:03 Z 04 Dec 2015:
No, your assumption is not correct: the two representations are not merely a matter of convenience. The ordering of colors works for this example -- because the effect happens to be a neat, linear function of the chosen encoding. As your example shows, your simpler encoding assumes that White-to-Red-to-Black pricing is a linear progression. What do you do when Green, Blue, and Brown are all $25, the rare Yellow is worth $500, and Transparent reduces the price by $1,000?
Also, how is it that you know in advance that Black is worth more than White, in turn worth more than Red?
Consider the case of housing prices based on elementary school district, with 50 districts in the area. If you use a numerical coding -- school district number, ordinal position alphabetically, or some other arbitrary ordering -- the regression software will have great trouble finding a correlation between that number and the housing price. Is PS 107 a more expensive district than PS 32 or PS 15? Are Addington and Bendemeer preferred to Union City and Ventura?
Splitting these into 50 different features under that one-hot principle decouples the feature from the encoding, and allows the analysis software to treat with them in a mathematically meaningful manner. It's not perfect by any means -- expanding from, say, 20 features to 70 means that it will take longer to converge -- but we do get meaningful results for the school district.
If you wish, you could now encode that feature in the expected order of value, and get a reasonable fit with little loss of accuracy and faster prediction from your model (fewer variables).

You cannot use ordinal encoding for a categorical variable where order doesn't matter. Main purpose of building a regression model is to see how much change in one variable has how much effect on the response variable. When you obtain the regression formula this is how you read it: "1 unit change in variable X causes theta_x change in response variable".
For example, let's say you built a regression model on housing prices and you got this: price = 1000 + (-50)*age_of_house. This means 1 year increase in the age of the house causes the price go down by 50.
When you have a categorical variable you cannot mention a unit change in that variable. You cannot say 1 unit increase/decrease in the color... etc. So, one-hot encoding, as Prune said in his/her answer, is merely a convention for dealing with categorical variables. It allows you to interpret the results like, if the house is white it adds $200 to the value when coefficient of color_white in your final model is +200. If the house is not white, that variable has no impact on your response variable because the value will be 0.
Don't forget that "Linear Regression" models can only explain linear relations between variables.
I hope this helps.

SVM Classifying binary data DNA

I'm working SVM in R software and I would appreaciate any input you may provide.
I have a data set that I need to train with SVM, the format of the data is the following
ToPredict Data1 Data2 Data3 Data4 DNA
S 1 12 1 11 000000000100
B -1 17 14 3 11011110111110111
S 1 4 0 4 0000
The question that I have is regarding the DNA column.
SVM is able to get an input like DNA and still calculate reliable predictions?
For my data set, 0≠00 or 1≠001 therefore, it cannot be taken as integers.Every value represents information that needs to be processed and the order is very important, it's a string of binary values, either is 1 or 0.
The 0101 information could be displayed as ABAB etc. (A=0, B=1)
How can I train a SVM with the data above?
Thank you.

For SVMs to work, "all" you need to have a kernel function.
So what is a sensible kernel function for your "DNA strings"? You probably don't need to be able to prove it is a proper kernel, but you can get away with a good similarity measure.
How would you evaluate similarity of your sequences? I cannot help you on that, because I don't know what the data means; this is up to the user (i.e. you) to specify.

How to cluster data with discrete binary attributes?

In my data, there are ten millions of binary attributes,
But only some of them are informative, most of them are zeros.
Format is like as following:
data attribute1 attribute2 attribute3 attribute4 .........
A 0 1 0 1 .........
B 1 0 1 0 .........
C 1 1 0 1 .........
D 1 1 0 0 .........
What is a smart way to cluster this?
I know K-means clustering. But I don't think it's suitable in this case.
Because the binary value makes distances less obvious.
And it will suffer form the curse of high-dimensionality.
Eeve if I cluster based on those few informative attribute, it's still to many attributes.
I think the decision tree is nice to cluster this data.
But it's a Classification algorithm!
What can I do?

Have you considered frequent itemset mining instead?
K-means definitely is a bad idea, but hierarchical clustering may work when using an appropriate distance function such as jaccard, hamming, dice, ...
Anyway, what is a cluster? The choice of algorithm needs to fit to the kind of cluster you want to find. On binary data, centroid-based methods such as k-means don't make sense, as centroids are not too meaningful.
If the data are "shopping cart" type of information, consider using frequent itemset mining, as it allows discovering overlapping subsets.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart