Logistic regression for a text feature - machine-learning

As far I know, we can apply logistic regression for numerical values. What if I have a text feature which I should reject (0) or approve (1)?
For instance:
[Feature y]
Text1 1
Text2 0
Text3 0
...
Textn 1
What kind of approach should I use?

Related

Machine Learning Classification

My target is a range 1 to 5. Is there a way to force to predict only in this range?
Regardless of the model I use, I sometimes get negative values ​​and values ​​greater than 5.
You can use a model that supports multiple classes classification such as Softmax Regression. This algorithm is a generaliztion of Logistic regression that can classify N targets where N > 1.
The hard prediction of your model can be:
1 2 3 4 5
0 0 0 1 0
Which means that the prediction is 4
or it can be a soft prediction:
1 2 3 4 5
0.1 0.1 0.6 0.1 0.1
Which is probability and then you can know how confident is your model.
Scikit-learn implements Softmax regression within Logistic regression algorithm itself by specifying the parameter multi_class="multinomial"

Use Vowpal wabbit with probabilities as labels to predict probabilities

I'm trying to use Vowpal Wabbit to predict probabilities given existing set of statistics. My txt file looks like that:
0.22 | Features1
0.28 | Features2
Now, given this example, I want to predict the label (probability) for Features3. I'm trying to use logistic regression:
vw -d ds.vw.txt -f model.p --loss_function=logistic --link=logistic -p probs.txt
But get the error :
You are using label 0.00110011 not -1 or 1 as loss function expects!
You are using label 0.00559702 not -1 or 1 as loss function expects!
etc..
How can I use these statistics as labels to predict probabilities?
To predict a continuous label you need to use one of the following loss functions:
--loss_function squared # optimizes for min loss vs mean
--loss_function quantile # optimizes for min loss vs median
--loss_function squared is the vw default, so you may leave it out.
Another trick you may use is to map your probability range into [-1, 1] by mapping the mid-point 0.5 to 0.0 using the function (2*probability - 1). You can then use --loss_function logistic which requires binary labels (-1 and 1), but follow the labels with abs(probability) as a floating point weight:
1 0.22 | features...
-1 0.28 | features...
This may or may not work better for your particular data (you'll have to hold-out some of your data and test your different models for accuracy.)
Background regarding binary outcomes: vw "starting point" (i.e null, or initial model) is 0.0 weights everywhere. This is why when you're doing a logistic regression, the negative, positive labels must be -1, 1 (rather than 0, 1) respectively.

Store textual dataset for binary classification

I am currently working on a machine learning project, and am in the process of building the dataset. The dataset will be comprised of a number of different textual features, of varying length from 1 sentence to around 50 sentences(including punctuation). What is the best way to store this data to then pre-process and use for machine learning using python?
In most cases, you can use a method called Bag of Word, however, in some cases when you are performing more complicated task like similarity extraction or want to make comparison between sentences, you should use Word2Vec
Bag of Word
You may use the classical Bag-Of-Word representation, in which you encode each sample into a long vector indicating the count of all the words from all samples. For example, if you have two samples:
"I like apple, and she likes apple and banana.",
"I love dogs but Sara prefer cats.".
Then all the possible words are(order doesn't matter here):
I she Sara like likes love prefer and but apple banana dogs cats , .
Then the two samples will be encoded to
First: 1 1 0 1 1 0 0 2 0 2 1 0 0 1 1
Second: 1 0 1 0 0 1 1 0 1 0 0 1 1 0 1
If you are using sklearn, the task would be as simple as:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
corpus = [
'This is the first document.',
'This is the second second document.',
'And the third one.',
'Is this the first document?',
]
X = vectorizer.fit_transform(corpus)
# Now you can feed X into any other machine learning algorithms.
Word2Vec
Word2Vec is a more complicated method, which attempts to find the relationship between words by training a embedding neural network underneath. An embedding, in plain english, can be thought of the mathematical representation of a word, in the context of all the samples provided. The core idea is that words are similar if their contexts are similar.
The result of Word2Vec are the vector representation(embeddings) of all the words shown in all the samples. The amazing thing is that we can perform algorithmic operations on the vector. A cool example is: Queen - Woman + Man = King reference here
To use Word2Vec, we can use a package called gensim, here is a basic setup:
model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)
model.most_similar(positive=['woman', 'king'], negative=['man'])
>>> [('queen', 0.50882536), ...]
Here sentences is your data, size is the dimension of the embeddings, the larger size is, the more space is used to represent a word, and there is always overfitting we should think about. window is the size of the context we are cared about, it is the number of words before the target word we are looking at when we are predicting the target from its context, when training.
One common way is to create your dictionary(all the posible words) and then encode every of your examples in function of this dictonary, for example(this is a very small and limited dictionary just for example) you could have a dictionary : hello ,world, from, python . Every word will be associated to a position, and in every of your examples you define a vector with 0 for inexistence and 1 for existence, for example for the example "hello python" you would encode it as: 1,0,0,1

How can I use one-hot encoded labels with some sklearn classifiers?

I have a multiclass classification task with 10 classes. As such, I used sklearn's OneHotEncoder to transform the one-column labels to 10-columns labels. I was trying to fit the training data. Although I was able to do this with RandomForestClassifier, I got the below error message when fitting with GaussianNB:
ValueError: bad input shape (1203L, 10L)
I understand the allowed shape of y in these two classifiers is different:
GaussianNB:
y : array-like, shape (n_samples,)
RandomForest:
y : array-like, shape = [n_samples] or [n_samples, n_outputs]
The question is, why is this? Wouldn't this be contradictory to "All classifiers in scikit-learn do multiclass classification out-of-the-box"? Any way to go around it? Thanks!
The question is, why is this?
It is because of a slight missunderstanding, in scikit-learn you do not encode labels, you pass it as one dimensional vector of labels, thus instead of
1 0 0
0 1 0
0 0 1
you literally pass
1 2 3
So why does random forest accepts a different scheme? Because it is not for multiclass setting! It is for multi label where each instance can have many labels, like
1 1 0
1 1 1
0 0 0
Wouldn't this be contradictory to "All classifiers in scikit-learn do multiclass classification out-of-the-box"?
Contrary - it is the easiest solution - to never ask for one-hot unless it is multi-label,
Any way to go around it?
Yup, just do not encode - pass raw labels :-)

How to tackle classification with string features?

I am working on a ad-click recommendation system in which I have to predict whether a user will click on a Advertisement. I have 98 features in total having both USER features and ADVERTISEMENT features. Some of the features which are very important for the prediction are having string values like this.
**FEATURE**
Inakdtive Kunmden
Stammkfunden
Stammkdunden
Stammkfunden
guteg Quartialskunden
gutes Quartialskunden
guteg Quartialskunden
gutes Quartialskunden
There are 14 different string value like this in whole data column. My model cannot take string values as input so I have to convert them to categorical int values. I have no idea how to do this and make these features useful. I am using K-MEANS CLUSTERING & RANDOMFOREST ALGORITHM.
Be careful in turning a list of string values into categorical ints as the model will likely interpret the integers as being numerically significant, but they probably are not.
For instance, if:
'Dog'=1,'Cat'=2,'Horse'=3,'Mouse'=4,'Human'=5
Then the distance metric in your clustering algorithm would think that humans are more like mice than they are like dogs. It is usually more useful to turn them into 14 binary values e.g.
Turn this:
'Dog'
'Cat'
'Human'
'Mouse'
'Dog'
Into this:
'Dog' 'Cat' 'Mouse' 'Human'
1 0 0 0
0 1 0 0
0 0 0 1
0 0 1 0
1 0 0 0
Not this:
'Species'
1
2
5
4
1
However, if the data are going to be the 'targets' that you are classifying and not the data 'features', you can leave them as ints in most multi-classification algorithms in SciKit-Learn.
I like user1745038's answer and it should give you reasonably good results. However, if you want to extract more meaningful features out of your strings (specially if the number of strings increases significantly), consider using some NLP techniques. For example, 'Dog' and 'Cat' are more similar than 'Dog' and 'Mouse'.
Good luck

Resources