One-to-one matching to labels for text classification - machine-learning

I am using scikit-learn for a text classification problem and I would like to know if there is a machine learning technique that uses a one-to-one, mutually exclusive mapping for labeling.
For example, say I want to label three documents based on what city they represent. My label choices are New York, Detroit and Los Angeles. My documents are "The Big Apple," "The Big City," and "City of Angels." Let's say just for this example that "City of Angels" most closely maps to Los Angeles, while both "The Big Apple" and "The Big City" should map most closely to New York. However, I want one to map to New York ("The Big Apple" because let's say that has a better fit) and one to map to Detroit because New York has already been used, and Detroit is the only choice that's left and it still fits in some sense.
I want to tell the predictor that if it has used one label, it cannot use it again, so it needs to make the best guess for that label since it can only be used once.
Does scikit-learn or another library have a feature for handling this one-to-one (and only one) text classification like I would like to do?

To achieve this kind of functionality, I'd suggest you do the following:
I'd assume that in your text classification algorithm, you obtain a probability score for each document for every label.
e.g.:
Documents "The Big Apple" "The Big City" "City of Angels"
Label
"New York" 0.45 0.45 0.1
"Detroit" 0.4 0.5 0.1
"Los Angeles" 0.15 0.05 0.8
You might now be able to see where I am heading towards with this.
Use the argmax function (returns the label with the maximum probability for each document).
In this case, the argmax function would return the label "New York" for the documents "The Big Apple" and the "The Big City", the label "The Big City" for the document "Detroit" and the label "Los Angeles" for the document "City of Angels".
Since, in this case there is a conflict (I'd rather not call it conflict) in assigning a label "New York" for a document (since you require a one to one mapping), I'd say you go to the next label. The label "The Big City" can be clearly assigned to the document "Detroit" as it has the maximum probability (matching), and then you remove the label "Detroit" from the set of possible labels (remaining labels -> "New York" and "Los Angeles"). You then move on to the next label "Los Angeles" and the argmax function tells you that the document "City of Angels" has the highest probability (maximum matching) of having the label "Los Angeles". You then remove the label "Lost Angeles" from the remaining labels. At this point, remaining labels -> "New York". You then go to the next label "New York" and see that the only document it can be assigned to is "The Big Apple" and you have a one-to-one mapping between the documents and the labels.
I have done this before in two ways, breaking a tie by assigning a label to a document randomly, or by breaking the tie by calculating the probability for the next label. This technique is also used in a decision tree algorithm to find the most suitable attribute at a given level in the tree. It is called as the entropy or the information gain of that attribute. This implementation is a simpler version of the information gain from the ID3 decision tree algorithm.
More about the ID3 decision tree algorithm here.

Related

Second word completion with python

I'm new to machine learning and I'm trying to come up with a model that will complete all second words in phrases. I couldn't find solution to this exact problem although there are lots of tutorials on generating text with RNN.
So, consider you have the 2 following files:
1) a word dictionary for training
Say we have a table with 2 columns of word pairs: 'complete' and 'sample' such that the first column includes different word pairs ("Hello dear", "my name", "What time", "He goes", etc.) and the second one includes first words and only a part (> 2 letters) of second words ("Hello de", "my nam", "What ti", "He goe", etc.).
2) a table for testing
It's a table that consists of only 'sample' column.
The aim is to add 'complete' column to the second table with complete pairs of words.
I came up with the only way to do this:
compute the frequences of all first words (P(w1))
compute the frequences of all complete second words (P(w2))
compute the frequences of all first words given complete second words (P(w1|w2))
predict complete second words using Bayes rule:
w2 = argmax_{w2} ( P(w2|w1)) = argmax_{w2} (P(w1|w2) * P(w2))
for each w1 in the test table w2 is the most probable w2 or the most frequent w2 (if w1 is not in the dict).
The problem is this algorithm doesn't work sufficiently well. How can I somehow optimise the probabilities (maybe gradient descent might be helpful?)? Is there any other way to address this task?

How can a machine learning model handle unseen data and unseen label?

I am trying to solve a text classification problem. I have a limited number of labels that capture the category of my text data. If the incoming text data doesn't fit any label, it is tagged as 'Other'. In the below example, I built a text classifier to classify text data as 'breakfast' or 'italian'. In the test scenario, I included couple of text data that do not fit into the labels that I used for training. This is the challenge that I'm facing. Ideally, I want the model to say - 'Other' for 'i like hiking' and 'everyone should understand maths'. How can I do this?
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
X_train = np.array(["coffee is my favorite drink",
"i like to have tea in the morning",
"i like to eat italian food for dinner",
"i had pasta at this restaurant and it was amazing",
"pizza at this restaurant is the best in nyc",
"people like italian food these days",
"i like to have bagels for breakfast",
"olive oil is commonly used in italian cooking",
"sometimes simple bread and butter works for breakfast",
"i liked spaghetti pasta at this italian restaurant"])
y_train_text = ["breakfast","breakfast","italian","italian","italian",
"italian","breakfast","italian","breakfast","italian"]
X_test = np.array(['this is an amazing italian place. i can go there every day',
'i like this place. i get great coffee and tea in the morning',
'bagels are great here',
'i like hiking',
'everyone should understand maths'])
classifier = Pipeline([
('vectorizer', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB())])
classifier.fit(X_train, y_train_text)
predicted = classifier.predict(X_test)
proba = classifier.predict_proba(X_test)
print(predicted)
print(proba)
['italian' 'breakfast' 'breakfast' 'italian' 'italian']
[[0.25099411 0.74900589]
[0.52943091 0.47056909]
[0.52669142 0.47330858]
[0.42787443 0.57212557]
[0.4 0.6 ]]
I consider the 'Other' category as noise and I cannot model this category.
I think Kalsi might have suggested this but it was not clear to me. You could define a confidence threshold for your classes. If the predicted probability does not achieve the threshold for any of your classes ('italian' and 'breakfast' in your example), you were not able to classify the sample yielding the 'other' "class".
I say "class" because other is not exactly a class. You probably don't want your classifier to be good at predicting "other" so this confidence threshold might be a good approach.
You cannot do that.
You have trained the model to predict only two labels i.e., breakfast or italian. So the model doesn't have any idea about the third label or the fourth etc.
You and me know that "i like hiking" is neither breakfast nor italian. But how a model a would know that ? It only knows breakfast & italian. So there has to be a way to tell the model that: If you get confused between breakfast &italian, then predict the label as other
You can achieve this by training the model which is having other as label with some texts like "i like hiking" etc
But in your case, a little hack can be done as follows.
So what does it mean when a model predicts a label with 0.5 probability (or approximately 0.5)? It means that model is getting confused between the labels breakfast and italian. So here you can take advantage of this.
You can take all the predicted probability values & assign the label other if the probability value is between 0.45 & 0.55 . In this way you can predict the other label (obviously with some errors) without letting the model knowing there is a label called other
You can try setting class priors when creating the MultinomialNB. You could create a dummy "Other" training example, and then set the prior high enough for Other so that instances default to Other when there aren't enough evidence to select the other classes.
No, you cannot do that.
You have to define a third category "other" or whatever name that suits you and give your model some data related to that category. Make sure that number of training examples for all three categories are somewhat equal, otherwise "other" being a very broad category could skew your model towards "other" category.
Other way to approach this, is to get noun phrases from all your sentences for different categories including other and then feed into the model, consider this as a feature selection step for your machine learning model. In this way noise added by irrelevant words will be removed, better performance than tf-idf.
If you have huge data, go for deep learning models which does feature selection automatically.
Dont go with manipulating probabilities by yourself approach, 50-50% probability means that the model is confused between two classes which you have defined, it has no idea about the third "other class".
Lets say the sentence is "I want italian breakfast", the model will be confused whether this sentence belongs to "italian" or "breakfast" category but that doesnt mean it belongs to "other" category".

Using SVM to predict text with label

I have data in a csv file in the following format
Name Power Money
Jon Red 30
George blue 20
Tom Red 40
Bob purple 10
I consider values like "jon", "red" and "30 as inputs. Each input as a label. For instance inputs [jon,george,tom,bob] have label "name". Inputs [red,blue,purple] have label "power". This is basically how I have training data. I have bunch of values that are each mapped to a label.
Now I want to use svm to train a model based on my training data to accurately identify given a new input what is its correct label. so for instance if the input provided is "444" , the model should be smart enough to categorize it as a "Money" label.
I have installed py and also installed sklearn. I have completed the following tutorial as well. I am just not sure on how to prepare input data to train the model.
Also I am new to machine learning if i have said something that sounds wrong or odd please point it out as I will be happy to learn the correct.
With how your current question is formulated, you are not dealing with a typical machine learning problem. Currently, you have column-wise data:
Name Power Money
Jon Red 30
George blue 20
Tom Red 40
Bob purple 10
If a user now inputs "Jon", you know it is going to be type "Name", by a simple hash-map look up, e.g.,:
hashmap["Jon"] -> "Name"
The main reason people are saying it is not a machine-learning problem is your "categorisation" or "prediction" is being defined by your column names. Machine learning problems, instead (typically), will be predicting some response variable. For example, imagine instead you had asked this:
Name Power Money Bought_item
Jon Red 30 yes
George blue 20 no
Tom Red 40 no
Bob purple 10 yes
We could build a model to predict Bought_item using the features Name, Power, and Money using SVM.
Your problem would have to look more like:
Feature1 Feature2 Feature3 Category
1.0 foo bar Name
3.1 bar foo Name
23.4 abc def Money
22.22 afb dad Power
223.1 dad vxv Money
You then use Feature1, Feature2, and Feature3 to predict Category. At the moment your question does not give enough information for anyone to really understand what you need or what you have to reformulate it this way, or consider an unsupervised approach.
Edit:
So frame it this way:
Name Power Money Label
Jon Red 30 Foo
George blue 20 Bar
Tom Red 40 Foo
Bob purple 10 Bar
OneHotEncode Name and Power, so you now have a variable for each name that can be 0/1.
Standardise Money so that it has a range between, approximately, -1/1.
LabelEncode your labels so that they are 0,1,2,3,4,5,6 and so on.
Use a One vs. All classifier, http://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html.

What is the difference between a feature and a label? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I'm following a tutorial about machine learning basics and there is mentioned that something can be a feature or a label.
From what I know, a feature is a property of data that is being used. I can't figure out what the label is, I know the meaning of the word, but I want to know what it means in the context of machine learning.
Briefly, feature is input; label is output. This applies to both classification and regression problems.
A feature is one column of the data in your input set. For instance, if you're trying to predict the type of pet someone will choose, your input features might include age, home region, family income, etc. The label is the final choice, such as dog, fish, iguana, rock, etc.
Once you've trained your model, you will give it sets of new input containing those features; it will return the predicted "label" (pet type) for that person.
Feature:
In Machine Learning feature means property of your training data. Or you can say a column name in your training dataset.
Suppose this is your training dataset
Height Sex Age
61.5 M 20
55.5 F 30
64.5 M 41
55.5 F 51
. . .
. . .
. . .
. . .
Then here Height, Sex and Age are the features.
label:
The output you get from your model after training it is called a label.
Suppose you fed the above dataset to some algorithm and generates a model to predict gender as Male or Female, In the above model you pass features like age, height etc.
So after computing, it will return the gender as Male or Female. That's called a Label
Here comes a more visual approach to explain the concept. Imagine you want to classify the animal shown in a photo.
The possible classes of animals are e.g. cats or birds.
In that case the label would be the possible class associations e.g. cat or bird, that your machine learning algorithm will predict.
The features are pattern, colors, forms that are part of your images e.g. furr, feathers, or more low-level interpretation, pixel values.
Label: Bird
Features: Feathers
Label: Cat
Features: Furr
Prerequisite: Basic Statistics and exposure to ML (Linear Regression)
It can be answered in a sentence -
They are alike but their definition changes according to the necessities.
Explanation
Let me explain my statement. Suppose that you have a dataset, for this purpose consider exercise.csv. Each column in the dataset are called as features. Gender, Age, Height, Heart Rate, Body_temp, and Calories might be one among various columns. Each column represents distinct features or property.
exercise.csv
User_ID Gender Age Height Weight Duration Heart_Rate Body_Temp Calories
14733363 male 68 190.0 94.0 29.0 105.0 40.8 231.0
14861698 female 20 166.0 60.0 14.0 94.0 40.3 66.0
11179863 male 69 179.0 79.0 5.0 88.0 38.7 26.0
To solidify the understanding and clear out the puzzle let us take two different problems (prediction case).
CASE1: In this case we might consider using - Gender, Height, and Weight to predict the Calories burnt during exercise. That prediction(Y) Calories here is a Label. Calories is the column that you want to predict using various features like - x1: Gender, x2: Height and x3: Weight .
CASE2: In the second case here we might want to predict the Heart_rate by using Gender and Weight as a feature. Here Heart_Rate is a Label predicted using features - x1: Gender and x2: Weight.
Once you have understood the above explanation you won't really be confused with Label and Features anymore.
Let's take an example where we want to detect the alphabet using handwritten photos. We feed these sample images in the program and the program classifies these images on the basis of the features they got.
An example of a feature in this context is: the letter 'C' can be thought of like a concave facing right.
A question now arises as to how to store these features. We need to name them. Here's the role of the label that comes into existence. A label is given to such features to distinguish them from other features.
Thus, we obtain labels as output when provided with features as input.
Labels are not associated with unsupervised learning.
A feature briefly explained would be the input you have fed to the system and the label would be the output you are expecting. For example, you have fed many features of a dog like his height, fur color, etc, so after computing, it will return the breed of the dog you want to know.
Suppose you want to predict climate then features given to you would be historic climate data, current weather, temperature, wind speed, etc. and labels would be months.
The above combination can help you derive predictions.

Character n-gram vs word features in NLP

I'm trying to predict if reviews on yelp are positive or negative by performing linear regression using SGD.I tried two different feature extractors.The first was the character n-gram and the second was separating words by space. However, I tried different n values for the character n-gram, and found that the n value that gave me the best test error.I noticed that this test error (0.27 in my test data) was nearly identical to the test error from extracting the words separated by space.Is there a reason behind this coincidence?Shouldn't the character n-gram have a lower test error since it extracted more features than the word features?
Character n-gram: ex. n=7
"Good restaurant" => "Goodres" "oodrest" "odresta" "drestau" "restaur" "estaura" "stauran" "taurant"
Word features:
"Good restaurant" => "Good" "restaurant"
Looks like the n-gram method simply produced a lot of redundant, overlapping features which do not contribute to the precision.

Resources