How can a machine learning model handle unseen data and unseen label? - machine-learning

I am trying to solve a text classification problem. I have a limited number of labels that capture the category of my text data. If the incoming text data doesn't fit any label, it is tagged as 'Other'. In the below example, I built a text classifier to classify text data as 'breakfast' or 'italian'. In the test scenario, I included couple of text data that do not fit into the labels that I used for training. This is the challenge that I'm facing. Ideally, I want the model to say - 'Other' for 'i like hiking' and 'everyone should understand maths'. How can I do this?
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
X_train = np.array(["coffee is my favorite drink",
"i like to have tea in the morning",
"i like to eat italian food for dinner",
"i had pasta at this restaurant and it was amazing",
"pizza at this restaurant is the best in nyc",
"people like italian food these days",
"i like to have bagels for breakfast",
"olive oil is commonly used in italian cooking",
"sometimes simple bread and butter works for breakfast",
"i liked spaghetti pasta at this italian restaurant"])
y_train_text = ["breakfast","breakfast","italian","italian","italian",
"italian","breakfast","italian","breakfast","italian"]
X_test = np.array(['this is an amazing italian place. i can go there every day',
'i like this place. i get great coffee and tea in the morning',
'bagels are great here',
'i like hiking',
'everyone should understand maths'])
classifier = Pipeline([
('vectorizer', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB())])
classifier.fit(X_train, y_train_text)
predicted = classifier.predict(X_test)
proba = classifier.predict_proba(X_test)
print(predicted)
print(proba)
['italian' 'breakfast' 'breakfast' 'italian' 'italian']
[[0.25099411 0.74900589]
[0.52943091 0.47056909]
[0.52669142 0.47330858]
[0.42787443 0.57212557]
[0.4 0.6 ]]
I consider the 'Other' category as noise and I cannot model this category.

I think Kalsi might have suggested this but it was not clear to me. You could define a confidence threshold for your classes. If the predicted probability does not achieve the threshold for any of your classes ('italian' and 'breakfast' in your example), you were not able to classify the sample yielding the 'other' "class".
I say "class" because other is not exactly a class. You probably don't want your classifier to be good at predicting "other" so this confidence threshold might be a good approach.

You cannot do that.
You have trained the model to predict only two labels i.e., breakfast or italian. So the model doesn't have any idea about the third label or the fourth etc.
You and me know that "i like hiking" is neither breakfast nor italian. But how a model a would know that ? It only knows breakfast & italian. So there has to be a way to tell the model that: If you get confused between breakfast &italian, then predict the label as other
You can achieve this by training the model which is having other as label with some texts like "i like hiking" etc
But in your case, a little hack can be done as follows.
So what does it mean when a model predicts a label with 0.5 probability (or approximately 0.5)? It means that model is getting confused between the labels breakfast and italian. So here you can take advantage of this.
You can take all the predicted probability values & assign the label other if the probability value is between 0.45 & 0.55 . In this way you can predict the other label (obviously with some errors) without letting the model knowing there is a label called other

You can try setting class priors when creating the MultinomialNB. You could create a dummy "Other" training example, and then set the prior high enough for Other so that instances default to Other when there aren't enough evidence to select the other classes.

No, you cannot do that.
You have to define a third category "other" or whatever name that suits you and give your model some data related to that category. Make sure that number of training examples for all three categories are somewhat equal, otherwise "other" being a very broad category could skew your model towards "other" category.
Other way to approach this, is to get noun phrases from all your sentences for different categories including other and then feed into the model, consider this as a feature selection step for your machine learning model. In this way noise added by irrelevant words will be removed, better performance than tf-idf.
If you have huge data, go for deep learning models which does feature selection automatically.
Dont go with manipulating probabilities by yourself approach, 50-50% probability means that the model is confused between two classes which you have defined, it has no idea about the third "other class".
Lets say the sentence is "I want italian breakfast", the model will be confused whether this sentence belongs to "italian" or "breakfast" category but that doesnt mean it belongs to "other" category".

Related

NLP / Rails sentiment search

I am building a tool from scratch that takes a sample of text and turns it into a list of categories. I am not using any libraries for this at the moment but am interested if anyone has experience in this territory as the hardest part that I'm struggling with is building in sentiment to the search. It's easy to word match but sentiment is much more challenging.
The goal would be to take something like this paragraph;
"Whenever I am out walking with my son, I like to take portrait photographs of him to see how he changes over time. My favourite is a pic of him when we were on holiday in Spain and when his face was covered in chocolate from a cake we had baked"
and turn it into
categories = ['father', 'photography', 'travel', 'spain', 'cooking', 'chocolate']
If possible I'd like to end up adding a filter for negative sentiment so that if the text said;
"I hate cooking"
'cooking' wouldn't be included in the categories.
Any help is greatly appreciated. TIA 👍
You seem to have at least two tasks: 1. Sequence classification by topics; 2. Sentiment analysis. [Edit, I only noticed now that you are using Ruby/Rails, but the code below is in Python. But maybe this answer is still useful for some people and the steps can be applied in any language.]
1. For sequence classification by topics, you can either define categories simply with a list of words as you said. Depending on the use-case, this might be the easiest option. If that list of words were too time-intensive to create, you can use a pre-trained zero-shot classifier. I would recommend the zero-shot classifier from HuggingFace, see details with code here.
Applied to your use-case, this would look like this:
# pip install transformers # pip install in terminal
from transformers import pipeline
classifier = pipeline("zero-shot-classification")
sequence = ["Whenever I am out walking with my son, I like to take portrait photographs of him to see how he changes over time. My favourite is a pic of him when we were on holiday in Spain and when his face was covered in chocolate from a cake we had baked"]
candidate_labels = ['father', 'photography', 'travel', 'spain', 'cooking', 'chocolate']
classifier(sequence, candidate_labels, multi_class=True)
# output:
{'labels': ['photography', 'spain', 'chocolate', 'travel', 'father', 'cooking'],
'scores': [0.9802802205085754, 0.7929317951202393, 0.7469273805618286, 0.6030028462409973, 0.08006269484758377, 0.005216470453888178]}
The classifier returns scores depending on how certain it is that a each candidate_label is represented in your sequence. It doesn't catch everything, but it works quite well and is fast to put into practice.
2. For sentiment analysis you can use HuggingFace's sentiment classification pipeline. In your use-case, this would look like this:
classifier = pipeline("sentiment-analysis")
sequence = ["I hate cooking"]
classifier(sequence)
# Output
[{'label': 'NEGATIVE', 'score': 0.9984041452407837}]
Putting 1. and 2. together:
I would probably probably (a) first take your entire text and split it into sentences (see here how to do that); then (b) run the sentiment classifier on each sentence and discard those that have a high negative sentiment score (see step 2. above) and then (c) run your labeling/topic classification on the remaining sentences (see 1. above).

Splitting train and test in multilabel classification to avoid missing data in the train set

I have a dataset (600 rows) composed of two columns:
-Summary: which contains the text of a document
-Keywords: which contains the keywords that identify that document.
Summary KeyWords_in_Array_wo_insurance
0 court sanction scheme transfer insur reinsur b... [insolvency]
1 immedi custodi sentenc month week impos direct... [administration of justice, civil evidence, se...
2 motorist injur hit run collis car identifi dri... [negligence, road traffic]
3 claimant given permiss continu claim compani a... [insolvency, civil procedure]
4 court gave guidanc approach taken applic relea... [civil procedure, costs]
5 plaintiff solicitor entitl declar life critic ... [trusts]
6 claimant insur establish requir standard road ... [personal injury, torts]
7 minimum indemn requir institut charter account... [arbitration, civil procedure, costs, accounta...
8 applic secur cost court could take account eve... [civil procedure, insolvency, cpr, costs]
I want to predict which keywords identify a summary.
The keywords are not mutually exclusive.
My code is:
X_train, X_test, y_train, y_test = train_test_split(df_final["Summary"], df_final["KeyWords_in_Array_wo_insurance"], test_size=0.20, random_state=42)
mlb = MultiLabelBinarizer()
y_train_mlb = mlb.fit_transform(y_train)
classifier = Pipeline([
('vectorizer', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC()))])
classifier.fit(X_train, y_train_mlb)
y_predicted = classifier.predict(X_test)
all_labels = mlb.inverse_transform(y_predicted)
y_test_mlb = mlb.transform(y_test)
print("Accuracy = ", accuracy_score(y_test_mlb,y_predicted))
I am getting a low accuracy score: 20%
Therefore I am thinking that my classification is not good enough.
The reason it might be that some keywords are used only once.
For example, the keywords "animal" or "partnership" or "succession" are used only in 1 row. (Meaning, they are assigned to only one summary)
I think, therefore, that when I split the dataset in training and test, some "lonely keywords" enter the test dataset but not the training one. Thus, the model will never be trained on them.
Is this the reason why my accuracy is so low?
Or am I doing something else wrong?
I think maybe this is because you don't have a lot of data for all the different classes you are trying to obtain. If you are training something with two classes, you want thousands of text for each. Maybe you don't have that.
Also civil procedure and civil evidence are counted as different classes for you. I would imagine if you lemmatize all of them and make smaller categories (join these two as 'civil' for example) you could get a better result?

What is the difference between a feature and a label? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I'm following a tutorial about machine learning basics and there is mentioned that something can be a feature or a label.
From what I know, a feature is a property of data that is being used. I can't figure out what the label is, I know the meaning of the word, but I want to know what it means in the context of machine learning.
Briefly, feature is input; label is output. This applies to both classification and regression problems.
A feature is one column of the data in your input set. For instance, if you're trying to predict the type of pet someone will choose, your input features might include age, home region, family income, etc. The label is the final choice, such as dog, fish, iguana, rock, etc.
Once you've trained your model, you will give it sets of new input containing those features; it will return the predicted "label" (pet type) for that person.
Feature:
In Machine Learning feature means property of your training data. Or you can say a column name in your training dataset.
Suppose this is your training dataset
Height Sex Age
61.5 M 20
55.5 F 30
64.5 M 41
55.5 F 51
. . .
. . .
. . .
. . .
Then here Height, Sex and Age are the features.
label:
The output you get from your model after training it is called a label.
Suppose you fed the above dataset to some algorithm and generates a model to predict gender as Male or Female, In the above model you pass features like age, height etc.
So after computing, it will return the gender as Male or Female. That's called a Label
Here comes a more visual approach to explain the concept. Imagine you want to classify the animal shown in a photo.
The possible classes of animals are e.g. cats or birds.
In that case the label would be the possible class associations e.g. cat or bird, that your machine learning algorithm will predict.
The features are pattern, colors, forms that are part of your images e.g. furr, feathers, or more low-level interpretation, pixel values.
Label: Bird
Features: Feathers
Label: Cat
Features: Furr
Prerequisite: Basic Statistics and exposure to ML (Linear Regression)
It can be answered in a sentence -
They are alike but their definition changes according to the necessities.
Explanation
Let me explain my statement. Suppose that you have a dataset, for this purpose consider exercise.csv. Each column in the dataset are called as features. Gender, Age, Height, Heart Rate, Body_temp, and Calories might be one among various columns. Each column represents distinct features or property.
exercise.csv
User_ID Gender Age Height Weight Duration Heart_Rate Body_Temp Calories
14733363 male 68 190.0 94.0 29.0 105.0 40.8 231.0
14861698 female 20 166.0 60.0 14.0 94.0 40.3 66.0
11179863 male 69 179.0 79.0 5.0 88.0 38.7 26.0
To solidify the understanding and clear out the puzzle let us take two different problems (prediction case).
CASE1: In this case we might consider using - Gender, Height, and Weight to predict the Calories burnt during exercise. That prediction(Y) Calories here is a Label. Calories is the column that you want to predict using various features like - x1: Gender, x2: Height and x3: Weight .
CASE2: In the second case here we might want to predict the Heart_rate by using Gender and Weight as a feature. Here Heart_Rate is a Label predicted using features - x1: Gender and x2: Weight.
Once you have understood the above explanation you won't really be confused with Label and Features anymore.
Let's take an example where we want to detect the alphabet using handwritten photos. We feed these sample images in the program and the program classifies these images on the basis of the features they got.
An example of a feature in this context is: the letter 'C' can be thought of like a concave facing right.
A question now arises as to how to store these features. We need to name them. Here's the role of the label that comes into existence. A label is given to such features to distinguish them from other features.
Thus, we obtain labels as output when provided with features as input.
Labels are not associated with unsupervised learning.
A feature briefly explained would be the input you have fed to the system and the label would be the output you are expecting. For example, you have fed many features of a dog like his height, fur color, etc, so after computing, it will return the breed of the dog you want to know.
Suppose you want to predict climate then features given to you would be historic climate data, current weather, temperature, wind speed, etc. and labels would be months.
The above combination can help you derive predictions.

How To Train Stanford NER For Names That Include Spaces?

The example training excersize labels single-term names after tokenizing with something like a simple split(' ').
I need to train for and recognize names that include spaces. How do I train the recognizer?
Example: "I saw a Big Red Apple Tree." -- How would I tokenize for training and then recognize "Big Red Apple Tree" instead of recognizing four separate words?
Will this work for the training data?
I\tO
saw\tO
a\tO
Big Red Apple Tree\tMyName
.\tO
Would the output from the recognizer look the same as that?
The training section in the FAQ says "The training file parser isn't very forgiving: You should make sure each line consists of solely content fields and tab characters. Spaces don't work."
The problem you are trying to solve belongs to phrase identification. There are different ways with which you can tag the words. For example, You can tag the words with IOB tags. Train the stanford ner model onto this newly created data. Write a post processing step to concatenate the predicted data.
For Example :
your training data should look like this:
I\tO
saw\tO
a\tO
Big\tB-MyName
Red\tI-MyName
Apple\tI-MyName
Tree\tO-MyName
.\tO<br/>
So Basically, you are using [ 0, B-MyName , I-MyName , O-MyName ] as tags.
I have solved similar problem and it works great. But make sure you have enough data to train it on.

Classification with numerical label?

I know of a couple of classification algorithms such as decision trees, but I can't use any of them to the problem I have at hands.
I have a dataset in which each row contains information about a purchase. It's columns are:
- customer id
- store id where the purchase took place
- date and time of the event
- amount of money spent
I'm trying to make a prediction that, given the information of who, where and when, predicts how much money is going to be spent.
What are some possible ways of doing this? Are there any well-known algorithms?
Also, I'm currently learning RapidMiner, and I'm experimenting with some of its features. Everything that I've tried there doesn't allow me to have a real number (amount spent) as a label. Maybe I'm doing something wrong?
You could use a Decision Tree Regressor for this. Using a toolkit like scikit-learn, you could use the DecisionTreeRegressor algo where your features would be store id, date and time, and customer id, and your target would be the amount spent.
You could turn this into a supervised learning problem. This is untested code, but it could probably get you started
# Load libraries
import numpy as np
import pylab as pl
from sklearn import datasets
from sklearn.tree import DecisionTreeRegressor
from sklearn import cross_validation
from sklearn import metrics
from sklearn import grid_search
def fit_predict_model(data_import):
"""Find and tune the optimal model. Make a prediction on housing data."""
# Get the features and labels from your data
X, y = data_import.data, data_import.target
# Setup a Decision Tree Regressor
regressor = DecisionTreeRegressor()
parameters = {'max_depth':(4,5,6,7), 'random_state': [1]}
scoring_function = metrics.make_scorer(metrics.mean_absolute_error, greater_is_better=False)
## fit your data to it ##
reg = grid_search.GridSearchCV(estimator = regressor, param_grid = parameters, scoring=scoring_function, cv=10, refit=True)
fitted_data = reg.fit(X, y)
print "Best Parameters: "
print fitted_data.best_params_
# Use the model to predict the output of a particular sample
x = [## input a test sample in this list ##]
y = reg.predict(x)
print "Prediction: " + str(y)
fit_predict_model(##your data in here)
I took this from a project I was working on almost directly to predict housing prices so there are probably some unnecessary libraries and without doing validation you have no clue how accurate this case would be, but this should get you started.
Check out this link:
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html
Yes, as comments have pointed out it's regression that you need. Linear regression does sound like a good starting point as you don't have a huge number of variables.
In RapidMiner type regression into the Operators menu and you'll see several options under Modelling-> Functions. Linear Regression, Polynomical, Vector, etc. (There's more, but as a beginner let's start here).
Right click any of these operators and press Show Operator Info and you'll see numerical labels are allowed.
Next scroll through the help documentation of the operator and you'll see a link to a tutorial process. It's really simple to use, but it's good to get you started with an example.
Let me know if you need any help.

Resources