What is the difference between a feature and a label? [closed] - machine-learning

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I'm following a tutorial about machine learning basics and there is mentioned that something can be a feature or a label.
From what I know, a feature is a property of data that is being used. I can't figure out what the label is, I know the meaning of the word, but I want to know what it means in the context of machine learning.

Briefly, feature is input; label is output. This applies to both classification and regression problems.
A feature is one column of the data in your input set. For instance, if you're trying to predict the type of pet someone will choose, your input features might include age, home region, family income, etc. The label is the final choice, such as dog, fish, iguana, rock, etc.
Once you've trained your model, you will give it sets of new input containing those features; it will return the predicted "label" (pet type) for that person.

Feature:
In Machine Learning feature means property of your training data. Or you can say a column name in your training dataset.
Suppose this is your training dataset
Height Sex Age
61.5 M 20
55.5 F 30
64.5 M 41
55.5 F 51
. . .
. . .
. . .
. . .
Then here Height, Sex and Age are the features.
label:
The output you get from your model after training it is called a label.
Suppose you fed the above dataset to some algorithm and generates a model to predict gender as Male or Female, In the above model you pass features like age, height etc.
So after computing, it will return the gender as Male or Female. That's called a Label

Here comes a more visual approach to explain the concept. Imagine you want to classify the animal shown in a photo.
The possible classes of animals are e.g. cats or birds.
In that case the label would be the possible class associations e.g. cat or bird, that your machine learning algorithm will predict.
The features are pattern, colors, forms that are part of your images e.g. furr, feathers, or more low-level interpretation, pixel values.
Label: Bird
Features: Feathers
Label: Cat
Features: Furr

Prerequisite: Basic Statistics and exposure to ML (Linear Regression)
It can be answered in a sentence -
They are alike but their definition changes according to the necessities.
Explanation
Let me explain my statement. Suppose that you have a dataset, for this purpose consider exercise.csv. Each column in the dataset are called as features. Gender, Age, Height, Heart Rate, Body_temp, and Calories might be one among various columns. Each column represents distinct features or property.
exercise.csv
User_ID Gender Age Height Weight Duration Heart_Rate Body_Temp Calories
14733363 male 68 190.0 94.0 29.0 105.0 40.8 231.0
14861698 female 20 166.0 60.0 14.0 94.0 40.3 66.0
11179863 male 69 179.0 79.0 5.0 88.0 38.7 26.0
To solidify the understanding and clear out the puzzle let us take two different problems (prediction case).
CASE1: In this case we might consider using - Gender, Height, and Weight to predict the Calories burnt during exercise. That prediction(Y) Calories here is a Label. Calories is the column that you want to predict using various features like - x1: Gender, x2: Height and x3: Weight .
CASE2: In the second case here we might want to predict the Heart_rate by using Gender and Weight as a feature. Here Heart_Rate is a Label predicted using features - x1: Gender and x2: Weight.
Once you have understood the above explanation you won't really be confused with Label and Features anymore.

Let's take an example where we want to detect the alphabet using handwritten photos. We feed these sample images in the program and the program classifies these images on the basis of the features they got.
An example of a feature in this context is: the letter 'C' can be thought of like a concave facing right.
A question now arises as to how to store these features. We need to name them. Here's the role of the label that comes into existence. A label is given to such features to distinguish them from other features.
Thus, we obtain labels as output when provided with features as input.
Labels are not associated with unsupervised learning.

A feature briefly explained would be the input you have fed to the system and the label would be the output you are expecting. For example, you have fed many features of a dog like his height, fur color, etc, so after computing, it will return the breed of the dog you want to know.

Suppose you want to predict climate then features given to you would be historic climate data, current weather, temperature, wind speed, etc. and labels would be months.
The above combination can help you derive predictions.

Related

How can a machine learning model handle unseen data and unseen label?

I am trying to solve a text classification problem. I have a limited number of labels that capture the category of my text data. If the incoming text data doesn't fit any label, it is tagged as 'Other'. In the below example, I built a text classifier to classify text data as 'breakfast' or 'italian'. In the test scenario, I included couple of text data that do not fit into the labels that I used for training. This is the challenge that I'm facing. Ideally, I want the model to say - 'Other' for 'i like hiking' and 'everyone should understand maths'. How can I do this?
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
X_train = np.array(["coffee is my favorite drink",
"i like to have tea in the morning",
"i like to eat italian food for dinner",
"i had pasta at this restaurant and it was amazing",
"pizza at this restaurant is the best in nyc",
"people like italian food these days",
"i like to have bagels for breakfast",
"olive oil is commonly used in italian cooking",
"sometimes simple bread and butter works for breakfast",
"i liked spaghetti pasta at this italian restaurant"])
y_train_text = ["breakfast","breakfast","italian","italian","italian",
"italian","breakfast","italian","breakfast","italian"]
X_test = np.array(['this is an amazing italian place. i can go there every day',
'i like this place. i get great coffee and tea in the morning',
'bagels are great here',
'i like hiking',
'everyone should understand maths'])
classifier = Pipeline([
('vectorizer', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB())])
classifier.fit(X_train, y_train_text)
predicted = classifier.predict(X_test)
proba = classifier.predict_proba(X_test)
print(predicted)
print(proba)
['italian' 'breakfast' 'breakfast' 'italian' 'italian']
[[0.25099411 0.74900589]
[0.52943091 0.47056909]
[0.52669142 0.47330858]
[0.42787443 0.57212557]
[0.4 0.6 ]]
I consider the 'Other' category as noise and I cannot model this category.
I think Kalsi might have suggested this but it was not clear to me. You could define a confidence threshold for your classes. If the predicted probability does not achieve the threshold for any of your classes ('italian' and 'breakfast' in your example), you were not able to classify the sample yielding the 'other' "class".
I say "class" because other is not exactly a class. You probably don't want your classifier to be good at predicting "other" so this confidence threshold might be a good approach.
You cannot do that.
You have trained the model to predict only two labels i.e., breakfast or italian. So the model doesn't have any idea about the third label or the fourth etc.
You and me know that "i like hiking" is neither breakfast nor italian. But how a model a would know that ? It only knows breakfast & italian. So there has to be a way to tell the model that: If you get confused between breakfast &italian, then predict the label as other
You can achieve this by training the model which is having other as label with some texts like "i like hiking" etc
But in your case, a little hack can be done as follows.
So what does it mean when a model predicts a label with 0.5 probability (or approximately 0.5)? It means that model is getting confused between the labels breakfast and italian. So here you can take advantage of this.
You can take all the predicted probability values & assign the label other if the probability value is between 0.45 & 0.55 . In this way you can predict the other label (obviously with some errors) without letting the model knowing there is a label called other
You can try setting class priors when creating the MultinomialNB. You could create a dummy "Other" training example, and then set the prior high enough for Other so that instances default to Other when there aren't enough evidence to select the other classes.
No, you cannot do that.
You have to define a third category "other" or whatever name that suits you and give your model some data related to that category. Make sure that number of training examples for all three categories are somewhat equal, otherwise "other" being a very broad category could skew your model towards "other" category.
Other way to approach this, is to get noun phrases from all your sentences for different categories including other and then feed into the model, consider this as a feature selection step for your machine learning model. In this way noise added by irrelevant words will be removed, better performance than tf-idf.
If you have huge data, go for deep learning models which does feature selection automatically.
Dont go with manipulating probabilities by yourself approach, 50-50% probability means that the model is confused between two classes which you have defined, it has no idea about the third "other class".
Lets say the sentence is "I want italian breakfast", the model will be confused whether this sentence belongs to "italian" or "breakfast" category but that doesnt mean it belongs to "other" category".

Estimating both the category and the magnitude of output using neural networks

Let's say I want to calculate which courses a final year student will take and which grades they will receive from the said courses. We have data of previous students'courses and grades for each year (not just the final year) to train with. We also have data of the grades and courses of the previous years for students we want to estimate the results for. I want to use a recurrent neural network with long-short term memory to solve this problem. (I know this problem can be solved by regression, but I want the neural network specifically to see if this problem can be properly solved using one)
The way I want to set up the output (label) space is by having a feature for each of the possible courses a student can take, and having a result between 0 and 1 in each of those entries to describe whether if a student will attend the class (if not, the entry for that course would be 0) and if so, what would their mark be (ie if the student attends class A and gets 57%, then the label for class A will have 0.57 in it)
Am I setting the output space properly?
If yes, what optimization and activation functions I should use?
If no, how can I re-shape my output space to get good predictions?
If I understood you correctly, you want that the network is given the history of a student, and then outputs one entry for each course. This entry is supposed to simultaneously signify whether the student will take the course (0 for not taking the course, 1 for taking the course), and also give the expected grade? Then the interpretation of the output for a single course would be like this:
0.0 -> won't take the course
0.1 -> will take the course and get 10% of points
0.5 -> will take the course and get half of points
1.0 -> will take the course and get full points
If this is indeed your plan, I would definitely advise to rethink it.
Some obviously realistic cases do not fit into this pattern. For example, how would you represent an (A+)-student is "unlikely" to take a course? Should the network output 0.9999, because (s)he is very likely to get the maximum amount of points if (s)he takes the course, OR should the network output 0.0001, because the student is very unlikely to take the course?
Instead, you should output two values between [0,1] for each student and each course.
First value in [0, 1] gives the probability that the student will participate in the course
Second value in [0, 1] gives the expected relative number of points.
As loss, I'd propose something like binary cross-entropy on the first value, and simple square error on the second, and then combine all the losses using some L^p metric of your choice (e.g. simply add everything up for p=1, square and add for p=2).
Few examples:
(0.01, 1.0) : very unlikely to participate, would probably get 100%
(0.5, 0.8): 50%-50% whether participates or not, would get 80% of points
(0.999, 0.15): will participate, but probably pretty much fail
The quantity that you wanted to output seemed to be something like the product of these two, which is a bit difficult to interpret.
There is more than one way to solve this problem. Andrey's answer gives a one good approach.
I would like to suggest simplifying the problem by bucketing grades into categories and adding an additional category for "did not take", for both input and output.
This turns the task into a classification problem only, and solves the issue of trying to differentiate between receiving a low grade and not taking the course in your output.
For example your training set might have m students, n possible classes, and six possible results: ['A', 'B', 'C', 'D', 'F', 'did_not_take'].
And you might choose the following architecture:
Input -> Dense Layer -> RELU -> Dense Layer -> RELU -> Dense Layer -> Softmax
Your input shape is (m, n, 6) and your output shape could be (m, n*6), where you apply softmax for every group of 6 outputs (corresponding to one class) and sum into a single loss value. This is an example of multiclass, multilabel classification.
I would start by trying 2n neurons in each hidden layer.
If you really want a continuous output for grades, however, then I recommend using separate classification and regression networks. This way you don't have to combine classification and regression loss into one number, which can get messy with scaling issues.
You can keep the grade buckets for input data only, so the two networks take the same input data, but for the grade regression network your last layer can be n sigmoid units with log loss. These will output numbers between 0 and 1, corresponding the predicted grade for each class.
If you want to go even further, consider using an architecture that considers the order in which students took previous classes. For example if a student took French I the previous year, it is more likely he/she will take French II this year than if he/she took French Freshman year and did not continue with French after that.

Using SVM to predict text with label

I have data in a csv file in the following format
Name Power Money
Jon Red 30
George blue 20
Tom Red 40
Bob purple 10
I consider values like "jon", "red" and "30 as inputs. Each input as a label. For instance inputs [jon,george,tom,bob] have label "name". Inputs [red,blue,purple] have label "power". This is basically how I have training data. I have bunch of values that are each mapped to a label.
Now I want to use svm to train a model based on my training data to accurately identify given a new input what is its correct label. so for instance if the input provided is "444" , the model should be smart enough to categorize it as a "Money" label.
I have installed py and also installed sklearn. I have completed the following tutorial as well. I am just not sure on how to prepare input data to train the model.
Also I am new to machine learning if i have said something that sounds wrong or odd please point it out as I will be happy to learn the correct.
With how your current question is formulated, you are not dealing with a typical machine learning problem. Currently, you have column-wise data:
Name Power Money
Jon Red 30
George blue 20
Tom Red 40
Bob purple 10
If a user now inputs "Jon", you know it is going to be type "Name", by a simple hash-map look up, e.g.,:
hashmap["Jon"] -> "Name"
The main reason people are saying it is not a machine-learning problem is your "categorisation" or "prediction" is being defined by your column names. Machine learning problems, instead (typically), will be predicting some response variable. For example, imagine instead you had asked this:
Name Power Money Bought_item
Jon Red 30 yes
George blue 20 no
Tom Red 40 no
Bob purple 10 yes
We could build a model to predict Bought_item using the features Name, Power, and Money using SVM.
Your problem would have to look more like:
Feature1 Feature2 Feature3 Category
1.0 foo bar Name
3.1 bar foo Name
23.4 abc def Money
22.22 afb dad Power
223.1 dad vxv Money
You then use Feature1, Feature2, and Feature3 to predict Category. At the moment your question does not give enough information for anyone to really understand what you need or what you have to reformulate it this way, or consider an unsupervised approach.
Edit:
So frame it this way:
Name Power Money Label
Jon Red 30 Foo
George blue 20 Bar
Tom Red 40 Foo
Bob purple 10 Bar
OneHotEncode Name and Power, so you now have a variable for each name that can be 0/1.
Standardise Money so that it has a range between, approximately, -1/1.
LabelEncode your labels so that they are 0,1,2,3,4,5,6 and so on.
Use a One vs. All classifier, http://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html.

machine learning, why do we need to weight data

This my sound as very naive question. I checked on google and many YouTube videos for beginners and pretty much, all explain data weighting as something the most obvious. I still do not understand why data is being weighted.
Let's assume I have four features:
a b c d
1 2 1 4
If I pass each value to Sigmond function, I'll receive -1 >< 1 value already.
I really don't understand why data needs or it is recommended to be weighted first. If you could explain to me this in very simple manner, I would appreciate it a lot.
I think you are not talking about weighing data but features.
A feature is a column in your table and as data I would understand rows.
The confusion comes now from the fact that weighing rows is also sometimes sensible, e.g., if you want to punish misclassification of positive class more.
Why do we need to weigh features?
I assume you are talking about a modle like
prediction = sigmoid(sum_i weight_i * feature_i) > base
Let's assume you want to predict whether a person is overweight based on Bodyweight, height, and age.
In R we can generate a sample dataset as
height = rnorm(100,1.80,0.1) #normal distributed mean 1.8,variance 0.1
weight = rnorm(100,70,10)
age = runif(100,0,100)
ow = weight / (height**2)>25 #overweight if BMI > 25
data = data.frame(height,weight,age,bc,ow)
if we now plot the data you can see that at least my sample of the data can be separated with a straight line in weight/height. However, age does not provide any value. If we weight it prior to the sum/sigmoid you can put all factors into relation.
Furthermore, as you can see from the following plot the weight/height have a very different domain. Hence, they need to be put into relation, such that the line in the following plot has the right slope, as the value of weight have are one order of magnitude larger

How to use Odds ratio feature selection with Naive bayes Classifier

I want to classify documents (composed of words) into 3 classes (Positive, Negative, Unknown/Neutral). A subset of the document words become the features.
Until now, I have programmed a Naive Bayes Classifier using as a feature selector Information gain and chi-square statistics. Now, I would like to see what happens if I use Odds ratio as a feature selector.
My problem is that I don't know hot to implement Odds-ratio. Should I:
1) Calculate Odds Ratio for every word w, every class:
E.g. for w:
Prob of word as positive Pw,p = #positive docs with w/#docs
Prob of word as negative Pw,n = #negative docs with w/#docs
Prob of word as unknown Pw,u = #unknown docs with w/#docs
OR(Wi,P) = log( Pw,p*(1-Pw,p) / (Pw,n + Pw,u)*(1-(Pw,n + Pw,u)) )
OR(Wi,N) ...
OR(Wi,U) ...
2) How should I decide if I choose or not the word as a feature ?
Thanks in advance...
Since it took me a while to independently wrap my head around all this, let me explain my findings here for the benefit of humanity.
Using the (log) odds ratio is a standard technique for filtering features prior to text classification. It is a 'one-sided metric' [Zheng et al., 2004] in the sense that it only discovers features which are positively correlated with a particular class. As a log-odds-ratio for the probability of seeing a feature 't' given the class 'c', it is defined as:
LOR(t,c) = log [Pr(t|c) / (1 - Pr(t|c))] : [Pr(t|!c) / (1 - Pr(t|!c))]
= log [Pr(t|c) (1 - Pr(t|!c))] / [Pr(t|!c) (1 - Pr(t|c))]
Here I use '!c' to mean a document where the class is not c.
But how do you actually calculate Pr(t|c) and Pr(t|!c)?
One subtlety to note is that feature selection probabilities, in general, are usually defined over a document event model [McCallum & Nigam 1998, Manning et al. 2008], i.e., Pr(t|c) is the probability of seeing term t one or more times in the document given the class of the document is c (in other words, the presence of t given the class c). The maximum likelihood estimate (MLE) of this probability would be the proportion of documents of class c that contain t at least once. [Technically, this is known as a Multivariate Bernoulli event model, and is distinct from a Multinomial event model over words, which would calculate Pr(t|c) using integer word counts - see the McCallum paper or the Manning IR textbook for more details, specifically on how this applies to a Naive Bayes text classifier.]
One key to using LOR effectively is to smooth these conditional probability estimates, since, as #yura noted, rare events are problematic here (e.g., the MLE of Pr(t|!c) could be zero, leading to an infinite LOR). But how do we smooth?
In the literature, Forman reports smoothing the LOR by "adding one to any zero count in the denominator" (Forman, 2003), while Zheng et al (2004) use "ELE [Expected Likelihood Estimation] smoothing" which usually amounts to adding 0.5 to each count.
To smooth in a way that is consistent with probability theory, I follow standard practices in text classification with a Multivariate Bernoulli event model. Essentially, we assume that we have seen each presence count AND each absence count B extra times. So our estimate for Pr(t|c) can be written in terms of #(t,c): the number of times we've seen t and c, and #(t,!c): the number of times we've seen t without c, as follows:
Pr(t|c) = [#(t,c) + B] / [#(t,c) + #(t,!c) + 2B]
= [#(t,c) + B] / [#(c) + 2B]
If B = 0, we have the MLE. If B = 0.5, we have ELE. If B = 1, we have the Laplacian prior. Note this looks different than smoothing for the Multinomial event model, where the Laplacian prior leads you to add |V| in the denominator [McCallum & Nigam, 1998]
You can choose 0.5 or 1 as your smoothing value, depending on which prior work most inspires you, and plug this into the equation for LOR(t,c) above, and score all the features.
Typically, you then decide on how many features you want to use, say N, and then choose the N highest-ranked features based on the score.
In a multi-class setting, people have often used 1 vs All classifiers and thus did feature selection independently for each classifier and thus each positive class with the 1-sided metrics (Forman, 2003). However, if you want to find a unique reduced set of features that works in a multiclass setting, there are some advanced approaches in the literature (e.g. Chapelle & Keerthi, 2008).
References:
Zheng, Wu, Srihari, 2004
McCallum & Nigam 1998
Manning, Raghavan & Schütze, 2008
Forman, 2003
Chapelle & Keerthi, 2008
Odd ratio is not good measure for feature selection, because it is only shows what happen when feature present, and nothing when it is not. So it will not work for rare features and almost all features are rare so it not work for almost all features. Example feature with 100% confidence that class is positive which present in 0.0001 is useless for classification. Therefore if you still want to use odd ratio add threshold on frequency of feature, like feature present in 5% of cases. But I would recommend better approach - use Chi or info gain metrics which automatically solve those problems.

Resources