I am a newbie in Machine learning and Natural language processing.
I am always confused between what are those three terms?
From my understanding:
class: The various categories our model output. Given a name of person identify whether he/she is male or female?
Lets say I am using Naive Bayes classifier.
What would be my features and parameters?
Also, what are some of the aliases of the above words which are used interchangeably.
Thank you
Let's use the example of classifying the gender of a person. Your understanding about class is correct! Given an input observation, our Naive Bayes Classifier should output a category. The class is that category.
Features: Features in a Naive Bayes Classifier, or any general ML Classification Algorithm, are the data points we choose to define our input. For the example of a person, we can't possibly input all data points about a person; instead, we pick a few features to define a person (say "Height", "Weight", and "Foot Size"). Specifically, in a Naive Bayes Classifier, the key assumption we make is that these features are independent (they don't affect each other): a person's height doesn't affect weight doesn't affect foot size. This assumption may or not be true, but for a Naive Bayes, we assume that it is true. In the particular case of your example where the input is just the name, features might be frequency of letters, number of vowels, length of name, or suffix/prefixes.
Parameters: Parameters in Naive Bayes are the estimates of the true distribution of whatever we're trying to classify. For example, we could say that roughly 50% of people are male, and the distribution of male height is a Gaussian distribution with mean 5' 7" and standard deviation 3". The parameters would be the 50% estimate, the 5' 7" mean estimate, and the 3" standard deviation estimate.
Aliases: Features are also referred to as attributes. I'm not aware of any common replacements for 'parameters'.
I hope that was helpful!
#txizzle explained the case of Naive Bayes well. In a more general sense:
Class: The output category of your data. You can call these categories as well. The labels on your data will point to one of the classes (if it's a classification problem, of course.)
Features: The characteristics that define your problem. These are also called attributes.
Parameters: The variables your algorithm is trying to tune to build an accurate model.
As an example, let us say you are trying to decide to whether admit a student to gard school or not based on various factors like his/her undergrad GPA, test scores, scores on recommendations, projects etc. In this case, the factors mentioned above are your features/attributes, whether the student is given an admit or not become your 2 classes, and the numbers which decide how these features combine together to get your output become your parameters. What the parameters actually represent depends on your algorithm. For a Neural Net, it's the weights on the synaptic links. Similarly, for a regression problem, the parameters are the coefficients of your features when they are combined.
take a simple linear classification problem-
y={0 if 5x-3>=0 else 1}
here y is class, x is feature, 5,3 are parameters.
I just wanted to add a definition that distinguishes between attributes and features, as these are often used interchangeably, and it may not be correct to do so. I'm quoting 'Hands-On Machine Learning with SciKit-Learn and TensorFlow'.
In Machine Learning an attribute is a data type (e.g., “Mileage”),
while a feature has several meanings depending on the context, but
generally means an attribute plus its value (e.g., “Mileage =
15,000”). Many people use the words attribute and feature interchangeably,
though.
I like the definition in “Hands-on Machine Learning with Scikit and Tensorflow” (by Aurelian Geron) where
ATTRIBUTE = DATA TYPE (e.g., Mileage)
FEATURE = DATA TYPE + VALUE (e.g., Mileage = 50000)
Regarding FEATURE versus PARAMETER, based on the definition in Geron’s book I used to interpret FEATURE as the variable and the PARAMETER as the weight or coefficient, such as in the model below
Y = a + b*X
X is the FEATURE
a, b are the PARAMETERS
However, in some publications I have seen the following interpretation:
X is the PARAMETER
a, b are the WEIGHTS
So, lately, I’ve begun to use the following definitions:
FEATURE = variables of the RAW DATA (e.g., all columns in the spreadsheet)
PARAMETER = variables used in the MODEL (ie after selecting the features that will be in the model)
WEIGHT = coefficients of the parameters of the MODEL
Thoughts ?
Let's see if this works :)
Imagine you have an excel spreadsheet which has data about a specific product and the presence of 7 atomic elements in them.
[product] [calcium] [magnesium] [zinc] [iron] [potassium] [nitrogen] [carbon]
Features - are each column except the product because all the other columns are independent, coexisting, has measurable impact on the target i.e. the product. You can even choose to combine some of them to be called Essential Elements i.e. dimension reduction to make it more appropriate for analysis. The term "Dimension Reduction" is strictly for explanation here, not be confused by the PCA technique in unsupervised learning. Features are relevant for supervised learning technique.
Now, imagine a cool machine that has the capability of looking at the data above and inferring what the product is.
parameters are like levers and stopcocks to the specific to that machine which you can juggle with, and make sure that if the machine says "It's soap scum" it really/truly is. If you you think about yourself doing the dart board practice, what are the things you'd do to yourself to get closer to the bullseye (balance bias/variance)?
Hyperparameters are like parameters, BUT external to this machine we're talking about. What if the machine parts/mechanical elements are made of a specific compound e.g. carbon fibre or magnesium poly-alloy? How would this change what the machine can/can't do better?
I suppose it's an oversimplification of what things are, but hopefully acceptable?
Related
I am a beginner in data science and need help with a topic.
I have a data set about the customers of an institution. My goal is to first find out which customers will pay to this institution and then find out how much money the paying customers will pay.
In this context, I think that I can first find out which customers will pay by "classification" and then how much will pay by applying "regression".
So, first I want to apply "classification" and then apply "regression" to this output. How can I do that?
Sure, you can definitely apply a classification method followed by regression analysis. This is actually a common pattern during exploratory data analysis.
For your use case, based on the basic info you are sharing, I would intuitively go for 1) logistic regression and 2) multiple linear regression.
Logistic regression is actually a classification tool, even though the name suggests otherwise. In a binary logistic regression model, the dependent variable has two levels (categorical), which is what you need to predict if your customers will pay vs. will not pay (binary decision)
The multiple linear regression, applied to the same independent variables from your available dataset, will then provide you with a linear model to predict how much your customers will pay (ie. the output of the inference will be a continuous variable - the actual expected dollar value).
That would be the approach I would recommend to implement, since you are new to this field. Now, there are obviously many different other ways to define these models, based on available data, nature of the data, customer requirements and so on, but the logistic + multiple regression approach should be a sure bet to get you going.
Another approach would be to make it a pure regression only. Without working on a cascade of models. Which will be more simple to handle
For example, you could associate to the people that are not willing to pay the value 0 to the spended amount, and fit the model on these instances.
For the business, you could then apply a threshold in which if the predicted amount is under a more or less fixed threshold, you classify the user as "non willing to pay"
Of course you can do it by vertically stacking models. Assuming that you are using binary classification, after prediction you will have a dataframe with target values 0 and 1. You are going to filter where target==1 and create a new dataframe. Then run the regression.
Also, rather than classification, you can use clustering if you don't have labels since the cost is lower.
When I see machine learning, specially the classification, I find that some algorithm are designed to classify , for example, the Decision tree, to classify without the consideration as described next:
For a two categories problem, category A and B, people are interested in a special one, for example the category A. For this case, assume that we have 100 for A and 1000 for B. A good classify may have a result that mixed 100A and 100B as a part and let 900B another part. This is good for classify . But is there a algorithm can pick, for example , 50A and 5 B to a part and 50 A and 995 B for another part. This may not so good as a view of classify, but if some one is interested in category A, I think that next algorithm can give a more pure A result so it is better.
In short, it means is there a algorithm can pure a special category, not to classify them with no bias?
If scikit-learn have included this algorithm, it is be better.
Look into a matching algorithm such as the "Stable Marriage Problem."
https://en.wikipedia.org/wiki/Stable_marriage_problem
If I understand you correctly, I think you're asking for a machine learning algorithm that gives a higher weight to certain classes and are therefore proportionally more likely to predict those "special" classes.
If that's what you're asking, you could use any algorithm that outputs a probability of each class during prediction. I think most algorithms take that approach actually, but I know specifically that neural nets do. Then, you can either train the network on proportionally more data on the "special" classes, or manually post-process the prediction output (the array of probabilities of each class) to adapt the probabilities to your specification.
I am doing a logistic regression to predict the outcome of a binary variable, say whether a journal paper gets accepted or not. The dependent variable or predictors are all the phrases used in these papers - (unigrams, bigrams, trigrams). One of these phrases has a skewed presence in the 'accepted' class. Including this phrase gives me a classifier with a very high accuracy (more than 90%), while removing this phrase results in accuracy dropping to about 70%.
My more general (naive) machine learning question is:
Is it advisable to remove such skewed features when doing classification?
Is there a method to check skewed presence for every feature and then decide whether to keep it in the model or not?
If I understand correctly you ask whether some feature should be removed because it is a good predictor (it makes your classifier works better). So the answer is short and simple - do not remove it in fact, the whole concept is to find exactly such features.
The only reason to remove such feature would be that this phenomena only occurs in the training set, and not in real data. But in such case you have wrong data - which does not represnt the underlying data density and you should gather better data or "clean" the current one so it has analogous characteristics as the "real ones".
Based on your comments, it sounds like the feature in your documents that's highly predictive of the class is a near-tautology: "paper accepted on" correlates with accepted papers because at least some of the papers in your database were scraped from already-accepted papers and have been annotated by the authors as such.
To me, this sounds like a useless feature for trying to predict whether a paper will be accepted, because (I'd imagine) you're trying to predict paper acceptance before the actual acceptance has been issued ! In such a case, none of the papers you'd like to test your algorithm with will be annotated with "paper accepted on." So, I'd remove it.
You also asked about how to determine whether a feature correlates strongly with one class. There are three things that come to mind for this problem.
First, you could just compute a basic frequency count for each feature in your dataset and compare those values across classes. This is probably not super informative, but it's easy.
Second, since you're using a log-linear model, you can train your model on your training dataset, and then rank each feature in your model by its weight in the logistic regression parameter vector. Features with high positive weight are indicative of one class, while features with large negative weight are strongly indicative of the other.
Finally, just for the sake of completeness, I'll point out that you might also want to look into feature selection. There are many ways of selecting relevant features for a machine learning algorithm, but I think one of the most intuitive from your perspective might be greedy feature elimination. In such an approach, you train a classifier using all N features in your model, and measure the accuracy on some held-out validation set. Then, train N new models, each with N-1 features, such that each model eliminates one of the N features, and measure the resulting drop in accuracy. The feature with the biggest drop was probably strongly predictive of the class, while features that have no measurable difference can probably be omitted from your final model. As larsmans points out correctly in the comments below, this doesn't scale well at all, but it can be a useful method sometimes.
I have 20 attributes and one target feature. All the attributes are binary(present or not present) and the target feature is multinomial(5 classes).
But for each instance, apart from the presence of some attributes, I also have the information that how much effect(scale 1-5) did each present attribute have on the target feature.
How do I make use of this extra information that I have, and build a classification model that helps in better prediction for the test classes.
Why not just use the weights as the features, instead of binary presence indicator? You can code the lack of presence as a 0 on the continuous scale.
EDIT:
The classifier you choose to use will learn optimal weights on the features in training to separate the classes... thus I don't believe there's any better you can do if you do not have access to test weights. Essentially a linear classifier is learning a rule of the form:
c_i = sgn(w . x_i)
You're saying you have access to weights, but without an example of what the data look like, and an explanation of where the weights come from, I'd have to say I don't see how you'd use them (or even why you'd want to---is standard classification with binary features not working well enough?)
This clearly depends on the actual algorithms that you are using.
For decision trees, the information is useless. They are meant to learn which attributes have how much effect.
Similarly, support vector machines will learn the best linear split, so any kind of weight will disappear since the SVM already learns this automatically.
However, if you are doing NN classification, just scale the attributes as desired, to emphasize differences in the influential attributes.
Sorry, you need to look at other algorithms yourself. There are just too many.
Use the knowledge as prior over the weight of features. You can actually compute the posterior estimation out of the data and then have the final model
I want to text classification based on the keywords appear in the text, because I do not have sample data to use naive bayes for text classification.
Example:
my document has some few words as "family, mother , father , children ... " that the categories of document are family.Or "football, tennis, score ... " that the category is sport
What is the best algorithm in this case ?.And is there any api java for this problem?
What you have are feature labels, i.e., labels on features rather than instances. There are a few methods for exploiting these, but usually it is assumed that one has instance labels (i.e., labels on documents) in addition to feature labels. This paradigm is referred to as dual-supervision.
Anyway, I know of at least two ways to learn from labeled features alone. The first is Generalized Expectation Criteria, which penalizes model parameters for diverging from a priori beliefs (e.g., that "moether" ought usually to correlate with "family"). This method has the disadvantage of being somewhat complex, but the advantage of having a nicely packaged, open-source Java implementation in the Mallet toolkit (see here, specifically).
A second option would basically be to use Naive Bayes and give large priors to the known word/class associations -- e.g., P("family"|"mother") = .8, or whatever. All unlabeled words would be assigned some prior, presumably reflecting class distribution. You would then effectively being making decisions only based on the prevalence of classes and the labeled term information. Settles proposed a model like this recently, and there is a web-tool available.
You likely will need an auxillary data set for this. You cannot rely on your data set to convey the information that "dad" and "father" and "husband" have a similar meaning.
You can try to do mine for co-occurrences to detect near-synonyms, but this is not very reliable.
Probably wordnet etc. are a good place to disambiguate such words.
You can download the freebase topic collection: http://wiki.freebase.com/wiki/Topic_API.