Classification vs Regression? - machine-learning

I am not quite sure what the differences are between classification and regression.
From what I understand is that classification is something that is categorical. It's either this or it's either that.
Regression is more of a prediction.
Both of the problems above would be more of a Regression problem right? It is both using a learning algorithm to predict. Could anyone give an example of Classification vs Regression?

You are correct: given some data point, classification assigns a label (or 'class') to that point. This label is, as you said, categorical. One example might be, say, malware classification: given some file, is it malware or is it not? (The "label" will be the answer to this question: 'yes' or 'no'.)
But in regression, the goal is instead to predict a real value (i.e. not categorical). An example here might be, given someone's height and age, predict their weight.
So in either of the questions you've quoted, the answer comes down to what you are trying to get out of your prediction: a category, or a real value?
(A side note: there are connections and relations between the two problems, and you could, if you wanted, see regression as an extension of classification in the case where the labels are ordinal and there are infinite labels.)

1.Classification is a process of organizing data into categories for its most effective and efficient use whereas Regression is the process of identifying the relationship and the effect of this relationship on the outcome of the future value of object.
2.classification is used to predict both numerical and categorical data whereas regression is used to predict numerical data.

Classification examples:-
Predicting whether a share of a company is good to buy or no given that the previous history of the company, along with the buyer's review on it saying yes or no for buying the share. (Discrete answer: Buy - Yes/No)
Regression example:-
Predicting the best price at which one should buy the share of a company given that the previous history of the company, along with the buyer's review of the price at which they bought the share in the past. (Continuous answer:- Price range)

Related

Tree vs Regression algorithm- which works better for a model with mostly categorical features?

I'm working on a regression problem to predict the selling price of a product. The features are a 4-level product hierarchy and a proposed price. In summary, there are 4 categorical features and one numerical feature. There are about 1000K rows in total.
I think a decision tree or random forest would work better than regression in this scenario. The reasoning is that there is only one numerical feature. Also, I plan to convert the numerical feature (proposed price) into price buckets, making it another categorical feature.
Does my reasoning make sense? Is there any other algorithm that might be worthy to try? Is there any other clever feature engineering that is worthy trying?
Note 1: This is actually a challenge problem (like Kaggle), so the features have been masked and encoded. Looking at the data, I can say for sure that there is a 4-level product hierarchy, but I'm not very sure about the one numerical feature (which I think is the proposed price), because there is a lot of difference in some scenarios between this number and the sold price (y variable). Also, there are a lot of outliers (probably forcibly introduced to confuse) in this column.
I would not recommend binning the proposed price variable, as one would expect that variable to carry most information needed to predict the selling price. Binning that variable is advantageous when the variable is noisy, however it comes at a cost since you throw away valuable information. You do not have to bin your continuous variable, Trees will do it for you (and RFs likewise). If your categorical variables are ordinal you do not have to do anything, however if they are not you may consider encoding the variables (map the distinct values to let's say one hot vectors -- 0,0,1) and try other regressors that way, such as SVR from https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html (in this case you may consider scaling the variables to [0,1]).
edit: RFs are better than trees in general just make sure you know what you're doing. And make sure you understand that RFs are many trees ensembled together.

Can I apply "classification" first and then "regression" to the same data set?

I am a beginner in data science and need help with a topic.
I have a data set about the customers of an institution. My goal is to first find out which customers will pay to this institution and then find out how much money the paying customers will pay.
In this context, I think that I can first find out which customers will pay by "classification" and then how much will pay by applying "regression".
So, first I want to apply "classification" and then apply "regression" to this output. How can I do that?
Sure, you can definitely apply a classification method followed by regression analysis. This is actually a common pattern during exploratory data analysis.
For your use case, based on the basic info you are sharing, I would intuitively go for 1) logistic regression and 2) multiple linear regression.
Logistic regression is actually a classification tool, even though the name suggests otherwise. In a binary logistic regression model, the dependent variable has two levels (categorical), which is what you need to predict if your customers will pay vs. will not pay (binary decision)
The multiple linear regression, applied to the same independent variables from your available dataset, will then provide you with a linear model to predict how much your customers will pay (ie. the output of the inference will be a continuous variable - the actual expected dollar value).
That would be the approach I would recommend to implement, since you are new to this field. Now, there are obviously many different other ways to define these models, based on available data, nature of the data, customer requirements and so on, but the logistic + multiple regression approach should be a sure bet to get you going.
Another approach would be to make it a pure regression only. Without working on a cascade of models. Which will be more simple to handle
For example, you could associate to the people that are not willing to pay the value 0 to the spended amount, and fit the model on these instances.
For the business, you could then apply a threshold in which if the predicted amount is under a more or less fixed threshold, you classify the user as "non willing to pay"
Of course you can do it by vertically stacking models. Assuming that you are using binary classification, after prediction you will have a dataframe with target values 0 and 1. You are going to filter where target==1 and create a new dataframe. Then run the regression.
Also, rather than classification, you can use clustering if you don't have labels since the cost is lower.

How do I create a feature vector if I don’t have all the data?

So say for each of my ‘things’ to classify I have:
{house, flat, bungalow, electricityHeated, gasHeated, ... }
Which would be made into a feature vector:
{1,0,0,1,0,...} which would mean a house that is heated by electricity.
For my training data I would have all this data- but for the actual thing I want to classify I might only have what kind of house it is, and a couple other things- not all the data ie.
{1,0,0,?,?,...}
So how would I represent this?
I would want to find the probability that a new item would be gasHeated.
I would be using a SVM linear classifier- I don’t have any core to show because this is purely theoretical at the moment. Any help would be appreciated :)
When I read this question, it seems that you may have confused with feature and label.
You said that you want to predict whether a new item is "gasHeated", then "gasHeated" should be a label rather than a feature.
btw, one of the most-common ways to deal with missing value is to set it as "zero" (or some unused value, say -1). But normally, you should have missing value in both training data and testing data to make this trick be effective. If this only happened in your testing data but not in your training data, it means that your training data and testing data are not from the same distribution, which basically violated the basic assumption of machine learning.
Let's say you have a trained model and a testing sample {?,0,0,0}. Then you can create two new testing samples, {1,0,0,0}, {0,0,0,0}, and you will have two predictions.
I personally don't think SVM is a good approach if you have missing values in your testing dataset. Just like I have mentioned above, although you can get two new predictions, but what if each one has different predictions? It is difficult to assign a probability to results of SVM in my opinion unless you use logistic regression or Naive Bayes. I would prefer Random Forest in this situation.

Can we use Logistic Regression to predict numerical(continuous) variable i.e Revenue of the Restaurant

I have been given a task to predict the revenue of the Restaurant based on some variables can i use Logistic regression to predict the Revenue data.
the dataset is of kaggle Restaurant Revenue Prediction Project.
PS :- I have been told to use Logistic regression i know its not the correct algorithm for this problem
Yes... You can.!!
Prediction using Logistic Regression can be done for numerical variables. The data you have right now contains all independent variables, and the outcome will be a dichotomous (dependent variable, having value TRUE/1 or FALSE/0).
You can then use it to determine the log odds ratio to find a probability(range 0-1).
For a reference you can have look at this.
-------------------UPDATE-------------------------------
Let me give u an example of my last yr's wok.. we had to predict if a student can qualify in campus placement or not, given history data of 3 yrs of test results and their final success or failure. (NOTE : This is dichotomous, will talk about this later.)
Sample data was, student's marks in academics, and aptitude test held at college, and their status as placed or not.
But in your case, you have to predict the revenue (WHICH IS non-dichotomous). So what to do?? It seems that my case was simple, right??
Nope..!!
We were not asked just to predict if the student will qualify or not, we were to predict the chances of individual student getting placed, which is not at all a dichotomous. Looks like your scenario right?
So, what you can do is, first classify the data as for what input variables, what is the final output variable (that will help in revenue calculation).
For eg: Use data to find out if the restaurant will go in profit or loss, then relate it with some algorithms to find out the approx revenue prediction.
I'm not sure if there are already such algorithms (identical to your need) exists or not, but I'm sure you can do much better by putting more efforts on research an analysis on this topic.
TIP: NEVER think in such way that "Will Logistic Regression ONLY solve my problem?" Rather expand it to, "What Logistic can do better if used with some other technique.?"

Predictive features with high presence in one class

I am doing a logistic regression to predict the outcome of a binary variable, say whether a journal paper gets accepted or not. The dependent variable or predictors are all the phrases used in these papers - (unigrams, bigrams, trigrams). One of these phrases has a skewed presence in the 'accepted' class. Including this phrase gives me a classifier with a very high accuracy (more than 90%), while removing this phrase results in accuracy dropping to about 70%.
My more general (naive) machine learning question is:
Is it advisable to remove such skewed features when doing classification?
Is there a method to check skewed presence for every feature and then decide whether to keep it in the model or not?
If I understand correctly you ask whether some feature should be removed because it is a good predictor (it makes your classifier works better). So the answer is short and simple - do not remove it in fact, the whole concept is to find exactly such features.
The only reason to remove such feature would be that this phenomena only occurs in the training set, and not in real data. But in such case you have wrong data - which does not represnt the underlying data density and you should gather better data or "clean" the current one so it has analogous characteristics as the "real ones".
Based on your comments, it sounds like the feature in your documents that's highly predictive of the class is a near-tautology: "paper accepted on" correlates with accepted papers because at least some of the papers in your database were scraped from already-accepted papers and have been annotated by the authors as such.
To me, this sounds like a useless feature for trying to predict whether a paper will be accepted, because (I'd imagine) you're trying to predict paper acceptance before the actual acceptance has been issued ! In such a case, none of the papers you'd like to test your algorithm with will be annotated with "paper accepted on." So, I'd remove it.
You also asked about how to determine whether a feature correlates strongly with one class. There are three things that come to mind for this problem.
First, you could just compute a basic frequency count for each feature in your dataset and compare those values across classes. This is probably not super informative, but it's easy.
Second, since you're using a log-linear model, you can train your model on your training dataset, and then rank each feature in your model by its weight in the logistic regression parameter vector. Features with high positive weight are indicative of one class, while features with large negative weight are strongly indicative of the other.
Finally, just for the sake of completeness, I'll point out that you might also want to look into feature selection. There are many ways of selecting relevant features for a machine learning algorithm, but I think one of the most intuitive from your perspective might be greedy feature elimination. In such an approach, you train a classifier using all N features in your model, and measure the accuracy on some held-out validation set. Then, train N new models, each with N-1 features, such that each model eliminates one of the N features, and measure the resulting drop in accuracy. The feature with the biggest drop was probably strongly predictive of the class, while features that have no measurable difference can probably be omitted from your final model. As larsmans points out correctly in the comments below, this doesn't scale well at all, but it can be a useful method sometimes.

Resources