difference between classification and regression in k-nearest neighbor? [closed] - machine-learning

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
what is the difference between using K-nearest neighbor in classification and using it in regression?
and when using KNN in recommendation system. Does it concerned as classification or as regression?

In classification tasks, the user seeks to predict a category, which is usually represented as an integer label, but represents a category of "things". For instance, you could try to classify pictures between "cat" and "dog" and use label 0 for "cat" and 1 for "dog".
The KNN algorithm for classification will look at the k nearest neighbours of the input you are trying to make a prediction on. It will then output the most frequent label among those k examples.
In regression tasks, the user wants to output a numerical value (usually continuous). It may be for instance estimate the price of a house, or give an evaluation of how good a movie is.
In this case, the KNN algorithm would collect the values associated with the k closest examples from the one you want to make a prediction on and aggregate them to output a single value. usually, you would choose the average of the k values of the neighbours, but you could choose the median or a weighted average (or actually anything that makes sense to you for the task at hand).
For your specific problem, you could use both but regression makes more sense to me in order to predict some kind of a "matching percentage ""between the user and the thing you want to recommand to him.

Related

What are the performance metrics for Clustering Algorithms? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I'm working on Kmeans clustering but unlike supervised learning I cannot figure the performance metrics for clustering algorithms. How to perform the accuracy after training the data?
For kmeans you can find the inertia_ of it. Which can give you an idea how well kmeans algorithm has worked.
kmeans = KMeans(...)
# Assuming you already have fitted data on it.
kmeans.inertia_ # lesser is better
Or, alternatively if you call score() function, which will give you the same but the sign will be negative. As we assume bigger score means better but for kmeans lesser inertia_ is better. So, to make them consistent an extra negation is applied on it.
# Call score with data X
kmeans.score(X) # greater is better
This is the very basic form of analyzing performance of kmeans. In reality if you take the number of clusters too high the score() will increase accordingly (in other words inertia_ will decrease), because inertia_ is nothing but the summation of the squared distances from each point to its corresponding cluster's centroid to which cluster it is assigned to. So if you increase the number of the clusters too much, the overall distances' squared summation will decrease as each point will get a centroid very near to it. Although, the quality of clustering is horrible in this case. So, for better analysis you should find out silhouette score or even better use silhouette diagram in this case.
You will find all of the implementations in this notebook: 09_unsupervised_learning.ipynb
The book corresponding to this repository is: Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition. It is a great book to learn all of these details.

How can I get the right balance between classification loss and a regularizer? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I'm working on a deep learning classifier (Keras and Python) that classifies time series into three categories. The loss function that I'm using is the standard categorical cross-entropy. In addition to this, I also have an attention map which is being learnt within the same model.
I would like this attention map to be as small as possible, so I'm using a regularizer. Here comes the problem: how do I set the right regularization parameter? What I want is the network to reach its maximum classification accuracy first, and then starts minimising the intensity attention map. For this reason, I train my model once without regulariser and a second time with the regulariser on. However, if the regulariser parameter (lambda) is too high, the network loses completely accuracy and only minimises the attention, while if the regulariser is too small, the network only cares about the classification error and won't minimise the attention, even when the accuracy is already the maximum.
Is there a smarter way to combine the categorical cross-entropy with the regulariser? Maybe something that considers the variation of categorical cross-entropy in time, and if it doesn't go down for, say N iterations, it only considers the regulariser?
Thank you
Regularisation is a way to fight with overfitting. So, you should understood if your model overfits. A simple way to do it: you can compare f1 score for train and test. If f1 score for train is high and for test is low, seems, you have overfitting - so you need add some regularisation.

How to round a prediction when it should be a (non-categorical) integer? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 4 years ago.
Improve this question
Say I am trying to predict a variable y which is a score from 0 to 10 (integer numbers only), and I am using a linear regression model. The model actually produces real numbers in that interval.
I am using regression, and not classification, because I want to be able to say that missing the correct prediction by (say) 2 is worse than missing it by 1. Currently I am using the average absolute error as the evaluation metric.
Given the prediction from the model is a real number, what is the best way to constraint it to be in the allowed set of integers (from 0 to 10)? Should I just round the prediction to the nearest integer, or any better way?
You can also use a multinomial logistic regression model and one can go for classification accuracy for the measure of the performance of the model.
Have a range from 0 to 11, and round to the nearest .5 number. This gives you 10 evenly spaced, equally sized categories. If you can, weigh the regression on how close it was to the .5 mark, as the results should ideally not be close enough to the boundary to cause ambiguity.
Alternatively, have a range from -0.5 to 10.5 and use the integers as the target. It makes no difference but is compatible with your existing network.

Difference between Regression and classification in Machine Learning? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am new to Machine Learning. Can anyone tell me the major difference between classification and regression in machine learning?
Regression aims to predict a continuous output value. For example, say that you are trying to predict the revenue of a certain brand as a function of many input parameters. A regression model would literally be a function which can output potentially any revenue number based on certain inputs. It could even output revenue numbers which never appeared anywhere in your training set.
Classification aims to predict which class (a discrete integer or categorical label) the input corresponds to. e.g. let us say that you had divided the sales into Low and High sales, and you were trying to build a model which could predict Low or High sales (binary/two-class classication). The inputs might even be the same as before, but the output would be different. In the case of classification, your model would output either "Low" or "High," and in theory every input would generate only one of these two responses.
(This answer is true for any machine learning method; my personal experience has been with random forests and decision trees).
Regression - the output variable takes continuous values.
Example :Given a picture of a person, we have to predict their age on the basis of the given picture
Classification - the output variable takes class labels.
Example: Given a patient with a tumor, we have to predict whether the tumor is malignant or benign.
I am a beginner in Machine Learning field but as far as i know, regression is for "continuous values" and classification is for "discrete values". With regression, there is a line for your continuous value and you can see that your model is good or bad fit. On the other hand, you can see how discrete values gains some meaning "discretely" with classification. If i am wrong please feel free to make correction.

Best way to detect features based on text [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I have a "simple" problem: I have text sections and based on this it should be decided its whether "Category A" or "Category B".
As training data I have classified sections of text, which the algorithm can be trained.
The text sections look something like this:
Category A
a blue car drives
or
the blue bus stops
or
the blue bike drives
Category B
a red bike drives
or
the red bus stops
(The section text contains up to 20 words and the vary is massive)
If I have trained the algorithm with this example data, it should decide if text contains "blue" it's Categorie A, if its contains "red" it's Categorie B and so on.
The algorithm should learn based on training data if the frequency of a word is it likely more Category A or B.
Whats the best way to do this and which tool should I use?
You can try Fisher method, in which the probability of both positive (A) and negative (B) for each feature word (red, blue) in the document are calculated. The probability that a sentence with each of the two specified word (red, blue) belongs in the specified category (A, B) is obtained, assuming there will be an equal number of items in each category. Then a combined probability is obtained.
Since the features are not independent, this won’t be a real probability, but it works much like the Bayesian classifier. The value returned by the Fisher method is a much better estimate of probability, which can be very useful when reporting results or deciding cutoffs.
I think a fisrt try should be Logistic Regression as you have a binary classification problem. As soon as you have defined you caracteristic vector (e.g., frequency of a set of determined words) you can optimize the parameters of cost function that is for binary classification (e.g, the sigmoid funcion ).
An step that you probably will need is to eliminate 'stop words'.
I really recommend the Coursera Machine Learning classes.

Resources