Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
This is my homework. I'm not asking you to do my homework here, I need a hint to keep going.
I know what is K nearest neighbor algorithm however I always seen it on graphs not like this. Can you guys tell me what I should do? I've been trying to figure out how to start doing this but I could not. I would appreciate a small hint from you guys.
This assignment helps you understand the steps in KNN.
KNN is based on distances. Find the K nearest neighbors and then maybe vote for a classification problem.
Your training data can be considered as (x1,x2, y) : age and profit are features (x1, x2) while BUY or NOT BUY is the label/output y.
To apply Knn you need to calculate the distance, which is based on features. Since the two features share different units ( year, USD), you should convert them into non-unit features which is called normalization, part 4.1 in your handout. After that, the feature vector will look like (-0.4,-0.8). The number should be between -1 and 0 if the suggested formula in part 4.1 is used.
Then use the normalized feature to calculate the distances (Euclidean in the handout) between every training data point and your interested company ( normalized as well). This is required in 4.2.
Last step should be to pick K nearest neighbor and decide BUY or NOT BUY judging from the outputs of those neighbors. ( a simple voting maybe?)
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I have features like Amount(20-100K$), Percent1(i.e. 0-1),Percent2(i.e.0-1). Here Amount values are between 20-100000 US dollars, and percent columns with decimals between 0-1. These features are positively skewed, so I applied log transformation on Amount, Yeo-Johnson using powertransformer on Percent1,Percent2 columns.
Is it right to apply different transformations on columns, will it have an effect on model performance or should I same same transformation on all columns?
There are some things that need to be known before we can answer the question.
The answer would depend on the model that you are using. In some models it’s better if the range of the different inputs are same. Some models are agnostic to that. And of course sometimes one might also be interested in assigning different priorities to the inputs.
To get back to your question, depending on the model, there might be absolutely no harm in applying different transformations, or there could be performance differences.
For example: Linear regression models would be greatly affected by the feature transformation. However supervised neural networks most likely wouldn’t.
You might want to check this stackoverflow
question: https://stats.stackexchange.com/questions/397227/why-feature-transformation-is-needed-in-machine-learning-statistics-doesnt-i
it's about understanding the benefits of transformation
so when we talk about some equation like f(x1,x2) = w1x1 + w2x2
then if x1 is about 100,000 like the amount
and if x2 is about 1.0 like the percent
and at the same time feature number 1 will be updated 100,000 faster than feature 2
mathematically when you update the weight then the equation of weights will be like
w1 = w1 - lr*(x1)
w2 = w2 - lr*(x2)
lr here represents the learning rate
then you are saying the amount feature is lot lot better than the percent feature
and that's why usually one transform the data into the same distribution to not make a feature better than another feature
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I am trying to implement clustering for bank transaction data. The dataset contains columns about Vendor and MCC which are string. There are too much distinct values in those columns, I want to make a clustering depending on some metrics such as cosine similarity for Vendor or MCC. ( For example 'Hotel A' and 'Hotel B' can be in the same cluster. ) I think Levenshtein distance is not sufficient for this.
I think about finding a corpus for MCC and create a model for find similarity between the words. Is this method good for this problem? If not, how can I handle with those columns? If yes, is there a corpus for this?
Data source: https://data.world/oklahoma/purchase-card-fiscal-year
I've done something similar to this problem using GloVe word embeddings.
One way to cluster a categorical text feature is to convert each unique value into an average word vector (after removing stopwords). Then you can compare the vectors via cosine similarity, and use clustering methods based on the similarity matrix. If this approach is too computationally complex, convert the values to vectors and get top-n closest items by cosine similarity.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
Please tell me how to split a node that has a numerical value, like suppose my parent node is temperature and it has numerical values 45.20, 33.10, 11.00, etc. How should I split such kind of numerical values? If I have a categorical column like temperature having a low and high value, I will split it low on the left side and high on the right side. But how should I split the column if it is numeric?
There are discretization methods for converting numerical features into categories e.g. for using in Decision Trees. There are many supervised and unsupervised algorithms, from a simple Binning to Information Theoretic approaches like what Fayyad & Irani proposed. Follow this tutorial to learn how to discretize your features. The algorithm by Fayyad and Irani is explained in this course.
Disclaimer: I am the instructor of that course.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 4 years ago.
Improve this question
Say I am trying to predict a variable y which is a score from 0 to 10 (integer numbers only), and I am using a linear regression model. The model actually produces real numbers in that interval.
I am using regression, and not classification, because I want to be able to say that missing the correct prediction by (say) 2 is worse than missing it by 1. Currently I am using the average absolute error as the evaluation metric.
Given the prediction from the model is a real number, what is the best way to constraint it to be in the allowed set of integers (from 0 to 10)? Should I just round the prediction to the nearest integer, or any better way?
You can also use a multinomial logistic regression model and one can go for classification accuracy for the measure of the performance of the model.
Have a range from 0 to 11, and round to the nearest .5 number. This gives you 10 evenly spaced, equally sized categories. If you can, weigh the regression on how close it was to the .5 mark, as the results should ideally not be close enough to the boundary to cause ambiguity.
Alternatively, have a range from -0.5 to 10.5 and use the integers as the target. It makes no difference but is compatible with your existing network.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
As given in the textbook Machine Learning by Tom M. Mitchell, the first statement about decision tree states that, "Decision tree leaning is a method for approximating discrete valued functions". Could someone kindly elaborate this statement, probably even justify it with an example. Thanks in advance :) :)
In a simple example, consider observations rows with two attributes; the training data contains classification (discrete values) based on a combination of those attributes. The learning phase has to determine which attributes to consider in which order, so that it can effectively do well in achieving the desired modelling.
For instance, consider a model that will answer "What should I order for dinner?" given the inputs of desired price range, cuisine, and spiciness. The training data will contain your history from a variety of restaurant experiences. The model will have to determine which is most effective in reaching a good entrée classification: eliminate restaurants based on cuisine first, then consider price, and finally tune the choice according to Scoville units; or perhaps check the spiciness first and start by dump choices that aren't spicy enough before going on to the other two factors.
Does that explain what you need?