Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
As given in the textbook Machine Learning by Tom M. Mitchell, the first statement about decision tree states that, "Decision tree leaning is a method for approximating discrete valued functions". Could someone kindly elaborate this statement, probably even justify it with an example. Thanks in advance :) :)
In a simple example, consider observations rows with two attributes; the training data contains classification (discrete values) based on a combination of those attributes. The learning phase has to determine which attributes to consider in which order, so that it can effectively do well in achieving the desired modelling.
For instance, consider a model that will answer "What should I order for dinner?" given the inputs of desired price range, cuisine, and spiciness. The training data will contain your history from a variety of restaurant experiences. The model will have to determine which is most effective in reaching a good entrée classification: eliminate restaurants based on cuisine first, then consider price, and finally tune the choice according to Scoville units; or perhaps check the spiciness first and start by dump choices that aren't spicy enough before going on to the other two factors.
Does that explain what you need?
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
I have to evaluate logistic regression model. This model is aimed to detect frouds, so in real life the algorithm will face highly imbalanced data.
Some people say that I need to balance train set only, while test set should remain similar to real life data. On the other hand, many people say that model must be trained and tested on balanced samples.
I tried to test my model for both (balanced, unbalanced) sets and get the same ROC AUC (0.73), but different precision-recall curve AUC - 0.4 (for unbalanced) and 0.74 for (balanced).
What shoud I choose?
And what metrics should I use to evaluate my model perfomance?
Since you are dealing with a problem that has an unbalanced concept (a disproportionately greater amount of not-fraud over fraud), I recommend you utilize F-scoring with a real-world "matching" unbalanced set. This would allow you to compare models without having to ensure that your test set is balanced, since that could mean that you are overrepresenting frauds and underrepresenting non-frauds cases in your test set.
Here are some references and how to implement on sklearn:
https://en.wikipedia.org/wiki/F-score
https://deepai.org/machine-learning-glossary-and-terms/f-score
https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I want to classify data into two classes based on parameters given. My data is publications from two different sources and I want to classify it into "match" or "non-match"; when comparing the dataset1 with dataset2. The datasets are unlabeled text data which contain five attributes (id, title, authors, venue, year) so if i apply unsupervised algorithms, it will not produce my target classes. On the other hand, supervised algorithms need to labelled data which is unavailable and time consumed.
What is the best and easiest method to do that in python?
The best, easiest and AFAIK the optimal method is as follows:
Use clustering algorithms like K-Means, to cluster your data points into 2 clusters.
Now, manually examine a few samples of one of the cluster and label it accordingly.
Assume you randomly picked 10 data points from the first cluster and they fall in the match class. Now all you need to do is label all the data points in this cluster as match and label all the data points in the other cluster as non-match.
This would give you the required classification.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I am trying to implement clustering for bank transaction data. The dataset contains columns about Vendor and MCC which are string. There are too much distinct values in those columns, I want to make a clustering depending on some metrics such as cosine similarity for Vendor or MCC. ( For example 'Hotel A' and 'Hotel B' can be in the same cluster. ) I think Levenshtein distance is not sufficient for this.
I think about finding a corpus for MCC and create a model for find similarity between the words. Is this method good for this problem? If not, how can I handle with those columns? If yes, is there a corpus for this?
Data source: https://data.world/oklahoma/purchase-card-fiscal-year
I've done something similar to this problem using GloVe word embeddings.
One way to cluster a categorical text feature is to convert each unique value into an average word vector (after removing stopwords). Then you can compare the vectors via cosine similarity, and use clustering methods based on the similarity matrix. If this approach is too computationally complex, convert the values to vectors and get top-n closest items by cosine similarity.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
Please tell me how to split a node that has a numerical value, like suppose my parent node is temperature and it has numerical values 45.20, 33.10, 11.00, etc. How should I split such kind of numerical values? If I have a categorical column like temperature having a low and high value, I will split it low on the left side and high on the right side. But how should I split the column if it is numeric?
There are discretization methods for converting numerical features into categories e.g. for using in Decision Trees. There are many supervised and unsupervised algorithms, from a simple Binning to Information Theoretic approaches like what Fayyad & Irani proposed. Follow this tutorial to learn how to discretize your features. The algorithm by Fayyad and Irani is explained in this course.
Disclaimer: I am the instructor of that course.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
Hello,
How can I choose the best fit feature selection method for a given dataset (textual data) ?
In Weka for example, there are several attribute selection methods (CfsSubsetEval, ChiSquaredAttributeEval, ... etc), and several search methods (bestfirst, greedy, ranker ... etc).
My Question: How can I know which attribute selection method and search method is best for a given dataset ?!
My Guess: Should I use cross validation to test the dataset after applying the feature selection filter ? so for example, that means if i have 10 attribute selection methods and 10 search methods, I will need to perform 100 cross validation test then pick the configuration with the highest accuracy !!!!!!! and I am assuming here that I am testing against one classifier only. So what if i have 2 classifiers (SMO and J48), will i need to perform 200 cross validation test ?!
Please correct me if I misunderstood something ...
You can try information gain or principle component analysis to determine which features add the most to your classification(Information gain) or have the highest variance (PCA).
You can also use the techniques you mentioned. But whatever you do, you will have to evaluate it to see how effective it was, this could be quite a pain or a lot of fun depending on your outlook :-)
There are different kinds of feature selection including filter and wrapper methods. Filter methods are classifier-independent techniques for selecting features based on distance, correlation or mutual information. I would advise that you check FEAST tool and mRMR.
Regarding the wrapper models which are based on the performance of a particular classifier, you do not need to enumerate all the search methods you have. You fix one search method and apply the comparison proposed in your post.
You should build a model on whole dataset, then perform feature selection (FS). If you have more than one model you can do scaling of feature importance by referring to RMSE or MSE. If you are familiar with R try searching "random forest AND feature selection" with google.