Handing high Cardinality features with supervised ratio and weight of evidence [closed] - machine-learning

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 5 years ago.
Improve this question
Say a data set has a categorical feature with high cardinality. Say zipcodes, or cities. Encoding this feature would give hundreds of feature columns. Different approaches such as supervised_ratio, weight of evidence seems to give better performance.
The question is, these supervised_ratio and WOE are to be calculated on the training set, right ? So I get the training set and process it and calcuate the SR and WOE and update the training set with the new values and keep the calculated values to be used in test set as well. But what happens if the test set has zip codes which were not in training set ? when there is no SR or WOE value to be used? (Practically this is possible if the training data set is not covering all the possible zip codes or if there are only one or two records from certain zip codes which might fall in to either training set or test set).
(Same will happen with encoding approach also)
I am more interested in the question, is SR and/or WOE the recommended way to handle a feature with high cardinality? if so what do we do when there are values in test set which were not in training set?
If not, what are the recommended ways to handling high cardinality features and which algorithms are more robust to them ? Thank you

This is a well-known problem when applying value-wise transformations to a categorical feature. The most common workaround is to have a set of rules to translate unseen values into values known by your training set.
This can be just a single 'NA' value (or 'others', as another answer is suggesting), or something more elaborate (e.g. in your example, you can map unseen zip codes to the closest know one in the training set).
Another posible solution in some scenarios is to have the model refusing to made a prediction in those cases, and just return an error.
For your second question, there is not really a recommended way of encoding high cardinality features (there are many methods and some may work better than others depending on the other features, the target variable, etc..); but what we can recommend you is to implement a few and experiment which one is more effective for your problem. You can consider the preprocessing method used as just another parameter in your learning algorithm.

That's a great question, thanks for asking!
When approaching this kind of problem of handle a feature with high cardinality, like zip codes, I keep in my training set just the most frequent ones and put all others in new category "others", then I calculate their WOE or any metric.
If some unseen zip code are found the test set, they falls to 'others' category. In general, this approach works well in practice.
I hope this nayve solution can help you!

Related

Can anyone please explain me the functioning of standard scalers in python sklearn [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I read about them and found that, they basically scale up the values.So dont they change up the values of the records? ok if they scale up/down the values,so there graph must look same everytime,but i saw changes in the graph as per selection of scaler.Please let me know this as I am new to this.
Standardizing the features around the center and 0 with a standard deviation of 1 is important when we compare measurements that have different units. Variables that are measured at different scales do not contribute equally to the analysis and might end up creating a bais.However, the minimum and maximum values vary according to how spread out the variable was, to begin with, and is highly influenced by the presence of outliers.
For example, A variable that ranges between 0 and 1000 will outweigh a variable that ranges between 0 and 1. Using these variables without standardization will give the variable with the larger range weight of 1000 in the analysis. Transforming the data to comparable scales can prevent this problem. Typical data standardization procedures equalize the range and/or data variability.
Note in particular that because the outliers on each feature have different magnitudes, the spread of the transformed data on each feature is very different.StandardScaler cannot guarantee balanced feature scales in the presence of outliers.
As you saw changes in the graph as per selection of scaler, one resion can you used StandardScaler() to standardize data so far doesn't work with NaNs (missing values).It's not exactly that simple to deal with NaN values. It requires analyses of the data before taking any further step to deal with the NaN values. There are various ways you can deal with these missing values (the following is not an exhaustive list):
Ignore missing values altogther : the problem with this approach is that the missing rows might contain important information in other
columns and ignoring them would lead to incomplete analyses
Replace them with another value : this one of the commonly used approaches, but the choice of the value that you will use to replace
will affect your overall analysis. You could replace them with say
mean, or say a placeholder value (like -1) which you know never
occurs throughout the column.
Using regression to substitute the values
**Using KNN to substitue values **

Cut points from which to choose the best split in Decision Tree regressor with continious feature?

I understand, that in the Decision Tree algorithm, when the splitting is decided, we choose the best split based on some criterion. And when you're looking for the best split, you have to iterate through some list of values. But it seems very computationally expensive to consider every value of the feature as the possible threshold (or, so called, cut point). Thus, there is a necessity for some heuristic for choosing these thresholds. For example, if we have continuous feature and categorical target (i.e, we are dealing with classification problem), we can do the following: sort dataset by given feature and consider for splitting only values, where target variable is changing it's value.
But what do you do if you have regression task, i.e. both feature and target are continuous variables? I realize, that I have to calculate, for example, the mean variance or mean median deviation in both branches for each split. But how do you decide from which values you're choosing you best split? People surely have came up with some optimal solution in order to avoid iterating over every value of the feature in the training set.
I've done some research, but most sources only focuses on different criteria and questions of how you determine whether your split is suitable. Which is not really answering my question.
I've found this question, but Predictor only suggests, that it can be done using the percentiles. And I think, that there is no guarantee, that this is how it really done in real life.
I've also found this question, but for me geledek's answer is not very clear (obviously, dude just copy-pasted his answer from presentation, that he is referring to). I'm pretty much fine with the Method 1, but I would really appreciate if someone could explain Method 2 in more details. Or, perhaps, provide some different source or explanation of your own.
UPD: I've also looked up to the scikit-learn repo at GitHub, and found this line. I can't quite understand the overall code, but it seems that this particular line implies that thresholds are chosen as the averages of the neighboring feature values (which corresponds with the aforementioned Method 1 from the question above). Is that correct? I also don't understand this comment: # sum of halves is used to avoid infinite value. How exactly does dividing by two prevent from getting infinite values? Don't you get infinity only when you are dividing by zero? Is dividing by two necessary, because this way we are getting average value (and not because we don't want to get infinitely)?

Confusion about test & validation set labels in machine learning [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have a question with regards to the training and validation of a dataset.
I understand the concept of labels for training data i.e. y_train. What I don't get is that why should our testing/validation samples have labels as well.
I assume that by giving labels to the test samples, we define what they are before putting them through the algorithm right?
Let me put it this way, if I have a dataset of pictures of dogs and cats, and I label them 1 and 2, respectively. Then if I want to throw a picture (dog) to test my model, which was not in my training dataset, why should I label it? If I label it 1, then I'm telling beforehand that it's a dog and if I label it 2, then it is a cat already.
Can I have a testing/validation dataset without label?
Validation dataset is used to finetune the parameters in your model while the test set is used to check the accuracy. Without the label how can claim the correctness of your model. This concept is valid in supervised learning so one needs to have labels with testing and validation dataset.
The purpose of a test set is, as its name implies, to test the performance of your model in data that were not seen during training. And in order to get this performance indication, you certainly need data with known labels, in order to compare these labels (ground truth) with the corresponding model predictions, and to arrive to some quantitative measure (e.g. accuracy) of your model performance - something you can certainly not do without these labels being available in the test set.
if I want to throw a picture (dog) to test my model, why should I label it? If I label it 1, then I'm telling beforehand that it's a dog and if I label it 2, then it is a cat already.
You are using the term "test" very loosely here - this is not its meaning in the context of a test set (which context I just described above). Notice also that, the fact that the test labels are available, does not mean that they are being used by the model during prediction (they are certainly not - they are only used for comparison with the model predictions, as described above). Plus, you are referring to a very specific problem where the answer (cat/dog) is obvious to a human observer - try using the same rationale e.g. in a genomics problem, or in one that requests numeric predictions for, say, house prices, and you'll see that the situation is not that simple and straightforward (could you possibly name the price of a house by just looking at a row of numbers?)...
The same applies for a validation set, only the objective here is different (i.e. not model assessment, but model tuning).
Admittedly, some people use the term "test data" to mean in general any unseen data, but this is not correct; after you have build & assess your model using your training, validation, and test sets, you deploy it feeding it with new and obviously unseen data, for which it is certainly not expected to already know the labels...
There are literally dozens of online tutorials on the subject, and SO is arguably not the most appropriate forum for this kind of questions - I just hope I have given you a first good-enough general idea...

Imbalanced Data for Random ferns [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
For a Multiclass problem, should the data be balanced for machine learning algorithms such as Random forests and Random ferns or is it ok for it to be imbalanced for a certain extent?
The issue with imbalanced classes raises when the disproportion alters the separability of the classes instances. But this does not happen in ever imbalanced dataset: sometimes the more data you have from one class the better you can differentiate the scarse data from it since it lets you find more easily which features are meaningful to create an discriminating plane (even though you are not using discriminative analysis the point is to classify-separate the instances according to classes).
For example I can remember the KDDCup2004 protein classification task in which one class had 99.1% of the instances in the training set but if you tried to use under sampling methods to alleviate the imbalance you would only get worse results. That meaning that the large amount of data from the first class defined the data in the smaller one.
Concerning random forests, and decision trees in general, they work by selecting, at each step, the most promising feature that can partitionate the set into two (or more) class-meaningful subsets. Having inherently more data about one class does not bias this partitioning by default ( = always) but only when the imbalance is not representative of the classes real distributions.
So I suggest that you first run a multivariate analysis to try to get the extent of imbalance among classes in your dataset and the run a series of experiments with different undersampling ratios if you still ar ein doubt.
I have used Random Forrests in my task before. Although the data don't need be balanced, however if the positive samples are too few, the pattern of the data maybe drown in the noise. Most of classify methods even (random forrests and ada boost) should have this flaw more or less.'Over sample' may be a good idea to deal with this problem.
Perhaps the paper Logistic Regression in rareis useful with this sort of problem, although its topic is logistic regression.

what should I do when training set contains some error data in supervised classification?

I am working on a project which performs text auto-classification, I have a lot of data set like as below:
Text | CategoryName
xxxxx... | AA
yyyyy... | BB
zzzzz... | AA
then, I will use the above data set to generate a classifier, once new text coming, the classifier can label new text with correct CategoryName
(text is natural language, size between 10-10000)
Now, the problem is, the original data set contains some incorrect data, (E.g. AAA should be labeled as Category AA, but it is labeled as Category BB accidentally ) because these data are classified manually. And I don't know which label is wrong and how many percentages are wrong because I can't review all data manually...
So my question is, what should I do?
Can I find the wrong labels via some automatic way?
How to increase precision and recall when new data coming?
How to evaluate the impact of wrong data? (since I don't know how many percentage data is wrong)
Any other suggestions?
Obviously, there is no easy way to solve your problem - after all, why build a classifier if you already have a system that can detect wrong classifications.
Do you know how much the erroneous classifications affect your learning? If there are only a small percentage of them, they should not hurt the performance much. (Edit. Ah, apparently you don't. Anyway, I suggest you try it out - at least if you can identify a false result when you see one.)
Of course, you could always first train your system and then have it suggest classifications for the training data. This might help you identify (and correct) your faulty training data. This obviously depends on how much training data you have, and if it is sufficiently broad to allow your system to learn correct classification despite the faulty data.
Can you review any of the data manually to find some mislabeled examples? If so, you might be able to train a second classifier to identify mislabeled data, assuming there is some kind of pattern to the mislabeling. It would be useful for you to know if mislabeling is a purely random process (it is just noise in the training data) or if mislabeling correlates with particular features of the data.
You can't evaluate the impact of mislabeled data on your specific data set if you have no estimate regarding what fraction of your training set is actually mislabeled. You mention in a comment that you have ~5M records. If you can correctly manually label a few hundred, you could train your classifier on that data set, then see how the classifier performs after introducing random mislabeling. You could do this multiple times with varying percentages of mislabeled data to see the impact on your classifier.
Qualitatively, having a significant quantity of mislabeled samples will increase the impact of overfitting so it is even more important that you do not overfit your classifier to the data set. If you have a test data set (assuming it also suffers from mislabling), then you might consider training your classifier to less-than-maximal classification accuracy on the test data set.
People usually deal with the problem you a describing by having multiple annotators and computing their agreement (e.g. Fleiss' kappa). This is often seen as the upper bound on the performance of any classifier. If three people give you three different answers, you know the task is quite hard and your classifier stands no chance.
As a side note:
If you do not know how many of your records have been labelled incorrectly, you do not understand one of the key properties of the problem. Select 1000 records at random and spend the day reviewing their labels to get an idea. It really is time well spent. For example, I found I can easily review 500 labelled tweets per hour. Health warning: it is very tedious, but a morning spent reviewing gives me a good idea of how distracted my annotators were. If 5% of the records are incorrect, it is not such a problem. If 50 are incorrect, you should go back you your boss and tell them it can't be done.
As another side note:
Someone mentioned active learning. I think it is worth looking into options from the literature, keeping in mind labels might have to change. You said that it hard.

Resources