Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
Hello,
How can I choose the best fit feature selection method for a given dataset (textual data) ?
In Weka for example, there are several attribute selection methods (CfsSubsetEval, ChiSquaredAttributeEval, ... etc), and several search methods (bestfirst, greedy, ranker ... etc).
My Question: How can I know which attribute selection method and search method is best for a given dataset ?!
My Guess: Should I use cross validation to test the dataset after applying the feature selection filter ? so for example, that means if i have 10 attribute selection methods and 10 search methods, I will need to perform 100 cross validation test then pick the configuration with the highest accuracy !!!!!!! and I am assuming here that I am testing against one classifier only. So what if i have 2 classifiers (SMO and J48), will i need to perform 200 cross validation test ?!
Please correct me if I misunderstood something ...
You can try information gain or principle component analysis to determine which features add the most to your classification(Information gain) or have the highest variance (PCA).
You can also use the techniques you mentioned. But whatever you do, you will have to evaluate it to see how effective it was, this could be quite a pain or a lot of fun depending on your outlook :-)
There are different kinds of feature selection including filter and wrapper methods. Filter methods are classifier-independent techniques for selecting features based on distance, correlation or mutual information. I would advise that you check FEAST tool and mRMR.
Regarding the wrapper models which are based on the performance of a particular classifier, you do not need to enumerate all the search methods you have. You fix one search method and apply the comparison proposed in your post.
You should build a model on whole dataset, then perform feature selection (FS). If you have more than one model you can do scaling of feature importance by referring to RMSE or MSE. If you are familiar with R try searching "random forest AND feature selection" with google.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
Why the batch normalization is working on the different samples of the same characteristics instead of different characteristics of the same sample? Shouldn't it be the normalization of different features? In the diagram, why do we use the first row and not the first column?
Could someone help me?
Because different features of the same object mean different things, and it's not logical to calculate some statistics over these values. They can have different range, mean, std, etc. E.g. one of your features could mean the age of a person and other one is the height of the person. If you calculate mean of these values you will not get any meaningful number.
In classic machine learning (especially in linear models and KNN) you should normalize your features (i.e. calculate mean and std of the specific feature over the entire dataset and transform your features to (X-mean(X)) / std(X) ). Batch normalization is analogue of this applied to stochastic optimization methods, like SGD (it's not meaningful to use global statistics on mini batch, furthermore you want to use batch norm more often than just before the first layer). More fundamenal ideas could be found in the original paper
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I read about them and found that, they basically scale up the values.So dont they change up the values of the records? ok if they scale up/down the values,so there graph must look same everytime,but i saw changes in the graph as per selection of scaler.Please let me know this as I am new to this.
Standardizing the features around the center and 0 with a standard deviation of 1 is important when we compare measurements that have different units. Variables that are measured at different scales do not contribute equally to the analysis and might end up creating a bais.However, the minimum and maximum values vary according to how spread out the variable was, to begin with, and is highly influenced by the presence of outliers.
For example, A variable that ranges between 0 and 1000 will outweigh a variable that ranges between 0 and 1. Using these variables without standardization will give the variable with the larger range weight of 1000 in the analysis. Transforming the data to comparable scales can prevent this problem. Typical data standardization procedures equalize the range and/or data variability.
Note in particular that because the outliers on each feature have different magnitudes, the spread of the transformed data on each feature is very different.StandardScaler cannot guarantee balanced feature scales in the presence of outliers.
As you saw changes in the graph as per selection of scaler, one resion can you used StandardScaler() to standardize data so far doesn't work with NaNs (missing values).It's not exactly that simple to deal with NaN values. It requires analyses of the data before taking any further step to deal with the NaN values. There are various ways you can deal with these missing values (the following is not an exhaustive list):
Ignore missing values altogther : the problem with this approach is that the missing rows might contain important information in other
columns and ignoring them would lead to incomplete analyses
Replace them with another value : this one of the commonly used approaches, but the choice of the value that you will use to replace
will affect your overall analysis. You could replace them with say
mean, or say a placeholder value (like -1) which you know never
occurs throughout the column.
Using regression to substitute the values
**Using KNN to substitue values **
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have a question with regards to the training and validation of a dataset.
I understand the concept of labels for training data i.e. y_train. What I don't get is that why should our testing/validation samples have labels as well.
I assume that by giving labels to the test samples, we define what they are before putting them through the algorithm right?
Let me put it this way, if I have a dataset of pictures of dogs and cats, and I label them 1 and 2, respectively. Then if I want to throw a picture (dog) to test my model, which was not in my training dataset, why should I label it? If I label it 1, then I'm telling beforehand that it's a dog and if I label it 2, then it is a cat already.
Can I have a testing/validation dataset without label?
Validation dataset is used to finetune the parameters in your model while the test set is used to check the accuracy. Without the label how can claim the correctness of your model. This concept is valid in supervised learning so one needs to have labels with testing and validation dataset.
The purpose of a test set is, as its name implies, to test the performance of your model in data that were not seen during training. And in order to get this performance indication, you certainly need data with known labels, in order to compare these labels (ground truth) with the corresponding model predictions, and to arrive to some quantitative measure (e.g. accuracy) of your model performance - something you can certainly not do without these labels being available in the test set.
if I want to throw a picture (dog) to test my model, why should I label it? If I label it 1, then I'm telling beforehand that it's a dog and if I label it 2, then it is a cat already.
You are using the term "test" very loosely here - this is not its meaning in the context of a test set (which context I just described above). Notice also that, the fact that the test labels are available, does not mean that they are being used by the model during prediction (they are certainly not - they are only used for comparison with the model predictions, as described above). Plus, you are referring to a very specific problem where the answer (cat/dog) is obvious to a human observer - try using the same rationale e.g. in a genomics problem, or in one that requests numeric predictions for, say, house prices, and you'll see that the situation is not that simple and straightforward (could you possibly name the price of a house by just looking at a row of numbers?)...
The same applies for a validation set, only the objective here is different (i.e. not model assessment, but model tuning).
Admittedly, some people use the term "test data" to mean in general any unseen data, but this is not correct; after you have build & assess your model using your training, validation, and test sets, you deploy it feeding it with new and obviously unseen data, for which it is certainly not expected to already know the labels...
There are literally dozens of online tutorials on the subject, and SO is arguably not the most appropriate forum for this kind of questions - I just hope I have given you a first good-enough general idea...
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 5 years ago.
Improve this question
Say a data set has a categorical feature with high cardinality. Say zipcodes, or cities. Encoding this feature would give hundreds of feature columns. Different approaches such as supervised_ratio, weight of evidence seems to give better performance.
The question is, these supervised_ratio and WOE are to be calculated on the training set, right ? So I get the training set and process it and calcuate the SR and WOE and update the training set with the new values and keep the calculated values to be used in test set as well. But what happens if the test set has zip codes which were not in training set ? when there is no SR or WOE value to be used? (Practically this is possible if the training data set is not covering all the possible zip codes or if there are only one or two records from certain zip codes which might fall in to either training set or test set).
(Same will happen with encoding approach also)
I am more interested in the question, is SR and/or WOE the recommended way to handle a feature with high cardinality? if so what do we do when there are values in test set which were not in training set?
If not, what are the recommended ways to handling high cardinality features and which algorithms are more robust to them ? Thank you
This is a well-known problem when applying value-wise transformations to a categorical feature. The most common workaround is to have a set of rules to translate unseen values into values known by your training set.
This can be just a single 'NA' value (or 'others', as another answer is suggesting), or something more elaborate (e.g. in your example, you can map unseen zip codes to the closest know one in the training set).
Another posible solution in some scenarios is to have the model refusing to made a prediction in those cases, and just return an error.
For your second question, there is not really a recommended way of encoding high cardinality features (there are many methods and some may work better than others depending on the other features, the target variable, etc..); but what we can recommend you is to implement a few and experiment which one is more effective for your problem. You can consider the preprocessing method used as just another parameter in your learning algorithm.
That's a great question, thanks for asking!
When approaching this kind of problem of handle a feature with high cardinality, like zip codes, I keep in my training set just the most frequent ones and put all others in new category "others", then I calculate their WOE or any metric.
If some unseen zip code are found the test set, they falls to 'others' category. In general, this approach works well in practice.
I hope this nayve solution can help you!
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
As given in the textbook Machine Learning by Tom M. Mitchell, the first statement about decision tree states that, "Decision tree leaning is a method for approximating discrete valued functions". Could someone kindly elaborate this statement, probably even justify it with an example. Thanks in advance :) :)
In a simple example, consider observations rows with two attributes; the training data contains classification (discrete values) based on a combination of those attributes. The learning phase has to determine which attributes to consider in which order, so that it can effectively do well in achieving the desired modelling.
For instance, consider a model that will answer "What should I order for dinner?" given the inputs of desired price range, cuisine, and spiciness. The training data will contain your history from a variety of restaurant experiences. The model will have to determine which is most effective in reaching a good entrée classification: eliminate restaurants based on cuisine first, then consider price, and finally tune the choice according to Scoville units; or perhaps check the spiciness first and start by dump choices that aren't spicy enough before going on to the other two factors.
Does that explain what you need?