I didn't understand what is the difference between this two model selection procedures: Grid search, Best subset selection .
Both of them pick up a subset of parameters and try all possible combinations of them, so I don't understand where is the difference.
Subset Selection in machine learning is used for features selection and it means evaluating a subset of features (out of all potential features) as a group and then selecting the best subset by some criteria.
There are many different algorithms for doing so, one of them is a grid search of all possible subsets.
Grid search is just the most simple way to search thru hyperparameters, it is an exhaustive search through a manually specified subset of the hyperparameters.
So, Best subset selection can be done by grid search, but there are also other approaches.
Related
I have recently been looking into different filter feature selection approaches and have noted that some are better suited for numerical data (Pearson) and some are better suited for categorical data (Chi-Square).
I am working with a dataset with a mixture of both data types and am unsure about what the best practice is in terms of applying the filter methods.
Is it best to split the dataset into categorical and numerical, performing different filter methods on each set and then joining the results?
Or should only one filter method be applied to the whole dataset?
You can have a look at Permutation Importance. The idea is to randomly shuffle the values of a feature and observe the change in error. If the feature is important, ideally the error should increase. It does not depend on the data type of the feature, unlike some statistical tests. Also it is very straightforward to implement and analyze. link1, link2
I am working to setup data for an unsupervised learning algorithm. The goal of the project is to group (cluster) different customers together based on their behavior on the website. Obviously, some sort of clustering algorithm is best for discovering patterns in the data we can't see as humans.
However, the database contains multiple rows for each customer (in chronological order) for each action the customer took on the website for that visit. For example customer with ID# 123 clicked on page 1 at time X and that would be a row in the database, and then the same customer clicked another page at time Y. That would make another row in the database.
My question is what algorithm or approach would you use for clustering in this given scenario? K-means is really popular for this type of problem, but I don't know if it's possible to use in this situation because of the grouping. Is it somehow possible to do cluster analysis around one specific ID that includes multiple rows?
Any help/direction of unsupervised learning I should take is appreciated.
In short,
Learn a fixed-length embedding (representation) of each event;
Learn a way to combine a sequence of such embeddings into a single representation for each event, then use your favorite unsupervised methods.
For (1), you can do it either manually or use an encoder/decoder;
For (2), there is a range of things you can do, ranging from just simply averaging embeddings from each event, to training an encoder-decoder on reconstructing the original sequence of events and take the intermediate representation (that the decoder uses to reconstruct the original sequence).
A good read on this topic (though a bit old; you now also have the option of Transformer Network):
Representations for Language: From Word Embeddings to Sentence Meanings
I am quite new to WEKA, and I have a dataset of 111 cases with 109 attributes. I am using feature selection tab in WEKA with CfsSubsetEval and BestFirst search method for feature selection. I am using leave-one-out cross-validation.
So, how many features does WEKA pick or what is the stopping criteria for number of features this method selects in each step of cross-validation
Thanks,
Gopi
The CfsSubsetEval algorithm is searching for a subset of features that work well together (have low correlation between the features and a high correlation to the target label). The score of the subset is called merit (you can see it in the output).
The BestFirst search won't allow you to determine the number of features to select. However, you can use other methods such as the GreedyStepWise or using InformationGain/GainRatio algorithms with Rankerand define the size of the feature set.
Another option you can use to influence the size of the set is the direction of the search (forward, backward...).
Good luck
I have around 300 features and I want to find the best subset of features by using feature selection techniques in weka. Can someone please tell me what method to use to remove redundant features in weka :)
There are mainly two types of feature selection techniques that you can use using Weka:
Feature selection with wrapper method:
"Wrapper methods consider the selection of a set of features as a search problem, where different combinations are prepared, evaluated and compared to other combinations. A predictive model us used to evaluate a combination of features and assign a score based on model accuracy.
The search process may be methodical such as a best-first search, it may stochastic such as a random hill-climbing algorithm, or it may use heuristics, like forward and backward passes to add and remove features.
An example if a wrapper method is the recursive feature elimination algorithm." [From http://machinelearningmastery.com/an-introduction-to-feature-selection/]
Feature selection with filter method:
"Filter feature selection methods apply a statistical measure to assign a scoring to each feature. The features are ranked by the score and either selected to be kept or removed from the dataset. The methods are often univariate and consider the feature independently, or with regard to the dependent variable.
Example of some filter methods include the Chi squared test, information gain and correlation coefficient scores." [From http://machinelearningmastery.com/an-introduction-to-feature-selection/]
If you are using Weka GUI, then you can take a look at two of my video casts here and here.
I am currently working on an existing system that recommends items that are similar to previous items that the user has liked.
It uses Alternating least squares Collaborative Filtering to find feature vectors of users and items. It then uses the feature vectors of the items and uses the cosine similarity measure to find similar items to it.
However, I would like some clarification as to whether this is item based CF or content based filtering? My inclination is that it is both. Since it is using a similarity measure to compare items, but the items are on the content of the feature vector?
Thanks,
If I understand correctly that you extract feature vectors for the items from users-like-items data, then it is pure item-based CF.
In order to be content based filtering, features of the item itself should be used: for example, if the items are movies, content based filtering should utilize such features like length of the movie, or its director, or so on, but not the features based on other users' preferences.
I guess your inclination is right, you are combining both content and collaborative filtering. If you are using content based then the vectors of item and users can be termed as x_i's of your data (like data points) whereas A_ij which is the cell in the input array stating what rating user i has given to item j can be termed as y_i.
You are using cosine-similarity to find similarity between item-item and user-user.
I guess in your scenario you should go for collaborative.
Try to make matrix of item-item and then calculate the cosine similarity.