How to weigh features or determine feature importance in unsupervised learning - machine-learning

I have two sets each with 15-20 attributes. I am using similarity/distance metrics like Jaccard or Hamming to find the similarity/distance between the two sets.
I am looking at an option to weigh the features before finding the similarity of the two sets. Like, attribute 1 has more weight than attribute 2 to determine the similarity between the sets.
I understand that feature importance can be determined when we have a target variable, but how can this be done when we do not have target?
Does options like PCA or filter methods like calculating the variance will help here? If yes, are there any references?
The attributes are mostly categorical, both nominal and ordinal.

Related

Sklearn k-means clustering (weighted), determining optimum sample weight for each feature?

K-means clustering in sklearn, number of clusters is known in advance (it is 2).
There are multiple features. Feature values are initially without any weight assigned, i.e. they are treated equally weighted. However, task is to assign custom weights to each feature, in order to get best possible clusters separation.
How to determine optimum sample weights (sample_weight) for each feature, in order to get best possible separation of the two clusters?
If this is not possible for k-means, or for sklearn, I am interested in any alternative clustering solution, the point is that I need method of automatic determination of appropriate weights for multivariate features, in order to maximize clusters separation.
In meantime, I have implemented following: clustering by each component separately, then calculating silhouette score, calinski harabasz score, dunn score and inverse davies bouldin score for each component (feature) separately. Then scaling those scores to same magnitude, then PCA them to 1 feature. This produced weights for each component. It seems this approach produces reasonable results. I suppose better approach would be full factorial experiment (DOE), but it seems that this simple approach produces satisfactory results as well.

Feature selection - how to go about it when you have way too many features?

Let's assume you have 1,400 columns/data points for 200k entries and your goal is to determine which of these columns show the most signal towards a simple classification task.
I've already removed columns with a threshold of null values, low variance, bad and also too many levels for categorical, and I still have 900+ columns.
I can use lasso if I only include the 500+ numerical columns, but if I try to include the categorical as well I keep crashing, it's too much data to process.
How would you go about further reducing features in that case? My goal, more than the classification itself, is to identify the features that bring in the most information towards the classification task.
You could use a data driven approach, for example the most simple one would be to use the L1 regularisation on a logistic regression (with your simple classification task) and looking at the weights you select the ones that are not zero or close to zero.
Basically the L1 norm on the model weights enforces the sparsity of the weights vector, and in doing so, the only surviving weight are the ones corresponding to the "important" features.
In any case be careful and normalise the data before using this technique and also be careful about categorical and scalar features...
You can also use a Neural network, and then compute the gradient w.r.t. the input to see what influences the decision more.
Or some other technique like: https://link.springer.com/chapter/10.1007/978-3-030-33778-0_24
Alternatively you can also use a Random Forest model and do feature importance like: https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html

How to get the final equation that the Random Forest algorithm uses on your independent variables to predict your dependent variable?

I am working on optimizing a manufacturing based dataset which consists of a huge number of controllable parameters. The goal is to attain the best run settings of these parameters.
I familiarized myself with several predictive algorithms while doing my research and if I say, use Random Forest to predict my dependent variable to understand how important each independent variable is, is there a way to extract the final equation/relationship the algorithm uses?
I'm not sure if my question was clear enough, please let me know if there's anything else I can add here.
There is no general way to get an interpretable equation from a random forest, explaining how your covariates affect the dependent variable. For that you can use a different model more suitable, e.g., linear regression (perhaps with kernel functions), or a decision tree. Note that you can use one model for prediction, and one model for descriptive analysis - there's no inherent reason to stick with a single model.
use Random Forest to predict my dependent variable to understand how important each independent variable is
Understanding how important each dependent variable, does not necessarily mean you need the question in the title of your question, namely getting the actual relationship. Most random forest packages have a method quantifying how much each covariate affected the model over the train set.
There is a number of methods to estimate feature importance based on trained model. For Random Forest, most famous methods are MDI (Mean Decrease of Impurity) and MDA (Mean Decrease of Accuracy). Many popular ML libraries support feature importance estimation out of the box for Random Forest.

Machine Learning: Weighting Training Points by Importance

I have a set of labeled training data, and I am training a ML algorithm to predict the label. However, some of my data points are more important than others. Or, analogously, these points have less uncertainty than the others.
Is there a general method to include an importance-representing weight to each training point in the model? Are there instead some specific models which are capable of this while others are not?
I can imagine duplicating these points (and perhaps smearing their features slightly to avoid exact duplicates), or downsampling the less important points. Is there a more elegant way to approach this problem?
Scikit-learn allows you to pass an array of sample weights while fitting the model. Vowpal Wabbit (an online ML library) also has this option.

How to preprocess high cardinality categorical features?

I have a data file which has features of different mobile devices. One column with categorical data type has 1421 distinct types of values. I am trying to train a logistic regression model along with other data that I have.
My question is: Will the high cardinality column described above affect the model I am training? If yes, how do I go about preprocessing this column so that it has lower number of distinct values?
You can calculate weight of evidence(WOE) to transform your numeric or categorical variable. Refer to this link http://www.kdnuggets.com/2016/08/include-high-cardinality-attributes-predictive-model.html for understanding WOE.
The best you could do here is that to group the features using what domain knowledge you have. For example phones by brand. If you do not have that information what you could do is that you could group the features by frequency. For example any feature that is not represented more than 5% of the data, you could group as others. You could use both of these methods together as well. For more information please refer this article.
Since logistic regression is distance based model (mostly least squares method) it suffers from the curse of dimensionality.
Hope this helps though quite late.
thanks
Michael
Typically, dimensionality reduction tasks (such as PCA and FA) are performed in order to decide which features are the most significant.
For example, in the case of PCA which is the most popular and easily employed dimensionality reduction task, significance is defined by largest variation of values.
By performing PCA, you "wash" out variables that are insignificant yet can cause overfitting. I suggestyou familiarize yourself with topics such as PCA, FA and SVD.

Resources