Feature extraction for multiple sub-features - machine-learning

I would like to conduct some feature extraction(or clustering) for dataset containing sub-features.
For example, dataset is like below. The goal is to classify the type of robot using the data.
Samples : 100 robot samples [Robot 1, Robot 2, ..., Robot 100]
Classes : 2 types [Type A, Type B]
Variables : 6 parts, and 3 sub-features for each parts (total 18 variables)
[Part1_weight, Part1_size, Part1_strength, ..., Part6_size, Part6_strength, Part6_weight]
I want to conduct feature extraction with [weight, size, strength], and use extracted feature as a representative value for the part.
In short, my aim is to reduce the feature to 6 - [Part1_total, Part2_total, ..., Part6_total] - and then, classify the type of robot with those 6 features. So, make combined feature with 'weight', 'size', and 'strength' is the problem to solve.
First I thought of applying PCA (Principal Component Analysis), because it is one of the most popular feature extraction algorithm. But it considers all 18 features separately, so 'Part1_weight' can be considered as more important than 'Part2_weight'. But what I have to know is the importance of 'weights', 'sizes', and 'strengths' among samples, so PCA seems to be not applicable.
Is there any supposed way to solve this problem?

If you want to have exactly one feature per part I see no other way than performing the feature reduction part-wise. However, there might be better choices than simple PCA. For example, if the parts are mostly solid, their weight is likely to correlate with the third power of the size, so you could take the cubic root of the weight or the cube of the size before performing the PCA. Alternatively, you can take a logarithm of both values, which again results in a linear dependency.
Of course, there are many more fancy transformations you could use. In statistics, the Box-Cox Transformation is used to achieve a normal-looking distribution of the data.
You should also consider normalising the transformed data before performing the PCA, i.e. subtracting the mean and dividing by the standard deviations of each variable. It will remove the influence of units of measurement. I.e. it won't matter whether you measure weight in kg, atomic units, or Sun masses.

If the Part's number makes them different from one another (e.g Part1 is different from Part2, doesn't matter if their size, weight, strength parameters are identical), you can do PCA once for each Part. Using only the current Part's size, weight and strength as parameters in the current PCA.
Alternatively, if the Parts array order doesn't matter, you can do only one PCA using all (size, weight, strength) parameter triples, not differing them by their part number.

Related

How to select features for clustering?

I had time-series data, which I have aggregated into 3 weeks and transposed to features.
Now I have features: A_week1, B_week1, C_week1, A_week2, B_week2, C_week2, and so on.
Some of features are discreet, some - continuous.
I am thinking of applying K-Means or DBSCAN.
How should I approach the feature selection in such situation?
Should I normalise the features? Should I introduce some new ones, that would somehow link periods together?
Since K-means and DBSCAN are unsupervised learning algorithms, selection of features over them are tied to grid search. You may want to test them to evaluate such algorithms based on internal measures such as Davies–Bouldin index, Silhouette coefficient among others. If you're using python you can use Exhaustive Grid Search to do the search. Here is the link to the scikit library.
Formalize your problem, don't just hack some code.
K-means minimizes the sum of squares. If the features have different scales they get different influence on the optimization. Therefore, you carefully need to choose weights (scaling factors) of each variable to balance their importance the way you want (and note that a 2x scaling factor does not make the variable twice as important).
For DBSCAN, the distance is only a binary decision: close enough, or not. If you use the GDBSCAN version, this is easier to understand than with distances. But with mixed variables, I would suggest to use the maximum norm. Two objects are then close if they differ in each variable by at most "eps". You can set eps=1, and scale your variables such that 1 is a "too big" difference. For example in discrete variables, you may want to tolerate one or two discrete steps, but not three.
Logically, it's easy to see that the maximum distance threshold decomposes into a disjunction of one-variablea clauses:
maxdistance(x,y) <= eps
<=>
forall_i |x_i-y_i| <= eps

If a dataset has multiple columns all in different formats, what would be the best approach to deal with such data?

Say, a dataset has columns like length and width which can be float, and it can also have some binary elements (yes/no) or discrete numbers (categories transformed into numbers). What would it be wise to simply use all these as features without having to worry about the formats (or more like the nature of the features)? When doing normalization, can we just normalize the discrete numbers the same way as continuous numbers? I'm really confused on dealing with multiple formats.....
Yes, you can normalize discrete values. But it ought have no real
effect on learning - normalization would be required if you are
doing some form of a similarity measurement, which is not the case
for factor variables. There are some special cases like neural
networks, which are sensible to the scale of input\output and the
size of weights (see 'vanishing\exploding gradient' topic). Also it
would make some sense if you are doing a clustering on your data.
Clustering uses some kind of a distance measure so it would be
better to have all features on the same scale.
There is nothing special with categorical stuff, except that some of
the learning methods are especially good at using categorical
features, some - at using real-valued features, and some are good at
both.
My first choice for mix of categorical and real-valued features would be to use some tree-based methods (RandomForest or Gradient Boosting Machine) and the second one - ANNs.
Also, extremely good approach at handling factors (categorical variables) is to convert them into set of Boolean variables. For example if you have a factor with five levels (1,2,3,4 and 5) a good way to go would be to convert it into 5 features with 1 in a column representing one of the levels.

Fuzzy clustering using unsupervised dimensionality reduction

An unsupervised dimensionality reduction algorithm is taking as input a matrix NxC1 where N is the number of input vectors and C1 is the number of components for each vector (the dimensionality of the vector). As a result, it returns a new matrix NxC2 (C2 < C1) where each vector has a lower number of component.
A fuzzy clustering algorithm is taking as input a matrix N*C1 where N, here again, is the number of input vectors and C1 is the number of components for each vector. As a result, it returns a new matrix NxC2 (C2 usually lower than C1) where each component of each vector is indicating the degree to which the vector belongs to the corresponding cluster.
I noticed that input and output of both classes of algorithms are the same in structure, only the interpretation of the results changes. Moreover, there no fuzzy clustering implementation in scikit-learn, hence the following question:
Does it make sense to use a dimensionality reduction algorithm to perform fuzzy clustering?
For instance, is it a non-sense to apply FeatureAgglomeration or TruncatedSVD to a dataset built from TF-IDF vectors extracted from textual data, and interpret the results as a fuzzy clustering?
In some sense, sure. It kind of depends on how you want to use the results downstream.
Consider SVD truncation or excluding principal components. We have projected into a new, variance-preserving space with essentially few other restrictions on the structure of the new manifold. The new coordinate representations of the original data points could have large negative numbers for some elements, which is a little weird. But one could shift and rescale the data without much difficulty.
One could then interpret each dimension as a cluster membership weight. But consider a common use for fuzzy clustering, which is to generate a hard clustering. Notice how easy this is with fuzzy cluster weights (e.g. just take the max). Consider a set of points in the new dimensionally-reduced space, say <0,0,1>,<0,1,0>,<0,100,101>,<5,100,99>. A fuzzy clustering would given something like {p1,p2}, {p3,p4} if thresholded, but if we took the max here (i.e. treat the dimensionally reduced axes as membership, we get {p1,p3},{p2,p4}, for k=2, for instance. Of course, one could use a better algorithm than max to derive hard memberships (say by looking at pairwise distances, which would work for my example); such algorithms are called, well, clustering algorithms.
Of course, different dimensionality reduction algorithms may work better or worse for this (e.g. MDS which focuses on preserving distances between data points rather than variances is more naturally cluster-like). But fundamentally, many dimensionality reduction algorithms implicitly preserve data about the underlying manifold that the data lie on, whereas fuzzy cluster vectors only hold information about the relations between data points (which may or may not implicitly encode that other information).
Overall, the purpose is a little different. Clustering is designed to find groups of similar data. Feature selection and dimensionality reduction are designed to reduce the noise and/or redundancy of the data by changing the embedding space. Often we use the latter to help with the former.

Reducing a matrix of feature vectors to a single, meaningful vector

I have matrices of feature vectors - 200 features long, in which the feature vectors within a matrix are temporally related, but I wish to reduce each matrix to a single, meaningful vector. I have applied PCA to the matrix in order to reduce its dimensionality to one with high variance, and am considering concatenating its rows together into one feature vector to summarize the data.
Is this a sensible approach, or are there better ways of achieving this?
So you have an n x 200 feature matrix, where n is your number of samples, and 200 features per sample, and each feature is temporally related to all others? Or you have individual feature matrices, one for each time point, and you want to run PCA on each of these individual feature matrices to find a single eigenvector for that time point, and then concatenate those together?
PCA seems more useful in the second case.
While this is doable, this is maybe not the best way to go about it because you lose temporal sensitivity by collapsing together features from different times. Even if each feature in your final feature matrix represents a different time, most classifiers cannot learn about the fact that feature 2 follows feature 1 etc. So you lose the natural temporal ordering by doing this.
If you care about the the temporal relationship between these features you may want to take a look at recurrent neural networks, which allow you feed information from t-1 into a node, at the same time as feeding in your current t features. So in a sense they learn about the relationship between t-1 and t features which will help you preserve temporal ordering. See this for an explanation: http://karpathy.github.io/2015/05/21/rnn-effectiveness/
If you don't care about time and just want to group everything together, then yes PCA will help reduce your feature count. Ultimately it depends what type of information you think is more relevant to your problem.

How to pre-process dataset for maximum effectiveness with LibSVM Weka implementation

So I read a paper that said that processing your dataset correctly can increase LibSVM classification accuracy dramatically...I'm using the Weka implementation and would like some help making sure my dataset is optimal.
Here are my (example) attributes:
Power Numeric (real numbers, range is from 0 to 1.5132, 9000+ unique values)
Voltage Numeric (similar to Power)
Light Numeric (0 and 1 are the only 2 possible values)
Day Numeric (1 through 20 are the possible values, equal number of each value)
Range Nominal {1,2,3,4,5} <----these are the classes
My question is: which Weka pre-processing filters should I apply to make this dataset more effective for LibSVM?
Should I normalize and/or standardize the Power and Voltage data values?
Should I use a Discretization filter on anything?
Should I be binning the Power/Voltage values into a lot smaller number of bins?
Should I make the Light value Binary instead of numeric?
Should I normalize the Day values? Does it even make sense to do that?
Should I be using the Nominal to Binary or Nominal to some thing else filter for the classes "Range"?
Please advice on these questions and anything else you think I might have missed...
Thanks in advance!!
Normalization is very important, as it influences the concept of distance which is used by SVM. The two main approaches to normalization are:
Scale each input dimension to the same interval, for example [0, 1]. This is the most common approach by far. It is necessary to prevent some input dimensions to completely dominate others. Recommended by the LIBSVM authors in their beginner's guide (Appendix B for examples).
Scale each instance to a given length. This is common in text mining / computer vision.
As to handling types of inputs:
Continuous: no work needed, SVM works on these implicitly.
Ordinal: treat as continuous variables. For example cold, lukewarm, hot could be modeled as 1, 2, 3 without implicitly defining an unnatural structure.
Nominal: perform one-hot encoding, e.g. for an input with N levels, generate N new binary input dimensions. This is necessary because you must avoid implicitly defining a varying distance between nominal levels. For example, modelling cat, dog, bird as 1, 2 and 3 implies that a dog and bird are more similar than a cat and bird which is nonsense.
Normalization must be done after substituting inputs where necessary.
To answer your questions:
Should I normalize and/or standardize the Power and Voltage data
values?
Yes, standardize all (final) input dimensions to the same interval (including dummies!).
Should I use a Discretization filter on anything?
No.
Should I be binning the Power/Voltage values into a lot smaller number of
bins?
No. Treat them as continuous variables (e.g. one input each).
Should I make the Light value Binary instead of numeric?
No, SVM has no concept of binary variables and treats everything as numeric. So converting it will just lead to an extra type-cast internally.
Should I normalize the Day values? Does it even make sense to do
that?
If you want to use 1 input dimension, you must normalize it just like all others.
Should I be using the Nominal to Binary or Nominal to some thing else filter for the classes "Range"?
Nominal to binary, using one-hot encoding.

Resources