Not uniform feature space - machine-learning

I'm making a machine learning algorithm for predicting interest rates. The problem that I'm dealing with consists of different features' format for old records.
More in detail: for old contracts(records), one of the features, that is the matrix of prices, have 15x12 cells conversely the new ones have 14x15 elements. The reason is due to changes in the financial markets.
Now the problem is that I was doing PCA on a subset of uniform matrixes in order to do standardization and dimensionality reduction, and there were no problems. Does anyone of you know how can I reach the same objectives in this new context ?
Thanks for any suggestion!

Related

Machine Learning - Feature Selection or Dimensionality Reduction?

I am still exploring this area of Machine Learning and although I know what's the difference between Feature Selection and Dimensionality Reduction, I am finding some difficulties grasping the concepts of when to do Feature Selection or Dimensionality Reduction (or both together).
Assuming that I have a dataset with around 40 features, is it good practice to perform Dimensionality Reduction alone or Feature Selection alone? Or should there be a hybrid of both approaches (i.e. Do feature selection first and then dimensionality reduction - or vice versa)?
The term feature selection is a bit misleading. It can have two meanings:
Selecting features by incorporating the domain knowledge (this involves constructing new features as well).
For example, finding the rotation invariant points in an image data set or creating BMI as a new feature when you have height and weight as features.
Keeping only the features of high importance according to a some measure
This is a one step of the dimensionality reduction process. The so-called dimensionality reduction process actually involves two steps:
Transforming the original features to new (artificial) features by changing the basis.
eg. PCA does so by finding a set of orthogonal features so that the variance along each axis is maximized.
Keeping only the most important (importance is defined by some measure) features resulted in the above step. This is actually a feature selection step.
eg. In PCA, this is achieved by keeping only the top-k number of features that have the highest explained variances.
As for the order of above (1) and (2) should happen: I think this is problem dependent.
If there's enough domain knowledge to construct/select features to cater the problem at hand, we should do the manual feature engineering (plus selection) first. If this feature engineering/selection process still results in a large number of features, then the so-called dimensionality reduction can be done to find a subspace that can represent the data with an even lesser number of totally new features that have almost no meaning in real life.
If the domain knowledge can't add anything to the data set, just doing dimensionality reduction would be fine which actually contain a feature selection step in it.
In a broad sense, we can think that feature selection is actually a special case of dimensionality reduction where no basis change occurs to the original data set.

Feature selection & significant features in kNN

I am developing a recommendation engine with the help of kNN. Data, though, is sparse, have around 1500 samples and around 200 features. I have an ordinal target having values 1 or 0.
What would be the techniques to do feature selection for it? I am assuming that if i choose Random forest for feature selection then the selected features may be different that what kNN assume important features are.
Also, is there any restriction on the number of features containing i have so less number of samples?
Features selection techniques want either to exclude irrelevant features, or/and to exclude redundant ones. One proven technique is to use Supervized discretization based on entropy (some more generic explanation can be found here) to meaningfully reduce the size of your data, and then use Info Gain to get top k most correlated features with the target variable. There are at least 5 various methods that you can try, it also depends on the ml library/framework that you are using to implement your app.
I would try with Relief algorithm, since its core part is the nearest neighbour search.

Principal component analysis vs feature removal

I'm quite new to machine learning and just got introduced to principle component analysis as a dimensionality reduction method. What I don't understand, in which circumstances is PCA any better than simply removing some features from the model? If the aim is to obtain lower dimensional data, why don't we just group those features that are correlated and retain one single feature from each group?
There is a fundamental difference between feature reduction (such as PCA) and feature selection (which you describe). The crucial difference is that feature reduction (PCA) maps your data to lower dimensional through some projection of all original dimensions, for example PCA uses linear combination of each. So final data embedding has information from all features. If you perform feature selection you discard information, you completely loose anything that was present there. Furthermore, PCA guarantees you retaining given fraction of the data variance.

How to calculate distance when we have sparse dataset in K nearest neighbour

I am implementing K nearest neighbour algorithm for a very sparse data. I want to calculate the distance between a test instance and each sample in the training set, but I am confused.
Because most of the features in training samples don't exist in test instance or vice versa (missing features).
How can I compute the distance in this situation?
To make sure I'm understanding the problem correctly: each sample forms a very sparsely filled vector. The missing data is different between samples, so it's hard to use any Euclidean or other distance metric to gauge similarity of samples.
If that is the scenario, I have seen this problem show up before in machine learning - in the Netflix prize contest, but not specifically applied to KNN. The scenario there was quite similar: each user profile had ratings for some movies, but almost no user had seen all 17,000 movies. The average user profile was quite sparse.
Different folks had different ways of solving the problem, but the way I remember was that they plugged in dummy values for the missing values, usually the mean of the particular value across all samples with data. Then they used Euclidean distance, etc. as normal. You can probably still find discussions surrounding this missing value problem on that forums. This was a particularly common problem for those trying to implement singular value decomposition, which became quite popular and so was discussed quite a bit if I remember right.
You may wish to start here:
http://www.netflixprize.com//community/viewtopic.php?id=1283
You're going to have to dig for a bit. Simon Funk had a little different approach to this, but it was more specific to SVDs. You can find it here: http://www.netflixprize.com//community/viewtopic.php?id=1283
He calls them blank spaces if you want to skip to the relevant sections.
Good luck!
If you work in very high dimension space. It is better to do space reduction using SVD, LDA, pLSV or similar on all available data and then train algorithm on trained data transformed that way. Some of those algorithms are scalable therefor you can find implementation in Mahout project. Especially I prefer using more general features then such transformations, because it is easier debug and feature selection. For such purpose combine some features, use stemmers, think more general.

machine learning - svm feature fusion techique

for my final thesis i am trying to build up an 3d face recognition system by combining color and depth information. the first step i did, is to realign the data-head to an given model-head using the iterative closest point algorithm. for the detection step i was thinking about using the libsvm. but i dont understand how to combine the depth and the color information to one feature vector? they are dependent information (each point consist of color (RGB), depth information and also scan quality).. what do you suggest to do? something like weighting?
edit:
last night i read an article about SURF/SIFT features i would like to use them! could it work? the concept would be the following: extracting this features out of the color image and the depth image (range image), using each feature as a single feature vector for the svm?
Concatenation is indeed a possibility. However, as you are working on 3d face recognition you should have some strategy as to how you go about it. Rotation and translation of faces will be hard to recognize using a "straightforward" approach.
You should decide whether you attempt to perform a detection of the face as a whole, or of sub-features. You could attempt to detect rotation by finding some core features (eyes, nose, etc).
Also, remember that SVMs are inherently binary (i.e. they separate between two classes). Depending on your exact application you will very likely have to employ some multi-class strategy (One-against-all or One-against-many).
I would recommend doing some literature research to see how others have attacked the problem (a google search will be a good start).
It sounds simple, but you can simply concatenate the two vectors into one. Many researchers do this.
What you arrived at is an important open problem. Yes, there are some ways to handle it, as mentioned here by Eamorr. For example you can concatenate and do PCA (or some non linear dimensionality reduction method). But it is kind of hard to defend the practicality of doing so, considering that PCA takes O(n^3) time in the number of features. This alone might be unreasonable for data in vision that may have thousands of features.
As mentioned by others, the easiest approach is to simply combine the two sets of features into one.
SVM is characterized by the normal to the maximum-margin hyperplane, where its components specify the weights/importance of the features, such that higher absolute values have a larger impact on the decision function. Thus SVM assigns weights to each feature all on its own.
In order for this to work, obviously you would have to normalize all the attributes to have the same scale (say transform all features to be in the range [-1,1] or [0,1])

Resources