Choosing Features and restoring Features using K Mean in Scikit - machine-learning

I want to do some K Mean Clustering in Scikit. I have 9 features, but I only want to select four of them in clustering, also since each of four clustering is measured in different metrics, I want to normalize each four feature to be clustered. However, I want to list each data in original form with its respective cluster point. What should I do?

You can always use the original data points.
Either recompute the centroid in the original data, or apply the inverse normalization (z-normalization is reversible!); but then you'll only get data for the four attributes you used.
Recomputing the centroids in the original data is trivial, and will get you information on the other attribute as well (if you can compute a mean, and they aren't e.g. categorial; but then you might want to look at the mode instead)

Related

Understanding multiple Linear regression

I am doing multiple regression problem. I have the below data set as below.
rank--discipline--yrs.since.phd--yrs.service--sex--salary
[ 1 1 19 18 1 139750],......
I am taking salary as dependent variable, and other variable as independent variable. After doing data pre processing, I ran the gradient descent, regression model. I estimated bias(intercept), coefficient for all independent features.
I want to do scattered plot for the actual values and regression line
for the hypothesis I predicted. Since we have more than one features here,
I have the below questions.
While plotting actual values (scatted plot), how do I decide the x-axis values. Meaning, I have list of values. for example, first row [1,1,19,18,1]=>139750 How do I transform or map [1,1,19,18,1] to x-axis.? I need to somehow make [1,1,19,18,1] to one value, so I can mark a point of (x,y) in the plot.
While plotting regression line, what would be the feature values, so I can calculate the hypothesis value.?
Meaning now, I have the intercept, and weight of all features, but I dont have the feature values. How do I decide upon the feature values now.?
I want to calculate the points and use matplot to do the jobs. I am aware that there are lot of tools available outside including matplotlib to do the job. But I want to get the basic understanding.
Thanks.
I am still not sure I completely understand your question, so if something is not what you expected comment below and we will work it out.
Now,
Query 1: In all your datasets you are going to have multiple inputs and there is no way to view the target variable salary in your case with respect to all, in a single graph, what is usually done is either you apply the concept of dimensionality reduction on your data using t-sne (link) or you use principal component analysis (PCA) to reduce the dimensionality of your data, and make your output a function of two or three variables and then plot it on the screen, the other technique that I prefer is rather plotting target vs each variable separately as subplot, The reason for this is we don't even have a way to comprehend how we will see the data that is in more than three dimensions.
Query 2: If you are not determined to use matplotlib, I will suggest seaborn.regplot(), but let's also do it in matplotlib. Suppose the variable you want to use first is 'discipline' vs 'salary'.
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
X = df[['discipline']]
Y = df['salary']
lm.fit(X,Y)
After running this lm.coef_ will give you the coefficient, and lm.intercept_ will give you the intercept, in a linear equation that forms this variable, then you can plot the data between two variables and a line using matplotlib easily.
what you can do is ->
from pandas import plotting as pdplt
pdplt.scatter_matrix(dataframe, pass the remaining required parameters)
by this you will get a matrix of plots(in your case it's 6X6) which will exactly show how each column in your dataframe relates to the other columns and you can clearly visualise which feature dominates the result and also how the features are correlated to each other.
If you ask me this is the first thing I used to do with such types of problems and then remove all correlated features and select the features which best approximate the output.
But as you have to plot a 2d plot and in the above approach you might get more than a single feature which dominate the output then what you can do is a miracle named PCA.
If you ask me PCA is one of the most beautiful thing in machine learning. What it will do that is somehow merges all your feautres in some magical ratio which will generate principle components for your data. Principal components are those components which govern/major contribution to your model. You apply pca by simply importing from sklearn and then select the first principle component(as you need a 2d plot) or might select 2 priciple components and plot a 3d graph. But always remember this that these pricipal components are not the real features of your model but they are some magical combination and how PCA did so is very very interesting(by using concepts like eigen values and vectors) and you can build by your own also.
Apart from all these you can apply Singular Value decomposition(SVD) to your model which is the essence of whole linear algebra which is a type of matrix decomposition existing for all matrix. What this do is decompose your matrix into three matrix out of which the diagonal matrix which consists of singular values(a scaling factor) in descending order and what you have to do is that select the top singular values (in your case only the first one having highest magnitude) and construct back a feature matrix from 5 columns to 1 columns and then plot that. You can do svd by using the numpy.linalg
Once you applied any one of these methods then what you can do is learn your hypothesis with only the single most important selected feature and finally plot the graph. But take a tip, just for plotting a 2d graph you should avoid other important features beacuse maybe you have 3 principal components all having almost the same contribution and may the top three singular values are very close to each other. So take my words and take all important features into account and if you need the visualisation of these important features then use scatter matrix
Summary ->
All I want to mention is that you can do the same process with all these things and also can invent your own statistical or mathematical model for compressing your feature space.
But for me I prefer to go with PCA and in such type of problems I even first plot the scatter matrix to get an visual intuition to the data. And also PCA and SVD helps to remove redundancy and hence overfitting.
For rest details refer to docs.
Happy machine learning...

How to Intelligently Sample Parameter Space while Training a Statistical Classifier

I'm interested in a statistical classification problem. Given a feature vector X, I would like to classify X as either "yes" or "no". However, the training data will be fed in real-time based on human input. For instance, if the user sees feature vector X, the user will assign "yes" or "no" based on their expertise.
Rather than doing grid search on parameter space, I would like to more intelligently explore the parameter space based on the previously submitted data. For example, if there is a dense cluster of "no's" in part of the parameter space, it probably doesn't make sense to keep sampling there - it's probably just going to be more "no's".
How can I go about doing this? The C4.5 algorithm seems to be up this alley, but I'm unsure if this is the way to go.
An additional subtlety is that some of the features might be specifying random data. Suppose that the first two attributes in the feature vector specify the mean and variance of a gaussian distribution. The data the user classifies could be significantly different, even if all parameters are held equal.
For example, let's say the algorithm displays a sine wave with gaussian noise added, where the gaussian distribution is specified by the mean and variance in the feature vector. The user is asked "does this graph represent a sine wave?" Two very similar values in mean or variance could still have significantly different graphs.
Is there an algorithm designed to handle such cases?
The setting that you're talking about fits in the broad area of Active Learning. This topic addresses the iterative process of model building, and choosing which training examples to query next in order to optimize model performance. Here, the training cost of each data point is roughly the same, and there are no additional variable rewards in the learning phase.
However, in each iteration, if you have a variable reward which is a function of the data point chosen, you would want to look at Multi-Armed Bandits and Reinforcement Learning.
The other issue that you're talking about is one of finding the right features to represent your data points, and should be handled separately.

Clustering Method Selection in High-Dimension?

If the data to cluster are literally points (either 2D (x, y) or 3D (x, y,z)), it would be quite intuitive to choose a clustering method. Because we can draw them and visualize them, we somewhat know better which clustering method is more suitable.
e.g.1 If my 2D data set is of the formation shown in the right top corner, I would know that K-means may not be a wise choice here, whereas DBSCAN seems like a better idea.
However, just as the scikit-learn website states:
While these examples give some intuition about the algorithms, this
intuition might not apply to very high dimensional data.
AFAIK, in most of the piratical problems we don't have such simple data. Most probably, we have high-dimensional tuples, which cannot be visualized like such, as data.
e.g.2 I wish to cluster a data set where each data is represented as a 4-D tuple <characteristic1, characteristic2, characteristic3, characteristic4>. I CANNOT visualize it in a coordinate system and observes its distribution like before. So I will NOT be able to say DBSCAN is superior to K-means in this case.
So my question:
How does one choose the suitable clustering method for such an "invisualizable" high-dimensional case?
"High-dimensional" in clustering probably starts at some 10-20 dimensions in dense data, and 1000+ dimensions in sparse data (e.g. text).
4 dimensions are not much of a problem, and can still be visualized; for example by using multiple 2d projections (or even 3d, using rotation); or using parallel coordinates. Here's a visualization of the 4-dimensional "iris" data set using a scatter plot matrix.
However, the first thing you still should do is spend a lot of time on preprocessing, and finding an appropriate distance function.
If you really need methods for high-dimensional data, have a look at subspace clustering and correlation clustering, e.g.
Kriegel, Hans-Peter, Peer Kröger, and Arthur Zimek. Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Transactions on Knowledge Discovery from Data (TKDD) 3.1 (2009): 1.
The authors of that survey also publish a software framework which has a lot of these advanced clustering methods (not just k-means, but e.h. CASH, FourC, ERiC): ELKI
There are at least two common, generic approaches:
One can use some dimensionality reduction technique in order to actually visualize the high dimensional data, there are dozens of popular solutions including (but not limited to):
PCA - principal component analysis
SOM - self-organizing maps
Sammon's mapping
Autoencoder Neural Networks
KPCA - kernel principal component analysis
Isomap
After this one goes back to the original space and use some techniques that seems resonable based on observations in the reduced space, or performs clustering in the reduced space itself.First approach uses all avaliable information, but can be invalid due to differences induced by the reduction process. While the second one ensures that your observations and choice is valid (as you reduce your problem to the nice, 2d/3d one) but it loses lots of information due to transformation used.
One tries many different algorithms and choose the one with the best metrics (there have been many clustering evaluation metrics proposed). This is computationally expensive approach, but has a lower bias (as reducting the dimensionality introduces the information change following from the used transformation)
It is true that high dimensional data cannot be easily visualized in an euclidean high dimensional data but it is not true that there are no visualization techniques for them.
In addition to this claim I will add that with just 4 features (your dimensions) you can easily try the parallel coordinates visualization method. Or simply try a multivariate data analysis taking two features at a time (so 6 times in total) to try to figure out which relations intercour between the two (correlation and dependency generally). Or you can even use a 3d space for three at a time.
Then, how to get some info from these visualizations? Well, it is not as easy as in an euclidean space but the point is to spot visually if the data clusters in some groups (eg near some values on an axis for a parallel coordinate diagram) and think if the data is somehow separable (eg if it forms regions like circles or line separable in the scatter plots).
A little digression: the diagram you posted is not indicative of the power or capabilities of each algorithm given some particular data distributions, it simply highlights the nature of some algorithms: for instance k-means is able to separate only convex and ellipsoidail areas (and keep in mind that convexity and ellipsoids exist even in N-th dimensions). What I mean is that there is not a rule that says: given the distributiuons depicted in this diagram, you have to choose the correct clustering algorithm consequently.
I suggest to use a data mining toolbox that lets you explore and visualize the data (and easily transform them since you can change their topology with transformations, projections and reductions, check the other answer by lejlot for that) like Weka (plus you do not have to implement all the algorithms by yourself.
In the end I will point you to this resource for different cluster goodness and fitness measures so you can compare the results rfom different algorithms.
I would also suggest soft subspace clustering, a pretty common approach nowadays, where feature weights are added to find the most relevant features. You can use these weights to increase performance and improve the BMU calculation with euclidean distance, for example.

Most appropriate normalization / transformation method for skewed features?

I am trying to pre-process biological data to train a neural network and despite an extensive search and repetitive presentation of the various normalization methods I am none the wiser as to which method should be used when. In particular I have a number of input variables which are positively skewed and have been trying to establish whether there is a normalisation method that is most appropriate.
I was also worried about whether the nature of these inputs would affect performance of the network and as such have experimented with data transformations (log transformation in particular). However some inputs have many zeros but may also be small decimal values and seem to be highly affected by a log(x + 1) (or any number from 1 to 0.0000001 for that matter) with the resulting distribution failing to approach normal (either remains skewed or becomes bimodal with a sharp peak at the min value).
Is any of this relevant to neural networks? ie. should I be using specific feature transformation / normalization methods to account for the skewed data or should I just ignore it and pick a normalization method and push ahead?
Any advice on the matter would be greatly appreciated!
Thanks!
As features in your input vector are of different nature, you should use different normalization algorithms for every feature. Network should be feeded by uniformed data on every input for better performance.
As you wrote that some data is skewed, I suppose you can run some algoritm to "normalize" it. If applying logarithm does not work, perhaps other functions and methods such as rank transforms can be tried out.
If the small decimal values do entirely occur in a specific feature, then just normalize it in specific way, so that they get transformed into your work range: either [0, 1] or [-1, +1] I suppose.
If some inputs have many zeros, consider removing them from main neural network, and create additional neural network which will operate on vectors with non-zeroed features. Alternatively, you may try to run Principal Component Analysis (for example, via Autoassociative memory network with structure N-M-N, M < N) to reduce input space dimension and so eliminate zeroed components (they will be actually taken into account in the new combined inputs somehow). BTW, new M inputs will be automatically normalized. Then you can pass new vectors to your actual worker neural network.
This is an interesting question. Normalization is meant to keep features' values in one scale to facilitate the optimization process.
I would suggest the following:
1- Check if you need to normalize your data. If, for example, the means of the variables or features are within same scale of values, you may progress with no normalization. MSVMpack uses some normalization check condition for their SVM implementation. If, however, you need to do so, you are still advised to run the models on the data without Normalization.
2- If you know the actual maximum or minimum values of a feature, use them to normalize the feature. I think this kind of normalization would preserve the skewedness in values.
3- Try decimal value normalization with other features if applicable.
Finally, you are still advised to apply different normalization techniques and compare the MSE for evey technique including z-score which may harm the skewedness of your data.
I hope that I have answered your question and gave some support.

What does dimensionality reduction mean?

What does dimensionality reduction mean exactly?
I searched for its meaning, I just found that it means the transformation of raw data into a more useful form. So what is the benefit of having data in useful form, I mean how can I use it in a practical life (application)?
Dimensionality Reduction is about converting data of very high dimensionality into data of much lower dimensionality such that each of the lower dimensions convey much more information.
This is typically done while solving machine learning problems to get better features for a classification or regression task.
Heres a contrived example - Suppose you have a list of 100 movies and 1000 people and for each person, you know whether they like or dislike each of the 100 movies. So for each instance (which in this case means each person) you have a binary vector of length 100 [position i is 0 if that person dislikes the i'th movie, 1 otherwise ].
You can perform your machine learning task on these vectors directly.. but instead you could decide upon 5 genres of movies and using the data you already have, figure out whether the person likes or dislikes the entire genre and, in this way reduce your data from a vector of size 100 into a vector of size 5 [position i is 1 if the person likes genre i]
The vector of length 5 can be thought of as a good representative of the vector of length 100 because most people might be liking movies only in their preferred genres.
However its not going to be an exact representative because there might be cases where a person hates all movies of a genre except one.
The point is, that the reduced vector conveys most of the information in the larger one while consuming a lot less space and being faster to compute with.
You're question is a little vague, but there's an interesting statistical technique that may be what you're thinking off called Principal Component Analysis which does something similar (and incidentally plotting the results from which was my first real world programming task)
It's a neat, but clever technique which is remarkably widely applicable. I applied it to similarities between protein amino acid sequences, but I've seen it used for analysis everything from relationships between bacteria to malt whisky.
Consider a graph of some attributes of a collection of things where one has two independent variables - to analyse the relationship on these one obviously plots on two dimensions and you might see a scatter of points. if you've three variable you can use a 3D graph, but after that one starts to run out of dimensions.
In PCA one might have dozens or even a hundred or more independent factors, all of which need to be plotted on perpendicular axis. Using PCA one does this, then analyses the resultant multidimensional graph to find the set of two or three axis within the graph which contain the largest amount of information. For example the first Principal Coordinate will be a composite axis (i.e. at some angle through n-dimensional space) which has the most information when the points are plotted along it. The second axis is perpendicular to this (remember this is n-dimensional space, so there's a lot of perpendiculars) which contains the second largest amount of information etc.
Plotting the resultant graph in 2D or 3D will typically give you a visualization of the data which contains a significant amount of the information in the original dataset. It's usual for the technique to be considered valid to be looking for a representation that contains around 70% of the original data - enough to visualize relationships with some confidence that would otherwise not be apparent in the raw statistics. Notice that the technique requires that all factors have the same weight, but given that it's an extremely widely applicable method that deserves to be more widely know and is available in most statistical packages (I did my work on an ICL 2700 in 1980 - which is about as powerful as an iPhone)
http://en.wikipedia.org/wiki/Dimension_reduction
maybe you have heard of PCA (principle component analysis), which is a Dimension reduction algorithm.
Others include LDA, matrix factorization based methods, etc.
Here's a simple example. You have a lot of text files and each file consists some words. There files can be classified into two categories. You want to visualize a file as a point in a 2D/3D space so that you can see the distribution clearly. So you need to do dimension reduction to transfer a file containing a lot of words into only 2 or 3 dimensions.
The dimensionality of a measurement of something, is the number of numbers required to describe it. So for example the number of numbers needed to describe the location of a point in space will be 3 (x,y and z).
Now lets consider the location of a train along a long but winding track through the mountains. At first glance this may appear to be a 3 dimensional problem, requiring a longitude, latitude and height measurement to specify. But this 3 dimensions can be reduced to one if you just take the distance travelled along the track from the start instead.
If you were given the task of using a neural network or some statistical technique to predict how far a train could get given a certain quantity of fuel, then it will be far easier to work with the 1 dimensional data than the 3 dimensional version.
It's a technique of data mining. Its main benefit is that it allows you to produce a visual representation of many-dimensional data. The human brain is peerless at spotting and analyzing patterns in visual data, but can process a maximum of three dimensions (four if you use time, i.e. animated displays) - so any data with more than 3 dimensions needs to somehow compressed down to 3 (or 2, since plotting data in 3D can often be technically difficult).
BTW, a very simple form of dimensionality reduction is the use of color to represent an additional dimension, for example in heat maps.
Suppose you're building a database of information about a large collection of adult human beings. It's also going to be quite detailed. So we could say that the database is going to have large dimensions.
AAMOF each database record will actually include a measure of the person's IQ and shoe size. Now let's pretend that these two characteristics are quite highly correlated. Compared to IQs shoe sizes may be easy to measure and we want to populate the database with useful data as quickly as possible. One thing we could do would be to forge ahead and record shoe sizes for new database records, postponing the task of collecting IQ data for later. We would still be able to estimate IQs using shoe sizes because the two measures are correlated.
We would be using a very simple form of practical dimension reduction by leaving IQ out of records initially. Principal components analysis, various forms of factor analysis and other methods are extensions of this simple idea.

Resources