Handling geospatial coordinates in machine learning

I'm building a machine learning model where some columns are physical addresses (which I can translate into X / Y coordinates) but I'm a little bit confused on how this will be handled by the ML algorithm.
Is there a particular way to translate a GEO location into columns for use into ML (classification and/or regression) ?
The choice of features would, in general, depend on what kind of relationship you anticipate between the features and the target variable. You are right in saying that post code number itself does not bear any relation to the target. Here the postcode is simply a string, or a category. What kind of model are you planning to use? Linear regression and Decision tree are two examples. These models capture relationships in different ways. As an example for a feature, you could compute the straight line distance between the source and destination, and use that in the model, since intuitively, the farther they are, the higher the transit time is likely to be. What else does the transit time depend on? See if you can relate the factors influencing the travel time to the information that you have, i.e., the postcodes / XY co-ordinates, in some way.

This summarizes the answer we ended up with in the comments of the questions:
This transformation from ZIP codes to geo-coordinates should not be seen as a "split" but only as a way to represent your data in a multidimensional way (in this case the dimension will be 2).
Machine learning algorithms exist for both unidimensional and multidimensional data. The two dimensions can be correlated or uncorrelated, depending on how you define the parameters of the model you choose afterwards.
Moreover, the correlation does not have to be explicitly set in most cases. Only an initial value may be useful, but many algorithm also rely on random initialization or other simple methods that estimate it from a subset of your data. So, for clarity's sake, if you model you data by a Gaussian for example, when estimating the parameters of this Gaussian, the covariance matrix will have non-diagonal term that are non-zeros which will represent the data correlation. You only need not to take an assumption that states that the 2 dimensions are uncorrelated!


Will it be a good idea to exclude the noisy data( which may reduce model accuracy or cause unexpected output for testing dataset) from a dataset to generate the training and validation dataset ?
Assumption: Noisy data is pre-known to us
It depends on your application. If the noisy data is valid, then definitely include it to find the best model.
However, if the noisy data is invalid, then it should be cleaned out before fitting your model.
Noise is a broad term, you better consider them as inliers or outliers instead.
Most of the outliers detection algorithms specify a threshold and sort the outliers candidates according to some given score. In this case, you can choose to eradicate the most extreme values. Say for example 3xSTD far from the mean (of course that is in case you have a Gaussian-like distributed data set).
So my suggestion is to build your judgement based on two things:
Your business concept and logic about validity vs invalidity. For example: A house size, area or price cannot be a negative number.
Your mathematical / algorithmic logic. For example: Detect extreme values based on some threshold to decide (along with / without point no. 1) whether it is a valid observation or not.
Noisy data doesn't cause a huge problem themselves. The extreme noisy data (i.e. extreme values / outliers) are those you should really concern about!
Such points would adjust the hypothesis of your model while fitting the data. Hence, results might be drastically shifted / incorrect.
Finally, you can look at Pyod open-source Pythonic toolbox which contains a lot of different algorithms implemented off-the-shelf. (You can choose more than one algorithm and create a voting pool to decide the extremeness of the observations).
You can use Multivariate Gaussian Distribution for outlier Detection in python. It is the best method.

How to derive the top contributing factors in a binary classification problem

I have a binary classification problem with about 30 features and an ultimate pass/fail label. I first trained a classifier to be able to predict if new instances will pass or fail but now I want to get a deeper understanding.
How can I derive some analysis about why these items pass or fail based on their features? I would ideally like to be able to show the top contributing factors with a weight associated with each one. Complicating this is that my features are not necessarily statistically independent of each other. What sorts of methods should I look into, what keywords will point me in the right direction?
Some initial thoughts: Use a decision tree classifier (ID3 or CART) and look at the top of the tree for top factors. I am not sure how robust this approach would be and it isn't immediately clear to me how one can assign the importance of each factor (one would just get an ordered list).
If I understand your objectives correctly, you might want to consider a Random Forest model. Random forests have the advantage of naturally providing an importance to the features by virtue of how the algorithm works.
In Python's scikit-learn, check out sklearn.ensemble.RandomForestClassifier(). feature_importances_ would return the "weights" I believe you're looking for. Check out the example in the documentation.
Alternatively, you can use R's randomForest package. After constructing the model, you can use importance() to extract the feature importance values.

Clustering data based on relationship patterns between independent variable and dependent variable(s)

I am interested in clustering a 2-dimensional input data having a 1-D output based on the relationship between the dependent variable and independent variables.
For example, if the 2-independent dimensions are x,y and the dependent variable is z and the relationship between (x,y) and z is different at different regions in the xy-space; I would like to cluster the data such that regions in xy-space that exhibit the same functional relationship with z fall into one-cluster. The functional relationships that can exist between the xy-space and z are unknown apriori.
It would be great if someone can provide me directions/references of what machine learning techniques that are out there that can be used as is or modified to fit this problem.
There is no good answer for this question, as this is the core concept of the whole field of hybridization between clustering and classification techniques. As a result dozens of approaches have been proposed ranging from clustering the initial data (whole XYZ space in your case) through independent analysis of possible behaviour of classification models in each cluster to the full merging of both processes in one big optimization problem. In my opinion it is almost as wide as asking "I have a data in form of (x,f(x)) and want to reconstruct "f", how do I do it?"
So references would be googling for anything related to clustering and classification hybrids, as the problem you are asking about is equivalent of finding a good clustering for modeling the (partially) independent classification/regression tasks.
Of course if you know something about the form of this functional relationship, then the whole problem can be quite easy to solve. For example if you know that your functional relationship is more or less a gaussian function you could simply fit some gaussian mixture model to your data. And in general EM (expectation maximization) would be a good choice given some knowledge about the function.

Centroid algorithm for document classification, threshold detection

I have a collection of documents related to a particular domain and have trained the centroid classifier based on that collection. What I want to do is, I will be feeding the classifier with documents from different domains and want to determine how much they are relevant to the trained domain. I can use the cosine similarity for this to get a numerical value but my question is what is the best way to determine the threshold value?
For this, I can download several documents from different domains and inspect their similarity scores to determine the threshold value. But is this the way to go, does it sound statistically good? What are the other approaches for this?
Actually there is another issue with centroids in sparse vectors. The problem is that they usually are significantly less sparse than the original data. For examples, this increases computation costs. And it can yield vectors that are themselves actually atypical because they have a different sparsity pattern. This effect is similar to using arithmetic means of discrete data: say the mean number of doors in a car is 3.4; yet obviously no car exists that actually has 3.4 doors. So in particular, there will be no car with an euclidean distance of less than 0.4 to the centroid! - so how "central" is the centroid then really?
Sometimes it helps to use medoids instead of centroids, because they actually are proper objects of your data set.
Make sure you control such effects on your data!
A simple method to try would be to employ various machine-learning algorithms - and in particular, tree-based ones - on the distances from your centroids.
As mentioned in another answer(#Anony-Mousse), this won't necessarily provide you with good or usable answers, but it just might. Using a ML framework for this procedure, E.g. WEKA, will also help you with estimating your accuracy in a more rigorous manner.
Here are the steps to take, using WEKA:
Generate a train set by finding a decent amount of documents representing each of your classes (to get valid estimations, I'd recommend at least a few dozens per class)
Calculate the distance from each document to each of your centroids.
Generate a feature vector for each such document, composed of the distances from this document to the centroids. You can either use a single feature - the distance to the nearest centroid; or use all distances, if you'd like to try a more elaborate thresholding scheme. For example, if you chose the simpler method of using a single feature, the vector representing a document with a distance of 0.2 to the nearest centroid, belonging to class A would be: "0.2,A"
Save this set in ARFF or CSV format, load into WEKA, and try classifying, e.g. using a J48 tree.
The results would provide you with an overall accuracy estimation, with a detailed confusion matrix, and - of course - with a specific model, e.g. a tree, you can use for classifying additional documents.
These results can be used to iteratively improve the models and thresholds by collecting additional train documents for problematic classes, either by recreating the centroids or by retraining the thresholds classifier.

What does dimensionality reduction mean?

What does dimensionality reduction mean exactly?
I searched for its meaning, I just found that it means the transformation of raw data into a more useful form. So what is the benefit of having data in useful form, I mean how can I use it in a practical life (application)?
Dimensionality Reduction is about converting data of very high dimensionality into data of much lower dimensionality such that each of the lower dimensions convey much more information.
This is typically done while solving machine learning problems to get better features for a classification or regression task.
Heres a contrived example - Suppose you have a list of 100 movies and 1000 people and for each person, you know whether they like or dislike each of the 100 movies. So for each instance (which in this case means each person) you have a binary vector of length 100 [position i is 0 if that person dislikes the i'th movie, 1 otherwise ].
You can perform your machine learning task on these vectors directly.. but instead you could decide upon 5 genres of movies and using the data you already have, figure out whether the person likes or dislikes the entire genre and, in this way reduce your data from a vector of size 100 into a vector of size 5 [position i is 1 if the person likes genre i]
The vector of length 5 can be thought of as a good representative of the vector of length 100 because most people might be liking movies only in their preferred genres.
However its not going to be an exact representative because there might be cases where a person hates all movies of a genre except one.
The point is, that the reduced vector conveys most of the information in the larger one while consuming a lot less space and being faster to compute with.
You're question is a little vague, but there's an interesting statistical technique that may be what you're thinking off called Principal Component Analysis which does something similar (and incidentally plotting the results from which was my first real world programming task)
It's a neat, but clever technique which is remarkably widely applicable. I applied it to similarities between protein amino acid sequences, but I've seen it used for analysis everything from relationships between bacteria to malt whisky.
Consider a graph of some attributes of a collection of things where one has two independent variables - to analyse the relationship on these one obviously plots on two dimensions and you might see a scatter of points. if you've three variable you can use a 3D graph, but after that one starts to run out of dimensions.
In PCA one might have dozens or even a hundred or more independent factors, all of which need to be plotted on perpendicular axis. Using PCA one does this, then analyses the resultant multidimensional graph to find the set of two or three axis within the graph which contain the largest amount of information. For example the first Principal Coordinate will be a composite axis (i.e. at some angle through n-dimensional space) which has the most information when the points are plotted along it. The second axis is perpendicular to this (remember this is n-dimensional space, so there's a lot of perpendiculars) which contains the second largest amount of information etc.
Plotting the resultant graph in 2D or 3D will typically give you a visualization of the data which contains a significant amount of the information in the original dataset. It's usual for the technique to be considered valid to be looking for a representation that contains around 70% of the original data - enough to visualize relationships with some confidence that would otherwise not be apparent in the raw statistics. Notice that the technique requires that all factors have the same weight, but given that it's an extremely widely applicable method that deserves to be more widely know and is available in most statistical packages (I did my work on an ICL 2700 in 1980 - which is about as powerful as an iPhone)
maybe you have heard of PCA (principle component analysis), which is a Dimension reduction algorithm.
Others include LDA, matrix factorization based methods, etc.
Here's a simple example. You have a lot of text files and each file consists some words. There files can be classified into two categories. You want to visualize a file as a point in a 2D/3D space so that you can see the distribution clearly. So you need to do dimension reduction to transfer a file containing a lot of words into only 2 or 3 dimensions.
The dimensionality of a measurement of something, is the number of numbers required to describe it. So for example the number of numbers needed to describe the location of a point in space will be 3 (x,y and z).
Now lets consider the location of a train along a long but winding track through the mountains. At first glance this may appear to be a 3 dimensional problem, requiring a longitude, latitude and height measurement to specify. But this 3 dimensions can be reduced to one if you just take the distance travelled along the track from the start instead.
If you were given the task of using a neural network or some statistical technique to predict how far a train could get given a certain quantity of fuel, then it will be far easier to work with the 1 dimensional data than the 3 dimensional version.
It's a technique of data mining. Its main benefit is that it allows you to produce a visual representation of many-dimensional data. The human brain is peerless at spotting and analyzing patterns in visual data, but can process a maximum of three dimensions (four if you use time, i.e. animated displays) - so any data with more than 3 dimensions needs to somehow compressed down to 3 (or 2, since plotting data in 3D can often be technically difficult).
BTW, a very simple form of dimensionality reduction is the use of color to represent an additional dimension, for example in heat maps.
Suppose you're building a database of information about a large collection of adult human beings. It's also going to be quite detailed. So we could say that the database is going to have large dimensions.
AAMOF each database record will actually include a measure of the person's IQ and shoe size. Now let's pretend that these two characteristics are quite highly correlated. Compared to IQs shoe sizes may be easy to measure and we want to populate the database with useful data as quickly as possible. One thing we could do would be to forge ahead and record shoe sizes for new database records, postponing the task of collecting IQ data for later. We would still be able to estimate IQs using shoe sizes because the two measures are correlated.
We would be using a very simple form of practical dimension reduction by leaving IQ out of records initially. Principal components analysis, various forms of factor analysis and other methods are extensions of this simple idea.
