How to combine two confusion matrices - machine-learning

In a machine learning context, I am working on a binary classification problem. There is a source of truth T for labels, and a labeling process A which is not perfect and makes errors compared to T, according to a confusion matrix C(T,A). There is then a second labeling process B, and a second confusion matrix C(A,B) between A and B.
Is it possible to calculate C(T,B) from C(T,A) and C(A,B)? If so, what would that calculation be?

Related

Why k-means in scikit learn have a predict function but DBSCAN/agglomerative doesnt?

Scikit-learn implementation of K-means has a predict() function which can be applied on unseen data. Where as DBSCAN and Agglomerative does not have a predict() function.
All the three algorithms has fit_predict() which is used to fit the model and then predict. But k-means has predict() which can be directly used on unseen data which is not the case for the other algorithm.
I am very much aware that there are clustering algorithms and as per my opinion, predict() should not be there for K-means also.
What is the possible intuition/reason behind this discrepancy? is it only because k-means performs "1NN classification", so it has a predict() function?
My interpretation is that the difference comes from the way the cluster are computed. In the KMeans there is a native way to assign a new point to a cluster, while not in DBSCAN or Agglomerative clustering.
A) KMeans
In KMeans, during the construction of the clusters, a data point is assigned to the cluster with the closest centroid, and the centroids are updated afterwards. "Predicting" in the KMeans algorithm is actually doing the assignment step without updating the clusters.
If you assume that the new data points are drawn from the same distribution than the "training" set, and that your "training" set was representative enough, it is reasonable to think that one can assign the new data points following the heuristic of the algorithm without updating the cluster centroids, thus making predictions.
Of course, if the data points distribution is likely to be change one should rerun the KMeans clustering on the updated dataset.
B) DBSCAN
DBSCAN creates the cluster by finding high density areas of the dataset (parametrized by the parameters epsilon and min_points). This is done by computing point-level properties (whether the point is a core point, a directly reachable point, a reachable point or a noise point). Adding a new data point can modify the definition of the neighboring points, and thus make the computed clusters obsolete.
As an example, let's look at this illustration from wikipedia, copied below. On this image there is one cluster (red+yellow points) and one noise point (blue). Red points are core points and yellow points are reachable points.
and consider two cases:
Adding a new point halfway between A and N would make N a reachable point from A and thus belonging to the cluster.
Adding (min_points-1) new points in the epsilon-neighborhood of N, but in no other epsilon-neighborhood (as an example at the top of the picture), would change the status of N which would become a core point, and form a new cluster with the newly added points.
Here adding new data points clearly requires to recompute the clusters.
C) Aggglomerative clustering
Agglomerative clustering iteratively builds the cluster starting from points and merges them according to a linkage measure. Similarly to DBSCAN, adding new data points can entirely modify the final clusters because it can trigger different mergings.
As an example, if the linkage strategy you choose in sklearn is "single", clusters are merged if the minimum distance between all elements of the two clusters is below a chosen threshold. You can easily figure out that a single well placed new data point can trigger a merge between two clusters that would have been separated otherwise.
Thus predicting here also requires to recompute the clusters

Linear Regression: Is there a difference in the model between using ML instead MSE?

We know we need 4 things for building a machine learning algorithm:
A Dataset
A Model
A cost function
An optimization procedure
Taking the example of linear regression (y = m*x +q) we have two most common way of finding the best parameters: using ML or MSE as cost functions.
We hypotize data are Gaussian-distributed, using ML.
Is this assumption part of the model, also?
It it's not, why? Is it part of the cost function?
I can't see the "edge" of the model, in this case.
Is this assumption part of the model, also?
Yes it is. The ideas of different loss functions derived from the nature of the problem, consequently the nature of the model.
MSE by definition calculates for the mean of the squares of the errors (error means the difference between real y and predicted y) which in its turn will be high if the data is not Gaussian-Like distributed. Just imagine a few extreme values among the data, what will happen to the line slope and consequently the residual error?
It is worth mentioning the assumptions of Linear Regression:
Linear relationship
Multivariate normality
No or little multicollinearity
No auto-correlation
Homoscedasticity
If it's not, why? Is it part of the cost function?
As far I have seen, the assumption is not directly related to the cost function itself, rather related -as above-mentioned- to the model itself.
For example, Support Vector Machine idea is separation of classes. That’s finding out a line/ hyper-plane (in multidimensional space that separate outs classes), thus its cost function is Hinge Loss to "maximum-margin" of classification.
On the other hand, Logistic Regression uses Log-Loss (related to cross-entropy) because the model is binary and works on the probability of the output (0 or 1). And the list goes on...
The assumption that the data is Gaussian-distributed is part of the model in the sense that, for Gaussian distributed data the minimal Mean Squared Error also yields the maximum liklelihood solution for the data, given the model parameters. (Common proof, you can look it up if you are interested).
So you could say that the Gaussian distribution assumption justifies the choice of least squares as the loss function.

Understanding multiple Linear regression

I am doing multiple regression problem. I have the below data set as below.
rank--discipline--yrs.since.phd--yrs.service--sex--salary
[ 1 1 19 18 1 139750],......
I am taking salary as dependent variable, and other variable as independent variable. After doing data pre processing, I ran the gradient descent, regression model. I estimated bias(intercept), coefficient for all independent features.
I want to do scattered plot for the actual values and regression line
for the hypothesis I predicted. Since we have more than one features here,
I have the below questions.
While plotting actual values (scatted plot), how do I decide the x-axis values. Meaning, I have list of values. for example, first row [1,1,19,18,1]=>139750 How do I transform or map [1,1,19,18,1] to x-axis.? I need to somehow make [1,1,19,18,1] to one value, so I can mark a point of (x,y) in the plot.
While plotting regression line, what would be the feature values, so I can calculate the hypothesis value.?
Meaning now, I have the intercept, and weight of all features, but I dont have the feature values. How do I decide upon the feature values now.?
I want to calculate the points and use matplot to do the jobs. I am aware that there are lot of tools available outside including matplotlib to do the job. But I want to get the basic understanding.
Thanks.
I am still not sure I completely understand your question, so if something is not what you expected comment below and we will work it out.
Now,
Query 1: In all your datasets you are going to have multiple inputs and there is no way to view the target variable salary in your case with respect to all, in a single graph, what is usually done is either you apply the concept of dimensionality reduction on your data using t-sne (link) or you use principal component analysis (PCA) to reduce the dimensionality of your data, and make your output a function of two or three variables and then plot it on the screen, the other technique that I prefer is rather plotting target vs each variable separately as subplot, The reason for this is we don't even have a way to comprehend how we will see the data that is in more than three dimensions.
Query 2: If you are not determined to use matplotlib, I will suggest seaborn.regplot(), but let's also do it in matplotlib. Suppose the variable you want to use first is 'discipline' vs 'salary'.
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
X = df[['discipline']]
Y = df['salary']
lm.fit(X,Y)
After running this lm.coef_ will give you the coefficient, and lm.intercept_ will give you the intercept, in a linear equation that forms this variable, then you can plot the data between two variables and a line using matplotlib easily.
what you can do is ->
from pandas import plotting as pdplt
pdplt.scatter_matrix(dataframe, pass the remaining required parameters)
by this you will get a matrix of plots(in your case it's 6X6) which will exactly show how each column in your dataframe relates to the other columns and you can clearly visualise which feature dominates the result and also how the features are correlated to each other.
If you ask me this is the first thing I used to do with such types of problems and then remove all correlated features and select the features which best approximate the output.
But as you have to plot a 2d plot and in the above approach you might get more than a single feature which dominate the output then what you can do is a miracle named PCA.
If you ask me PCA is one of the most beautiful thing in machine learning. What it will do that is somehow merges all your feautres in some magical ratio which will generate principle components for your data. Principal components are those components which govern/major contribution to your model. You apply pca by simply importing from sklearn and then select the first principle component(as you need a 2d plot) or might select 2 priciple components and plot a 3d graph. But always remember this that these pricipal components are not the real features of your model but they are some magical combination and how PCA did so is very very interesting(by using concepts like eigen values and vectors) and you can build by your own also.
Apart from all these you can apply Singular Value decomposition(SVD) to your model which is the essence of whole linear algebra which is a type of matrix decomposition existing for all matrix. What this do is decompose your matrix into three matrix out of which the diagonal matrix which consists of singular values(a scaling factor) in descending order and what you have to do is that select the top singular values (in your case only the first one having highest magnitude) and construct back a feature matrix from 5 columns to 1 columns and then plot that. You can do svd by using the numpy.linalg
Once you applied any one of these methods then what you can do is learn your hypothesis with only the single most important selected feature and finally plot the graph. But take a tip, just for plotting a 2d graph you should avoid other important features beacuse maybe you have 3 principal components all having almost the same contribution and may the top three singular values are very close to each other. So take my words and take all important features into account and if you need the visualisation of these important features then use scatter matrix
Summary ->
All I want to mention is that you can do the same process with all these things and also can invent your own statistical or mathematical model for compressing your feature space.
But for me I prefer to go with PCA and in such type of problems I even first plot the scatter matrix to get an visual intuition to the data. And also PCA and SVD helps to remove redundancy and hence overfitting.
For rest details refer to docs.
Happy machine learning...

Fuzzy clustering using unsupervised dimensionality reduction

An unsupervised dimensionality reduction algorithm is taking as input a matrix NxC1 where N is the number of input vectors and C1 is the number of components for each vector (the dimensionality of the vector). As a result, it returns a new matrix NxC2 (C2 < C1) where each vector has a lower number of component.
A fuzzy clustering algorithm is taking as input a matrix N*C1 where N, here again, is the number of input vectors and C1 is the number of components for each vector. As a result, it returns a new matrix NxC2 (C2 usually lower than C1) where each component of each vector is indicating the degree to which the vector belongs to the corresponding cluster.
I noticed that input and output of both classes of algorithms are the same in structure, only the interpretation of the results changes. Moreover, there no fuzzy clustering implementation in scikit-learn, hence the following question:
Does it make sense to use a dimensionality reduction algorithm to perform fuzzy clustering?
For instance, is it a non-sense to apply FeatureAgglomeration or TruncatedSVD to a dataset built from TF-IDF vectors extracted from textual data, and interpret the results as a fuzzy clustering?
In some sense, sure. It kind of depends on how you want to use the results downstream.
Consider SVD truncation or excluding principal components. We have projected into a new, variance-preserving space with essentially few other restrictions on the structure of the new manifold. The new coordinate representations of the original data points could have large negative numbers for some elements, which is a little weird. But one could shift and rescale the data without much difficulty.
One could then interpret each dimension as a cluster membership weight. But consider a common use for fuzzy clustering, which is to generate a hard clustering. Notice how easy this is with fuzzy cluster weights (e.g. just take the max). Consider a set of points in the new dimensionally-reduced space, say <0,0,1>,<0,1,0>,<0,100,101>,<5,100,99>. A fuzzy clustering would given something like {p1,p2}, {p3,p4} if thresholded, but if we took the max here (i.e. treat the dimensionally reduced axes as membership, we get {p1,p3},{p2,p4}, for k=2, for instance. Of course, one could use a better algorithm than max to derive hard memberships (say by looking at pairwise distances, which would work for my example); such algorithms are called, well, clustering algorithms.
Of course, different dimensionality reduction algorithms may work better or worse for this (e.g. MDS which focuses on preserving distances between data points rather than variances is more naturally cluster-like). But fundamentally, many dimensionality reduction algorithms implicitly preserve data about the underlying manifold that the data lie on, whereas fuzzy cluster vectors only hold information about the relations between data points (which may or may not implicitly encode that other information).
Overall, the purpose is a little different. Clustering is designed to find groups of similar data. Feature selection and dimensionality reduction are designed to reduce the noise and/or redundancy of the data by changing the embedding space. Often we use the latter to help with the former.

How do I use principal component analysis in supervised machine learning classification problems?

I have been working through the concepts of principal component analysis in R.
I am comfortable with applying PCA to a (say, labeled) dataset and ultimately extracting out the most interesting first few principal components as numeric variables from my matrix.
The ultimate question is, in a sense, now what? Most of the reading I've come across on PCA immediately halts after the computations are done, especially with regards to machine learning. Pardon my hyperbole, but I feel as if everyone agrees that the technique is useful, but nobody wants to actually use it after they do it.
More specifically, here's my real question:
I respect that principle components are linear combinations of the variables you started with. So, how does this transformed data play a role in supervised machine learning? How could someone ever use PCA as a way to reduce dimensionality of a dataset, and THEN, use these components with a supervised learner, say, SVM?
I'm absolutely confused about what happens to our labels. Once we are in eigenspace, great. But I don't see any way to continue to move forward with machine learning if this transformation blows apart our concept of classification (unless there's some linear combination of "Yes" or "No" I haven't come across!)
Please step in and set me straight if you have the time and wherewithal. Thanks in advance.
Old question, but I don't think it's been satisfactorily answered (and I just landed here myself through Google). I found myself in your same shoes and had to hunt down the answer myself.
The goal of PCA is to represent your data X in an orthonormal basis W; the coordinates of your data in this new basis is Z, as expressed below:
Because of orthonormality, we can invert W simply by transposing it and write:
Now to reduce dimensionality, let's pick some number of components k < p. Assuming our basis vectors in W are ordered from largest to smallest (i.e., eigenvector corresponding to the largest eigenvalue is first, etc.), this amounts to simply keeping the first k columns of W.
Now we have a k dimensional representation of our training data X. Now you run some supervised classifier using the new features in Z.
The key is to realize that W is in some sense a canonical transformation from our space of p features down to a space of k features (or at least the best transformation we could find using our training data). Thus, we can hit our test data with the same W transformation, resulting in a k-dimensional set of test features:
We can now use the same classifier trained on the k-dimensional representation of our training data to make predictions on the k-dimensional representation of our test data:
The point of going through this whole procedure is because you may have thousands of features, but (1) not all of them are going to have a meaningful signal and (2) your supervised learning method may be far too complex to train on the full feature set (either it would take too long or your computer wouldn't have a enough memory to process the calculations). PCA allows you to dramatically reduce the number of features it takes to represent your data without eliminating features of your data that truly add value.
After you have used PCA on a portion of your data to compute the transformation matrix, you apply that matrix to each of your data points before submitting them to your classifier.
This is useful when the intrinsic dimensionality of your data is much smaller than the number of components and the gain in performance you get during classification is worth the loss in accuracy and the cost of PCA. Also, keep in mind the limitations of PCA:
In performing a linear transformation, you implicitly assume that all components are expressed in equivalent units.
Beyond variance, PCA is blind to the structure of your data. It may very well happen that the data splits along low-variance dimensions. In that case, the classifier won't learn from transformed data.

Resources