How to classify MNIST data set using k-means clustering? - machine-learning

I am applying K-Means clustering on MNIST dataset. How can I then predict the values of my test set according to this ?

well k-means is an unsupervised technique, so technically speaking you don't use it to "classify"--ie, a k-means model isn't supplied with labeled data (if it is then it doesn't use the class labels) and more so it doesn't return a prediction as a class label (eg, "1")
so to use k-means to predict the single digit encoded in a given data instance:
your k-means model is comprised of a set of centroids (i assume
you chose 26 centroids to correspond to the numbers 0 - 9 in base 10
each centroid represents the geometric center of one cluster--one
cluster per number
calculate the pairwise Euclidean distance (vector norm) between
your unknown data point and each centroid in your k-means model (the
centroid values from the final iteration, obviously)
the cluster whose centroid that is the least distance from the
unknown data point is the cluster to which the unknown data point


Can a machine learning model provide information about mean and standard deviation of data on which it was trained?

Consider a parametric binary classifier (such as Logistic Regression, SVM etc.) trained on a dataset (say containing two features for e.g. Blood Pressure and Cholesterol level). The dataset is thrown away and the trained model can only be used as a black box (no tweaks and inside information can be gathered from the trained model). Only a set of data points can be provided and their labels predicted.
Is it possible to get information about the mean and/or standard deviation and/or range of the features of the dataset on which this model was trained? If yes, how so? and If no, then why can't we?
Thank you for your response! :)
SVM does not provide any information about the data statistics, it is a maximum margin classifier and it finds the best separating hyperplane between two datasets in the feature space, as a linear combination of "support vectors". If you use kernel functions, then this combination is in the kernel space, it is not even in the original feature space. SVM does not have a straightforward probabilistic interpretation whatsoever.
Logistic regression is a discriminative classifer and models the conditional probability p (y|x,w) where y is your label, x is your data and w are the features. After maximum likelihood training you are left with w and it is again a discriminator (hyperplane) in the feature space, so you don't have the features again.
The following can be considered. Use a Gaussian classifier. Assume that your class is produced by the prior class probability p (y). Then a class conditional density p (x|y,w) produces your data. Then by the Bayes rule, you will have: p (y|x,w) = (p (y)p (x|y,w))/p (x). If you define the class conditional density p (x|y,w) as Gaussian, its parameter set w will consists of the mean vector m and covariance matrix C of x, assuming it is being produced by the class y. But remember that, this will work only based on the assumption that the current data vector belongs to a specific class. Conditioned on w, a better option would be for mean vector: E [x|w]. This the expectation of x with respect to p (x|w). It comes down to a weighted average of mean vectors for the class y=0 and y=1, with respect to their prior class probabilities. Same should work for covariance as well, but it needs to be derived properly, I am not %100 sure right now.

Is there a need to normalise input vector for prediction in SVM?

For input data of different scale I understand that the values used to train the classifier has to be normalized for correct classification(SVM).
So does the input vector for prediction also needs to be normalized?
The scenario that I have is that the training data is normalized and serialized and saved in the database, when a prediction has to be done the serialized data is deserialized to get the normalized numpy array, and the numpy array is then fit on the classifier and the input vector for prediction is applied for prediction. So does this input vector also needs to be normalized? If so how to do it, since at the time of prediction I don't have the actual input training data to normalize?
Also I am normalizing along axis=0 , i.e. along the column.
my code for normalizing is :
preprocessing.normalize(data, norm='l2',axis=0)
is there a way to serialize preprocessing.normalize
In SVMs it is recommended a scaler for several reasons.
It is better to have the same scale in many optimization methods.
Many kernel functions use internally an euclidean distance to compare two different samples (in the gaussian kernel the euclidean distance is in the exponential term), if every feature has a different scale, the euclidean distance only take into account the features with highest scale.
When you put the features in the same scale you must remove the mean and divide by the standard deviation.
xi - mi
xi -> ------------
You must storage the mean and standard deviation of every feature in the training set to use the same operations in future data.
In python you have functions to do that for you:
To obtain means and standar deviations:
scaler = preprocessing.StandardScaler().fit(X)
To normalize then the training set (X is a matrix where every row is a data and every column a feature):
X = scaler.transform(X)
After the training, you must normalize of future data before the classification:
newData = scaler.transform(newData)

Where do dimensions in Word2Vec come from?

I am using word2vec model for training a neural network and building a neural embedding for finding the similar words on the vector space. But my question is about dimensions in the word and context embeddings (matrices), which we initialise them by random numbers(vectors) at the beginning of the training, like this
Lets say we want to display {book,paper,notebook,novel} words on a graph, first of all we should build a matrix with this dimensions 4x2 or 4x3 or 4x4 etc, I know the first dimension of the matrix its the size of our vocabulary |v|. But the second dimension of the matrix (number of vector's dimensions), for example this is a vector for word “book" [0.3,0.01,0.04], what are these numbers? do they have any meaning? for example the 0.3 number related to the relation between word “book" and “paper” in the vocabulary, the 0.01 is the relation between book and notebook, etc.
Just like TF-IDF, or Co-Occurence matrices that each dimension (column) Y has a meaning - its a word or document related to the word in row X.
The word2vec model uses a network architecture to represent the input word(s) and most likely associated output word(s).
Assuming there is one hidden layer (as in the example linked in the question), the two matrices introduced represent the weights and biases that allow the network to compute its internal representation of the function mapping the input vector (e.g. “cat” in the linked example) to the output vector (e.g. “climbed”).
The weights of the network are a sub-symbolic representation of the mapping between the input and the output – any single weight doesn’t necessarily represent anything meaningful on its own. It’s the connection weights between all units (i.e. the interactions of all the weights) in the network that gives rise to the network’s representation of the function mapping. This is why neural networks are often referred to as “black box” models – it can be very difficult to interpret why they make particular decisions and how they learn. As such, it's very difficult to say what the vector [0.3,0.01,0.04] represents exactly.
Network weights are traditionally initialised to random values for two main reasons:
It prevents a bias being introduced to the model before training begins
It allows the network to start from different points in the search space after initialisation (helping reduce the impact of local minima)
A network’s ability to learn can be very sensitive to the way its weights are initialised. There are more advanced ways of initialising weights today e.g. this paper (see section: Weights initialization scaling coefficient).
The way in which weights are initialised and the dimension of the hidden layer are often referred to as hyper-parameters and are typically chosen according to heuristics and prior knowledge of the problem space.
I have wondered the same thing and put in a vector like (1 0 0 0 0 0...) to see what terms it was nearest to. The answer is that the results returned didn't seem to cluster around any particular meaning, but were just kind of random. This was using Mikolov's 300-dimensional vectors trained on Google News.
Look up NNSE semantic vectors for a vector space where the individual dimensions do seem to carry specific human-graspable meanings.

Regression, classification on Machine Learning

I have a classification and regression question on machine learning.
First question, the following dataset
Can we say, the data set is linearly separable?
In order to apply a linear model for classication, a transformation of the input space is not needed for this dataset, or is not possible for this dataset?
My answer is no, but I am not sure for the second, I am not sure a transformation is possible for the dataset.
Second question about regression probl:
Give the following data set f : R -> R
Can we say that :
A linear model for regression can be used to learn the function associated to this data set ?
Given this data set, it is not possible to determine an optimal conguration of the linear model?
I am reading the book of Tom Mitchell Machine learning, and Pattern Recognition and Machine Learning Bishop, but I still have trouble giving the right answer.
Thanks in advance.
Neither of this datasets can be modeled using linear classification/regression.
In case of the "input data transfromation" if only dataset is consistent (there are no two exact same points with two different labels) there always exists transformation after which data is linearly separable. In particular one can construct it with:
phi(x) = 1 iff label of x is "1"
in other words, you map all positive samples to "1" and negatives to "0", so your data is now trivialy linearly separable. Or simply map your N points into N unit vectors in R^N space in such a way that i'th point is mapped to [0 0 0 ... 1 ... 0 0 0]^T where this "1" appears at i'th place. Such dataset is trivialy linearly separable for any labeling.

how to use weight when training a weak learner for adaboost

The following is adaboost algorithm:
It mentions "using weights wi on the training data" at part 3.1.
I am not very clear about how to use the weights. Should I resample the training data?
I am not very clear about how to use the weights. Should I resample the training data?
It depends on what classifier you are using.
If your classifier can take instance weight (weighted training examples) into account, then you don't need to resample the data. An example classifier could be naive bayes classifier that accumulates weighted counts or a weighted k-nearest-neighbor classifier.
Otherwise, you want to resample the data using the instance weight, i.e., those instance with more weights could be sampled multiple times; while those instance with little weight might not even appear in the training data. Most of the other classifiers fall in this category.
In Practice
Actually in practice, boosting performs better if you only rely on a pool of very naive classifiers, e.g., decision stump, linear discriminant. In this case, the algorithm you listed has a easy-to-implement form (see here for details):
Where alpha is chosen by (epsilon is defined similarly as yours).
An Example
Define a two-class problem in the plane (for example, a circle of points
inside a square) and build a strong classier out of a pool of randomly
generated linear discriminants of the type sign(ax1 + bx2 + c).
The two class labels are represented with red crosses and blue dots. We here are using a bunch of linear discriminants (yellow lines) to construct the pool of naive/weak classifiers. We generate 1000 data points for each class in the graph (inside the circle or not) and 20% of data is reserved for testing.
This is the classification result (in the test dataset) I got, in which I used 50 linear discriminants. The training error is 1.45% and the testing error is 2.3%
The weights are the values applied to each example (sample) in step 2. These weights are then updated at step 3.3 (wi).
So initially all weights are equal (step 2) and they are increased for wrongly classified data and decreased for correctly classified data. So in step 3.1 you have to take take these value in account to determine a new classifier, giving more importance to higher weight values. If you did not change the weight you would produce exactly the same classifier each time you execute step 3.1.
These weights are only used for training purpose, they're not part of the final model.
