Feature scaling (normalization) for clustering algorithms (as Kmeans & EM) - machine-learning

I want to use KMeans clustering algorithm to analyze a profile data. The sample data is in the format of :
Features: name ISBN Date ID price ....
'A' '31NDB' '05/18/2014' 'CBDDN' 12.00
'B' '3241B' '08/19/2012/ 'ABCDE' 33.08
These are just examples, the real data is not necessarily in this format. But if need to apply clustering algorithm on this set of data, how can do the feature scaling aka, normalization part? How should I treat the string value and the date value and the price (double) value? Is there a relationship between these values? I'm confused...
Any idea?

K-means and EM are for numeric data only.
It does not make much sense to apply them on name/date/price typed data.
As the name indicates, the algorithm needs to compute means. How would you compute a mean in your "name" column? You can hack something for the date, but not for the name.
Wrong tool for your job.

You will have to encode the non-numeric features as numbers. This is the case for categorical or ordinal features.
Also, if certain features are unimportant to your analysis, consider throwing them away. For e.g., if you are trying to cluster books, then the purchase date might not be important (or it might be, depends on what you are concerned with), so adding the date won't make sense.
As an example for encoding a variable with 3 categories, you could for e.g., encode it as 3 variables [1, 0, 0], [0, 1, 0], [0, 0, 1], or as 2 variables [0, 0], [1, 0], [0, 1].
There is a bit more discussion on this here.
Note that as your KMeans/GMM(since you eluded to EM) is going to compute the distances between points, proper encoding is especially important. Understand what they entails, especially when used with the different feature normalization schemes, and try different ones to see the result.

Related

How do I match samples with their predictions when doing inference with PyTorch's DistributedSampler?

I have trained a torch model for NLP tasks and would like to perform some inference using a multi GPU machine (in this case with two GPUs).
Inside the processing code, I use this
dataset = TensorDataset(encoded_dict['input_ids'], encoded_dict['attention_mask'])
sampler = DistributedSampler(
dataset, num_replicas=args.nodes * args.gpus, rank=args.node_rank * args.gpus + gpu_number, shuffle=False
)
dataloader = DataLoader(dataset, batch_size=batch_size, sampler=sampler)
For those familiar with NLP, encoded_dict is the output from the tokenizer.batch_encode_plus function where the tokenizer is an instance of transformers.BertTokenizer.
The issue I’m having is that when I call the code through the torch.multiprocessing.spawn function, each GPU is doing predictions (i.e. inference) on a subset of the full dataset, and saving the predictions separately; for example, if I have a dataset with 1000 samples to predict, each GPU is predicting 500 of them. As a result, I have no way of knowing which samples out of the 1000 were predicted by which GPU, as their order is not preserved, therefore the model predictions are meaningless as I cannot trace each of them back to their input sample.
I have tried to save the dataloader instance (as a pickle) together with the predictions and then extracting the input_ids by using dataloader.dataset.tensors, however this requires a tokeniser decoding step which I rather avoid, as the tokenizer will have slightly changed the text (for example double whitespaces would be removed, words with dashes will have been split and so on).
What is the cleanest way to save the input text samples together with their predictions when doing inference in distributed mode, or alternatively to keep track of which prediction refers to which sample?
As I understand it, basically your dataset returns for an index idx [data,label] during training and [data] during inference. The issue with this is that the idx is not preserved by the dataloader object, so there is no way to obtain the idx values for the minibatch after the fact.
One way to handle this issue is to define a very simple custom dataset object that also returns [data,id] instead of only data during inference. Probably the easiest way to do this is to make the dataset return a dictionary object with keys id and data. The dictionary return type is convenient because Pytorch collates (converts data structures to batches) this type automatically, otherwise you'd have to define a custom collate_fn and pass it to the dataloader object, which is itself not very hard but is an extra step.
In any case, here's I would define a new dataset object as follows which should be almost a one-to-one substitute for your current dataset (I believe):
def TensorDictDataset(torch.data.Dataset):
def __init__(self,ids,attention_mask):
self.ids = ids
self.mask = attention_mask
def __len__(self):
return len(self.ids)
def __getitem(self,idx):
datum = {
"mask": self.mask[idx],
"id":ids[idx]
}
return datum
The only change you'll then have to make is that rather than returning mask your dataset will now return dict{"mask":mask,"id":id} so you'll have to parse that appropriately.
thanks for your answer. I have done further debugging and found another solution and wanted to post it.
Your solution is quite elegant (there was one minor misunderstanding, in that the predictions contain only the predicted labels and not the data contrary to what you understood, but this doesn't affect your answer anyway). Mask is NLP is also something else, and instead of having the mask tokens together with predictions I would like to have the untokenized text string. This is not so easy to achieve because the splitting of the data into different GPUs happens AFTER the tokenisation, however I believe that with a slight adaptation to your answer it could work.
However, I’ve done some further debugging and I’ve noticed that the data are not actually randomly split across GPUs as I thought. If I set shuffle=False in the DistributedSampler then this happens:
in the case of two GPUs, GPU 0 and GPU 1, all the samples with even index (starting from 0) will be passed to GPU 0, and all those with odd index will be passed to GPU 1.
So for example, if you have 10 samples, whose indices are [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], then samples 0, 2, 4, 6, 8 will go to GPU 0 and samples 1, 3, 5, 7, 9 will to go GPU 1. Therefore this allows me to map the predictions back to the original text string samples by just using this ordering. Not sure if this is the best solution, as keeping the original text string next to its prediction would be ideal, but at least it works.
N.B. Special case: As the two GPUs must be passed the SAME number of inputs, if the number of inputs is an odd number, for example we have 9 samples with indices [0, 1, 2, 3, 4, 5, 6, 7, 8], then GPU 0 will be passed samples 0, 2, 4, 6, 8 and GPU 1 will be passed samples 1, 3, 5, 7, 0 (in this exact order). In other words, the first sample with index 0 is repeated at the very end of the dataset to make sure each GPU has the same number of samples, in which case we can then write some codes which drops the last prediction from GPU 1 as it is redundant.

How to understand SpatialDropout1D and when to use it?

Occasionally I see some models are using SpatialDropout1D instead of Dropout. For example, in the Part of speech tagging neural network, they use:
model = Sequential()
model.add(Embedding(s_vocabsize, EMBED_SIZE,
input_length=MAX_SEQLEN))
model.add(SpatialDropout1D(0.2)) ##This
model.add(GRU(HIDDEN_SIZE, dropout=0.2, recurrent_dropout=0.2))
model.add(RepeatVector(MAX_SEQLEN))
model.add(GRU(HIDDEN_SIZE, return_sequences=True))
model.add(TimeDistributed(Dense(t_vocabsize)))
model.add(Activation("softmax"))
According to Keras' documentation, it says:
This version performs the same function as Dropout, however it drops
entire 1D feature maps instead of individual elements.
However, I am unable to understand the meaning of entrie 1D feature. More specifically, I am unable to visualize SpatialDropout1D in the same model explained in quora.
Can someone explain this concept by using the same model as in quora?
Also, under what situation we will use SpatialDropout1D instead of Dropout?
To make it simple, I would first note that so-called feature maps (1D, 2D, etc.) is our regular channels. Let's look at examples:
Dropout(): Let's define 2D input: [[1, 1, 1], [2, 2, 2]]. Dropout will consider every element independently, and may result in something like [[1, 0, 1], [0, 2, 2]]
SpatialDropout1D(): In this case result will look like [[1, 0, 1], [2, 0, 2]]. Notice that 2nd element was zeroed along all channels.
The noise shape
In order to understand SpatialDropout1D, you should get used to the notion of the noise shape. In plain vanilla dropout, each element is kept or dropped independently. For example, if the tensor is [2, 2, 2], each of 8 elements can be zeroed out depending on random coin flip (with certain "heads" probability); in total, there will be 8 independent coin flips and any number of values may become zero, from 0 to 8.
Sometimes there is a need to do more than that. For example, one may need to drop the whole slice along 0 axis. The noise_shape in this case is [1, 2, 2] and the dropout involves only 4 independent random coin flips. The first component will either be kept together or be dropped together. The number of zeroed elements can be 0, 2, 4, 6 or 8. It cannot be 1 or 5.
Another way to view this is to imagine that input tensor is in fact [2, 2], but each value is double-precision (or multi-precision). Instead of dropping the bytes in the middle, the layer drops the full multi-byte value.
Why is it useful?
The example above is just for illustration and isn't common in real applications. More realistic example is this: shape(x) = [k, l, m, n] and noise_shape = [k, 1, 1, n]. In this case, each batch and channel component will be kept independently, but each row and column will be kept or not kept together. In other words, the whole [l, m] feature map will be either kept or dropped.
You may want to do this to account for adjacent pixels correlation, especially in the early convolutional layers. Effectively, you want to prevent co-adaptation of pixels with its neighbors across the feature maps, and make them learn as if no other feature maps exist. This is exactly what SpatialDropout2D is doing: it promotes independence between feature maps.
The SpatialDropout1D is very similar: given shape(x) = [k, l, m] it uses noise_shape = [k, 1, m] and drops entire 1-D feature maps.
Reference: Efficient Object Localization Using Convolutional Networks
by Jonathan Tompson at al.

Can artificial neural networks work with mathematical sets?

I know that using neural networks for anything text-related is difficult as they have problems with non-numerical input data.
But I'm not sure about mathematical sets. And sets of sets.
Like [0, 1, 2] and [3, 4, 5] or [[0, 1], [2, 3]] and [[4, 5], [6, 7]]
It should be possible to compute distances between these by computing the distances between all corresponding elements, right? I can't really find any information on that and don't want to start using neural networks without being sure.
(Googling anything with 'set' just isn't promising because all you get as result is the term 'data set'..)
EDIT:
First: The assignment specifically asks for a neural network, so I can't use k-means or any other clustering methods.
So the original question wasn't really addressing the actual problem. I don't have to think of a distance metric but of a way to add the sets to the activation function and for that of how to map them to a single value. But, regarding the distance metric, I'm actually not really sure at what point of the neural network I need it.. I guess that's a basic comprehension problem.
I will just write down some thoughts now.
The thing that confuses me is standardization of categories. Having three categories 'red', 'green' and 'blue' you can map them to numbers 1 to 3, but that would mean that 'red' would have a larger distance to 'blue' than 'green' does and that's not the case. So the categories are encoded as (1, 0, 0) and (0, 1, 0) and (0, 0, 1) which gives them all the same distance.
So it must be possible to add these to the activation function somehow. I could imagine that they are interpreted as binary numbers, so that (1,0,0)=100=4, (0,1,0)=010=2 and (0,0,1)=001=1. That would be a distinct mapping. But numbers 1 to 3 are distinct to, so as mentioned above, the distance metric must be necessary at some point.
So the problem still is how to map a set to a single value. I can do that right before I add it to the function, so I don't have to choose a mapping that also maintains a logical distance between the sets because when getting to the point of applying the distance metric I can still apply it to the original sets and don't have to use the mapped value. Is that correct? Or am I still missing something?
Neural nets, in general, have no such problem. Image recognition and language translation are well within their domains. What you do need is the metrics and manipulations to relate your inputs to the ground truth in a well-ordered fashion -- which your distance metric will do quite nicely.
Go right ahead and build your neural network. Supply it with the appropriate distance function, and let it train away. Do make sure to put in some tracking instrumentation (e.g. print statements) to trace the operation for a few iterations before you turn it entirely loose.

Multi Label classification with Sklearn

I have tried using the OneVsRest with Logistic Regression from Sklearn, but it gives empty labels for some samples (i.e. doesn't predict any out), even though I do not have any unlabelled training data.
Any idea what might be causing this or how to fix this?
clf = OneVsRestClassifier(LogisticRegression(multi_class='ovr',max_iter=1000,solver='lbfgs'))
clf.fit(X,Y)
self.classifier=clf
self.classifier.predict(test_data)
Whenever you are performing MultiLabel classification, according to the OneVsRestClassifier the targets need to be "a sequence of sequences of labels".
Moreover, depending on how you encode this labels you may get the following warning: "DeprecationWarning: Direct support for sequence of sequences multilabel representation will be unavailable from version 0.17. Use sklearn.preprocessing.MultiLabelBinarizer to convert to a label indicator representation."
So, neat way to encode your labels:
from sklearn import preprocessing
mlb = preprocessing.MultiLabelBinarizer()
Y = mlb.fit_transform([(1, 2), (1,2), (1,2),(4,)])
# this means sample one belongs to classes {1,2} and so on.
# Take into account the format if only one class is needed, (4,) not (4)
so Y turns out to be:
array([[1, 1, 0],
[1, 1, 0],
[1, 1, 0],
[0, 0, 1]])

How to evaluate predictions from incomplete data, where not all data is incomplete

I am using Non-negative Matrix Factorization and Non-negative Least Squares for predictions, and I want to evaluate how good the predictions are depending on the amount of data given. For example the original Data was
original = [1, 1, 0, 1, 1, 0]
And now I want to see how good I can reconstruct the original data when the given data is incomplete:
incomplete1 = [1, 1, 0, 1, 0, 0],
incomplete2 = [1, 1, 0, 0, 0, 0],
incomplete3 = [1, 0, 0, 0, 0, 0]
And I want to do this for every example in a big dataset. Now the problem is, the original data varies in the amount of positive data, in the original above there are 4, but for other examples in the dataset it could be more or less. Let´s say I make an evaluation round with 4 positives given, but half of my dataset only has 4 positives, the other half has 5,6 or 7. Should I exclude the half with 4 positives, because they have no data missing which makes the "prediction" much better? On the other side I would change the trainingset if I excluded data. What can I do? Or shouldn´t I evaluate with 4 at all in this case?
EDIT:
Basically I want to see how good I can reconstruct the input matrix. For simplicity, say the "original" stands for a user who watched 4 movies. And then I want to know how good I can predict each user, based on just 1 movie that the user acually watched. I get a prediction for lots of movies. Then I plot a ROC and Precision-Recall curve (using top-k of the prediction). And I will repeat all of this with n movies that the users actually watched. I will get a ROC curve in my plot for every n. When I come to the point where I use e.g. 4 movies that the user actually watched, to predict all movies he watched, but he only watched those 4, the results get too good.
The reason why I am doing this is to see how many "watched movies" my system needs to make reasonable predictions. If it would return only good results when there are already 3 movies watched, It would not be so good in my application.
I think it's first important to be clear what you are trying to measure, and what your input is.
Are you really measuring ability to reconstruct the input matrix? In collaborative filtering, the input matrix itself is, by nature, very incomplete. The whole job of the recommender is to fill in some blanks. If it perfectly reconstructed the input, it would give no answers. Usually, your evaluation metric is something quite different from this when using NNMF for collaborative filtering.
FWIW I am commercializing exactly this -- CF based on matrix factorization -- as Myrrix. It is based on my work in Mahout. You can read the docs about some rudimentary support for tests like Area under curve (AUC) in the product already.
Is "original" here an example of one row, perhaps for one user, in your input matrix? When you talk about half, and excluding, what training/test split are you referring to? splitting each user, or taking a subset across users? Because you seem to be talking about measuring reconstruction error, but that doesn't require excluding anything. You just multiply your matrix factors back together and see how close they are to the input. "Close" means low L2 / Frobenius norm.
But for convention recommender tests (like AUC or precision recall), which are something else entirely, you would either split your data into test/training by time (recent data is the test data) or value (most-preferred or associated items are the test data). If I understand the 0s to be missing elements of the input matrix, then they are not really "data". You wouldn't ever have a situation where the test data were all the 0s, because they're not input to begin with. The question is, which 1s are for training and which 1s are for testing.

Resources