data preprocessing tricks for auto-encoder - machine-learning

Recently, I try to use auto-encoder to find anomalies, but some of the input features are count data(e.g. number of clicks or number of shows). Do I need normalizing or scaling before training?

Yes you will. The most common way to do it is to subtract the mean and divide by the standard deviation. Each one of your click items should be normalized separately. For example, if you have number of 'nb_click_banner' and 'nb_click_sidebar' you should normalize both independently. This helps the network train faster, but it also gives all the features the same weighting at the input and doesn't require the network to learn to divide the weights for those by some factor to give it the same effect on the output.

I would assume that any kind of numerical feature would require normalization and scale data preprocessing, otherwise you could be in a situation where one feature influences the classification process more than the others simply because of the range of data it can hold.

Related

Is there a way to find the most representative set of samples of the entire dataset?

I'm working on text classification and I have a set of 200.000 tweets.
The idea is to manually label a short set of tweets and train classifiers to predict the labels of the rest. Supervised learning.
What I would like to know is if there is a method to choose what samples to include in the train set in a way that this train set is a good representation of the whole data set, and because the high diversity included in the train set, the trained classifiers have considerable trust to be applied on the rest of tweets.
This sounds like a stratification question - do you have pre-existing labels or do you plan to design the labels based on the sample you're constructing?
If it's the first scenario, I think the steps in order of importance would be:
Stratify by target class proportions (so if you have three classes, and they are 50-30-20%, train/dev/test should follow the same proportions)
Stratify by features you plan to use
Stratify by tweet length/vocabulary etc.
If it's the second scenario, and you don't have labels yet, you may want to look into using n-grams as a feature, coupled with a dimensionality reduction or clustering approach. For example:
Use something like PCA or t-SNE to maximize distance between tweets (or a large subset), then pick candidates from different regions of the projected space
Cluster them based on lexical items (unigrams or bigrams, possibly using log frequencies or TF-IDF and stop word filtering, if content words are what you're looking for) - then you can cut the tree at a height that gives you n bins, which you can then use as a source for samples (stratify by branch)
Use something like LDA to find n topics, then sample stratified by topic
Hope this helps!
It seems that before you know anything about the classes you are going to label, a simple uniform random sample will do almost as well as any stratified sample - because you don't know in advance what to stratify on.
After labelling this first sample and building the first classifier, you can start so-called active learning: make predictions for the unlabelled dataset, and sample some tweets in which your classifier is least condfident. Label them, retrain the classifier, and repeat.
Using this approach, I managed to create a good training set after several (~5) iterations, with ~100 texts in each iteration.

Standardization before or after categorical encoding?

I'm working on a regression algorithm, in this case k-NearestNeighbors to predict a certain price of a product.
So I have a Training set which has only one categorical feature with 4 possible values. I've dealt with it using a one-to-k categorical encoding scheme which means now I have 3 more columns in my Pandas DataFrame with a 0/1 depending the value present.
The other features in the DataFrame are mostly distances like latitud - longitude for locations and prices, all numerical.
Should I standardize (Gaussian distribution with zero mean and unit variance) and normalize before or after the categorical encoding?
I'm thinking it might be benefitial to normalize after encoding so that every feature is to the estimator as important as every other when measuring distances between neighbors but I'm not really sure.
Seems like an open problem, thus I'd like to answer even though it's late. I am also unsure how much the similarity between the vectors would be affected, but in my practical experience you should first encode your features and then scale them. I have tried the opposite with scikit learn preprocessing.StandardScaler() and it doesn't work if your feature vectors do not have the same length: scaler.fit(X_train) yields ValueError: setting an array element with a sequence. I can see from your description that your data have a fixed number of features, but I think for generalization purposes (maybe you have new features in the future?), it's good to assume that each data instance has a unique feature vector length. For instance, I transform my text documents into word indices with Keras text_to_word_sequence (this gives me the different vector length), then I convert them to one-hot vectors and then I standardize them. I have actually not seen a big improvement with the standardization. I think you should also reconsider which of your features to standardize, as dummies might not need to be standardized. Here it doesn't seem like categorical attributes need any standardization or normalization. K-nearest neighbors is distance-based, thus it can be affected by these preprocessing techniques. I would suggest trying either standardization or normalization and check how different models react with your dataset and task.
After. Just imagine that you have not numerical variables in your column but strings. You can't standardize strings - right? :)
But given what you wrote about categories. If they are represented with values, I suppose there is some kind of ranking inside. Probably, you can use raw column rather than one-hot-encoded. Just thoughts.
You generally want to standardize all your features so it would be done after the encoding (that is assuming that you want to standardize to begin with, considering that there are some machine learning algorithms that do not need features to be standardized to work well).
So there is 50/50 voting on whether to standardize data or not.
I would suggest, given the positive effects in terms of improvement gains no matter how small and no adverse effects, one should do standardization before splitting and training estimator

Reducing a matrix of feature vectors to a single, meaningful vector

I have matrices of feature vectors - 200 features long, in which the feature vectors within a matrix are temporally related, but I wish to reduce each matrix to a single, meaningful vector. I have applied PCA to the matrix in order to reduce its dimensionality to one with high variance, and am considering concatenating its rows together into one feature vector to summarize the data.
Is this a sensible approach, or are there better ways of achieving this?
So you have an n x 200 feature matrix, where n is your number of samples, and 200 features per sample, and each feature is temporally related to all others? Or you have individual feature matrices, one for each time point, and you want to run PCA on each of these individual feature matrices to find a single eigenvector for that time point, and then concatenate those together?
PCA seems more useful in the second case.
While this is doable, this is maybe not the best way to go about it because you lose temporal sensitivity by collapsing together features from different times. Even if each feature in your final feature matrix represents a different time, most classifiers cannot learn about the fact that feature 2 follows feature 1 etc. So you lose the natural temporal ordering by doing this.
If you care about the the temporal relationship between these features you may want to take a look at recurrent neural networks, which allow you feed information from t-1 into a node, at the same time as feeding in your current t features. So in a sense they learn about the relationship between t-1 and t features which will help you preserve temporal ordering. See this for an explanation: http://karpathy.github.io/2015/05/21/rnn-effectiveness/
If you don't care about time and just want to group everything together, then yes PCA will help reduce your feature count. Ultimately it depends what type of information you think is more relevant to your problem.

Machine learning: Which algorithm is used to identify relevant features in a training set?

I've got a problem where I've potentially got a huge number of features. Essentially a mountain of data points (for discussion let's say it's in the millions of features). I don't know what data points are useful and what are irrelevant to a given outcome (I guess 1% are relevant and 99% are irrelevant).
I do have the data points and the final outcome (a binary result). I'm interested in reducing the feature set so that I can identify the most useful set of data points to collect to train future classification algorithms.
My current data set is huge, and I can't generate as many training examples with the mountain of data as I could if I were to identify the relevant features, cut down how many data points I collect, and increase the number of training examples. I expect that I would get better classifiers with more training examples given fewer feature data points (while maintaining the relevant ones).
What machine learning algorithms should I focus on to, first,
identify the features that are relevant to the outcome?
From some reading I've done it seems like SVM provides weighting per feature that I can use to identify the most highly scored features. Can anyone confirm this? Expand on the explanation? Or should I be thinking along another line?
Feature weights in a linear model (logistic regression, naive Bayes, etc) can be thought of as measures of importance, provided your features are all on the same scale.
Your model can be combined with a regularizer for learning that penalises certain kinds of feature vectors (essentially folding feature selection into the classification problem). L1 regularized logistic regression sounds like it would be perfect for what you want.
Maybe you can use PCA or Maximum entropy algorithm in order to reduce the data set...
You can go for Chi-Square tests or Entropy depending on your data type. Supervized discretization highly reduces the size of your data in a smart way (take a look into Recursive Minimal Entropy Partitioning algorithm proposed by Fayyad & Irani).
If you work in R, the SIS package has a function that will do this for you.
If you want to do things the hard way, what you want to do is feature screening, a massive preliminary dimension reduction before you do feature selection and model selection from a sane-sized set of features. Figuring out what is the sane-size can be tricky, and I don't have a magic answer for that, but you can prioritize what order you'd want to include the features by
1) for each feature, split the data in two groups by the binary response
2) find the Komogorov-Smirnov statistic comparing the two sets
The features with the highest KS statistic are most useful in modeling.
There's a paper "out there" titled "A selctive overview of feature screening for ultrahigh-dimensional data" by Liu, Zhong, and Li, I'm sure a free copy is floating around the web somewhere.
4 years later I'm now halfway through a PhD in this field and I want to add that the definition of a feature is not always simple. In the case that your features are a single column in your dataset, the answers here apply quite well.
However, take the case of an image being processed by a convolutional neural network, for example, a feature is not one pixel of the input, rather it's much more conceptual than that. Here's a nice discussion for the case of images:
https://medium.com/#ageitgey/machine-learning-is-fun-part-3-deep-learning-and-convolutional-neural-networks-f40359318721

Most appropriate normalization / transformation method for skewed features?

I am trying to pre-process biological data to train a neural network and despite an extensive search and repetitive presentation of the various normalization methods I am none the wiser as to which method should be used when. In particular I have a number of input variables which are positively skewed and have been trying to establish whether there is a normalisation method that is most appropriate.
I was also worried about whether the nature of these inputs would affect performance of the network and as such have experimented with data transformations (log transformation in particular). However some inputs have many zeros but may also be small decimal values and seem to be highly affected by a log(x + 1) (or any number from 1 to 0.0000001 for that matter) with the resulting distribution failing to approach normal (either remains skewed or becomes bimodal with a sharp peak at the min value).
Is any of this relevant to neural networks? ie. should I be using specific feature transformation / normalization methods to account for the skewed data or should I just ignore it and pick a normalization method and push ahead?
Any advice on the matter would be greatly appreciated!
Thanks!
As features in your input vector are of different nature, you should use different normalization algorithms for every feature. Network should be feeded by uniformed data on every input for better performance.
As you wrote that some data is skewed, I suppose you can run some algoritm to "normalize" it. If applying logarithm does not work, perhaps other functions and methods such as rank transforms can be tried out.
If the small decimal values do entirely occur in a specific feature, then just normalize it in specific way, so that they get transformed into your work range: either [0, 1] or [-1, +1] I suppose.
If some inputs have many zeros, consider removing them from main neural network, and create additional neural network which will operate on vectors with non-zeroed features. Alternatively, you may try to run Principal Component Analysis (for example, via Autoassociative memory network with structure N-M-N, M < N) to reduce input space dimension and so eliminate zeroed components (they will be actually taken into account in the new combined inputs somehow). BTW, new M inputs will be automatically normalized. Then you can pass new vectors to your actual worker neural network.
This is an interesting question. Normalization is meant to keep features' values in one scale to facilitate the optimization process.
I would suggest the following:
1- Check if you need to normalize your data. If, for example, the means of the variables or features are within same scale of values, you may progress with no normalization. MSVMpack uses some normalization check condition for their SVM implementation. If, however, you need to do so, you are still advised to run the models on the data without Normalization.
2- If you know the actual maximum or minimum values of a feature, use them to normalize the feature. I think this kind of normalization would preserve the skewedness in values.
3- Try decimal value normalization with other features if applicable.
Finally, you are still advised to apply different normalization techniques and compare the MSE for evey technique including z-score which may harm the skewedness of your data.
I hope that I have answered your question and gave some support.

Resources