Machine Learning data preprocessing [closed] - machine-learning

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I have a question regarding data preprocessing for machine learning. Specifically transforming the data so it has zero mean and unit variance.
I have split my data into two datasets (I know I should have three, but for the sake of simplicity let's just say I have two). Should I transform my training data set so that the entire training data set has unit variance and zero mean and then when testing the model transform each test input vector so that each particular test input vector presents unit variance and zero mean, or I should just transform the entire dataset (traning and testing) together so that the whole thing presents unit var and zero mean? My belief is that I should do the former that way I won't be introducing a despicable amount of bias into the test data set. But I am no expert, thus my question.

Fitting your preprocessor should only be done on the training-set and the mean and variance transformers are then used on the test-set. Computing these statistics on train and test leaks some information about the test-set.
Let me link you to a good course on Deep-Learning and show you a citation (both from Andrej Karpathy):
Common pitfall. An important point to make about the preprocessing is that any preprocessing statistics (e.g. the data mean) must only be computed on the training data, and then applied to the validation / test data. E.g. computing the mean and subtracting it from every image across the entire dataset and then splitting the data into train/val/test splits would be a mistake. Instead, the mean must be computed only over the training data and then subtracted equally from all splits (train/val/test).

Related

Should I balance the test set for evaluating a model? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
I have to evaluate logistic regression model. This model is aimed to detect frouds, so in real life the algorithm will face highly imbalanced data.
Some people say that I need to balance train set only, while test set should remain similar to real life data. On the other hand, many people say that model must be trained and tested on balanced samples.
I tried to test my model for both (balanced, unbalanced) sets and get the same ROC AUC (0.73), but different precision-recall curve AUC - 0.4 (for unbalanced) and 0.74 for (balanced).
What shoud I choose?
And what metrics should I use to evaluate my model perfomance?
Since you are dealing with a problem that has an unbalanced concept (a disproportionately greater amount of not-fraud over fraud), I recommend you utilize F-scoring with a real-world "matching" unbalanced set. This would allow you to compare models without having to ensure that your test set is balanced, since that could mean that you are overrepresenting frauds and underrepresenting non-frauds cases in your test set.
Here are some references and how to implement on sklearn:
https://en.wikipedia.org/wiki/F-score
https://deepai.org/machine-learning-glossary-and-terms/f-score
https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

How can I get the right balance between classification loss and a regularizer? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I'm working on a deep learning classifier (Keras and Python) that classifies time series into three categories. The loss function that I'm using is the standard categorical cross-entropy. In addition to this, I also have an attention map which is being learnt within the same model.
I would like this attention map to be as small as possible, so I'm using a regularizer. Here comes the problem: how do I set the right regularization parameter? What I want is the network to reach its maximum classification accuracy first, and then starts minimising the intensity attention map. For this reason, I train my model once without regulariser and a second time with the regulariser on. However, if the regulariser parameter (lambda) is too high, the network loses completely accuracy and only minimises the attention, while if the regulariser is too small, the network only cares about the classification error and won't minimise the attention, even when the accuracy is already the maximum.
Is there a smarter way to combine the categorical cross-entropy with the regulariser? Maybe something that considers the variation of categorical cross-entropy in time, and if it doesn't go down for, say N iterations, it only considers the regulariser?
Thank you
Regularisation is a way to fight with overfitting. So, you should understood if your model overfits. A simple way to do it: you can compare f1 score for train and test. If f1 score for train is high and for test is low, seems, you have overfitting - so you need add some regularisation.

Word embedding training [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I have one corpus for word embedding. Using this corpus, I trained my word embedding. However, whenever I train my word embedding, the results are quite different(this results are based on K-Nearest Neighbor(KNN)). For example, in the first training, 'computer' nearest neighbor words are 'laptops', 'computerized' ,'hardware'. But, in the second training, this knn words are 'software', 'machine',...('laptops' is low ranked!) - all training are performed independently 20 epochs, and hyper-parameters are all the same.
I want to train my word embedding very similar(e.g., 'laptops' is high ranked). How should i do? Should I modulate hyper-parameters(learning rate, initializing, etc)?
You didn't say what word2vec software you're using, which might change the relevant factors.
The word2vec algorithm inherently uses randomness, in both initialization and several aspects of its training (like the selection of negative-examples, if using negative-sampling, or random downsampling of very-frequent words). Additionally, if you're doing multithreaded training, the essentially-random jitter in the OS thread scheduling will change the order of training examples, introducing another source of randomness. So you shouldn't necessarily expect subsequent runs, even with the exact same parameters and corpus, to give identical results.
Still, with enough good data, suitable parameters, and a proper training loop, the relative-neighbors results should be fairly similar from run-to-run. If it's not, more data or more iterations might help.
Wildly-different results would be most likely if the model is overlarge (too many dimensions/words) for your corpus – and thus prone to overfitting. That is, it finds a great configuration for the data, through essentially memorizing its idiosyncracies, without achieving any generalization power. And if such overfitting is possible, there are typically many equally-good such memorizations – so they can be very different from run-to-tun. Meanwhile, a right-sized model with lots of data will instead be capturing true generalities, and those would be more consistent from run-to-run, despite any randomization.
Getting more data, using smaller vectors, using more training passes, or upping the minimum-count of word-occurrences to retain/train a word all might help. (Very-infrequent words don't get high-quality vectors, so wind up just interfering with the quality of other words, and then randomly intruding in most-similar lists.)
To know what else might be awry, you should clarify in your question things like:
software used
modes/metaparameters used
corpus size, in number of examples, average example size in words, and unique-words count (both in the raw corpus, and after any minumum-count is applied)
methods of preprocessing
code you're using for training (if you're managing the multiple training-passes yourself)

Difference between Regression and classification in Machine Learning? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am new to Machine Learning. Can anyone tell me the major difference between classification and regression in machine learning?
Regression aims to predict a continuous output value. For example, say that you are trying to predict the revenue of a certain brand as a function of many input parameters. A regression model would literally be a function which can output potentially any revenue number based on certain inputs. It could even output revenue numbers which never appeared anywhere in your training set.
Classification aims to predict which class (a discrete integer or categorical label) the input corresponds to. e.g. let us say that you had divided the sales into Low and High sales, and you were trying to build a model which could predict Low or High sales (binary/two-class classication). The inputs might even be the same as before, but the output would be different. In the case of classification, your model would output either "Low" or "High," and in theory every input would generate only one of these two responses.
(This answer is true for any machine learning method; my personal experience has been with random forests and decision trees).
Regression - the output variable takes continuous values.
Example :Given a picture of a person, we have to predict their age on the basis of the given picture
Classification - the output variable takes class labels.
Example: Given a patient with a tumor, we have to predict whether the tumor is malignant or benign.
I am a beginner in Machine Learning field but as far as i know, regression is for "continuous values" and classification is for "discrete values". With regression, there is a line for your continuous value and you can see that your model is good or bad fit. On the other hand, you can see how discrete values gains some meaning "discretely" with classification. If i am wrong please feel free to make correction.

Imbalanced Data for Random ferns [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
For a Multiclass problem, should the data be balanced for machine learning algorithms such as Random forests and Random ferns or is it ok for it to be imbalanced for a certain extent?
The issue with imbalanced classes raises when the disproportion alters the separability of the classes instances. But this does not happen in ever imbalanced dataset: sometimes the more data you have from one class the better you can differentiate the scarse data from it since it lets you find more easily which features are meaningful to create an discriminating plane (even though you are not using discriminative analysis the point is to classify-separate the instances according to classes).
For example I can remember the KDDCup2004 protein classification task in which one class had 99.1% of the instances in the training set but if you tried to use under sampling methods to alleviate the imbalance you would only get worse results. That meaning that the large amount of data from the first class defined the data in the smaller one.
Concerning random forests, and decision trees in general, they work by selecting, at each step, the most promising feature that can partitionate the set into two (or more) class-meaningful subsets. Having inherently more data about one class does not bias this partitioning by default ( = always) but only when the imbalance is not representative of the classes real distributions.
So I suggest that you first run a multivariate analysis to try to get the extent of imbalance among classes in your dataset and the run a series of experiments with different undersampling ratios if you still ar ein doubt.
I have used Random Forrests in my task before. Although the data don't need be balanced, however if the positive samples are too few, the pattern of the data maybe drown in the noise. Most of classify methods even (random forrests and ada boost) should have this flaw more or less.'Over sample' may be a good idea to deal with this problem.
Perhaps the paper Logistic Regression in rareis useful with this sort of problem, although its topic is logistic regression.

Resources