spam classification - machine learning - machine-learning

I have to do spam detection application using a few classifiers(e.g. Naive Bayes, SVM and another one yet) and compare them efficiency but unfortunately I don't know what should I do exactly.
Is this correct:
Firstly I should have corpus spam such as trec2005, spamassasin or enron-spam.
Then, I do text pre-processing like stemming, stop words removal, tokenize, etc.
After that I can measure weight my features/terms in spam emails using tf-idf .
Next I remove these features with very low and very high frequencies.
And I can classify my emails then. Right?
After that I can measure my correct classifications by true-positive, false-positive, etc..
If 10fold cross validation is required for something ?
How should I use it?
Could you tell me if these steps for email classifications are OK?
If not, please explain what are the correct steps for spam classification.

Here is roughly the steps you need to build a spam classifier:
1- Input: a labeled training set that contains enough samples of spam and legitimate e-mails
2- Feature Extraction: convert your e-mail text into useful features for training e.g. stemming, remove stop words, words frequency. Then evaluate these features (i.e. apply attribute selection method) to select the most significant ones.
3- If you have large enough dataset, split it into training, validation and testing set. If not you can use your entire dataset for training and do cross validation to evaluate the classifier performance
4- Train your classifier and either use the testing dataste to evaluate its performance or do cross validation
5- Use the trained model to classify new e-mails. Done.
The use of cross validation is to evaluate your model performance on new/unseen data. So if you have an independent testing dataset you might not need cross validation at all, because you can evaluate the model performance on the testing dataset. However when your dataset is small you can divide it to subsets (e.g. 10 folds) and then repeat the training 10 times, every time you will use only 90% of your data and test on the remaining 10% and so on.
You will end up with 10 estimates of the classifier error average them to get the mean square or absolute error

Related

Why does my Random Forest Classifier perform better on test and validation data than on training data?

I'm currently training a random forest on some data I have and I'm finding that the model performs better on the validation set, and even better on the test set, than on the train set. Here are some details of what I'm doing - please let me know if I've missed any important information and I will add it in.
My question
Am I doing anything obviously wrong and do you have any advice for how I should improve my approach because I just can't believe that I'm doing it right when my model predicts significantly better on unseen data than training data!
Data
My underlying data consists of tables of features describing customer behaviour and a binary target (so this is a binary classification problem). Technically I have one such table per month and I tend to use several months of data to train and then a different month to predict (e.g. Train on Apr, May and Predict on Jun)
Generally this means I end up with a training dataset of about 100k rows and 20 features (I've previously looked into feature selection and found a set of 7 features which seem to perform best, so have been using these lately). My prediction set generally has around 50k rows.
My dataset is heavily unbalanced (approximately 2% incidence of target feature), so I'm using oversampling techniques - more on that below.
Method
I've searched around online quite a lot and this has led me to the following approach:
Take scaleable (continuous) features in the training data and standardise them (currently using sklearn StandardScaler)
Take categorical features and encode them into separate binary columns (one-hot) using Pandas get_dummies function
Remove 10% of the training data to form a validation set (I'm currently using a random seed in this process for comparability whilst I vary different things such as hyperparameters in the model)
Take the remaining 90% of training data and perform a grid search across a few parameters of the RandomForestClassifier() (currently min_samples_split, max_depth, n_estimators and max_features)
Within each hyperparameter combination from the grid I perform kfold validation with 5 folds and using a random state
Within each fold I oversample my minority class for training data only (sometimes using imbalanced-learn's RandomOverSampler() and sometimes using SMOTE() from the same package), train the model on the training data and then apply the model to the kth fold and record performance metrics (precision, recall, F1 and AUC)
Once I've been through 5 folds on each hyperparameter combination I find the best F1 score (and best precision if two combinations are tied on F1 score) and retrain a random forest on the entire 90% training data using those hyperparameters. During this step I use the same oversampling technique as I did in the kfold process
I then use this model to make predictions on the 10% of training data that I put aside earlier as a validation set, evaluating the same metrics as above
Finally I have a test set, which is actually based on data from another month, which I apply the already trained model to and evaluate the same metrics
Outcome
At the moment I'm finding that my training set achieves an F1 score of around 30%, the validation set is consistently slightly higher than this at around 36% (mostly driven by a much better precision than the training data e.g. 60% vs. 30%) and then the testing set is getting an F1 score of between 45% and 50% which is again driven by a better precision (around 65%)
Notes
Please do ask about any details I haven't mentioned; I've had my stuck in this for weeks and so have doubtless omitted some details
I've had a brief look (not a systematic analysis) of the stability of metrics between folds in the kfold validation and it seems that they aren't varying very much, so I'm fairly happy with the stability of the model here
I'm actually performing the grid search manually rather than using a Python pipeline because try as I might I couldn't get imbalanced-learn's Pipeline function to work with the oversampling functions and so I run a loop with combinations of hyperparameters, but I'm confident that this isn't impacting the results I've talked about above in an adverse way
When I apply the final model to the prediction data (and get an F1 score around 45%) I also apply it back to the training data itself out of interest and get F1 scores around 90% - 100%. I suppose this is to be expected as the model is trained and predicts on almost exactly the same data (except the 10% holdout validation set)

How do neural networks learn functions instead of memorize them?

For a class project, I designed a neural network to approximate sin(x), but ended up with a NN that just memorized my function over the data points I gave it. My NN took in x-values with a batch size of 200. Each x-value was multiplied by 200 different weights, mapping to 200 different neurons in my first layer. My first hidden layer contained 200 neurons, each one a linear combination of the x-values in the batch. My second hidden layer also contained 200 neurons, and my loss function was computed between the 200 neurons in my second layer and the 200 values of sin(x) that the input mapped to.
The problem is, my NN perfectly "approximated" sin(x) with 0 loss, but I know it wouldn't generalize to other data points.
What did I do wrong in designing this neural network, and how can I avoid memorization and instead design my NN's to "learn" about the patterns in my data?
It is same with any machine learning algorithm. You have a dataset based on which you try to learn "the" function f(x), which actually generated the data. In real life datasets, it is impossible to get the original function from the data, and therefore we approximate it using something g(x).
The main goal of any machine learning algorithm is to predict unseen data as best as possible using the function g(x).
Given a dataset D you can always train a model, which will perfectly classify all the datapoints (you can use a hashmap to get 0 error on the train set), but which is overfitting or memorization.
To avoid such things, you yourself have to make sure that the model does not memorise and learns the function. There are a few things which can be done. I am trying to write them down in an informal way (with links).
Train, Validation, Test
If you have large enough dataset, use Train, Validation, Test splits. Split the dataset in three parts. Typically 60%, 20% and 20% for Training, Validation and Test, respectively. (These numbers can vary based on need, also in case of imbalanced data, check how to get stratified partitions which preserve the class ratios in every split). Next, forget about the Test partition, keep it somewhere safe, don't touch it. Your model, will be trained using the Training partition. Once you have trained the model, evaluate the performance of the model using the Validation set. Then select another set of hyper-parameter configuration for your model (eg. number of hidden layer, learaning algorithm, other parameters etc.) and then train the model again, and evaluate based on Validation set. Keep on doing this for several such models. Then select the model, which got you the best validation score.
The role of validation set here is to check what the model has learned. If the model has overfit, then the validation scores will be very bad, and therefore in the above process you will discard those overfit models. But keep in mind, although you did not use the Validation set to train the model, directly, but the Validation set was used indirectly to select the model.
Once you have selected a final model based on Validation set. Now take out your Test set, as if you just got new dataset from real life, which no one has ever seen. The prediction of the model on this Test set will be an indication how well your model has "learned" as it is now trying to predict datapoints which it has never seen (directly or indirectly).
It is key to not go back and tune your model based on the Test score. This is because once you do this, the Test set will start contributing to your mode.
Crossvalidation and bootstrap sampling
On the other hand, if your dataset is small. You can use bootstrap sampling, or k-fold cross-validation. These ideas are similar. For example, for k-fold cross-validation, if k=5, then you split the dataset in 5 parts (also be carefull about stratified sampling). Let's name the parts a,b,c,d,e. Use the partitions [a,b,c,d] to train and get the prediction scores on [e] only. Next, use the partitions [a,b,c,e] and use the prediction scores on [d] only, and continue 5 times, where each time, you keep one partition alone and train the model with the other 4. After this, take an average of these scores. This is indicative of that your model might perform if it sees new data. It is also a good practice to do this multiple times and perform an average. For example, for smaller datasets, perform a 10 time 10-folds cross-validation, which will give a pretty stable score (depending on the dataset) which will be indicative of the prediction performance.
Bootstrap sampling is similar, but you need to sample the same number of datapoints (depends) with replacement from the dataset and use this sample to train. This set will have some datapoints repeated (as it was a sample with replacement). Then use the missing datapoins from the training dataset to evaluate the model. Perform this multiple times and average the performance.
Others
Other ways are to incorporate regularisation techniques in the classifier cost function itself. For example in Support Vector Machines, the cost function enforces conditions such that the decision boundary maintains a "margin" or a gap between two class regions. In neural networks one can also do similar things (although it is not same as in SVM).
In neural network you can use early stopping to stop the training. What this does, is train on the Train dataset, but at each epoch, it evaluates the performance on the Validation dataset. If the model starts to overfit from a specific epoch, then the error for Training dataset will keep on decreasing, but the error of the Validation dataset will start increasing, indicating that your model is overfitting. Based on this one can stop training.
A large dataset from real world tends not to overfit too much (citation needed). Also, if you have too many parameters in your model (to many hidden units and layers), and if the model is unnecessarily complex, it will tend to overfit. A model with lesser pameter will never overfit (though can underfit, if parameters are too low).
In the case of you sin function task, the neural net has to overfit, as it is ... the sin function. These tests can really help debug and experiment with your code.
Another important note, if you try to do a Train, Validation, Test, or k-fold crossvalidation on the data generated by the sin function dataset, then splitting it in the "usual" way will not work as in this case we are dealing with a time-series, and for those cases, one can use techniques mentioned here
First of all, I think it's a great project to approximate sin(x). It would be great if you could share the snippet or some additional details so that we could pin point the exact problem.
However, I think that the problem is that you are overfitting the data hence you are not able to generalize well to other data points.
Few tricks that might work,
Get more training points
Go for regularization
Add a test set so that you know whether you are overfitting or not.
Keep in mind that 0 loss or 100% accuracy is mostly not good on training set.

Text Classification Technique for this scenario

I am completely new to Machine Learning algorithms and I have a quick question with respect to Classification of a dataset.
Currently there is a training data that consists of two columns Message and Identifier.
Message - Typical message extracted from Log containing timestamp and some text
Identifier - Should classify the category based on the message content.
The training data was prepared by extracting a particular category from the tool and labelling it accordingly.
Now the test data contains just the message and I am trying to obtain the Category accordingly.
Which approach is most helpful in this scenario ? Is it the Supervised or Unsupervised Learning ?
I have a trained dataset and I am trying to predict the Category for the Test Data.
Thanks in advance,
Adam
If your labels are exact then you can classify using ANN, SVM etc. But labels are not exact you have to cluster data with respect to the features you have in data. K-means or nearest neighbour can be the starting point for clustering.
It is supervised learning, and a classification problem.
However, obviously you do not have the label column (the to-be-predicted value) for your testset. Thus, you cannot calculate error measures (such as False Positive Rate, Accuracy etc) for that test set.
You could, however, split the set of labeled training data that you do have into a smaller training set and a validation set. Split it 70%/30%, perhaps. Then build a prediction model from your smaller 70% training dataset. Then tune it on your 30% validation set. When accuracy is good enough, then apply it on your testset to obtain/predict the missing values.
Which techniques / algorithms to use is a different question. You do not give enough information to answer that. And even if you did you still need to tune the model yourself.
You have labels to predict, and training data.
So by definition it is a supervised problem.
Try any classifier for text, such as NB, kNN, SVM, ANN, RF, ...
It's hard to predict which will work best on your data. You willhave to try and evaluate several.

progressive random forest?

I am considering using random forest for a classification problem. The data comes in sequences. I plan to use first N(500) to train the classifier. Then, use the classifier to classify the data after that. It will make mistakes and the mistakes sometimes can be recorded.
My question is: can I use those mis-classified data to retrain the original classifier and how? If I simply add the mis-classified ones to original training set with size N, then the importance of the mis-classified ones will be exaggerated as the corrected classified ones are ignored. Do I have to retrain the classifier using all data? What other classifiers can do this kind of learning?
What you describe is a basic version of the Boosting meta-algorithm.
It's better if your underlying learner have a natural way to handle samples weights. I have not tried boosting random forests (generally boosting is used on individual shallow decision trees with a depth limit between 1 and 3) but that might work but will likely be very CPU intensive.
Alternatively you can train several independent boosted decision stumps in parallel with different PRNG seed values and then aggregate the final decision function as you would do with a random forests (e.g. voting or averaging class probability assignments).
If you are using Python, you should have a look at the scikit-learn documentation on the topic.
Disclaimer: I am a scikit-learn contributor.
Here is my understanding of your problem.
You have a dataset and create two subdata set with it say, training dataset and evaluation dataset. How can you use the evaluation dataset to improve classification performance ?
The point of this probleme is'nt to find a better classifier but to find a good way for the evaluation, then have a good classifier in the production environnement.
Evaluation purpose
As the evaluation dataset has been tag for evaluation there is now way yo do this. You must use another way for training and evaluation.
A common way to do is cross-validation;
Randomize your samples in your dataset. Create ten partitions from your initial dataset. Then do ten iteration of the following :
Take all partitions but the n-th for training and do the evaluation with the n-th.
After this take the median of the errors of the ten run.
This will give you the errors rate of yours classifiers.
The least run give you the worst case.
Production purpose
(no more evaluation)
You don't care anymore of evaluation. So take all yours samples of all your dataset and give it for training to your classifier (re-run a complet simple training). The result can be use in production environnement, but can't be evaluate any more with any of yours data. The result is as best as the worst case in previous partitions set.
Flow sample processing
(production or learning)
When you are in a flow where new samples are produce over time. You will face case where some sample correct errors case. This is the wanted behavior because we want the system to
improve itself. If you just correct in place the leaf in errors, after some times your
classifier will have nothing in common with the original random forest. You will be doing
a form of greedy learning, like meta taboo search. Clearly we don't wanna this.
If we try to reprocess all the dataset + the new sample every time a new sample is available we will experiment terrible low latency. The solution is like human, sometime
a background process run (when service is on low usage), and all data get a complet
re-learning; and at the end swap old and new classifier.
Sometime the sleep time is too short for a complet re-learning. So you have to use node computing clusturing like that. It cost lot of developpement because you probably need to re-write the algorithms; but at that time you already have the bigest computer you could have found.
note : Swap process is very important to master. You should already have it in your production plan. What do you do if you want to change algorithms? backup? benchmark? power-cut? etc...
I would simply add the new data and retrain the classifier periodically if it weren't too expensive.
A simple way to keep things in balance is to add weights.
If you weigh all positive samples by 1/n_positive and all negative samples by 1/n_negative ( including all the new negative samples you're getting ), then you don't have to worry about the classifier getting out of balance.

measuring the accuracy of a model and the importance of a feature in SVM

I'm starting to use LIBSVM for regression analysis. My world has about 20 features and thousands to millions of training samples.
I'm curious about two things:
Is there a metric that indicates the accuracy or confidence of the model, perhaps in the .model file or elsewhere?
How can I determine whether or not a feature is significant? E.g., if I'm trying to predict body weight as a function of height, shoulder width, gender and hair color, I might discover that hair color is not a significant feature in predicting weight. Is that reflected in the .model file, or is there some way to find out?
libSVM calculates p-values for test points based upon the certainty of the classifier (i.e., how far is the test point from the decision boundary and how wide are the margins).
I think you should consider the determination of feature importance a separate problem from training your SVMs. There are tons of approaches for "feature selection" (just open any text book) but one easy to understand, straightforward approach would be a simple cross-validation as follows:
Divide your dataset into k folds (e.g., k = 10 is common)
For each of the k folds:
Separate your data into train/test sets (the current fold is the test set, the rest are the training set)
Train your SVM classifier using only n-1 of your n features
Measure the prediction performance
Average the performance of your n-1 feature classifier for all k test folds
Repeat 1-3 for all remaining features
You could also do the reverse where you test each of the n features separately but you will likely miss out on important second and higher order interactions between the features.
In general, however, SVMs are good at ignoring irrelevant features.
You may also want to try and visualize your data using Principal Components Analysis to get a feel for how the data is distributed.
The F-score is a metric commonly used for features selection in Machine Learning.
Since version 3.0, LIBSVM library includes a directory called tools. In that directory is a python script called fselect.py, which calculates F-score. To use it, just execute from the command line and pass in the file comprised of training data (and optionally a testing data file).
python fselect.py data_training data_testing
The output is comprised of an fscore for each of the features in your data set which corresponds to the importance of that feature to the model result (regression score).

Resources