What is the difference between classification and prediction? - machine-learning

What is the difference between classification and prediction in machine learning?

Classification is the prediction of a categorial variable within a predefined vocabulary based on training examples.
The prediction of numerical (continuous) variables is called regression.
In summary, classification is one kind of prediction, but there are others. Hence, prediction is a more general problem.

Functionality
Classification is about determining a (categorial) class (or label) for an element in a dataset
Prediction is about predicting a missing/unknown element(continuous value) of a dataset
Working Strategy
In classification, data is grouped into categories based on a training dataset.
In prediction, a classification/regression model is built to predict the outcome(continuous value)
Example
In a hospital, the grouping of patients based on their medical record or treatment outcome is considered classification, whereas, if you use a classification model to predict the treatment outcome for a new patient, it is considered a prediction.

Classification is the process of identifying the category or class label of the new observation to which it belongs.
Predication is the process of identifying the missing or unavailable numerical data for a new observation.
That is the key difference between classification and prediction. The predication does not concern about the class label like in classification.

Predictions can be using both regression as well as classification models. It means that once a model is trained on the training data; the next phase is to do predictions for the data whose real/ground-truth values are either unknown or kept aside to evaluate the performance of model. If the nature of the problem is of determining classes/labels/categories athen its classification and if the problem is about determining real numbers (numeric) values then its regression. In nutshell, predictions are supposed to done with both classification and regression for the test data set.

1.Prediction is like saying something which may going to be happened in future.Prediction may be a kind of classification
2.Prediction is mostly based on our future assumptions
whereas
1.Classification is categorization of the things or data that we already have with us.This categorization can be based on any kind of technique or algorithms
2.Classification is mostly based on our current or past assumptions

Related

Application and Deployment of K-Fold Cross-Validation

K-Fold Cross Validation is a technique applied for splitting up the data into K number of Folds for testing and training. The goal is to estimate the generalizability of a machine learning model. The model is trained K times, once on each train fold and then tested on the corresponding test fold.
Suppose I want to compare a Decision Tree and a Logistic Regression model on some arbitrary dataset with 10 Folds. Suppose after training each model on each of the 10 folds and obtaining the corresponding test accuracies, Logistic Regression has a higher mean accuracy across the test folds, indicating that it is the better model for the dataset.
Now, for application and deployment. Do I retrain the Logistic Regression model on all the data, or do I create an ensemble from the 10 Logistic Regression models that were trained on the K-Folds?
The main goal of CV is to validate that we did not get the numbers by chance. So, I believe you can just use a single model for deployment.
If you are already satisfied with hyper-parameters and model performance one option is to train on all data that you have and deploy that model.
And, the other option is obvious that you can deploy one of the CV models.
About the ensemble option, I believe it should not give significant better results than a model trained on all data; as each model train for same amount of time with similar paparameters and they have similar architecture; but train data is slightly different. So, they shouldn't show different performance. In my experience, ensemble helps when the output of models are different due to architecture or input data (like different image sizes).
The models trained during k-fold CV should never be reused. CV is only used for reliably estimating the performance of a model.
As a consequence, the standard approach is to re-train the final model on the full training data after CV.
Note that evaluating different models is akin to hyper-parameter tuning, so in theory the performance of the selected best model should be reevaluated on a fresh test set. But with only two models tested I don't think this is important in your case.
You can find more details about k-fold cross-validation here and there.

Word2vec is a Generalization or memorization algorithm?

I need to know that word2vec is a Generalization algorithm like all ML algorithms or its Memorization algorithm like KNN ?
Because we have 2 types of algorithms model based and memory based , word2vec is coming in which category when it's using for most_similar_items
Let me define generalization as the ability of a model which has completed training to be effective in prediction across a whole range of inputs, including include inputs that is not part of training. From that perspective, Word2Vec cannot predict words that are not part of the training dataset because it simply would not have trained on the context of it to create an embedding. To qualify as a generalization method, it needs to be able to predict on an input which was not part of the training dataset.
Word2Vec model maintains a dictionary of words to the corresponding embedding/vector. In summary, cannot predict on unknown words. This was one of the important differences between fastText model and Word2Vec.

Text Classification Technique for this scenario

I am completely new to Machine Learning algorithms and I have a quick question with respect to Classification of a dataset.
Currently there is a training data that consists of two columns Message and Identifier.
Message - Typical message extracted from Log containing timestamp and some text
Identifier - Should classify the category based on the message content.
The training data was prepared by extracting a particular category from the tool and labelling it accordingly.
Now the test data contains just the message and I am trying to obtain the Category accordingly.
Which approach is most helpful in this scenario ? Is it the Supervised or Unsupervised Learning ?
I have a trained dataset and I am trying to predict the Category for the Test Data.
Thanks in advance,
Adam
If your labels are exact then you can classify using ANN, SVM etc. But labels are not exact you have to cluster data with respect to the features you have in data. K-means or nearest neighbour can be the starting point for clustering.
It is supervised learning, and a classification problem.
However, obviously you do not have the label column (the to-be-predicted value) for your testset. Thus, you cannot calculate error measures (such as False Positive Rate, Accuracy etc) for that test set.
You could, however, split the set of labeled training data that you do have into a smaller training set and a validation set. Split it 70%/30%, perhaps. Then build a prediction model from your smaller 70% training dataset. Then tune it on your 30% validation set. When accuracy is good enough, then apply it on your testset to obtain/predict the missing values.
Which techniques / algorithms to use is a different question. You do not give enough information to answer that. And even if you did you still need to tune the model yourself.
You have labels to predict, and training data.
So by definition it is a supervised problem.
Try any classifier for text, such as NB, kNN, SVM, ANN, RF, ...
It's hard to predict which will work best on your data. You willhave to try and evaluate several.

How to evaluate the performance of different model on one dataset?

I want to evaluate the performance different model such as SVM, RandForest, CNN etc, I only have one dataset. So I split the dataset to training set and testing set and train different model on this dataset with training data and test with testing dataset.
My question: Can I get the real performance of different model on only one dataset? For example: I found SVM model get the best result, So Should I select the SVM as my final classification model?
Its probably a better idea to cross validate your models with different test samples through cross validation to avoid biases. Also check your models against different evaluation metrics depending upon your application type. For instance use recall, accuracy and AUC for each model if its a classification problem.
Evaluation results can be pretty deceptive and require extensive validation.
You can Plot ROC curve for all the models.The model for which AUC is highest will be best model.

Anomaly Detection vs Supervised Learning

I have very small data that belongs to positive class and a large set of data from negative class. According to prof. Andrew Ng (anomaly detection vs supervised learning), I should use Anomaly detection instead of Supervised learning because of highly skewed data.
Please correct me if I am wrong but both techniques look same to me i.e. in both (supervised) Anomaly detection, and standard Supervised learning, we train data with both normal and anomalous samples and test on unknown data. Is there any difference?
Should I just perform under-sampling of negative class or over-sampling of positive class to get both type data of same size? Does it affect the overall accuracy?
Actually in supervised learning, you have the data set labelled (e.g good, bad) and you pass the labelled values as you train the model so that it learns parameters that will separate the 'good' from 'bad' results.
In anomaly detection, it is unsupervised as you do not pass any labelled values.. What you do is you train using only the 'non-anomalous' data. You then select epsilon values and evaluate with a numerical value (such as F1 score) so that your model will get a good balance of true positives.
Regarding trying to over/under sample so your data is not skewed, there are 2 things.
Prof Ng mentioned something like if your positive class is only 10 out of 10k or 100k then you need to use anomaly detection since your data is highly skewed.
Supervised learning makes sense if you know typically what 'bad' values are. If you only know what is 'normal'/'good' but your 'bad' value can really be very different every time then this is a good case for anomaly detection.
In anomaly detection you would determine model parameters from the portion of the data which is well supported (As Andrew explains). Since your negative class has many instances you would use these data for 'learning'. Kernel density estimation or GMMs are examples of approaches that are typically used. A model of 'normalcy' may thus be learnt and thresholding may be used to detect instances which are considered anomalous with respect to your derived model. The difference between this approach and conventional supervised learning lies in the fact that you are using only a portion of the data (the negative class in your case) for training. You would expect your positive instances to be identified as anomalous after training.
As for your second question, under-sampling the negative class will result in a loss of information whilst over-sampling the positive class doesn't add information. I don't think that following that route is desirable.

Resources