Choice supervised learning algorithm or unsupervised learning algorithm - machine-learning

I would like examine a large collection of emails that are known to be spam email, to discover if there are sub-types of spam mail.
Should I using a supervised learning algorithm or an unsupervised learning algorithm ?
Thank you.

Supervised Learning. Look into Naive Bayes. It has been used to solve exactly this problem with great success in the past.

Related

How to use Reinforcement Learning for a Classification problem?

Using Python, I'm trying to predict if an individual is eligible for a loan or not. So, my output will be either a 1 or 0.
I want to use a Reinforcement Learning approach to learn and experiment. But, I haven't found any useful resources on how to use RL in classification problems.
My question is, is to suitable to use RL or is it too complex for my problem and it's not used in similar real-world problems?
If the answer is yes, how can I apply RL in classification problems?

How to determine, whether to use machine learning algorithms or data mining technique for a given scenario?

I have been reading so many articles on Machine Learning and Data mining from the past few weeks. Articles like the difference between ML and DM, similarities, etc. etc. But I still have one question, it may look like a silly question,
How to determine, when should we use ML algorithms and when should we use DM?
Because I have performed some practicals of DM using weka on Time Series Analysis(future population prediction, sales prediction), text mining using R/python, etc. Same can be done using ML algorithms also, like future population prediction using Linear regression.
So how to determine, that, for a given problem ML is best suitable or Dm is best suitable.
Thanks in advance.
Probably the closest thing to the quite arbitrary and meaningless separation of ML and DM is unsupervised methods vs. supervised learning.
Choose ML if you have training data for your target function.
Choose DM when you need to explore your data.

held-out data and k-fold cross validation

I have learnt about holding some data out of the training set (development data for tuning model's parameters) and also k-fold cross validation. I got a question about them, Can we use them in all of the machine learning algorithms such as Decision Tree and Naïve Bayes?or there is restriction in using them? Is it better to use them in Decision Tree and Naïve Bayes rather than their pure algorithms in order to gain better results?
Any help would be highly appreciated.
Yes, you could use them in all machine learning algorithms. The approach is to test how best your learning algorithm can generalize and perform on an unseen dataset.

When to use supervised or unsupervised learning?

Which are the fundamental criterias for using supervised or unsupervised learning?
When is one better than the other?
Is there specific cases when you can only use one of them?
Thanks
If you a have labeled dataset you can use both. If you have no labels you only can use unsupervised learning.
It´s not a question of "better". It´s a question of what you want to achieve. E.g. clustering data is usually unsupervised – you want the algorithm to tell you how your data is structured. Categorizing is supervised since you need to teach your algorithm what is what in order to make predictions on unseen data.
See 1.
On a side note: These are very broad questions. I suggest you familiarize yourself with some ML foundations.
Good podcast for example here: http://ocdevel.com/podcasts/machine-learning
Very good book / notebooks by Jake VanderPlas: http://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/Index.ipynb
Depends on your needs. If you have a set of existing data including the target values that you wish to predict (labels) then you probably need supervised learning (e.g. is something true or false; or does this data represent a fish or cat or a dog? Simply put - you already have examples of right answers and you are just telling the algorithm what to predict). You also need to distinguish whether you need a classification or regression. Classification is when you need to categorize the predicted values into given classes (e.g. is it likely that this person develops a diabetes - yes or no? In other words - discrete values) and regression is when you need to predict continuous values (1,2, 4.56, 12.99, 23 etc.). There are many supervised learning algorithms to choose from (k-nearest neighbors, naive bayes, SVN, ridge..)
On contrary - use the unsupervised learning if you don't have the labels (or target values). You're simply trying to identify the clusters of data as they come. E.g. k-Means, DBScan, spectral clustering..)
So it depends and there's no exact answer but generally speaking you need to:
Collect and see you data. You need to know your data and only then decide which way you choose or what algorithm will best suite your needs.
Train your algorithm. Be sure to have a clean and good data and bear in mind that in case of unsupervised learning you can skip this step as you don't have the target values. You test your algorithm right away
Test your algorithm. Run and see how well your algorithm behaves. In case of supervised learning you can use some training data to evaluate how well is your algorithm doing.
There are many books online about machine learning and many online lectures on the topic as well.
Depends on the data set that you have.
If you have target feature in your hand then you should go for supervised learning. If you don't have then it is a unsupervised based problem.
Supervised is like teaching the model with examples. Unsupervised learning is mainly used to group similar data, it plays a major role in feature engineering.
Thank you..

How to train an unsupervised neural network such as RBM?

Is this process correct?
Suppose We have a bunch of data such as MNIST.
We just feed all these data(without label) to RBM and resample each data from trained model.
Then output can be treated as new data for classification.
Do I understand it correctly?
What is the purpose of using RBM?
You are correct, RBMs are a form of unsupervised learning algorithm that are commonly used to reduce the dimensionality of your feature space. Another common approach is to use autoencoders.
RBMs are trained using the contrastive divergence algorithm. The best overview of this algorithm comes from Geoffrey Hinton who came up with it.
https://www.cs.toronto.edu/~hinton/absps/guideTR.pdf
A great paper about how unsupervised learning improves performance can be found at http://jmlr.org/papers/volume11/erhan10a/erhan10a.pdf. The paper shows that unsupervised learning provides better generalization and filters (if using CRBMs)

Resources