clustering VS supervised classification, in the case of very small database [closed] - machine-learning

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 months ago.
Improve this question
I'm trying to classify/cluster subjects according to 4 features in two classes: healthy and sick.
Two things to know: I know the labels/classes of each subject + I only have 40 subjects (in total: training + testing set!)
What should I choose in this case, clustering or classification?

Clustering vs classification is not the choice of method but choice of problem. What is the problem at hand? You have labeled data and want to get a model that can label more - this is by definition classification. In terms of what specific method of classification to use it is a whole new, research-driven, question, rather than a simple programming issue. In particular many classifiers will try to fit some sort of generative model to the data (and thus learn about the structure even without labels), but in the end - labels are there, and should be used.*

Clustering is based on unsupervised learning and classification is based on supervised learning. Unsupervised learning is used when you don't have the target labels, it is used to cluster the data into groups. Whereas supervised learning is used when you have labeled data.
In your statement you have mentioned that you have labels then go for classification algorithms like logistic regression, svm etc. Also if you have a small dataset then you should take care of over fitting, to overcome this go for simple algorithms.

Classification is type of supervised learning. In the Classification you know algorithm needs to predict from finite set of output. For example input data has information about people who take credit card. Then algorithm will learn pattern from input data and output column(take credit card or not).Once algorithm learn it will predict from unseen data take credit card or not. In this example there are only finite number of output(2 in this case - take credit card or not). This problem can be solved using classification.
Clustering is in the unsupervised learning. It mainly deal with data which is not labelled. Clustering algorithm will separate data based on similar characteristics

Related

Training Anomaly detection model on large datasets and chossing the correct model [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
We are trying to build an anomaly detection model for application logs.
The preprocessing is already completed where we have built our own word2vec model which was trained on application log entries.
Now we have a training data of 1.5 M rows * 100 columns
Where each row is the vectorized representation of the log entries (the length of each vector is 100 hence 100 columns)
The problem is that most of the anomaly detection algorithms (LOF, SOS, SOD, SVM) are not scaling for this amount of data. We reduced the training size to 500K but still these algorithm hangs. SVM which performed best on POC sample data, does not have an option for n_jobs to run it on multiple cores.
Some algorithms are able to finish such as Isolation Forest (with low n_estimators), Histogram and Clustering. But these are not able to detect the anomalies which we purposely put in the training data.
Does anyone have an idea on how do we run the Anomaly detection algorithm for large datasets ?
Could not find any option for batch training in standard anomaly detection techniques.Shall we look into Neural Nets (autoencoders) ?
Selecting Best Model:
Given this is unsupervised learning, the approach we are taking for selecting a model is the following:
In the log entries training data, insert an entry from a novel (say Lord of the Rings). The vector representation of this log entry would be different from the rest of the log entires.
While running the dataset on various Anomaly detection algorithms, see which ones were able to detect the entry from the novel (which is an anomaly).
This approach worked when we tried to run anomaly detection on a very small dataset (1000 entries) where the log files were vectorized using the google provided word2vec model.
Is this approach a sound one ? We are open to other ideas as well. Given its an unsupervised learning algorithm we had to put in an anomalous entry and see which model was able to identify it.
The contaminiation ration put in is 0.003
From your explanation, it seems that you are approaching a Novelty detection problem. The novelty detection problems are usually a semi-supervised problem (exceptions or approaches can vary).
Now the problem with huge matrix size can be solved if you use batch processing. This can help you- https://scikit-learn.org/0.15/modules/scaling_strategies.html
Finally yes, if you could use deep learning your problem can be solved in a much better way using both unsupervised learning or semi-supervised learning(I recommend this).

Research paper has Supervised and Unsupervised Learning definition [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed last year.
Improve this question
I am looking for some Research paper or books have good, basic definiton of what Supervised and Unsupervised Learning is. So that i am able to quote these definition in my project.
Thank you so much.
I would make a reference to the following book: Artificial Intelligence: A Modern Approach (3rd Edition) 3rd Edition by Stuart Russell and Peter Norvig. In more detail in Chapter 18 and in pages 693 and on there is an analysis of supervised and unsupervised learning. About unsupervised learning:
In unsupervised learning, the agent learns patterns in the input
even though no explicit feedback is supplied.
The most common unsupervised learning task is clustering:
detecting potentially useful clusters of input examples.
For example, a taxi agent might gradually develop a concept
of “good traffic days” and “bad traffic days” without ever being
given labeled examples of each by a teacher
While for supervised:
In supervised learning, the agent observes some example input–output
pairs
and learns a function that maps from input to output. In component 1 above,
the inputs are percepts and the output are provided by a teacher
who says “Brake!” or “Turn left.” In component 2, the inputs are camera
images and the outputs again come from a teacher who says “that’s a bus.”
In 3, the theory of braking is a function from states and braking actions
to stopping distance in feet. In this case the output value is available
directly from the agent’s percepts (after the fact); the environment
is the teacher.
The examples are mentioned in the text above.
Christopher M. Bishop, "Pattern Recognition and Machine Learning", p.3 (emphasis mine)
Applications in which the training data comprises examples of the input vectors along with their corresponding target vectors are known as supervised learning problems...
In other pattern recognition problems, the training data consists of a set of input vectors x without any corresponding target values. The goal in such unsupervised learning problems may be to discover groups of similar examples within the data,
where it is called clustering, or to determine the distribution of data within the input space, known as density estimation, or to project the data from a high-dimensional space down to two or three dimensions for the purpose of visualization.
Which is as good as you can get. Basically, the most noticable difference is whether we have labels wrt. which we want learning model to optimize. If we don't have some of the labels, it's still can be described as weakly-supervised learning. If no labels are available,the only thing left is to find some structure in the data.
Thanks #Pavel Tyshevskyi for the answear. Your answer is perfect but it seem a littel but hard to understand for beginers like me.
And after hour of searching, i found my own answer version in "Machine Learning For Dummies, IBM Limited Edition" book, at part "Approaches to Machine Learning" of chapter 1 "Understanding Machine Learning". It has simpler definition and has example that can help me to understand better a bit. Link to the book: Machine Learning For Dummies, IBM Limited Edition
Supervised learning
Supervised learning typically begins with an established set of data and a certain understanding of how that data is classified. Supervised learning is intended to find patterns in data that can be applied to an analytics process. This data has labeled features that define the meaning of data. For example, there could be mil-lions of images of animals and include an explanation of what each animal is and then you can create a machine learning appli-cation that distinguishes one animal from another. By labeling this data about types of animals, you may have hundreds of cat-egories of different species. Because the attributes and the mean-ing of the data have been identified, it is well understood by the users that are training the modeled data so that it fits the details of the labels. When the label is continuous, it is a regression; when the data comes from a finite set of values, it known as classifica-tion. In essence, regression used for supervised learning helps you understand the correlation between variables. An example of supervised learning is weather forecasting. By using regression analysis, weather forecasting takes into account known historical weather patterns and the current conditions to provide a predic-tion on the weather.
The algorithms are trained using preprocessed examples, and at this point, the performance of the algorithms is evaluated with test data. Occasionally, patterns that are identified in a subset of the data can’t be detected in the larger population of data. If the model is fit to only represent the patterns that exist in the training subset, you create a problem called overfitting. Overfit-ting means that your model is precisely tuned for your training data but may not be applicable for large sets of unknown data. To protect against overfitting, testing needs to be done against unforeseen or unknown labeled data. Using unforeseen data for the test set can help you evaluate the accuracy of the model in predicting outcomes and results. Supervised training models have broad applicability to a variety of business problems, including fraud detection, recommendation solutions, speech recognition, or risk analysis.
Unsupervised learning
Unsupervised learning is best suited when the problem requires a massive amount of data that is unlabeled. For example, social media applications, such as Twitter, Instagram, Snapchat, and.....

When to use supervised or unsupervised learning?

Which are the fundamental criterias for using supervised or unsupervised learning?
When is one better than the other?
Is there specific cases when you can only use one of them?
Thanks
If you a have labeled dataset you can use both. If you have no labels you only can use unsupervised learning.
It´s not a question of "better". It´s a question of what you want to achieve. E.g. clustering data is usually unsupervised – you want the algorithm to tell you how your data is structured. Categorizing is supervised since you need to teach your algorithm what is what in order to make predictions on unseen data.
See 1.
On a side note: These are very broad questions. I suggest you familiarize yourself with some ML foundations.
Good podcast for example here: http://ocdevel.com/podcasts/machine-learning
Very good book / notebooks by Jake VanderPlas: http://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/Index.ipynb
Depends on your needs. If you have a set of existing data including the target values that you wish to predict (labels) then you probably need supervised learning (e.g. is something true or false; or does this data represent a fish or cat or a dog? Simply put - you already have examples of right answers and you are just telling the algorithm what to predict). You also need to distinguish whether you need a classification or regression. Classification is when you need to categorize the predicted values into given classes (e.g. is it likely that this person develops a diabetes - yes or no? In other words - discrete values) and regression is when you need to predict continuous values (1,2, 4.56, 12.99, 23 etc.). There are many supervised learning algorithms to choose from (k-nearest neighbors, naive bayes, SVN, ridge..)
On contrary - use the unsupervised learning if you don't have the labels (or target values). You're simply trying to identify the clusters of data as they come. E.g. k-Means, DBScan, spectral clustering..)
So it depends and there's no exact answer but generally speaking you need to:
Collect and see you data. You need to know your data and only then decide which way you choose or what algorithm will best suite your needs.
Train your algorithm. Be sure to have a clean and good data and bear in mind that in case of unsupervised learning you can skip this step as you don't have the target values. You test your algorithm right away
Test your algorithm. Run and see how well your algorithm behaves. In case of supervised learning you can use some training data to evaluate how well is your algorithm doing.
There are many books online about machine learning and many online lectures on the topic as well.
Depends on the data set that you have.
If you have target feature in your hand then you should go for supervised learning. If you don't have then it is a unsupervised based problem.
Supervised is like teaching the model with examples. Unsupervised learning is mainly used to group similar data, it plays a major role in feature engineering.
Thank you..

Using Reinforcement Learning for Classfication Problems [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
Can I use reinforcement learning on classification? Such as human activity recognition? And how?
There are two types of feedback. One is evaluative that is used in reinforcement learning method and second is instructive that is used in supervised learning mostly used for classification problems.
When supervised learning is used, the weights of the neural network are adjusted based on the information of the correct labels provided in the training dataset. So, on selecting a wrong class, the loss increases and weights are adjusted, so that for the input of that kind, this wrong class is not chosen again.
However, in reinforcement learning, the system explores all the possible actions, class labels for various inputs in this case and by evaluating the reward it decides what is right and what is wrong. It may be the case too that until it gets the correct class label it may be giving wrong class name as it is the best possible output it has found till now. So, it doesn't make use of the specific knowledge we have about the class labels, hence slows the convergence rate significantly as compared to supervised learning.
You can use reinforcement learning for classification problems but it won't be giving you any added benefit and instead slow down your convergence rate.
Short answer: Yes.
Detailed answer: yes but it's an overkill. Reinforcement learning is useful when you don't have labeled dataset to learn the correct policy, so you need to develop correct strategy based on the rewards. This also allows to backpropagate through non-differentiable blocks (which I suppose is not your case). The biggest drawback of reinforcement learning methods is that thay are typically took a VERY large amount of time to converge. So, if you possess labels, it would be a LOT more faster and easier to use regular supervised learning.
You may be able to develop an RL model that chooses which classifier to use. The gt labels being used to train the classifiers and the change in performance of those classifiers being the reward for the RL model. As others have said, it would probably take a very long time to converge, if it ever does. This idea may also require many tricks and tweaks to make it work. I would recommend searching for research papers on this topic.

Can an algorithm be classified as "unsupervised learning" if there is no "learning" involved?

Basically, my question is, since unsupervised learning is a type of machine learning, does there need to be some aspect of the machine "learning" and improving based on it's discoveries? For example, if an algorithm is developed that takes unlabeled images and finds associations between them, does it need to improve itself based on those associations to be classified as "unsupervised learning" or is simply reporting those associations good enough to earn that classification?
For example, if an algorithm is developed that takes unlabeled images and finds associations between them...
That is the "learning" in "unsupervised learning," so yes, this would be considered unsupervised learning.
...does it need to improve itself based on those associations...
No, there's no requirement that the algorithm take what it has learned and improves itself to be considered unsupervised learning. Just analyzing the data set and finding previously unknown associations is enough to be considered unsupervised machine learning. The "unsupervised" distinction is really just that the initial data set is unlabeled.

Resources