I am working in the area of machine learning and pattern recognition for the last 8-10 years. I use it for image classification and recognition. Recently, I started learning about some Data Mining. How much is data mining related to classification? Or can I as a person with experience on image classification work on data mining?
Classification is one of many machine learning techniques used in data mining. But usually, you'd simply use the more precise "machine learning" category for classification.
Data mining is he explorative side - you want to understand the data. That can mean learning to predict, but mostly to understand what can be predicted (and what not) and how (which features etc.).
In many cases, classification is used in a way that I'd not include in data mining. If you just want to recognize images as cars or not (but don't care about the "why") it's probably not data mining.
Related
I have been reading so many articles on Machine Learning and Data mining from the past few weeks. Articles like the difference between ML and DM, similarities, etc. etc. But I still have one question, it may look like a silly question,
How to determine, when should we use ML algorithms and when should we use DM?
Because I have performed some practicals of DM using weka on Time Series Analysis(future population prediction, sales prediction), text mining using R/python, etc. Same can be done using ML algorithms also, like future population prediction using Linear regression.
So how to determine, that, for a given problem ML is best suitable or Dm is best suitable.
Thanks in advance.
Probably the closest thing to the quite arbitrary and meaningless separation of ML and DM is unsupervised methods vs. supervised learning.
Choose ML if you have training data for your target function.
Choose DM when you need to explore your data.
I have some questions about SVM :
1- Why using SVM? or in other words, what causes it to appear?
2- The state Of art (2017)
3- What improvements have they made?
SVM works very well. In many applications, they are still among the best performing algorithms.
We've seen some progress in particular on linear SVMs, that can be trained much faster than kernel SVMs.
Read more literature. Don't expect an exhaustive answer in this QA format. Show more effort on your behalf.
SVM's are most commonly used for classification problems where labeled data is available (supervised learning) and are useful for modeling with limited data. For problems with unlabeled data (unsupervised learning), then support vector clustering is an algorithm commonly employed. SVM tends to perform better on binary classification problems since the decision boundaries will not overlap. Your 2nd and 3rd questions are very ambiguous (and need lots of work!), but I'll suffice it to say that SVM's have found wide range applicability to medical data science. Here's a link to explore more about this: Applications of Support Vector Machine (SVM) Learning in Cancer Genomics
Which are the fundamental criterias for using supervised or unsupervised learning?
When is one better than the other?
Is there specific cases when you can only use one of them?
Thanks
If you a have labeled dataset you can use both. If you have no labels you only can use unsupervised learning.
It´s not a question of "better". It´s a question of what you want to achieve. E.g. clustering data is usually unsupervised – you want the algorithm to tell you how your data is structured. Categorizing is supervised since you need to teach your algorithm what is what in order to make predictions on unseen data.
See 1.
On a side note: These are very broad questions. I suggest you familiarize yourself with some ML foundations.
Good podcast for example here: http://ocdevel.com/podcasts/machine-learning
Very good book / notebooks by Jake VanderPlas: http://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/Index.ipynb
Depends on your needs. If you have a set of existing data including the target values that you wish to predict (labels) then you probably need supervised learning (e.g. is something true or false; or does this data represent a fish or cat or a dog? Simply put - you already have examples of right answers and you are just telling the algorithm what to predict). You also need to distinguish whether you need a classification or regression. Classification is when you need to categorize the predicted values into given classes (e.g. is it likely that this person develops a diabetes - yes or no? In other words - discrete values) and regression is when you need to predict continuous values (1,2, 4.56, 12.99, 23 etc.). There are many supervised learning algorithms to choose from (k-nearest neighbors, naive bayes, SVN, ridge..)
On contrary - use the unsupervised learning if you don't have the labels (or target values). You're simply trying to identify the clusters of data as they come. E.g. k-Means, DBScan, spectral clustering..)
So it depends and there's no exact answer but generally speaking you need to:
Collect and see you data. You need to know your data and only then decide which way you choose or what algorithm will best suite your needs.
Train your algorithm. Be sure to have a clean and good data and bear in mind that in case of unsupervised learning you can skip this step as you don't have the target values. You test your algorithm right away
Test your algorithm. Run and see how well your algorithm behaves. In case of supervised learning you can use some training data to evaluate how well is your algorithm doing.
There are many books online about machine learning and many online lectures on the topic as well.
Depends on the data set that you have.
If you have target feature in your hand then you should go for supervised learning. If you don't have then it is a unsupervised based problem.
Supervised is like teaching the model with examples. Unsupervised learning is mainly used to group similar data, it plays a major role in feature engineering.
Thank you..
Basically, my question is, since unsupervised learning is a type of machine learning, does there need to be some aspect of the machine "learning" and improving based on it's discoveries? For example, if an algorithm is developed that takes unlabeled images and finds associations between them, does it need to improve itself based on those associations to be classified as "unsupervised learning" or is simply reporting those associations good enough to earn that classification?
For example, if an algorithm is developed that takes unlabeled images and finds associations between them...
That is the "learning" in "unsupervised learning," so yes, this would be considered unsupervised learning.
...does it need to improve itself based on those associations...
No, there's no requirement that the algorithm take what it has learned and improves itself to be considered unsupervised learning. Just analyzing the data set and finding previously unknown associations is enough to be considered unsupervised machine learning. The "unsupervised" distinction is really just that the initial data set is unlabeled.
I'm trying to get into machine learning and so I wanted to try out text classification on tweets. I collected a small sample of tweets, but for me to perform any supervised learning I need to hand label some of the tweets I collected. This is an arduous task when I scale up my data.
Is there any way to perform classification without me hand labeling a large number of tweets?
Or are unsupervised learning better for this task?
Semi-supervised learning methods were created for problems like this. Simplest approach includes you manually labelling few observations, running a supervised learning algorithm on the labeled data to select a classifier to label other observations, and this is repeated.
Tweets are short text. You should try a classifier tailored for short-text classification such as LibShortText: https://www.csie.ntu.edu.tw/~cjlin/libshorttext/
This article explains certain properties of short text (title) vs full-text classification : https://www.csie.ntu.edu.tw/~cjlin/papers/title.pdf
Classification will always involved labeled data (active learning techniques help with labeling datasets) but you can take advantage of new emerging techniques such as Snorkel (data programming) to alleviate some of the issues: https://github.com/HazyResearch/snorkel