I'm trying to get into machine learning and so I wanted to try out text classification on tweets. I collected a small sample of tweets, but for me to perform any supervised learning I need to hand label some of the tweets I collected. This is an arduous task when I scale up my data.
Is there any way to perform classification without me hand labeling a large number of tweets?
Or are unsupervised learning better for this task?
Semi-supervised learning methods were created for problems like this. Simplest approach includes you manually labelling few observations, running a supervised learning algorithm on the labeled data to select a classifier to label other observations, and this is repeated.
Tweets are short text. You should try a classifier tailored for short-text classification such as LibShortText: https://www.csie.ntu.edu.tw/~cjlin/libshorttext/
This article explains certain properties of short text (title) vs full-text classification : https://www.csie.ntu.edu.tw/~cjlin/papers/title.pdf
Classification will always involved labeled data (active learning techniques help with labeling datasets) but you can take advantage of new emerging techniques such as Snorkel (data programming) to alleviate some of the issues: https://github.com/HazyResearch/snorkel
Related
Scenario - I have data that does not have labels but I can create a function to label the data based on behavior and deploy the model so I don't have to keep labeling the data. Is this considered machine learning?
Objective: classify accounts with Volume spikes based on high medium low labels to deploy on big data (trillions of lines of data)
Data: the data I have includes the following attributes:
Account, Time, Date, Volume amount.
Method:
Create a new feature column called "spike" and create a pandas function to ID a spike greater than 5. Is this feature engineering?
Next I create my label column and classify it as low medium or high spike.
Next I Train a machine learning classifier and deploy it to label future accounts with similar patterns in big data.
Thoughts on this process? Is this approach correct for Machine learning?
1st question:
If your algorithm takes the decision, that is, put a label in a sample, based on the set of samples that you have, I'd say it's a machine learning algorithm. But if you design a code that takes into account your experience regarding the data, I think it's not an ML method. In brief, ML look at the data to get patterns and insights from them. I don't know why you're doing that, but is it need to be an ML algorithm? Sometimes you can solve the problem in a very simple way, without using ML.
2nd question: I'm afraid not. Select your data attributes (ex: Account, Time, Date, Volume amount), checking their correlations, try to figure out if you have a dominant one, etc. This process is pre ML. The feature engineering will select what are the best features to present to our algorithm in order to perform the classification (in your case)
3rd question: I think it's fair enough to start playing with some ML algorithms, such as KNN, SVM, NNs, Decision Tree, etc.
I am working in the area of machine learning and pattern recognition for the last 8-10 years. I use it for image classification and recognition. Recently, I started learning about some Data Mining. How much is data mining related to classification? Or can I as a person with experience on image classification work on data mining?
Classification is one of many machine learning techniques used in data mining. But usually, you'd simply use the more precise "machine learning" category for classification.
Data mining is he explorative side - you want to understand the data. That can mean learning to predict, but mostly to understand what can be predicted (and what not) and how (which features etc.).
In many cases, classification is used in a way that I'd not include in data mining. If you just want to recognize images as cars or not (but don't care about the "why") it's probably not data mining.
I have been reading so many articles on Machine Learning and Data mining from the past few weeks. Articles like the difference between ML and DM, similarities, etc. etc. But I still have one question, it may look like a silly question,
How to determine, when should we use ML algorithms and when should we use DM?
Because I have performed some practicals of DM using weka on Time Series Analysis(future population prediction, sales prediction), text mining using R/python, etc. Same can be done using ML algorithms also, like future population prediction using Linear regression.
So how to determine, that, for a given problem ML is best suitable or Dm is best suitable.
Thanks in advance.
Probably the closest thing to the quite arbitrary and meaningless separation of ML and DM is unsupervised methods vs. supervised learning.
Choose ML if you have training data for your target function.
Choose DM when you need to explore your data.
How we can calculate the importance of features in data set using machine learning ? which algorithm will be better and why ?
There are several methods that fit a model to the data and based on the fit classify the features from most relevant to less relevant. If you want to know more just google feature selection.
I don't know which language you're using but here's a link to a python page about it:
http://scikit-learn.org/stable/modules/feature_selection.html
You can use this function:
http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn.feature_selection.RFE
This will eliminate the less meaningful features from you dataset based on a fit from a classifier, you can choose for instance logistic regression or the SVM and select how many features you want left.
I think the choice of the best method depends on the data, so more information is necessary.
Which are the fundamental criterias for using supervised or unsupervised learning?
When is one better than the other?
Is there specific cases when you can only use one of them?
Thanks
If you a have labeled dataset you can use both. If you have no labels you only can use unsupervised learning.
It´s not a question of "better". It´s a question of what you want to achieve. E.g. clustering data is usually unsupervised – you want the algorithm to tell you how your data is structured. Categorizing is supervised since you need to teach your algorithm what is what in order to make predictions on unseen data.
See 1.
On a side note: These are very broad questions. I suggest you familiarize yourself with some ML foundations.
Good podcast for example here: http://ocdevel.com/podcasts/machine-learning
Very good book / notebooks by Jake VanderPlas: http://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/Index.ipynb
Depends on your needs. If you have a set of existing data including the target values that you wish to predict (labels) then you probably need supervised learning (e.g. is something true or false; or does this data represent a fish or cat or a dog? Simply put - you already have examples of right answers and you are just telling the algorithm what to predict). You also need to distinguish whether you need a classification or regression. Classification is when you need to categorize the predicted values into given classes (e.g. is it likely that this person develops a diabetes - yes or no? In other words - discrete values) and regression is when you need to predict continuous values (1,2, 4.56, 12.99, 23 etc.). There are many supervised learning algorithms to choose from (k-nearest neighbors, naive bayes, SVN, ridge..)
On contrary - use the unsupervised learning if you don't have the labels (or target values). You're simply trying to identify the clusters of data as they come. E.g. k-Means, DBScan, spectral clustering..)
So it depends and there's no exact answer but generally speaking you need to:
Collect and see you data. You need to know your data and only then decide which way you choose or what algorithm will best suite your needs.
Train your algorithm. Be sure to have a clean and good data and bear in mind that in case of unsupervised learning you can skip this step as you don't have the target values. You test your algorithm right away
Test your algorithm. Run and see how well your algorithm behaves. In case of supervised learning you can use some training data to evaluate how well is your algorithm doing.
There are many books online about machine learning and many online lectures on the topic as well.
Depends on the data set that you have.
If you have target feature in your hand then you should go for supervised learning. If you don't have then it is a unsupervised based problem.
Supervised is like teaching the model with examples. Unsupervised learning is mainly used to group similar data, it plays a major role in feature engineering.
Thank you..