Is NLTK's naive Bayes Classifier suitable for commercial applications? - machine-learning

I need to train a naive Bayes classifier on two corpuses consisting of approx. 15,000 tokens each. I'm using a basic bag of words feature extractor with binary labeling and I'm wondering if NLTK is powerful enough to handle all this data without significantly slowing down run time if such an application were to gain many users. The program would basically be classifying a regular stream of text messages from potentially thousands of users. Are there other machine learning packages you'd recommend integrating with NLTK if it isn't suitable?

Your corpora are not very big, so NLTK should do the job. However,I wouldn't recommend it in general, it is quite slow and buggy in places. Weka is a more powerful tool, but the fact that it can do so much more makes it harder to understand. If Naive Bayes is all you plan to use, it would probably be fastest to code it yourself.
EDIT (much later):
Try scikit-learn, it is very easy to use.

Related

Simple machine learning for website classification

I am trying to generate a Python program that determines if a website is harmful (porn etc.).
First, I made a Python web scraping program that counts the number of occurrences for each word.
result for harmful websites
It's a key value dictionary like
{ word : [ # occurrences in harmful websites, # of websites that contain these words] }.
Now I want my program to analyze the words from any websites to check if the website is safe or not. But I don't know which methods will suit to my data.
The key thing here is your training data. You need some sort of supervised learning technique where your training data consists of website's data itself (text document) and its label (harmful or safe).
You can certainly use the RNN but there also other natural language processing techniques and much faster ones.
Typically, you should use a proper vectorizer on your training data (think of each site page as a text document), for example tf-idf (but also other possibilities; if you use Python I would strongly suggest scikit that provides lots of useful machine learning techniques and mentioned sklearn.TfidfVectorizer is already within). The point is to vectorize your text document in enhanced way. Imagine for example the English word the how many times it typically exists in text? You need to think of biases such as these.
Once your training data is vectorized you can use for example stochastic gradient descent classifier and see how it performs on your test data (in machine learning terminology the test data means to simply take some new data example and test what your ML program outputs).
In either case you will need to experiment with above options. There are many nuances and you need to test your data and see where you achieve the best results (depending on ML algorithm settings, type of vectorizer, used ML technique itself and so on). For example Support Vector Machines are great choice when it comes to binary classifiers too. You may wanna play with that too and see if it performs better than SGD.
In any case, remember that you will need to obtain quality training data with labels (harmful vs. safe) and find the best fitting classifier. On your journey to find the best one you may also wanna use cross validation to determine how well your classifier behaves. Again, already contained in scikit-learn.
N.B. Don't forget about valid cases. For example there may be a completely safe online magazine where it only mentions the harmful topic in some article; it doesn't mean the website itself is harmful though.
Edit: As I think of it, if you don't have any experience with ML at all it could be useful to take any online course because despite the knowledge of API and libraries you will still need to know what it does and the math behind the curtain (at least roughly).
What you are trying to do is called sentiment classification and is usually done with recurrent neural networks (RNNs) or Long short-term memory networks (LSTMs). This is not an easy topic to start with machine learning. If you are new you should have a look into linear/logistic regression, SVMs and basic neural networks (MLPs) first. Otherwise it will be hard to understand what is going on.
That said: there are many libraries out there for constructing neural networks. Probably easiest to use is keras. While this library simplifies a lot of things immensely, it isn't just a magic box that makes gold from trash. You need to understand what happens under the hood to get good results. Here is an example of how you can perform sentiment classification on the IMDB dataset (basically determine whether a movie review is positive or not) with keras.
For people who have no experience in NLP or ML, I recommend using TFIDF vectorizer instead of using deep learning libraries. In short, it converts sentences to vector, taking each word in vocabulary to one dimension (degree is occurrence).
Then, you can calculate cosine similarity to resulting vector.
To improve performance, use stemming / lemmatizing / stopwords supported in NLTK libraires.

Incremental Learning of SVM

What are some real world applications where incremental learning of (machine learning) algorithms is useful?
Are SVMs preferred for such applications?
Is the solution more computationally intensive than retraining with the set containing old support vectors and new training vectors ?
There is a well known incremental version of SVM:
http://www.isn.ucsd.edu/pubs/nips00_inc.pdf
However, there are not much existing implementations available, maybe something is in Matlab:
http://www.isn.ucsd.edu/svm/incremental/
The advantage of that approach is that it offers exact leave-one-out evaluation of
the generalization performance on the training data
theres is a trend towards large, "out of core" datasets, which are often streamed in from network, disk, or a database. a real world example is the popular nyc taxi dataset, which, at 330+gb, cannot be easily tackled by desktop statistical models.
svms, as a "one batch" algorithm, must load the entire dataset into memory. as such they are not preferred for incremental learning. rather, learners like logistic regression, kmeans, neural nets, which are capable of partial learning, are preferred for such tasks.

NLP for reliable text classification on raspberry pi

Trying to get up and running my very own smart room.
As of now the system is on raspi 3, Google STT, naive bayes for text classification, PoS/NER by nlp-compromise, bunch of APIs, and then eSpeak. (sure there are lot of other stages, but generally speaking)
One thing which is problematic though is the text classification. Though, NB is doing a fair job but yeah there are issues.
Various text classification heavily rely on the fact that there would be large corpora to train with. And this makes sense, particularly if the application is news categorisation, for example.
But here we are talking about spoken language. If the sentence is Tell me the weather, there's only so much corpus you can generate for the variation in that simple statement. And still, find some other way to ask for the weather.
I don't think for each category there can be a large datasets of statements which would help to make the device clearly distinguish between commands.
Question
What can I do here, since more categories (or skillsets) would mean more similar statements.
Since it is a classification problem, even using SVM or RNN or any other trick should not make any such difference, even if I have to rig an external GPU for it. The corpus is about spoken sentences for various categories and the dataset can't be expected to be diversely educative for the system.
But honestly I am not clear of what could be a reliable method for classification, only for such specific purposes.
PS - I have seen how Jasper works, but even that does not resolve to better "understanding" of categories, many times

How to approach a machine learning programming competition

Many machine learning competitions are held in Kaggle where a training set and a set of features and a test set is given whose output label is to be decided based by utilizing a training set.
It is pretty clear that here supervised learning algorithms like decision tree, SVM etc. are applicable. My question is, how should I start to approach such problems, I mean whether to start with decision tree or SVM or some other algorithm or is there is any other approach i.e. how will I decide?
So, I had never heard of Kaggle until reading your post--thank you so much, it looks awesome. Upon exploring their site, I found a portion that will guide you well. On the competitions page (click all competitions), you see Digit Recognizer and Facial Keypoints Detection, both of which are competitions, but are there for educational purposes, tutorials are provided (tutorial isn't available for the facial keypoints detection yet, as the competition is in its infancy. In addition to the general forums, competitions have forums also, which I imagine is very helpful.
If you're interesting in the mathematical foundations of machine learning, and are relatively new to it, may I suggest Bayesian Reasoning and Machine Learning. It's no cakewalk, but it's much friendlier than its counterparts, without a loss of rigor.
EDIT:
I found the tutorials page on Kaggle, which seems to be a summary of all of their tutorials. Additionally, scikit-learn, a python library, offers a ton of descriptions/explanations of machine learning algorithms.
This cheatsheet http://peekaboo-vision.blogspot.pt/2013/01/machine-learning-cheat-sheet-for-scikit.html is a good starting point. In my experience using several algorithms at the same time can often give better results, eg logistic regression and svm where the results of each one have a predefined weight. And test, test, test ;)
There is No Free Lunch in data mining. You won't know which methods work best until you try lots of them.
That being said, there is also a trade-off between understandability and accuracy in data mining. Decision Trees and KNN tend to be understandable, but less accurate than SVM or Random Forests. Kaggle looks for high accuracy over understandability.
It also depends on the number of attributes. Some learners can handle many attributes, like SVM, whereas others are slow with many attributes, like neural nets.
You can shrink the number of attributes by using PCA, which has helped in several Kaggle competitions.

training a decision tree

I am trying to get started with Machine Learning. I have some training data representing pixel values of digits in images and I am trying to train a decision tree out of this. What would be a good way of getting started? What tools should I consider (pointers on related documentation would help)? I also want to train a random forest on the data to compare performance versus decision tree. Any guidance would be of great help.
The best way to get started is probably Weka. Apart from offering implementations of a random forest classifier as well as several decision trees (among lots of other algorithms), it also provides tools for processing and visualizing the data. It comes with a relatively easy to use GUI.
The random forest uses trees, so I'd probably counsel you to get the trees working first. Once you know all about trees, you can read about forests and it will be very straightforward. However, you should start by trying to learn about machine learning rather than just jumping into a library. I would start by understanding how to use decision trees on Boolean features (much simpler) using the method of maximizing entropy. Once you understand that algorithm well enough to run it by hand on a small dataset, read up on how to use decision-trees on real valued features. Then check out the library.

Resources