I want to build text classifier where category will be determined by the text.
Which classifier should I work with?
I have been reading about mahout. Is mahout sufficient. I have about 1Mils documents to train.
I could not find a better example/tutorial of mahout classifier.
Does mahout has http server where I make request and it1 gives me response back?
If not how do I embed mahout in my web app (PHP)
Please suggest some good tutorial on mahout..
It seems that your data is not labeled, so I believe you are looking at a clustering problem.
I strongly suggest you start with the Mahout in Action book. The book covers Recommendations, Clustering and Classification. It should have all the information you need to get you started.
NaiveBayesClassifier
Mahout does not have embeded http server, you have to build your own. And the PHP can just request the service as a http client.
There's any demo implementation in Mahout in action, but it is not based on http.
Good luck!
Related
I am exploring the method and code to construct the Bayesian network for information retrieval using a data-driven approach. I do find very old papers where the dataset or code are not available. I am new and exploring this field.
Please, if anyone can provide the code link or suggestion of latest papers that can help to give the implementation touch for Bayesian network construction.
I am new to apache mahout but i read one article which said Apache Mahout 1.0 gives content based recommendion (http://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html) but now it turns out that it does not give content-based recommendation rather it gives recommendation based on different user actions on website.
Amazon ,Netflix might have been using content-based recommender and probabily they might have implemented them from scratch but now my question is:
Is there any Machine Learning library which gives us content-based recommendation or do i have to implement it by myself ?
Here by content based recommendation i mean there is feature vector for item and we behaviour vector for each user and hence by multiplying them we get recommendation for particular user.
Please recommend something to me,
Thanks in advance.
I'm at a cross roads, ive been using Mahout to classify some documents, and have stumbled across OpenNLP document classifier.
They seem to do very similar things, and i cant figure out if its worth converting what I currently have written in mahout, and provide an OpenNLP implementation instead.
Are there some blatently obvious advantages mahout has over OpenNLP for document classification?
My situation is that I have several hundred thousand news articles, and i only want to extract a subset of them. Mahout does this reasonably well, - im using Naive Bayes for term counting, and then TF-IDF to determine which category the documents fall into. The model is updated as and when new articles are found, so the model is consistently improving over time.
It seems OpenNLP document classifier does something very similar (although i have not tested how accurate it is). - does anyone have experience using both, who can say diffentively why one would be used above the other?
I don't have experience with these two, but while trying to figure out if one of them would make a difference in a personal project, I stumbled upon this blog, and I quote:
Data categorization with OpenNLP is another approach with more accuracy and performance rate as compared to mahout.
You can check the blog post here.
For a novice to machine learning, what are the learning prerequisites to using Apache Mahout in an efficient way?
I know that a committer to Mahout would need calculus, linear algebra, probability and machine learning before they can contribute anything useful. But does a "User" of Apache Mahout need all of this?
I'm asking this because learning/revising all of the above would take me ages..
Mahout In Action provides a good overview of what you need to know to use Mahout.
Typically, scalable machine learning does not require advanced mathematics for use. It may require serious math to develop, but not necessarily to use.
The primary requirement is that you really understand your data and its origins and what you want to do with it. That understanding doesn't have to come all at once and can be developed over time.
Try to Google the topics below:
Programming Collaborative Intelligence
Similarity calculation with vectors
What's the different between cluster and classification.
I have been working on crawling webpages and extracting the elements of the website.
Ex:
Given a website - The crawler should return the following sections: Header, Menu, Footer, content etc.
I was thinking that it would be great if I could use machine learning to train the code to learn how to classify websites.
I tried looking at Python Machine learning libraries (ex: PyBrain) but the examples are very complex.
Can anyone please suggest me a library and some tutorial on how to get started on using Python Machine Learning with some simple examples?
Thanks!
MLPy may be a simpler start for you.
Here is a link to the documentation on classification. By the way, if you don't know what the classes should look like, maybe you need to cluster your pages, and not to classify them.