Classifying website type from webpages - machine-learning

Are there any reliable/deployed approaches, algorithms or tools to tagging the website type by parsing some its webpages.
For ex: forums, blogs, PressRelease sites, news, E-Comm etc.
I am looking for some well-defined characteristics (Static rules) from which this can be determined. If not, then i hope Machine Learning model may help.
Suggestions/Ideas ?

If you approach this from machine learning standpoint, Naive Bayes classifier probably has the greatest work/payoff ratio. A version of it is used in Winnow to categorize news articles.
You will need a collection of pages, each tagged with it's proper category. Then you extract words or other relevant elements from each page and use them as features
Dr.Dobbs has an article on implementing Naive Bayes

If you're interested in persuing the naïve Bayes approach (there are other machine learning options, after all), then I suggest the following document, which follows the coverage of this subject in "Data Mining: Practical Machine Learning Tools and Techniques", by Witten and Frank:
http://www.coli.uni-sb.de/~crocker/Teaching/Connectionist/lecture10_4up.pdf

Related

In machine learning which algorithm should I use to recommend, based on different features like rating,type,gender etc

I am developing a website, which will recommend recipes to the visitors based on their data. I am collecting data from their profile, website activity and facebook.
Currently I have data like [username/userId, rating of recipes, age, gender, type(veg/Non veg), cuisine(Italian/Chinese.. etc.)]. With respect to above features I want to recommend new recipes which they have not visited.
I have implemented ALS (alternating least squares) spark algorithm. In this we have to prepare csv which contains [userId,RecipesId,Rating] columns. Then we have to train this data and create the model by adjusting parameters like lamdas, Rank, iteration. This model generated recommendation, using pyspark
model.recommendProducts(userId, numberOfRecommendations)
The ALS algorithm accepts only three features userId, RecipesId, Rating. I am unable to include more features (like type, cuisine, gender etc.) apart from which I have mentioned above (userId, RecipesId, Rating). I want to include those features, then train the model and generate recommendations.
Is there any other algorithm in which I can include above parameters and generate recommendation.
Any help would be appreciated, Thanks.
Yes, there are couple of others algorithms. For your case, I would suggest that you Naive Bayes algorithm.
https://en.wikipedia.org/wiki/Naive_Bayes_classifier
Since you are working on a web application, a JS solution, I guess, would come handy to you.
(simple) https://www.npmjs.com/package/bayes
or for example:
(a bit more powerful) https://www.npmjs.com/package/naivebayesclassifier
There are algorithms called recommender systems in machine learning. In this we have content based recommender systems. They are mainly used to recommend products/movies based on customer reviews. You can apply the same algorithm using customer reviews to recommend recipes. For better understanding of this algorithm refer this links:
https://www.youtube.com/watch?v=Bv6VkpvEeRw&list=PL0Smm0jPm9WcCsYvbhPCdizqNKps69W4Z&index=97
https://www.youtube.com/watch?v=2uxXPzm-7FY
You can go with powerful classification algorithms like
->SVM: works very well if you have more number of attributes.
->Logistic Regression: if you have huge data of customers.
You are looking for recommender systems using algorithms like collaborative filtering. I would suggest you to go through Prof.Andrew Ng's short videos on collaborative filtering algorithm and low-rank matrix factorization and also building recommender systems. They are a part of Coursera's Machine learning course offered by Stanford University.
The course link:
https://www.coursera.org/learn/machine-learning#%20
You can check week 9 for the content related to recommender systems.

Research papers classification on the basis of title of the research paper

Dear all I am working on a project in which I have to categories research papers into their appropriate fields using titles of papers. For example if a phrase "computer network" occurs somewhere in then title then this paper should be tagged as related to the concept "computer network". I have 3 million titles of research papers. So I want to know how I should start. I have tried to use tf-idf but could not get actual results. Does someone know about a library to do this task easily? Kindly suggest one. I shall be thankful.
If you don't know categories in advance, than it's not classification, but instead clustering. Basically, you need to do following:
Select algorithm.
Select and extract features.
Apply algorithm to features.
Quite simple. You only need to choose combination of algorithm and features that fits your case best.
When talking about clustering, there are several popular choices. K-means is considered one of the best and has enormous number of implementations, even in libraries not specialized in ML. Another popular choice is Expectation-Maximization (EM) algorithm. Both of them, however, require initial guess about number of classes. If you can't predict number of classes even approximately, other algorithms - such as hierarchical clustering or DBSCAN - may work for you better (see discussion here).
As for features, words themselves normally work fine for clustering by topic. Just tokenize your text, normalize and vectorize words (see this if you don't know what it all means).
Some useful links:
Clustering text documents using k-means
NLTK clustering package
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Note: all links in this answer are about Python, since it has really powerful and convenient tools for this kind of tasks, but if you have another language of preference, you most probably will be able to find similar libraries for it too.
For Python, I would recommend NLTK (Natural Language Toolkit), as it has some great tools for converting your raw documents into features you can feed to a machine learning algorithm. For starting out, you can maybe try a simple word frequency model (bag of words) and later on move to more complex feature extraction methods (string kernels). You can start by using SVM's (Support Vector Machines) to classify the data using LibSVM (the best SVM package).
The fact, that you do not know the number of categories in advance, you could use a tool called OntoGen. The tool basically takes a set of texts, does some text mining, and tries to discover the clusters of documents. It is a semi-supervised tool, so you must guide the process a little, but it does wonders. The final product of the process is an ontology of topics.
I encourage you, to give it a try.

Mahout Classifier v. OpenNLP Documentclassifier

I'm at a cross roads, ive been using Mahout to classify some documents, and have stumbled across OpenNLP document classifier.
They seem to do very similar things, and i cant figure out if its worth converting what I currently have written in mahout, and provide an OpenNLP implementation instead.
Are there some blatently obvious advantages mahout has over OpenNLP for document classification?
My situation is that I have several hundred thousand news articles, and i only want to extract a subset of them. Mahout does this reasonably well, - im using Naive Bayes for term counting, and then TF-IDF to determine which category the documents fall into. The model is updated as and when new articles are found, so the model is consistently improving over time.
It seems OpenNLP document classifier does something very similar (although i have not tested how accurate it is). - does anyone have experience using both, who can say diffentively why one would be used above the other?
I don't have experience with these two, but while trying to figure out if one of them would make a difference in a personal project, I stumbled upon this blog, and I quote:
Data categorization with OpenNLP is another approach with more accuracy and performance rate as compared to mahout.
You can check the blog post here.

What subjects, topics does a computer science graduate need to learn to apply available machine learning frameworks, esp. SVMs

I want to teach myself enough machine learning so that I can, to begin with, understand enough to put to use available open source ML frameworks that will allow me to do things like:
Go through the HTML source of pages
from a certain site and "understand"
which sections form the content,
which the advertisements and which
form the metadata ( neither the
content, nor the ads - for eg. -
TOC, author bio etc )
Go through the HTML source of pages
from disparate sites and "classify"
whether the site belongs to a
predefined category or not ( list of
categories will be supplied
beforhand )1.
... similar classification tasks on
text and pages.
As you can see, my immediate requirements are to do with classification on disparate data sources and large amounts of data.
As far as my limited understanding goes, taking the neural net approach will take a lot of training and maintainance than putting SVMs to use?
I understand that SVMs are well suited to ( binary ) classification tasks like mine, and open source framworks like libSVM are fairly mature?
In that case, what subjects and topics
does a computer science graduate need
to learn right now, so that the above
requirements can be solved, putting
these frameworks to use?
I would like to stay away from Java, is possible, and I have no language preferences otherwise. I am willing to learn and put in as much effort as I possibly can.
My intent is not to write code from scratch, but, to begin with putting the various frameworks available to use ( I do not know enough to decide which though ), and I should be able to fix things should they go wrong.
Recommendations from you on learning specific portions of statistics and probability theory is nothing unexpected from my side, so say that if required!
I will modify this question if needed, depending on all your suggestions and feedback.
"Understanding" in machine learn is the equivalent of having a model. The model can be for example a collection of support vectors, the layout and weights of a neural network, a decision tree, or more. Which of these methods work best really depends on the subject you're learning from and on the quality of your training data.
In your case, learning from a collection of HTML sites, you will like to preprocess the data first, this step is also called "feature extraction". That is, you extract information out of the page you're looking at. This is a difficult step, because it requires domain knowledge and you'll have to extract useful information, or otherwise your classifiers will not be able to make good distinctions. Feature extraction will give you a dataset (a matrix with features for each row) from which you'll be able to create your model.
Generally in machine learning it is advised to also keep a "test set" that you do not train your models with, but that you will use at the end to decide on what is the best method. It is of extreme importance that you keep the test set hidden until the very end of your modeling step! The test data basically gives you a hint on the "generalization error" that your model is making. Any model with enough complexity and learning time tends to learn exactly the information that you train it with. Machine learners say that the model "overfits" the training data. Such overfitted models seem to appear good, but this is just memorization.
While software support for preprocessing data is very sparse and highly domain dependent, as adam mentioned Weka is a good free tool for applying different methods once you have your dataset. I would recommend reading several books. Vladimir Vapnik wrote "The Nature of Statistical Learning Theory", he is the inventor of SVMs. You should get familiar with the process of modeling, so a book on machine learning is definitely very useful. I also hope that some of the terminology might be helpful to you in finding your way around.
Seems like a pretty complicated task to me; step 2, classification, is "easy" but step 1 seems like a structure learning task. You might want to simplify it to classification on parts of HTML trees, maybe preselected by some heuristic.
The most widely used general machine learning library (freely) available is probably WEKA. They have a book that introduces some ML concepts and covers how to use their software. Unfortunately for you, it is written entirely in Java.
I am not really a Python person, but it would surprise me if there aren't also a lot of tools available for it as well.
For text-based classification right now Naive Bayes, Decision Trees (J48 in particular I think), and SVM approaches are giving the best results. However they are each more suited for slightly different applications. Off the top of my head I'm not sure which would suit you the best. With a tool like WEKA you could try all three approaches with some example data without writing a line of code and see for yourself.
I tend to shy away from Neural Networks simply because they can get very very complicated quickly. Then again, I haven't tried a large project with them mostly because they have that reputation in academia.
Probability and statistics knowledge is only required if you are using probabilistic algorithms (like Naive Bayes). SVMs are generally not used in a probabilistic manner.
From the sound of it, you may want to invest in an actual pattern classification textbook or take a class on it in order to find exactly what you are looking for. For custom/non-standard data sets it can be tricky to get good results without having a survey of existing techniques.
It seems to me that you are now entering machine learning field, so I'd really like to suggest to have a look at this book: not only it provides a deep and vast overview on the most common machine learning approaches and algorithms (and their variations) but it also provides a very good set of exercises and scientific paper links. All of this is wrapped in an insightful language starred with a minimal and yet useful compendium about statistics and probability

Naive Bayesian for Topic detection using "Bag of Words" approach

I am trying to implement a naive bayseian approach to find the topic of a given document or stream of words. Is there are Naive Bayesian approach that i might be able to look up for this ?
Also, i am trying to improve my dictionary as i go along. Initially, i have a bunch of words that map to a topics (hard-coded). Depending on the occurrence of the words other than the ones that are already mapped. And depending on the occurrences of these words i want to add them to the mappings, hence improving and learning about new words that map to topic. And also changing the probabilities of words.
How should i go about doing this ? Is my approach the right one ?
Which programming language would be best suited for the implementation ?
Existing Implementations of Naive Bayes
You would probably be better off just using one of the existing packages that supports document classification using naive Bayes, e.g.:
Python - To do this using the Python based Natural Language Toolkit (NLTK), see the Document Classification section in the freely available NLTK book.
Ruby - If Ruby is more of your thing, you can use the Classifier gem. Here's sample code that detects whether Family Guy quotes are funny or not-funny.
Perl - Perl has the Algorithm::NaiveBayes module, complete with a sample usage snippet in the package synopsis.
C# - C# programmers can use nBayes. The project's home page has sample code for a simple spam/not-spam classifier.
Java - Java folks have Classifier4J. You can see a training and scoring code snippet here.
Bootstrapping Classification from Keywords
It sounds like you want to start with a set of keywords that are known to cue for certain topics and then use those keywords to bootstrap a classifier.
This is a reasonably clever idea. Take a look at the paper Text Classication by Bootstrapping with Keywords, EM and Shrinkage by McCallum and Nigam (1999). By following this approach, they were able to improve classification accuracy from the 45% they got by using hard-coded keywords alone to 66% using a bootstrapped Naive Bayes classifier. For their data, the latter is close to human levels of agreement, as people agreed with each other about document labels 72% of the time.

Resources