Document clustering - machine-learning

Are there any Artificial Intelligence algorithms which can be applied to improve Document Clustering results? The algorithm for clustering can be hierarchical or any other.
Thank You

The Wikipedia article on document clustering includes a link to a 2007 paper by Nicholas Andrews and Edward Fox from Virginia Tech called "Recent Developments in Document Clustering". I'm not sure specifically what you would class as an "Artificial Intelligence algorithm" but scanning the paper's contents shows that they look at vector space models, extensions to kmeans, generative algorithms, spectral clustering, dimensionality reduction, phase-based models, and comparative analysis. It's a pretty mathematically dense treatment but they are careful to include references to the algorithms they talk about.

Clustering is indeed a type of problem in the AI domain. And if you want to go one level down you may say it is in the Machine Learning field. In this sense AI does not improve document clustering, but solves it! Dumbledad mentions some basic alternatives but the type of data you have each time may be treated better with different algorithm. There are a lot of k-means based approaches for the problem. Careful seeding is needed in such a case. Spherical k-means (search for the paper of Dhillon) is a simple and standard approach. Other extensions are k-synthetic prototypes.
Subspace clustering is also a good try and in general if you want to go further than "document clustering" literature check for "clustering in high dimensional and sparse data spaces".

Related

How to determine, whether to use machine learning algorithms or data mining technique for a given scenario?

I have been reading so many articles on Machine Learning and Data mining from the past few weeks. Articles like the difference between ML and DM, similarities, etc. etc. But I still have one question, it may look like a silly question,
How to determine, when should we use ML algorithms and when should we use DM?
Because I have performed some practicals of DM using weka on Time Series Analysis(future population prediction, sales prediction), text mining using R/python, etc. Same can be done using ML algorithms also, like future population prediction using Linear regression.
So how to determine, that, for a given problem ML is best suitable or Dm is best suitable.
Thanks in advance.
Probably the closest thing to the quite arbitrary and meaningless separation of ML and DM is unsupervised methods vs. supervised learning.
Choose ML if you have training data for your target function.
Choose DM when you need to explore your data.

why using support vector machine?

I have some questions about SVM :
1- Why using SVM? or in other words, what causes it to appear?
2- The state Of art (2017)
3- What improvements have they made?
SVM works very well. In many applications, they are still among the best performing algorithms.
We've seen some progress in particular on linear SVMs, that can be trained much faster than kernel SVMs.
Read more literature. Don't expect an exhaustive answer in this QA format. Show more effort on your behalf.
SVM's are most commonly used for classification problems where labeled data is available (supervised learning) and are useful for modeling with limited data. For problems with unlabeled data (unsupervised learning), then support vector clustering is an algorithm commonly employed. SVM tends to perform better on binary classification problems since the decision boundaries will not overlap. Your 2nd and 3rd questions are very ambiguous (and need lots of work!), but I'll suffice it to say that SVM's have found wide range applicability to medical data science. Here's a link to explore more about this: Applications of Support Vector Machine (SVM) Learning in Cancer Genomics

Research papers classification on the basis of title of the research paper

Dear all I am working on a project in which I have to categories research papers into their appropriate fields using titles of papers. For example if a phrase "computer network" occurs somewhere in then title then this paper should be tagged as related to the concept "computer network". I have 3 million titles of research papers. So I want to know how I should start. I have tried to use tf-idf but could not get actual results. Does someone know about a library to do this task easily? Kindly suggest one. I shall be thankful.
If you don't know categories in advance, than it's not classification, but instead clustering. Basically, you need to do following:
Select algorithm.
Select and extract features.
Apply algorithm to features.
Quite simple. You only need to choose combination of algorithm and features that fits your case best.
When talking about clustering, there are several popular choices. K-means is considered one of the best and has enormous number of implementations, even in libraries not specialized in ML. Another popular choice is Expectation-Maximization (EM) algorithm. Both of them, however, require initial guess about number of classes. If you can't predict number of classes even approximately, other algorithms - such as hierarchical clustering or DBSCAN - may work for you better (see discussion here).
As for features, words themselves normally work fine for clustering by topic. Just tokenize your text, normalize and vectorize words (see this if you don't know what it all means).
Some useful links:
Clustering text documents using k-means
NLTK clustering package
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Note: all links in this answer are about Python, since it has really powerful and convenient tools for this kind of tasks, but if you have another language of preference, you most probably will be able to find similar libraries for it too.
For Python, I would recommend NLTK (Natural Language Toolkit), as it has some great tools for converting your raw documents into features you can feed to a machine learning algorithm. For starting out, you can maybe try a simple word frequency model (bag of words) and later on move to more complex feature extraction methods (string kernels). You can start by using SVM's (Support Vector Machines) to classify the data using LibSVM (the best SVM package).
The fact, that you do not know the number of categories in advance, you could use a tool called OntoGen. The tool basically takes a set of texts, does some text mining, and tries to discover the clusters of documents. It is a semi-supervised tool, so you must guide the process a little, but it does wonders. The final product of the process is an ontology of topics.
I encourage you, to give it a try.

How to approach a machine learning programming competition

Many machine learning competitions are held in Kaggle where a training set and a set of features and a test set is given whose output label is to be decided based by utilizing a training set.
It is pretty clear that here supervised learning algorithms like decision tree, SVM etc. are applicable. My question is, how should I start to approach such problems, I mean whether to start with decision tree or SVM or some other algorithm or is there is any other approach i.e. how will I decide?
So, I had never heard of Kaggle until reading your post--thank you so much, it looks awesome. Upon exploring their site, I found a portion that will guide you well. On the competitions page (click all competitions), you see Digit Recognizer and Facial Keypoints Detection, both of which are competitions, but are there for educational purposes, tutorials are provided (tutorial isn't available for the facial keypoints detection yet, as the competition is in its infancy. In addition to the general forums, competitions have forums also, which I imagine is very helpful.
If you're interesting in the mathematical foundations of machine learning, and are relatively new to it, may I suggest Bayesian Reasoning and Machine Learning. It's no cakewalk, but it's much friendlier than its counterparts, without a loss of rigor.
EDIT:
I found the tutorials page on Kaggle, which seems to be a summary of all of their tutorials. Additionally, scikit-learn, a python library, offers a ton of descriptions/explanations of machine learning algorithms.
This cheatsheet http://peekaboo-vision.blogspot.pt/2013/01/machine-learning-cheat-sheet-for-scikit.html is a good starting point. In my experience using several algorithms at the same time can often give better results, eg logistic regression and svm where the results of each one have a predefined weight. And test, test, test ;)
There is No Free Lunch in data mining. You won't know which methods work best until you try lots of them.
That being said, there is also a trade-off between understandability and accuracy in data mining. Decision Trees and KNN tend to be understandable, but less accurate than SVM or Random Forests. Kaggle looks for high accuracy over understandability.
It also depends on the number of attributes. Some learners can handle many attributes, like SVM, whereas others are slow with many attributes, like neural nets.
You can shrink the number of attributes by using PCA, which has helped in several Kaggle competitions.

How to categorize continuous data?

I have two dependent continuous variables and i want to use their combined values to predict the value of a third binary variable. How do i go about discretizing/categorizing the values? I am not looking for clustering algorithms, i'm specifically interested in obtaining 'meaningful' discrete categories i can subsequently use in in a Bayesian classifier.
Pointers to papers, books, online courses, all very much appreciated!
That is the essence of machine learning and problem one of the most studied problem.
Least-square regression, logistic regression, SVM, random forest are widely used for this type of problem, which is called binary classification.
If your goal is to pragmatically classify your data, several libraries are available, like Scikits-learn in python and weka in java. They have a great documentation.
But if you want to understand what's the intrinsics of machine learning, just search (here or on google) for machine learning resources.
If you wanted to be a real nerd, generate a bunch of different possible discretizations and then train a classifier on it, and then characterize the discretizations by features and then run a classifier on that, and see what sort of discretizations are best!?
In general discretizing stuff is more of an art and having a good understanding of what the input variable ranges mean.

Resources