I want to cluster text. I kinda understand the concept of clustering text-only content from Mahout in Action:
make a mapping (int -> term) of all terms in the input and store into a dictionary
convert all input documents into a normalized sparse vector
do clustering
I want to cluster text as well as other information like date-time, location, people I was with. For example, I want documents made in a 10-day visit to a distant place to be placed into a distinct cluster.
I know I must write my own tool for making vectors from date-time, location, tags and (natural) text. How do I approach this? Should I use built-in tools to vectorize text and then integrate that output to my own vectors? What about weighing the dimensions?

I cant give you full implementation details, as im not sure, but i can help you out with a piece of the puzzle. You will almost certainly need some context analysis to extract entities (such as location, time/date, person names)
For this take a look at OpenNLP.
in particular look at POS tagger, and namefinder.
Once you have extracted out the relevant entities, - you 'may' be able to do something with them using Mahout classification, (once you have extracted enough entities to train your model), but this i am not sure.
Text corpus clustering

I have 27000 free text elements, each of around 2-3 sentences. I need to cluster these by similarity. So far, I have pretty limited success. I have tried the following:
I used Python Natural Language Toolkit to remove stop words, lemmatize and tokenize, then generated semantically similar words for each word in the sentence before inserting them into a Neo4j graph database. I then tried querying that using the TF counts for each word and related word. That didn't work very well and only resulted in being able to easily calculate the similarity between two text items.
I then looked at Graphawares NLP library to annotate, enrich and calculate the cosine similarity between each text item. After 4 days of processing similarity I checked the log to find that it would take 1.5 years to process. Apparently the community version of the plugin isn't optimised, so I guess it's not appropriate for this kind of volume of data.
I then wrote a custom implementation that took the same approach as the Graphaware plugin, but in much simpler form. I used scikitlearn's TfidfVectorizer to calculate the cosine similarity between each text item and every other text item and saved those as relationships between the Neo4j nodes. However, with 27000 text items that creates 27000 * 27000 = 729000000 relationships! The intention was to take the graph into Grephi selecting relationships of over X threshold of similarity and use modularity clustering to extract clusters. Processing for this is around 4 days which is much better. Processing is incomplete and is currently running. However, I believe that Grephi has a max edge count of 1M, so I expect this to restrict what I can do.
So I turned to more conventional ML techniques using scikitlearn's KMeans, DBSCAN, and MeanShift algorithms. I am getting clustering, but when it's plotted on a scatter chart there is no separation (I can show code if that would help). Here is what I get with DBSCAN:
I get similar results with KMeans. These algorithms run within a few seconds, which obviously makes life easier, but the results seem poor.
So my questions are:
Is there a better approach to this?
Can I expect to find distinct clusters at all in free text?
What should my next move be?
Thank you very much.
I think your question is too general to be a good fit for Stack Overflow, but to give you some pointers...
What is your data? You discuss your methods in detail but not your data. What sort of clusters are you expecting?
Example useful description:
I have a bunch of short product reviews. I expect to be able to separate reviews of shoes, hats, and refrigerators.
Have you tried topic modelling? It's not fancy but it's a traditional method of sorting textual documents into clusters. Start with LDA if you're not familiar with anything.
Are you looking for duplicates? If you're looking for plagiarism or bot-generated spam, look into MinHash, SimHash, and the FuzzyWuzzy library for Python.

Entity Type Recogition : Finding an Entity's Dominant Type from its Description

I've been working on a research project. I have a database of Wikipedia descriptions of a large number of entities, including sportspersons, politicians, actors, etc. The aim is to determine the type of entity using the descriptions. I have access to some data with the predicted type of entity which is quite accurate. This will be my training data. What I would like to do is train a model to predict the dominant type of entity for rest of the data.
What I've done till now:
Extracted the first paragraph, H1, H2 headers of Wiki description of the entity.
Extracted the category list of the entity on the wiki page (The bottom 'Categories' section present on any page like here.
Finding the type of entity can be difficult for entities that are associated with two or more concepts, like an actor who later became a politician.
I want to ask as to how I create a model out of the raw data that I have? What are the variables that I should use to train the model?
Also are there any Natural Language Processing techniques that can be helpful for this purpose? I know POS taggers can be helpful in this case.
My search over the internet has not been much successful. I've stumbled across research papers and blogs like this one, but none of them have relevant information for this purpose. Any ideas would be appreciated. Thanks in advance!
The input data is the first paragraph of the Wikipedia page of the entity. For example, for this page, my input would be:
Alan Stuart Franken (born May 21, 1951) is an American comedian, writer, producer, author, and politician who served as a United States Senator from Minnesota from 2009 to 2018. He became well known in the 1970s and 1980s as a performer on the television comedy show Saturday Night Live (SNL). After decades as a comedic actor and writer, he became a prominent liberal political activist, hosting The Al Franken Show on Air America Radio.
My extracted information is, the first paragraph of the page, the string of all the 'Categories' (bottom part of the page), and all the headers of the page.
From what I gather you would like to have a classifier which takes text input and predicts from a list of predefined categories.
I am not sure what your level of expertise is, so I will give a high level overview if additional people would like to know about the subject.
Like all NLP tasks which use ML, you are going to have to transform your textual domain to a numerical domain by a process of featurization.
Process the text and labels
Determine the relevant features
Create numerical representation of features
Train and Test on a Classifier
Process the text and labels
the text might have some strange markers or things that need to be modified to make it more "clean". this is standard as a text normalisation step.
then you will have to keep the related categories as labels for the texts.
It will end up being something like the following:
For each wiki article:
Normalise wiki article text
Save associated categories labels with text for training
Determine the relevant features
Some features you seem to have mentioned are:
Dominant field (actor, politician)
Header information
Syntactic information (POS Tags) are local (token level), but can be used to extract specific features such as if words are proper nouns or not.
Create numerical representation of features
Luckily, there are ways of doing auto-encoding, such as doc2vec, which can make a document vector from the text. Then you can add additional bespoke features that seem relevant.
You will then have a vector representation of features relevant to this text as well as the labels (categories).
This will become your training data.
Train and Test on a Classifier
Now train and test on a classifier of your choice.
Your data is one-to-many as you will try to predict many labels.
Try something simple just to seem if things work as you expect.
You should test your results with a cross validation routine such as k-fold validation using standard metrics (Precision, Recall, F1)
Just to help clarify, This task is not really a named entity recognition task. It is a kind of multi-label classification task, where the labels are the categories defined on the wikipedia pages.
Named Entity Recognition is finding meaningful named entities in a document such as people, places. Usually something noun-like. This is usually done on a token level whereas your task is on a document level it seems.

Using Text Sentiment as feature in machine learning model?

I am researching what features I'll have for my machine learning model, with the data I have. My data contains a lot of textdata, so I was wondering how to extract valuable features from it. Contrary to my previous belief, this often consists of representation with Bag-of-words, or something like word2vec: (
Because my understanding of the subject is limited, I dont understand why I can't analyze the text first to get numeric values. (for example: textBlob.sentiment =, google Clouds Natural Language =
Are there problems with this, or could I use these values as features for my machine learning model?
Thanks in advance for all the help!
Of course, you can convert text input single number with sentiment analysis then use this number as a feature in your machine learning model. Nothing wrong with this approach.
The question is what kind of information you want to extract from text data. Because sentiment analysis convert text input to a number between -1 to 1 and the number represents how positive or negative the text is. For example, you may want sentiment information of the customers' comments about a restaurant to measure their satisfaction. In this case, it is fine to use sentiment analysis to preprocess text data.
But again, sentiment analysis is only given an idea about how positive or negative text is. You may want to cluster text data and sentiment information is not useful in this case since it does not provide any information about the similarity of texts. Thus, other approaches such as word2vec or bag-of-words will be used for the representation of text data in those tasks. Because those algorithms provide vector representation of the text instance of a single number.
In conclusion, the approach depends on what kind of information you need to extract from data for your specific task.

metric learning for information retrieval in semi-structured text?

I am interested in parsing semi-structured text. Assuming I have a text with labels of the kind: year_field, year_value, identity_field, identity_value, ..., address_field, address_value and so on.
These fields and their associated values can be everywhere in the text, but usually they are near to each other, and more generally the text in organized in a (very) rough matrix, but rather often the value is just after the associated field with eventually some non-interesting information in between.
The number of different format can be up to several dozens, and is not that rigid (do not count on spacing, moreover some information can be added and removed).
I am looking toward machine learning techniques to extract all those (field,value) of interest.
I think metric learning and/or conditional random fields (CRF) could be of a great help, but I have not practical experience with them.
Does anyone have already encounter a similar problem?
Any suggestion or literature on this topic?
Your task, if I understand correctly, is to extract all pre-defined entities from a text. What you describe here is exactly named entity recognition.
Stanford has a Stanford Named Entity Recognizer that you can download and use (python/java and more)
Regarding the models you considers (CRF for example) - the hard thing here is to get the training data - sentences with the entities already labeled. This is why you should consider getting a trained model, or use someone else's data to train your model (again, the model will recognize only entities it saw in the training part)
A great choice for already train model in python is nltk's Information Extraction module.
Hope this sums it up

Mahout: RowSimilarity vs Clustering

I was trying to cluster some documents using the KMeansClustering approach and successfully created the clusters. I saved the cluster id corresponding to a particular document for recommendations. So whenever I wanted to recommend documents similar to a particular document, I would query all the documents in a particular cluster and return n random documents from the cluster. However, returning any random document from the cluster did not seem appropriate and I read somewhere that we should be returning the documents nearest to the document in question.
So I started searching for calculating distance between documents and stumbled upon the RowSimilarity approach which returns 10 most similar documents to each document, ordered by distance. Now this approach relies on a similarity metric like LogLikelihood etc to calculate the distance between documents.
Now my question is this. How is clustering better/worse than RowSimilarity given that both the approaches use a similarity distance metric to calculate the distance between documents?
What I'm trying to achieve is that I'm trying to cluster products on the basis of their titles and other text properties to recommend similar products. Any help is appreciated.
Clustering is not just another variant of classification or recommendation. It is a different discipline.
When you are doing cluster analysis, you want to discover structure in the data. But then, you should actually be analyzing the structure you found.
Now k-means is not really meant for documents. It tries to find a near optimal partitioning of a data set into k Voronoi cells. Unless you have a good reason to believe that Voronoi cells are a good partitioning for your data, the algorithm may be pretty much useless. Just because it returns a result does not at all indicate that the result is useful.
For documents, Euclidean distance (and k-means is in fact optimizing Euclidean distances) are usually pretty much meaningless. The vectors are very sparse, and k-means cluster centers will then often resemble impossible (and thus insensible) "average documents".
And I havn't started on the need to find an appropriate value of k, on the Mahout implementation likely just being an approximation of Lloyds k-means approximation, and so on. Did you even check the cluster sizes? In situations like these, k-means will often produce degenerate results. For example, almost all clusters containing 1 or 0 elements, and a mega-cluster containing the rest. In this situation, you might in fact be returning just random documents from your database...
Just because you can use it does not mean it is helpful. Make sure to validate the individual steps of your approach, for example if the clusters are in any way useful and sensible!
Similarity is not the same thing as distance -- one is big when the other is small. Clustering is not the same as computing distances either. First you should decide whether you have a clustering problem -- it does not sound like you do based on what you say. So, don't use k-means.
