Can some one explain (or quote a reference) to compare the scoring mechanism used by SOLR and LUCENE in simpler words.
Is there any difference in them;
I am not that good at solr/lucene but my finding showed as if they are different.
P.S: i just tries a simple query like "+Contents:risk" and didn't use any filter other stuff.
Lucene uses concepts from the Vector space model to compute the score of documents. In summary, queries and documents can be seen as vectors. To compute the score of a document for a particular query, Lucene calculates how near each document's vector are from the query's vector. The more a document is near the query in VSM, the higher the score. You can have more details by looking at Lucene's Similarity class and Lucene's Scoring document.
The actual formula can be found in the Similarity javadocs.
Here's a summary of the parameters involved and a brief description of what they mean.
Solr uses Lucene under the hood, and by default Solr uses the default Lucene similarity algorithm.
Related
I'm creating a search engine to search a list of roughly 20k English phrases, each one being a few words long.
I've looked into ways to create the search engine, and currently I am using a TfidfVectorizer from sklearn and Cosine Similarity to compute the ranking scores.
From what I understand in information retrieval you have retrieval and ranking phases, however I'm confused how you could use a data structure like an inverted index to speed up the search before using TfidfVectorizer? It seems like TfidfVectorizer creates a term-document matrix which is different to an index. Could you just store TF and IDF values in an inverted index and use cosine similarity at run time? Ideally I want autocomplete of phrases so I need to store edge ngrams as well, and a boolean model isn't useful here.
I have 27000 free text elements, each of around 2-3 sentences. I need to cluster these by similarity. So far, I have pretty limited success. I have tried the following:
I used Python Natural Language Toolkit to remove stop words, lemmatize and tokenize, then generated semantically similar words for each word in the sentence before inserting them into a Neo4j graph database. I then tried querying that using the TF counts for each word and related word. That didn't work very well and only resulted in being able to easily calculate the similarity between two text items.
I then looked at Graphawares NLP library to annotate, enrich and calculate the cosine similarity between each text item. After 4 days of processing similarity I checked the log to find that it would take 1.5 years to process. Apparently the community version of the plugin isn't optimised, so I guess it's not appropriate for this kind of volume of data.
I then wrote a custom implementation that took the same approach as the Graphaware plugin, but in much simpler form. I used scikitlearn's TfidfVectorizer to calculate the cosine similarity between each text item and every other text item and saved those as relationships between the Neo4j nodes. However, with 27000 text items that creates 27000 * 27000 = 729000000 relationships! The intention was to take the graph into Grephi selecting relationships of over X threshold of similarity and use modularity clustering to extract clusters. Processing for this is around 4 days which is much better. Processing is incomplete and is currently running. However, I believe that Grephi has a max edge count of 1M, so I expect this to restrict what I can do.
So I turned to more conventional ML techniques using scikitlearn's KMeans, DBSCAN, and MeanShift algorithms. I am getting clustering, but when it's plotted on a scatter chart there is no separation (I can show code if that would help). Here is what I get with DBSCAN:
I get similar results with KMeans. These algorithms run within a few seconds, which obviously makes life easier, but the results seem poor.
So my questions are:
Is there a better approach to this?
Can I expect to find distinct clusters at all in free text?
What should my next move be?
Thank you very much.
I think your question is too general to be a good fit for Stack Overflow, but to give you some pointers...
What is your data? You discuss your methods in detail but not your data. What sort of clusters are you expecting?
Example useful description:
I have a bunch of short product reviews. I expect to be able to separate reviews of shoes, hats, and refrigerators.
Have you tried topic modelling? It's not fancy but it's a traditional method of sorting textual documents into clusters. Start with LDA if you're not familiar with anything.
Are you looking for duplicates? If you're looking for plagiarism or bot-generated spam, look into MinHash, SimHash, and the FuzzyWuzzy library for Python.
Given an index or database with a lot of (short) documents (~ 1 million), I am trying to do some kind of novelty detection for each newly incoming document.
I know that I have to compute the similarity of the new document with each document in the index. If the similarity is below a certain threshold, one can consider this document as novel. One common approach - which I want to do - is to use a Vector Space Model and compute the cosine similarity (e.g. by using Apache Lucene).
But this approach has two shortcomings: 1) it is computationally expensive and 2) it does not incorporate the semantics of documents and words respectively.
In order to overcome these shortcomings, my idea was to either use an LDA topic distribution or named entities to augment the Lucene index and the query (i.e. the document collection and each new document) with semantics.
Now, I am completely lost regarding the concrete implementation. I have already trained an LDA topic model using Mallet and I am also able to do Named Entity Recognition on the corpus. But I do not know how to use these topics and named entities in order to realise novelty detection. More specifically, I do not know how to use these features for index and query creation.
For example, is it already sufficient to store all named entities of one document as a separate field in the index, add certain weights (i.e. boost them) and use a MultiFieldQuery? I do not think that this already adds some kind of semantics to the similarity detection. The same applies to LDA topics: is it sufficient to add the topic probability of each term as a Payload and implement a new similarity score?
I would be very happy if you could provide some hints or even code snippets on how to incorporate LDA topics or named entities in Lucene for some kind of novelty detection or semantic similarity measure.
Thank you in advance.
Most of the recommendation algorithm in mahout requires user-item preference. But I want to find similar items for a given item. My system doesn't have user inputs. i.e. for any movie these can be attribute which can be use to find similarity coefficient
Genre
Director
Actor
The attribute list can be modified in future to build more efficient system. But to find item similarity in mahout datamodel user preference for each item is required. Where as these movies can be clustered together and get closest items in cluster on given item.
Later on after introducing user based recommendation above result can be used to boost the result.
If product attribute has some fix values like Genre. Do I have to convert those values to numerical value. If yes how system will calculate distance between two items where genre-1 and genre-2 doesn't have any numeric relation.
Edit:
I have found few example from command line, but I want to do it in java and save the pre-computed values for later use.
I think in the case of features vectors, the best similarity measure is the ones with exact matches like jaccard similarity for example.
In jaccard, the similarity between two items vectors is calculated as:
number of features in intersection/ number of features in union.
So, converting the genre to a numerical value will not make a difference since the exact match ( that is used to find intersection) will be the same in non numerical values.
Take a look at this question for how to do it in mahout:
Does Mahout provide a way to determine similarity between content (for content-based recommendations)?
It sounds like Mahout's spark-rowsimilarity algorithm, available since version 0.10.0, would be the perfect solution to your problem. It compares the rows of a given matrix (i.e: row vectors representing movies and their properties), looking for cooccurrences of values across those rows - or in your case: cooccurrences of Genres, Directors, and Actors. No user history or item interaction needed. The end result is another matrix mapping each of your movies to the top n most similar other movies in your collection, based on cooccurrence of genre, director, or actor.
The Apache Mahout site has a great write-up regarding how to do this from the command line, but if you want a deeper understanding of what's going on under the covers, read Pat Ferrel's machine learning blog Occam's Machete. He calls this type of similarity content or metadata similarity.
I was trying to cluster some documents using the KMeansClustering approach and successfully created the clusters. I saved the cluster id corresponding to a particular document for recommendations. So whenever I wanted to recommend documents similar to a particular document, I would query all the documents in a particular cluster and return n random documents from the cluster. However, returning any random document from the cluster did not seem appropriate and I read somewhere that we should be returning the documents nearest to the document in question.
So I started searching for calculating distance between documents and stumbled upon the RowSimilarity approach which returns 10 most similar documents to each document, ordered by distance. Now this approach relies on a similarity metric like LogLikelihood etc to calculate the distance between documents.
Now my question is this. How is clustering better/worse than RowSimilarity given that both the approaches use a similarity distance metric to calculate the distance between documents?
What I'm trying to achieve is that I'm trying to cluster products on the basis of their titles and other text properties to recommend similar products. Any help is appreciated.
Clustering is not just another variant of classification or recommendation. It is a different discipline.
When you are doing cluster analysis, you want to discover structure in the data. But then, you should actually be analyzing the structure you found.
Now k-means is not really meant for documents. It tries to find a near optimal partitioning of a data set into k Voronoi cells. Unless you have a good reason to believe that Voronoi cells are a good partitioning for your data, the algorithm may be pretty much useless. Just because it returns a result does not at all indicate that the result is useful.
For documents, Euclidean distance (and k-means is in fact optimizing Euclidean distances) are usually pretty much meaningless. The vectors are very sparse, and k-means cluster centers will then often resemble impossible (and thus insensible) "average documents".
And I havn't started on the need to find an appropriate value of k, on the Mahout implementation likely just being an approximation of Lloyds k-means approximation, and so on. Did you even check the cluster sizes? In situations like these, k-means will often produce degenerate results. For example, almost all clusters containing 1 or 0 elements, and a mega-cluster containing the rest. In this situation, you might in fact be returning just random documents from your database...
Just because you can use it does not mean it is helpful. Make sure to validate the individual steps of your approach, for example if the clusters are in any way useful and sensible!
Similarity is not the same thing as distance -- one is big when the other is small. Clustering is not the same as computing distances either. First you should decide whether you have a clustering problem -- it does not sound like you do based on what you say. So, don't use k-means.