Bi-gram model to predict text - machine-learning

I am planning to implement bi-gram model to predict a search text. If a user has frequently searched "Test search word" and then if user types "Test" I am looking to automatically suggest "Test search word"
I have the list of data of searched text. I am trying with bi-gram as even if user types "Tast" it should still provide "Test search word". I am implementing it in Java. I am looking for a library to supply the data that I have and when I pass the user keyed in text, it should provide the prediction.
After research I found below links
https://www.javatips.net/api/Solbase-Lucene-master/contrib/analyzers/common/src/java/org/apache/lucene/analysis/shingle/ShingleFilter.java
https://opennlp.apache.org/docs/1.8.1/apidocs/opennlp-tools/opennlp/tools/ngram/NGramUtils.html
but they are not helping in my case. Are there any Java libraries that suits my purpose?

I'm thinking of two solutions:
First
Index each of your user string queries in a MARISA (Matching Algorithm with Recursively Implemented StorAge) TRIE data structure (data structure optimised for keywords search and autocomplete).
Prepare a Levenshtein distance measurement method to tolerate typos.
Now for each new user query q, get all strings indexed in MARISA TRIE that has your query q as prefix (after typo tolerance).
Second
Use a elasticsearch suggester
Documentation https://www.elastic.co/guide/en/elasticsearch/reference/7.5/search-suggesters.html#completion-suggester
Please notice that parts of the suggest feature are still under development.

Related

Query-document similarity with doc2vec

Given a query and a document, I would like to compute a similarity score using Gensim doc2vec.
Each document consists of multiple fields (e.g., main title, author, publisher, etc)
For training, is it better to concatenate the document fields and treat each row as a unique document or should I split the fields and use them as different training examples?
For inference, should I treat a query like a document? Meaning, should I call the model (trained over the documents) on the query?
The right answer will depend on your data & user behavior, so you'll want to try several variants.
Just to get some initial results, I'd suggest combining all fields into a single 'document', for each potential query-result, and using the (fast-to-train) PV-DBOW mode (dm=0). That will let you start seeing results, doing either some informal assessment or beginning to compile some automatic assessment data (like lists of probe queries & docs that they "should" rank highly).
You could then try testing the idea of making the fields separate docs – either instead-of, or in addition-to, the single-doc approach.
Another option might be to create specialized word-tokens per field. That is, when 'John' appears in the title, you'd actually preprocess it to be 'title:John', and when in author, 'author:John', etc. (This might be in lieu of, or in addition to, the naked original token.) That could enhance the model to also understand the shifting senses of each token, depending on the field.
Then, providing you have enough training data, & choose other model parameters well, your search interface might also preprocess queries similarly, when the user indicates a certain field, and get improved results. (Or maybe not: it's just an idea to be tried.)
In all cases, if you need precise results – exact matches of well-specified user queries – more traditional searches like exact DB matches/greps, or full-text reverse-indexes, will outperform Doc2Vec. But when queries are more approximate, and results need filling-out with near-in-meaning-even-if-not-in-literal-tokens results, a fuzzier vector document representation may be helpful.

Recognizing language patterns in a list of sentences on Google Sheets

I am trying to analyze a series of sentences by identifying the most common adverb-adjective-noun strings. I have managed to get answers for how to do so with random words but I think this is a standalone question, and it might better to be dealt with separately.
In this case, I would like to omit common word types like personal pronouns, articles, prepositions and even verbs. Ideally, the results should produce:
Most common nouns
Most common adjectives
Most common adverbs
Most common adjective+noun strings
Most common adverb+noun strings
I understand there is a way to do this by using an online dictionary but I have been unable to integrate that in my code to get the results I want. Is there any way of automating this without listing all the words that you want omitted? How could it be done?
Here's a link to the spreadsheet I'm using (for this particular query, see page 2) and a screenshot of the types of text I would like to analyze with a manual color-coded visualization of what I want to achieve:

How to best use Solr parser syntax in a specific business requirement

Just starting to learn Solr for a project at work and was wondering on how to go about this issue. Our application allows a user to search based on a business name. The business name is comprised of 3 different categories ( English, French and Combined Name ). Based on a single query entered by the user, how would one go about using Solr to provide the most relevant search results? I have looked into fuzzy and proximity searches which seem reasonable enough. Although fuzzy search only applies to a single term, which makes me believe that I would need to split the query into single terms and apply fuzzy search to each and merge the results if I were to use it ? My question is how to best approach the problem ? Thanks!
To provide relevancy to your documents , you need to have a combination of proper boosting queries and your priorities as what relevance means to your use case . If Regex based search is included in use case you may go for NGrams , if exact search is what you seeking for , boosting is important . You can use parameters like phrase slope , mm, and other edismax parameters to your advantage . You may use a combination of title and text content search, with a good combination of boosts . Also , Solr allows you to pass your query in parenthesis, that functions like an SQL IN query , that further boosts relevancy in your documents by sticking to keywords only mentioned in the query . And , at last , if all these doesn't suffice, you may use custom function queries to meet your needs . While doing all this, just keep in mind the Analyzers in schema.xml file are just right and serve the purpose to execute above mentioned queries .
You can go as far down this rabbit-hole as you have time for wrt Business Name search. (Fuzzy, sound-alike, language-specific analysis, weird compounded-terms used as a domain name (eg: getting "EZBake" to match "easy bake", or "1-to-1" to match "one to one" is non-trivial)
Since this sounds like a pre-existing application, I typically look to query logs (when available) to sample the frequency of different types of mismatches (dig out the zero-result search terms and start manually categorizing the high-level issues behind the more common mismatches).
That will provide you with a backlog of "matching use cases to research how to implement" (in the order of maximal benefit, as determined by your sample).
Then you're ready to start burning them down, and asking much more specific questions about how to get Solr to jump through your domain-specific hoops.

What is the best approach for a interpreting an text input for geocoding purposes?

Consider the following site:
http://maps.google.com
It has a main text input, where the user can type business, countries, provinces, cities, addresses and zip codes. I wonder which is the best way to implement a search like this. I realize that probably Google Maps uses a full text search with all kinds of data in the same table, and it has a chance of having a parser which classifies the input (i.e. between numeric, like zip codes and coordinates, and textual, like business and addresses).
With the data spread in many tables and systems, a parser is essential. The parser could be built from regular expressions, or could be built with IA tools like Artificial Neural Networks and Genetic Algorithms.
Which approach would you recommend?
It might be best to aggregate the data from all of your tables into a search index. Lucene is a free search engine, similar to how Google's search engine works (inverted index), and it should allow you to search by any of those values or any combination of them with relative ease.
http://lucene.apache.org/java/docs/
Lucene comes with its own query language (again, very similar to Google's or any other Internet search sites syntax). The only drawback of using something like Lucene is you would need to build its index. You wouldn't be querying your database directly (which could get very complicated...inverted index are pretty much designed for what your trying to do), so you need to periodically gather up new information from your database and add it to your index. It might also be necessary to rebuild your index to remove unneeded data.
With Lucene, you get a pretty flexible query syntax that most people are familiar with (because pretty much everyone searches the internet), it performs very well, and is not terribly complicated. By using Lucene, you avoid the hit of using regular expressions (which are not the most performant text searching mechanism), and you don't have to write your own parser. Should be a win-win, aside from a little learning curve to build a Lucene index generator and figure out how to query that index.
I'd have the data in one database. If the data got to big or I knew it would be huge, I'd assign an id to each business, address etc, then have other tables which reference this data.
Regular Expressions would only be necessary if the user could define what they want to search for:
business: Argos
But then what happens if they want an Argos in Manchester (Sorry, I'm English), maybe then get the location of the user based on their IP but what happens if they say:
business: Argos Scotland
Now you don't know if the company has two words, or if there is a location next to it. All of this has to be taken into consideration.
P.s Sorry if that made no sense.
You will need to pre process the query before doing a full text search on it. If you are using a GIS database, then you will already have columns like city, areacode, country etc. Convert your query into tokens seperated on space or commas, or both. Then hit individual columns to see match. This way you will know what part of the query is the city, the areacode etc.
You could also try some naive approximation approaches,example - 6 consecutive numbers will probably be an area code. Look for common words like "road" , "restaurant" , "street" etc which will be part of many queries and then use some approximation to figure out what they are looking for. Hope this helps.

Using Ferret to build unique tag clouds

I've been using Ferret as my full-text search engine in a small project I'm working on.
Through the documentation and a few examples online, i've been able to pull together a tag cloud generator using the full-text index to help with tag cloud generation using the IndexReader.terms method.
It's worked quite well up to now, when I want to get term data based on a search result.
For example, if the user searches for "cake", I want to show them a tag cloud of terms used in association with the term "cake".
I've been looking for examples of where the terms method can be used in association with a search result set or similar?
Currently I'm using the following method to generate my list of tags:
reader = Ferret::Index::IndexReader.new(Scrape.find_last_index_version)
terms = []
reader.terms(:all_quotes).each do |term, doc_freq|
terms << [term, doc_freq]
end
Cheers.
It's more like a term frequency chart (like a wordle) than a tag cloud? Or are these in a tag field? Anyway, the index doesn't keep track of term frequency within each possible document subset (such as the results of a search), so that method wouldn't be fast, even if it existed. For a single document, you can get the TermFreqVector and provide suggested documents that are good matches for other frequent terms in that document. So, you could take some of the top results, grab the term vectors from each one, and just add them up, but those aggregate functions don't exist natively (they generally try not to put slow operations in there.)

Resources