Search a document in Elasticsearch by a list of Wildcarded statements on a single field - url

If I have documents in ElasticSearch that have a field called url and the contents of the url field are strings like "http://www.foo.com" or "http://www.bar.com/some/url/segment/the-page.html", is it possible to search for documents matching a list of wildcarded url fragments e.g., ["http://www.foo.", "http://www.bar.com//segment/.html", "://*bar.com/**"]?
If it is possible, what is the best approach to do this? I have explored wildcard query which only seems to support 1 fragment not multiple. Filters don't seem to support wildcarding as I have tried using * in a term filter without any luck.
To make it a little more complex, I'm also interested in being able to search by a lot of these fragments. I have come across terms filter lookup which seems like it is a good solution for dealing with many search terms, but I'm not sure wildcarding works with filters.
Any thoughts?

Related

How to best use Solr parser syntax in a specific business requirement

Just starting to learn Solr for a project at work and was wondering on how to go about this issue. Our application allows a user to search based on a business name. The business name is comprised of 3 different categories ( English, French and Combined Name ). Based on a single query entered by the user, how would one go about using Solr to provide the most relevant search results? I have looked into fuzzy and proximity searches which seem reasonable enough. Although fuzzy search only applies to a single term, which makes me believe that I would need to split the query into single terms and apply fuzzy search to each and merge the results if I were to use it ? My question is how to best approach the problem ? Thanks!
To provide relevancy to your documents , you need to have a combination of proper boosting queries and your priorities as what relevance means to your use case . If Regex based search is included in use case you may go for NGrams , if exact search is what you seeking for , boosting is important . You can use parameters like phrase slope , mm, and other edismax parameters to your advantage . You may use a combination of title and text content search, with a good combination of boosts . Also , Solr allows you to pass your query in parenthesis, that functions like an SQL IN query , that further boosts relevancy in your documents by sticking to keywords only mentioned in the query . And , at last , if all these doesn't suffice, you may use custom function queries to meet your needs . While doing all this, just keep in mind the Analyzers in schema.xml file are just right and serve the purpose to execute above mentioned queries .
You can go as far down this rabbit-hole as you have time for wrt Business Name search. (Fuzzy, sound-alike, language-specific analysis, weird compounded-terms used as a domain name (eg: getting "EZBake" to match "easy bake", or "1-to-1" to match "one to one" is non-trivial)
Since this sounds like a pre-existing application, I typically look to query logs (when available) to sample the frequency of different types of mismatches (dig out the zero-result search terms and start manually categorizing the high-level issues behind the more common mismatches).
That will provide you with a backlog of "matching use cases to research how to implement" (in the order of maximal benefit, as determined by your sample).
Then you're ready to start burning them down, and asking much more specific questions about how to get Solr to jump through your domain-specific hoops.

How to implement fuzzy search

I'm using Neo4j 3 REST API and i have node named customer it has properties like name etc i need to get search results of name of customer eg i should get results for name "john" for my input "joan".how to implement fuzzy search to get my desired results.
Thanks in advance
First off, I want to make that you know that if you're using Neo4j 3.x that 3.x is currently in beta and isn't considered stable yet.
You have two options to implement a fuzzy search in Neo4j. You can use the legacy indexes to implement Lecene-based indexing. That should provide anything that Lucene can do, though you'd probably need to do a bit more work. You can also implement your own unmanaged extension which will allow you to use Lucene a bit more directly.
Perhaps the easier alternative is to use elasticsearch with Neo4j and have elasticsearch do your full-text indexing. You might take a look at the Neo4j and ElasticSearch page on neo4j.com. There they provide a link to a GitHub repository which is a plugin for Neo4j which automagically updates ElasticSearch with data from Neo4j and which provides and endpoint for querying your graph fuzzily. There is also a video tutorial on how to do this.
You will have to try using https://neo4j.com/developer/kb/how-to-perform-a-soundex-search/ which in this case will work. If your input is Joan you will not get John as the response, unless you just give jo as input in which you will get both. To get what you are expecting you will have to use the soundex search.
Stepping back a little, what is the problem you are trying to solve with fuzzy matching?
My experience has been that misspellings and typos are far less common than you might think, and humans prefer exact matches whenever possible. If there is no exact match (often just missing a space between words), that's a good time to use a spellchecker, and that's where the fuzzy matching should kick in.
In addition, your example would match "joan" to "john", but some synonyms like "joanie" would be more useful. If you have a big corpus of content to work with, you may be able to extract some relationships, using fuzzy & machine learning to identify "joanne" and "joni" as possible synonyms and then submit that to a human curator. "Jon" looks like a related name but it's not, while "jo" and even "nonie" may or may not be nicknames in these groupings.

elasticsearch nGram/edgengram partial match?

I'm trying to make partial search working, a search for
"sw"
"swe"
"swed"
should match "Sweden"
I looked around and just can't get it to work
Rails Code
I'm using
this code from the Tire repo as templatecode.
whole words still match!
I have reindex and also tried using the edgengram filter.
I'm not a Ruby developper but found this article useful:
http://dev.af83.com/2012/01/19/autocomplete-with-tire.html
Like it sais:
Most databases handle that feature with a filter (with LIKE keyword in SQL, regular expression search with mongoDB). The strategy is simple: iterate on all results and keep only words which match the filter. This is brutal and hurts the hard drive. Elastic Search can do it too, with the prefix query.
With a small index, it plays well. For large indexes it will be more slow and painful.
You said:
whole words still match!
And what do you expect? Is "Sweden" not supposed to match "Sweden" but only "Swe", "Swed" or Swede" ?
Because your query on the field is also analyzed
Use the edgengram token filter. That will get you what you're looking for.

Lucene partial word matching

Lucene does not support it out of the box, so I need some help building my query.
Lets say I have the document with a field value "Develop"
I would like this document to be returned for the searches "Dev" and "lop".
Maybe creating two queries?
"*keyword"
and
"keyword*"
and
"keyword"
?
How would you go about doing this with multiple words? Would you split the sentence/search into a words list and do the previous example for each word?
What you're asking is if I understand you correctly not feasible on any large scale search engine.
Lucene creates an index over keywords using term-document matrix and inverted-file techniques (see links at the bottom). A fully fledged string matching might be very nice to have, but it does not scale: you will never be able to query a decently sized index (say more than a couple of dozen/hundreds of documents) in an acceptable time.
Still, here are two ideas that might help...
Syllable tokenization
To come back to your example with 'Develop'. As long as you are happy with letting users search for syllables I guess you can do something.
You would have to create use tokenizer that splits up words in your indexed according to their syllables and create a database index over the syllables. (I am not sure there are built in tokenizers for the English language that can do that and writing one on your own might be tricky...)
An important thing to note:
If you would index the full words AND the seperate syllables the size of your index will be much larger than if you only index one of the two.
However I would not suggest to index only syllables. If you want to also allow your users to search for the full word 'Develop' (which I guess you want) this would result in two queries with a logical and between them, namely <'dev' AND 'lop'>. Although Lucene supports such logical constructs in queries they are very expensive. I have personally had some trouble in the past using logical queries in Lucene.
Stemming
Another way to somehow arrive at what you're trying could be to use a brutal form of word stemming (http://en.wikipedia.org/wiki/Stemming) that stems words to their first syllable. (This would allow to search for 'dev' but not for 'lop'...)
Again, I don't think such a word stem feature is already in Lucene. Writing one for yourself will be a pain and involve working with/importing huge dictionaries.
Links
These might be looking into if you don't know about search engine internals:
http://en.wikipedia.org/wiki/Index_%28search_engine%29
http://en.wikipedia.org/wiki/Vector_space_model
http://en.wikipedia.org/wiki/Inverted_file
http://en.wikipedia.org/wiki/Term-document_matrix
http://en.wikipedia.org/wiki/Tf-idf

Using Ferret to build unique tag clouds

I've been using Ferret as my full-text search engine in a small project I'm working on.
Through the documentation and a few examples online, i've been able to pull together a tag cloud generator using the full-text index to help with tag cloud generation using the IndexReader.terms method.
It's worked quite well up to now, when I want to get term data based on a search result.
For example, if the user searches for "cake", I want to show them a tag cloud of terms used in association with the term "cake".
I've been looking for examples of where the terms method can be used in association with a search result set or similar?
Currently I'm using the following method to generate my list of tags:
reader = Ferret::Index::IndexReader.new(Scrape.find_last_index_version)
terms = []
reader.terms(:all_quotes).each do |term, doc_freq|
terms << [term, doc_freq]
end
Cheers.
It's more like a term frequency chart (like a wordle) than a tag cloud? Or are these in a tag field? Anyway, the index doesn't keep track of term frequency within each possible document subset (such as the results of a search), so that method wouldn't be fast, even if it existed. For a single document, you can get the TermFreqVector and provide suggested documents that are good matches for other frequent terms in that document. So, you could take some of the top results, grab the term vectors from each one, and just add them up, but those aggregate functions don't exist natively (they generally try not to put slow operations in there.)

Resources