EdgeNGram with Tire and ElasticSearch - ruby-on-rails

If I have two strings:
Doe, Joe
Doe, Jonathan
I want to implement a search such that:
"Doe" > "Doe, Joe", "Doe, Jonathan"
"Doe J" > "Doe, Joe", "Doe, Jonathan"
"Jon Doe" > "Doe, Jonathan"
"Jona Do" > "Doe, Jonathan"
Here's the code that I have:
settings analysis: {
filter: {
nameNGram: {
type: "edgeNGram",
min_gram: 1,
max_gram: 20,
}
},
tokenizer: {
non_word: {
type: "pattern",
pattern: "[^\\w]+"
}
},
analyzer: {
name_analyzer: {
type: "custom",
tokenizer: "non_word",
filter: ["lowercase", "nameNGram"]
},
}
} do
mapping do
indexes :name, type: "multi_field", fields: {
analyzed: { type: "string", index: :analyzed, index_analyzer: "name_analyzer" }, # for indexing
unanalyzed: { type: "string", index: :not_analyzed, :include_in_all => false } # for sorting
}
end
end
def self.search(params)
tire.search(:page => params[:page], :per_page => 20) do
query do
string "name.analyzed:" + params[:query], default_operator: "AND"
end
sort do
by "name.unanalyzed", "asc"
end
end
end
Unfortunately, this doesn't appear to be working... The tokenizing looks great, for "Doe, Jonathan" I get something like "d", "do", "doe", "j", "jo", "jon", "jona" etc. but if I search for "do AND jo", I get back nothing. If I, however, search for "jona", I get back "Doe, Jonathan." What am I doing wrong?

You should likely only be using EdgeNGram if you want to create an autocomplete. I suspect that you want to use a pattern filter to separate words my commas.
Something like this:
"tokenizer": {
"comma_pattern_token": {
"type": "pattern",
"pattern": ",",
"group": -1
}
}
If I am mistaken and you need edgeNGrams for some other reason then your problem is that your index analyzer is ignoring stop words (such as the word AND) and your search analyzer is not. You need to create a custom analyzer for your search_analyzer that does not include the stop word filter.

Related

How to setup date and fuzzy title search on elasticsearch

I am building an Rails 5 app with an Angular 7 frontent.
In this app I am using Searchkick (an Elasticsearch gem) and I have indexed a model called Event that got attributes title (string) and starts_at (datetime).
I want to be able to build a query in the search controller where I am able to do the following:
Search the title with a fuzzy search meaning it do not have to match 100% (which it now require).
Search with a date range matching starts_at for the indexed Events.
This is my controller index method
def index
args = {}
args[:eventable_id] = params[:id]
args[:eventable_type] = params[:type]
args[:title] = params[:title] if params[:title].present?
if params[:starts_at].present?
args[:starts_at] = {}
args[:starts_at][:gte] = params[:starts_at].to_date.beginning_of_day
args[:starts_at][:lte] = params[:ends_at].to_date.end_of_day
end
#events = Event.search where: args, page: params[:page], per_page: params[:per_page]
end
I have added this line to my Event model
searchkick text_middle: [:title]
This is the actual query that is run
{
"query": {
"bool": {
"must": {
"match_all": {}
},
"filter": [{
"term": {
"eventable_id": "2"
}
}, {
"term": {
"eventable_type": "Space"
}
}, {
"term": {
"title": "nice event"
}
}, {
"range": {
"starts_at": {
"from": "2020-02-01T00:00:00.000Z",
"include_lower": true,
"to": "2020-02-29T23:59:59.999Z",
"include_upper": true
}
}
}]
}
},
"timeout": "11s",
"_source": false,
"size": 10000
}
The date search does not work (but I get no errors) and the title search must match 100% (even the case).
Thankful for all help!
Rather than using Fuzzy queries, I would recommend an ngram analyzer.
Here is an example of an ngram analyzer:
analyzer: {
ngram_analyzer: {
type: "custom",
tokenizer: "standard",
filter: ["lowercase", "ngram_filter"],
char_filter: [
"replace_dots"
]
}
},
filter: {
ngram_filter: {
type: "ngram",
min_gram: "3",
max_gram: "20",
}
}
You will also have to add this code to your settings index:
max_ngram_diff: 17
Then on your mapping, make sure you create two fields. 1 mapping for your regular field such as name and then another mapping for your ngram field such as name.ngram.
In my query, I like to give my name field a boost of 10 and my name.ngram field a boost of 5 so that the exact matches will be rendered first. You will have to play with this though.
In regard to your range query, I am using gte and lte. Here is an example:
query:{
bool: {
must: {
range: {date: {gte: params[:date], lte: params[:date], boost: 10}}
}
}
}
I hope this helps.

Elasticsearch ngram index returns nothing

I'm attempting to build a custom analyzer using nGram and apparently it's working ok, I just can't query it for some reason.
I'm using `elasticsearch-model in Ruby
Here is how the index is defined:
include Elasticsearch::Model
index_name "stemmed_videos"
settings index: { number_of_shards: 5 },
analysis: {
analyzer: {
video_analyzer: {
tokenizer: :stemmer,
filter: [
"lowercase"
]
},
standard_lowercase: {
tokenizer: :standard,
filter: [
"lowercase"
]
}
},
tokenizer: {
stemmer: {
type: "nGram",
min_gram: 2,
max_gram: 10,
token_chars: [
"letter",
"digit",
"symbol"
]
}
}
} do
mappings do
indexes :title, type: 'string', analyzer: 'video_analyzer'
indexes :description, type: 'string', analyzer: 'standard_lowercase'
end
end
def as_indexed_json(options = {})
as_json(only: [:title, :description])
end
I've attempted to take one of the strings I'm trying to index and run it through "http://localhost:9200/stemmed_videos/_analyze?pretty=1&analyzer=video_analyzer&text=indiana_jones_4-tlr3_h640w.mov" and it's apparently doing the right thing.
But then, the only way I have to make a generic query is by adding wildcards, which is not what I'm expecting.
[8] pry(main)> Video.__elasticsearch__.search('*ind*').results.total
=> 4
[9] pry(main)> Video.__elasticsearch__.search('ind').results.total
=> 0
(4 is the right number of results in my test data).
What I'd love to accomplish is to get the right results without the wildcards because with what I have now I'd need to take the query string and add the wildcards in the code, which honestly is rather bad.
How can I accomplish this?
Thanks in advance.

Spell check Ngram for elastic Search not working with rails

I have used in my model to include spell check such that if the user inputs data like "Rentaal" then it should fetch the correct data as "Rental"
document.rb code
require 'elasticsearch/model'
class Document < ApplicationRecord
include Elasticsearch::Model
include Elasticsearch::Model::Callbacks
belongs_to :user
Document.import force: true
def self.search(query)
__elasticsearch__.search({
query: {
multi_match: {
query: query,
fields: ['name^10', 'service']
}
}
})
end
settings index: {
"number_of_shards": 1,
analysis: {
analyzer: {
edge_ngram_analyzer: { type: "custom", tokenizer: "standard", filter:
["lowercase", "edge_ngram_filter", "stop", "kstem" ] },
}
},
filter: {
edge_ngram_filter: { type: "edgeNGram", min_gram: "3", max_gram:
"20" }
}
} do
mapping do
indexes :name, type: "string", analyzer: "edge_ngram_analyzer"
indexes :service, type: "string", analyzer: "edge_ngram_analyzer"
end
end
end
search controller code:
def search
if params[:query].nil?
#documents = []
else
#documents = Document.search params[:query]
end
end
However, if I enter Rentaal or any misspelled word, it does not display anything.
In my console
#documents.results.to_a
gives an empty array.
What am I doing wrong here? Let me know if more data is required.
Try to add fuzziness in your multi_match query:
{
"query": {
"multi_match": {
"query": "Rentaal",
"fields": ["name^10", "service"],
"fuzziness": "AUTO"
}
}
}
Explanation
Kstem filter is used for reducing words to their root forms and it does not work as you expected here - it would handle corectly phrases like Renta or Rent, but not the misspelling you provided.
You can check how stemming works with following query:
curl -X POST \
'http://localhost:9200/my_index/_analyze?pretty=true' \
-d '{
"analyzer" : "edge_ngram_analyzer",
"text" : ["rentaal"]
}'
As a result I see:
{
"tokens": [
{
"token": "ren"
},
{
"token": "rent"
},
{
"token": "renta"
},
{
"token": "rentaa"
},
{
"token": "rentaal"
}
]
}
So typical misspelling will be handled much better with applying fuzziness.

Elasticsearch : Multi match query on nested fields

I am having a problem with multi-match query in RoR. I have Elastic Search configured and working however I am working on setting up aggregations which so far seem to work, but for whatever reason I am not able to search on the field which I am aggregating. This is the extract from my model:
settings :index => { :number_of_shards => 1 } do
mapping do
indexes :id, index: :not_analyzed
indexes :name
indexes :summary
indexes :description
indexes :occasions, type: 'nested' do
indexes :id, type: 'integer'
indexes :occasion_name, type: 'string', index: :not_analyzed
...
end
end
end
def as_indexed_json(options = {})
self.as_json(only: [:id, :name, :summary, :description],
include: {
occasions: { only: [:id, :occasion_name] },
courses: { only: [:id, :course_name] },
allergens: { only: [:id, :allergen_name] },
cookingtechniques: { only: [:id, :name] },
cuisine: { only: [:id, :cuisine_name]}
})
end
class << self
def custom_search(query)
__elasticsearch__.search(query: multi_match_query(query), aggs: aggregations)
end
def multi_match_query(query)
{
multi_match:
{
query: query,
type: "best_fields",
fields: ["name^9", "summary^8", "cuisine_name^7", "description^6", "occasion_name^6", "course_name^6", "cookingtechniques.name^5"],
operator: "and"
}
}
end
I am able to search on all fields as specified in the multi_match_query apart of "occasion_name" which happens to be the field I am aggregating. I have checked that the field is correctly indexed (using elastic search-head plugin). I am also able to display the facets with the aggregated occasion_names in my view. I tried everything I can think of, including removing the aggregation and searching on occasion_name, but still no luck.
(I am using the elasticsearch-rails gem)
Any help will be much appreciated.
Edit:
I got this ES query from rails:
#search=
#<Elasticsearch::Model::Searching::SearchRequest:0x007f91244df460
#definition=
{:index=>"recipes",
:type=>"recipe",
:body=>
{:query=>
{:multi_match=>
{:query=>"Christmas",
:type=>"best_fields",
:fields=>["name^9", "summary^8", "cuisine_name^7", "description^6", "occasion_name^6", "course_name^6", "cookingtechniques.name^5"],
:operator=>"and"}},
:aggs=>
{:occasion_aggregation=>
{:nested=>{:path=>"occasions"}, :aggs=>{:id_and_name=>{:terms=>{:script=>"doc['occasions.id'].value + '|' + doc['occasions.occasion_name'].join(' ')", :size=>35}}}}}}},
This is an example of all that gets indexed for 1 of my dummy recipes I use for testing (the contents are meaningless - I use this only for testing):
{
"_index": "recipes",
"_type": "recipe",
"_id": "7",
"_version": 1,
"_score": 1,
"_source": {
"id": 7,
"name": "Mustard-stuffed chicken",
"summary": "This is so good we'd be surprised if this chicken fillet recipe doesn't become a firm favourite. Save it to your My Good Food collection and enjoy",
"description": "Heat oven to 200C/fan 180C/gas 6. Mix the cheeses and mustard together. Cut a slit into the side of each chicken breast, then stuff with the mustard mixture. Wrap each stuffed chicken breast with 2 bacon rashers – not too tightly, but enough to hold the chicken together. Season, place on a baking sheet and roast for 20-25 mins.",
"occasions": [
{
"id": 9,
"occasion_name": "Christmas"
}
,
{
"id": 7,
"occasion_name": "Halloween"
}
,
{
"id": 8,
"occasion_name": "Bonfire Night"
}
,
{
"id": 10,
"occasion_name": "New Year"
}
],
"courses": [
{
"id": 9,
"course_name": "Side Dish"
}
,
{
"id": 7,
"course_name": "Poultry"
}
,
{
"id": 8,
"course_name": "Salad"
}
,
{
"id": 10,
"course_name": "Soup"
}
],
"allergens": [
{
"id": 6,
"allergen_name": "Soya"
}
,
{
"id": 7,
"allergen_name": "Nut"
}
,
{
"id": 8,
"allergen_name": "Other"
}
,
{
"id": 1,
"allergen_name": "Dairy"
}
],
"cookingtechniques": [
{
"id": 15,
"name": "Browning"
}
],
"cuisine": {
"id": 1,
"cuisine_name": "African"
}
}
}
EDIT 2:
I managed to make the search work for occasions as suggested by #rahulroc, but now I can't search on anything else...
def multi_match_query(query)
{
nested:{
path: 'occasions',
query:{
multi_match:
{
query: query,
type: "best_fields",
fields: ["name^9", "summary^8", "cuisine_name^7", "description^6", "occasion_name^6", "course_name^6", "cookingtechniques.name^5"],
operator: "and"
}
}
}
}
end
UPDATE: Adding multiple nested fields - I am trying to add the rest of my aggregations but I am facing similar problem as before. My end goal will be to use the aggregations as filters so I need to add about 4 more nested fields to my query (I also would like to have the fields searchable) Here is the working query as provided by #rahulroc + the addition of another nested field which I can't search on. As before in terms of indexing everything is working and I can display the aggregations for the newly added field, but I can't search on it. I tried different variations of this query but I couldn't make it work (the rest of the fields are still working and searchable - the problem is just the new field):
def multi_match_query(query)
{
bool: {
should: [
{
nested:{
path: 'occasions',
query: {
multi_match:
{
query: query,
type: "best_fields",
fields: ["occasion_name"]
}
}
}
},
{
nested:{
path: 'courses',
query: {
multi_match:
{
query: query,
type: "best_fields",
fields: ["course_name"]
}
}
}
},
{
multi_match: {
query: query,
fields:["name^9", "summary^8", "cuisine_name^7", "description^6"],
}
}
]
}
}
end
You need to create a separate nested clause for matching a nested field
"query": {
"bool": {
"should": [
{
"nested": {
"path": "occassions",
"query": {
"multi_match": {
"query": "Christmas",
"fields": ["occassion_name^2"]
}
}
}
},
{
"multi_match": {
"query": "Christmas",
"fields":["name^9", "summary^8", "cuisine_name^7", "description^6","course_name^6"] }
}
]
}
}

Why does this elasticsearch/tire code not match partial words?

I'm trying to use Elasticsearch and Tire to index some data. I want to be able to search it on partial matches, not just full words. When running a query on the example model below, it will only match words in the "notes" field that are full word matches. I can't figure out why.
class Thingy
include Tire::Model::Search
include Tire::Model::Callbacks
# has some attributes
tire do
settings analysis: {
filter: {
ngram_filter: {
type: 'nGram',
min_gram: 2,
max_gram: 12
}
},
analyzer: {
index_ngram_analyzer: {
type: 'custom',
tokenizer: 'standard',
filter: ['lowercase']
},
search_ngram_analyzer: {
type: 'custom',
tokenizer: 'standard',
filter: ['lowercase', 'ngram_filter']
}
}
} do
mapping do
indexes :notes, :type => "string", boost: 10, index_analyzer: "index_ngram_analyzer", search_analyzer: "search_ngram_analyzer"
end
end
end
def to_indexed_json
{
id: self.id,
account_id: self.account_id,
created_at: self.created_at,
test: self.test,
notes: some_method_that_returns_string
}.to_json
end
end
The query looks like this:
#things = Thing.search page: params[:page], per_page: 50 do
query {
boolean {
must { string "account_id:#{account_id}" }
must_not { string "test:true" }
must { string "#{query}" }
}
}
sort {
by :id, 'desc'
}
size 50
highlight notes: {number_of_fragments: 0}, options: {tag: '<span class="match">'}
end
I've also tried this but it never returns results (and ideally I'd like the search to apply to all fields, not just notes):
must { match :notes, "#{query}" } # tried with `type: :phrase` as well
What am I doing wrong?
You almost got there! :) The problem is that you've swapped the role of index_analyzer and search_analyzer, in fact.
Let me explain briefly how it works:
You want to break document words into these ngram "chunks" during indexing, so when you are indexing a word like Martian, it get's broken into: ['ma', 'mar', 'mart', ..., 'ar', 'art', 'arti', ...]. You can try it with the Analyze API: http://localhost:9200/thingies/_analyze?text=Martian&analyzer=index_ngram_analyzer.
When people are searching, they are already using these partial ngrams, so to speak, since they search for "mar" or "mart" etc. So you don't break their phrases further with the ngram tokenizer.
That's why you (correctly) separate index_analyzer and search_analyzer in your mapping, so Elasticsearch knows how to analyze the notes attribute during indexing, and how to analyse any search phrase against this attribute.
In other words, do this:
analyzer: {
index_ngram_analyzer: {
type: 'custom',
tokenizer: 'standard',
filter: ['lowercase', 'ngram_filter']
},
search_ngram_analyzer: {
type: 'custom',
tokenizer: 'standard',
filter: ['lowercase']
}
}
Full, working Ruby code is below. Also, I highly recommend you to migrate to the new elasticsearch-model Rubygem, which contains all important features of Tire and is actively developed.
require 'tire'
Tire.index('thingies').delete
class Thingy
include Tire::Model::Persistence
tire do
settings analysis: {
filter: {
ngram_filter: {
type: 'nGram',
min_gram: 2,
max_gram: 12
}
},
analyzer: {
index_ngram_analyzer: {
type: 'custom',
tokenizer: 'standard',
filter: ['lowercase', 'ngram_filter']
},
search_ngram_analyzer: {
type: 'custom',
tokenizer: 'standard',
filter: ['lowercase']
}
}
} do
mapping do
indexes :notes, type: "string", index_analyzer: "index_ngram_analyzer", search_analyzer: "search_ngram_analyzer"
end
end
end
property :notes
end
Thingy.create id: 1, notes: 'Martial Partial Martian'
Thingy.create id: 2, notes: 'Venetian Completion Heresion'
Thingy.index.refresh
# Find 'art' in 'martial'
#
# Equivalent to: http://localhost:9200/thingies/_search?q=notes:art
#
results = Thingy.search do
query do
match :notes, 'art'
end
end
p results.map(&:notes)
# Find 'net' in 'venetian'
#
# Equivalent to: http://localhost:9200/thingies/_search?q=notes:net
#
results = Thingy.search do
query do
match :notes, 'net'
end
end
p results.map(&:notes)
The problem for me was that I was using the string query instead of the match query. The search should have been written like this:
#things = Thing.search page: params[:page], per_page: 50 do
query {
match [:prop_1, prop_2, :notes], query
}
sort {
by :id, 'desc'
}
filter :term, account_id: account_id
filter :term, test: false
size 50
highlight notes: {number_of_fragments: 0}, options: {tag: '<span class="match">'}
end

Resources