Custom Analyzer elasticsearch-rails - ruby-on-rails

I'm using elasticsearch-rails gem in my Rails app to simplify integration with Elasticsearch. I'm trying to use the phonetic analysis plugin, so I need to define a custom analyzer and a custom filter for my index.
I tried this piece of code in order to perform the custom analysis with a soundex phonetic filter, but It fails with an exception message:
[!!!] Error when creating the index: Elasticsearch::Transport::Transport::Errors::BadRequest
[400] {"error":"MapperParsingException[mapping [call_sentence]]; nested: MapperParsingException[Analyzer [{tokenizer=standard, filter=[standard, lowercase, metaphoner]}] not found for field [phonetic]]; ","status":400}
# Set up index configuration and mapping
#
settings index: { number_of_shards: 1, number_of_replicas: 0 } do
mapping do
indexes :text, type: 'multi_field' do
indexes :processed, analyzer: 'snowball'
indexes :phone, {analyzer: {
tokenizer: "standard",
filter: ["standard", "lowercase", "metaphoner"]
}, filter: {
metaphoner: {
type: "phonetic",
encoder: "soundex",
replace: false
}
}}
indexes :raw, analyzer: 'keyword'
end
end
end

You can also specify it in the settings call:
settings index: {
number_of_shards: 1,
number_of_replicas: 0,
analysis: {
filter: {
metaphoner: {
type: 'phonetic',
encoder: doublemetaphone,
replace: true,
}
},
analyzer: {
phonetic_analyzer: {
tokenizer: 'standard',
filter: ["standard", "lowercase", "metaphoner"],
}
}
}
} do
mapping do
indexes :text, type: 'multi_field' do
indexes :processed, analyzer: 'snowball'
indexes :phone, analyzer: 'phonetic_analyzer'
indexes :raw, analyzer: 'keyword'
end
end
end

Alright, I modified elasticsearch.yml config to include the phonetic analyzer
#################################### Index ####################################
index:
analysis:
analyzer:
phonetic_analyzer:
tokenizer: standard
filter: [metaphoner]
filter:
metaphoner:
type: phonetic
encoder: doublemetaphone
replace: true

Related

Shingle token Filter error "illegal_argument_exception"

I am using elasticsearch for ruby on rails. I used
Elastic::Model::JobPosting.recreate_index! for indexing. for which I get the error:
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"In Shingle TokenFilter the difference between max_shingle_size and min_shingle_size (and +1 if outputting unigrams) must be less than or equal to: [3] but was [4].
But on further checking the shingle values I found that:
min_shingle_size: 2,
max_shingle_size: 5,
So the difference between them is "3" but why the output is getting "4".I cant seem to understand the reason for this
My code for modules/elastic/utils is:
module ::Elastic::Utils
def self.suggestion_field
{
analyzer: :name_analyzer,
type: "text",
fields: {
:raw => {type: "keyword", index: true}
}
}
end
def self.optimized_suggestion_field
{
analyzer: :suggest_analyzer,
type: "text"
}
end
def self.optimized_settings
{
index: {
number_of_shards: 2,
number_of_replicas: 1
},
analysis: {
analyzer: {
suggest_analyzer: {
tokenizer: 'keyword',
type: 'custom',
filter: %w(suggest_filter)
}
},
filter: {
suggest_filter: {
type: 'edgeNGram',
max_gram: 40,
min_gram: 1
}
},
}
}
end
def self.common_settings
{
index: {
number_of_shards: 2,
number_of_replicas: 1
},
analysis: {
analyzer: {
name_analyzer: {
tokenizer: 'whitespace',
type: 'custom',
filter: %w(lowercase multi_words name_filter)
},
lower_keyword: {
tokenizer: 'keyword',
type: 'custom',
filter: ['lowercase']
},
},
filter: {
multi_words: {
type: 'shingle',
min_shingle_size: 2,
max_shingle_size: 5
},
name_filter: {
type: 'edgeNGram',
max_gram: 40,
min_gram: 1
}
},
}
}
end
def self.normalized_terms(name)
terms = []
terms = name.split(/\W+/) if name.present?
normalized_terms = []
norm_term = ''
terms.reverse.each do |term|
norm_term = "#{term}#{norm_term}"
normalized_terms << norm_term
end
normalized_terms.reverse
end
def self.normalize(name)
return nil if name.blank?
name.downcase.gsub(/[^[:alnum:]]/, ' ').
gsub(' and ', ' ').gsub(' of ', ' ').gsub(' in ', ' ').
gsub('engineering', 'engg').gsub('technology', 'tech').
gsub('bachelor ', 'b ').gsub('master ', 'm ').
squish
end
end

Elasticsearch ngram index returns nothing

I'm attempting to build a custom analyzer using nGram and apparently it's working ok, I just can't query it for some reason.
I'm using `elasticsearch-model in Ruby
Here is how the index is defined:
include Elasticsearch::Model
index_name "stemmed_videos"
settings index: { number_of_shards: 5 },
analysis: {
analyzer: {
video_analyzer: {
tokenizer: :stemmer,
filter: [
"lowercase"
]
},
standard_lowercase: {
tokenizer: :standard,
filter: [
"lowercase"
]
}
},
tokenizer: {
stemmer: {
type: "nGram",
min_gram: 2,
max_gram: 10,
token_chars: [
"letter",
"digit",
"symbol"
]
}
}
} do
mappings do
indexes :title, type: 'string', analyzer: 'video_analyzer'
indexes :description, type: 'string', analyzer: 'standard_lowercase'
end
end
def as_indexed_json(options = {})
as_json(only: [:title, :description])
end
I've attempted to take one of the strings I'm trying to index and run it through "http://localhost:9200/stemmed_videos/_analyze?pretty=1&analyzer=video_analyzer&text=indiana_jones_4-tlr3_h640w.mov" and it's apparently doing the right thing.
But then, the only way I have to make a generic query is by adding wildcards, which is not what I'm expecting.
[8] pry(main)> Video.__elasticsearch__.search('*ind*').results.total
=> 4
[9] pry(main)> Video.__elasticsearch__.search('ind').results.total
=> 0
(4 is the right number of results in my test data).
What I'd love to accomplish is to get the right results without the wildcards because with what I have now I'd need to take the query string and add the wildcards in the code, which honestly is rather bad.
How can I accomplish this?
Thanks in advance.

Elastic Search "Did you mean" for auto correction of words implementation not working with rails

I am trying to implement full search text engine for my rails app using elastic search for a document class. It should have an auto correction for misspelled words.
This is my document.rb
require 'elasticsearch/model'
class Document < ApplicationRecord
include Elasticsearch::Model
include Elasticsearch::Model::Callbacks
belongs_to :user
Document.import force: true
def self.search(query)
__elasticsearch__.search(
{
query: {
multi_match: {
query: query,
fields: ['name^10', 'service']
}
}
}
)
end
settings index: { "number_of_shards": 1,
analysis: {
analyzer: {
string_lowercase: { tokenizer: 'keyword', filter: %w(lowercase
ascii_folding) },
did_you_mean: { filter: ['lowercase'], char_filter: ['html_strip'],
type: 'custom', tokenzier: 'standard'},
autocomplete: {filter: ["lowercase", "autocompleteFilter"], char_filter:
[ "html_strip"], type: "custom", tokenizer: "standard"},
default: {filter: [ "lowercase", "stopwords", "stemmer"], char_filter: [
"html_strip"], type: "custom",tokenizer: "standard"}
}
},
filter: { ascii_folding: { type: 'asciifolding', preserve_original: true
},
stemmer: {type: 'stemmer', language: 'english'},
autocompleteFilter: { max_shingle_size: 5, min_shingle_size:2,
type: 'shingle'},
stopwords: {type: 'stop', stopwords: ['_english_'] }
}
} do
mapping do{
document: {
properties: {
autocomplete: {
type: "string",
analyzer: "autocomplete"
},
name: {
type: "string",
copy_to: ["did_you_mean","autocomplete"]
},
did_you_mean: {
type: "string",
analyzer: "didYouMean"
},
service: {
type: "string",
copy_to: ["autocomplete", "did_you_mean"]
}
}
}
}
end
end
It helps me search data. However, the did you mean phrase is not working here.
What can I do further to improve this code?I am using elastic search for the very first time.

elasticsearch 5.X + searchkick(rails) configuration

I'm in the process of upgrading my elasticsearch instance from V1.7 to 5.3 and I'm running into some errors when starting my reindex. From what I can tell most models can index just fine but there are a couple I had that used more advanced settings that don't seem to be working. Here's an example of one of my models(brand.rb):
searchkick word_start: [:name],
merge_mappings: true,
mappings: searchkick_mappings,
settings: searchkick_settings
def search_data
attributes.merge(
geography: self.geography ? self.geography.name : "",
geography_breadcrumb: self.geography ? self.geography.breadcrumb : "",
producer_name: self.producer.name
)
end
the searchkick_mappings and searchkick_settings are defined in another file that is included in my model. Here's the code:
def searchkick_mappings
{
brand: {
properties: {
name: {
type: 'text',
analyzer: 'standard',
fields: {
autocomplete: {
type: 'text',
analyzer: 'autocomplete'
},
folded: {
type: 'text',
analyzer: 'folded'
},
delimited: {
type: 'text',
analyzer: 'delimited'
}
}
}
}
}
}
end
def searchkick_settings
{
analysis: {
filter: {
autocomplete_filter: {
type: 'edge_ngram',
min_gram: 1,
max_gram: 20
},
delimiter_filter: {
type: 'word_delimiter',
preserve_original: true
}
},
analyzer: {
folded: {
type: 'custom',
tokenizer: 'standard',
filter: ['standard','lowercase','asciifolding']
},
delimited: {
type: 'custom',
tokenizer: 'whitespace',
filter: ['lowercase','delimiter_filter']
},
autocomplete: {
type: 'custom',
tokenizer: 'standard',
filter: ['standard','lowercase', 'asciifolding',
'autocomplete_filter']
}
}
}
}
end
The only change I made from when it was working in V1.7 -> 5.3 is that I had to change the 'type' field from "String" to "text" as it seems they removed the string type in favor of a text and keyword type, where text is analyzed and keywords are not. The error i'm receiving when I run bundle exec searchkick:reindex:all is saying there is an unknown parameter 'ignore_above'. From reading the documentation it seems that parameter is only for keyword fields and not text but I am not adding that parameter in my custom mappings so I don't see why it would be there.
Let me know if you need to see more code/need to know more. I'll gladly edit OP/comment whatever is helpful.

Why does this elasticsearch/tire code not match partial words?

I'm trying to use Elasticsearch and Tire to index some data. I want to be able to search it on partial matches, not just full words. When running a query on the example model below, it will only match words in the "notes" field that are full word matches. I can't figure out why.
class Thingy
include Tire::Model::Search
include Tire::Model::Callbacks
# has some attributes
tire do
settings analysis: {
filter: {
ngram_filter: {
type: 'nGram',
min_gram: 2,
max_gram: 12
}
},
analyzer: {
index_ngram_analyzer: {
type: 'custom',
tokenizer: 'standard',
filter: ['lowercase']
},
search_ngram_analyzer: {
type: 'custom',
tokenizer: 'standard',
filter: ['lowercase', 'ngram_filter']
}
}
} do
mapping do
indexes :notes, :type => "string", boost: 10, index_analyzer: "index_ngram_analyzer", search_analyzer: "search_ngram_analyzer"
end
end
end
def to_indexed_json
{
id: self.id,
account_id: self.account_id,
created_at: self.created_at,
test: self.test,
notes: some_method_that_returns_string
}.to_json
end
end
The query looks like this:
#things = Thing.search page: params[:page], per_page: 50 do
query {
boolean {
must { string "account_id:#{account_id}" }
must_not { string "test:true" }
must { string "#{query}" }
}
}
sort {
by :id, 'desc'
}
size 50
highlight notes: {number_of_fragments: 0}, options: {tag: '<span class="match">'}
end
I've also tried this but it never returns results (and ideally I'd like the search to apply to all fields, not just notes):
must { match :notes, "#{query}" } # tried with `type: :phrase` as well
What am I doing wrong?
You almost got there! :) The problem is that you've swapped the role of index_analyzer and search_analyzer, in fact.
Let me explain briefly how it works:
You want to break document words into these ngram "chunks" during indexing, so when you are indexing a word like Martian, it get's broken into: ['ma', 'mar', 'mart', ..., 'ar', 'art', 'arti', ...]. You can try it with the Analyze API: http://localhost:9200/thingies/_analyze?text=Martian&analyzer=index_ngram_analyzer.
When people are searching, they are already using these partial ngrams, so to speak, since they search for "mar" or "mart" etc. So you don't break their phrases further with the ngram tokenizer.
That's why you (correctly) separate index_analyzer and search_analyzer in your mapping, so Elasticsearch knows how to analyze the notes attribute during indexing, and how to analyse any search phrase against this attribute.
In other words, do this:
analyzer: {
index_ngram_analyzer: {
type: 'custom',
tokenizer: 'standard',
filter: ['lowercase', 'ngram_filter']
},
search_ngram_analyzer: {
type: 'custom',
tokenizer: 'standard',
filter: ['lowercase']
}
}
Full, working Ruby code is below. Also, I highly recommend you to migrate to the new elasticsearch-model Rubygem, which contains all important features of Tire and is actively developed.
require 'tire'
Tire.index('thingies').delete
class Thingy
include Tire::Model::Persistence
tire do
settings analysis: {
filter: {
ngram_filter: {
type: 'nGram',
min_gram: 2,
max_gram: 12
}
},
analyzer: {
index_ngram_analyzer: {
type: 'custom',
tokenizer: 'standard',
filter: ['lowercase', 'ngram_filter']
},
search_ngram_analyzer: {
type: 'custom',
tokenizer: 'standard',
filter: ['lowercase']
}
}
} do
mapping do
indexes :notes, type: "string", index_analyzer: "index_ngram_analyzer", search_analyzer: "search_ngram_analyzer"
end
end
end
property :notes
end
Thingy.create id: 1, notes: 'Martial Partial Martian'
Thingy.create id: 2, notes: 'Venetian Completion Heresion'
Thingy.index.refresh
# Find 'art' in 'martial'
#
# Equivalent to: http://localhost:9200/thingies/_search?q=notes:art
#
results = Thingy.search do
query do
match :notes, 'art'
end
end
p results.map(&:notes)
# Find 'net' in 'venetian'
#
# Equivalent to: http://localhost:9200/thingies/_search?q=notes:net
#
results = Thingy.search do
query do
match :notes, 'net'
end
end
p results.map(&:notes)
The problem for me was that I was using the string query instead of the match query. The search should have been written like this:
#things = Thing.search page: params[:page], per_page: 50 do
query {
match [:prop_1, prop_2, :notes], query
}
sort {
by :id, 'desc'
}
filter :term, account_id: account_id
filter :term, test: false
size 50
highlight notes: {number_of_fragments: 0}, options: {tag: '<span class="match">'}
end

Resources