Elastic search - aggregation filter for product options - ruby-on-rails

I have a products catalogue where every product is indexed as follows (queried from http://localhost:9200/products/_doc/1) as sample:
{
"_index": "products_20201202145032789",
"_type": "_doc",
"_id": "1",
"_version": 1,
"_seq_no": 0,
"_primary_term": 1,
"found": true,
"_source": {
"title": "Roncato Eglo",
"description": "Amazing LED light made of wood and description continues.",
"price": 3990,
"manufacturer": "Eglo",
"category": [
"Lights",
"Indoor lights"
],
"options": [
{
"title": "Mount type",
"value": "E27"
},
{
"title": "Number of bulps",
"value": "4"
},
{
"title": "Batteries included",
"value": "true"
},
{
"title": "Ligt temperature",
"value": "warm"
},
{
"title": "Material",
"value": "wood"
},
{
"title": "Voltage",
"value": "230"
}
]
}
}
Every option contains different value, so there are many Mount type values, Light temperature values, Material values, and so on.
How can I create an aggregation (filter) where I can let customers choose between various Mount Type options:
[ ] E27
[X] E14
[X] GU10
...
Or let them choose from different Material options displayed as checkboxes:
[X] Wood
[ ] Metal
[ ] Glass
...
I can handle it on frontend once the buckets are created. Creation of different buckets for these options is What I am struggling with.
I have succesfully created and displayed and using aggregations for Category, Manufacturer and other basic ones. Thes product options are stored in has_many_through relationships in database. I am using Rails + searchkick gem, but those allow me to create raw queries to elastic search.

The prerequisite for such aggregation is to have options field as nested.
Sample index mapping:
PUT test
{
"mappings": {
"properties": {
"title": {
"type": "keyword"
},
"options": {
"type": "nested",
"properties": {
"title": {
"type": "keyword"
},
"value": {
"type": "keyword"
}
}
}
}
}
}
Sample docs:
PUT test/_doc/1
{
"title": "Roncato Eglo",
"options": [
{
"title": "Mount type",
"value": "E27"
},
{
"title": "Material",
"value": "wood"
}
]
}
PUT test/_doc/2
{
"title": "Eglo",
"options": [
{
"title": "Mount type",
"value": "E27"
},
{
"title": "Material",
"value": "metal"
}
]
}
Assumption: For a given document a title under option appears only once. For e.g. there can exists only one nested document under option having title as Material.
Query for aggregation:
GET test/_search
{
"size": 0,
"aggs": {
"OPTION": {
"nested": {
"path": "options"
},
"aggs": {
"TITLE": {
"terms": {
"field": "options.title",
"size": 10
},
"aggs": {
"VALUES": {
"terms": {
"field": "options.value",
"size": 10
}
}
}
}
}
}
}
}
Response:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"OPTION" : {
"doc_count" : 4,
"TITLE" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Material",
"doc_count" : 2,
"VALUES" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "metal",
"doc_count" : 1
},
{
"key" : "wood",
"doc_count" : 1
}
]
}
},
{
"key" : "Mount type",
"doc_count" : 2,
"VALUES" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "E27",
"doc_count" : 2
}
]
}
}
]
}
}
}
}

Related

Problem with Elasticsearch querying nested documents

I am learning ES and I am having problems with this query:
Given 2 products:
products/_source/1
{
"product_id": "58410-2",
"name": [
{
"locale": "en",
"translation": "CBC panel"
},
{
"locale": "vn",
"translation": "CBC panel VN"
}
],
"status": "active",
"category": {
"id": 8,
"name": [
{
"locale": "en",
"translation": "Hematology"
},
{
"locale": "vn",
"translation": "huyết học"
}
]
},
"children": [
{
"product_id": "6690-2",
"name": [
{
"locale": "en",
"translation": "Leukocytes"
},
{
"locale": "vn",
"translation": "Leukocytes vn"
}
],
"status": "active",
"category": {
"id": 8,
"name": [
{
"locale": "en",
"translation": "Hematology"
},
{
"locale": "vn",
"translation": "huyết học"
}
]
},
"children": []
}]}
and
products/_source/2
{
"product_id": "6690-2",
"name": [
{
"locale": "en",
"translation": "Leukocytes"
},
{
"locale": "vn",
"translation": "Leukocytes vn"
}
],
"status": "active",
"category": {
"id": 8,
"name": [
{
"locale": "en",
"translation": "Hematology"
},
{
"locale": "vn",
"translation": "huyết học"
}
]
},
"children": []
}
where a product is a single document but also can be nested in a children array of other products. Both products are different documents in the index.
and this index:
{
"products": {
"aliases": {},
"mappings": {
"dynamic": "false",
"properties": {
"category": {
"properties": {
"name": {
"properties": {
"locale": {
"type": "keyword"
},
"translation": {
"type": "text"
}
}
}
}
},
"children": {
"type": "nested"
},
"name": {
"properties": {
"locale": {
"type": "keyword"
},
"translation": {
"type": "text"
}
}
},
"product_id": {
"type": "keyword"
},
"status": {
"type": "keyword"
}
}
},
"settings": {
"index": {
"routing": {
"allocation": {
"include": {
"_tier_preference": "data_content"
}
}
},
"number_of_shards": "3",
"provided_name": "products",
"number_of_replicas": "1"
}
}
}
}
I want to be able to query for "Leuko" (or the category or the product_id) and retrieve both products, the single product and the root product.
I have tried using object field, nested, flattened but I think the problem is I don't know how to properly write the query, I have tried things like this (I am using a ruby library but I think it is easy to follow):
#query = {
query: {
query_string: {
fields: ['name.translation', 'children.name.translation', 'category.name.translation', 'children.product_id'],
query: "*#{text}*"
}
},
size: 50
}
#query = {
query: {
nested: {
path: 'children',
query: {
bool: {
should: [
term: { 'children.name.translation' => "*#{text}*" },
term: { 'name.translation' => "*#{text}*" }
]
}
}
}
}
}
but I think at some point I dunno what I am doing anymore and I am just randomly trying different stuff from the documentation.
Follow my query suggestion. Note that I had to add the fields in the Nested object to the mapping.
Mapping:
{
"mappings": {
"dynamic": "false",
"properties": {
"category": {
"properties": {
"name": {
"properties": {
"locale": {
"type": "keyword"
},
"translation": {
"type": "text"
}
}
}
}
},
"children": {
"type": "nested",
"properties": {
"product_id": {
"type": "keyword"
},
"category": {
"properties": {
"name": {
"properties": {
"locale": {
"type": "keyword"
},
"translation": {
"type": "text"
}
}
}
}
},
"name": {
"properties": {
"locale": {
"type": "keyword"
},
"translation": {
"type": "text"
}
}
},
"status": {
"type": "keyword"
}
}
},
"name": {
"properties": {
"locale": {
"type": "keyword"
},
"translation": {
"type": "text"
}
}
},
"product_id": {
"type": "keyword"
},
"status": {
"type": "keyword"
}
}
}
}
Query:
{
"query": {
"bool": {
"minimum_should_match": 1,
"should": [
{
"nested": {
"path": "children",
"query": {
"wildcard": {
"children.name.translation": "leuko*"
}
}
}
},
{
"wildcard": {
"name.translation": "leuko*"
}
}
]
}
}
}
hint
See that you use translation. Avoid using array to make your queries simpler.
What I would do in your case is to create a field for each language, this makes the use of analyzer more flexible for each type of language and you stop using an array and work with an object.
PUT test
{
"mappings": {
"properties": {
"name":{
"type": "text",
"fields": {
"es":{
"type": "text",
"analyzer":"english"
},
"vn":{
"type": "text"
}
}
}
}
}
}
POST test/_doc/
{
"name": "Leukocytes"
}
An example query using field languages.
GET test/_search
{
"query": {
"multi_match": {
"query": "Leukocytes",
"fields": ["name.es", "name.vn"]
}
}
}

Elastic search running in docker container does not return nearest neighbour results as expected following an ANN query

I am following the approximate nearest neighbour (ANN) task found at:
https://www.elastic.co/guide/en/elasticsearch/reference/current/knn-search.html#approximate-knn
and I am running Elastic Search using the insecure mode via docker compose, but the results of the ANN query are never close to the provided vector and seem to return the 'bulk' inserted vectors in the same order. The exact match never appears first, as I would reasonably expect it to.
I spin up ES using the following docker compose file:
version: "3.7"
services:
es:
container_name: ann_es
image: docker.elastic.co/elasticsearch/elasticsearch:8.6.0
environment:
- discovery.type=single-node
- xpack.security.enabled=false
ports:
- "9200:9200"
Then I insert an index operating over a dense_vector
PUT https://localhost:9200/image-index
{
"mappings": {
"properties": {
"image-vector": {
"type": "dense_vector",
"dims": 3,
"index": true,
"similarity": "l2_norm"
},
"title": {
"type": "text"
},
"file-type": {
"type": "keyword"
}
}
}
}
I've noticed that when I GET the index, the mappings are not present. Not sure if this is the root of the problem, but after I insert data (see later), a basic mapping appears that's not the one I created using a dense_vector.
GET http://localhost:9200/image-index
{
"image-index": {
"aliases": {},
"mappings": **{}**,
"settings": {
"index": {
"routing": {
"allocation": {
"include": {
"_tier_preference": "data_content"
}
}
},
"number_of_shards": "1",
"provided_name": "image-index",
"creation_date": "1675178096730",
"number_of_replicas": "1",
"uuid": "HR521T_jQfeduY7cdMr-Jw",
"version": {
"created": "8060099"
}
}
}
}
}
I have a programme that will insert 10000 random vectors similar to this:
POST http://localhost:9200/image-index/_bulk?refresh=true
{ "index": { "_id": "1" } }
{ "image-vector": [1, 5, -20], "title": "moose family", "file-type": "jpg" }
{ "index": { "_id": "2" } }
{ "image-vector": [42, 8, -15], "title": "alpine lake", "file-type": "png" }
{ "index": { "_id": "3" } }
{ "image-vector": [15, 11, 23], "title": "full moon", "file-type": "jpg" }
When I retrieve the index I can see that the index has a mapping (but it's not a dense_vector)
GET http://localhost:9200/image-index
{
"image-index": {
"aliases": {},
"mappings": {
"properties": {
"file-type": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"image-vector": {
"type": "long"
},
"title": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},
"settings": {
"index": {
"routing": {
"allocation": {
"include": {
"_tier_preference": "data_content"
}
}
},
"number_of_shards": "1",
"provided_name": "image-index",
"creation_date": "1675178096730",
"number_of_replicas": "1",
"uuid": "HR521T_jQfeduY7cdMr-Jw",
"version": {
"created": "8060099"
}
}
}
}
}
When I invoke an ANN search query:
GET http://localhost:9200/image-index/_search
{
"knn": {
"field": "image-vector",
"query_vector": [42, 8, -15],
"k": 10,
"num_candidates": 1
},
"fields": [ "title", "file-type" ]
}
I always get totally inaccurate results, no matter how many vectors I insert or if I fix the query to be obviously near one of the many vectors I insert:
{
"took": 6,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "image-index",
"_id": "1",
"_score": 1.0,
"_source": {
"image-vector": [
1,
5,
-20
],
"title": "moose family",
"file-type": "jpg"
}
},
{
"_index": "image-index",
"_id": "2",
"_score": 1.0,
"_source": {
"image-vector": [
42,
8,
-15
],
"title": "alpine lake",
"file-type": "png"
}
},
{
"_index": "image-index",
"_id": "3",
"_score": 1.0,
"_source": {
"image-vector": [
15,
11,
23
],
"title": "full moon",
"file-type": "jpg"
}
}
]
}
}
Does anybody know what I am doing wrong?

Spell corrections not working when refining queries with aggregations

I'm searching for list items using the /search/query endpoint of MS Graph. I want to use aggregations and spell checking. This is my request
{
"requests": [
{
"entityTypes": [
"listItem"
],
"query": {
"queryString": "inspring"
},
"fields": [
"title"
],
"aggregations": [
{
"field": "fileType",
"size": 20,
"bucketDefinition": {
"sortBy": "count",
"isDescending": "true",
"minimumCount": 0
}
}
],
"queryAlterationOptions": {
"enableModification": true
}
}
]
}
It returns no results, since the search term was not spell checked and modified:
{
"value": [
{
"searchTerms": [
"inspring"
],
"hitsContainers": [
{
"total": 0,
"moreResultsAvailable": false
}
]
}
],
"#odata.context": "https://graph.microsoft.com/beta/$metadata#Collection(microsoft.graph.searchResponse)"
}
However, when I remove the aggregations and use the following request, it works:
{
"requests": [
{
"entityTypes": [
"listItem"
],
"query": {
"queryString": "inspring"
},
"fields": [
"title"
],
"queryAlterationOptions": {
"enableModification": true
}
}
]
}
Response:
{
{
"value": [
{
"searchTerms": [
"inspiring"
],
"hitsContainers": [
{
"hits": [...],
"total": 64,
"moreResultsAvailable": true
}
],
"queryAlterationResponse": {
"originalQueryString": "inspring",
"queryAlteration": {
"alteredQueryString": "inspiring",
"alteredHighlightedQueryString": "inspiring",
"alteredQueryTokens": [
{
"offset": 0,
"length": 8,
"suggestion": "inspiring"
}
]
},
"queryAlterationType": "modification"
}
}
],
"#odata.context": "https://graph.microsoft.com/v1.0/$metadata#Collection(microsoft.graph.searchResponse)"
}
How do I have to change my request to make query alterations work with aggregations?

How to order by two different attributes on same document in elasticsearch?

I have documents as
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "journeys-development-latest",
"_type" : "_doc",
"_id" : "1399",
"_score" : 1.0,
"_source" : {
"draft_recent_edit_at" : "2023-01-14T04:16:41.318Z",
"recent_edit_at" : "2022-09-23T14:13:41.246Z"
}
},
{
"_index" : "journeys-development-latest",
"_type" : "_doc",
"_id" : "1394",
"_score" : 1.0,
"_source" : {
"draft_recent_edit_at" : "2022-07-02T16:19:41.347Z",
"recent_edit_at" : "2022-12-26T10:12:41.333Z"
}
},
{
"_index" : "journeys-development-latest",
"_type" : "_doc",
"_id" : "1392",
"_score" : 1.0,
"_source" : {
"draft_recent_edit_at" : "2022-05-20T11:33:41.372Z",
"recent_edit_at" : "2021-12-21T03:36:41.359Z"
}
}
]
}
}
What I know is if I do
{
"size": 12,
"from": 0,
"query": {
......,
......
},
"sort": [
{
"recent_edit_at": {
"order": "desc"
}
}
]
}
This will order by recent_edit_at in desc order.
Similarly replacing recent_edit_at with draft_recent_edit_at will order by draft_recent_edit_at in desc order.
What I am struggling is to find a way where I can say I want to order by max in draft_recent_edit_at, recent_edit_at and then order the documents according to those.
===========================Update===========================
After adding sort proposed by HPringles the output is
{
"error": {
"root_cause": [
{
"type": "script_exception",
"reason": "runtime error",
"script_stack": [
"Math.max(doc['draft_recent_edit_at'].value.toInstant().toEpochMilli(),\n doc['recent_edit_at'].value.toInstance().toEpochMilli())\n ",
" ^---- HERE"
],
"script": "\n Math.max(doc['draft_recent_edit_at'].value.toInstant().toEpochMilli(),\n doc['recent_edit_at'].value.toInstance().toEpochMilli())\n ",
"lang": "painless"
}
],
"type": "search_phase_execution_exception",
"reason": "all shards failed",
"phase": "query",
"grouped": true,
"failed_shards": [
{
"shard": 0,
"index": "journeys-development-latest",
"node": "GGAHq1ufQQmSqeLRyzka5A",
"reason": {
"type": "script_exception",
"reason": "runtime error",
"script_stack": [
"Math.max(doc['draft_recent_edit_at'].value.toInstant().toEpochMilli(),\n doc['recent_edit_at'].value.toInstance().toEpochMilli())\n ",
" ^---- HERE"
],
"script": "\n Math.max(doc['draft_recent_edit_at'].value.toInstant().toEpochMilli(),\n doc['recent_edit_at'].value.toInstance().toEpochMilli())\n ",
"lang": "painless",
"caused_by": {
"type": "illegal_argument_exception",
"reason": "dynamic method [org.elasticsearch.script.JodaCompatibleZonedDateTime, toInstance/0] not found"
}
}
}
]
},
"status": 400
}
If I'm understanding correctly, you can do this with a painless script at runtime.
See below:
"sort": {
"_script": {
"type": "number",
"script": {
"lang": "painless",
"source": """
Math.max(doc['draft_recent_edit_at'].value.toInstant().toEpochMilli(),
doc['recent_edit_at'].value.toInstance().toEpochMilli())
""",
"params": {
"factor": 1.1
}
},
"order": "asc"
}
}
This will work out the maximum of the two, and then sort based on that value.
As far as I know you might also want to convert the Epoch values to long.
Something like -
"sort": {
"_script": {
"type": "number",
"script": {
"lang": "painless",
"source": """
long draft_recent_edit_at = doc['draft_recent_edit_at'].value.toInstant().toEpochMilli();
long recent_edit_at = doc['recent_edit_at'].value.toInstant().toEpochMilli();
Math.max(draft_recent_edit_at, recent_edit_at);
"""
},
"order": "asc"
}
}

Exact search getting less precedence than phonetic search?

I have an elasticsearch index and am using the following query:
"_source": [
"title",
"content"
],
"size": 15,
"from": 0,
"query": {
"bool": {
"must": {
"multi_match": {
"query": "{{query}}",
"fields": [
"title",
"content"
],
"operator": "or"
}
},
"should": [
{
"multi_match": {
"query": "{{query}}",
"fields": [
"title.standard^16",
"content.standard^2"
],
"operator": "and"
}
},
{
"match_phrase": {
"content.standard": {
"query": "{{query}}",
"_name": "Phrase on title",
"boost": 1000
}
}
}
]
}
},
"highlight": {
"fields": {
"content": {}
},
"fragment_size": 100
}
}
Here is the mapping I set:
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_metaphone"
]
}
},
"filter": {
"my_metaphone": {
"type": "phonetic",
"encoder": "metaphone",
"replace": true
}
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"term_vector": "with_positions_offsets",
"analyzer": "my_analyzer",
"fields": {
"standard": {
"type": "text"
},
"stemmer": {
"type": "text",
"analyzer": "english"
}
}
},
"content": {
"type": "text",
"term_vector": "with_positions_offsets",
"analyzer": "my_analyzer",
"fields": {
"standard": {
"type": "text"
},
"stemmer": {
"type": "text",
"analyzer": "english"
}
}
}
}
}
}
Here is my logic with the query:
1) It will give the highest precedence to a phrase if it appears.
2) If not it will use the standard analyzer (that is the text, as is) and give it the highest precedence.
3) If all else doesn't match up, it will use the phonetic analyzer to get the results, that is the least precedence.
But obviously there is some fault to this as it seems to give higher precedence to the phonetic analyzer than the standard or phrase. For example, if I search for "Person of Indian Origin" it returns results on the top highlighting "Pursuant" "pursuing" and very, very less number of results with person of Indian origin although I know a large number of them exists. How do I solve this?

Resources