Approximate nearest neighbors search returns too few results - yql

I have 1M records in my db with such schema:
schema embeddings {
document embeddings {
field id type int {}
field text_embedding type tensor<double>(d0[960]) {
indexing: attribute | index
attribute {
distance-metric: euclidean
}
index {
hnsw {
max-links-per-node: 16
neighbors-to-explore-at-insert: 100
}
}
}
}
rank-profile closeness {
num-threads-per-search:1
inputs {
query(query_embedding) tensor<double>(d0[960])
}
first-phase {
expression: closeness(field, text_embedding)
}
}
}
My query for finding the nearest neighbors looks like this:
body = {
'yql': 'select * from embeddings where ({approximate:true, targetHits:100} nearestNeighbor(text_embedding, query_embedding));',
"hits":100,
'input': {
'query(query_embedding)': [...],
},
"ranking": {
"profile": "closeness",
"softtimeout": {
"enable": false
}
}
}
For some reasons, for certain vectors the number of results is smaller, than targetHits. Changing timeouts does not help.
Here is coverage section from the response:
"id": "toplevel",
"relevance": 1.0,
"fields": {
"totalCount": 39
},
"coverage": {
"coverage": 100,
"documents": 1000000,
"full": true,
"nodes": 1,
"results": 1,
"resultsFull": 1
},
Is there any way to receive exactly (or at least not less than) targetHits results (obviously there are enough results, since the closeness can be calculated for any other vector in the db)?

When you ask for targetHits:100, Vespa will expose that to the first-phase ranking phase, per content node. If it does not, then we would be very interested in how to reproduce. That is best done by creating a issue over at github vespa-engine/vespa. There is also support for dropping hits in first-phase ranking using rank-score-drop-limit, which can reduce the result set and totalCount. This does not seem to be enabled here.
The hits parameter (or limit in YQL) controls how many hits are returned in the response.
Vespa's default timeout is 500ms, and if your system is heavily overloaded (or using exact search with approximate:false), you might see soft-timeouts where Vespa returns a partial result. This situation is reflected in the returned result coverage element.

Related

Ruby - Sum all keys from a hash

I have the following hash and I want to sum all total. Using dig I'm able to get the first index. But there are sometimes more keys based on range. How can I sum all amount values?
results_by_time has 3 amount keys.
value = {
:next_page_token=>nil,
:group_definitions=>nil,
:results_by_time=>[
{
:time_period=>{
:start=>"2022-10-01",
:end=>"2022-11-01"
},
:total=>{
"BlendedCost"=>{
:amount=>"49.1803785401",
:unit=>"USD"
}
},
:groups=>[],
:estimated=>false
},
{
:time_period=>{
:start=>"2022-11-01",
:end=>"2022-12-01"
},
:total=>{
"BlendedCost"=>{
:amount=>"79.0698396954",
:unit=>"USD"
}
},
:groups=>[],
:estimated=>false
},
{
:time_period=>{
:start=>"2022-12-01",
:end=>"2022-12-03"
},
:total=>{
"BlendedCost"=>{
:amount=>"2.6918272089",
:unit=>"USD"
}
},
:groups=>[],
:estimated=>true
}
],
:dimension_value_attributes=>[]
}
Here is the dig method but it can only be used to get specific index.
amount = value.dig(:results_by_time, 0, :total, 'BlendedCost', :amount).to_f
print amount

Elastic search scroll aggregations

I am trying to get a unique document count in an index based on an id property via elastic search web API. The thing is that I have millions of entries. How can I scroll on an aggregation ?
this is the url:
http://my.servers.ip:9200/index_name/doc_type/_search?scroll=1m
And this is the body:
{
"_source": "false",
"aggs" : {
"Ids" : {
"terms" : {
"field" : "somePropertyIWantToGoupBy",
"size" : 100
},
"aggs": {
"unique": {
"cardinality": {
"field": "someCategoryIWantUniqueCount"
}
}
}
}
},"size":0
}
I get the scrollId , but on the next call with scroll id I'll get the next 100 aggregations, instead I get an empty result set.
Is it possible to scroll on aggregations ?
What am I doing wrong ?
There's no way to paginate terms aggregation.
You should use Composite Aggregation but it's a beta aggregation and might be removed or changed in the future...

iOS Swit 3 - filter array inside filter

I would like to filter array inside a filter. First I have a big array of Staff object (self.bookingSettings.staffs). Inside this array I have multiple object like this :
"staffs": [
{
"id": 1,
"name": "Brian",
"services": [
{
"id": 1
},
{
"id": 2
},
{
"id": 3
},
{
"id": 4
}
],
"pos": 1
},...
I would like to filter this array in order to have only services with id = 3.
I succeed to have if first object is equal to 3 with this code :
self.bookingSettings.staffs.filter({ $0.services.first?.id == self.bookingService.id })
but that takes only the first item.
I think I have to filter inside my filter function, something like this to loop over all object inside services :
self.bookingSettings.staffs.filter({ $0.services.filter({ $0.id == self.bookingService.id }) })
but I've the following error: Cannot convert value of type [BookingService] to closure result type Bool.
Is this a good idea ? How can I achieve this ?
You could use filter, which would look something like this:
self.bookingSettings.staffs.filter {
!$0.services.filter{ $0.id == self.bookingService.id }.isEmpty
}
This code is constructing an entire array of filtered results, only to check if its empty and immediately discard it. Since filter returns all items that match the predicate from the list, it won't stop after it finds a match (which is really what you're looking for). So even if the first element out of a list of a million elements matches, it'll still go on to check 999,999 more elements. If the other 999,999 elements also match, then they will all be copied into filter's result. That's silly, and can use way more CPU and RAM than necessary in this case.
You just need contains(where:):
self.bookingSettings.staffs.filter {
$0.services.contains(where: { $0.id == self.bookingService.id })
}
contains(where:) short-circuits, meaning that it won't keep checking elements after a match is found. It stops and returns true as soon as find a match. It also doesn't both copying matching elements into a new list.

custom sort order in tablesorter - jquery

I do use tablesorter (https://mottie.github.io/tablesorter/docs/index.html)
To sort my HTML tables.
I have one sorting I cannot find howtoo. ie.
(4)
(dns)
1
2
3
5
dns
is to be sorted as:
1
2
3
(4)
5
(dns)
dns
in short: the () are to be ignored and numeric sort, numeric first then alphabetical.
I have seen how to replace characters, (doesn't work as "empty" as some rank too)
The parsers I have seen thusfar require me to create per header and known value to be replaced.
ie:
$.tablesorter.addParser({
id: 'nummeriek',
is: function(s) {
return false;
},
format: function(s) {
// format your data for normalization
return s.toLowerCase().replace('dns',999).replace('(dns)',999).replace('(4)',4);
},
type: 'numeric'
});
$('.tablesorter').tablesorter({
headers: {
6: {
sorter:'nummeriek'
}
}
});
If I have to do this for every possible table content I end up creating hundreds of replace() statements. as I have scores from 1 to 100 Thus (1) to (100) is possible too...
There must be an easier way. Any help is much appreciated.
The default digit parser "assumes" that numbers wrapped in parentheses are negative; this is a common method of indicating a negative number in accounting (ref).
To get around this, you will need to slightly modify the parser (demo)
$(function() {
$.tablesorter.addParser({
id: 'nummeriek',
is: function(s) {
return false;
},
format: function(str) {
// format your data for normalization
var s = str.replace(/[()]/g, ""),
n = parseFloat(s);
return !isNaN(n) && isFinite(n) ? n : s;
},
type: 'numeric'
});
$('.tablesorter').tablesorter({
headers: {
0: {
sorter: 'nummeriek'
}
}
});
});
Note: this parser always returns a non-numeric string without parentheses, e.g. "(dns)" will become "dns". I kept it this way so the "(dns)" entries will sort as if they are "dns".

Mongoid max and embeded collections

I have a Collection Report embeds submissions
class Report
embeds_many :submissions
class Submission
embedded_in :report
field :date_submitted, type: TimeWithZone
field :mistakes, type: Integer
I am trying to create a scope on Report
I want to add a scope query with two parts
get the latest submission (given by max date_submitted) that also has zero mistakes
I can create a scope for the mistakes part, but cannot work out how to get the latest submission
scope :my_scope, where("submissions.mistakes" => 0)
So this report would be returned as it's last enter in submissions has zero mistakes
Report
"submissions" : [
{
"date_submitted" : ISODate("2014-01-28T13:00:00Z"),
"mistakes" : 11
},
{
"date_submitted" : ISODate("2014-03-08T13:00:00Z"),
"mistakes" : 0
}
]
where this one wouldn't be returned
Report
"submissions" : [
{
"date_submitted" : ISODate("2014-01-28T13:00:00Z"),
"mistakes" : 0
},
{
"date_submitted" : ISODate("2014-03-08T13:00:00Z"),
"mistakes" : 11
}
]
This is because you are not filtering the element of the embedded array but the document that contains that element.
There could be an $elemMatch clause here which allows you to combine the conditions on a single element. But find does not have any operation for getting the max value as it were. This is not to be confused with the $max query modifier, which actually clips the index in use to not search beyond those bounds.
So here you use aggregate:
db.collection.aggregate([
// Optionally query to match and filter your documents.
//{ "$match: { /* Same conditions as find */ } },
// Unwind the array
{ "$unwind": "$submissions" },
// Filter all but 0 mistakes
{ "$match": { "submissions.mistakes": 0 } },
// Group the results, taking the max entry and presuming by document `_id`
{ "$group": {
"_id": "$_id",
"date_submitted": { "$max": "$submissions.date_submitted" }
}}
])
That is the general process for filtering the elements of an array. You may look into your driver implementation of aggregate, but the form is always the pipeline represented as an array of documents (hashes) in this form. Possibly using the moped form for getting the collection method. So something like:
Report.collection.aggregate([ /* stages */ ])
For more information on returning the original document form if that is what your requirement is then see here.

Resources