Unexpected microsoft external search aggregation values - microsoft-graph-api

We have an Microsoft Search instance for crawling one custom app : https://learn.microsoft.com/en-us/microsoftsearch/connectors-overview
Query & display is working as expected but aggregation provides wrong results
query JSON : https://graph.microsoft.com/v1.0/search/query
select title + submitter and aggregation on submitter
"fields": [
"title",
"submitter"
],
"aggregations": [
{
"field": "submitter",
"size": 1,
"bucketDefinition": {
"sortBy": "keyAsString",
"isDescending": true,
"minimumCount": 0
}
}
]
JSON response
submitter property is correctly returned with Firstname Lastname on row 0 but aggregate is lowercase and middle space trimmed firstnamelastname
"hitsContainers": [
{
"total": 1,
"moreResultsAvailable": false,
"hits": [
{
"hitId": "xxxx",
"contentSource": "ConnectionId",
"rank": 1,
"summary": "New service / <c0>business</c0> <c0>model</c0> <c0>design</c0> <ddd/>",
"resource": {
"#odata.type": "#microsoft.graph.externalConnectors.externalItem",
"properties": {
"title": "New service / business model design",
"submitter": "Firstname Lastname"
}
}
}
],
"aggregations": [
{
"field": "submitter",
"buckets": [
{
"key": "firstnamelastname",
"count": 1,
"aggregationFilterToken": "\"ǂǂ696c736573706f656c73747261\""
}
]
}
]
}
]
reproducible in Microsoft Graph Explorer (a bit obfuscated)
result with space
aggregation concatenated in lowercase

Rootcause has been identified as submitter property wasn't created with flag refinable
{
"name": "submitter",
"type": "String",
"isSearchable": "true",
"isQueryable": "true",
"isRetrievable": "true"
"isRefinable": "false"
}
as a consequence, output was incorrect.
testing with refinable = true provides correct aggregation value (1 = non refinable, 2 = refinable).
small note : refinable properties can't be searchable

Related

Microsoft Graph Search returning duplicates

I am querying https://graph.microsoft.com/v1.0/search/query with the following payload:
{
"requests": [
{
"entityTypes": [
"listItem"
],
"query": {
"queryString": "uniqueid:925211fd-fc7e-4ed8-95fb-0bd00f378e8b"
},
"trimDuplicates": true,
"fields": [
"uniqueid",
"originalpath"
]
}
]
}
Searching for UniqueID I would expect a single result, but instead I get the same item twice:
{
"value": [
{
"searchTerms": [],
"hitsContainers": [
{
"hits": [
{
"hitId": "925211fd-fc7e-4ed8-95fb-0bd00f378e8b",
"rank": 1,
"summary": "",
"resource": {
"#odata.type": "#microsoft.graph.listItem",
"fields": {
"uniqueid": "{925211fd-fc7e-4ed8-95fb-0bd00f378e8b}",
"originalpath": "https://tenant.sharepoint.com/sites/POC/POC Docs/Employee Contracts/JohnD Employee Contract.docx"
}
}
},
{
"hitId": "925211fd-fc7e-4ed8-95fb-0bd00f378e8b",
"rank": 2,
"summary": "",
"resource": {
"#odata.type": "#microsoft.graph.listItem",
"fields": {
"uniqueid": "{925211fd-fc7e-4ed8-95fb-0bd00f378e8b}",
"originalpath": "https://tenant.sharepoint.com/sites/POC/POC Docs/Employee Contracts/JohnD Employee Contract.docx"
}
}
}
],
"total": 2,
"moreResultsAvailable": false
}
]
}
],
"#odata.context": "https://graph.microsoft.com/v1.0/$metadata#Collection(microsoft.graph.searchResponse)"
}
I get duplicate results with other queries as well. It is not limited to this specific file. If I do the same search in SharePoint I only get a single result as expected.
Am I doing something wrong, or is this a bug?
per my test, unfortunately, I cannot reproduce your issue. In my tests, I can use the following Graph API well and return only one result:
https://graph.microsoft.com/v1.0/search/query
My test result:
I suggest you can create a feedback on this issue, more professional will help you. Thank you for your understanding.
Feedback: https://feedbackportal.microsoft.com/feedback/forum/ebe2edae-97d1-ec11-a7b5-0022481f3c80

Filter empty object in odata

There is an oData service i want to call. We have an model that has an field Parent. The service tells me that the parent is an object with an iv that is an string inside. So we get the following model from the oData service:
{
"total": 0,
"items": [
{
"id": "string",
"data": {
"Title": {
"en": "string",
"nl": "string"
},
"Parent": {
"iv": [
"string"
]
},
}
]
}
Now when we get data back it looks like this:
{
"total": 2,
"items": [
{
"id": "6204c07d-1aef-4bd2-9646-e3ca36c63784",
"data": {
"Title": {
"en": "Test 1",
"nl": "Test 1"
},
"Parent": {
"iv": []
},
},
},
{
"id": "bfd1b084-4166-4fec-9032-08047c8313d2",
"data": {
"Title": {
"en": "Test 2",
"nl": "Test 2"
},
"Parent": {
"iv": [
"6204c07d-1aef-4bd2-9646-e3ca36c63784"
]
},
},
},
}
Now i want to filter on all the items where parent is [] (so no parent selected). I've tried the following filters but all without success
data/Parent/iv eq [] | Returns 0 results
data/Parent/iv eq null | Returns 0 results
data/Parent eq null | Returns 0 results
length(data/Parent/iv) eq 0 | OData operation is not supported
not data/Parent/any() | Any/All may only be used following a collection.
not data/Parent/any() | Any/All may only be used following a collection.
data/Parent/iv/$count eq 0 | Server error
Found that for this case the following works:
empty(data/Parent/iv)

Building an OpenAPI response, including oneOf, and maybe allOf

I am trying to build up a response from a variety of schema components using OpenAPI 3. There are basically three parts to the response:
A shared component that other endpoints use (i.e. success/failure flags). - #/components/schemas/core_response_schema inside allOf.
Properties that all responses on this endpoint use (i.e., user_id) - the properties component of the below.
One of several schemas that will vary depending on the type of user. - the oneOf component.
I've determined that I have to use allOf to be able to mix properties (item 2) and the core response (item 1), though this feels wrong as there's only one item. I tried a $ref, but it didn't work.
The below successfully passes three different OpenAPI linting tools, but in the example it builds, Swagger UI does not show the item 2 things (properties), and does show all of the item 3 things (should be oneOf).
"responses": {
"200": {
"description": "Operation successfully executed.",
"content": {
"application/json": {
"schema": {
"properties": {
"user_id": {
"$ref": "#/components/schemas/user_id"
},
"results": {
"type": "array",
"items": {
"$ref": "#/components/schemas/result_user_by_id"
}
}
},
"type": "object",
"allOf": [
{
"$ref": "#/components/schemas/core_response_schema"
}
],
"oneOf": [
{
"$ref": "#/components/schemas/user_type_a"
},
{
"$ref": "#/components/schemas/user_type_b"
},
{
"$ref": "#/components/schemas/user_type_c"
}
]
}
}
}
}
},
"components": {
"schemas": {
"core_response_schema": {
"properties": {
"success": {
"description": "A flag indicating whether the request was successfully completed or not.",
"type": "boolean"
},
"num_results": {
"description": "The number of results for this request",
"type": "integer"
}
},
"type": "object"
},
"user_id": {
"description": "Unique 10 character `user_id`.",
"type": "string",
"maxLength": 10,
"minLength": 10,
"example": "a1b2c3d4e5"
},
}
}
And example payloads for two users. Type A and B (it's a contrived example).
User Type A:
{
"success": true,
"num_results": 1,
"user_id": "c1b00cb714",
"results": [{
"user_type": "a",
"group_id": "e7a99e3769",
"name": null,
"title": null,
... (and so on until we get to the stuff that's unique to this type of user) ...
"favourite_artworks": [
"sunflowers",
"landscapes"
],
"artwork_urls": [
"http://sunflowers.example"
]
}
]
}
User Type B:
{
"success": true,
"num_results": 1,
"user_id": "c1b00cb715",
"results": [{
"user_type": "B",
"group_id": "e7a99e3769",
"name": null,
"title": null,
... (and so on until we get to the stuff that's unique to this type of user) ...
"supported_charities": [
"UN Foundations"
],
"charity_urls": [
"http://www.un.int"
],
}
]
}
What's the correct way to merge together different schemas and properties in OpenAPI? Is this right and Swagger UI just can't handle it?
And how do you mix a schema with properties without having to use allOf?
This suggests it's possible: Swagger Schema: oneOf, anyOf, allOf valid at the same time?
After further investigation, I've determined this is a bug in swagger-ui - https://github.com/swagger-api/swagger-ui/issues/3803 - they simply don't support oneOf (or anyOf) currently.
As far as at least three different linting tools are concerned, a mixture of anyOf, oneOf, and allOf can be used together in the same schema.
Redoc appears to have similar problems - https://github.com/Rebilly/ReDoc/issues/641

In Watson Discovery, limiting "return"ed fields to aggregation values

For the Discovery REST api, the argument/parameter "return" controls which fields are returned.
So if I pass these arguments to the API
{
"query": named_sector,
"count": "10",
"filter": filter_dates,
"aggregation" : "term(docSentiment.type,count:3)"
}
my_query = discovery.query(my_disc_environment_id, my_disc_collection_id, qopts)
print(json.dumps(my_query, indent=2))
I get the following:
{
"matching_results": 14779,
"aggregations": [
{
"type": "term",
"field": "docSentiment.type",
"count": 3,
"results": [
{
"key": "positive",
"matching_results": 4212
},
{
"key": "negative",
"matching_results": 3259
},
{
"key": "neutral",
"matching_results": 152
}
]
}
],
"results": [
{
"id": "6389715fe7e7f711e0bc09d4f1236639",
"score": 1.3689895,
"yyyymm": "201704",
"url": "https://seekingalpha.com/article/4060446-valuation-dashboard-consumer-discretionary-update",
"enrichedTitle": null,
"host": "seekingalpha.com",
"text": "Valuation Dashboard: Consumer Discretionary - Update\n\nSummary\n\nValuation metrics in Consumer Discretionary.\n\nEvolution since last month.\n\nA list of stocks loo ....
and thousands of more lines. How do I restrict the output to the aggregations section? Is this an issue of me better handling the JSON structure that is returned?
thanks
If you change the count argument to 0, the returned JSON will only contain the aggregations.
Also, if you're using the Discovery web tooling, you can enter 0 for the "Number of results to return (Count)" field.
More details and an example can be found here: https://www.ibm.com/watson/developercloud/doc/discovery/using.html#building-aggregations

How to make Elasticsearch sort/prefer hits with exactly matching strings first

I'm using default analyzers and indexing. So let's say I have this simple mapping:
"question": {
"properties": {
"title": {
"type": "string"
},
"answer": {
"properties": {
"text": {
"type": "string"
}
}
}
}
}
(that was an example. sorry if it has typos)
Now, I perform the following search.
GET _search
{
"query": {
"query_string": {
"query": "yes correct",
"fields": ["answer.text"]
}
}
}
The results will score a text value like "yes correct." (doc id value 1) higher than simply "yes correct" (without a period, doc id value 181). Both hits have the same score value, but the hits array lists the one with the smaller doc id first. I understand that the default index option includes sorting by doc id, so how do I exclude that one attribute and still use the rest of the default options?
I'm not setting any custom analyzers, so everything is using default values for Elasticsearch 2.0.
This is probably a use case for Dis Max Query
A query that generates the union of documents produced by its
subqueries, and that scores each document with the maximum score for
that document as produced by any subquery, plus a tie breaking
increment for any additional matching subqueries.
So following that, you need to make your answer score as an exact match and give it highest boost. You'll have to use a custom analyzer for that. That'd be your mappings:
PUT /test
{
"settings": {
"analysis": {
"analyzer": {
"my_keyword": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"asciifolding",
"lowercase"
]
}
}
}
},
"mappings": {
"question": {
"properties": {
"title": {
"type": "string"
},
"answer": {
"type": "object",
"properties": {
"text": {
"type": "string",
"analyzer": "my_keyword",
"fields": {
"stemmed": {
"type": "string",
"analyzer": "standard"
}
}
}
}
}
}
}
}
}
Your test data:
PUT /test/question/1
{
"title": "title nr1",
"answer": [
{
"text": "yes correct."
}
]
}
PUT /test/question/2
{
"title": "title nr2",
"answer": [
{
"text": "yes correct"
}
]
}
Now when you're querying for "yes correct." using such query:
POST /test/_search
{
"query": {
"dis_max": {
"tie_breaker": 0.7,
"boost": 1.2,
"queries": [
{
"match": {
"answer.text": {
"query": "yes correct.",
"type": "phrase"
}
}
},
{
"match": {
"answer.text.stemmed": {
"query": "yes correct.",
"operator": "and"
}
}
}
]
}
}
}
You get this output:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.37919715,
"hits": [
{
"_index": "test",
"_type": "question",
"_id": "1",
"_score": 0.37919715,
"_source": {
"title": "title nr1",
"answer": [
{
"text": "yes correct."
}
]
}
},
{
"_index": "test",
"_type": "question",
"_id": "2",
"_score": 0.11261705,
"_source": {
"title": "title nr2",
"answer": [
{
"text": "yes correct"
}
]
}
}
]
}
}
If you run very same query without trailing dot, which then becomes "yes correct", you're getting this result:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.37919715,
"hits": [
{
"_index": "test",
"_type": "question",
"_id": "2",
"_score": 0.37919715,
"_source": {
"title": "title nr2",
"answer": [
{
"text": "yes correct"
}
]
}
},
{
"_index": "test",
"_type": "question",
"_id": "1",
"_score": 0.11261705,
"_source": {
"title": "title nr1",
"answer": [
{
"text": "yes correct."
}
]
}
}
]
}
}
Hopefully this is what you're looking for.
By the way, I'd recommend to always use Match query when performing text search. Taken from documentation:
Comparison to query_string / field The match family of queries
does not go through a "query parsing" process. It does not support
field name prefixes, wildcard characters, or other "advanced"
features. For this reason, chances of it failing are very small / non
existent, and it provides an excellent behavior when it comes to just
analyze and run that text as a query behavior (which is usually what a
text search box does). Also, the phrase_prefix type can provide a
great "as you type" behavior to automatically load search results.
Elasticsearch or rather Lucene scoring does not take into account the relative positioning of the tokens. It utlizes 3 different criterias to do the same
Term frequency - Frequency at which the search terms is present in
the document
Inverse document frequency - Number of occurrence of the search term
in the entire database. The more the occurance , the more the common
is the search term and less the importance it has in search
Field length normalization - Number of tokens present in the target
field.
You can learn more about it here.

Resources