Receiving responses of different formats for the same query in vespa - yql

I have such schema
schema embeddings {
document embeddings {
field id type int {}
field text_embedding type tensor<double>(d0[960]) {
indexing: attribute | index
attribute {
distance-metric: euclidean
}
}
}
rank-profile closeness {
num-threads-per-search:1
inputs {
query(query_embedding) tensor<double>(d0[960])
}
first-phase {
expression: closeness(field, text_embedding)
}
}
Such services:
...
<container id="query" version="1.0">
<search/>
<nodes>
<node hostalias="query" />
</nodes>
</container>
<content id='mind' version='1.0'>
<redundancy>1</redundancy>
<documents>
<document type='embeddings' mode="index"/>
</documents>
<nodes>
<node hostalias="content1" distribution-key="0"/>
</nodes>
</content>
...
Then I have the number of queries of the same format:
{
'yql': 'select * from embeddings where ({approximate:false, targetHits:100} nearestNeighbor(text_embedding, query_embedding));',
'timeout': 5,
"hits":100,
'input': {
'query(query_embedding)': [...],
},
'ranking': {
'profile': 'closeness',
},
}
which are then run via app.query_batch(test_queries)
The problem is some responses look like this (and contain id field as integers, just like I inserted):
{'id': 'id:embeddings:embeddings::786559', 'relevance': 0.5703559830732123, 'source': 'mind', 'fields': {'sddocname': 'embeddings', 'documentid': 'id:embeddings:embeddings::786559'}}
and others look like this (neither containing int id as I inserted, nor keeping the format of the previous example):
{'id': 'index:mind/0/b0dde169c545ce11e8fd1a17', 'relevance': 0.49024561522459087, 'source': 'mind'}
How can I make all responses look like the first one? Why are they different at all?

Some of them are filled with content and some are not, I suppose because it timed out. Check the coverage info, and run with traceLevel=3 to see more details.
Some more background info on what's going on:
Searches are executed in two phases: First, minimal information on each hits hit is returned from each content node up to the issuing container. These partial lists are then merged to produce the final hits length list of matches. For those we execute phase two, which is to fill the content of the final hits. This involves doing another request to each of the content nodes to get the relevant content.
If there's little time left, or lots of data, or expensive summary features to compute, or a slow disk subsystem or network, or a node in some kind of trouble, this may time out leaving only some hits filled so that you'll see this.
Why are the id's not the true document id in these cases? The text string id is stored in the disk document blob but not in memory as an attribute, so it needs to be fetched in the fill phase too. If it is not, an internally generated unique id is used instead.

Related

Is it possible to apply [Rule Chain] after [Data Converter]?

I am currently working on a POC by using ThingsBoard PE.
Our raw data contains [Asset] [Attributes].
Data flow:
[IoT cloud] --https webhook carry raw data--> [ThingsBoard PE HTTP INTEGRATION] --uplink--> [ThingsBoard PE Data Converter]
My question is: is it possible to apply [Rule Chain] after [ThingsBoard PE Data Converter]? Therefore, the device can auto create relationship with [Asset] by the [Attribute], instead of [AssetName].
Example data after data converter process:
{
"deviceName": "ABC",
"deviceType": "temperature",
"attributes": {
"asset_id": 6 // <-- the id is used in asset attribute
},
"telemetry": {
"temperature": 39.43
}
}
Answering your two, separate questions:
is it possible to apply [Rule Chain] after [ThingsBoard PE Data Converter]?
Yes it is possible. Once your data is successfully integrated and you are receiving it, you can access it using the [Input] Rule Node (the default green one that is always there when you create a Rule) and route it to any other node you need.
Therefore, the device can auto create relationship with [Asset] by the [Attribute], instead of [AssetName].
So, you want the relationship to take your custom attribute and use that as the pattern that identifies the Asset you want to create the relationship from.
The PE edition already has the Create Relation Node. However, seems that as it is one is not able to do what you seek (has no option to specify custom Asset id).
However, two options you got are:
Create a Custom Rule Node that does what you want. For that try checking the Rule Node Development page from Thingsboard. You can use the Create Relation Node as base and work from there. This can be a longer solution than doing...
Enrich your incoming message's metadata, adding your desired attribute. The Create Relation Node allows you to use variables on your message's metadata in your Name and Type patterns, as seen from this screenshot I took from that node:
This allows us a workaround to what you want to do: Add a Script Transformation Node that adds attributes.asset_id to the metadata, for example as metadata.asset_id, so you can then use it as ${asset_id} on your Name and Type patterns.
For example, your Transform() method of such Script Transformation Node should look something like this:
function Transform(msg, metadata, msgType){
//assumming your desired id is msg.attributes.asset_id, add it to the metadata
metadata.asset_id = msg.attributes.asset_id;
//return the message, in your case to the Create Relation Node
return {msg: msg, metadata:metadata, msgType:msgType};
}
Finally, your Rule should be connected like this:
[Input] -> [Script Node] -> [Create Relation Node] -> [...whatever else you like]

Azure Data Factory get data for "For Each"component from query

The situation is as follows: I have a table in my database that recieves about 3 million rows each day. We want to archive this table on a regular base, so that only the 8 most recents weeks are in the table. The rest of the data can be archived tot AZure Data lake.
I allready found out how to do this by one day at a time. But now I want to run this pipeline each week for the first seven days in the table. I assume I should do this with the "For Each" component. It should itterate along the seven distinct dates that are present in the dataset I want to backup. This dataset is copied from the source table to an archive table on forehand.
It's not difficult to get the distinct dates with a SQL query, but how to get the result of this query into an array that is used for the "For Each" component?
The issue is solved thanks to a co-worker.
What we have to do is assign a parameter to the dataset of the sink. Does not matter how you name this and you do not have to assign a value to it. But let's assume this parameter is called "date"
After that you can use this parameter in the filename of the sink (also in dataset) with by using "#dataset().Date".
After that you go back to the copyactivity and in the sink you assign a dataset property to #item().DateSelect. (DateSelect is the field name from the array that is passed to the For Each activity)
See also the answer from Bo Xioa as part of the answer
This way it works perfectly. It's just a shame that this is not well documented
You can use lookup activity to fetch the column content, and the output will be like
{
"count": "2",
"value": [
{
"Id": "1",
"TableName" : "Table1"
},
{
"Id": "2",
"TableName" : "Table2"
}
]
}
Then you can pass the value array to the Foreach activity items field by using the pattern of #activity('MyLookupActivity').output.value
ref doc: Use the Lookup activity result in a subsequent activity
I post this as an answer, because the error does not fit into a comment :D
have seen antoher option to accomplish this. That is by executing a pipeline from another pipeline. And in that way I can define the dates that I should iterate over as a parameter in the second pipeline (learn.microsoft.com/en-us/azure/data-factory/…). But unformtunately this leads to the same rsult as when just using the foreach parameter. Because in the filename of my data lake file I have to use: #{item().columname}. I can see in the monitoring view that the right values are passed in the iteration steps, but I keep getting an error:
{
"errorCode": "2200",
"message": "Failure happened on 'Sink' side. ErrorCode=UserErrorFailedFileOperation,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=The request to 'Unknown' failed and the status code is 'BadRequest', request id is ''. {\"error\":{\"code\":\"BadRequest\",\"message\":\"A potentially dangerous Request.Path value was detected from the client (:). Trace: cf3b4c3f-1681-4073-b225-17e1c07ec76d Time: 2018-08-02T05:16:13.2141897-07:00\"}} ,Source=Microsoft.DataTransfer.ClientLibrary,''Type=System.Net.WebException,Message=The remote server returned an error: (400) Bad Request.,Source=System,'",
"failureType": "UserError",
"target": "CopyDancerDatatoADL"
}

Neo4j+PopotoJS: filter graph based-on predefined constraints

I have a question about the query based on the predefined constraints in PopotoJs. In this example, the graph can be filtered based on the constraints defined in the search boxes. The sample file in this example visualizations folder, constraint is only defined for "Person" node. It is specified in the sample html file like the following:
"Person": {
"returnAttributes": ["name", "born"],
"constraintAttribute": "name",
// Return a predefined constraint that can be edited in the page.
"getPredefinedConstraints": function (node) {
return personPredefinedConstraints;
},
....
In my graph I would like to apply that query function for more than one node. For example I have 2 nodes: Contact (has "name" attribute) and Delivery (has "address" attribute)
I succeeded it by defining two functions for each nodes. However, I also had to put two search box forms with different input id (like constraint1 and constraint2). And I had to make the queries in the associated search boxes.
Is there a way to make queries which are defined for multiple nodes in one search box? For example searching Contact-name and/or Delivery-adress in the same search box?
Thanks
First I’d like to specify that the predefined constraints feature is still experimental (but fully functional) and doesn’t have any documentation yet.
It is intended to be used in configuration to filter data displayed in nodes and in the example the use of search boxes is just to show dynamically how it works.
A common use of this feature would be to add the list of predefined constraint you want in the configuration for every node types.
Let's take an example:
With the following configuration example the graph will be filtered to show only Person nodes having "born" attribute and only Movie nodes with title in the provided list:
"Person": {
"getPredefinedConstraints": function (node) {
return ["has($identifier.born)"];
},
...
}
"Movie": {
"getPredefinedConstraints": function (node) {
return ["$identifier.title IN [\"The Matrix\", \"The Matrix Reloaded\", \"The Matrix Revolutions\"]"];
},
...
}
The $identifier variable is then replaced during query generation with the corresponding node identifier. In this case the generated query would look like this:
MATCH (person:`Person`) WHERE has(person.born) RETURN person
In your case if I understood your question correctly you are trying to use this feature to implement a search box to filter the data. I'm still working on that feature but it won't be available soon :(
This is a workaround but maybe it could work in your use case, you could keep the search box value in a variable:
var value = d3.select("#constraint")[0][0].value;
inputValue = value;
Then use it in the predefined constraint of all the nodes type you want.
In this example Person will be filtered based on the name attribute and Movie on title:
"Person": {
"getPredefinedConstraints": function (node) {
if (inputValue) {
return ["$identifier.name =~ '(?i).*" + inputValue + ".*'"];
} else {
return [];
}
},
...
}
"Movie": {
"getPredefinedConstraints": function (node) {
if (inputValue) {
return ["$identifier.title =~ '(?i).*" + inputValue + ".*'"];
} else {
return [];
}
},
...
}
Everything is in the HTML page of this example so you can view the full source directly on the page.
#Popoto, thanks for the descriptive reply. I tried your suggestion and it worked pretty much well. With the actual codes, when I make a query it was showing only the queried node and make the other node amount zero. I wanted to make a query which queries only the related node while the number of other nodes are still same.
I tried a temporary solution for my problem. What I did is:
Export the all the node data to JSON file, search my query constraint in the exported JSONs, if the file is existing in JSON, then run the query in the related node; and if not, do nothing.
With that way, of course I needed to define many functions with different variable names (as much as the node amount). Anyhow, it is not a propoer way, bu it worked for now.

MyBatis set fetch size on ResultSet as out parameter of procedure

I have stored procedure that I need to call using MyBatis. Anyway I managed to call this stored procedure. Procedure has multiple out parameters. One of out parameter is oracle cursor. I need to iterate over Oracle Cursor, but when I do this without any fine-tuning of jdbc driver using fetchSize attribute, it goes row by row and this solution is very slow.
I am able to set on procedure call fethcSize attribute:
<select id="getEvents" statementType="CALLABLE" parameterMap="eventInputMap" fetchSize="1000">
{call myProc(?, ?, ?, ?, ?)}
</select>
But this doesn't helps at all. I think that this doesn't work because of multiple out parameters - so program doesn't know where this fetch size should be applied - on which out parameter. Is there any way to set fetch size on ResultSet(Oracle cursor)? Like when I use CallableStatemen from java.sql package I am able to set on ResultSet fetch size.
Here are mapping files and procedure call from program:
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE mapper
PUBLIC "-//ibatis.apache.org//DTD Mapper 3.0//EN"
"http://ibatis.apache.org/dtd/ibatis-3-mapper.dtd">
<mapper namespace="mypackage.EventDao">
<resultMap id="eventResult" type="Event">
<result property="id" column="event_id" />
<result property="name" column="event_name" />
</resultMap>
<parameterMap id="eventInputMap" type="map" >
<parameter property="pnNetworkId" jdbcType="NUMERIC" javaType="java.lang.Integer" mode="IN"/>
<parameter property="pvUserIdentityId" jdbcType="VARCHAR" javaType="java.lang.String" mode="IN"/>
<parameter property="result" resultMap="eventResult" jdbcType="CURSOR" javaType="java.sql.ResultSet" mode="OUT" />
<parameter property="success" jdbcType="INTEGER" javaType="java.lang.Integer" mode="OUT"/>
<parameter property="message" jdbcType="VARCHAR" javaType="java.lang.String" mode="OUT"/>
</parameterMap>
<select id="getEvents" statementType="CALLABLE" parameterMap="eventInputMap" fetchSize="1000">
{call myProc(?, ?, ?, ?, ?)}
</select>
</mapper>
And call from program:
SqlSession session = sqlSessionFactory.openSession();
Map<String, Object> eventInputMap = new HashMap<String, Object>();
try {
EventDao ed = session.getMapper(EventDao.class);
eventInputMap.put("pnNetworkId", networkId);
eventInputMap.put("pvUserIdentityId", identityId);
eventInputMap.put("success", 0);
eventInputMap.put("message", null);
ed.getEvents(eventInputMap);
session.selectList("EventDao.getEvents", eventInputMap);
} catch (Exception e) {
e.printStackTrace();
}finally{
session.close();
}
Thanks in advance!
The provided code works. I have even checked 3 ways to write it: like here with parameterMap, without parameterMap (mapping directly in the statement), and through annotations, everything works.
I used to thing the fetchSize setting was not propagated from main statement to OUT param resultSet, until I really tested that, lately.
To apprehend whether the fetch size is used or not and how much effect it has, the result must contain a large enough number of rows. And of course, the poorest is the latency from app to DB, the more noticeable is the effect.
For my test, the cursor used by procedure returned 5400 rows of 120 columns (but the most important is the row count).
To give an order of magnitude, I have measured fetching times, i.e from the stored procedure return to the statement return, with a result list filled with data fetched from cursor. Then I log the instanciation of the first object mapped, this occurs near the beginning of the global fetching, probably after the first fetch:
public static boolean firstInstance = true;
public Item() {
if (firstInstance) {
LOGGER.debug("Item first instance");
firstInstance=false;
}
}
And I log again just at the end, after the session.selectList returns.
This is for test purpose only. Do not let that is your code. Find a clean way to do it.
Here are some timing depending on the configured fetch size:
- fetchSize=1 => 13000 ms
- fetchSize=10 => 5300 ms
- fetchSize=100 => 3800 ms
- fetchSize=300 => 3700 ms
- fetchSize=500 => 3650 ms
- fetchSize=1000 => 3600 ms
Oracle JDBC driver default fetchSize is 10.
Testing with fetchSize=1 allows proving the supplied setting is used.
With 100, here, 30% are saved. Beyond, the gain is negligible (with this use case, and environment)
Anyway, it would be interesting to be able to know when procedure execution finishes and when result fetch starts.
Unfortunately, Mybatis logs very few. I thought custom result handler could help, but looking at the source code of class org.apache.ibatis.executor.resultset.DefaultResultSetHandler, I notice that
unlike method handleResultSet (used for simple select statements) that allows using a custom result handler, method handleRefCursorOutputParameter (used here for procedure OUT cursor) does not. Then no need to trying passing a custom result handler: it will be ignored.
I am interested in a solution if anyone has one. But it seems an evolution request will be required.

Storing graph-like structure in Couch DB or do include_docs yourself

I am trying to store network layout in Couch DB, but my solution provides rather randomized graph.
I store a nodes with a document:
{_id ,
nodeName,
group}
and storing links in traditional:
{_id, source_id, target_id, value}
Following multiple tutorials on handling joins and multiple relationship in Couch DB I created view:
function(doc) {
if(doc.type == 'connection') {
if (doc.source_id)
emit("source", {'_id': doc.source_id});
if(doc.target_id)
emit("target", {'_id': doc.target_id});
}
}
which should have emitted sequence of source and target id, then I pass it to the list function with include_docs=true, assumes that source and target come in pairs stitches everything back in a structure like this:
{
"nodes":[
{"nodeName":"Name 1","group":"1"},
{"nodeName":"Name 2","group":"1"},
],
"links": [
{"source":7,"target":0,"value":1},
{"source":7,"target":5,"value":1}
]
}
Although my list produce a proper JSON, view map returns number of rows of source docs and then target docs.
So far I don't have any ideas how to make this thing working properly - I am happy to fetch additional values from document _id in the list, but so far I havn't find any good examples.
Alternative ways of achieving the same goal are welcome. _id values are standard for CouchDB so far.
Update: while writing a question I came up with different view which sorted my immediate problem, but I still would like to see other options.
updated map:
function(doc) {
if(doc.type == 'connection') {
if (doc.source_id)
emit([doc._id,0,"source"], {'_id': doc.source_id});
if(doc.target_id)
emit([doc._id,1,"target"], {'_id': doc.target_id});
}
}
Your updated map function makes more sense. However, you don't need 0 and 1 in your key since you have already "source"and "target".

Resources