Dynamic targetHits in Vespa yql - yql

I'm trying to create a Vespa query where I would like to set the rate limit of the targetHits. For example the query below has a constant number of 3 targetHits:
'yql': 'select id, title from sources * where \
([{"targetHits":3}]nearestNeighbor(embeddings_vector,query_embeddings_vector));'
Is there some way I can set this number dynamically in every call?
Also what is the difference between hits and targetHits? Does it have to do with the minimum and desired requirement?
Thanks a lot.

I'm not sure what you mean by rate limit of targetHits but generally
The targetHits is per content node if you run Vespa with multi-node cluster
The targetHits is the number of hits you want to expose to the ranking profile's first phase ranking function
hits only controls how many to return in the SERP response. It's perfectly valid to ask for targethits 500 per content node for ranking and finally just the global best 10 (according to your ranking profile)

Is there some way I can set this number dynamically in every call?
You can modify the SQL you send of course, but a better way to do this is often to create a Searcher as part of your application that modifies it programmatically, e.g:
public class TargetHitsSearcher extends Searcher {
#Override
public Result search(Query query, Execution execution) {
Item root = query.getModel().getQueryTree().getRoot();
if (root instanceof NearestNeighborItem) { // In general; search recursively
int target = query.properties().getInteger("myTarget");
((NearestNeighborItem)root).setTargetNumHits(target);
}
return execution.search(query);
}
}

Related

Write a particular PCollection to BigQuery

Suppose I create two output PCollections as a result of SideOutputs and depending on some condition I want to write only one of them to BigQuery. How to do this?
Basically my use case is that I'm trying to make Write_Append and Write_Truncate dynamic. I fetch the information(append/truncate) from a config table that I maintain in BigQuery. So depending on what I have in the config table I must apply Truncate or Append.
So using SideOutputs I was able to create two PCollections(Append and Truncate respectively) out of which one will be empty. And the one which has all the rows must be written to BigQuery. Is this approach correct?
The code that i'm using:
final TupleTag<TableRow> truncate =
new TupleTag<TableRow>(){};
// Output that contains word lengths.
final TupleTag<TableRow> append =
new TupleTag<TableRow>(){};
PCollectionTuple results = read.apply("convert to table row",ParDo.of(new DoFn<String,TableRow>(){
#ProcessElement
public void processElement(ProcessContext c)
{
String value = c.sideInput(configView).get(0).toString();
LOG.info("config: "+value);
if(value.equals("truncate")){
LOG.info("outputting to truncate");
c.output(new TableRow().set("color", c.element()));
}
else
{
LOG.info("outputting to append");
c.output(append,new TableRow().set("color", c.element()));
}
//c.output(new TableRow().set("color", c.element()));
}
}).withSideInputs(configView).withOutputTags(truncate,
TupleTagList.of(append)));
results.get(truncate).apply("truncate",BigQueryIO.writeTableRows()
.to("projectid:datasetid.tableid")
.withSchema(schema)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));
results.get(append).apply("append",BigQueryIO.writeTableRows()
.to("projectid:datasetid.tableid")
.withSchema(schema)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));
I need to perform one out of the two. If I do both table is going to get truncated anyways.
P.S. I'm using Java SDK (Apache Beam 2.1)
I believe you are right that, if your pipeline includes at all a write to a BigQuery table with WRITE_TRUNCATE, currently the table will get truncated even if there's no data. Feel free to file a JIRA to support more configurable behavior in this case.
So if you want it to conditionally not get truncated, you need to conditionally not include that write transform at all. Is there a way to push the condition to that level, or does the condition actually have to be computed from other data in the pipeline?
(the only workaround I can think of is to use DynamicDestinations to dynamically choose the name of the table to truncate, and truncate some other dummy empty table instead - I can elaborate on this more after your answer to the previous paragraph)

How to reduce Azure Table Storage latency?

I have a rather huge (30 mln rows, up to 5–100Kb each) Table on Azure.
Each RowKey is a Guid and PartitionKey is a first Guid part, for example:
PartitionKey = "1bbe3d4b"
RowKey = "1bbe3d4b-2230-4b4f-8f5f-fe5fe1d4d006"
Table has 600 reads and 600 writes (updates) per second with an average latency of 60ms. All queries use both PartitionKey and RowKey.
BUT, some reads take up to 3000ms (!). In average, >1% of all reads take more than 500ms and there's no correlation with entity size (100Kb row may be returned in 25ms and 10Kb one – in 1500ms).
My application is an ASP.Net MVC 4 web-site running on 4-5 Large instances.
I have read all MSDN articles regarding Azure Table Storage performance goals and already did the following:
UseNagle is turned Off
Expect100Continue is also disabled
MaxConnections for table client is set to 250 (setting 1000–5000 doesn't make any sense)
Also I checked that:
Storage account monitoring counters have no throttling errors
There are some kind of "waves" in performance, though they does not depend on load
What could be the reason of such performance issues and how to improve it?
I use the MergeOption.NoTracking setting on the DataServiceContext.MergeOption property for extra performance if I have no intention of updating the entity anytime soon. Here is an example:
var account = CloudStorageAccount.Parse(RoleEnvironment.GetConfigurationSettingValue("DataConnectionString"));
var tableStorageServiceContext = new AzureTableStorageServiceContext(account.TableEndpoint.ToString(), account.Credentials);
tableStorageServiceContext.RetryPolicy = RetryPolicies.Retry(3, TimeSpan.FromSeconds(1));
tableStorageServiceContext.MergeOption = MergeOption.NoTracking;
tableStorageServiceContext.AddObject(AzureTableStorageServiceContext.CloudLogEntityName, newItem);
tableStorageServiceContext.SaveChangesWithRetries();
Another problem might be that you are retrieving the entire enity with all its properties even though you intend only use one or two properties - this is of course wasteful but can't be easily avoided. However, If you use Slazure then you can use query projections to only retrieve the entity properties that you are interested in from the table storage and nothing more, which would give you better query performance. Here is an example:
using SysSurge.Slazure;
using SysSurge.Slazure.Linq;
using SysSurge.Slazure.Linq.QueryParser;
namespace TableOperations
{
public class MemberInfo
{
public string GetRichMembers()
{
// Get a reference to the table storage
dynamic storage = new QueryableStorage<DynEntity>("UseDevelopmentStorage=true");
// Build table query and make sure it only return members that earn more than $60k/yr
// by using a "Where" query filter, and make sure that only the "Name" and
// "Salary" entity properties are retrieved from the table storage to make the
// query quicker.
QueryableTable<DynEntity> membersTable = storage.WebsiteMembers;
var memberQuery = membersTable.Where("Salary > 60000").Select("new(Name, Salary)");
var result = "";
// Cast the query result to a dynamic so that we can get access its dynamic properties
foreach (dynamic member in memberQuery)
{
// Show some information about the member
result += "LINQ query result: Name=" + member.Name + ", Salary=" + member.Salary + "<br>";
}
return result;
}
}
}
Full disclosure: I coded Slazure.
You could also consider pagination if you are retrieving large data sets, example:
// Retrieve 50 members but also skip the first 50 members
var memberQuery = membersTable.Where("Salary > 60000").Take(50).Skip(50);
Typically, if a specific query requires scanning a large number of rows, that will take longer time. Is the behavior you are seeing specific a query / data? Or, are you seeing the performance varies for the same data and query?

How to get total number of db-hits from Cypher query within a Java code?

I am trying to get total number of db-hits from my Cypher query. For some reason I always get 0 when calling this:
String query = "PROFILE MATCH (a)-[r]-(b)-[p]-(c)-[q]-(a) RETURN a,b,c";
Result result = database.execute(query);
while (result.hasNext()) {
result.next();
}
System.out.println(result.getExecutionPlanDescription().getProfilerStatistics().getDbHits());
The database seems to be ok. Is there something wrong about the way of reaching such value?
ExecutionPlanDescription is a tree like structure. Most likely the top element does not directly hit the database by itself, e.g. a projection.
So you need to write a recursive function using ExecutionPlanDescription.getChildren() to drill to the individual parts of the query plan. E.g. if one of the children (or sub*-children) is a plan of type Expand you can use plan.getProfilerStatistics().getDbHits().

Filtering by aggregate function

I am trying to raise an event when the average value of a field is over a threshold for a minute. I have the object defined as:
class Heartbeat
{
public string Name;
public int Heartbeat;
}
My condition is defined as
select avg(Heartbeat) , Name
from Heartbeat.std:groupwin(Name).win:time(60 sec)
having avg(Heartbeat) > 100
However, the event never gets fired despite the fact that I fire a number of events with the Heartbeat value over 100. Any suggestions on what I have done wrong?
Thanks in advance
It confuses many people, but since time is the same for all groups you can simplify the query and remove the groupwin. The documentation note in this section explains why: http://esper.codehaus.org/esper-4.11.0/doc/reference/en-US/html_single/index.html#view-std-groupwin
The semantics with or without groupwin are the same.
I think you want group-by (and not groupwin) since group-by controls the aggregation level and groupwin controls the data window level.
New query:
select avg(Heartbeat) , Name from Heartbeat.win:time(60 sec) group by Name having avg(Heartbeat) > 100

Weighted Graph DijkstraShortestPath : getpath() does not return path with least cost

Thanks for prompt response chessofnerd and Joshua.I am sorry for unclear logsand unclear question.Let me rephrase it.
Joshua:
I am storing my weights in DB and retrieving from DB in transformer.
I have 4 devices connected in my topology and between some devices there are multiple connections and between 2 devices only single connection as shown below.
I am using undirected weighted graph.
Initially all links are assigned weight of 0.When I request a path between D1 and D4 , I increase the weight of each link by 1.
When a second request comes for another path, I am feeding all the weights through Transformer.
When request comes second time, I am correctly feeding weight of 1 for links L1,L2,L3 and 0 for other links.
Since weight of (L4,L5,L3) or (L6,L7,L3) or (L8,L9,L3) is less than weight of (L1,L2,L3), I am expecting I will get one of these paths - (L4,L5,L3) or (L6,L7,L3) or (L8,L9,L3). But I am getting again (L1,L2,L3)
D1---L1-->D2---L2--->D3--L3--->D4
D1---L4-->D2---L5--->D3--L3--->D4
D1---L6-->D2---L7--->D3--L3--->D4
D1---L8-->D2---L9--->D3--L3---->D4
transformer simply returns the weight previosuly stored for link.
Graph topology = new UndirectedSparseMultigraph()
DijkstraShortestPath pathCalculator = new DijkstraShortestPath(topology, wtTransformer);
List path = pathCalculator.getPath(node1, node2);
private final Transformer wtTransformer = new Transformer() {
public Integer transform(Link link) {
int weight = getWeightForLink(link, true);
return weight;
}
}
You're creating DijkstraShortestPath so that it caches results. Add a "false" parameter to the constructor to change this behavior.
http://jung.sourceforge.net/doc/api/edu/uci/ics/jung/algorithms/shortestpath/DijkstraShortestPath.html
(And no, the cache does not get poisoned if you change an edge weight; if you do that, it's your responsibility to create a new DSP instance, or not use caching in the first place.)

Resources