SAP HANA Graphscript - degradation of performance on result with many rows - stored-procedures

I've bumped into significant performance degradation of stored procedure when using HANA Graph script.
My task is following - I'm doing a BFS traverse on graph using standard BFS feature of HANA SP03. My graph is pretty dense and the result can easily go to a couple or several thousands of rows.
CREATE PROCEDURE "MY_PROC" (IN word VARCHAR(100), IN category VARCHAR(100), OUT res "RESULT" DEFAULT EMPTY)
LANGUAGE GRAPH READS SQL DATA AS
BEGIN
Graph g = Graph("SCHEMA1","MYGRAPH");
Multiset<Edge> filteredEdges = Multiset<Edge>(:g);
TRAVERSE BFS :g FROM Vertex(:g, :word)
ON VISIT EDGE (Edge e) {
Vertex sourceV = SOURCE(:e);
IF (:sourceV."WORD" != :word) {
filteredEdges = :filteredEdges UNION {:e};
}
};
--copy all results into output object
res = SELECT :e."TARGET", :e."CATEGORY_ID" FOREACH e IN :filteredEdges;
END;
I'm returning a TABLE type and use the following statement, pretty much the simplest thing possible as per tutorial:
It takes up to 10 seconds in my environment to prepare that result, which is obviously not acceptable. I've tested running time of other parts combined and it's up to tens of milliseconds. In case when result collection has only several hundreds of records running time became moderate - 100-200 miliseconds.
Is there another faster way of returning thousands of data from the graph script? I have a lot of liberty in my implementation, so I'll consider any approach that works. What I need in OUT parameter is a collection of some attributes of vertexes and of edges.
Thanks in advance

I think I got the answer, thanks to SAP HANA team guys.
There are several key ideas:
narrow initial graph to the smallest sub-graph possible using Subgraph
use custom BFS via NEIGHBORS instead of standard TRAVERSE BFS, just set limit big enough
use UNION ALL instead of UNION if logic allows - it's faster
so initial procedure transforms into something like this. Performance of the system soared to tens of milliseconds:
CREATE PROCEDURE "MY_PROC" (IN word VARCHAR(100), IN category VARCHAR(100), IN is_direct_category BOOLEAN, OUT res "TARGET_CATEGORY_RESULT" DEFAULT EMPTY)
LANGUAGE GRAPH READS SQL DATA AS
BEGIN
Graph = Graph("SCHEMA1","MYGRAPH");
Vertex startV = Vertex(:g_all, :word);
Multiset<Vertex> m_reachable= NEIGHBORS(:g_all, :startV, 0, 100);
Graph g = Subgraph(:g_all, :m_reachable);
if (:is_direct_category == TRUE) {
Multiset<Edge> properEdges = e in Edges(:g) where :e."CATEGORY_ID" == :category;
Graph res_g = Subgraph(:g, :properEdges);
Multiset<Edge> e_res = Edges(:res_g);
res = SELECT :hypoEdge."TARGET", :hypoEdge."CATEGORY_ID" FOREACH hypoEdge IN :e_res;
} else {
Multiset<Edge> e_res = Edges(:g);
res = SELECT :hypoEdge."TARGET", :hypoEdge."CATEGORY_ID" FOREACH hypoEdge IN :e_res;
}
END;

Related

Performance of accessing table via reference vs ipairs loop

I'm modding a game. I'd like to optimize my code if possible for a frequently called function. The function will look into a dictionary table (consisting of estimated 10-100 entries). I'm considering 2 patterns a) direct reference and b) lookup with ipairs:
PATTERN A
tableA = { ["moduleName.propertyName"] = { some stuff } } -- the key is a string with dot inside, hence the quotation marks
result = tableA["moduleName.propertyName"]
PATTERN B
function lookup(type)
local result
for i, obj in ipairs(tableB) do
if obj.type == "moduleName.propertyName" then
result = obj
break
end
end
return result
end
***
tableB = {
[1] = {
type = "moduleName.propertyName",
... some stuff ...
}
}
result = lookup("moduleName.propertyName")
Which pattern should be faster on average? I'd expect the 'native' referencing to be faster (it is certainly much neater), but maybe this is a silly assumption? I'm able to sort (to some extent) tableB in a order of frequency of the lookups whereas (as I understand it) tableA will have in Lua random internal order by default even if I declare the keys in proper order.
A lookup table will always be faster than searching a table every time.
For 100 elements that's one indexing operation compared to up to 100 loop cycles, iterator calls, conditional statements...
It is questionable though if you would experience a difference in your application with so little elements.
So if you build that data structure for this purpose only, go with a look-up table right away.
If you already have this data structure for other purposes and you just want to look something up once, traverse the table with a loop.
If you have this structure already and you need to look values up more than once, build a look up table for that purpose.

How to properly use apoc.periodic.iterate to reduce heap usage for large transactions?

I am trying to use apoc.periodic.iterate to reduce heap usage when doing very large transactions in a Neo4j database.
I've been following the advice given in this presentation.
BUT, my results are differing from those observed in those slides.
First, some notes on my setup:
Using Neo4j Desktop, graph version 4.0.3 Enterprise, with APOC 4.0.0.10
I'm calling queries using the .NET Neo4j Driver, version 4.0.1.
neo4j.conf values:
dbms.memory.heap.initial_size=2g
dbms.memory.heap.max_size=4g
dbms.memory.pagecache.size=2g
Here is the cypher query I'm running:
CALL apoc.periodic.iterate(
"UNWIND $nodes AS newNodeObj RETURN newNodeObj",
"CREATE(n:MyNode)
SET n = newNodeObj",
{batchSize:2000, iterateList:true, parallel:false, params: { nodes: $nodes_in } }
)
And the line of C#:
var createNodesResCursor = await session.RunAsync(createNodesQueryString, new { nodes_in = nodeData });
where createNodesQueryString is the query above, and nodeData is a List<Dictionary<string, object>> where each Dictionary has just three entries: 2 strings, 1 long.
When attempting to run this to create 1.3Million nodes I observe the heap usage (via JConsole) going all the way up to the 4GB available, and bouncing back and forth between ~2.5g - 4g. Reducing the batch size makes no discernible difference, and upping the heap.max_size causes the heap usage to shoot up to almost as much as that value. It's also really slow, taking 30+ mins to create those 1.3 million nodes.
Does anyone have any idea what I may be doing wrong/differently to the linked presentation? I understand my query is doing a CREATE whereas in the presentation they are only updating an already loaded dataset, but I can't imagine that's the reason my heap usage is so high.
Thanks
My issue was that although using apoc.periodic.iterate, I was still uploading that large 1.3million node data set to the database as a parameter for the query!
Modifying my code to do the batching myself as follows fixed my heap usage problem, and the slowness problem:
const int batchSize = 2000;
for (int count = 0; count < nodeData.Count; count += batchSize)
{
string createNodesQueryString = $#"
UNWIND $nodes_in AS newNodeObj
CREATE(n:MyNode)
SET n = newNodeObj";
int length = Math.Min(batchSize, nodeData.Count - count);
var createNodesResCursor = await session.RunAsync(createNodesQueryString,
new { nodes_in = nodeData.ToList().GetRange(count, length) });
var createNodesResSummary = await createNodesResCursor.ConsumeAsync();
}

SPARK - Joining two data streams - maintenance of cache

It is evident that the out of box join capability in spark streaming does not warrent a lot of real life use cases. The reason being it joins only the data contained in the micro batch RDDs.
Use case is to join data from two kafka streams and enrich each object in stream1 with it's corresponding object in stream2 in spark and save it to HBase.
Implementation would
maintain a dataset in memory from objects from stream2, adding or replacing objects as and when they are recieved
for every element in stream1, access the cache to find a matching object from stream2, save to HBase if match is found or put it back on the kafka stream if not.
This question is on exploration of Spark streaming and it's API to find a way to implement the above mentioned.
You can join the incoming RDDs to other RDDs -- not just the ones in that micro-batch. Basically you keep a "running total" RDD that you fill something like:
var globalRDD1: RDD[...] = sc.emptyRDD
var globalRDD2: RDD[...] = sc.emptyRDD
dstream1.foreachRDD(rdd => if (!rdd.isEmpty) globalRDD1 = globalRDD1.union(rdd))
dstream2.foreachRDD(rdd => if (!rdd.isEmpty) {
globalRDD2 = globalRDD2.union(rdd))
globalRDD1.join(globalRDD2).foreach(...) // etc, etc
}
A good start would be to look into mapWithState. This is a more efficient replacement for updateStateByKey. These are defined on PairDStreamFunction, so assuming your objects of type V in stream2 are identified by some key of type K, your first point would go like this:
def stream2: DStream[(K, V)] = ???
def maintainStream2Objects(key: K, value: Option[V], state: State[V]): (K, V) = {
value.foreach(state.update(_))
(key, state.get())
}
val spec = StateSpec.function(maintainStream2Objects)
val stream2State = stream2.mapWithState(spec)
stream2State is now a stream where each batch contains the (K, V) pairs with the latest value seen for each key. You can do a join on this stream and stream1 to perform the further logic for your second point.

Move properties from relation to node in Neo4J on large datasets

I'm trying to move a property I've set up on a relationship in Neo4J to one of it's member nodes, because I want to index that property, and as of version 2.2.5 which I am using, indexing on relationships is not possible.
However, when I try to move it via Cypher command MATCH (k)-[l]->(m) SET m.key = l.key, my request also drops due to a lack of memory. I have no possibility to add additional memory either.
Does any one know of a good way to do this without having to resort to lots of memory when dealing with large (ca. 20M) datasets?
If it's one time operation I highly recommend you to write Unmanaged Extensions.
It will be much faster than Cypher.
Here is an example
Label startNodeLabel = DynamicLabel.label("StartNode");
Label endNodeLabel = DynamicLabel.label("EndNode");
RelationshipType relationshipType = DynamicRelationshipType.withName("RelationshipType");
String nodeProperty = "nodeProperty";
String relationshipProperty = "relationshipProperty";
try(Transaction tx = database.beginTx()) {
final ResourceIterator<Node> nodes = database.findNodes(startNodeLabel);
for (Node startNode : IteratorUtil.asCollection(nodes)) {
if (startNode.hasRelationship(relationshipType, Direction.OUTGOING)) {
final Iterable<Relationship> relationships = startNode.getRelationships(relationshipType, Direction.OUTGOING);
for (Relationship relationship : relationships) {
final Node endNode = relationship.getOtherNode(startNode);
if (endNode.hasLabel(endNodeLabel)) {
endNode.setProperty(nodeProperty, relationship.getProperty(relationshipProperty));
}
}
}
}
tx.success();
}
If you do not want to go for an unmanaged extension because you are moving the properties as a one-time problem you can also write e.g. a shell script which calls the linux curl command and loops in a loop with skip and limit. This has the advantage that you don't need to move the values but can copy them.
MATCH (k)-[l]->(m)
WITH l skip 200000 limit 100000
SET m.key = l.key
RETURN COUNT(*) AS nRows
Replace 200000 with the value of the loop variable.
You can use use LIMIT to limit the query to a specific number of rows, and then repeat the query until no more rows are returned. That will also limit the amount of memory usage.
For example, if you also wanted to remove the key property from the relationship at the same time (and you wanted to process 100K rows each time):
[EDITED]
MATCH (k)-[l]->(m)
WHERE HAS(l.key)
SET m.key = l.key
REMOVE l.key
WITH l
LIMIT 100000
RETURN COUNT(*) AS nRows;
This query will return an nRows value of 0 when you are done.

How to reduce Azure Table Storage latency?

I have a rather huge (30 mln rows, up to 5–100Kb each) Table on Azure.
Each RowKey is a Guid and PartitionKey is a first Guid part, for example:
PartitionKey = "1bbe3d4b"
RowKey = "1bbe3d4b-2230-4b4f-8f5f-fe5fe1d4d006"
Table has 600 reads and 600 writes (updates) per second with an average latency of 60ms. All queries use both PartitionKey and RowKey.
BUT, some reads take up to 3000ms (!). In average, >1% of all reads take more than 500ms and there's no correlation with entity size (100Kb row may be returned in 25ms and 10Kb one – in 1500ms).
My application is an ASP.Net MVC 4 web-site running on 4-5 Large instances.
I have read all MSDN articles regarding Azure Table Storage performance goals and already did the following:
UseNagle is turned Off
Expect100Continue is also disabled
MaxConnections for table client is set to 250 (setting 1000–5000 doesn't make any sense)
Also I checked that:
Storage account monitoring counters have no throttling errors
There are some kind of "waves" in performance, though they does not depend on load
What could be the reason of such performance issues and how to improve it?
I use the MergeOption.NoTracking setting on the DataServiceContext.MergeOption property for extra performance if I have no intention of updating the entity anytime soon. Here is an example:
var account = CloudStorageAccount.Parse(RoleEnvironment.GetConfigurationSettingValue("DataConnectionString"));
var tableStorageServiceContext = new AzureTableStorageServiceContext(account.TableEndpoint.ToString(), account.Credentials);
tableStorageServiceContext.RetryPolicy = RetryPolicies.Retry(3, TimeSpan.FromSeconds(1));
tableStorageServiceContext.MergeOption = MergeOption.NoTracking;
tableStorageServiceContext.AddObject(AzureTableStorageServiceContext.CloudLogEntityName, newItem);
tableStorageServiceContext.SaveChangesWithRetries();
Another problem might be that you are retrieving the entire enity with all its properties even though you intend only use one or two properties - this is of course wasteful but can't be easily avoided. However, If you use Slazure then you can use query projections to only retrieve the entity properties that you are interested in from the table storage and nothing more, which would give you better query performance. Here is an example:
using SysSurge.Slazure;
using SysSurge.Slazure.Linq;
using SysSurge.Slazure.Linq.QueryParser;
namespace TableOperations
{
public class MemberInfo
{
public string GetRichMembers()
{
// Get a reference to the table storage
dynamic storage = new QueryableStorage<DynEntity>("UseDevelopmentStorage=true");
// Build table query and make sure it only return members that earn more than $60k/yr
// by using a "Where" query filter, and make sure that only the "Name" and
// "Salary" entity properties are retrieved from the table storage to make the
// query quicker.
QueryableTable<DynEntity> membersTable = storage.WebsiteMembers;
var memberQuery = membersTable.Where("Salary > 60000").Select("new(Name, Salary)");
var result = "";
// Cast the query result to a dynamic so that we can get access its dynamic properties
foreach (dynamic member in memberQuery)
{
// Show some information about the member
result += "LINQ query result: Name=" + member.Name + ", Salary=" + member.Salary + "<br>";
}
return result;
}
}
}
Full disclosure: I coded Slazure.
You could also consider pagination if you are retrieving large data sets, example:
// Retrieve 50 members but also skip the first 50 members
var memberQuery = membersTable.Where("Salary > 60000").Take(50).Skip(50);
Typically, if a specific query requires scanning a large number of rows, that will take longer time. Is the behavior you are seeing specific a query / data? Or, are you seeing the performance varies for the same data and query?

Resources