Neo4j Cypher, deadlock exception while inserting bulk relationships. - neo4j

Due the following use case:
(:Product)-[:HAS_PRICE]->(:Price)-[:HAS_CURRENCY]->(:Currency)
Having 1000 products and only one (1) currency supported, let’s say (:Currency {code:'USD'}).
To insert a Price for each Product, thousand (1000) of relations will be created against the node (:Currency {code:'USD'}.
Having 3 working threads receiving the prices and setting them up, the (:Currency {code:'USD'} node will be locked for the others two workers while one of the worker creates the edge -["HAS_CURRENCY]->.
Implementing a RETRY/BACK-OFF approach avoid some failures, but the RETRY threshold must be higher enough (about 100 in my case) to avoid all deadlocks and setting a long BACK-OFF delay is not worthy.
Chris Vest commented about modifications in locking and transaction isolation behavior in a previous post.
Is there anything we can do to avoid this issue? Any tips around server configuration, data modeling, etc.?
Thanks in advance.

Related

One single Azure SQL query is consuming almost all query_stats.total_worker_time and query_stats.execution_count

I'm running a production website for 4 years with azure SQL.
With help of 'Top Slow Request' query from alexsorokoletov on github I have 1 super slow query according to Azure query stats.
The one on top is the one that uses a lot of CPU.
When looking at the linq query and the execution plans / live stats, I can't find the bottleneck yet.
And the live stats
The join from results to project is not directly, there is a projectsession table in between, not visible in the query, but maybe under the hood of entity framework.
Might I be affected by parameter sniffing? Can I reset a hash? Maybe the optimized query plan was used in 2014 and now result table is about 4Million rows and the query is far from optimal?
If I run this query in Management Studio its very fast!
Is it just the stats that are wrong?
Regards
Vincent - The Netherlands.
I would suggest you try adding option(hash join) at the end of the query, if possible. Once you start getting into large arity, loops join is not particularly efficient. That would prove out if there is a more efficient plan (likely yes).
Without seeing more of the details (your screenshots are helpful but cut off whether auto-param or forced parameterization has kicked in and auto-parameterized your query), it is hard to confirm/deny this explicitly. You can read more about parameter sniffing in a blog post I wrote a bit longer ago than I care to admit ;) :
https://blogs.msdn.microsoft.com/queryoptteam/2006/03/31/i-smell-a-parameter/
Ultimately, if you update stats, dbcc freeproccache, or otherwise cause this plan to recompile, your odds of getting a faster plan in the cache are higher if you have this particular query + parameter values being executed often enough to sniff that during plan compilation. Your other option is to add optimize for unknown hints which will disable sniffing and direct the optimizer to use an average value for the frequency of any filters over parameter values. This will likely encourage more hash or merge joins instead of loops joins since the cardinality estimates of the operators in the tree will likely increase.

Neo4j multiple node labels and performance

According to my Spring Data Neo4j 4(SDN4) class hierarchy I have a lot of Neo4j nodes with ~7 labels per each node.
Should I worry about the performance of my application with such number of labels per node or Neo4j labels( and theirs usage in SDN 4) don't impact the performance ?
Behind every label is an index. So a high number of labels per node will increase the write time for any such node. If you're doing mass updates this will be noticable but for a regular application you will hardly notice the difference on writes. For reads it makes no difference.
Hope this helps,
Tom

How to make ActiveRecord updates thread safe in Rails?

I have a Resque queue of jobs. Each job has a batch of events to be processed. A worker will process the jobs and count the number of events that occurred in each minute. It uses ActiveRecord to get a "datapoint" for that minute and add the number of events to it and save.
When I have multiple workers processing that queue, I believe there is a concurrency issue. I think there is a race condition between getting the datapoint from the database, adding the correct amount, and updating to the new value. I looked into Transactions, but I think that only helps if the query would fail.
My current workaround is only using 1 Resque Worker. I'd like to scale and process jobs faster, though. Any ideas?
Edit: I originally had trouble finding key words to search Google for, but thanks to Robin, I found the answer to my question here.
The correct answer is to use increment_counter or update_counters if you need to increment multiple attributes or a value other than +1. Both of them are model classes.

Neo4j database very slow to add relationships

I have a Neo4j database with 7340 nodes. Each node has a label (neoplasm) and 2 properties (conceptID and fullySpecifiedName). Autoindexing is enabled on both properties, and I have created a schema index on neoplasm:conceptID and neoplasm:fullySpecifiedName. The nodes are concepts in a terminology tree. There is a single root node and the others descend often via several paths to a depth of up to 13 levels. From a SQL Server implementation, the hierarchy structure is as follows...
Depth Relationship Count
0 1
1 37
2 360
3 1598
4 3825
5 6406
6 7967
7 7047
8 4687
9 2271
10 825
11 258
12 77
13 3
I am adding the relationships using a C# program and neo4jclient which contructs and executes cypher queries like this one...
MATCH (child:neoplasm), (parent:neoplasm)
WHERE child.conceptID = "448257000" AND parent.conceptID="372095001"
CREATE child-[:ISA]->parent
Adding the relationships up to level 3 was very fast, and level 4 itself was not bad, but at level 5 things started getting very slow, an average of over 9 seconds per relationship.
The example query above was executed through the http://localhost:7474/browser/ interface and took 12917ms, so the poor execution times are not a feature of the C# code nor the neo4jclient API.
I thought graph databases were supposed to be blindingly fast and that the performance was independent of size.
So far I have added just 9033 out of 35362 relationships. Even if the speed does not degrade further as the number of relationships increases, it will take over three days to add the remainder!
Can anyone suggest why this performance is so bad? Or is write performance of this nature normal, and it is just read performance that is so good. A sample Cypher query to return parents of a level 5 node returns a list of 23 fullySpecifiedName properties in less time than I can measure with a stop watch! (well under a second).
When using different Indexes on labels at the same time, Cypher does not (yet) choose these to make the query faster, instead, try giving hints to use them, see http://docs.neo4j.org/chunked/milestone/query-using.html#using-query-using-multiple-index-hints
PROFILE
MATCH (child:neoplasm), (parent:neoplasm)
WHERE child.conceptID = "448257000" AND parent.conceptID="372095001"
USING INDEX child:neoplasm(conceptID)
USING INDEX parent:neoplasm(conceptID)
CREATE child-[:ISA]->parent
Does that improve things? Also, please post the PROFILE output for better insight.
You said you're using autoindexing. However your query would use schema indexes and not autoindexes. Autoindexes index nodes based on properties and are not tied to labels.
Schema indexes are a new and stunning feature of Neo4j 2.0.
So get rid of the autoindexes and, as Tatham suggested, create schema indexes using:
CREATE INDEX ON :neoplasm(conceptId)
Even with schema indexes inserting relationships will become slower as your graph grows since indexes typically scale at log(n) level. However it should be much faster then the times you've observed.
I appear to have found the answer. I restarted the Neop4j database (Neop4j 2.0.0-M06) and got the usual message of Neo4j will be ready in a few seconds. Over half an hour later the status turned green. During that time I was monitoring the process and it appeared to be rebuilding the lucene indexes.
I have since tried loading more relationships and they are now being added at an acceptable rate (~100msec per relationship).
Thanks for the comments

Representing (and incrementing) relationship strength in Neo4j

I would like to represent the changing strength of relationships between nodes in a Neo4j graph.
For a static graph, this is easily done by setting a "strength" property on the relationship:
A --knows--> B
|
strength
|
3
However, for a graph that needs updating over time, there is a problem, since incrementing the value of the property can't be done atomically (via the REST interface) since a read-before-write is required. Incrementing (rather than merely updating) is necessary if the graph is being updated in response to incoming streamed data.
I would need to either ensure that only one REST client reads and writes at once (external synchronization), or stick to only the embedded API so I can use the built-in transactions. This may be workable but seems awkward.
One other solution might be to record multiple relationships, without any properties, so that the "strength" is actually the count of relationships, i.e.
A knows B
A knows B
A knows B
means a relationship of strength 3.
Disadvantage: only integer strengths can be recorded
Advantage: no read-before-write is required
Disadvantage: (probably) more storage required
Disadvantage: (probably) much slower to extract the value since multiple relationships must be extracted and counted
Has anyone tried this approach, and is it likely to run into performance issues, particularly when reading?
Is there a better way to model this?
Nice idea.
To reduce storage and multi-reads those relationships could be aggregated to one in a batch job which runs transactionally.
Each rel could also carry an individual weight value, whose aggregated value is used as weight. It doesn't have to be integer based and could also be negative to represent decrements.
You could also write a small server-extension for updating a weight value on a single relationship transactionally. Would probably even make sense for the REST API (as addition to the "set single value" operation have a modify single value operation.
PUT http://localhost:7474/db/data/node/15/properties/mod/foo
The body contains the delta value (1.5, -10). Another idea would be to replace the mode keyword by the actual operation.
PUT http://localhost:7474/db/data/node/15/properties/add/foo
PUT http://localhost:7474/db/data/node/15/properties/or/foo
PUT http://localhost:7474/db/data/node/15/properties/concat/foo
What would "increment" mean in a non integer case?
Hmm a bit of a different approach, but you could consider using a queuing system. I'm using the Neo4j REST interface as well and am looking into storing a constantly changing relationship strength. The project is in Rails and using Resque. Whenever an update to the Neo4j database is required it's thrown in a Resque queue to be completed by a worker. I only have one worker working on the Neo4j Resque queue so it never tries to perform more than one Neo4j update at once.
This has the added benefit of not making the user wait for the neo4j updates when they perform an action that triggers an update. However, it is only a viable solution if you don't need to use/display the Neo4j updates instantly (though depending on the speed of your worker and the size of your queue, it should only take a few seconds).
Depends a bit on what read and write load you are targeting. How big is the total graph going to be?

Resources