Neo4j and concurrent writes, how to handle deadlock issues, preferably server-side? - neo4j

I'm running into issues with deadlocks during concurrent merge operations (REST API). I have a function that processes text with some metadata, and for each item in the metadata dictionary, I'm performing a merge to add either a node or connect the text node with the metadata[n] node. Issues come up when the message rate is around 500-1000 per second.
In this particular function, there are 11 merges between 6 queries, which go something like this:
q1 = "MERGE (n:N { id: {id} }) ON CREATE SET ... ON MATCH SET "
"WITH . MERGE (...) "
"WITH ., . MERGE (.)-[:some_rel]->(.)"
params = {'the': 'params'}
cypher.execute(q1, params)
if some_condition:
q2 = "MATCH (n:N { id: {id} }) ... "
"WITH n, . MERGE (n)-[:some_rel]->(.)"
params = {'more': 'params'}
cypher.execute(q2, params)
if some_condition2:
q3
...
if some_condition_n:
qn
I'm running the above with Python, via Celery (for those not familiar with Celery, it's a distributed task queue). When the issue first came up, I was executing the above in a single transaction, and had a ton of failures due to deadlock exceptions. My initial thought was simply to implement a distributed blocking lock at the function level with Redis. This, however, causes a bottleneck in my app.
Next, I switched from a single Cypher transaction a few atomic transactions as in the above and removed the lock. This takes care of the bottleneck, as well as greatly reducing the number of deadlock exceptions, but they're still occurring, albeit at the reduced level.
Graph databases aren't really my thing, so I don't have a ton of experience with the in's and out's of Neo4j and Cypher. I have a secondary-index in Redis of the uuid's of existing nodes, so there is a pre-processing step prior to the merge's to try and keep the graph access down. Are there any obvious solutions that I should try? Maybe there's some way to queue the operations on the graph-side, or maybe I'm overlooking some server optimizations? Any advice on where to look would be appreciated. Thanks!

Okay, after thinking about this some more, I realized that the way my queries were executed was inefficient and could do with some refactoring. Since all the queries are within the same general context, there is no reason to execute them all individually, or even no reason to open a transaction and have them executed that way.
Instead, I changed function to go through the conditionals, and concatenate the query strings into one long string, and add the params that I need to the param dictionary. So, now, there's only one execution at the end, with one statement. This also takes out some of the 'MATCH' statements.
This fix doesn't wholly fix the issue, though, as there are still some deadlock exceptions being thrown.

I think I found the issue, mainly that there wasn't an issue to begin with. That is:
The Enterprise version of Neo4j has an alternative lock manager than
the Community version, meant to provide scalable locking on
high-CPU-count machines.
The Enterprise Lock Manager uses a deadlock detection algorithm that
does not require (much) synchronization, which gives it some very
desirable scalability attributes. The drawback is that it may
sometimes detect false-positives. This normally does not happen in
production usage, but becomes evident in stress testing individual
operations. These scenarios see much lower churn in CPU cache
invalidation, which the enterprise lock manager needs to communicate
across cores.
As a deadlock detection error is a safe-to-retry error and the user is
expected to handle these in all application code, since there may be
legitimate deadlocks at any time, this behavior is actually by design
to gain scalability.
I simply caught the exception and then retry after a few seconds and now:

Related

Guarantee Print Order After Parallelism

I have X amount of cores doing unique work in parallel, however, their output needs to be printed in order.
Object {
Data data
int order
}
I've tried putting the objects in a min heap after they're done with their parallel work, however, even that is too much of a bottleneck.
Is there any way I could have work done in parallel and guarantee the print order? Is there a known term for my problem? Have others encountered it before?
Is there any way I could have work done in parallel and guarantee the print order?
Needless to say, we design parallelized routines with focus on an efficiency, but not constraining the order of the calculations. The printing of the results at the end, when everything is done, should dictate the ordering. In fact, parallel routines often do calculations in such a way that they’re conspicuously not in order (e.g., striding on each thread) to minimize thread and synchronization overhead.
The only question is how you structure the results to allow efficient storage and efficient, ordered retrieval. I often just use a mutable buffer or a pre-populated array. It’s very efficient in terms of both storage and retrieval. Or you can use a dictionary, too. It depends upon the nature of your Data. But I’d avoid the order property pattern in your result Object.
Just make sure you’re using optimized build if using standard Swift collections, as this can have a material impact on performance.
Q : Is there a known term for my problem?
Yes, there is. A con·​tra·​dic·​tion:
Definition of contradiction…2a : a proposition, statement, or phrase that asserts or implies both the truth and falsity of something// … both parts of a contradiction cannot possibly be true …— Thomas Hobbes
2b : a statement or phrase whose parts contradict each other// a round square is a contradiction in terms
3a : logical incongruity
3b : a situation in which inherent factors, actions, or propositions are inconsistent or contrary to one anothersource: Merriam-Webster
Computer science, having borrowed the terms { PARALLEL | SERIAL | CONCURRENT } from the theory of systems, respects the distinctive ( and never overlapping ) properties of each such class of operations, where:
[PARALLEL] orchestration of units-of-work implies, that any and every work-unit: a) starts and b) gets executed and c) gets finished at the same time, i.e. all get into/out-of [PARALLEL]-section at once and being elaborated at the very same time, not otherwise.
[SERIAL] orchestration of units-of-work implies, that all work-units be processed in a one, static, known, particular order, starting work-unit(s) in such an order, just a (known)-next one after previous one has finished its work - i.e. one-after-another, not otherwise.
[CONCURRENT] orchestration of units-of-work permits to start more than one unit-of-work, if resources and system conditions permit (scheduler priorities obeyed), resulting in unknown order of execution and unknown time of completion, as both the former and the latter depend on unknown externalities (system conditions and (non)-availability of resources, that are/will be needed for a particular work-unit elaboration)
Whereas there is an a-priori known, inherently embedded sense of an ORDER in [SERIAL]-type of processing ( as it was already pre-wired into the units-of-work processing-orchestration-code ), it has no such meaning in either [CONCURRENT], where opportunistic scheduling makes a wished-to-have order an undeterministically random result from the system states, skewed by the coincidence of all other externalities, and the same wished-to-have order is principally singular value in true [PARALLEL] by definition, as all start/execute/finish at-the-same-time - so all units-of-work being executed in [PARALLEL] fashion have no other chance, but be both 1st and last at the same time.
Q : Is there any way I could have work done in parallel and guarantee the print order?
No, unless you intentionally or unknowingly violate the [PARALLEL] orchestration rules and re-enter a re-[SERIAL]-iser logic into the work-units, so as to imperatively enforce any such wished-to-have ordering, that is not known, the less natural for the originally [PARALLEL] work-units' orchestration ( as is a common practice in python - using a GIL-monopolist indoctrinated stepping - as an example of such step )
Q : Have others encountered it before?
Yes. Since 2011, each and every semester this or similar questions reappear here, on Stack Overflow at growing amounts every year.

Stream de-duplication on Dataflow | Running services on Dataflow services

I want to de-dupe a stream of data based on an ID in a windowed fashion. The stream we receive has and we want to remove data with matching within N-hour time windows. A straight-forward approach is to use an external key-store (BigTable or something similar) where we look-up for keys and write if required but our qps is extremely large making maintaining such a service pretty hard. The alternative approach I came up with was to groupBy within a timewindow so that all data for a user within a time-window falls within the same group and then, in each group, we use a separate key-store service where we look up for duplicates by the key. So, I have a few questions about this approach
[1] If I run a groupBy transform, is there any guarantee that each group will be processed in the same slave? If guaranteed, we can group by the userid and then within each group compare the sessionid for each user
[2] If it is feasible, my next question is to whether we can run such other services in each of the slave machines that run the job - in the example above, I would like to have a local Redis running which can then be used by each group to look up or write an ID too.
The idea seems off what Dataflow is supposed to do but I believe such use cases should be common - so if there is a better model to approach this problem, I am looking forward to that too. We essentially want to avoid external lookups as much as possible given the amount of data we have.
1) In the Dataflow model, there is no guarantee that the same machine will see all the groups across windows for the key. Imagine that a VM dies or new VMs are added and work is split across them for scaling.
2) Your welcome to run other services on the Dataflow VMs since they are general purpose but note that you will have to contend with resource requirements of the other applications on the host potentially causing out of memory issues.
Note that you may want to take a look at RemoveDuplicates and use that if it fits your usecase.
It also seems like you might want to be using session windows to dedupe elements. You would call:
PCollection<T> pc = ...;
PCollection<T> windowed_pc = pc.apply(
Window<T>into(Sessions.withGapDuration(Duration.standardMinutes(N hours))));
Each new element will keep extending the length of the window so it won't close until the gap closes. If you also apply an AfterCount speculative trigger of 1 with an AfterWatermark trigger on a downstream GroupByKey. The trigger would fire as soon as it could which would be once it has seen at least one element and then once more when the session closes. After the GroupByKey you would have a DoFn that filters out an element which isn't an early firing based upon the pane information ([3], [4]).
DoFn(T -> KV<session key, T>)
|
\|/
Window.into(Session window)
|
\|/
Group by key
|
\|/
DoFn(Filter based upon pane information)
It is sort of unclear from your description, can you provide more details?
Sorry for not being clear. I gave the setup you mentioned a try, except for the early and late firings part, and it is working on smaller samples. I have a couple of follow up questions, related to scaling this up. Also, I was hoping I could give you more information on what the exact scenario is.
So, we have incoming data stream, each item of which can be uniquely identified by their fields. We also know that duplicates occur pretty far apart and for now, we care about those within a 6 hour window. And regarding the volume of data, we have atleast 100K events every second, which span across a million different users - so within this 6 hour window, we could get a few billion events into the pipeline.
Given this background, my questions are
[1] For the sessioning to happen by key, I should run it on something like
PCollection<KV<key, T>> windowed_pc = pc.apply(
Window<KV<key,T>>into(Sessions.withGapDuration(Duration.standardMinutes(6 hours))));
where key is a combination of the 3 ids I had mentioned earlier. Based on the definition of Sessions, only if I run it on this KV would I be able to manage sessions per-key. This would mean that Dataflow would have too many open sessions at any given time waiting for them to close and I was worried if it would scale or I would run into any bottle-necks.
[2] Once I perform Sessioning as above, I have already removed the duplicates based on the firings since I will only care about the first firing in each session which already destroys duplicates. I no longer need the RemoveDuplicates transform which I found was a combination of (WithKeys, Combine.PerKey, Values) transforms in order, essentially performing the same operation. Is this the right assumption to make?
[3] If the solution in [1] going to be a problem, the alternative is to reduce the key for sessioning to be just user-id, session-id ignoring the sequence-id and then, running a RemoveDuplicates on top of each resulting window by sequence-id. This might reduce the number of open sessions but still would leave a lot of open sessions (#users * #sessions per user) which can easily run into millions. FWIW, I dont think we can session only by user-id since then the session might never close as different sessions for same user could keep coming in and also determining the session gap in this scenario becomes infeasible.
Hope my problem is a little more clear this time. Please let me know any of my approaches make the best use of Dataflow or if I am missing something.
Thanks
I tried out this solution at a larger scale and as long as I provide sufficient workers and disks, the pipeline scales well although I am seeing a different problem now.
After this sessionization, I run a Combine.perKey on the key and then perform a ParDo which looks into c.pane().getTiming() and only rejects anything other than an EARLY firing. I tried counting both EARLY and ONTIME firings in this ParDo and it looks like the ontime-panes are actually deduped more precisely than the early ones. I mean, the #early-firings still has some duplicates whereas the #ontime-firings is less than that and has more duplicates removed. Is there any reason this could happen? Also, is my approach towards deduping using a Combine+ParDo the right one or could I do something better?
events.apply(
WithKeys.<String, EventInfo>of(new SerializableFunction<EventInfo, String>() {
#Override
public java.lang.String apply(EventInfo input) {
return input.getUniqueKey();
}
})
)
.apply(
Window.named("sessioner").<KV<String, EventInfo>>into(
Sessions.withGapDuration(mSessionGap)
)
.triggering(
AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterPane.elementCountAtLeast(1))
)
.withAllowedLateness(Duration.ZERO)
.accumulatingFiredPanes()
);

Understanding atomic operations and race conditions

Lets say I have a collection with a views_count, males_views_count and females_views_count fields. I would like to be able to $inc (increment) the count fields respectively when a page is viewed.
The issue with this is I receive several concurrent connections, and a race condition may occur.
I've been reading about atomic operations in Mongodb. Where they typically have a succeed or fail approach. Either the record is written or it isn't. Does that mean I need to create the logic to determine whether the operation failed and if so, retry it?
Going back to my scenario. What if I wanted to make every view count (even when a race conditions occur). I know Mongodb locks are slightly different than traditional RDBMS'. Typically, I would implement an optimistic locking technique. Similar to:
begin
# .. do work .. determine if user is a male or a female
stat.save
rescue StaleDocumentError
stat.reload
retry
end
Or are atomic operations meant to prevent race conditions as it is the server who does the update and it is authoritative about what the truth is? If so, do we still need to implement optimistic/pessimistic locking techniques, or will Mongodb take care of that for us if we use atomic operations?
If you are using atomic operations such as $inc there is no race condition. If you have two increments from different threads they will be applied in the order received by the server and the end outcome on the document will be the same after both operations are applied.
If you are updating fields to specific values using $set, the "winning" value will be the last $set applied. If your application use case may lead to conflicting updates of the same field, then optimistic locking/versioning would be useful.
As far as exception handling goes, you should assume that any database operation could potentially result in a server exception (network error, duplicate key, ...) and retry as appropriate. There isn't any special logic required for atomic updates versus non-atomic updates, unless you choose to implement your own optimistic locking (in which case you would probably refetch the current version of the document and retry the operation).
The MongoDB manual covers a few different patterns in Isolate Sequence of Operations.

Neo4j In memory configurations, multithreading, and slow writes

How do I improve performance when writing to neo4j. I currently have neo4j set up on a server and I am currently running it in embedded more. I believe my configurations are storing all the content of my graph database in memory based upon configurations I've found online
neostore.nodestore.db.mapped_memory=0
neostore.relationship.db.mapped_memory=0
neostore.propertystore.db.mapped_memory=0
neostore.propertystore.db.strings.mapped_memory=0
neostore.propertystore.db.arrays.mapped_memory=0
neostore.propertystore.db.index.keys.mapped_memory=0
neostore.propertystore.db.index.mapped_memory=0
node_auto_indexing=true
node_keys_indexable=type,id
cache_type=strong
use_memory_mapped_buffers=false
node_cache_size=12G
relationship_cache_size=12G
node_cache_array_fraction=10
relationship_cache_array_fraction=10
Please let me know if this is incorrect. The problem that I am encountering is that when I try to persist information to the graph database. It appears that those times are not very quick in comparison to our MYSQL times of the samething(ex. to add 250 items would take about 3sec and in MYSQL it takes 1sec) . I read online that when you have multiple indexes that that can slow down performance on persisting data so I am working on that right now to see if that is my culprit. But, I just wanted to make sure that my configurations seem to be inline when it comes to running your graph database in memory.
Second question to this topic. Okay, if my configurations are good and my database is indeed in memory, then is there a way to optimize persisting data just in case this isn't the silver bullet. If we ran one thread against our test that executes this functionality, oppose to 10 threads, its seems like the times for execution bubbles up
ex.( thread 1 finishes 1s, thread 2 finishes 2s, thread 3 finishes 3s,etc). Is there some special multithreaded configuration that I am missing to improve the performance when mulitple threads are hitting it at one time.
Neo4J version
1.9.1-enterprise
My Jvm configs are
-Xms25G -Xmx25G -XX:+UseNUMA -XX:+UseSerialGC
My Machine Specs:
File system type ext3
You cache arguments are invalid.
node_cache_size=12G
relationship_cache_size=12G
node_cache_array_fraction=10
relationship_cache_array_fraction=10
These can only be used with the GCR cache. Setting the cache isn't going to put everything in memory for you at start up, you will have to write code to do this for you. Something like this:
GlobalGraphOperations ggo = GlobalGraphOperations.at(graphDatabaseFactory);
for (Node n : ggo.getAllNodes()) {
for (String propertyKey : n.getPropertyKeys()) {
n.getProperty(propertyKey);
}
for (Relationship relationship : n.getRelationships()) {
}
}
Beware with the strong cache, if you have a lot of nodes/relationships, eventually your cache will become large and performing GC against it will cause long pauses in your system.
My recommendation would be to use the memory mapped files, as this is an OS handled and will be outside of heap space. It doesn't provide near the speed of caching, but it will provide a speed up if you have to read from the neo store.

Trying to hack my way around SQLite3 concurrent writing, any better way to do this?

I use Delphi XE2 along with DISQLite v3 (which is basically a port of SQLite3). I love everything about SQLite3, except the lack of concurrent writing, especially that I extensively rely on multi-threading in this project :(
My profiler made it clear I needed to do something about it, so I decided to use this approach:
Whenever I need to insert a record in DB, Instead of doing an INSERT, I write the SQL query in a special foler, ie.
WriteToFile_Inline(SPECIAL_FOLDER_PATH + '\' + GUID, FileName + '|' + IntToStr(ID) + '|' + Hash + '|' + FloatToStr(ModifDate) + '|' + ...);
I added a timer (in the main app thread) that fires every minute, parse these files and then INSERT the queries using a transaction.
Delete those temporary files at the end.
The result is I have like 500% performance gain. Plus this technique is ACID, as I can always scan the SPECIAL_FOLDER_PATH after a power failure and execute the INSERTs I find.
Despite the good results, I'm not very happy with the method used (hackish to say the least), I keep thinking that if I could have a generics-like with fast lookup access, thread-safe, ACID list, this would be much cleaner (and possibly faster?)
So my question is: do you know anything like that for Delphi XE2?
PS. I trust many of you reading the code above be in shock and will start insulting me at this point! Please be my guest, but if you know a better (ie. faster) ACID approach, please share your thoughts!
Your idea of sending the inserts to a queue, which will rearrange the inserts, and join them via prepared statements is very good. Using a timer in the main thread or a separated thread is up to you. It will avoid any locking.
Do not forget to use a transaction, then commit it every 100/1000 inserts for instance.
About high performance using SQLite3, see e.g. this blog article (and graphic below):
In this graphic, best performance (file off) comes from:
PRAGMA synchronous = OFF
Using prepared statements
Inside a transaction
In WAL mode (especially in concurrency mode)
You may also change the page size, or the journal size, but settings above are the best. See https://stackoverflow.com/search?q=sqlite3+performance
If you do not want to use a background thread, ensure WAL is ON, prepare your statements, use batchs, and regroup your process to release the SQLite3 lock as soon as possible.
The best performance will be achieved by adding a Client-Server layer, just as we did for mORMot.
With files you organized an asynchronous job queue with persistance. It allows you to avoid one-by-one and use batch (records group) approach to insert the records. Comparing one-by-one and batch:
first works in auto-commit mode (probably) for each record, second wraps a batch into a single transaction and gives greatest performance gain.
first prepares an INSERT command each time when you need to insert a record (probably), second once per batch and gives second by value gain.
I dont think, that SQLite concurrency is a problem in your case (at least not the main issue). Because in SQLite a single insert is comparably fast and concurrency performance issues you will get with high workload. Probably similar results you will get with other DBMS, like Oracle.
To improve your batch approach, consider the following:
consider to set journal_mode to WAL and disable shared cache mode.
use a background thread to process your queue. Instead of a fixed time interval (1 min), check SPECIAL_FOLDER_PATH more often. And if the queue has more than X Kb of data, then start processing. Or use a count of queued records and event to notify the thread, that the queue should start processing.
use multy-record prepared INSERT instead of single-record INSERT. You can build an INSERT for 100 records and process your queue data in a single batch, but by 100 record chanks.
consider to write / read a binary field values instead of a text values.
consider to use a set of files with preallocated size.
etc
sqlite3_busy_timeout is pretty inefficient because it doesn't return immediately when the table it's waiting on is unlocked.
I would try creating a critical section (TCriticalSection?) to protect each table. If you enter the critical section before inserting a row and exit it immediately thereafter, you will create better table locks than SQLite provides.
Without knowing your access patterns, though, it's hard to say if this will be faster than batching up a minute's worth of inserts into single transactions.

Resources