bottleneck node taking long time to return - neo4j

we're on neo4j 2.1.4 soon to upgrade to 2.2.1.
We've been experiencing some slow downs with certain cypher queries and I think they are mostly centered around two to three nodes out of millions in the graph. These nodes were created with the intent on having some monitoring put in place to check the availability of the graph. I've since found out that a few apps that have been built are actually exercising these queries before actually performing their write operations on the graph. Then I found out that our load balancer was setup to actually do some tests through multiple apps that end up querying the same nodes. So we have a large mix of applications that are all either pulling or updating these same nodes. This has resulted in those two nodes taking anywhere from 8 to 40 seconds to be returned.
Is there any way to determine how many updates and how many queries are being issued against one node?

Since Neo4j 2.2 there's a config option to log queries taking longer than a given threshold, see the dbms.querylog.XXXX settings in http://neo4j.com/docs/stable/configuration-settings.html.
To get an update count for a given node you could setup a custom TransactionEventHandler that tracks write accesses to your given nodes.

Related

How to check the amount of times a neo4J index was used?

So I received a dated schema that used to work well at the beginning but it's experiencing some scaling issues.
Among of them, the space used by the indexes is catching my attention so I would like to know if they are being used, how many times, etc.
Other that explaining/profiling queries, is there anything else I could use to have this kind of information?
The information you are looking for would be under metrics monitoring, but index accesses is not one of the available metrics Neo4j provides. (Neo4j supports Prometheus, but I don't know if Prometheus captures that info either)
But there are some indirect ways you can get this data.
Assuming you have a test server that replicates production, with appropriate load tests, you can try removing the index and seeing how it affects the load tests. (This way is a bit cumbersome, but probably gives the most accurate measure of how varies DB changes affect performance, but only if the load tests accurately reflect production use.)
Alternatively, for a more static analysis, you should only be executing pre-defined, parameterized cyphers. So you can Profile/Explain those Cyphers against the DB at different scales, and compare those notes to the Cypher logs (either calling end, or using Neo4j metrics monitoring) to get an idea of how often each one is called.

How can I view only the context I'm working on?

In Neo4j, I created the database through the various exercises I'm doing.
When I run a query, for example MATCH (n) RETURN (n), until that database that was created in "Christmas of 1914" appears on the screen, making my interface ugly, polluted, loaded with unnecessary objects to work at that moment.
If I work with Northwind, I want to see only Northwind, if I work with Facebook, I just want to see Social, and so on. I do not want to see all the databases on the planet on my screen each time I run a query like MATCH (n) RETURN (n).
Neo4j doesn't really have a direct equivalent to multiple databases stored within the same server instance. There are three options for achieving this:
1) the closest match would be create run an additional instance of neo4j on the same server. You will need to edit the neo4j.conf file to give the new instance a new port number and a new data directory. This will give you isolation between the data and user accounts in the two databases. The downside is you will need to divide up the RAM on the box before running, effectively limiting both instances to half the RAM.
2) You can attach labels to your nodes to identify which bucket of data (database in the RDBMS world) each node belongs to. You can operate as if the two are isolated even though they really live in the same database instance. Neo4j won't do a lot to help you enforce this, you will need to do the work at the application level. There is a mechanism for you to restrict users to only being able interact with a subset of your graph but you have to write custom procedures and restrict the users to only using those. I haven't tried it but it sounds tedious.
https://neo4j.com/docs/operations-manual/current/security/authentication-authorization/subgraph-access-control/
3) If you are running on VMs or the cloud, you mind as well just create a new instance for your second database. It achieves the same effect as number one but with better isolation of resources.

Monitoring a causal cluster's redundancy

I'm on Neo4j 3.1.2. I'm trying to automate monitoring a causal cluster for proper redundancy, preferably over the http interface, dbms.cluster.overview being the most obvious call. But when they die, servers drop off this list without regard to how they exit. The operations manual says there is a difference between clean shutdowns and unclean ones. How do I figure out if a server left cleanly or uncleanly? Is there a procedure to clean up a unclean failure that's never coming back?
In general I would like to know the number of core servers Neo4j is checking for consensus. I don't see an API to find that number. That way I could tell how close to failure we are.
The expected_core_cluster size setting is used when bootstrapping the cluster on first formation. A cluster will not form without the configured amount of cores and this should in general be configured to the full and fixed amount.
This setting is then also used as the minimum consensus group size. The consensus group size (core machines successfully voted into the raft) can shrink and grow dynamically but bounded on the lower end at this number.
The intention is in almost all cases for users to leave this setting alone. If you have 5 machines then you can survive failures down to 3 remaining, e.g. with 2 dead members. The three remaining can still vote another replacement member in successfully up to a total of 6 (2 of which are still dead) and then after this, one of the superfluous dead members will be immediately and automatically voted out (so you are left with 5 members in the consensus group, 1 of which is currently dead). Operationally you can now bring the last machine up by bringing in another replacement or repairing the dead one.
If the intention really is to bring the expected_core_cluster size down to 3 then today you will have to update the setting and do a rolling restart. This is considered an uncommon scenario. Causal clustering optimizes operationally for repair/replace.
The only difference between a clean and an unclean shutdown is that the former leads to a quicker discovery of a core member disappearing, since that is based on a timeout.

Give users read-only access to Neo4j while doing Batch Update

This is just a general question, not too technical. We have this use-case wherein we are to load hundreds of thousands of records to an existing Neo4j database. Now, we cannot afford to make the database offline because of users who are accessing it. I know that Neo4j requires exclusive lock on the database while it's performing batch updates. Is there a way around my problem? I don't want to lock my database while doing updates. I still want my users to access it - even for just read-only access. Thanks.
Neo4j never requires exclusive lock on the database. It selectively locks portions of the graph that are affected by mutating operations. So there are some things you can do to achieve your goal. Are you a Neo4j Enterprise customer?
Option 1: If so, you can run your batch insert on the master node and route users to slaves for reading.
Option 2: Alternatively, you could do a "blue-green" style deployment where you:
take a backup (B) of your existing database (A), then mark the A database read-only
apply your batch inserts onto B either by starting a separate instance, or even better, using BatchInserters. That way, you'll insert your hundreds of thousands in a few seconds
start the new database B
flip a switch on a load-balancer, so that users start to be routed to the B instead of A
take A down
(Please let me know if you need some tips how to make a read-only DB.)
Option 3: If you can only afford to run one instance at any one time, then there are techniques you can employ to let your users access the database as usual and still insert large volumes of data. One of them could be using a single-threaded "writer" with a queue that batches write operations. Because one thread only ever writes to the database, you never run into deadlock scenarios and people can happily read from the database. For option 3, I suggest using GraphAware Writer.
I've assumed you are not trying to insert hundreds of thousands of nodes to a running Neo4j database using Cypher. If you are, I would start there and change it to use Java APIs or the BatchInserter API.

Does neo4j have a "trigger" mechanism via Cypher? (similar to percolators in ElasticSearch)

I am looking for a method to store cypher queries and when adding nodes and relations be notified when it matches said query? Can this be done currently? Something similar to ElasticSearch percolators would be great.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-percolate.html
Update
The answer below was accurate in 2014. It's mostly accurate in 2018.
But now there is a way of implementing triggers in Neo4j provided by Max DeMarzi which is pretty good, and will get the job done.
Original answer below.
No, it doesn't.
You might be able to get something similar to what you want by using a TransactionEventHandler object, which basically lets you bind a piece of code (in java) to the processing of a transaction.
I'd be really careful with running cypher in this context though. Depending on what kind of matching you want to do, you could really slaughter performance by running that each time new data is created in the graph. Usually triggers in an RDBMS are specific to inserts or updates on a particular table. In Neo4J, the closest equivalent you might have is on creating/modifying a node of a certain type of label. If your app has any number of different node classes, it wouldn't make sense to run your trigger code whenever new relationships/nodes are created, because most of the time the node type probably wouldn't be relevant to the trigger code.
Related reading: Do graph databases support triggers? and a feature request for triggers in neo4j
Neo4j 3.5 supports triggers.
To use this functionality - Enable apoc.trigger.enabled=true in $NEO4J_HOME/config/neo4j.conf.
you have to add APOC to the server - it's not there by default.
In a trigger you register Cypher statements that are called when data in Neo4j is changed (created, updated, deleted). You can run them before or after commit.
Here is the help doc -
https://neo4j-contrib.github.io/neo4j-apoc-procedures/#triggers

Resources