As far as I understand it the IDs given by Neo4j (ID(node)) are unstable and behave somewhat like row numbers in SQL. Since IDs are mostly used for relations in SQL and these are easily modeled in Neo4j, there doesn't seem to be much use for IDs, but then how do you solve retrieval of specific nodes? Having a REST API which is supposed to have unique routes for each node (e.g. /api/concept/23) seems like a pretty standard case for web applications.
But despite it being so fundamental, the only viable way I found were either via
language specific frameworks
as an unconnected node which maintains the increments:
// get unique id
MERGE (id:UniqueId{name:'Person'})
ON CREATE SET id.count = 1
ON MATCH SET id.count = id.count + 1
WITH id.count AS uid
// create Person node
CREATE (p:Person{id:uid,firstName:'Gabriel',lastName:'Smith'})
RETURN p AS person
Source: http://www.neo4j.org/graphgist?8012859
Is there really not a simpler way and if not, is there a particular reason for it? Is my approach an anti-pattern in the context of Neo4j?
Neo4j internal ids are a bit more stable than sql row id's as they will never change during a transaction for e.g.
And indeed exposing them for external usage is not recommended. I know there are some intentions at Neo internals to implement such a feature.
Basically people tend to use two solutions for this :
Using a UUID generator at the application level like for PHP : https://packagist.org/packages/rhumsaa/uuid and add a label/uuid unique constraint on all nodes.
Using a very handful Neo4j plugin like https://github.com/graphaware/neo4j-uuid that will add uuid properties on the fly, so it remove you the burden to handle it at the application level and it is easier to manage the persistence state of your node objects
I agree with Pavel Niedoba.
I came up with this without and UniqueID Node:
MATCH (a:Person)
WITH a ORDER BY a.id DESC LIMIT 1
CREATE (n:Person {id: a.id+1})
RETURN n
It requires a first Node with an id field though.
Related
Suppose I have 2 types of nodes :Server and :Client.
(Client)-[:CONNECTED_TO]->(Server)
I want to find the Female clients connected to some Server ordered by age.
I did this
Match (s:Server{id:"S1"})<-[:CONNNECTED_TO]-(c{gender:"F"}) return c order by c.age DESC
Doing this, all the Client nodes linked to my Server node are traversed to find the highest age.
Is there a way to index the Client nodes on gender and age properties to avoid the full scan?
You can create an index on :Client(gender), as follows:
CREATE INDEX ON :Client(gender);
However, your particular query will probably benefit more from creating an index on :Server(id), since there are probably a lot of female clients but probably only a single Server with that id. So, you probably want to do this instead:
CREATE INDEX ON :Server(id);
But, even better, if every Server has a unique id property, you should create a uniqueness constraint (which also automatically creates an index for you):
CREATE CONSTRAINT ON (s:Server) ASSERT s.id IS UNIQUE;
Currently, neo4j does not use indexes to perform ordering, but there are some APOC procedures that do support that. However, the procedures do not support returning results in descending order, which is what you want. So, if you really need to use indexing for this purpose, a workaround would be to add an extra minusAge property to your Client nodes that contains the negative value of the age property. If you do this, then first create an index:
CREATE INDEX ON :Client(minusAge);
and then use this query:
MATCH (s:Server{id:"S1"})<-[:CONNNECTED_TO]-(cl:Client {gender:"F"})
CALL apoc.index.orderedRange('Client', 'minusAge', -200, 0, false, -1) YIELD node AS c
RETURN c;
The 3rd and 4th parameters of that procedure are for the minimum and maximum values you want to use for matching (against minusAge). The 5th parameter should be false for your purposes (and is actually currently ignored by the implementation). The last parameter is for the LIMIT value, and -1 means you do not want a limit.
If that is a request you're doing quite frequently, then you might want to write that data out. As you're experiencing, that query can be quite expensive and it won't get better the more clients you get, as in fact, all of the connected nodes have to be retrieved and get a property check/comparison run on them.
As a workaround, you can add another connection to your clients when you modify their age data.
As a suggestion, you can create an Age node and create a MATURE relationship to your oldest clients.
(:Server)<-[:CONNNECTED_TO]-(:Client)-[:MATURE]->(:Age)
You can do this for all the ages, and run queries off the Age nodes (with an indexed/unique age property on the) as needed. If you have 100,000 clients, but only are interested in the top 100 ordered by age, there's no need to get all the clients and order them... It really depends on your use case and client apps.
While this is certainly not a nice pattern, I've seen it work in production and is the only workaround that's been doing well in different production environments that I've seen.
If this answer didn't solve your problem (I'd rather use an age property, too), I hope it gave you at least an idea on what to do next.
In our company, we return a list of IDs to clients through a web service. The IDs are unique across the system. They invoke other web services passing the IDs. We don't always know the label of the ID we receive.
This does not perform:
MATCH(n {id:{my_id}) ...
While we have indexes on almost all label types, this query has no label and as thus does not use an index as far as I can tell.
Is it a bad idea to add a label called "GLOBAL" (or whatever) to all nodes so we can put a unique constraint on GLOBAL.id? Then the query above could be
MATCH(n: GLOBAL{id:{my_id}})...
and perform nicely.
Is there another way?
You can use Neo4j's internal ID to identify your resources, but it's not best practice, see Should we use the Neo4J internal id?
This is how to get a node using his neo4j's internal id:
START n=node({my_id}) return n
It's really faster than a MATCH clause, because here your query directly starts with one node, and doesn't have to filter a property accross a set of nodes, because it's internal id.
If you can handle the internal id limitations, it's the solution you are looking for.
Is using a tree with a counter on the root node, to be referenced and incremented when creating new nodes, a viable way of managing unique IDs in Neo4j? In a previous question on performance on this forum (Neo4j merge performance VS create/set), the approach was described, and it occurred to me it may suggest a methodology for unique ID management without having to extend the Neo4j database (and support that extension). However, I noticed this approach has not been mentioned in other discussions on best practice for unique ID management (Best practice for unique IDs in Neo4J and other databases?).
Can anyone help validate or reject this approach?
Thanks!
You can just create a singleton node (I'll give it the label IdCounter in my example) to hold the "next-valid ID counter" value. There is no need for it be part of any "tree" or for it to have any relationships at all.
When you create the singleton, initialize it with the first id value that you want to use. For example:
CREATE (:IdCounter {nextId: 1});
Here is a simple example of how to use it when creating a new node.
MATCH (c:IdCounter)
CREATE (x {id: c.nextId})
SET c.nextId = c.nextId + 1
RETURN x;
Since all Cypher queries are transactional, if the node creation did not happen for any reason, then the nextId increment would also not be done, so you should never end up with any gaps in assigned id numbers.
However, to avoid re-using the same id number, you would have to write your queries carefully to ensure that the increment always happens whenever you create a new node (using CREATE, CREATE UNIQUE, or MERGE).
Setup:
Neo4j and Cypher version 2.2.0.
I'm querying Neo4j as an in-memory instance in Eclipse created TestGraphDatabaseFactory().newImpermanentDatabase();.
I'm using this approach as it seems faster than the embedded version and I assume it has the same functionality.
My graph database is randomly generated programmatically with varying numbers of nodes.
Background:
I generate cypher queries automatically. These queries are used to try and identify a single 'target' node. I can limit the possible matches of the queries by using known 'node' properties. I only use a 'name' property in this case. If there is a known name for a node, I can use it to find the node id and use this in the start clause. As well as known names, I also know (for some nodes) if there are names known not to belong to a node. I specify this in the where clause.
The sorts of queries that I am running look like this...
START
nvari = node(5)
MATCH
(target:C5)-[:IN_LOCATION]->(nvara:LOCATION),
(nvara:LOCATION)-[:CONNECTED]->(nvarb:LOCATION),
(nvara:LOCATION)-[:CONNECTED]->(nvarc:LOCATION),
(nvard:LOCATION)-[:CONNECTED]->(nvarc:LOCATION),
(nvard:LOCATION)-[:CONNECTED]->(nvare:LOCATION),
(nvare:LOCATION)-[:CONNECTED]->(nvarf:LOCATION),
(nvarg:LOCATION)-[:CONNECTED]->(nvarf:LOCATION),
(nvarg:LOCATION)-[:CONNECTED]->(nvarh:LOCATION),
(nvari:C4)-[:IN_LOCATION]->(nvarg:LOCATION),
(nvarj:C2)-[:IN_LOCATION]->(nvarg:LOCATION),
(nvare:LOCATION)-[:CONNECTED]->(nvark:LOCATION),
(nvarm:C3)-[:IN_LOCATION]->(nvarg:LOCATION),
WHERE
NOT(nvarj.Name IN ['nf']) AND NOT(nvarm.Name IN ['nb','nj'])
RETURN DISTINCT target
Another way to think about this (if it helps), is that this is an isomorphism testing problem where we have some information about how nodes in a query and target graph correspond to each other based on restrictions on labels.
Question:
With regards to optimisation:
Would it help to include relation variables in the match clause? I took them out because the node variables are sufficient to distinguish between relationships but this might slow it down?
Should I restructure the match clause to have match/where couples including the where clauses from my previous example first? My expectation is that they can limit possible bindings early on. For example...
START
nvari = node(5)
MATCH
(nvarj:C2)-[:IN_LOCATION]->(nvarg:LOCATION)
WHERE NOT(nvarj.Name IN ['nf'])
MATCH
(nvarm:C3)-[:IN_LOCATION]->(nvarg:LOCATION)
WHERE NOT(nvarm.Name IN ['nb','nj'])
MATCH
(target:C5)-[:IN_LOCATION]->(nvara:LOCATION),
(nvara:LOCATION)-[:CONNECTED]->(nvarb:LOCATION),
(nvara:LOCATION)-[:CONNECTED]->(nvarc:LOCATION),
(nvard:LOCATION)-[:CONNECTED]->(nvarc:LOCATION),
(nvard:LOCATION)-[:CONNECTED]->(nvare:LOCATION),
(nvare:LOCATION)-[:CONNECTED]->(nvarf:LOCATION),
(nvarg:LOCATION)-[:CONNECTED]->(nvarf:LOCATION),
(nvarg:LOCATION)-[:CONNECTED]->(nvarh:LOCATION),
(nvare:LOCATION)-[:CONNECTED]->(nvark:LOCATION)
RETURN DISTINCT target
On the side:
(Less important but still an interest) If I make each relationship in a match clause an optional match except for relationships containing the target node, would cypher essentially be finding a maximum common sub-graph between the query and the graph data base with the constraint that the MCS contains the target node?
Thanks a lot in advance! I hope I have made my requirements clear but I appreciate that this is not a typical use-case for Neo4j.
I think querying with node properties is almost always preferable to using relationship properties (if you had a choice), as that opens up the possibility that indexing can help speed up the query.
As an aside, I would avoid using the IN operator if the collection of possible values only has a single element. For example, this snippet: NOT(nvarj.Name IN ['nf']), should be (nvarj.Name <> 'nf'). The current versions of Cypher might not use an index for the IN operator.
Restructuring a query to eliminate undesirable bindings earlier is exactly what you should be doing.
First of all, you would need to keep using MATCH for at least the first relationship in your query (which binds target), or else your result would contain a lot of null rows -- not very useful.
But, thinking clearly about this, if all the other relationships were placed in separate OPTIONAl MATCH clauses, you'd be essentially saying that you want a match even if none of the optional matches succeeded. Therefore, the logical equivalent would be:
MATCH (target:C5)-[:IN_LOCATION]->(nvara:LOCATION)
RETURN DISTINCT target
I don't think this is a useful result.
I'm quite new to neo4j. I'm developing a web app(using express.js and async) that does a POST request, which in turn creates a triangle of nodes and relationships. So, there are 6 queries and I want to use auto-increment ID (or rowID) of the created nodes (using id(a)) to create relationships.
As I saw in another post(Node identifiers in neo4j), rowID should not be relied for reuse. But, I have no other way of identifying my nodes (unless if I create an index on all the properties which is a pain).
Hence, my question, can I use rowID for this use-case ? If not, what kind of use case suits better for rowID ?
If you only need an id to create the triangles, then you can use id(n), but probably you can just create the triangle with a single cypher statement.
Perhaps you can share more of your code/domain?
Usually you should have a business-key / -id that you can use.
Create your own id and store it on the node as a property, if you have no unique id that you can use.