Global indexes in Neo - neo4j

In our company, we return a list of IDs to clients through a web service. The IDs are unique across the system. They invoke other web services passing the IDs. We don't always know the label of the ID we receive.
This does not perform:
MATCH(n {id:{my_id}) ...
While we have indexes on almost all label types, this query has no label and as thus does not use an index as far as I can tell.
Is it a bad idea to add a label called "GLOBAL" (or whatever) to all nodes so we can put a unique constraint on GLOBAL.id? Then the query above could be
MATCH(n: GLOBAL{id:{my_id}})...
and perform nicely.
Is there another way?

You can use Neo4j's internal ID to identify your resources, but it's not best practice, see Should we use the Neo4J internal id?
This is how to get a node using his neo4j's internal id:
START n=node({my_id}) return n
It's really faster than a MATCH clause, because here your query directly starts with one node, and doesn't have to filter a property accross a set of nodes, because it's internal id.
If you can handle the internal id limitations, it's the solution you are looking for.

Related

How to handle cypher query common stanzas

I'm writing a bunch of queries in order to build a tree inside Neo4j, but in order to add different types of new data, I'm writing the same opening stanzas for each of my queries.
Example: I want to be able to add Root(identifier=Root1)->A(identifier=1)->B(identifier=2)... without modifying the trees pointed to by other roots.
All of my queries start off with
Match
(root:`Root` {identifier=$identifier})
Create
(root)-[:`someRel`]->(a:`A` {identifier=$a_identifier})
Then some time passes and A needs a child:
Match
(root:`Root` {identifier=$identifier})
-[:`someRel`]->
(a:`A` {identifier=$a_identifier})
Create
(a)-[:`someOtherRel`]->(b:`B` {identifier=$b_identifier})
Then some other time passes and maybe B needs a child, and I have to use the same opening stanza to get to A and then add another one to get the correct B.
Is there some functionality that I'm missing that will allow me to not have to build up those opening stanzas every time I want to get to the correct B, (or C or D) or do I just need to do this using string concatenation?
String concatenation example: (python)
MATCH
{ROOT_LOOKUP_STANZA},
{A_LOOKUP_STANZA},
{B_LOOKUP_STANZA},
CREATE
(b)-[:`c_relationship`]->(c:`C` {...})
Some additional notes:
Root Nodes have to be uniquely identified
The rest of the nodes have to be uniquely identified with their parents. So the following is valid:
Root(root)->A(a)->B(b)
Root(root)->A(a1)->B(b)
In this case B(b) references two different nodes because their parents are different.
So your main problem is that children do not have unique ids, only the root nodes have unique ids. Neo4j does not have have any mechanic (yet) to carry the final context of one query into the start of another, and makes no guarantee that a nodes internal id will be the same between queries. So for your data as is, you must match the whole chain to be sure you match the correct node to append to. There are a few things you can do make this not necessary though.
Add a UUID
By adding a universally unique id to each node (, and indexing that property,) you will be able to match on that id with the guarantee that there are no collisions and it will be the same across queries. Any time using a nodes internal ID would be useful, that is a good sign you could use UUID's in your data. (Also helps if the data is mirrored to other databases)
Store the path as a Unique ID
It's possible you don't know the UUID assigned in Neo4j (because it's not in the source data), but in a tree you can create a unique ID in the format of <parent-ID>_<index><sorted-labels><source-id>. The idea here is that the parent is guaranteed to have a unique id, and you combine that id with the info that makes this child unique to that parent. This allows you do generate a deterministic unique ID. (Requires a tree data structure, with a unique root id) In most cases, you can probably leave the index part out (that is for cases of lists/arrays in the source data). In essence, you are storing the path from the root node to this node as the nodes unique id. (Again, you will want an index on this id)
Batch the job
If this is all part of just one job, another option is pool the changes you want to make, and generate one cypher that will do all of them while Neo4j already has everything fetched.

Use of index in Neo4j

Suppose I have 2 types of nodes :Server and :Client.
(Client)-[:CONNECTED_TO]->(Server)
I want to find the Female clients connected to some Server ordered by age.
I did this
Match (s:Server{id:"S1"})<-[:CONNNECTED_TO]-(c{gender:"F"}) return c order by c.age DESC
Doing this, all the Client nodes linked to my Server node are traversed to find the highest age.
Is there a way to index the Client nodes on gender and age properties to avoid the full scan?
You can create an index on :Client(gender), as follows:
CREATE INDEX ON :Client(gender);
However, your particular query will probably benefit more from creating an index on :Server(id), since there are probably a lot of female clients but probably only a single Server with that id. So, you probably want to do this instead:
CREATE INDEX ON :Server(id);
But, even better, if every Server has a unique id property, you should create a uniqueness constraint (which also automatically creates an index for you):
CREATE CONSTRAINT ON (s:Server) ASSERT s.id IS UNIQUE;
Currently, neo4j does not use indexes to perform ordering, but there are some APOC procedures that do support that. However, the procedures do not support returning results in descending order, which is what you want. So, if you really need to use indexing for this purpose, a workaround would be to add an extra minusAge property to your Client nodes that contains the negative value of the age property. If you do this, then first create an index:
CREATE INDEX ON :Client(minusAge);
and then use this query:
MATCH (s:Server{id:"S1"})<-[:CONNNECTED_TO]-(cl:Client {gender:"F"})
CALL apoc.index.orderedRange('Client', 'minusAge', -200, 0, false, -1) YIELD node AS c
RETURN c;
The 3rd and 4th parameters of that procedure are for the minimum and maximum values you want to use for matching (against minusAge). The 5th parameter should be false for your purposes (and is actually currently ignored by the implementation). The last parameter is for the LIMIT value, and -1 means you do not want a limit.
If that is a request you're doing quite frequently, then you might want to write that data out. As you're experiencing, that query can be quite expensive and it won't get better the more clients you get, as in fact, all of the connected nodes have to be retrieved and get a property check/comparison run on them.
As a workaround, you can add another connection to your clients when you modify their age data.
As a suggestion, you can create an Age node and create a MATURE relationship to your oldest clients.
(:Server)<-[:CONNNECTED_TO]-(:Client)-[:MATURE]->(:Age)
You can do this for all the ages, and run queries off the Age nodes (with an indexed/unique age property on the) as needed. If you have 100,000 clients, but only are interested in the top 100 ordered by age, there's no need to get all the clients and order them... It really depends on your use case and client apps.
While this is certainly not a nice pattern, I've seen it work in production and is the only workaround that's been doing well in different production environments that I've seen.
If this answer didn't solve your problem (I'd rather use an age property, too), I hope it gave you at least an idea on what to do next.

Set a transient property on a node neo4j

Is there way to set a transient property on nodes returned by a cypher query such that it is only visible to the user running the query.
This would allow us offload some controller logic directly into Neo4j and reduce business logic queries.
Currently I have a list that is returned by
List<Post> newsFeed (Long uid) {}
Post is a relationship between a User and News node.
I have two sub-classes of the Post object:
BroadcastedPost
MentionedPost
I have two cypher queries that return the posts that a user should see.
List broadcasts obtained from
MATCH (user:PlatformUser)-[:BROADCASTED]->post RETURN post;
List mentionedPost obtained from
MATCH (user:PlatformUser)-[:MENTIONED]->post RETURN post;
I then use Java instanceof to determine what kind of post this is. Depending on the type I am able to do some further application logic.
This however is inefficient because I should be able to combine both queries into one super query using the UNION operator
i.e List newsFeed is obtained directly by querying
MATCH (user:PlatformUser)-[:BROADCASTED]->post RETURN post UNION MATCH (user:PlatformUser)-[:MENTIONED]->post RETURN post;
However, how can I tell what kind of post this. I was hoping I could use the SET operator transiently to know which kind of post this is but I believe this is used to persist a property.
Neo4j 2.2 recently added authentication, which it had lacked in previous releases, but it's still only really one user; you set a login/password to secure access to the database, but adding additional users takes extra work and isn't something obvious to do out of the box.
Now what you're asking for has to do with securing per-user access to particular types of data. Since neo4j doesn't have much of a user management feature right now, what you're asking for can't be done inside of neo4j because in order to secure this data away from Joe or Bob, the DBMS would have to know that it's dealing with Joe or Bob.
What you're trying to do is usually enforced by the application layer by people writing neo4j applications right now. So it can be done, but it's done within your custom code and not by the database directly.

Auto increment property in Neo4j

As far as I understand it the IDs given by Neo4j (ID(node)) are unstable and behave somewhat like row numbers in SQL. Since IDs are mostly used for relations in SQL and these are easily modeled in Neo4j, there doesn't seem to be much use for IDs, but then how do you solve retrieval of specific nodes? Having a REST API which is supposed to have unique routes for each node (e.g. /api/concept/23) seems like a pretty standard case for web applications.
But despite it being so fundamental, the only viable way I found were either via
language specific frameworks
as an unconnected node which maintains the increments:
// get unique id
MERGE (id:UniqueId{name:'Person'})
ON CREATE SET id.count = 1
ON MATCH SET id.count = id.count + 1
WITH id.count AS uid
// create Person node
CREATE (p:Person{id:uid,firstName:'Gabriel',lastName:'Smith'})
RETURN p AS person
Source: http://www.neo4j.org/graphgist?8012859
Is there really not a simpler way and if not, is there a particular reason for it? Is my approach an anti-pattern in the context of Neo4j?
Neo4j internal ids are a bit more stable than sql row id's as they will never change during a transaction for e.g.
And indeed exposing them for external usage is not recommended. I know there are some intentions at Neo internals to implement such a feature.
Basically people tend to use two solutions for this :
Using a UUID generator at the application level like for PHP : https://packagist.org/packages/rhumsaa/uuid and add a label/uuid unique constraint on all nodes.
Using a very handful Neo4j plugin like https://github.com/graphaware/neo4j-uuid that will add uuid properties on the fly, so it remove you the burden to handle it at the application level and it is easier to manage the persistence state of your node objects
I agree with Pavel Niedoba.
I came up with this without and UniqueID Node:
MATCH (a:Person)
WITH a ORDER BY a.id DESC LIMIT 1
CREATE (n:Person {id: a.id+1})
RETURN n
It requires a first Node with an id field though.

Neo4j node property type

I'm playing around with neo4j, and I was wondering, is it common to have a type property on nodes that specify what type of Node it is? I've tried searching for this practice, and I've seen some people use name for a purpose like this, but I was wondering if it was considered a good practice or if indexes would be the more practical method?
An example would be a "User" node, which would have type: user, this way if the index was bad, I would be able to do an all-node scan and look for types of user.
Labels have been added to neo4j 2.0. They fix this problem.
You can create nodes with labels:
CREATE (me:American {name: "Emil"}) RETURN me;
You can match on labels:
MATCH (n:American)
WHERE n.name = 'Emil'
RETURN n
You can set any number of labels on a node:
MATCH (n)
WHERE n.name='Emil'
SET n :Swedish:Bossman
RETURN n
You can delete any number of labels on a node:
MATCH (n { name: 'Emil' })
REMOVE n:Swedish
Etc...
True, it does depend on your use case.
If you add a type property and then wish to find all users, then you're in potential trouble as you've got to examine that property on every node to get to the users. In that case, the index would probably do better- but not in cases where you need to query for all users with conditions and relations not available in the index (unless of course, your index is the source of the "start").
If you have graphs like mine, where a relation type implies two different node types like A-(knows)-(B) and A or B can be a User or a Customer, then it doesn't work.
So your use case is really important- it's easy to model graphs generically, but important to "tune" it as per your usage pattern.
IMHO you shouldn't have to put a type property on the node. Instead, a common way to reference all nodes of a specific "type" is to connect all user nodes to a node called "Users" maybe. That way starting at the Users node, you can very easily find all user nodes. The "Users" node itself can be indexed so you can find it easily, or it can be connected to the reference node.
I think it's really up to you. Some people like indexed type attributes, but I find that they're mostly useful when you have other indexed attributes to narrow down the number of index hits (search for all users over age 21, for example).
That said, as #Luanne points out, most of us try to solve the problem in-graph first. Another way to do that (and the more natural way, in my opinion) is to use the relationship type to infer a practical node type, i.e. "A - (knows) -> B", so A must be a user or some other thing that can "know", and B must be another user, a topic, or some other object that can "be known".
For client APIs, modeling the element type as a property makes it easy to instantiate the right domain object in your client-side code so I always include a type property on each node/vertex.
The "type" var name is commonly used for this, but in some languages like Python, "type" is a reserved word so I use "element_type" in Bulbs ( http://bulbflow.com/quickstart/#models ).
This is not needed for edges/relationships because they already contain a type (the label) -- note that Neo4j also uses the keyword "type" instead of label for relationships.
I'd say it's common practice. As an example, this is exactly how Spring Data Neo4j knows of which entity type a certain node is. Each node has "type" property that contains the qualified class name of the entity. These properties are automatically indexed in the "types" index, thus nodes can be looked up really fast. You could implement your use case exactly like this.
Labels have recently been added to Neo4j 2.0 ( http://docs.neo4j.org/chunked/milestone/graphdb-neo4j-labels.html ). They are still under development at the moment, but they address this exact problem.

Resources