Neo4j Relationship Schema Indexes - neo4j

Using Neo4j 2.1.4 and SDN 3.2.0.RELEASE
I have a graph that connects nodes with relationships that have a UUID associated with them. External systems use the UUID as a means to identify the source and target of the relationship. Within Spring Data Neo4j (SDN) we have a #RelationshipEntity(type=”LINKED_TO”) class with a #StartNode, #EndNode and a String uuid field. The uuid field is #Indexed and the resulting schema definition in Neo4j shows up as
neo4j-sh (?)$ SCHEMA
==> Indexes
...
==> ON :Link(uuid) ONLINE
...
However, running a cypher query against the data, e.g.
MATCH ()-[r:LINKED_TO]->() WHERE uuid=’XXXXXX’ RETURN r;
does a full scan of the database and takes a long time
If I try to use the index by running
MATCH ()-[r:LINKED_TO]->() USING INDEX r:Link(uuid) WHERE uuid=’XXXXXX’ RETURN r;
I get
SyntaxException: Type mismatch: expected Node but was Relationship.
As I understand it, Relationships are supposed to be first class citizens in Neo4j, but I can’t see how to utilize the index on the relationship to prevent the graph equivalent of a table scan on the database to locate the relationship.
I know there are posts like How to use relationship index in Cypher which ask similar things, but this Link is the relationship between the two nodes. If I converted the Link to a Node, we would be creating a Node to represent a Relationship which seems wrong when we are working in a graph database - I'd end up with ()-[:xxx]->(:Link)-[:xxx]->() to represent one relationship. It would make the model messy purely due to the fact that the Link couldn't be represented as a relationship.
The Link has got a unique, shared key attached to it that I want to use. The Schema output suggests that the index is there for that field - I just can't use it.
Does anyone have any suggestions?
Many thanks,
Dave

Schema indexes are only available for nodes. The only way to index relationships is using legacy indexes or autoindexes. Legacy indexes need to be used explicitly in the START clause:
START r=relationship:my_index_name(uuid=<myuuid>)
RETURN r
I'm not sure how this can be used in conjunction with SDN.
side note: requiring relationship index is almost always a indication of doing something wrong in your graph data model. Everything being a thing or having an identity in your domain should be a node. So if a relationship requires a uuid, maybe the relationship refers to a thing and therefore should be converted into a node having a inbound relationship to the previous start node and a outbound relationship to the previous end node.

Related

What is indexing means neo4j and how it effects performance

I have a idea of indexing in rdbms but can't think how indexing works in neo4j and also what is schema indexing?
To quote from neo4j's free book, Graph Databases:
Indexes help optimize the process of finding specific nodes.
Most of
the time, when querying a graph, we’re happy to let the traversal
process discover the nodes and relationships that meet our
information goals. By following relationships that match a specific
graph pattern, we encounter elements that contribute to a query’s
result. However, there are certain situations that require us to pick
out specific nodes directly, rather than discover them over the course
of a traversal. Identifying the starting nodes for a traversal, for
example, requires us to find one or more specific nodes based on some
combination of labels and property values.
That same book does an extensive comparison between neo4j and relational databases as well.
As for what the above-mentioned indexes (also known as "schema indexes") index: they index the nodes that have a specific node label and node property combination.
There is also a different indexing mechanism called "manual" (or "legacy", or "explicit") indexing, which is now only recommended for special use cases.
[UPDATE]
As an example, suppose we have already created an index on :Person(firstname), like so:
CREATE INDEX ON :Person(firstname);
In that case, the following query can quickly start off by using the index to find the desired Person nodes. Once those nodes are found, neo4j can easily traverse their outgoing WORKS_AT relationships to find the related Company nodes:
MATCH (p:Person)-[:WORKS_AT]->(c:Company)
WHERE p.firstname = 'Karan'
RETURN p, c;
Without that index, the query would have to either:
Scan through all Person nodes to find the right ones, before traversing their outgoing WORKS_AT relationships, or
Find all Company nodes, traverse their incoming WORKS_AT relationships, and compare the firstname values of every Person at the other end of the relationship.

How to get instant match to a node, If i know it's < id >

I'm trying to speed up this query:
LOAD CSV FROM 'file:///path/to/file' AS line
MATCH (n:Organization{rc:'2051061'})-[:Ap]->(a:Person{numDc: toint(line[1])})
CREATE (a)-[:Af]->(n)
The CSV has about 100k rows, the relationship (n:Organization)-[:Ap]->(a:Person) is unique between different a/b pairs. The number of nodes with label :Organization is 50, and those with :Person is 200k.
So basically I take a value in the csv and check if a :Person who has a relation :Ap with the :Organization with the given rc (2051061) has that value as numDc and then I add another relation between the Person and the organization.
My query runs too slow, I even added indexes to :Person(numDc) and Organization(rc).
So I think since I'm matching the organization for every row It may be the problem.
How can I get instant match to that node if I do know it's < id >.
Thanks in advance.
Note: You may not actually need to create an Af relationship if it does not have any properties, since you can easily traverse an Ap relationship "backwards" from a to n.
If you really do need to create an Af relationship, you can improve your performance by forcing Cypher to use both of your indexes.
Using PROFILE on your query (with the 2 indexes), I see that the Cypher planner (I tried both planner types) uses the SchemaIndex operator (which takes advantage of an index) with only one of your indexes. In order to force Cypher to use both indexes, you can use the USING INDEX clause, like this:
LOAD CSV FROM 'file:///path/to/file' AS line
MATCH (n:Organization { rc:'2051061' })
USING INDEX n:Organization(rc)
MATCH (n)-[:Ap]->(a:Person { numDc: toint(line[1])})
USING INDEX a:Person(numDc)
CREATE (a)-[:Af]->(n);
The performance should be much improved.
It's better to use your own unique identifier instead of node id. Because you can't to rely on ID. Node id is basically address where node is in file with nodes records.
You can add unique id to your csv file and import it into database.
Or you can use GraphAware UUID module for creating UUID on the fly - https://github.com/graphaware/neo4j-uuid

Do we need to index on relationship properties to ensure that Neo4j will not search through all relationships

To clarify, let's assume that I have a relationship type: "connection." Connections has a property called: "typeOfConnection," which can take on values in the domain:
{"GroupConnection", "FriendConnection", "BlahConnect"}.
When I query, I may want to qualify connection with one of these types. While there are not many types, there will be millions of connections with each property type.
Do I need to put an index on connection.typeOfConnection in order to ensure that all connections will not be traversed?
If so, I have been unable to find a simple cypher statement to do this. I've seen some stuff in the documentation describing how to do this in Java, but I'm interacting with Neo using Py2Neo, so it would be wonderful if there was a cypher way to do this.
This is a mixed granularity property graph data model. Totally fine, but you need to replace your relationship qualifiers with intermediate nodes. To do this, replace your relationships with one type node and 2 relationships so that you can perform indexing.
Your model has a graph with a coarse-grained granularity. The opposite extreme is referred to as fine-grained granularity, which is the foundation of the RDF model. With property graph you'll need to use nodes in place of relationships that have labels applied by their type if you're going to do this kind of coarse-grained graph.
For instance, let's assume you have:
MATCH (thing1:Thing { id: 1 })-->(:Connection { type: "group" }),
(group)-->(thing2:Thing)
RETURN thing2
Then you can index on the label Connection by property type.
CREATE INDEX ON :Connection(type)
This allows you the flexibility of not typing your relationships if your application requires dynamic types of connections that prevent you from using a fine-grained granularity.
Whatever you do, don't work around your issue by dynamically generating typed relationships in your Cypher queries. This will prevent your query templates from being cached and decrease performance. Either type all your relationships or go with the intermediate node I've recommended above.

Cypher query to find a node based on a regexp on a property

I have a Neo4J database mapped with COMPANY as nodes and RELATED as edges. COMPANY has a CODE property. I want to query the database and get the first node that matches the regexp COMPANY.CODE =~ '12345678.*', i.e., a COMPANY whose first 8 letters of CODE is equal a given string literal.
After several attempts, the best I could come up with was the following query:
START p=node(*) where p.CODE =~ '12345678.*' RETURN p;
The result is the following exception:
org.neo4j.cypher.EntityNotFoundException:
The property 'CODE' does not exist on Node[0]
It looks like Node[0] is a special kind of node in the database, that obviously doesn't have my CODE property. So, my query is failing because I'm not choosing the appropriate type of node to query upon. But I couldn't figure out how to specify the type of node to query on.
What's the query that returns what I want?
I think I need an index on CODE to run this query, but I'd like to know whether there's a query that can do the job without using such an index.
Note: I'm using Neo4J version 1.9.2. Should I upgrade to 2.0?
You can avoid the exception by checking for the existence of the property,
START p=node(*)
WHERE HAS(p.CODE) AND p.CODE =~ '12345678.*'
RETURN p;
You don't need an index for the query to work, but an index may increase performance. If you don't want indices there are several other options. If you keep working with Neo4j 1.9.x you may group all nodes representing companies under one or more sorting nodes. When you query for company nodes you can then retrieve them from their sorting node and filter on their properties. You can partition your graph by grouping companies with a certain range of values for .code under one sorting node, and a different range under another; and you can extend this partitioning as needed if your graph grows.
If you upgrade to 2.0 (note: not released yet, so not suitable for production) you can benefit from labels. You can then assign a Company label to all those nodes and this label can maintain it's own index and uniqueness constraints.
The zero node, called 'reference node', will not remain in future versions of Neo4j.

How can I port a relational database to Neo4j?

I am playing around with Neo4j but trying to get my head around the graph concepts. As a learning process I want to port a small Postgres relational database schema to Neo4j. Is there any way I can port it and issues "equivalent" relational queries to Neo4j?
Yes, you can port your existing schema to a graph database. Keep in mind that this is not necessarily the best model for your data, but it is a starting point.
How easy it is depends a lot on the quality of your existing schema.
The tables corresponding to entities in an entity-relationship-diagram define your types of nodes. In the upcoming neo4j 2.0, you can labels them with the name of the entity to make a lookup easier. In older versions you can use an index or a manual label property.
Assuming a best case, where all your relationships between data is modelled using foreign keys, any 1:1 relationship between nodes can be identified and ported next.
For tables modelling n:m relationships, identify the corresponding nodes and add a direct relationship between them.
So as an example assume tables Author[id, name, publisher foreign key], Publisher[id, name] and Book[id, title] and written_by[author foreign key, book foreign key].
Every row in Author, Publisher and Book becomes a node.
Every Author node gets a relationship to the publisher identified by the foreign key relationship.
For every row in written_by you add a relationship between the Author node and Book node referenced
For queries in neo4j I recommend cypher due to its expressiveness. A (2.0) query looking for books by some author would look like:
MATCH (author:Author)-[:written_by]-(book:Book)
WHERE author.name='Hugh Laurie'
RETURN book.title
You actually have several options at hand:
use the Talend connector for Neo4J
export your schema+data in CSV files consumable by the batch importer
or you can do it programmatically
I'm afraid not. The relational data model and the graph data model are two different ways of modelling a real-world domain. It requires a human brain (at least as of 2013) to understand the domain in order to model it.
I suggest that you take a piece of paper and capture, using circles and arrows, what your entities are (nodes) and how they relate to each other (relationships). Then, have a look at that piece of paper. Voila, your new Neo4j data model.
Then, take a query that you want to be answered and try to figure out how you would do that without a computer, just by tracing your nodes and relationships with a finger on that piece of paper. Once you've figured that out, translate what you've done to a Cypher query.
Have a look at neo4j.org, there are plenty of examples.
Check this out:
The musicbrainz -> neo4j
https://github.com/redapple/sql2graph/tree/master/examples/musicbrainz
Neo4j Sql-importer
https://github.com/peterneubauer/sql-import
Good Luck!
This tool does exactly that.
Import any relational db into neo4j
https://github.com/jexp/neo4j-rdbms-import

Resources