What is indexing means neo4j and how it effects performance - neo4j

I have a idea of indexing in rdbms but can't think how indexing works in neo4j and also what is schema indexing?

To quote from neo4j's free book, Graph Databases:
Indexes help optimize the process of finding specific nodes.
Most of
the time, when querying a graph, we’re happy to let the traversal
process discover the nodes and relationships that meet our
information goals. By following relationships that match a specific
graph pattern, we encounter elements that contribute to a query’s
result. However, there are certain situations that require us to pick
out specific nodes directly, rather than discover them over the course
of a traversal. Identifying the starting nodes for a traversal, for
example, requires us to find one or more specific nodes based on some
combination of labels and property values.
That same book does an extensive comparison between neo4j and relational databases as well.
As for what the above-mentioned indexes (also known as "schema indexes") index: they index the nodes that have a specific node label and node property combination.
There is also a different indexing mechanism called "manual" (or "legacy", or "explicit") indexing, which is now only recommended for special use cases.
[UPDATE]
As an example, suppose we have already created an index on :Person(firstname), like so:
CREATE INDEX ON :Person(firstname);
In that case, the following query can quickly start off by using the index to find the desired Person nodes. Once those nodes are found, neo4j can easily traverse their outgoing WORKS_AT relationships to find the related Company nodes:
MATCH (p:Person)-[:WORKS_AT]->(c:Company)
WHERE p.firstname = 'Karan'
RETURN p, c;
Without that index, the query would have to either:
Scan through all Person nodes to find the right ones, before traversing their outgoing WORKS_AT relationships, or
Find all Company nodes, traverse their incoming WORKS_AT relationships, and compare the firstname values of every Person at the other end of the relationship.

Related

neo4j for fraud detection - efficient data structure

I'm trying to improve a fraud detection system for a commerce website. We deal with direct bank transactions, so fraud is a risk we need to manage. I recently learned of graphing databases and can see how it applies to these problems. So, over the past couple of days I set up neo4j and parsed our data into it: example
My intuition was to create a node for each order, and a node for each piece of data associated with it, and then connect them all together. Like this:
MATCH (w:Wallet),(i:Ip),(e:Email),(o:Order)
WHERE w.wallet="ex" AND i.ip="ex" AND e.email="ex" AND o.refcode="ex"
CREATE (w)-[:USED]->(o),(i)-[:USED]->(o),(e)-[:USED]->(o)
But this query runs very slowly as the database size increases (I assume because it needs to search the whole data set for the nodes I'm asking for). It also takes a long time to run a query like this:
START a=node(179)
MATCH (a)-[:USED*]-(d)
WHERE EXISTS(d.refcode)
RETURN distinct d
This is intended to extract all orders that are connected to a starting point. I'm very new to Cypher (<24 hours), and I'm finding it particularly difficult to search for solutions.
Are there any specific issues with the data structure or queries that I can address to improve performance? It ideally needs to complete this kind of thing within a few seconds, as I'd expect from a SQL database. At this time we have about 17,000 nodes.
Always a good idea to completely read through the developers manual.
For speeding up lookups of nodes by a property, you definitely need to create indexes or unique constraints (depending on if the property should be unique to a label/value).
Once you've created the indexes and constraints you need, they'll be used under the hood by your query to speed up your matches.
START is only used for legacy indexes, and for the latest Neo4j versions you should use MATCH instead. If you're matching based upon an internal id, you can use MATCH (n) WHERE id(n) = xxx.
Keep in mind that you should not persist node ids outside of Neo4j for lookup in future queries, as internal node ids can be reused as nodes are deleted and created, so an id that once referred to a node that was deleted may later end up pointing to a completely different node.
Using labels in your queries should help your performance. In the query you gave to find orders, Neo4j must inspect every end node in your path to see if the property exists. Property access tends to be expensive, especially when you're using a variable-length match, so it's better to restrict the nodes you want by label.
MATCH (a)-[:USED*]-(d:Order)
WHERE id(a) = 179
RETURN distinct d
On larger graphs, the variable-length match might start slowing down, so you may get more performance by installing APOC Procedures and using the Path Expander procedure to gather all subgraph nodes and filter down to just Order nodes.
MATCH (a)
WHERE id(a) = 179
CALL apoc.path.expandConfig(a, {bfs:true, uniqueness:"NODE_GLOBAL"}) YIELD path
RETURN LAST(NODES(path)) as d
WHERE d:Order

Most efficient way to get all connected nodes in neo4j

The answer to this question shows how to get a list of all nodes connected to a particular node via a path of known relationship types.
As a follow up to that question, I'm trying to determine if traversing the graph like this is the most efficient way to get all nodes connected to a particular node via any path.
My scenario: I have a tree of groups (group can have any number of children). This I model with IS_PARENT_OF relationships. Groups can also relate to any other groups via a special relationship called role playing. This I model with PLAYS_ROLE_IN relationships.
The most common question I want to ask is MATCH(n {name: "xxx") -[*]-> (o) RETURN o.name, but this seems to be extremely slow on even a small number of nodes (4000 nodes - takes 5s to return an answer). Note that the graph may contain cycles (n-IS_PARENT_OF->o, n<-PLAYS_ROLE_IN-o).
Is connectedness via any path not something that can be indexed?
As a first point, by not using labels and an indexed property for your starting node, this will already need to first find ALL the nodes in the graph and opening the PropertyContainer to see if the node has the property name with a value "xxx".
Secondly, if you now an approximate maximum depth of parentship, you may want to limit the depth of the search
I would suggest you add a label of your choice to your nodes and index the name property.
Use label, e.g. :Group for your starting point and an index for :Group(name)
Then Neo4j can quickly find your starting point without scanning the whole graph.
You can easily see where the time is spent by prefixing your query with PROFILE.
Do you really want all arbitrarily long paths from the starting point? Or just all pairs of connected nodes?
If the latter then this query would be more efficient.
MATCH (n:Group)-[:IS_PARENT_OF|:PLAYS_ROLE_IN]->(m:Group)
RETURN n,m

An Example Showing the Necessity of Relationship Type Index and Related Execution Plan Optimization

Suppose I have a large knowledge base with many relationship types, e.g., hasChild, livesIn, locatedIn, capitalOf, largestCityOf...
The number of capicalOf relationships is relatively small (say, one hundred) compared to that of all nodes and other types of relationships.
I want to fetch any capital which is also the largest city in their country by the following query:
MATCH city-[:capitalOf]->country, city-[:largestCityOf]->country RETURN city
Apparently it would be wise to take the capitalOf type as clue, scan all 100 relationship with this type and refine by [:largestCityOf]. However the current execution plan engine of neo4j would do an AllNodesScan and Expand. Why not consider add an "RelationshipByTypeScan" operator into the current query optimization engine, like what NodeByLabelScan does?
I know that I can transform relationship types to relationship properties, index it using the legacy index and manually indicate
START r=relationship:rels(rtype = "capitalOf")
to tell neo4j how to make it efficient. But for a more complicated pattern query with many relationship types but no node id/label/property to start from, it is clearly a duty of the optimization engine to decide which relationship type to start with.
I saw many questions asking the same problem but getting answers like "negative... a query TYPICALLY starts from nodes... ". I just want to use the above typical scenario to ask why once more.
Thanks!
A relationship is local to its start and end node - there is no global relationship dictionary. An operation like "give me globally all relationships of type x" is therefore an expensive operation - you need to go through all nodes and collect matching relationships.
There are 2 ways to deal with this:
1) use a manual index on relationships as you've sketched
2) assign labels to your nodes. Assume all the country nodes have a Country label. Your can rewrite your query:
MATCH (city)-[:capitalOf]->(country:Country), (city)-[:largestCityOf]->(country) RETURN city
The AllNodesScan is now a NodeByLabelScan. The query grabs all countries and matches to the cities. Since every country does have one capital and one largest city this is efficient and scales independently of the rest of your graph.
If you put all relationships into one index and try to grab to ~100 capitalOf relationships that operation scales logarithmically with the total number of relationships in your graph.

Neo4j Relationship Schema Indexes

Using Neo4j 2.1.4 and SDN 3.2.0.RELEASE
I have a graph that connects nodes with relationships that have a UUID associated with them. External systems use the UUID as a means to identify the source and target of the relationship. Within Spring Data Neo4j (SDN) we have a #RelationshipEntity(type=”LINKED_TO”) class with a #StartNode, #EndNode and a String uuid field. The uuid field is #Indexed and the resulting schema definition in Neo4j shows up as
neo4j-sh (?)$ SCHEMA
==> Indexes
...
==> ON :Link(uuid) ONLINE
...
However, running a cypher query against the data, e.g.
MATCH ()-[r:LINKED_TO]->() WHERE uuid=’XXXXXX’ RETURN r;
does a full scan of the database and takes a long time
If I try to use the index by running
MATCH ()-[r:LINKED_TO]->() USING INDEX r:Link(uuid) WHERE uuid=’XXXXXX’ RETURN r;
I get
SyntaxException: Type mismatch: expected Node but was Relationship.
As I understand it, Relationships are supposed to be first class citizens in Neo4j, but I can’t see how to utilize the index on the relationship to prevent the graph equivalent of a table scan on the database to locate the relationship.
I know there are posts like How to use relationship index in Cypher which ask similar things, but this Link is the relationship between the two nodes. If I converted the Link to a Node, we would be creating a Node to represent a Relationship which seems wrong when we are working in a graph database - I'd end up with ()-[:xxx]->(:Link)-[:xxx]->() to represent one relationship. It would make the model messy purely due to the fact that the Link couldn't be represented as a relationship.
The Link has got a unique, shared key attached to it that I want to use. The Schema output suggests that the index is there for that field - I just can't use it.
Does anyone have any suggestions?
Many thanks,
Dave
Schema indexes are only available for nodes. The only way to index relationships is using legacy indexes or autoindexes. Legacy indexes need to be used explicitly in the START clause:
START r=relationship:my_index_name(uuid=<myuuid>)
RETURN r
I'm not sure how this can be used in conjunction with SDN.
side note: requiring relationship index is almost always a indication of doing something wrong in your graph data model. Everything being a thing or having an identity in your domain should be a node. So if a relationship requires a uuid, maybe the relationship refers to a thing and therefore should be converted into a node having a inbound relationship to the previous start node and a outbound relationship to the previous end node.

Is a DFS Cypher Query possible?

My database contains about 300k nodes and 350k relationships.
My current query is:
start n=node(3) match p=(n)-[r:move*1..2]->(m) where all(r2 in relationships(p) where r2.GameID = STR(id(n))) return m;
The nodes touched in this query are all of the same kind, they are different positions in a game. Each of the relationships contains a property "GameID", which is used to identify the right relationship if you want to pass the graph via a path. So if you start traversing the graph at a node and follow the relationship with the right GameID, there won't be another path starting at the first node with a relationship that fits the GameID.
There are nodes that have hundreds of in and outgoing relationships, some others only have a few.
The problem is, that I don't know how to tell Cypher how to do this. The above query works for a depth of 1 or 2, but it should look like [r:move*] to return the whole path, which is about 20-200 hops.
But if i raise the values, the querys won't finish. I think that Cypher looks at each outgoing relationship at every single path depth relating to the start node, but as I already explained, there is only one right path. So it should do some kind of a DFS search instead of a BFS search. Is there a way to do so?
I would consider configuring a relationship index for the GameID property. See http://docs.neo4j.org/chunked/milestone/auto-indexing.html#auto-indexing-config.
Once you have done that, you can try a query like the following (I have not tested this):
START n=node(3), r=relationship:rels(GameID = 3)
MATCH (n)-[r*1..]->(m)
RETURN m;
Such a query would limit the relationships considered by the MATCH cause to just the ones with the GameID you care about. And getting that initial collection of relationships would be fast, because of the indexing.
As an aside: since neo4j reuses its internally-generated IDs (for nodes that are deleted), storing those IDs as GameIDs will make your data unreliable (unless you never delete any such nodes). You may want to generate and use you own unique IDs, and store them in your nodes and use them for your GameIDs; and, if you do this, then you should also create a uniqueness constraint for your own IDs -- this will, as a nice side effect, automatically create an index for your IDs.

Resources