Neo4j, do nodes without relationships affect performance? - neo4j

How do nodes without relationships affect performance?
The input stream contains duplicate nodes and once I've determined that a node is not of interest I'd like a short-hand way to know that I've already seen this node and want to disregard it.
If I store one instance of the node in the db without any relationships will it impact performance? Potentially the number of relationship-less nodes is very large.

Usually these don't affect performance, they take up space on disk but will not be loaded if you don't access them. And as you don't traverse them it doesn't matter that much.
I would still skip them, you can do it both with neo4j-import there is a --skip-duplicate-nodes option as well as LOAD CSV or Cypher in general, there is the MERGE clause which only creates a new node if it is not already there.

Related

neo4j for fraud detection - efficient data structure

I'm trying to improve a fraud detection system for a commerce website. We deal with direct bank transactions, so fraud is a risk we need to manage. I recently learned of graphing databases and can see how it applies to these problems. So, over the past couple of days I set up neo4j and parsed our data into it: example
My intuition was to create a node for each order, and a node for each piece of data associated with it, and then connect them all together. Like this:
MATCH (w:Wallet),(i:Ip),(e:Email),(o:Order)
WHERE w.wallet="ex" AND i.ip="ex" AND e.email="ex" AND o.refcode="ex"
CREATE (w)-[:USED]->(o),(i)-[:USED]->(o),(e)-[:USED]->(o)
But this query runs very slowly as the database size increases (I assume because it needs to search the whole data set for the nodes I'm asking for). It also takes a long time to run a query like this:
START a=node(179)
MATCH (a)-[:USED*]-(d)
WHERE EXISTS(d.refcode)
RETURN distinct d
This is intended to extract all orders that are connected to a starting point. I'm very new to Cypher (<24 hours), and I'm finding it particularly difficult to search for solutions.
Are there any specific issues with the data structure or queries that I can address to improve performance? It ideally needs to complete this kind of thing within a few seconds, as I'd expect from a SQL database. At this time we have about 17,000 nodes.
Always a good idea to completely read through the developers manual.
For speeding up lookups of nodes by a property, you definitely need to create indexes or unique constraints (depending on if the property should be unique to a label/value).
Once you've created the indexes and constraints you need, they'll be used under the hood by your query to speed up your matches.
START is only used for legacy indexes, and for the latest Neo4j versions you should use MATCH instead. If you're matching based upon an internal id, you can use MATCH (n) WHERE id(n) = xxx.
Keep in mind that you should not persist node ids outside of Neo4j for lookup in future queries, as internal node ids can be reused as nodes are deleted and created, so an id that once referred to a node that was deleted may later end up pointing to a completely different node.
Using labels in your queries should help your performance. In the query you gave to find orders, Neo4j must inspect every end node in your path to see if the property exists. Property access tends to be expensive, especially when you're using a variable-length match, so it's better to restrict the nodes you want by label.
MATCH (a)-[:USED*]-(d:Order)
WHERE id(a) = 179
RETURN distinct d
On larger graphs, the variable-length match might start slowing down, so you may get more performance by installing APOC Procedures and using the Path Expander procedure to gather all subgraph nodes and filter down to just Order nodes.
MATCH (a)
WHERE id(a) = 179
CALL apoc.path.expandConfig(a, {bfs:true, uniqueness:"NODE_GLOBAL"}) YIELD path
RETURN LAST(NODES(path)) as d
WHERE d:Order

Creating/Managing millions of Vertex Tree in Neo4j 3.0.4

I'm doing some stuff with my University and I've been asked to create a system that builds Complete Trees with millions of nodes (1 or 2 million at least).
I was trying to create the Tree with a Load CSV Using a periodic commit and it worked well with the creation of just Nodes (70000 ms on a general purpose Notebook :P ). When I tried the same with the Edges, it didn't scale as well.
Using periodic commit LOAD CSV WITH HEADERS FROM 'file:///Archi.csv' AS line
Merge (:Vertex {name:line.from})<-[:EDGE {attr1: toFloat(line.attr1), attr2:toFloat(line.attr2), attr3: toFloat(line.attr3), attr4: toFloat(line.attr4), attr5: toFloat(line.attr5)}]-(:Vertex {name:line.to})
I need to guarantee that a Tree is generated in no more than 5 minutes.
Is there a Faster method that can return such a performances?
P.S. : The task doesn't expect to use Neo4j, but just a Database (either SQL or NoSQL), but I found out this NoSQL Graph DB and I thought would be nice to implement with Neo4j as the graph data structure is given for free.
P.P.S : I'm using Cypher
I think you should read up on MERGE in the developer documentation again, to make sure you understand exactly what it's doing.
A few things in particular to be aware of...
If the pattern you are merging does not exist, all elements of the pattern will be merged, which could result in duplicate :Vertex nodes being created. If your :Vertexes are supposed to be in the database already, and if there are no relationships yet, and if you are sure that no relationship repeats itself in your CSV, I strongly urge you to MATCH on the start and end nodes, and then CREATE the relationship between them instead of the MERGE. Remember that doing a MERGE with a relationship with many attributes means it will try to match on that first, so as the number of relationships grow between nodes, there will be an increasing number of comparisons, which will slow your query down further. CREATE is a better choice if you know that no relationship will be duplicated, and if you are sure those relationships don't exist yet.
I also urge you to create an index on :Vertex(name), as that will significantly help matching on end nodes.

Neo4j labels and properties, and their differences

Say we have a Neo4j database with several 50,000 node subgraphs. Each subgraph has a root. I want to find all nodes in one subgraph.
One way would be to recursively walk the tree. It works but can be thousands of trips to the database.
One way is to add a subgraph identifier to each node:
MATCH(n {subgraph_id:{my_graph_id}}) return n
Another way would be to relate each node in a subgraph to the subgraph's root:
MATCH(n)-[]->(root:ROOT {id: {my_graph_id}}) return n
This feels more "graphy" if that matters. Seems expensive.
Or, I could add a label to each node. If {my_graph_id} was "BOBS_QA_COPY" then
MATCH(n:BOBS_QA_COPY) return n
would scoop up all the nodes in the subgraph.
My question is when is it appropriate to use a garden-variety property, add relationships, or set a label?
Setting a label to identify a particular subgraph makes me feel weird, like I am abusing the tool. I expect labels to say what something is, not which instance of something it is.
For example, if we were graphing car information, I could see having parts labeled "FORD EXPLORER". But I am less sure that it would make sense to have parts labeled "TONYS FORD EXPLORER". Now, I could see (USER id:"Tony") having a relationship to a FORD EXPLORER graph...
I may be having a bout of "SQL brain"...
Let's work this through, step by step.
If there are N non-root nodes, adding an extra N ROOT relationships makes the least sense. It is very expensive in storage, it will pollute the data model with relationships that don't need to be there and that can unnecessarily complicate queries that want to traverse paths, and it is not the fastest way to find all the nodes in a subgraph.
Adding a subgraph ID property to every node is also expensive in storage (but less so), and would require either: (a) scanning every node to find all the nodes with a specific ID (slow), or (b) using an index, say, :Node(subgraph_id) (faster). Approach (b), which is preferable, would also require that all the nodes have the same Node label.
But wait, if approach 2(b) already requires all nodes to be labelled, why don't we just use a different label for each subgroup? By doing that, we don't need the subgraph_id property at all, and we don't need an index either! And finding all the nodes with the same label is fast.
Thus, using a per-subgroup label would be the best option.

How much does the architecture of data affect the speed of a query

I have the following nodes and relationships in Neo4j database.
The grey and the pink node are furtherly connected with more nodes. Running the following query:
MATCH (n:RealNode {gid:'$obj_id'})-[:CONTAINS*..3]-(z)
RETURN DISTINCT ID(z), z.id,n.id as InternalID"
I get a result very fast (the node n:RealNode is not one of the nodes in the image).
If I increase the depth to 4 like:
MATCH (n:RealNode {gid:'$obj_id'})-[:CONTAINS*..4]-(z)
RETURN DISTINCT ID(z), z.id,n.id as InternalID"
The response gets extremely slow. I will never get a response with depth 5 etc.
The depth 4 is actually the relationship between the blue-pink node. So my question is: can the architecture of data (in this case) affect in such a great level the speed of the query? If yes what should I do?
I have tried to run the query also using parameters but the result was the same. Also the gid of n:RealNode is an indexed value.
The architecture of your data has a huge, no...massive impact on query performance. There's a lot you can do with improving performance by reformulating your query, but you can do even more than that by changing your data model.
The model needs to be chosen in a way that's an accurate depiction of the real-world domain, but it often also has to make certain concessions to usage patterns. If you know you're going to do certain queries over and over, it makes sense to choose a data model that makes it easy for the DBMS to answer that query. In the RDBMS world, that entire line of thinking gets summarized in the word "denormalization". In graph databases, the concept is the same but the way you go about it is different.
The thing to keep in mind when adjusting your data model is that neo4j is good at traversing relationships fast, and that with all queries, the less data you have to consider, the faster the query will go.
So in your case, I don't know how many nodes branch off of each other node by a :CONTAINS relationship, but I'm guessing that at each level of the hierarchy you have many items below it. So going from level 4 to level 5 probably doesn't just add a fixed number of additional nodes, but if say each level of the hierarchy has 3x the number of nodes as the level above, the deeper you go, the more you're multiplying how much data you have to consider. If it's 10x...then ouch.
You have many different options. One is to create short-cut relationships, and "pre-materialize" the query. Imagine creating :grandfather and :greatgrandfather relationships to "hop" levels of the tree. That would make it faster. Another way would be to filter intermediate nodes, or the return nodes, so that you're not considering everything, but some subset.
In the end, really huge queries will always take longer than really small ones. You must first begin with a careful understanding of what data you want, and how often you have to run this query. I would not attempt to optimize your data model for infrequently run queries, but if you do this all the time, you should look at your options. Your query to me looks like it's going to return a whole lot of data no matter what you do.

Cypher LOAD FROM CSV performance for relationship merge

I am merging large batches of ~500,000 relationships with the LOAD CSV command:
LOAD CSV WITH HEADERS FROM 'http://file.csv'
MATCH (a:Label {uid: csv.uid1}),(b:Otherlabel {uid: csv.uid2})
MERGE (a)-[:TYPE {key1: csv.key1}]->(b)
Both uid properties have a UNIQUE constraint.
The CSV file looks like:
uid1,uid2,key1
123,abc,some_value
456,def,some_value
This is usually very fast (< 1 min) when there are many different nodes on each side.
But performance drops dramatically when I load batches where a single a node is connected to many different b nodes. The uid1 is always the same but schema constraints are still there. ~30,000 relationships take ~8 min to load.
Am I missing something here? What could explain the huge performance difference in MERGEing 'many-to-many' relationships vs. 'one-to-many'?
As I mentioned in the comment on the question, I verified this behavior with a ~300,000 line CSV file that I created with unique random values for uid1 and uid2. #MartinPreusse then mentioned that if you change the query to use CREATE instead of MERGE, the query is fast. This observation made me realize what is going on.
The slowdown is caused by the need to scan the relationships list of the 'a' node each time a MERGE is performed. When a CREATE is performed, the relationship is added without testing first to see if the relationship already exists. When the relationship lists remain short (first case), scanning the relationship lists has little impact. When the relationship lists are growing long (second case), the repeated scanning of a growing list is dominating the process. In my test I linked all 300,000 nodes to a single node using a MERGE clause and it took hours.
If you don't have to worry about creating duplicate relationships, you can use CREATE without fear. Even if duplicates are an issue, it might be faster to use CREATE and then craft a query that will remove the duplicates.

Resources