faster way to create relationship between existing nodes? - neo4j

I am working on an application in which I have only "11263" number of nodes in my neo4j database.
I am using following cypher query to form the relationships between the nodes:
let CreateRelations(fromToList : FromToCount list)=
client.Cypher
.Unwind(fromToList, "fromToList")
.Match("(source)", "(target)")
.Where("source.Id= fromToList.SId and target.Id= fromToList.FId ")
.Merge("(source)-[relation:Fights_With]->(target)")
.OnCreate()
.Set("relation.Count= fromToList.Count,relation.Date= fromToList.Date")
.OnMatch()
.Set("relation.Count= (relation.Count+ fromToList.Count )")
.Set("relation.Date= fromToList.Date")
.ExecuteWithoutResults()
It is taking almost 47 to 50 sec to form say 1000 relations in a neo4j database.
I am new to the neo4j DB, Is there is any other efficient way to do it?

The big thing slowing you down is that you're not using an index to lookup your starting nodes. Your match to source is performing a scan of all nodes in your db to find possible matches, per row in your unwound list. Then it does the same thing with target.
You need to add labels on your nodes, and if they already have labels, use the labels in your query. You'll need either an index or unique constraint on the label and id property so the index will be used for lookup.
Best way to go about tuning your queries is to try them out in the browser, and use EXPLAIN to ensure you're using index lookups, and if it's still slow, use PROFILE on the query (it will execute the query) to see the rows generated and db hits as the query executes.

Related

Neo4j Query Optimization for Cartesian Product

I am trying to implement a user-journey analytics solution. Simply analyze on which screens, which users leave the application.
For this , I have modeled the data like this:
I modeled single activity since I want to index some attributes. Relation attributes can not be indexed in Neo4j.
With this model, I am trying to write a query that follows three successive event types with below query:
MATCH (eventType1:EventType {eventName:'viewStart-home'})<--(event:EventNode)
<--(eventType2:EventType{eventName:'viewStart-payment'})
WITH distinct event.deviceId as eUsers, event.clientCreationDate as eDate
MATCH((eventType2)<--(event2:EventNode)
<--(eventType3:EventType{eventName:'viewStart-screen1'}))
WITH distinct event2.deviceId as e2Users, event2.clientCreationDate as e2Date
RETURN e2Users limit 200000
And the execution plan is below:
I could not figure the reason of this process out. Can you help me?
Your query is doing a lot more work than it needs to.
The first WITH clause is not needed at all, since its generated eUsers and eDate variables are never used. And the second WITH clause does not need to generate the unused e2Date variable.
In addition, you could first add an index for :EventType(eventName) to speed up the processing:
CREATE INDEX ON :EventType(eventName);
With these changes, your query's profile could be simpler and the processing would be faster.
Here is an updated query (that should use the index to quickly find the EventType node at one end of the path, to kick off the query):
MATCH (:EventType {eventName:'viewStart-home'})<--(:EventNode)
<--(:EventType{eventName:'viewStart-payment'})<--(event2:EventNode)
<--(:EventType{eventName:'viewStart-screen1'})
RETURN distinct event2.deviceId as e2Users
LIMIT 200000;
Here is an alternate query that uses 2 USING INDEX hints to tell the planner to quickly find the :EventType nodes at both ends of the path to kick off the query. This might be even faster than the first query:
MATCH (a:EventType {eventName:'viewStart-home'})<--(:EventNode)
<--(:EventType{eventName:'viewStart-payment'})<--(event2:EventNode)
<--(b:EventType{eventName:'viewStart-screen1'})
USING INDEX a:EventType(eventName)
USING INDEX b:EventType(eventName)
RETURN distinct event2.deviceId as e2Users
LIMIT 200000;
Try profiling them both on your DB, and pick the best one or keep tweaking further.

neo4j for fraud detection - efficient data structure

I'm trying to improve a fraud detection system for a commerce website. We deal with direct bank transactions, so fraud is a risk we need to manage. I recently learned of graphing databases and can see how it applies to these problems. So, over the past couple of days I set up neo4j and parsed our data into it: example
My intuition was to create a node for each order, and a node for each piece of data associated with it, and then connect them all together. Like this:
MATCH (w:Wallet),(i:Ip),(e:Email),(o:Order)
WHERE w.wallet="ex" AND i.ip="ex" AND e.email="ex" AND o.refcode="ex"
CREATE (w)-[:USED]->(o),(i)-[:USED]->(o),(e)-[:USED]->(o)
But this query runs very slowly as the database size increases (I assume because it needs to search the whole data set for the nodes I'm asking for). It also takes a long time to run a query like this:
START a=node(179)
MATCH (a)-[:USED*]-(d)
WHERE EXISTS(d.refcode)
RETURN distinct d
This is intended to extract all orders that are connected to a starting point. I'm very new to Cypher (<24 hours), and I'm finding it particularly difficult to search for solutions.
Are there any specific issues with the data structure or queries that I can address to improve performance? It ideally needs to complete this kind of thing within a few seconds, as I'd expect from a SQL database. At this time we have about 17,000 nodes.
Always a good idea to completely read through the developers manual.
For speeding up lookups of nodes by a property, you definitely need to create indexes or unique constraints (depending on if the property should be unique to a label/value).
Once you've created the indexes and constraints you need, they'll be used under the hood by your query to speed up your matches.
START is only used for legacy indexes, and for the latest Neo4j versions you should use MATCH instead. If you're matching based upon an internal id, you can use MATCH (n) WHERE id(n) = xxx.
Keep in mind that you should not persist node ids outside of Neo4j for lookup in future queries, as internal node ids can be reused as nodes are deleted and created, so an id that once referred to a node that was deleted may later end up pointing to a completely different node.
Using labels in your queries should help your performance. In the query you gave to find orders, Neo4j must inspect every end node in your path to see if the property exists. Property access tends to be expensive, especially when you're using a variable-length match, so it's better to restrict the nodes you want by label.
MATCH (a)-[:USED*]-(d:Order)
WHERE id(a) = 179
RETURN distinct d
On larger graphs, the variable-length match might start slowing down, so you may get more performance by installing APOC Procedures and using the Path Expander procedure to gather all subgraph nodes and filter down to just Order nodes.
MATCH (a)
WHERE id(a) = 179
CALL apoc.path.expandConfig(a, {bfs:true, uniqueness:"NODE_GLOBAL"}) YIELD path
RETURN LAST(NODES(path)) as d
WHERE d:Order

Neo4j: error in adding relationships among existing nodes

When writing a query to add relationships to existing nodes, it keeps me warning with this message:
"This query builds a cartesian product between disconnected patterns.
If a part of a query contains multiple disconnected patterns, this will build a cartesian product between all those parts. This may produce a large amount of data and slow down query processing. While occasionally intended, it may often be possible to reformulate the query that avoids the use of this cross product, perhaps by adding a relationship between the different parts or by using OPTIONAL MATCH (identifier is: (e))"
If I run the query, it creates no relationships.
The query is:
match
(a{name:"Angela"}),
(b{name:"Carlo"}),
(c{name:"Andrea"}),
(d{name:"Patrizia"}),
(e{name:"Paolo"}),
(f{name:"Roberta"}),
(g{name:"Marco"}),
(h{name:"Susanna"}),
(i{name:"Laura"}),
(l{name:"Giuseppe"})
create
(a)-[:mother]->(b),
(a)-[:grandmother]->(c), (e)-[:grandfather]->(c), (i)-[:grandfather]->(c), (l)-[:grandmother]->(c),
(b)-[:father]->(c),
(e)-[:father]->(b),
(l)-[:father]->(d),
(i)-[:mother]->(d),
(d)-[:mother]->(c),
(c)-[:boyfriend]->(f),
(g)-[:brother]->(f),
(g)-[:brother]->(h),
(f)-[:sister]->(g), (f)-[:sister]->(h)
Can anyone help me?
PS: if I run the same query, but with just one or two relationships (and less nodes in the match clause), it creates the relationships correctly.
What is wrong here?
First of all, as I mentionned in my comments, you don't have any Labels, it's a really bad practice because Labels are useful to match properties in a certains dataset (if you match "name" property, you don't want to match it on a node who doesn't have a name, Labels are here for that.
The second problem is that your query doesn't know how many nodes it will get before it does. It means that if you have 500 000 nodes having name : "Angela" and 500 000 nodes having name : "Carlo", you will create one relation from each Angela node, going on each Carlo, that's quite a big query (500 000 * 500 000 relations to create if my maths aren't bad). Cypher is giving you a warning for that.
Cypher will still tell you this warning because you aren't using Unique properties to match your nodes, even with Labels, you will still have the warning.
Solution?
Use unique properties to create and match your nodes, so you avoid cartesian product.
Always use labels, Neo4j without labels is like using one giant table in SQL to store all of your data.
If you want to know how your query will run, use PROFILE before your query, here is the profile plan for your query:
Does every single one of those name strings exist? If not then you're not going to get any results because it's all one big match. You could try changing it to a MERGE.
But Supamiu is right, you really should have a label (say Person) and an index on :Person(name).

Cypher Query in Qlikview

I have connected Neo4j with Qlikview and I'm firing the cypher query. Now I have two graphs with the same labels so when I make a report in Qlikview I'm finding the aggregate count value of a particular label retrieved from both the graphs. So in order to get the data from only one graph can I fire the query with graph name say 'emp' prefixed with the label as below.
SQL MATCH a=(emp.cname)-[:has]->(emp.clusters) RETURN cname.name as cName, clusters.name as ClustName; where emp is the name of the graph from which I want to get the data and cname and clusters are the labels.
Thanks in advance...
There is technically only one graph. It seems that you have create two sub-graphs in that space and need a way of disguising them. That is typically done by creating an anchor node that at least one of the sub-graph nodes is related to or to add a unique label to each set of nodes to distinguish them. If that is not an option then you could run a preliminary query to determine the two start nodes (distinguished by some property/label or that are not connected in any way) and using one or the other for the query of interest.

Is a DFS Cypher Query possible?

My database contains about 300k nodes and 350k relationships.
My current query is:
start n=node(3) match p=(n)-[r:move*1..2]->(m) where all(r2 in relationships(p) where r2.GameID = STR(id(n))) return m;
The nodes touched in this query are all of the same kind, they are different positions in a game. Each of the relationships contains a property "GameID", which is used to identify the right relationship if you want to pass the graph via a path. So if you start traversing the graph at a node and follow the relationship with the right GameID, there won't be another path starting at the first node with a relationship that fits the GameID.
There are nodes that have hundreds of in and outgoing relationships, some others only have a few.
The problem is, that I don't know how to tell Cypher how to do this. The above query works for a depth of 1 or 2, but it should look like [r:move*] to return the whole path, which is about 20-200 hops.
But if i raise the values, the querys won't finish. I think that Cypher looks at each outgoing relationship at every single path depth relating to the start node, but as I already explained, there is only one right path. So it should do some kind of a DFS search instead of a BFS search. Is there a way to do so?
I would consider configuring a relationship index for the GameID property. See http://docs.neo4j.org/chunked/milestone/auto-indexing.html#auto-indexing-config.
Once you have done that, you can try a query like the following (I have not tested this):
START n=node(3), r=relationship:rels(GameID = 3)
MATCH (n)-[r*1..]->(m)
RETURN m;
Such a query would limit the relationships considered by the MATCH cause to just the ones with the GameID you care about. And getting that initial collection of relationships would be fast, because of the indexing.
As an aside: since neo4j reuses its internally-generated IDs (for nodes that are deleted), storing those IDs as GameIDs will make your data unreliable (unless you never delete any such nodes). You may want to generate and use you own unique IDs, and store them in your nodes and use them for your GameIDs; and, if you do this, then you should also create a uniqueness constraint for your own IDs -- this will, as a nice side effect, automatically create an index for your IDs.

Resources