neo4j - shortest path with conditions include plugin functions - neo4j

I have a problem with the implementation in cypher. My problem is this: I have a database model, which is photographed here as an overview: https://www.instpic.de/QTIhBbPgVHBHg5pKwVdk.PNG
Short for the explanation. The red nodes simulate star systems, the yellow one jump points. Each jump point has a certain size, which determines which body can pass the point. The size is stored as a property at the relation between the yellow nodes. Among the red nodes are other nodes that represent the orbital celestial bodies of a star system. (Planets, moons, stations, etc.) Now, from any point within a solar system (planet, station, moon), I would like to find the shortest path to another lying point in the same solar system or another. In addition, I can calculate the distance of two celestial bodies within a system using the plugin that I have programmed. This value should now be included in finding the path, so I have the shortest path on the database and also the smallest distance between the celestial bodies within a solar system. I already have a query, unfortunately it fails partly because of its performance. I also think that the paths here are very variable, so a change to the database model is well considered.
Here is a part of my acutal query iam using:
MATCH (origin:Marketplace)
WHERE origin.eid = 'c816c4fa501244a48292f5d881103d7f'
OPTIONAL MATCH (marketplace:Marketplace)-[:Sell]->(currentPrice:Price)-[:Content]->(product:Product)
OPTIONAL MATCH p = shortestPath((origin)-[:HasMoon|:HasStation|:HasLandingZone|:HasPlanet|:HasJumpPoint|:CanTravel*]-(marketplace))
WHERE SIZE([rel in relationships(p) WHERE EXISTS(rel.size)]) <= 3 AND ALL(rel IN [rel in relationships(p) WHERE EXISTS(rel.size)] WHERE rel.size IN ['small', 'medium', 'large'])
WITH origin, marketplace, p, currentPrice, product
CALL srt.getRoutes(origin, marketplace, p) YIELD node, jump_sizes, jump_gates, jump_distance, hops, distance
OPTIONAL MATCH (currentPrice)-[:CompletedVotes]->(:Wrapper)-[:CompletedVote]->(voteHistory:CompletedVote)
OPTIONAL MATCH (currentPrice)-[:CurrentVote]->(vote:Vote)-[:VotedPrices]->(currentVotings)
WITH node, currentPrice, product, jump_sizes, jump_gates, jump_distance, hops, distance, voteHistory, currentVotings, vote, origin
WITH {eid: product.eid, displayName: product.displayName, name: product.name, currentPrice: {eid: currentPrice.eid, price: currentPrice.price}, currentVoting: {approved: vote.approved, count: Count(currentVotings), declined: vote.declined, users: Collect(currentVotings.userId), votes: Collect(currentVotings.price), voteAvg: round(100 * avg(currentVotings.price)) / 100}, voteHistory: Collect({votings: voteHistory.votings, users: voteHistory.users, completed: voteHistory.completed,
vote: voteHistory.votes}), marketplace: {eid: node.eid, name: node.name, type: node.type, designation: node.designation}, travel: {jumpSizes: jump_sizes, jumpGate: jump_gates, jumpDistance: jump_distance, jumps: hops, totalDistance: distance}} as sellOptions, currentPrice ORDER BY currentPrice.price
WITH Collect(sellOptions) as sellOptions
For the moment, this query works pretty well, but now I want to filter (after ".... dium ',' large '])" -> line 5) the minimum total distance you need to travel to reach your destination , I would like to realize this with my written plugin, which calculates the total distance in the path (getTotalDistance (path AS PATH))
For additional: when I cut of 'big' from the possible jump sizes, I get no result, but there is still a path in my graph that leads me to the goal.
For additional 2: iam working on neo4j 3.3.1 and i have set these config:
cypher.forbid_shortestpath_common_nodes=false
which not works in 3.3.3
EIDT 1: (More detailed explanation)
I have a place where I am. Then I search for marketplaces that sell some product. For this I can specify further filters. I can e.g. say that I can travel only through jump points of the size "large". Also, I only want marketplaces that are 4 system away.
Now, looking in the database for the above restrictions, I search for the shortest path to the market places I found.
It may well be that I have several paths that meet the conditions. If this is the case, I would like to filter out of all the shortest paths, the one in which one has to overcome the smallest distance within each solar system.
Is that accurate enough? Otherwise, please just report.

The latest APOC releases may be able to help here, though the APOC path expanders work best with labels and relationship types, so a slight change to your model may be needed.
In particular, the size of your jump points. Right now this is a property on the relationships between them, but for APOC to work optimally, these might be better modeled with the size as a label on the :JumpPoint nodes themselves, so you might have :JumpPoint:Small, :JumpPoint:Medium, and :JumpPoint:Large (you can add this in addition to the rel properties if you like).
Keep in mind this approach will be more complex than shortestPath(), as the idea is we're trying to find systems within a certain number of jumps, then find :Marketplaces reachable at those star systems, then filter based on whether they sell the product we want, and we'll stitch the path together as we find the pieces.
MATCH localSystemPath = (origin:Marketplace)-[*]-(s:Solarsystem)
WHERE origin.eid = $originId
WITH origin, localSystemPath, s
LIMIT 1
WITH origin, localSystemPath, s,
CASE WHEN coalesce($maxJumps, -1) = -1
THEN -1,
ELSE 3*$maxJumps
END as maxJumps,
CASE $shipSize
WHEN 'small' THEN ''
WHEN 'medium' THEN '|-Small'
ELSE '|-Small|-Medium'
END as sizeBlacklist
CALL apoc.path.spanningTree(s,
{relationshipFilter:'HasJumpPoint|CanTravel>', maxLevel:maxJumps,
labelFilter:'>Solarsystem' + sizeBlacklist, filterStartNode:true}) YIELD path as jumpSystemPath
WITH origin, localSystemPath, jumpSystemPath, length(jumpSystemPath) / 3 as jumps, last(nodes(jumpSystemPath)) as destSystem
MATCH destSystemPath = (destSystem)-[*]-(marketplace:Market)
WHERE none(rel in relationships(destSystemPath) WHERE type(rel) = 'HasJumpPoint')
AND <insert predicate for filtering which :Markets you want>
WITH origin, apoc.path.combine(apoc.path.combine(localSystemPath, jumpSystemPath), destSystemPath) as fullPath, jumps, destSystem, marketplace
CALL srt.getRoutes(origin, marketplace, fullPath) YIELD node, jump_sizes, jump_gates, jump_distance, hops, distance
...
This assumes parameter inputs of $shipSize for the minimum size of all jump gates to pass through, $originId as the id of the origin :Marketplace (plus you DEFINITELY need an index or unique constraint on :Marketplace(eid) for fast lookups here), and $maxJumps, for the maximum number of jumps to reach a destination system.
Keep in mind the expansion procedure used, spanningTree(), will only find the single shortest path to another system. If you need all possible paths, including multiple paths to the same system, then change the procedure to expandConfig() instead.

Related

Neo4j variable-length pattern matching tunning

Query:
PROFILE
MATCH(node:Symptom) WHERE node.symptom =~ '.*adult male.*|.*151.*'
WITH node
MATCH (node)-[*1..2]-(result:Disease)
RETURN result
Profile:
enter image description here
Problems:
There are over 40 thousand "Symptom" nodes in the database, and the query is very slow because of the part - "[*1..2]".
It only took 4 seconds when length is 1, i.e "[*1]", but it will take about 30 seconds when length is 2, i.e "[*1..2]".
Is there any way to tune this query???
Firstly your query is using the regex operator, and it can't use indexes. You should use the CONTAINS operator instead :
MATCH (node:Symptom)
WHERE node.symptom CONTAINS 'adult male' OR node.symptom CONTAINS '151'
RETURN node
And you can create an index :CREATE INDEX ON :Symptom(symptom)
For the second part of your query, as it, there is nothing to do ... it's due to the complexity you are asking to do.
So to have better performances, you should think to :
put the relationship type on the pattern to reduce the number returned path : (node)-[*1..2:MY_REL_TYPE]-(result:Disease)
put the direction on the relationship on the pattern to reduce the number returned path : (node)-[*1..2:MY_REL_TYPE]->(result:Disease)
find an other way to reduce this complexity (filter on a property of the relationship , review your model, etc)
For your information, you can directly write your query in one step (ie. without the WITH, but in your case performances should be the same) :
MATCH (node:Symptom)-[*1..2]-(result:Disease)
WHERE node.symptom CONTAINS 'adult male' OR node.symptom CONTAINS '151'
RETURN result

Improve Neo4j query performance

I have a Neo4j query with searched multiple entities and I would like to pass parameters in batch using nodes object. However, I the speed of query execution is not quite high. How can I optimize this query and make its performance better?
WITH $nodes as nodes
UNWIND nodes AS node
with node.id AS id, node.lon AS lon, node.lat AS lat
MATCH
(m:Member)-[mtg_r:MT_TO_MEMBER]->(mt:MemberTopics)-[mtt_r:MT_TO_TOPIC]->(t:Topic),
(t1:Topic)-[tt_r:GT_TO_TOPIC]->(gt:GroupTopics)-[tg_r:GT_TO_GROUP]->(g:Group)-[h_r:HAS]->
(e:Event)-[a_r:AT]->(v:Venue)
WHERE mt.topic_id = gt.topic_id AND
distance(point({ longitude: lon, latitude: lat}),point({ longitude: v.lon, latitude: v.lat })) < 4000 AND
mt.member_id = id
RETURN
distinct id as member_id,
lat as member_lat,
lon as member_lon,
g.group_name as group_name,
e.event_name as event_name,
v.venue_name as venue_name,
v.lat as venue_lat,
v.lon as venue_lon,
distance(point({ longitude: lon,
latitude: lat}),point({ longitude: v.lon, latitude: v.lat })) as distance
Query profiling looks like this:
So, your current plan has 3 parallel threads. One we can ignore for now because it has 0db hits.
The biggest hit you are taking is the match for (mt:MemberTopics) ... WHERE mt.member_id = id. I'm guessing member_id is a unique id, so you will want to create an index on it CREATE INDEX ON :MemberTopics(member_id). That will allow Cypher to do an index lookup instead of a node scan, which will reduce the DB hits from ~30mill to ~1 (Also, in some cases, in-lining property matches is faster for more complex queries. So (mt:MemberTopics {member_id:id}) is better. It explicitly makes clear that this condition must always be true while matching, and will reinforce to use the index lookup)
The second biggest hit is the point-distance check. Right now, this is being done independently, because the node scan takes so long. Once you make the changes for MemberTopic, The planner should switch to finding all connected Venues, and then only doing the distance check on thous, so that should become cheaper as well.
Also, it looks like mt and gt are linked by a topic, and you are using a topic id to align them. If t and t1 are suppose to be the same Topic node, you could just use t for both nodes to enforce that, and then you don't need to do the id check to link mt and gt. If t and t1 are not the same node, the use of a foriegn key in your node's properties is a sign that you should have a relationship between the two nodes, and just travel along that edge (Relationships can have properties too, but the context looks a lot like t and t1 are suppose to be the same node. You can also enforce this by saying WHERE t = t1, but at that point, you should just use t for both nodes)
Lastly, Depending on the number of rows your query returns, you may want to use LIMIT and SKIP to page your results. This looks like info going to a user, and I doubt they need the full dump. So Only return the top results, and only process the rest if the user wants to see more. (Useful as results approach a metric ton) Since you only have 21 results so far, this won't be an issue right now, but keep in mind as you need to scale to 100,000+ results.

neo4j cypher - Differing query plan behavior

Nodes with the Location node label have an index on Label.name
Profiling the following query gives me a smart plan, with a NodeHashJoin between the two sides of the graph on either side of Trip nodes. Very clever. Works great.
PROFILE MATCH (rosen:Location)<-[:OCCURS_AT]-(ev:Event)<-[:HAS]-(trip:Trip)-[:OPERATES_ON]->(date:Date)
WHERE rosen.name STARTS WITH "U Rosent" AND
ev.scheduled_departure_time > "07:45:00" AND
date.date = '2015-11-20'
RETURN rosen.name, ev.scheduled_departure_time, trip.headsign
ORDER BY ev.scheduled_departure_time
LIMIT 20;
However, just changing one line of the query from:
WHERE rosen.name STARTS WITH "U Rosent" AND
to
WHERE id(rosen) = 4752371 AND
seems to alter the entire behavior of the query plan, which now appears to become more "sequential", losing the parallel execution of (Trip)-[:OPERATES_ON]->(Date)
Much slower. 6x more DB hits in total.
Question
Why does changing the retrieval of one, seemingly-unrelated Location node via a different index/mechanism alter the behavior of the whole query?
(I'm not sure how best to convey more information about the graph model, but please advise, and I'd be happy to add details that are missing)
Edit:
It gets better. Changing that query line from:
WHERE rosen.name STARTS WITH "U Rosent" AND
to
WHERE rosen.name = "U Rosenthaler Platz." AND
results in the same loss of parallelism in the query plan!
Seems odd that a LIKE query is faster than an = ?

Low performance of neo4j

I am server engineer in company that provide dating service.
Currently I am building a PoC for our new recommendation engine.
I try to use neo4j. But performance of this database does not meet our needs.
I have strong feeling that I am doing something wrong and neo4j can do much better.
So can someone give me an advice how to improve performance of my Cypher’s query or how to tune neo4j in right way?
I am using neo4j-enterprise-2.3.1 which is running on c4.4xlarge instance with Amazon Linux.
In our dataset each user can have 4 types of relationships with others users - LIKE, DISLIKE, BLOCK and MATCH.
Also he has a properties like countryCode, birthday and gender.
I made import of all our users and relationships from RDBMS to neo4j using neo4j-import tool.
So each user is a node with properties and each reference is a relationship.
The report from neo4j-import tool said that :
2 558 667 nodes,
1 674 714 539 properties and
1 664 532 288 relationships
were imported.
So it’s huge DB :-) In our case some nodes can have up to 30 000 outgoing relationships..
I made 3 indexes in neo4j :
Indexes
ON :User(userId) ONLINE
ON :User(countryCode) ONLINE
ON :User(birthday) ONLINE
Then I try to build online recommendation engine using this query :
MATCH (me:User {userId: {source_user_id} })-[:LIKE | :MATCH]->()<-[:LIKE | :MATCH]-(similar:User)
USING INDEX me:User(userId)
USING INDEX similar:User(birthday)
WHERE similar.birthday >= {target_age_gte} AND
similar.birthday <= {target_age_lte} AND
similar.countryCode = {target_country_code} AND
similar.gender = {source_gender}
WITH similar, count(*) as weight ORDER BY weight DESC
SKIP {skip_similar_person} LIMIT {limit_similar_person}
MATCH (similar)-[:LIKE | :MATCH]-(recommendation:User)
WITH recommendation, count(*) as sheWeight
WHERE recommendation.birthday >= {recommendation_age_gte} AND
recommendation.birthday <= {recommendation_age_lte} AND
recommendation.gender= {target_gender}
WITH recommendation, sheWeight ORDER BY sheWeight DESC
SKIP {skip_person} LIMIT {limit_person}
MATCH (me:User {userId: {source_user_id} })
WHERE NOT ((me)--(recommendation))
RETURN recommendation
here is the execution plan for one of the user :
plan
When I executed this query for list of users I had the result :
count=2391, min=4565.128849, max=36257.170065, mean=13556.750555555178, stddev=2250.149335254768, median=13405.409811, p75=15361.353029999998, p95=17385.136478, p98=18040.900481, p99=18426.811424, p999=19506.149138, mean_rate=0.9957385490980866, m1=1.2148195797996817, m5=1.1418078036067119, m15=0.9928564378521962, rate_unit=events/second, duration_unit=milliseconds
So even the fastest is too slow for Real-time recommendations..
Can you tell me what I am doing wrong?
Thanks.
EDIT 1 : plan with the expanded boxes :
I built an unmanaged extension to see if I could do better than Cypher. You can grab it here => https://github.com/maxdemarzi/social_dna
This is a first shot, there are a couple of things we can do to speed things up. We can pre-calculate/save similar users, cache things here and there, and random other tricks. Give it a shot, let us know how it goes.
Regards,
Max
If I'm reading this right, it's finding all matches for users by userId and separately finding all matches for users by your various criteria. It's then finding all of the places that they come together.
Since you have a case where you're starting on the left with a single node, my guess is that we'd be better served by following the paths and then filtering what it gotten via relationship traversal.
Let's see how starting like this works for you:
MATCH
(me:User {userId: {source_user_id} })-[:LIKE | :MATCH]->()
<-[:LIKE | :MATCH]-(similar:User)
WITH similar
WHERE similar.birthday >= {target_age_gte} AND
similar.birthday <= {target_age_lte} AND
similar.countryCode = {target_country_code} AND
similar.gender = {source_gender}
WITH similar, count(*) as weight ORDER BY weight DESC
SKIP {skip_similar_person} LIMIT {limit_similar_person}
MATCH (similar)-[:LIKE | :MATCH]-(recommendation:User)
WITH recommendation, count(*) as sheWeight
WHERE recommendation.birthday >= {recommendation_age_gte} AND
recommendation.birthday <= {recommendation_age_lte} AND
recommendation.gender= {target_gender}
WITH recommendation, sheWeight ORDER BY sheWeight DESC
SKIP {skip_person} LIMIT {limit_person}
MATCH (me:User {userId: {source_user_id} })
WHERE NOT ((me)--(recommendation))
RETURN recommendation
[UPDATED]
One possible (and nonintuitive) cause of inefficiency in your query is that when you specify the similar:User(birthday) filter, Cypher uses an index seek with the :User(birthday) index (and additional tests for countryCode and gender) to find all possible DB matches for similar. Let's call that large set of similar nodes A.
Only after finding A does the query filter to see which of those nodes are actually connected to me, as specified by your MATCH pattern.
Now, if there are relatively few me to similar paths (as specified by the MATCH pattern, but without considering its WHERE clause) as compared to the size of A -- say, 2 or more orders of magnitude smaller -- then it might be faster to remove the :User label from similar (since I presume they are probably all going to be users anyway, in your data model), and also remove the USING INDEX similar:User(birthday) clause. In this case, not using the index for similar may actually be faster for you, since you will only be using the WHERE clause on a relatively small set of nodes.
The same considerations also apply to the recommendation node.
Of course, this all has to be verified by testing on your actual data.

How to find nodes being contained in a node's properties interval?

I'm currently developing some kind of a configurator using neo4j as a backend. Now I ran into a problem, I don't know how to solve best.
I've got nodes created like this:
(A:Product {name:'ProductA', minWidth:20, maxWidth:200, minHeight:10, maxHeight:400})
(B:Product {name:'ProductB', minWidth:40, maxWidth:100, minHeight:20, maxHeight:300})
...
There is an interface where the user can input a desired width & height, f.e. Width=30, Height=250. Now I'd like to check which products match the input criteria. As the input might be any long value, the approach used in http://neo4j.com/blog/modeling-a-multilevel-index-in-neoj4/ with dates doesn't seem to be suitable for me. How can I run a cypher query giving me all the nodes matching the input criteria?
I don't know if I understand well what you are asking for, but if I do, here a simple query to get this:
Assuming the user wants width = 30 and height = 50
Match (p:Product)
WHERE
p.minWidth < 30 AND p.maxWidth > 30 AND
p.minHeight < 50 AND p.maxHeight > 50
RETURN
p
If this is not what you are looking for, feel free to say it as comment.

Resources