Improve Neo4j query performance

Improve Neo4j query performance - neo4j

I have a Neo4j query with searched multiple entities and I would like to pass parameters in batch using nodes object. However, I the speed of query execution is not quite high. How can I optimize this query and make its performance better?
WITH $nodes as nodes
UNWIND nodes AS node
with node.id AS id, node.lon AS lon, node.lat AS lat
MATCH
(m:Member)-[mtg_r:MT_TO_MEMBER]->(mt:MemberTopics)-[mtt_r:MT_TO_TOPIC]->(t:Topic),
(t1:Topic)-[tt_r:GT_TO_TOPIC]->(gt:GroupTopics)-[tg_r:GT_TO_GROUP]->(g:Group)-[h_r:HAS]->
(e:Event)-[a_r:AT]->(v:Venue)
WHERE mt.topic_id = gt.topic_id AND
distance(point({ longitude: lon, latitude: lat}),point({ longitude: v.lon, latitude: v.lat })) < 4000 AND
mt.member_id = id
RETURN
distinct id as member_id,
lat as member_lat,
lon as member_lon,
g.group_name as group_name,
e.event_name as event_name,
v.venue_name as venue_name,
v.lat as venue_lat,
v.lon as venue_lon,
distance(point({ longitude: lon,
latitude: lat}),point({ longitude: v.lon, latitude: v.lat })) as distance
Query profiling looks like this:

So, your current plan has 3 parallel threads. One we can ignore for now because it has 0db hits.
The biggest hit you are taking is the match for (mt:MemberTopics) ... WHERE mt.member_id = id. I'm guessing member_id is a unique id, so you will want to create an index on it CREATE INDEX ON :MemberTopics(member_id). That will allow Cypher to do an index lookup instead of a node scan, which will reduce the DB hits from ~30mill to ~1 (Also, in some cases, in-lining property matches is faster for more complex queries. So (mt:MemberTopics {member_id:id}) is better. It explicitly makes clear that this condition must always be true while matching, and will reinforce to use the index lookup)
The second biggest hit is the point-distance check. Right now, this is being done independently, because the node scan takes so long. Once you make the changes for MemberTopic, The planner should switch to finding all connected Venues, and then only doing the distance check on thous, so that should become cheaper as well.
Also, it looks like mt and gt are linked by a topic, and you are using a topic id to align them. If t and t1 are suppose to be the same Topic node, you could just use t for both nodes to enforce that, and then you don't need to do the id check to link mt and gt. If t and t1 are not the same node, the use of a foriegn key in your node's properties is a sign that you should have a relationship between the two nodes, and just travel along that edge (Relationships can have properties too, but the context looks a lot like t and t1 are suppose to be the same node. You can also enforce this by saying WHERE t = t1, but at that point, you should just use t for both nodes)
Lastly, Depending on the number of rows your query returns, you may want to use LIMIT and SKIP to page your results. This looks like info going to a user, and I doubt they need the full dump. So Only return the top results, and only process the rest if the user wants to see more. (Useful as results approach a metric ton) Since you only have 21 results so far, this won't be an issue right now, but keep in mind as you need to scale to 100,000+ results.

Related

neo4j - shortest path with conditions include plugin functions

I have a problem with the implementation in cypher. My problem is this: I have a database model, which is photographed here as an overview: https://www.instpic.de/QTIhBbPgVHBHg5pKwVdk.PNG
Short for the explanation. The red nodes simulate star systems, the yellow one jump points. Each jump point has a certain size, which determines which body can pass the point. The size is stored as a property at the relation between the yellow nodes. Among the red nodes are other nodes that represent the orbital celestial bodies of a star system. (Planets, moons, stations, etc.) Now, from any point within a solar system (planet, station, moon), I would like to find the shortest path to another lying point in the same solar system or another. In addition, I can calculate the distance of two celestial bodies within a system using the plugin that I have programmed. This value should now be included in finding the path, so I have the shortest path on the database and also the smallest distance between the celestial bodies within a solar system. I already have a query, unfortunately it fails partly because of its performance. I also think that the paths here are very variable, so a change to the database model is well considered.
Here is a part of my acutal query iam using:
MATCH (origin:Marketplace)
WHERE origin.eid = 'c816c4fa501244a48292f5d881103d7f'
OPTIONAL MATCH (marketplace:Marketplace)-[:Sell]->(currentPrice:Price)-[:Content]->(product:Product)
OPTIONAL MATCH p = shortestPath((origin)-[:HasMoon|:HasStation|:HasLandingZone|:HasPlanet|:HasJumpPoint|:CanTravel*]-(marketplace))
WHERE SIZE([rel in relationships(p) WHERE EXISTS(rel.size)]) <= 3 AND ALL(rel IN [rel in relationships(p) WHERE EXISTS(rel.size)] WHERE rel.size IN ['small', 'medium', 'large'])
WITH origin, marketplace, p, currentPrice, product
CALL srt.getRoutes(origin, marketplace, p) YIELD node, jump_sizes, jump_gates, jump_distance, hops, distance
OPTIONAL MATCH (currentPrice)-[:CompletedVotes]->(:Wrapper)-[:CompletedVote]->(voteHistory:CompletedVote)
OPTIONAL MATCH (currentPrice)-[:CurrentVote]->(vote:Vote)-[:VotedPrices]->(currentVotings)
WITH node, currentPrice, product, jump_sizes, jump_gates, jump_distance, hops, distance, voteHistory, currentVotings, vote, origin
WITH {eid: product.eid, displayName: product.displayName, name: product.name, currentPrice: {eid: currentPrice.eid, price: currentPrice.price}, currentVoting: {approved: vote.approved, count: Count(currentVotings), declined: vote.declined, users: Collect(currentVotings.userId), votes: Collect(currentVotings.price), voteAvg: round(100 * avg(currentVotings.price)) / 100}, voteHistory: Collect({votings: voteHistory.votings, users: voteHistory.users, completed: voteHistory.completed,
vote: voteHistory.votes}), marketplace: {eid: node.eid, name: node.name, type: node.type, designation: node.designation}, travel: {jumpSizes: jump_sizes, jumpGate: jump_gates, jumpDistance: jump_distance, jumps: hops, totalDistance: distance}} as sellOptions, currentPrice ORDER BY currentPrice.price
WITH Collect(sellOptions) as sellOptions
For the moment, this query works pretty well, but now I want to filter (after ".... dium ',' large '])" -> line 5) the minimum total distance you need to travel to reach your destination , I would like to realize this with my written plugin, which calculates the total distance in the path (getTotalDistance (path AS PATH))
For additional: when I cut of 'big' from the possible jump sizes, I get no result, but there is still a path in my graph that leads me to the goal.
For additional 2: iam working on neo4j 3.3.1 and i have set these config:
cypher.forbid_shortestpath_common_nodes=false
which not works in 3.3.3
EIDT 1: (More detailed explanation)
I have a place where I am. Then I search for marketplaces that sell some product. For this I can specify further filters. I can e.g. say that I can travel only through jump points of the size "large". Also, I only want marketplaces that are 4 system away.
Now, looking in the database for the above restrictions, I search for the shortest path to the market places I found.
It may well be that I have several paths that meet the conditions. If this is the case, I would like to filter out of all the shortest paths, the one in which one has to overcome the smallest distance within each solar system.
Is that accurate enough? Otherwise, please just report.

The latest APOC releases may be able to help here, though the APOC path expanders work best with labels and relationship types, so a slight change to your model may be needed.
In particular, the size of your jump points. Right now this is a property on the relationships between them, but for APOC to work optimally, these might be better modeled with the size as a label on the :JumpPoint nodes themselves, so you might have :JumpPoint:Small, :JumpPoint:Medium, and :JumpPoint:Large (you can add this in addition to the rel properties if you like).
Keep in mind this approach will be more complex than shortestPath(), as the idea is we're trying to find systems within a certain number of jumps, then find :Marketplaces reachable at those star systems, then filter based on whether they sell the product we want, and we'll stitch the path together as we find the pieces.
MATCH localSystemPath = (origin:Marketplace)-[*]-(s:Solarsystem)
WHERE origin.eid = $originId
WITH origin, localSystemPath, s
LIMIT 1
WITH origin, localSystemPath, s,
CASE WHEN coalesce($maxJumps, -1) = -1
THEN -1,
ELSE 3*$maxJumps
END as maxJumps,
CASE $shipSize
WHEN 'small' THEN ''
WHEN 'medium' THEN '|-Small'
ELSE '|-Small|-Medium'
END as sizeBlacklist
CALL apoc.path.spanningTree(s,
{relationshipFilter:'HasJumpPoint|CanTravel>', maxLevel:maxJumps,
labelFilter:'>Solarsystem' + sizeBlacklist, filterStartNode:true}) YIELD path as jumpSystemPath
WITH origin, localSystemPath, jumpSystemPath, length(jumpSystemPath) / 3 as jumps, last(nodes(jumpSystemPath)) as destSystem
MATCH destSystemPath = (destSystem)-[*]-(marketplace:Market)
WHERE none(rel in relationships(destSystemPath) WHERE type(rel) = 'HasJumpPoint')
AND <insert predicate for filtering which :Markets you want>
WITH origin, apoc.path.combine(apoc.path.combine(localSystemPath, jumpSystemPath), destSystemPath) as fullPath, jumps, destSystem, marketplace
CALL srt.getRoutes(origin, marketplace, fullPath) YIELD node, jump_sizes, jump_gates, jump_distance, hops, distance
...
This assumes parameter inputs of $shipSize for the minimum size of all jump gates to pass through, $originId as the id of the origin :Marketplace (plus you DEFINITELY need an index or unique constraint on :Marketplace(eid) for fast lookups here), and $maxJumps, for the maximum number of jumps to reach a destination system.
Keep in mind the expansion procedure used, spanningTree(), will only find the single shortest path to another system. If you need all possible paths, including multiple paths to the same system, then change the procedure to expandConfig() instead.

neo4j cypher - Differing query plan behavior

Nodes with the Location node label have an index on Label.name
Profiling the following query gives me a smart plan, with a NodeHashJoin between the two sides of the graph on either side of Trip nodes. Very clever. Works great.
PROFILE MATCH (rosen:Location)<-[:OCCURS_AT]-(ev:Event)<-[:HAS]-(trip:Trip)-[:OPERATES_ON]->(date:Date)
WHERE rosen.name STARTS WITH "U Rosent" AND
ev.scheduled_departure_time > "07:45:00" AND
date.date = '2015-11-20'
RETURN rosen.name, ev.scheduled_departure_time, trip.headsign
ORDER BY ev.scheduled_departure_time
LIMIT 20;
However, just changing one line of the query from:
WHERE rosen.name STARTS WITH "U Rosent" AND
to
WHERE id(rosen) = 4752371 AND
seems to alter the entire behavior of the query plan, which now appears to become more "sequential", losing the parallel execution of (Trip)-[:OPERATES_ON]->(Date)
Much slower. 6x more DB hits in total.
Question
Why does changing the retrieval of one, seemingly-unrelated Location node via a different index/mechanism alter the behavior of the whole query?
(I'm not sure how best to convey more information about the graph model, but please advise, and I'd be happy to add details that are missing)
Edit:
It gets better. Changing that query line from:
WHERE rosen.name STARTS WITH "U Rosent" AND
to
WHERE rosen.name = "U Rosenthaler Platz." AND
results in the same loss of parallelism in the query plan!
Seems odd that a LIKE query is faster than an = ?

Neo4j spatial withinDistance only returns one node

I am using the spatial server plugin for Neo4j 2.0 and manage to add Users and Cities with their geo properties lat/lon to a spatial index "geom". Unfortunately I cannot get the syntax right to get them back via Neo4jClient :( What I want is basically:
Translate the cypher query START n=node:geom('withinDistance:[60.0,15.0, 100.0]') RETURN n; to Neo4jClient syntax so I can get all the users within a given distance from a specified point.
Even more helpful would be if it is possible to return the nodes with their respective distance to the point?
Is there any way to get the nearest user or city from a given point without specify a distance?
UPDATE
After some trial and error I have solved question 1 and the problem communicating with Neo4j spatial through Neo4jClient. Below Neo4jClient query returns 1 user but only the nearest one even though the database contains 2 users who should be returned. I have also tried plain cypher through the web interface without any luck. Have I completely misunderstood what withinDistance is supposed to do? :) Is there really no one who can give a little insight to question 2 and 3 above? It would be very much appreciated!
var queryString = string.Format("withinDistance:[" + latitude + ", " + longitude + ", " + distance + "]");
var graphResults = graphClient.Cypher
.Start(new { user = Node.ByIndexQuery("geom", queryString) })
.Return((user) => new
{
EntityList = user.CollectAsDistinct<UserEntity>()
}).Results;

The client won't let you using the fluent system, the closest you could get would be something like:
var geoQuery = client.Cypher
.Start( new{n = Node.ByIndexLookup("geom", "withindistance", "[60.0,15.0, 100.0]")})
.Return(n => n.As<????>());
but that generates cypher like:
START n=node:`geom`(withindistance = [60.0,15.0, 100.0]) RETURN n
which wouldn't work, which unfortunately means you have two options:
Get the code and create a pull request adding this in
Go dirty and use the IRawGraphClient interface. Now this is VERY frowned upon, and I wouldn't normally suggest it, but I don't see you having much choice if you want to use the client as-is. To do this you need to do something like: (sorry Tatham)
((IRawGraphClient)client).ExecuteGetCypherResults<Node<string>>(new CypherQuery("START n=node:geom('withinDistance:[60.0,15.0, 100.0]') RETURN n", null, CypherResultMode.Projection));
I don't know the spatial system, so you'll have to wait for someone who does know it to get back to you for the other questions - and I have no idea what is returned (hence the Node<string> return type, but if you get that worked out, you should change that to a proper POCO.

After some trial and error and help from the experts in the Neo4j google group all my problems are now solved :)
Neo4jClient can be used to query withinDistance as below. Unfortunately withinDistance couldn't handle attaching parameters in the normal way so you would probably want to check your latitude, longitude and distance before using them. Also those metrics have to be doubles in order for the query to work.
var queryString = string.Format("withinDistance:[" + latitude + ", " + longitude + ", " + distance + "]");
var graphResults = graphClient.Cypher
.Start(new { city = Node.ByIndexQuery("geom", queryString) })
.Where("city:City")
.Return((city) => new
{
Entity = city.As<CityEntity>()
})
.Limit(1)
.Results;
Cypher cannot be used to return distance, you have to calculate it yourself. Obviously you should be able to use REST http://localhost:7474/db/data/index/node/geom?query=withinDistance:[60.0,15.0,100.0]&ordering=score to get the score (distance) but I didn't get that working and I want to user cypher.
No there isn't but limit the result to 1 as in the query above and you will be fine.
A last note regarding this subject is that you should not add your nodes to the spatial layer just the spatial index. I had a lot of problems and strange exceptions before figure this one out.

Get introduced [linkedin] like

I am using neo4j with people and companies as nodes and friend_of/works_at relationship between these.
I would like to know how to implement a get introduced to a second degree connection that linked in uses. The idea is to get your second degree connections at the company you wish to apply. If there are these second degree connections, then you would like to know who among your 1st deg connections can introduce y*ou to these 2nd deg connections.
For this I'm trying this query :
START from = node:Nodes(startNode), company = node:Nodes(endNode)
MATCH from-[:FRIEND_OF]->f-[:FRIEND_OF]-fof-[:WORKS_AT]->company
WHERE not(fof = from) and not (from-[:FRIEND_OF]->fof)
RETURN distinct f.name, fof.name, company.name
But, this returns duplicate friend of friend names (fof.name), since the distinct is applied on all the parameters that are returned as a whole. It could be like I have friends X and Y who are both connected to Z who works at company C. This way, I get both X-Z-C and Y-Z-C. But, I want to apply distinct on Z, such that I get either X-Z-C or Y-Z-C or maybe a list/collection/aggregate of all friends that connect to Z. This could like ["X","Y"..]->Z How should I modify my query?

http://console.neo4j.org/?id=s1m14g
start joe=node:node_auto_index(name = "Joe")
match joe-[:knows]->friend-[:knows]->friend_of_friend
where not(joe-[:knows]-friend_of_friend)
return collect(friend.name), friend_of_friend.name

Can Neo4j be effectively used to show a collection of nodes in a sortable and filterable table?

I realise this may not be ideal usage, but apart from all the graphy goodness of Neo4j, I'd like to show a collection of nodes, say, People, in a tabular format that has indexed properties for sorting and filtering
I'm guessing the Type of a node can be stored as a Link, say Bob -> type -> Person, which would allow us to retrieve all People
Are the following possible to do efficiently (indexed?) and in a scalable manner?
Retrieve all People nodes and display all of their names, ages, cities of birth, etc (NOTE: some of this data will be properties, some Links to other nodes (which could be denormalised as properties for table display's and simplicity's sake)
Show me all People sorted by Age
Show me all People with Age < 30
Also a quick how to do the above (or a link to some place in the docs describing how) would be lovely
Thanks very much!
Oh and if the above isn't a good idea, please suggest a storage solution which allows both graph-like retrieval and relational-like retrieval

if you want to operate on these person nodes, you can put them into an index (default is Lucene) and then retrieve and sort the nodes using Lucene (see for instance How do I sort Lucene results by field value using a HitCollector? on how to do a custom sort in java). This will get you for instance People sorted by Age etc. The code in Neo4j could look like
Transaction tx = neo4j.beginTx();
idxManager = neo4j.index()
personIndex = idxManager.forNodes('persons')
personIndex.add(meNode,'name',meNode.getProperty('name'))
personIndex.add(youNode,'name',youNode.getProperty('name'))
tx.success()
tx.finish()
'*** Prepare a custom Lucene query context with Neo4j API ***'
query = new QueryContext( 'name:*' ).sort( new Sort(new SortField( 'name',SortField.STRING, true ) ) )
results = personIndex.query( query )
For combining index lookups and graph traversals, Cypher is a good choice, e.g.
START people = node:people_index(name="E*") MATCH people-[r]->() return people.name, r.age order by r.age asc
in order to return data on both the node and the relationships.

Sure, that's easily possible with the Neo4j query language Cypher.
For example:
start cat=node:Types(name='Person')
match cat<-[:IS_A]-person-[born:BORN]->city
where person.age > 30
return person.name, person.age, born.date, city.name
order by person.age asc
limit 10
You can experiment with it in our cypher console.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart