I have a few tables loaded up in Neo4j. I have gone through some tutorials and come up with this cypher query.
MATCH (n:car_detail)
RETURN COUNT(DISTINCT n.model_year), n.model, n.maker_name
ORDER BY COUNT(DISTINCT n.model_year) desc
This query gave me all the cars that were continued or discontinued. Logic being count one being discontinued and anything higher being continued.
My table car_detail has cars which were build in different years. I want to make a relationship saying for example
"Audi A4 2011" - (:CONTINUED) -> "Audi A4 2015" - (:CONTINUED) -> "Audi A4 2016"
So it sounds like you want to match to the model and make of the car, ordered by the model year ascending, and to create relationships between those nodes.
We can make use of APOC Procedures as a shortcut for creating the linked list through the ordered and collected nodes, you'll want to install this (with the appropriate version given your Neo4j version) to take advantage of this capability, as the pure cypher approach is quite ugly.
The query would look something like this:
MATCH (n:car_detail)
WITH n
ORDER BY n.model_year
WITH collect(n) as cars, n.model as model, n.maker_name as maker
WHERE size(cars) > 1
CALL apoc.nodes.link(cars, 'CONTINUED')
RETURN cars
The key here is that after we order the nodes, we aggregate the nodes with respect to the model and maker, which act as your grouping key (when aggregating, the non-aggregation variables become the grouping key for the aggregation). This means your ordered cars are going to be grouped per make and model, so all that's left is to use APOC to create the relationships linking the nodes in the list.
You can just find both cars with MATCH and then connect them:
e.g.
MATCH (c1:car_detail)
where c1.model = 'Audi A4 2011'
MATCH (c2:car_detail)
where c2.model = 'Audi A4 2015'
CREATE (c1)-[:CONTIUED]->(c2);
etc.
Related
I am a newbie who just started learning graph database and I have a problem querying the relationships between nodes.
My graph is like this:
There are multiple relationships from one node to another, and the IDs of these relationships are different.
How to find relationships where the number of relationships between two nodes is greater than 2,or is there a problem with the model of this graph?
Just like on the graph, I want to query node d and node a.
I tried to use the following statement, but the result is incorrect:
match (from)-[r:INVITE]->(to)
with from, count(r) as ref
where ref >2
return from
It seems to count the number of relations issued by all from, not the relationship between from-->to.
to return nodes who have more then 2 relationship between them you need to check the size of the collected rels. something like
MATCH (x:Person)-[r:INVITE]-(:Party)
WITH x, size(collect(r)) as inviteCount
WHERE inviteCount > 2
RETURN x
Aggregating functions like COLLECT and COUNT use non-aggregating terms in the same WITH (or RETURN) clause as "grouping keys".
So, here is one way to get pairs of nodes that have more than 2 INVITE relationships (in a specific direction) between them:
MATCH (from)-[r:INVITE]->(to)
WITH from, to, COUNT(r) AS ref
WHERE ref > 2
RETURN from, to
NOTE: Ideally (for clarity and efficiency), your nodes would have specific labels and the MATCH pattern would specify those labels.
I have a graph that looks like this.
I want to find all the items bought by the people, who bought the same items as Gremlin using cypher.
Basically I want to imitate the query in the gremlin examples that looks like this
g.V().has("name","gremlin")
.out("bought").aggregate("stash")
.in("bought").out("bought")
.where(not(within("stash")))
.groupCount()
.order(local).by(values,desc)
I was trying to do it like this
MATCH (n)-[:BOUGHT]->(g_item)<-[:BOUGHT]-(r),
(r)-[:BOUGHT]->(n_item)
WHERE
n.name = 'Gremlin'
AND NOT (n)-[:BOUGHT]->(n_item)
RETURN n_item.id, count(*) as frequency
ORDER by frequency DESC
but it seems it doesn't count frequencies properly - they seem to be twice as big.
4 - 4
5 - 2
3 - 2
While 3 and 5 was bought only once and 4 was bought 2 times.
What's the problem?
Cypher is interested in paths, and your MATCH finds the following:
2 paths to item 3 both through Rexter (via items 2 and 1)
2 paths to item 5 through Pipes (via items 1 and 2)
4 paths to item 4 via Rexter and Pipes (via items 1 and 2 for each person)
Basically the items are being counted multiple times because there are multiple paths to that same item per individual person via different common items with Gremlin.
To get accurate counts, you either need to match to distinct r users, and only then match out to items the r users bought (as long as they aren't in the collection of items bought by Gremlin), OR you need to do the entire match, but before doing the counts, get distinct items with respect to each person so each item per person only occurs once...then get the count per item (counts across all persons).
Here's a query that uses the second approach
MATCH (n:Person)-[:BOUGHT]->(g_item)
WHERE n.name = 'Gremlin'
WITH n, collect(g_item) as excluded
UNWIND excluded as g_item // now you have excluded list to use later
MATCH (g_item)<-[:BOUGHT]-(r)-[:BOUGHT]->(n_item)
WHERE r <> n AND NOT n_item in excluded
WITH DISTINCT r, n_item
WITH n_item, count(*) as frequency
RETURN n_item.id, frequency
ORDER by frequency DESC
You should be using labels in your graph, and you should use them in your query in order to leverage indexes and quickly find a starting point in the graph. In your case, an index on :Person(name), and usage of the :Person label in the query, should make this quick even as more nodes and more :Persons are added to the graph.
EDIT
If you're just looking for conciseness of the query, and don't have a large enough graph where performance will be an issue, then you can use your original query but add one extra line to get distinct rows of r and n_item before you count the item. This ensures that you only count an item per person once when you get the count.
Note that forgoes optimizations for handling excluded items (it will do a pattern match per item rather than aggregating the collection of bought items and doing a collection membership check), and it aggregates on items while doing property access rather than doing property access only after aggregating by the node.
MATCH (n:Person)-[:BOUGHT*2]-(r)-[:BOUGHT]->(n_item)
WHERE n.name = 'Gremlin'
WITH DISTINCT n, r, n_item
WHERE NOT (n)-[:BOUGHT]->(n_item)
RETURN n_item.id, count(*) as frequency
ORDER by frequency DESC
I am adding a quick shortcut in your match, using :BOUGHT*2 to indicate two :BOUGHT hops to r, since we don't really care about the item in-between.
I am making a named entity graph in Noe4j 3.2.0. I have ARTICLE and ENTITY as node types. And the relation/edge between them is CONTAINS; which represents the number of times the entity has occurred in that article (As shown in attached picture Simple graph for articles and entities ). So if an article has one entity for 5 times, there will be 5 edges between that article and particular entity.
There are roughly 18 million articles and 40 thousand unique entities. The whole data is around 20GB(including indices on ids) and is loaded on a machine with 32 GB RAM.
I am using this graph to suggest/recommend the other entities. But my queries are taking too much time.
Use Case1: Find all entities present in the articles which have an entity from list ["A", "B"] and also having an entity "X" and an entity "Y" and an entity "Z" in the order of articles count.
Here is the cypher query I am running.
MATCH(e:Entity)-[:CONTAINS]-(a:Article)
WHERE e.EID in ["A","B"]
WITH a
MATCH (:Entity {EID:"X"})-[:CONTAINS]-(a)
WITH a
MATCH (:Entity {EID:"Y"})-[:CONTAINS]-(a)
WITH a
MATCH (:Entity {EID:"Z"})-[:CONTAINS]-(a)
WITH a
MATCH (a)-[:CONTAINS]-(e2:Entity)
RETURN e2.EID as EID, e2.Text as Text, e2.Type as Type ,count(distinct(a)) as articleCount
ORDER BY articleCount desc
Query Profile is here: Query Profile
This query gives me all first level entity neighbours of articles having X,Y,Z and at least one of A,B entities (I had to change the IDs in the query for content sensitivity).
I was just wondering if there is a better/fast way of doing it?
Another observation is if I keep adding filters (more match clauses like X,Y,Z) the performance is deteriorated; despite the fact that result set is getting smaller and smaller.
You have a uniqueness constraint on :Entity(EID), so at least that optimization is already in place.
The following Cypher query is simpler, and generates a simpler execution plan. Hopefully, it also reduces the number of DB hits.
MATCH (e:Entity)-[:CONTAINS]-(a)
WHERE e.EID in ['A','B'] AND ALL(x IN ['X','Y','Z'] WHERE (:Entity {EID: x})-[:CONTAINS]-(a))
WITH a
MATCH (a)-[:CONTAINS]-(e2:Entity)
RETURN e2.EID as EID, e2.Text as Text, e2.Type as Type, COUNT(DISTINCT a) as articleCount
ORDER BY articleCount DESC;
I am using Neo4j CE 3.1.1 and I have a relationship WRITES between authors and books. I want to find the N (say N=10 for example) books with the largest number of authors. Following some examples I found, I came up with the query:
MATCH (a)-[r:WRITES]->(b)
RETURN r,
COUNT(r) ORDER BY COUNT(r) DESC LIMIT 10
When I execute this query in the Neo4j browser I get 10 books, but these do not look like the ones written by most authors, as they show only a few WRITES relationships to authors. If I change the query to
MATCH (a)-[r:WRITES]->(b)
RETURN b,
COUNT(r) ORDER BY COUNT(r) DESC LIMIT 10
Then I get the 10 books with the most authors, but I don't see their relationship to authors. To do so, I have to write additional queries explicitly stating the name of a book I found in the previous query:
MATCH ()-[r:WRITES]->(b)
WHERE b.title="Title of a book with many authors"
RETURN r
What am I doing wrong? Why isn't the first query working as expected?
Aggregations only have context based on the non-aggregation columns, and with your match, a unique relationship will only occur once in your results.
So your first query is asking for each relationship on a row, and the count of that particular relationship, which is 1.
You might rewrite this in a couple different ways.
One is to collect the authors and order on the size of the author list:
MATCH (a)-[:WRITES]->(b)
RETURN b, COLLECT(a) as authors
ORDER BY SIZE(authors) DESC LIMIT 10
You can always collect the author and its relationship, if the relationship itself is interesting to you.
EDIT
If you happen to have labels on your nodes (you absolutely SHOULD have labels on your nodes), you can try a different approach by matching to all books, getting the size of the incoming :WRITES relationships to each book, ordering and limiting on that, and then performing the match to the authors:
MATCH (b:Book)
WITH b, SIZE(()-[:WRITES]->(b)) as authorCnt
ORDER BY authorCnt DESC LIMIT 10
MATCH (a)-[:WRITES]->(b)
RETURN b, a
You can collect on the authors and/or return the relationship as well, depending on what you need from the output.
You are very close: after sorting, it is necessary to rediscover the authors. For example:
MATCH (a:Author)-[r:WRITES]->(b:Book)
WITH b,
COUNT(r) AS authorsCount
ORDER BY authorsCount DESC LIMIT 10
MATCH (b)<-[:WRITES]-(a:Author)
RETURN b,
COLLECT(a) AS authors
ORDER BY size(authors) DESC
Say I have a database with orders node as below
Order{OrderId, Customer, Date, Quantity, Product}
now I want to refactor this node in the database to look as below using a cypher query
(day)<-[:PLACED_ON]-(Order{OrderId, Quantity})-[:PLACED_BY]->(customer), (Order)-[:FOR_PRODUCT]->(product)
I understand that we can actually do such thing directly in the cypher, without having to load all the nodes in to my code and then make multiple cypher calls to the database.
Would it be possible for some one to help me understand how such refactoring can be done without introducing duplicates of customer, product and day node.
Rrgards
Kiran
Yes, you can manipulate a Neo4j database with cypher.
Guessing that your current Order node looks similar to:
CREATE (:ORDER {orderId:100,customer:'John',date:13546354,quantity:1,product:'pizza'})
You could write the following:
MATCH (o:ORDER)
CREATE (d:DAY{timestamp:o.date}) <- [:PLACED_ON] - o - [:PLACED_BY] -> (c:CUSTOMER{name:o.customer})
CREATE o - [:FOR_PRODUCT] -> (p:PRODUCT{name:o.product})
REMOVE o.product, o.customer, o.date
RETURN o as order, d as day, c as customer, p as product
The query output would be:
Nodes created: 3
Relationships created: 3
Properties set: 6
Labels added: 3
Note that if you're having a large dataset, migrating an entire database can be very time consuming! You might want to try the PERIODIC COMMIT feature in the 2.1.0 milestone release.