I am reading a big file into neo4j with the script below:
WITH $dict.rows as rows UNWIND rows as row
WITH row WHERE row.object CONTAINS 'wikidata'
MERGE(e:Entity {wikidataId: replace(row.object,"http://www.wikidata.org/entity/","")})
SET e.dbpediaUri = row.subject
WITH distinct $dict.rows as rows UNWIND rows as row
MATCH(e:Entity) where e.dbpediaUri = row.subject
WITH row, e
CREATE(object:Property {value:row.object, type: "string"})
WITH row,e,object
CALL apoc.create.relationship(e, row.predicate, {source:"dbpedia", type:"uri"}, object) YIELD rel
RETURN null
Here I want to first merge entities with the given wikidata id(Here I need the WITH with WHERE so that first I get the desired property), and in the second loop I want to add relationships to that entity.
I'm wondering if this code would end up with cartesian product? Will the second WITH ... UNWIND statement run inside the first one or not? If so, how can I achieve what I want to do in one query?
To my understanding your second unwind on rows will run inside the first one. I assume you want to prevent this. After SET e.dbpediaUri = row.subject you need to close the loop by using a proper aggregation function. Example in your case: with collect(e) as entities
Caution: in case WITH row WHERE row.object CONTAINS 'wikidata' returns 0 records, the rest of the cypher query will NOT executed. The second unwind will never be reached. Therefore it is better to split your query in 2 different transactions
Best
Markus
Related
I have a JSON document with history based entity counts and relationship counts. I want to use this lookup data for entity and relationships in Neo4j. Lookup data has around 3000 rows. For the entity counts I want to display the counts for two entities based on UUID. For relationships, I want to order by two relationship counts (related entities and related mutual entities).
For entities, I have started with the following:
// get JSON doc
with value.aggregations.ent.terms.buckets as data
unwind data as lookup1
unwind data as lookup2
MATCH (e1:Entity)-[r1:RELATED_TO]-(e2)
WHERE e1.uuid = '$entityId'
AND e1.uuid = lookup1.key
AND e2.uuid = lookup2.key
RETURN e1.uuid, lookup1.doc_count, r1.uuid, e2.uuid, lookup2.doc_count
ORDER BY lookup2.doc_count DESC // just to demonstrate
LIMIT 50
I'm noticing that query is taking about 10 seconds. What am I doing wrong and how can I correct it?
Attaching explain plan:
Your query is very inefficient. You stated that data has 3,000 rows (let's call that number D).
So, your first UNWIND creates an intermediate result of D rows.
Your second UNWIND creates an intermediate result of D**2 (i.e., 9 million) rows.
If your MATCH (e1:Entity)-[r1:RELATED_TO]-(e2) clause finds N results, that generates an intermediate result of up to N*(D**2) rows.
Since your MATCH clause specifies a non-directional relationship pattern, it finds the same pair of nodes twice (in reverse order). So, N is actually twice as large as it needs to be.
Here is an improved version of your query, which should be much faster (with N/2 intermediate rows):
WITH apoc.map.groupBy(value.aggregations.ent.terms.buckets, 'key') as lookup
MATCH (e1:Entity)-[r1:RELATED_TO]->(e2)
WHERE e1.uuid = $entityId AND lookup[e1.uuid] IS NOT NULL AND lookup[e2.uuid] IS NOT NULL
RETURN e1.uuid, lookup[e1.uuid].doc_count AS count1, r1.uuid, e2.uuid, lookup[e2.uuid].doc_count AS count2
ORDER BY count2 DESC
LIMIT 50
The trick here is that the query uses apoc.map.groupBy to convert your buckets (a list of maps) into a single unified lookup map that uses the bucket key values as its property names. This allows the rest of the query to literally "look up" each uuid's data in the unified map.
[image 1][1]
Depending on the order I query 2 relationships, I get 2 different answers despite the query being the same (as far as I understand). The query obviously is not the same but I don't know why.
MATCH p1=(:Barrier {code: 'B2'})-[:REL1]->()
WITH count(DISTINCT p1) AS failed_B2
MATCH p2=(:Barrier {code: 'B2'})-[:REL2]->()
RETURN count(DISTINCT p2) AS worked_B2, failed_B2
Returns 1 and 0 - which is correct
But the other way round:
MATCH p1=(:Barrier {code: 'B2'})-[:REL2]->()
WITH count(DISTINCT p1) AS failed_B2
MATCH p2=(:Barrier {code: 'B2'})-[:REL1]->()
RETURN count(DISTINCT p2) AS worked_B2, failed_B2
Returns 0 and 0 - which is incorrect
I would like to combine the results of multiple queries but UNION does not work because it needs to group the results under the same column which in my case would be incorrect. I need the results in different columns.
So this is an interesting thing that centers around what happens when rows get filtered out (such as when a MATCH fails or a WHERE condition filters the row).
But first we need to address that you observed in the second case: Returns 0 and 0. I don't think that's really true, and I'd like to know what version of Neo4j you're using here. In this particular case, I would have instead expected no rows being returned, and this is ENTIRELY different than a row being returned with 0 values for both.
When Cypher queries execute, they build up records (or rows) of data. And Cypher operations execute per row. So when you do a MATCH in some point in your query, that's being performed per row, and when the MATCH fails, when no such pattern exists (that adheres to your WHERE clause, if present), then the row is filtered out. This is important, because this means that any other data in that record is gone and no longer addressable.
The second thing to keep in mind is that we allow certain aggregations such as count() and collect() to execute even when no rows are present, as it is conceivable that you may have a query where nothing matches, and getting that count of 0 (or that empty collection when you collect) is an entirely valid case and should be allowed. In these cases, where there may not have been any rows left at all after a MATCH or filter (and because of no rows, nothing else would be able to execute, as Cypher operations execute per row, so it's no-op if there are no rows), the count() or collect() would cause a new row to emerge with that count of 0, or that empty collection. And since a row is now present, the remaining operators in the query have something to execute on, and the query can continue execution.
This is what happens in your first case, where pattern p1 doesn't exist, but pattern p2 does (once). Here's the breakdown of what happens:
The first match fails to find anything. Rows go to 0. There is nothing left to execute subsequent operations upon.
You perform a standalone count() aggregation (with no other variables in scope, this is important). This emits a single row with a count of 0, which is correct: there are no occurrences of that pattern in the graph.
You perform the second MATCH, and there is a record/row for it to execute upon (with the value {failed_B2:0}), and it finds the single occurrence, and gets its count (1), and is able to output the expected answer (1, 0), with the 1 being the count of the pattern matches at the end of the query, p2, and the 0 being the count of the pattern matches from the first two lines of the query, p1.
Now let's see what happens when we reverse this.
In your second query, it is now pattern p1 that exists once in the graph, and pattern p2 that doesn't exist. Here's the breakdown that happens:
The first MATCH succeeds and finds the pattern.
You get the count of the patterns found: 1. You now have a single record/row with the value {failed_B2:1}
You execute the second MATCH, and the pattern isn't found. The record/row is filtered out. You now have no records/rows, so not only is there nothing to operate on, anything that was previously in the record/rows is gone. There IS no failed_B2 value anywhere to reference.
You attempt to get the count of the p2 along with failed_B2. But this isn't allowed by Cypher, we only allow aggregation across 0 rows when it's a standalone count() or collect(), there IS no failed_B2 to reference, it was wiped out when the record/row that contained it got filtered out. There is no way to process that sanely, as that previously existing data is just not there (and this is correct behavior). The query should be returning no rows...which is NOT the same as 0, 0, as that implies you got a row returned (which is why I'm interested in clarifying that point with you).
As for how you should be correctly executing this, when you have to aggregate like this and you know that some patterns may not exist, use an OPTIONAL MATCH instead.
When you OPTIONAL MATCH, it doesn't filter out the row if no match is found. Instead newly introduced variables in the pattern go to null, and when you count() or collect() over nulls, it ignores them, giving you a correct count of 0, but not wiping out the record/row that contains the failed_B2 value you also want to return at the end.
This was inspired by these two SO threads:
Boolean value return from Neo4j cypher query without CASE
How to set property name and value on csv load?
I have a CSV with 3 columns:
first_name,prop_name,prop_value
John,weight,100
Paul,height,200
John,hair_color,blonde
Paul,weight,120
So, there are a number of people, and their properties are randomly scattered across different rows. My goal is to screen the rows and assign all found properties to their holders. For the sake of simplicity, let's focus on the 'weight' property only.
I do know how to make this work the long way:
LOAD CSV WITH HEADERS FROM
"file:///test.csv" AS line
WITH line
MERGE (m:Person {name:line.first_name})
WITH line, CASE line.prop_name WHEN "weight" THEN [1] ELSE [] END as loopW
UNWIND loopW as w
MATCH (p:Person {name: line.first_name})
SET p.weight = line.prop_value
But then, I tried to replace the CASE line with a shorter version
WITH line, collect(line.prop_name = "weight") as loopW
...which resulted in weird behavior, where the created nodes did get their 'weight' keys assigned to, but sometimes with the wrong values. So, I could see something like (:Person {weight:blue})
What would be the right way to get rid of the CASE?
You should know that your current usage filters out all lines that don't have "weight" as the prop_name (the UNWIND of an empty collection wipes out all the other lines and they won't get processed).
What you really need is a better way to set dynamically named properties on your nodes so you can avoid the CASE usage completely.
If you can install APOC Procedures (please read the instructions at the top for how to modify your neo4j.conf to whitelist the procedures, and pay attention to the version matrix to ensure you get the version that corresponds with your Neo4j version) there is one that is a perfect fit for what you're trying to do: CALL apoc.create.setProperty( [node,id,ids,nodes], key, value) YIELD node
Usage would be something like:
LOAD CSV WITH HEADERS FROM
"file:///test.csv" AS line
MERGE (m:Person {name:line.first_name})
CALL apoc.create.setProperty(m, line.prop_name, line.prop_value) YIELD node
RETURN count(distinct m)
EDIT
Expanding on what's wrong with the original query:
UNWIND produces rows in a multiplicative way, with respect to the number of elements in the collection. If the collection in the row had 5 elements, the single row would result in 5 rows, one for each element. If the collection was empty, the row would be removed instead, as there wouldn't be any elements in the collection for which to output rows. Because of this, any further WITH line, CASE ... lines in the query wouldn't do any good.
Let's analyze the original query, going by the example input csv:
LOAD CSV WITH HEADERS FROM
"file:///test.csv" AS line
WITH line //redundant, line not needed
MERGE (m:Person {name:line.first_name}) // 4 rows corresponding with 2 nodes
WITH line, CASE line.prop_name WHEN "weight" THEN [1] ELSE [] END as loopW
// still 4 rows, 2 have [1] as loopW, the other 2 have [] as loopW
UNWIND loopW as w // 2 rows eliminated by unwinding empty collections
MATCH (p:Person {name: line.first_name})
SET p.weight = line.prop_value
// only 2 rows are for 'John,weight,100' and 'Paul,weight,120'
// any further repetitions of WITH line, CASE ... UNWIND for different props will fail and eliminate the remaining 2 rows.
Lets say i have nodes that are connected in FRIEND relationship.
I want to query 2 of them each time, so i use SKIP and LIMIT to maintain this.
However, if someone adds a FRIEND in between calls, this messes up my results (since suddenly the 'whole list' is pushed 1 index forward).
For example, lets say i had this list of friends (ordered by some parameter):
A B C D
I query the first time, so i get A B (skipped 0 and limited 2).
Then someone adds a friend named E, list is now E A B C D.
now the second query will return B C (skipped 2 and limited 2). Notice B returned twice because the skipping method is not aware of the changes that the DB had.
Is there a way to return 2 each time starting considering the previous query? For example, if i knew that B was last returned from the query, i could provide it to the query and it would query the 2 NEXT, getting C D (Which is correct) instead of B C.
I tried finding a solution and i read about START and indexes but i am not sure how to do this.
Thanks for your time!
You could store a timestamp when the FRIEND relationship was created and order by that property.
When the FRIEND relationship is created, add a timestamp property:
MATCH (a:Person {name: "Bob"}), (b:Person {name: "Mike"})
CREATE (a)-[r:FRIEND]->(b)
SET r.created = timestamp()
Then when you are paginating through friends two at a time you can order by the created property:
MATCH (a:Person {name: "Bob"})-[r:FRIEND]->(friends)
RETURN friends SKIP {page_number_times_page_size} LIMIT {page_size}
ORDER BY r.created
You can parameterize this query with the page size (the number of friends to return) and the number of friends to skip based on which page you want.
Sorry, if It's not exactly answer to you question. On my previous project I had experience of modifying big data. It wasn't possible to modify everything with one query so I needed to split it in batches. First I started with skip limit. But for some reason in some cases it worked unpredictable (not modified all the data). And when I become tired of finding the reason I changed my approach. I used Java for querying database. So I get all the ids that I needed to modify in first query. And after this I run through stored ids.
How do I make the equivalent of a this left join in cypher, but returning a single row for each activity? (Assume only 1 optional target record and one optional source record per activity - both can be present or absent etc)
person-----activity--+--> source (optional table, left join)
+--> target(optional table, left join)
of course I don't want two rows if a source and target exists, so this would be rolled up via crosstab /pivot table query so the final row would be
person...activity...(optional source node)...(optional target node)
I want to bring back the person, activity, and optionally the source and target rows if they even exist - if not, just the person and activity.. I want one row per activity.
See the graph example at: http://console.neo4j.org/?id=rogg0w
Here is my guess as how to accomplish this
start n=node(1)
match n-[act_rel]->activity-[?sourcerel]-(source)
with n
match n-[act_rel]->activity-[?targetrel]-(target)
return n, activity, source.description! as description, target.description! as target_description
Basically I'd like to bring back a person's list of activities with optional source and target nodes on a single row for each activity - is this possible to do through Cyhper alone? (Using ROR / REST). What is wrong with the Cypher query above? Or must I get the person and activity data from cyper, and then look up nodes individually via code which would be, IMO, a big performance hit.
You can use a comma to specify multiple parts of a pattern:
start n=node(1)
match n-[:act_rel]->activity-[?:sourcerel]-source, activity-[?:targetrel]-target
return n, activity, source.description as description, target.description as target_description
Make sense?