So I am importing a rather robust CSV with tons of information in it. Rather than slicing it and reduplicating a lot of data and cleansing does Neo4j support a where clause in a multiple way for instance:
USING PERIODIC COMMIT 1000
LOAD CSV FROM 'file:///registryDump.csv' AS line
WITH line
WHERE line[25] IS NOT NULL
MERGE (u:User {name: line[25]})
ON CREATE SET u.source = "Registry", u.type = "Owner"
Than additionally add another:
WHERE line[12] IS NOT NULL
MERGE (u:User {name: line[12]})
ON CREATE SET u.source = "Registry", u.type = "Steward"
Making a much larger clause?
Use a combination of CASE and FOREACH:
WITH [0,1,null,3] as line
FOREACH(NULL IN CASE WHEN line[0]=0 THEN [1] ELSE [] END |
MERGE (U:User{name:0})
ON CREATE SET U.source = "Registry", U.type = "Steward"
)
FOREACH(NULL IN CASE WHEN line[1]<1 THEN [1] ELSE [] END |
MERGE (U:User{name:1})
)
FOREACH(NULL IN CASE WHEN line[2] IS NOT NULL THEN [1] ELSE [] END |
MERGE (U:User{name:line[2]})
)
FOREACH(NULL IN CASE WHEN line[3] IS NOT NULL THEN [1] ELSE [] END |
MERGE (U:User{name:line[3]})
)
Taken from Mark Needham. Neo4j: LOAD CSV – Handling empty columns
Related
I have created the following nodes in neo4j (1 million of them):
CREATE (p:Person { name: 'user1', email: ['user1#gmail.com', 'user1#yahoo.com'] }) RETURN p
CREATE (p:Person { name: 'user2', email: ['user2#gmail.com', 'user2#yahoo.com'] }) RETURN p
...
CREATE (p:Person { name: 'user1000000', email: ['user1000000#gmail.com', 'user1000000#yahoo.com'] }) RETURN p
I have created the following indexes:
CREATE BTREE INDEX i1 FOR (n:Person) ON (n.name)
CREATE BTREE INDEX i2 FOR (n:Person) ON (n.email)
With the above data, the following query takes 2ms to complete and I can concurrently execute about 2800 such queries per second on my desktop.
MATCH (p:Person) WHERE p.name = 'user10' RETURN DISTINCT p.name
But the following query takes 710ms to complete and I can concurrently execute only about 5 such queries per second on my desktop.
MATCH (p:Person) WHERE 'user10#gmail.com' IN p.email RETURN DISTINCT p.name
Is there any way to speed up the second query and also increase the throughput ?
Edit 1:
I tried to use separate nodes for email as suggested by #jose_bacoy in his answer.
I created the following nodes:
CREATE (m1:mail { email: 'user1#gmail.com' })
CREATE (m2:mail { email: 'user1#yahoo.com' })
CREATE (p:Person { name: 'user1' })
CREATE (p) - [:attribute] -> (m1)
CREATE (p) - [:attribute] -> (m2)
RETURN p
...
CREATE (m1:mail { email: 'user1000000#gmail.com' })
CREATE (m2:mail { email: 'user1000000#yahoo.com' })
CREATE (p:Person { name: 'user1000000' })
CREATE (p) - [:attribute] -> (m1)
CREATE (p) - [:attribute] -> (m2)
RETURN p
and indexed them as follows:
CREATE BTREE INDEX i1 FOR (n:Person) ON (n.name)
CREATE BTREE INDEX i2 FOR (n:mail) ON (n.email)
The speed is also good. Latency: 4ms, throughput 1850 queries per second.
The problem with this is that the following query performs very badly.
MATCH (p:Person) - [:attribute] -> (m1:mail)
MATCH (p) - [:attribute] -> (m2:mail)
WHERE m1.email = 'user10#gmail.com' OR m2.email = 'user10#yahoo.com'
RETURN DISTINCT p.name
On my desktop, the latency is about 5s and the throughput is less than 1 per second.
Edit 2:
I modified the query as suggested by Charchit Kapoor below. Following is the query I used.
MATCH (p:Person) - [:attribute] -> (m:mail)
WHERE m.email IN ['user10#gmail.com', 'user10#yahoo.com']
RETURN DISTINCT p.name
has a latency of about 4ms and throughput of about 2600 queries per second.
Your data model is not aligned to your query. Email is a list of emails in Person node and you are searching within a list. Below is a script to change your data model from Person.email into a relationship between Person -[:HAS_EMAIL]-> Email. The APOC function iterate will divide your Person nodes into batches and will run it in parallel for efficiency. Ensure that you have APOC installed.
Then it will create the (Person)->(Email) relationship and remove the property in Person after completion. You can change the batch size (10k per batch) according to your taste. You also want to create a unique index for Email. I will leave it up to you on how to do it.
CALL apoc.periodic.iterate(
"MATCH (p:Person) RETURN p as person;",
"WITH person
UNWIND person.email as email
MERGE (e:Email {email: email})
MERGE (person)-[:HAS_EMAIL]->(e)
SET person.email = null;",
{batchSize:10000, parallel:true, retries:3});
After doing this and creating the index on Email.email, profiling shows that the BTREE index is being used:
PROFILE MATCH (p:Person) -[:HAS_EMAIL] -> (e:Email)
WHERE e.email = 'user10#gmail.com'
RETURN DISTINCT p.name
BTREE INDEX e:Email(email) WHERE
email = $autostring_0
Previously, it shows NodeLabelByScan and Filter on $autostring_0 IN p.email. Even if you create an index on a list, it is not used.
Your second query can be structured differently, first find all the relevant emails and then find the related users:
MATCH (m1:mail)
WHERE m1.email IN ['user10#gmail.com', 'user10#yahoo.com']
MATCH (p)-[:attribute]->(m1)
RETURN DISTINCT p.name
I have 3 Nodes:
:User which has a unique id property
:Palette which has a unique name property
:Color which has a unique hex property
When a user saves a pallete I would like to:
create a new pallete if a palette with this name does not exist, and add a :CREATED relationship from the :User to the :Palette
create a :SAVED relationship from the :User to the :Palette if one does not exist
Afterwards I would like to delete all :INCLUDES relationships that this :Palette has to :Color nodes inside the database (in order to create new ones afterwards).
This is my query (I replaced the variables with hardcoded strings so it's easier to execute):
MATCH (u:User)
WHERE u.id = '4f3d1904'
MERGE (p:Palette {name: 'Test'})
ON CREATE
SET p.name = "Test"
MERGE (u)-[cr:CREATED]->(p)
MERGE (u)-[sa:SAVED]->(p)
MATCH (p:Palette {name: 'Test'})-[in:INCLUDES]->()
DELETE in
When running the query I get the following error:
WITH is required between MERGE and MATCH (line 8, column 1 (offset: 181))
"MATCH (p:Palette {name: 'Test'})-[in:INCLUDES]->()"
^
But if I add a WITH I get the following:
MATCH (u:User)
WHERE u.id = '4f3d1904'
MERGE (p:Palette {name: 'Test'})
ON CREATE
SET p.name = "Test"
MERGE (u)-[cr:CREATED]->(p)
MERGE (u)-[sa:SAVED]->(p)
WITH
MATCH (p:Palette {name: 'Test'})-[in:INCLUDES]->()
DELETE in
Invalid input ')': expected whitespace or a relationship pattern (line 9, column 32 (offset: 217))
"MATCH (p:Palette {name: 'Test'})-[in:INCLUDES]->()"
^
What am I doing wrong?
MERGE and MATCH stages (or MATCH and MATCH) require a WITH between them in order to use the result of the former in the latter.
In your case you can use the p that you already have like this:
...
WITH p
MATCH (p)-[in:INCLUDES]->()
DELETE in
So you won't need to find it again. without the WITH, it is like two different queries.
I tried the following but it threw this error. I wish to only create a new person node if there isnt a person node in the existing database that has the exact same properties.
org.neo4j.driver.exceptions.ClientException: Invalid input 'R': expected
MERGE (n:Person{id: abc.id})
MERGE (m:Place{place:def.id})
MERGE (o:Thing{id:abcd.id})
WITH n,m,o
OPTIONAL MATCH (n) – [:present_at] -> x with n,m,o, collect (distinct x) as known_place
OPTIONAL MATCH (m) – [:is] -> y with n,m,o, collect (distinct y) as known_thing
FOREACH (a in ( CASE WHEN NOT m IN known_place THEN [1] ELSE [] END ) CREATE (n)-[:present_at] ->(m))
FOREACH (a in ( CASE WHEN NOT o IN known_thing THEN [1] ELSE [] END ) CREATE (m)-[:is] ->(o))
That error was caused by a missing | in each of your FOREACH clauses. For example, this would fix that syntax error:
FOREACH (a in ( CASE WHEN NOT m IN known_place THEN [1] ELSE [] END ) | CREATE (n)-[:present_at] ->(m))
FOREACH (a in ( CASE WHEN NOT o IN known_thing THEN [1] ELSE [] END ) | CREATE (m)-[:is] ->(o))
However, your query would still have numerous other syntax errors.
In fact, the entire query could be refactored to be simpler and more efficient:
WITH {id: 123} AS abc, {id: 234} as def, {id: 345} AS abcd
MERGE (n:Person{id: abc.id})
MERGE (m:Place{place: def.id})
MERGE (o:Thing{id: abcd.id})
FOREACH (a in ( CASE WHEN NOT EXISTS((n)–[:present_at]->(m)) THEN [1] END ) | CREATE (n)-[:present_at]->(m))
FOREACH (a in ( CASE WHEN NOT EXISTS((m)–[:is]->(o)) THEN [1] END ) | CREATE (m)-[:is]->(o))
I want to MERGE a node:
MERGE (a: Article {URL: event.URL})
If the node does not exist, I need to do this:
ON CREATE FOREACH( site_name in CASE WHEN event.site_name is not null then [1] ELSE [] END |
MERGE (w: Website { value: event.site_name})
MERGE (w)-[:PUBLISHED]->(a))
// all of the tag creation
FOREACH( tag in CASE WHEN event.tags is not NULL then event.tags else [] END |
Merge (t: Article_Tag {value: tag})
CREATE (a)-[: HAS_ARICLE_TAG {date:event_datetime}]->(t))
I believe that ON CREATE only works with SET, but as above, i need to execute multiple statements. Is there a way to create multiple nodes and relationships with an ON CREATE clause?
EDIT: I have tried ON CREATE FOREACH(ignoreme in case when event.article is not null then [1] else [] end |... multiple statements but this does not escape the SET problem.
This is the best way to do this: just wrap it in a FOREACH statement
MATCH (a: Article {URL: event.URL})
FOREACH(ignoreme in case when a is not null then [1] else [] end |... Statement here...)
I want to create some new nodes in my database if some condition is satisfy
MATCH (u:User)-[:has]->(a:Account)-[:initiated]->(l:Process) WHERE ID(l)=984
UNWIND [{mobile:123,email:'a#b.com'}, {mobile:456, email:'a1#b1.com'}] as x
OPTIONAL MATCH (u1:User) WHERE u1.mobile = x.mobile OR u1.email = x.email CASE u1 WHEN u1 IS NULL THEN CREATE (u)-[:pending]->(p:Pending {mobile: x.mobile, email: x.email})
ELSE CREATE (u1)-[:pending]->(p:Pending {mobile: x.mobile, email: x.email})
END
I want to check condition whether any users exist with mobile or email. If exist I want to create node (p) attached to there node (u1) else I want to create node attached to my node i.e (u).
Somehow create is not working in case
Currently the only way to do conditional writes is using the FOREACH/CASE WHEN trick. Based on your condition you create either a 1 element or an empty array and iterate over that using FOREACH, e.g.
...
FOREACH(x in CASE WHEN u1 IS null THEN [1] ELSE [] END |
CREATE (u)-[:pending]->(p:Pending {mobile: x.mobile, email: x.email}))
FOREACH(x in CASE WHEN u1 IS NOT null THEN [1] ELSE [] END |
CREATE (u1)-[:pending]->(p:Pending {mobile: x.mobile, email: x.email}))
See http://www.markhneedham.com/blog/2014/06/17/neo4j-load-csv-handling-conditionals/ for more details.