I have the following two node types:
c:City {name: 'blah'}
s:Course {title: 'whatever', city: 'New York'}
Looking to create this:
(s)-[:offered_in]->(c)
I'm trying to get all courses that are NOT tied to cities and create the relationship to the city (city gets created if doesn't exist). However, the issue is that my dataset is about 5 million nodes and any query i make times out (unless i do in increment of 10k).
... anybody has any advice?
EDIT:
Here is a query for jobs i'm running now (that has to be done in 10k chunks (out of millions) because it takes few minutes as it is. creates city if doesn't exist):
match (j:Job)
where not has(j.merged) and has(j.city)
WITH j
LIMIT 10000
MERGE (c:City {name: j.city})
WITH j, c
MERGE (j)-[:in]->(c)
SET j.merged = 1
return count(j)
(for now don't know of a good way to filter out the ones already matched, so trying to do it by tagging it with custom "merged" attribute that i already have an index on)
500000 is a fair few nodes and on your other question you suggested 90% were without the relationship that you want to create here, so it is going to take a bit of time. Without more knowledge of your system (spec, neo setup, programming environment) and when you are running this (on old data or on insert) this is just a best guess at a tidier solution:
MATCH (j:Job)
WHERE NOT (j)-[:IN]->() AND HAS(j.city)
MERGE (c:City {name: j.city})
MERGE (j)-[:IN]->(c)
return count(j)
Obviously you can add your limits back as required.
Related
i'm trying to make a graph database from an edgelist and i'm kind of new with neo4j so i have this problem. First of all, the edgelist i got is like this:
geneId geneSymbol diseaseId diseaseName score
10 NAT2 C0005695 Bladder Neoplasm 0.245871429880008
10 NAT2 C0013182 Drug Allergy 0.202681755307501
100 ADA C0002170 Alopecia 0.2
100 ADA C0002880 Autoimmune hemolytic anemia 0.2
100 ADA C0004096 Asthma 0.21105290517153
i have a lot of relationships like that (165k) between gen and diseases associated.
I want to make a bipartite network in which nodes are gen or diseases, so i upload the data like this:
LOAD CSV WITH HEADERS FROM "file:///path/curated_gene_disease_associations.tsv" as row FIELDTERMINATOR '\t'
MERGE (g:Gene{geneId:row.geneId})
ON CREATE SET g.geneSymbol = row.geneSymbol
MERGE (d:Disease{diseaseId:row.diseaseId})
ON CREATE SET d.diseaseName = row.diseaseName
after a while (which is way longer than what it takes in R to upload the nodes using igraph), it's done and i got the nodes, i used MERGE because i don't want to repeat the gen/disease. The problem is that i don't know how to make the relationships, i've searched and they always use something like
MATCH (g:Gene {geneId: toInt(row.geneId)}), (d:Disease {diseaseId: toInt(row.geneId)})
CREATE (g)-[:RELATED_TO]->(d);
But when i run it it says that there are no changes. I've seen the neo4j tutorial but when they do the relations they don't work with edgelists so maybe the problem is when i merge the nodes so they don't repeat. I'd appreciate any help!
Looks like there might be two problems with your relationship query:
1) You're inserting (probably) as a string type (no toInt), and doing the MATCH query as an integer type (with toInt).
2) You're MATCHing the Disease node on row.geneId, not row.diseaseId.
Try the following modification:
MATCH (g:Gene {geneId: row.geneId}), (d:Disease {diseaseId: row.diseaseId})
CREATE (g)-[:RELATED_TO]->(d);
#DanielKitchener's answer seems to address your main question.
With respect to the slowness of creating the nodes, you should create indexes (or uniqueness constraints, which automatically create indexes as well) on these label/property pairs:
:Gene(geneId)
:Disease(diseaseId)
For example, execute these 2 statements separately:
CREATE INDEX ON :Gene(geneId);
CREATE INDEX ON :Disease(diseaseId);
Once the DB has those indexes, your MERGE clauses should be much faster, as they would not have to scan through all existing Gene or Disease nodes to find possible matches.
I am experimenting with Neo4j using a simple dataset of Locations. A location can have a relation to another relation.
a:Location - [rel] - b:Location
I already have the locations in the database (roughly 700.000+ Location entries)
Now I wanted to add the relation data (170M Edges), but I wanted to experiment with the import logic with a smaller set first, so I basically picked 2 nodes that are in the set and tried to create a relationship as follows.
MERGE p =(a:Location {locationid: 3616})-[w:WikiLink]->(b:Location {locationid: 467501})
RETURN p;
and also tried the approach directly from the docu
MATCH (a:Person),(b:Person)
WHERE a.name = 'Node A' AND b.name = 'Node B'
CREATE (a)-[r:RELTYPE { name : a.name + '<->' + b.name }]->(b)
RETURN r
I tried using a directional merge, undirectional merge, etc. etc. I basically tried multiple variants of the above queries and the result is: They run forever, seeming to no complete even after 15 minutes. Which is very odd.
Indexes
ON :Location(locationid) ONLINE (for uniqueness constraint)
Constraints
ON (location:Location) ASSERT location.locationid IS UNIQUE
This is what I am currently using:
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///edgelist.csv' AS line WITH line
MATCH (a:Location {locationid: toInt(line.locationidone)}), (b:Location {locationid: toInt(line.locationidtwo)})
MERGE (a)-[w:WikiLink {weight: toFloat(line.edgeweight)}]-(b)
RETURN COUNT(w);
If you look at the terminal output below you can see Neo4j reports 258ms query execution time, the realtime is however somewhat above that. This query already takes a few seconds too much in my opinion (The machine this runs on has 48GB RAM, 16 Cores and is relatively new).
I am currently running this query with LIMIT 1000 (before it was LIMIT 1) but the script is already running for a few minutes. I wonder if I have to switch from MERGE to CREATE. The problem is, I cannot understand the callgraph that EXPLAIN gives me in order to determine the bottleneck.
time /usr/local/neo4j/bin/neo4j-shell -file import-relations.cql
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| p |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| [Node[758609]{title:"Tehran",locationid:3616,locationlabel:"NIL"},:WikiLink[9422418]{weight:1.2282325516616477E-7},Node[917147]{title:"Khorugh",locationid:467501,locationlabel:"city"}] |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row
Relationships created: 1
Properties set: 1
258 ms
real 0m1.417s
user 0m1.497s
sys 0m0.158s
If you haven't:
create constraint on loc:Location assert loc.locationid is unique;
Then find both nodes, and create the releationship.
MATCH (a:Location {locationid: 3616}),(b:Location {locationid: 467501})
MERGE p = (a)-[w:WikiLink]->(b)
RETURN p;
or if the locations don't exist yet:
MERGE (a:Location {locationid: 3616})
MERGE (b:Location {locationid: 467501})
MERGE p = (a)-[w:WikiLink]->(b)
RETURN p;
You should also use parameters if you do that from a program.
Have you indexed the Location nodes on locationid?
CREATE INDEX ON :Location(locationid)
I had a similar problem adding edges to a graph and indexing the nodes led to the linking running over 150x faster.
If the nodes aren't indexed neo4j will do a serial search for the two nodes to link together.
USING PERIODIC COMMIT <value>:
Specifies number of records(rows) to be commited in a transaction. Since you have high RAM, it is good to use value that is greater than 100000. This will reduce the number of transactions committed and might further reduce the overall time.
Lets say i have nodes that are connected in FRIEND relationship.
I want to query 2 of them each time, so i use SKIP and LIMIT to maintain this.
However, if someone adds a FRIEND in between calls, this messes up my results (since suddenly the 'whole list' is pushed 1 index forward).
For example, lets say i had this list of friends (ordered by some parameter):
A B C D
I query the first time, so i get A B (skipped 0 and limited 2).
Then someone adds a friend named E, list is now E A B C D.
now the second query will return B C (skipped 2 and limited 2). Notice B returned twice because the skipping method is not aware of the changes that the DB had.
Is there a way to return 2 each time starting considering the previous query? For example, if i knew that B was last returned from the query, i could provide it to the query and it would query the 2 NEXT, getting C D (Which is correct) instead of B C.
I tried finding a solution and i read about START and indexes but i am not sure how to do this.
Thanks for your time!
You could store a timestamp when the FRIEND relationship was created and order by that property.
When the FRIEND relationship is created, add a timestamp property:
MATCH (a:Person {name: "Bob"}), (b:Person {name: "Mike"})
CREATE (a)-[r:FRIEND]->(b)
SET r.created = timestamp()
Then when you are paginating through friends two at a time you can order by the created property:
MATCH (a:Person {name: "Bob"})-[r:FRIEND]->(friends)
RETURN friends SKIP {page_number_times_page_size} LIMIT {page_size}
ORDER BY r.created
You can parameterize this query with the page size (the number of friends to return) and the number of friends to skip based on which page you want.
Sorry, if It's not exactly answer to you question. On my previous project I had experience of modifying big data. It wasn't possible to modify everything with one query so I needed to split it in batches. First I started with skip limit. But for some reason in some cases it worked unpredictable (not modified all the data). And when I become tired of finding the reason I changed my approach. I used Java for querying database. So I get all the ids that I needed to modify in first query. And after this I run through stored ids.
I have a Neo4J DB up and running with currently 2 Labels: Company and Person.
Each Company Node has a Property called old_id.
Each Person Node has a Property called company.
Now I want to establish a relation between each Company and each Person where old_id and company share the same value.
Already tried suggestions from: Find Nodes with the same properties in Neo4J and
Find Nodes with the same properties in Neo4J
following the first link I tried:
MATCH (p:Person)
MATCH (c:Company) WHERE p.company = c.old_id
CREATE (p)-[:BELONGS_TO]->(c)
resulting in no change at all and as suggested by the second link I tried:
START
p=node(*), c=node(*)
WHERE
HAS(p.company) AND HAS(c.old_id) AND p.company = c.old_id
CREATE (p)-[:BELONGS_TO]->(c)
RETURN p, c;
resulting in a runtime >36 hours. Now I had to abort the command without knowing if it would eventually have worked. Therefor I'd like to ask if its theoretically correct and I'm just impatient (the dataset is quite big tbh). Or if theres a more efficient way in doing it.
This simple console shows that your original query works as expected, assuming:
Your stated data model is correct
Your data actually has Person and Company nodes with matching company and old_id values, respectively.
Note that, in order to match, the values must be of the same type (e.g., both are strings, or both are integers, etc.).
So, check that #1 and #2 are true.
Depending on the size of your dataset you want to page it
create constraint on (c:Company) assert c.old_id is unique;
MATCH (p:Person)
WITH p SKIP 100000 LIMIT 100000
MATCH (c:Company) WHERE p.company = c.old_id
CREATE (p)-[:BELONGS_TO]->(c)
RETURN count(*);
Just increase the skip value from zero to your total number of people in 100k steps.
Usually I am building relationships between nodes while loading from CSV files. Here is a statement written cypher I used this time to build relationships between nodes. The Language nodes are 39K and the Description nodes are 2M.
MATCH (d:Description),(l:Language)
> WHERE d.description_language = l.language_name
> CREATE (d)-[r:HAS_LANGUAGE]->(l);
After a long, run the error I got is:
Self-suppression not permitted
I have created indexes on for the properties to be used in the relationship.
Indexes
...
ON :Description(woka_id) ONLINE
ON :Description(description_language) ONLINE
ON :Language(language_id) ONLINE (for uniqueness constraint)
ON :Language(language_name) ONLINE (for uniqueness constraint)
...
What I am doing wrong here causing such a long time to complete the relationships creation (more than 10 hours)?
You are dealing with a very large cartesian product at the filter step:
WHERE d.description_language = l.language_name
You could try to MATCH the Descriptions, group them by their description_language and CREATE the relationships from there:
MATCH (d:Description)
WITH d.description_language AS dl, collect(d) as all_d_for_lang
MATCH (l:Language {language_name: dl})
UNWIND all_d_for_lang AS d
CREATE (l)-[:HAS_LANGUAGE]->(d)
If you look at the PROFILE of this query you'll see there are less DB hits (limit the number of descriptions in the first MATCH for testing).
In general, I think the best way would be to use your CSV files to generate relationships when you create the nodes, i.e. do this application side, not on the database.
Since you are creating relationships from every Description node and there are 2M of them I would just grab the description that are not yet matched and do them in smaller batches.
Something like...
match (d:Description)
where not ( d-[:HAS_LANGUAGE]->() )
with d
limit 200000
match (l:Language {language_name: d.description_language} )
create d-[:HAS_LANGUAGE]->l