How to refactor the data in neo4j - neo4j

Say I have a database with orders node as below
Order{OrderId, Customer, Date, Quantity, Product}
now I want to refactor this node in the database to look as below using a cypher query
(day)<-[:PLACED_ON]-(Order{OrderId, Quantity})-[:PLACED_BY]->(customer), (Order)-[:FOR_PRODUCT]->(product)
I understand that we can actually do such thing directly in the cypher, without having to load all the nodes in to my code and then make multiple cypher calls to the database.
Would it be possible for some one to help me understand how such refactoring can be done without introducing duplicates of customer, product and day node.
Rrgards
Kiran

Yes, you can manipulate a Neo4j database with cypher.
Guessing that your current Order node looks similar to:
CREATE (:ORDER {orderId:100,customer:'John',date:13546354,quantity:1,product:'pizza'})
You could write the following:
MATCH (o:ORDER)
CREATE (d:DAY{timestamp:o.date}) <- [:PLACED_ON] - o - [:PLACED_BY] -> (c:CUSTOMER{name:o.customer})
CREATE o - [:FOR_PRODUCT] -> (p:PRODUCT{name:o.product})
REMOVE o.product, o.customer, o.date
RETURN o as order, d as day, c as customer, p as product
The query output would be:
Nodes created: 3
Relationships created: 3
Properties set: 6
Labels added: 3
Note that if you're having a large dataset, migrating an entire database can be very time consuming! You might want to try the PERIODIC COMMIT feature in the 2.1.0 milestone release.

Related

How do I self reference within a table in Neo4j?

I have a few tables loaded up in Neo4j. I have gone through some tutorials and come up with this cypher query.
MATCH (n:car_detail)
RETURN COUNT(DISTINCT n.model_year), n.model, n.maker_name
ORDER BY COUNT(DISTINCT n.model_year) desc
This query gave me all the cars that were continued or discontinued. Logic being count one being discontinued and anything higher being continued.
My table car_detail has cars which were build in different years. I want to make a relationship saying for example
"Audi A4 2011" - (:CONTINUED) -> "Audi A4 2015" - (:CONTINUED) -> "Audi A4 2016"
So it sounds like you want to match to the model and make of the car, ordered by the model year ascending, and to create relationships between those nodes.
We can make use of APOC Procedures as a shortcut for creating the linked list through the ordered and collected nodes, you'll want to install this (with the appropriate version given your Neo4j version) to take advantage of this capability, as the pure cypher approach is quite ugly.
The query would look something like this:
MATCH (n:car_detail)
WITH n
ORDER BY n.model_year
WITH collect(n) as cars, n.model as model, n.maker_name as maker
WHERE size(cars) > 1
CALL apoc.nodes.link(cars, 'CONTINUED')
RETURN cars
The key here is that after we order the nodes, we aggregate the nodes with respect to the model and maker, which act as your grouping key (when aggregating, the non-aggregation variables become the grouping key for the aggregation). This means your ordered cars are going to be grouped per make and model, so all that's left is to use APOC to create the relationships linking the nodes in the list.
You can just find both cars with MATCH and then connect them:
e.g.
MATCH (c1:car_detail)
where c1.model = 'Audi A4 2011'
MATCH (c2:car_detail)
where c2.model = 'Audi A4 2015'
CREATE (c1)-[:CONTIUED]->(c2);
etc.

Import Edgelist from CSV Neo4J

i'm trying to make a graph database from an edgelist and i'm kind of new with neo4j so i have this problem. First of all, the edgelist i got is like this:
geneId geneSymbol diseaseId diseaseName score
10 NAT2 C0005695 Bladder Neoplasm 0.245871429880008
10 NAT2 C0013182 Drug Allergy 0.202681755307501
100 ADA C0002170 Alopecia 0.2
100 ADA C0002880 Autoimmune hemolytic anemia 0.2
100 ADA C0004096 Asthma 0.21105290517153
i have a lot of relationships like that (165k) between gen and diseases associated.
I want to make a bipartite network in which nodes are gen or diseases, so i upload the data like this:
LOAD CSV WITH HEADERS FROM "file:///path/curated_gene_disease_associations.tsv" as row FIELDTERMINATOR '\t'
MERGE (g:Gene{geneId:row.geneId})
ON CREATE SET g.geneSymbol = row.geneSymbol
MERGE (d:Disease{diseaseId:row.diseaseId})
ON CREATE SET d.diseaseName = row.diseaseName
after a while (which is way longer than what it takes in R to upload the nodes using igraph), it's done and i got the nodes, i used MERGE because i don't want to repeat the gen/disease. The problem is that i don't know how to make the relationships, i've searched and they always use something like
MATCH (g:Gene {geneId: toInt(row.geneId)}), (d:Disease {diseaseId: toInt(row.geneId)})
CREATE (g)-[:RELATED_TO]->(d);
But when i run it it says that there are no changes. I've seen the neo4j tutorial but when they do the relations they don't work with edgelists so maybe the problem is when i merge the nodes so they don't repeat. I'd appreciate any help!
Looks like there might be two problems with your relationship query:
1) You're inserting (probably) as a string type (no toInt), and doing the MATCH query as an integer type (with toInt).
2) You're MATCHing the Disease node on row.geneId, not row.diseaseId.
Try the following modification:
MATCH (g:Gene {geneId: row.geneId}), (d:Disease {diseaseId: row.diseaseId})
CREATE (g)-[:RELATED_TO]->(d);
#DanielKitchener's answer seems to address your main question.
With respect to the slowness of creating the nodes, you should create indexes (or uniqueness constraints, which automatically create indexes as well) on these label/property pairs:
:Gene(geneId)
:Disease(diseaseId)
For example, execute these 2 statements separately:
CREATE INDEX ON :Gene(geneId);
CREATE INDEX ON :Disease(diseaseId);
Once the DB has those indexes, your MERGE clauses should be much faster, as they would not have to scan through all existing Gene or Disease nodes to find possible matches.

Neo4j - return results from match results starting from specific node

Lets say i have nodes that are connected in FRIEND relationship.
I want to query 2 of them each time, so i use SKIP and LIMIT to maintain this.
However, if someone adds a FRIEND in between calls, this messes up my results (since suddenly the 'whole list' is pushed 1 index forward).
For example, lets say i had this list of friends (ordered by some parameter):
A B C D
I query the first time, so i get A B (skipped 0 and limited 2).
Then someone adds a friend named E, list is now E A B C D.
now the second query will return B C (skipped 2 and limited 2). Notice B returned twice because the skipping method is not aware of the changes that the DB had.
Is there a way to return 2 each time starting considering the previous query? For example, if i knew that B was last returned from the query, i could provide it to the query and it would query the 2 NEXT, getting C D (Which is correct) instead of B C.
I tried finding a solution and i read about START and indexes but i am not sure how to do this.
Thanks for your time!
You could store a timestamp when the FRIEND relationship was created and order by that property.
When the FRIEND relationship is created, add a timestamp property:
MATCH (a:Person {name: "Bob"}), (b:Person {name: "Mike"})
CREATE (a)-[r:FRIEND]->(b)
SET r.created = timestamp()
Then when you are paginating through friends two at a time you can order by the created property:
MATCH (a:Person {name: "Bob"})-[r:FRIEND]->(friends)
RETURN friends SKIP {page_number_times_page_size} LIMIT {page_size}
ORDER BY r.created
You can parameterize this query with the page size (the number of friends to return) and the number of friends to skip based on which page you want.
Sorry, if It's not exactly answer to you question. On my previous project I had experience of modifying big data. It wasn't possible to modify everything with one query so I needed to split it in batches. First I started with skip limit. But for some reason in some cases it worked unpredictable (not modified all the data). And when I become tired of finding the reason I changed my approach. I used Java for querying database. So I get all the ids that I needed to modify in first query. And after this I run through stored ids.

Neo4j - Cypher statement to build relationships took near half of day to complete with error "Self-suppression not permitted"

Usually I am building relationships between nodes while loading from CSV files. Here is a statement written cypher I used this time to build relationships between nodes. The Language nodes are 39K and the Description nodes are 2M.
MATCH (d:Description),(l:Language)
> WHERE d.description_language = l.language_name
> CREATE (d)-[r:HAS_LANGUAGE]->(l);
After a long, run the error I got is:
Self-suppression not permitted
I have created indexes on for the properties to be used in the relationship.
Indexes
...
ON :Description(woka_id) ONLINE
ON :Description(description_language) ONLINE
ON :Language(language_id) ONLINE (for uniqueness constraint)
ON :Language(language_name) ONLINE (for uniqueness constraint)
...
What I am doing wrong here causing such a long time to complete the relationships creation (more than 10 hours)?
You are dealing with a very large cartesian product at the filter step:
WHERE d.description_language = l.language_name
You could try to MATCH the Descriptions, group them by their description_language and CREATE the relationships from there:
MATCH (d:Description)
WITH d.description_language AS dl, collect(d) as all_d_for_lang
MATCH (l:Language {language_name: dl})
UNWIND all_d_for_lang AS d
CREATE (l)-[:HAS_LANGUAGE]->(d)
If you look at the PROFILE of this query you'll see there are less DB hits (limit the number of descriptions in the first MATCH for testing).
In general, I think the best way would be to use your CSV files to generate relationships when you create the nodes, i.e. do this application side, not on the database.
Since you are creating relationships from every Description node and there are 2M of them I would just grab the description that are not yet matched and do them in smaller batches.
Something like...
match (d:Description)
where not ( d-[:HAS_LANGUAGE]->() )
with d
limit 200000
match (l:Language {language_name: d.description_language} )
create d-[:HAS_LANGUAGE]->l

Get nodes that don't have certain relationship (cypher/neo4j)

I have the following two node types:
c:City {name: 'blah'}
s:Course {title: 'whatever', city: 'New York'}
Looking to create this:
(s)-[:offered_in]->(c)
I'm trying to get all courses that are NOT tied to cities and create the relationship to the city (city gets created if doesn't exist). However, the issue is that my dataset is about 5 million nodes and any query i make times out (unless i do in increment of 10k).
... anybody has any advice?
EDIT:
Here is a query for jobs i'm running now (that has to be done in 10k chunks (out of millions) because it takes few minutes as it is. creates city if doesn't exist):
match (j:Job)
where not has(j.merged) and has(j.city)
WITH j
LIMIT 10000
MERGE (c:City {name: j.city})
WITH j, c
MERGE (j)-[:in]->(c)
SET j.merged = 1
return count(j)
(for now don't know of a good way to filter out the ones already matched, so trying to do it by tagging it with custom "merged" attribute that i already have an index on)
500000 is a fair few nodes and on your other question you suggested 90% were without the relationship that you want to create here, so it is going to take a bit of time. Without more knowledge of your system (spec, neo setup, programming environment) and when you are running this (on old data or on insert) this is just a best guess at a tidier solution:
MATCH (j:Job)
WHERE NOT (j)-[:IN]->() AND HAS(j.city)
MERGE (c:City {name: j.city})
MERGE (j)-[:IN]->(c)
return count(j)
Obviously you can add your limits back as required.

Resources