Neo4j relate nodes by same property - neo4j

I have a Neo4J DB up and running with currently 2 Labels: Company and Person.
Each Company Node has a Property called old_id.
Each Person Node has a Property called company.
Now I want to establish a relation between each Company and each Person where old_id and company share the same value.
Already tried suggestions from: Find Nodes with the same properties in Neo4J and
Find Nodes with the same properties in Neo4J
following the first link I tried:
MATCH (p:Person)
MATCH (c:Company) WHERE p.company = c.old_id
CREATE (p)-[:BELONGS_TO]->(c)
resulting in no change at all and as suggested by the second link I tried:
START
p=node(*), c=node(*)
WHERE
HAS(p.company) AND HAS(c.old_id) AND p.company = c.old_id
CREATE (p)-[:BELONGS_TO]->(c)
RETURN p, c;
resulting in a runtime >36 hours. Now I had to abort the command without knowing if it would eventually have worked. Therefor I'd like to ask if its theoretically correct and I'm just impatient (the dataset is quite big tbh). Or if theres a more efficient way in doing it.

This simple console shows that your original query works as expected, assuming:
Your stated data model is correct
Your data actually has Person and Company nodes with matching company and old_id values, respectively.
Note that, in order to match, the values must be of the same type (e.g., both are strings, or both are integers, etc.).
So, check that #1 and #2 are true.

Depending on the size of your dataset you want to page it
create constraint on (c:Company) assert c.old_id is unique;
MATCH (p:Person)
WITH p SKIP 100000 LIMIT 100000
MATCH (c:Company) WHERE p.company = c.old_id
CREATE (p)-[:BELONGS_TO]->(c)
RETURN count(*);
Just increase the skip value from zero to your total number of people in 100k steps.

Related

depth in nodes of the same label in custom query

I have in my database Company with:
#Relationship(type = "COLLECTS_FROM")
private Set<Company> subCompanies
What I'm trying to achieve is to create two queries:
get the company with limited depth 2 for nodes of the same type (so company which collects from different companies, which collects from different companies)
get the company with full depth for nodes of the same type (similar to previous, but just all the way down without any limitations.
I had a few attempts to solve this and here are those:
first approach was to use findById(ID,2) and findById(ID,999999), but this query has a huge performance drop and I ran into StackOverflow exception, that's why I'm trying to create my own methods similar to what I found here.
second approach was to use custom query (which were actually working in case of different node types):
#Query("Match (company:Company {name:$name})-[relation:COLLECTS_FROM*0..2]->(secondCompany:Company) RETURN company,relation,secondCompany")
Company customerWithCustomDepth(String name);
Unfortunately while using the same node types I ran into the IncorrectResultSizeDataAccessException: Incorrect result size: expected at most 1 and when I changed return type to the list 3 records were returned (actually the first one of them is what I would like to get, but it seems like specifying secondCompany in the return statement populates nodes that I don't want to have in my return type.
on the third approach I did similar to the second, but it also returned 3 rows (the first one is what I wanted).
#Query("Match p = (company:Company {name:$name})-[relation:COLLECTS_FROM*0..2]->(secondCompany:Company) RETURN p")
and fourth one once again 3 companies, instead of a single one (the first one is what I wanted)
#Query("MATCH (company:Company {name:$name} ) WITH company MATCH p=(company)-[COLLECTS_FROM*0..2]->(m) RETURN p")
And that's basically it. I feel like I'm close, but a little detail is missing. Maybe instead of trying to fix the query, I should somehow limit the result to get the first row from the Query? Any help will be really appreciated.
update after receiving #cybersam answer:
The general purpose of this query is to create a tree graph, that's why I would like to get a company with their sub-companies inside (and companies in the sub-companies), not only the parent one.
I could use second, or third approach and just grab the first element on the server side:
result.stream()
.findFirst()
.orElseThrow();
but I believe it should be able to do it on the service side.
To return the Company with the specified name if it is connected by a path of 1 to 2 COLLECTS_FROM relationships to descendant Company nodes:
#Query("MATCH p=(c:Company {name:$name})-[:COLLECTS_FROM*..2]->(:Company) WHERE ALL(n IN NODES(p) WHERE 'Company' IN LABELS(n)) RETURN c")
Company customerWithCustomDepth(String name);
I assume that you do not want to get a result if there are no such descendants, which is why the variable-length relationship pattern does not use *0..2.
Note: If you want to test for paths of exactly length 2, change ..2 to just 2.
The corresponding query for paths of any length can be obtained by just changing ..2 to ... However, variable-length relationship pattern with no upper bound are not recommended, since they can take "forever" to run or run out of memory.

Cypher pattern for getting self related nodes

Given that I'm very new to Neo4j. I have a schema which looks like the below image:
Here Has nodes are different for example Passport, Merchant, Driving License, etc. and also these nodes are describing the customer node (looking for future scope of filtering customers based on these nodes).
SIMILAR is a self-relation meaning there exists a customer with ID:1 is related to another customer with ID:2 with a score of 2800.
I have the following questions:
Is this a good schema given the condition of the future scope I mentioned above, or getting all the properties in a single customer node is viable? (Different nodes may have array of items as well, for example: ()-[:HAS]->(Phone) having {active: "+91-1231241", historic_phone_numbers: ["+91-121213", "+91-1231421"]})
I want to get the customer along with describing nodes in relation to other customers. For that, I tried the below query (w/o number of relation more than 1):
// With number_of_relation > 1
MATCH (searched:Customer)-[r:SIMILAR]->(matched:Customer)
WHERE r.score > 2700
WITH searched, COLLECT(matched.customer_id) AS MatchedList, count(r) as cnt
WHERE cnt > 1
UNWIND MatchedList AS matchedCustomer
MATCH (person:Customer {customer_id: matchedCustomer})-[:HAS|:LIVES_IN|:IS_EMPLOYED_BY]->(related)
RETURN searched, person, related
Result what I got is below, notice one customer node not having its describing nodes:
// without number_of_relation > 1
// second attempt - for a sample customer_id
MATCH (matched)<-[r:SIMILAR]-(c)-[:HAS|:LIVES_IN|:IS_EMPLOYED_BY]->(b)
WHERE size(keys(b)) > 0
AND c.customer_id = "1b093559-a39b-4f95-889b-a215cac698dc"
AND r.score > 2700
RETURN b AS props, c AS src_cust, r AS relation, matched
Result I got are below, notice related nodes are not having their describing nodes:
If I had two describing nodes with some property (some may have a list) upon which I wanted to query and build the expected graph specified in point 2 above, how can I do that?
I want the database to find a similar customer given the describing nodes. Example: A customer {name: "Dave"} has phone {active_number: "+91-12345"} is similar to customer {name: "Mike"} has phone {active_number: "+91-12345"}. How can get started with this?
If something is unclear, please ask. I can explain with examples.
[EDITED]
Yes, the schema seems fine, except that you should not use the same HAS relationship type between different node label pairs.
The main problem with your first query is that its top MATCH clause uses a directional relationship pattern, ()-->(), which does not allow all Customer nodes to have a chance to be the searched node (because some nodes may only be at the tail end of SIMILAR relationships). This tweaked query should work better:
MATCH (searched:Customer)-[r:SIMILAR]-(matched:Customer)
WHERE r.score > 2700
WITH searched, COLLECT(matched) AS matchedList
WHERE SIZE(matchedList) > 1
UNWIND matchedList AS person
MATCH (person)-[:HAS|LIVES_IN|IS_EMPLOYED_BY]->(pDesc)
WITH searched, person, COLLECT(pDesc) AS personDescribers
MATCH (searched)-[:HAS|LIVES_IN|IS_EMPLOYED_BY]->(sDesc)
RETURN searched, person, personDescribers, COLLECT(sDesc) AS searchedDescribers
It's not clear what you want are trying to do.
To get all Customers who have the same phone number:
MATCH (c:Customer)-[:HAS_PHONE]-(p:Phone)
WHERE p.activeNumber = '+91-12345'
WITH p.activeNumber AS phoneNumber, COLLECT(c) AS customers
WHERE SIZE(customers) > 1
RETURN phoneNumber, customers

querying a DB for an unknown element with a given uuid

Lot of years ago, I discussed with some neo4j engineers about the ability to query an unknown object given it's uuid.
At that time, the answer was that there was no general db index in neo4j.
Now, I have the same problem to solve:
each node I create has an unique id (uuid in the form <nx:u-<uuid>-?v=n> where ns is the namespace, uuid is a unique uuid and v=n is the version number of the element.
I'd like to have the ability to run the following cypher query:
match (n) where n.about = 'ki:u-SSD-v5.0?v=2' return n;
which actually return nothing.
The following query
match (n:'mm:ontology') where n.about = 'ki:u-SSD-v5.0?v=2' return n;
returns what I need, despite the fact that at query time I don't know the element type.
Can anyone help on this?
Paolo
Have you considered adding a achema index to every node in the database for the about attribute?
For instance
Add a global label to all nodes in the graph (e.g. Node) that do not already have it. If your graph is overly large and/or heap overly small you will need to batch this operation. Something along the lines of the following...
MATCH (n)
WHERE NOT n:Node
WITH n
LIMIT 100000
SET n:Node
After the label is added then create an index on the about attribute for your new global label (e.g. Node). These steps can be performed interchangeably as well.
CREATE CONSTRAINT ON (node:Node) assert node.about IS UNIQUE
Then querying with something like the following
MATCH (n:Node)
WHERE n.about = 'ki:u-SSD-v5.0?v=2'
RETURN n;
will return the node you are seeking in a performant manner.

Neo4j / Cypher: Returning sum of value in relationship between nodes within the node itself

There are two node types, Account and Transfer. A Transfer signifies movement of funds between Account nodes. Transfer nodes may have any number of input and output nodes. For example, three Accounts could each send $40 ($120 combined) to sixteen other Accounts in any way they please and it would work.
The Transfer object, as is, does not have the sum of the funds sent or received - those are only stored in the relationships themselves. I'd like to calculate this in the cypher query and return it as part of the the returned Transfer object, not separately. (Similar to a SQL JOIN)
I'm rather new to Neo4j + Cypher; So far, the query I've got is this:
MATCH (tf:Transfer {id:'some_id'})
MATCH (tf)<-[in:IN_TO]-(in_account:Account)
MATCH (tf)-[out:OUT_TO]->(out_account:Account)
RETURN tf,in_account,in,out_account,out, sum(in.value) as sum_in, sum(out.value) as sum_out
If I managed this database, I'd just precalculate the sums and store it in the Transfer properties - but that's not an option at this time.
tl;dr: I'd like to store sum_in and sum_out in the returned tf object.
Tore Eschliman's answer is very insightful, especially on the properties of aggregated aliases.
I came up with a more hackish solution, that could work in this case.
Example data set:
CREATE
(a1:Account),
(a2:Account),
(a3:Account),
(tf:Transfer),
(a1)-[:IN_TO {value: 110}]->(tf),
(a2)-[:IN_TO {value: 230}]->(tf),
(tf)-[:OUT_TO {value: 450}]->(a3)
Query:
MATCH (in_account:Account)-[in:IN_TO]->(tf:Transfer)
WITH tf, SUM(in.value) AS sum_in
SET tf.sum_in = sum_in
RETURN tf
UNION
MATCH (tf:Transfer)-[out:OUT_TO]->(out_account:Account)
WITH tf, SUM(out.value) AS sum_out
SET tf.sum_out = sum_out
RETURN tf
Results:
╒═══════════════════════════╕
│tf │
╞═══════════════════════════╡
│{sum_in: 340, sum_out: 450}│
└───────────────────────────┘
Note that UNION performs a set union (as opposed to UNION ALL, which performs a multiset/bag union), hence we will not have duplicates in the results.
Update: as Tore Eschliman pointed out in the comments, this solution will modify the database. As a workaround, you can collect the results and abort the transaction afterwards.
When you use an aggregation like SUM, you have to leave the aggregated aliases out of your result row, or you'll end up with single-row sums. This should help you get something closer to what you want, including a workaround for your dynamic property assignment:
CREATE (temp)
WITH temp
MATCH (tf:Transfer {id:'some_id'})
MATCH (tf)<-[in:IN_TO]-(in_account:Account)
MATCH (tf)-[out:OUT_TO]->(out_account:Account)
SET temp += PROPERTIES(tf)
WITH temp, SUM(in.value) AS sum_in, SUM(out.value) AS sum_out, COLLECT(in_account) AS in_accounts, COLLECT(out_account) AS out_accounts
SET temp.sum_in = sum_in
SET temp.sum_out = sum_out
WITH temp, PROPERTIES(temp) AS props, in_accounts, out_accounts
DELETE temp
RETURN props, in_accounts, out_accounts
You are creating a dummy node to hold properties, because that's the only way to assign dynamic properties to existing Maps or Map-alikes, but the node won't ever be committed to the graph. This query should return a Map of the :Transfer's properties, with the in- and out-sums included, plus lists of the in- and out-accounts in case you need to do any additional work on them.

Neo4j indexing for large number of nodes

I am learning the basics of neo4j and I am looking at the following example with credit card fraud https://linkurio.us/stolen-credit-cards-and-fraud-detection-with-neo4j. Cypher query that finds stores where all compromised user shopped is
MATCH (victim:person)-[r:HAS_BOUGHT_AT]->(merchant)
WHERE r.status = “Disputed”
MATCH victim-[t:HAS_BOUGHT_AT]->(othermerchants)
WHERE t.status = “Undisputed” AND t.time < r.time
WITH victim, othermerchants, t ORDER BY t.time DESC
RETURN DISTINCT othermerchants.name as suspicious_store, count(DISTINCT t) as count, collect(DISTINCT victim.name) as victims
ORDER BY count DESC
However, when the number of users increase (let's say to millions of users), this query may become slow since the initial query will have to traverse through all nodes labeled person. Is it possible to speed up the query by asigning properties to nodes instead of transactions? I tried to remove "status" property from relationships and add it to nodes (users, not merchants). However, when I run query with constraint WHERE victim.status="Disputed" query doesn't return anything. So, in my case person has one additional property 'status'. I assume I did a lot of things wrong, but would appreciate comments. For example
MATCH (victim:person)-[r:HAS_BOUGHT_AT]->(merchant)
WHERE victim.status = “Disputed”
returns the correct number of disputed transactions. The same holds for separately quering number of undisputed transactions. However, when merged, they yield an empty set.
If I made a mistake in my approach, how can I speed up queries for large number of nodes (avoid traversing all nodes in the first step). I will be working with a data set with similar properties, but will have around 100 million users, so I would like to index users on additional properties.
[Edited]
Moving the status property from the relationship to the person node does not seem to be the right approach, since I presume the same person can be a customer of multiple merchants.
Instead, you can reify the relationship as a node (let's label it purchase), as in:
(:person)-[:HAS_PURCHASE]->(:purchase)-[:BOUGHT_AT]->(merchant)
The purchase nodes can have the status property. You just have to create the index:
CREATE INDEX ON :purchase(status)
Also, you can put the time property in the new purchase nodes.
With the above, your query would become:
MATCH (victim:person)-[:HAS_PURCHASE]->(pd:purchase)-[:BOUGHT_AT]->(merchant)
WHERE pd.status = “Disputed”
MATCH victim-[:HAS_PURCHASE]->(pu:purchase)-[:BOUGHT_AT]->(othermerchants)
WHERE pu.status = “Undisputed” AND pu.time < pd.time
WITH victim, othermerchants, pu ORDER BY pu.time DESC
RETURN DISTINCT othermerchants.name as suspicious_store, count(DISTINCT pu) as count, collect(DISTINCT victim.name) as victims
ORDER BY count DES

Resources