Neo4J: How can I repeat my query on all resulting nodes?

Neo4J: How can I repeat my query on all resulting nodes? - neo4j

so I've started working on a project for my Bachelor Thesis and therefore I'm looking for some help with Cypher since I've hadn't had any touchpoints with it yet!
I've got the BTC Blockchain as my DB and now I want to use the Multi Input Clustering Heuristic to identify all addresses that belong to a person. This means that I want to identify all Transactions from a BTC Wallet that have more than one input address. Once I have the transactions identified, I am looking for the wallets. This is what the following query does:
MATCH(a:Address{address:"3QQdfAaPhP1YqLYMBS59BqWjcpXjXVP1wi"})-[:SENDS]-(tx)-[:SENDS]-(a2)
RETURN a2
Now that I have these wallets I want to repeat the exact process on these wallets as well and their resulting wallets and so on! So I need a recursive Query which returns me all wallets that where used as Input Wallets at some point.
Note:
Example of an Address that has 2 transactions
This how a the graph of an BTC wallet looks like that received and sent BTC. There are 3 types of nodes (Transactions, Addresses, Blocks).

I'm not sure I understand fully, but would the following work to return more than 1 input address?
MATCH (a:Address)-[:SENDS*2..]->(tx:Transaction)
WITH DISTINCT a
// Do something with these address
RETURN a.address // or whatever
For the record, it's not recommended to have an infinite path like *2... You're better off putting a max value, like *2..10, which will limit the returning path to anywhere between 2 and 10 hops.

Related

How to optimise recursive query - Neo4j?

I am developing a contact tracing framework using Neo4j. There are 2 types of nodes, namely Person and Location. There exists a relationship VISITED between a Person and a Location, which has properties startTS and endTS. Example:
Now suppose person 1 is infected. I need to find all the persons who have been in contact with this person. For each person identified, I need to find all other persons who have been in contact with that person. This process is repeated until an identified person has not met anyone. Here is a working code:
MATCH path = (infected:Person {id:'1'})-[*]-(otherPerson:Person)
WITH relationships(path) as rels, otherPerson
WHERE all(i in range(1, size(rels)-1)
WHERE i % 2 = 0
OR (rels[i].endTS >= rels[i-1].startTS AND rels[i].startTS <= rels[i-1].endTS)
)
RETURN otherPerson
The problem is that the process is taking way too much time to complete with large datasets. Can the above query be optimised? Thank you for your help.

For this one, unfortunately, there are some limitations on our syntax for filtering these more complex conditions during expansion. We can cover post-expansion filtering, but you'd want an upper bound otherwise this won't perform well on a more complex graph.
To get what you need today (filtering during-expansion instead of after), you would need to implement a custom procedure in Java leveraging our traversal API, and then call the procedure in your Cypher query.
Advanced syntax that can cover these cases has already been proposed for GQL, and we definitely want that in Cypher. It's on our backlog.

Cypher query for list pattern

I have a schema which looks like below:
A customer is linked to another customer with a relationship SIMILAR having similarity score.
Example: (c1:Customer)-->(c2:Customer)
An Email node is connected to each customer with relationship MAIL_AT with the following node properties:
{
"active_email_address": "a#mail.com",
"cibil_email_addresses": [
"b#mail.com", "c#mail.com"
]
}
Example: (e1:Email)<-[:MAIL_AT]-(c1:Customer)-[:SIMILAR]->(c2:Customer)-[:MAIL_AT]->(e2:Email)
A Risk node with some risk-related properties (below) and is related to customer with relationship HAS_RISK:
{
"f0_score": 870.0,
"pta_score": 430.0
}
A Fraud node with some fraud-related properties (below) and is related to customer with relationship IS_FRAUD:
{
"has_commited_fraud": true
}
My Objectives:
To find the customers with common email addresses (irrespective of active and secondary)?
My tentative solution:
MATCH (email:Email)
WITH email.cibil_email_addresses + email.active_email_address AS emailAddress, email
UNWIND emailAddress AS eaddr
WITH DISTINCT eaddr AS deaddr, email
UNWIND deaddr AS eaddress
MATCH (customer:Customer)-[]->(someEmail:Email)
WHERE eaddress IN someEmail.cibil_email_addresses + someEmail.active_email_address
WITH eaddress, COLLECT(customer.customer_id) AS customers
RETURN eaddress, customers
Problem: It is taking forever to execute this. Working with lists will take time I understand, however, I'm flexible to change the schema (If suggested). Should I break the email address into separate nodes? If yes, then how do I break cibil_email_addresses into different nodes as they can vary - Should I create two nodes with different cibil email addresses and connect both of them to customer with relationship HAS_CIBIL_EMAIL? (Is this a valid schema design). Also, it is possible, a customer's active_email_address is present in other customer's cibil_email_address. I'm trying to find a synthetic identity attack. PS: If there exists some APOC that can help achieve this and below, do suggest with example.
In production, for a given customer with email addresses, risk values, similarity score, and also given other customers may or may not be tagged with fraud_status, I want to check whether this new person will fall in a fraud ring or not. PS: If I need to use any gds to solve this, please suggest with examples.
If I were to do this same exercise with some other node such as Address which may be partially matching and will be having same list of historical addresses in a list, what should be my ideal approach?
I know, I'm tagging someone in my question, but that person only seems to be active with respect to Cypher on StackOverflow. #cybersam any help?
Thanks.

This should work:
MATCH (e:Email)
UNWIND (e.cibil_email_addresses + e.active_email_address) AS address
WITH address, COLLECT(e) AS es
UNWIND es AS email
MATCH (email)<-[:MAIL_AT]-(cust)
RETURN address, COLLECT(cust) AS customers
The WITH clause takes advantage of the arregating function COLLECT to automatically collect all the Email nodes containing the same address, using address as the grouping key.
You should only ask one question at a time. You have a couple of other questions at the bottom. If you continue to need help with them, please create new questions.

DB model for logging ongoing transactions

I am used to have tables for ongoing activities from my former life as relational DB guy. I am wondering how I would store ongoing information like transactions, logs or whatever in a neo4j DB. Let#s assume I have an account, which is been assigned to a user A:
(u:User {name:"A"})
I want to keep track on transactions he does, e.g. deducting or adding a value:
(t:Transaction {value:"-20", date:timestamp()})
Would I do for every transaction a new node and assign it to the user:
(u) -[r:changeBalance]-> (t)
In the end I might have lots of nodes which are assigned to the user and keep one transaction each resulting in lots of nodes with only one information. I was pondering if a query that has a limit on the last 50 transactions (limit 50, sort by t.date) might still have to read all available transaction nodes to get the total sorting queue before the limit applies - this seems a bit unperformant.
How would you model a list of actions in a neo4j DB? Any hint is very appreciated.

If you used a simple query like the following, you would NOT be reading all Transaction nodes per User.
MATCH (u:User)-[r:ChangeBalance]->(t:Transaction)
RETURN u, t
ORDER BY t.date;
You'd only be reading the Transaction nodes that are directly related to each User (via a ChangeBalance relationship). So, the performance would not be as bad as you are afraid it might be.

Although everything is fine with your query - you are reading only transactions, that are related to this specific user - this approach can be improved.
Let's imagine that, for some reason, you application will work 5 years and you have user that have 10 transactions per day. It will result in ~18250 transaction connected to single node.
This is not great idea, from data model perspective. In this case if you want to filter result (get 50 latest transaction) on some non-indexed field, then this will result in full 18250 node traverse.
This can be solved by adding additional relations to database.
Currently you have such graph: (user)-[:HAS]->(transasction)
( user )
/ | \
(transasction1) (transaction2) (transaction3)
You can add additional relation between transactions, to specify sequence of events.
Like that: (transaction)-[:NEXT]->(transasction)
( user )
/ | \
(transasction1)-(transaction2)-(transaction3)
Note: there is no need to have additional PREVIOUS relation, because Neo4j store relationship pointers in both directions, so traversing backwards can be done at same speed as forward.
And maintain relations to first and last user transasctions:
(user)-[:LAST_TRANSACTION]->(transaction)
(user)-[:FIRST_TRANSACTION]->(transaction)
This allows you to get last transaction in 1 hop. And then latest 50 with additional 50 hops.
So, adding additional complexity, you can traverse/manipulate with your data in more efficient ways.
This idea come from EventStore database (and similar to them).
Moreover, with such data model User balance can be aggerated by wrapping up sequence of transaction. This can give you a nice and fast way to get user balance at any point.
Getting latest 50 transaction in this model can look like this:
MATCH (user:User {id: 1} WITH user
MATCH (user)-[:LAST_TRANSACTION]->(last_transaction:Transaction) WITH last_transaction
MATCH (last_transasction)<-[:NEXT*0..50]-(transasctions:Transaction)
RETURN transactions
Getting total user balance can be:
MATCH (user:User {id: 1}) WITH user
MATCH (user)-[:FIRST_TRANSACTION]->(first_transaction:Transaction) WITH first_transaction
MATCH (first_transaction)-[:NEXT*]->(transactions:Transaction)
RETURN first_transaction.value + sum(transasctions.value)

Same Cypher Query has different performance on different DBs

I have a fullDB, (a graph clustered by Country) that contains ALL countries and I have various single country test DBs that contain exactly the same schema but only for one given country.
My query's "start" node, is identified via a match on a given value for a property e.g
match (country:Country{name:"UK"})
and then proceeds to the main query defined by the variable country. So I am expecting the query times to be similar given that we are starting from the same known node and it will be traversing the same number of nodes related to it in both DBs.
But I am getting very difference performance for my query if I run it in the full DB or just a single country.
I immediately thought that I must have some kind of "Cartesian Relationship" issue going on so I profiled the query in the full DB and a single country DB but the profile is exactly the same for each step in the plan. I was assuming that the profile would reveal a marked increase in db hits at some point in the plan, but the values are the same. Am I mistaken in what profile is displaying?
Some sizing:
The fullDB would have 70k nodes, the test DB 672 nodes, the time in full db for the query to complete is 218764ms while the test db is circa 3407ms.
While writing this I realised that there will be an increase in the number of outgoing relationships on certain nodes (suppliers can supply different countries) which I think is probably the cause, but the question remains as to why I am not seeing any indication of this in the profiling.
Any thoughts welcome.

What version are you using?
Both query times are way too long for your dataset size.
So you might check your configuration / disk.
Did you create an index/constraint for :Country(name) and is that index online?
And please share your query and your query plans.

How to query recommendation using Cypher

I'm trying to query Book nodes for recommendation by Cypher.
I want to recommend A:Book and C:Book for A:User.
i'm sorry I need some graph to explain this question, but I could't up graph image because my lepletion lacks for upload function.
I wrote query below.
match (u1:User{uid:'1003'})-->(o1:Order)-->(b1:Book)<--(o2:Order)
<--(u2:User)-->(o3:Order)-->(b2:Book)
return b2
This query return all Books(A,B,C,D) dispite cypher's Uniqueness.
I expect to only return A:Book and C:Book.
Is this behavior Neo4j' specification?
How do I get expected return? Thanks, everyone.
environment:
Neo4j ver.v2.0.0-RC1
Using Neo4j Server with REST API

Without the sample graph its hard to say why you get something back when you expected something else. You can share a sample graph by including a create statement that would generate said graph, or by creating it in Neo4j console and putting the link in your question. Here is an example of the latter: console.neo4j.org/r/fnnz6b
In the meantime, you probably want to declare the type of the relationships in your pattern. If a :User has more than one type of outgoing relationships you will be excluding those other paths based on the labels of the nodes on the other end, which is much less efficient than to only traverse the right relationships to begin with.
To my mind its not clear whether (u:User)-->(o:Order)-->(b:Book) means that a user has one or more orders, and each order consists of one or more books; or if it means only that a user ordered a book. If you can share a sample, hopefully that will be clear too.
Edit:
Great, so looking at the graph: You get B and D back because others who bought B also bought D, and others who bought D also bought B, which is your criterion for recommendation. You can add a filter in the WHERE clause to exclude those books that the user has already bought, something like
WHERE NOT (u1)-[:BUY]->()-[:CONTAINS]->(b2)
This will give you A, C, C back, since there are two matching paths to C. It's probably not important to get two result items for C, so you can either limit the return to give only distinct values
RETURN DISTINCT(b2)
or group the return values by counting the matching paths for each result as a 'recommendation score'
RETURN b2, COUNT(b2) as score
Also, if each order only [CONTAINS] one book, you could try modelling without order, just (:User)-[:BOUGHT]->(:Book).

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart