Cypher query for list pattern - neo4j

I have a schema which looks like below:
A customer is linked to another customer with a relationship SIMILAR having similarity score.
Example: (c1:Customer)-->(c2:Customer)
An Email node is connected to each customer with relationship MAIL_AT with the following node properties:
{
"active_email_address": "a#mail.com",
"cibil_email_addresses": [
"b#mail.com", "c#mail.com"
]
}
Example: (e1:Email)<-[:MAIL_AT]-(c1:Customer)-[:SIMILAR]->(c2:Customer)-[:MAIL_AT]->(e2:Email)
A Risk node with some risk-related properties (below) and is related to customer with relationship HAS_RISK:
{
"f0_score": 870.0,
"pta_score": 430.0
}
A Fraud node with some fraud-related properties (below) and is related to customer with relationship IS_FRAUD:
{
"has_commited_fraud": true
}
My Objectives:
To find the customers with common email addresses (irrespective of active and secondary)?
My tentative solution:
MATCH (email:Email)
WITH email.cibil_email_addresses + email.active_email_address AS emailAddress, email
UNWIND emailAddress AS eaddr
WITH DISTINCT eaddr AS deaddr, email
UNWIND deaddr AS eaddress
MATCH (customer:Customer)-[]->(someEmail:Email)
WHERE eaddress IN someEmail.cibil_email_addresses + someEmail.active_email_address
WITH eaddress, COLLECT(customer.customer_id) AS customers
RETURN eaddress, customers
Problem: It is taking forever to execute this. Working with lists will take time I understand, however, I'm flexible to change the schema (If suggested). Should I break the email address into separate nodes? If yes, then how do I break cibil_email_addresses into different nodes as they can vary - Should I create two nodes with different cibil email addresses and connect both of them to customer with relationship HAS_CIBIL_EMAIL? (Is this a valid schema design). Also, it is possible, a customer's active_email_address is present in other customer's cibil_email_address. I'm trying to find a synthetic identity attack. PS: If there exists some APOC that can help achieve this and below, do suggest with example.
In production, for a given customer with email addresses, risk values, similarity score, and also given other customers may or may not be tagged with fraud_status, I want to check whether this new person will fall in a fraud ring or not. PS: If I need to use any gds to solve this, please suggest with examples.
If I were to do this same exercise with some other node such as Address which may be partially matching and will be having same list of historical addresses in a list, what should be my ideal approach?
I know, I'm tagging someone in my question, but that person only seems to be active with respect to Cypher on StackOverflow. #cybersam any help?
Thanks.

This should work:
MATCH (e:Email)
UNWIND (e.cibil_email_addresses + e.active_email_address) AS address
WITH address, COLLECT(e) AS es
UNWIND es AS email
MATCH (email)<-[:MAIL_AT]-(cust)
RETURN address, COLLECT(cust) AS customers
The WITH clause takes advantage of the arregating function COLLECT to automatically collect all the Email nodes containing the same address, using address as the grouping key.
You should only ask one question at a time. You have a couple of other questions at the bottom. If you continue to need help with them, please create new questions.

Related

Neo4j Cypher exclude nodes where a specific relationship is missing

I am trying to implement a fraud detection system in neo4j, where I have a bunch of nodes with person, bank account, kredit card, telephone numbers and addresses.
A basic idea of detecting fraud in bank sytems is someone who has a bank account and a credit card, where his credit card is not linked to his own bank account.
And I cannot figure it out what to do. Because when I try to exclude these nodes with:
WHERE NOT (k)-[:VERKNUEPFT]-(b)
I still get the wrong nodes, but it just hides the VERKNUEPFT node.
Can someone give me the correct way to negate, to exclude every not needed node?
So simply said I need to get following output:
First I filtered out which nodes are needed at all:
MATCH (p:person)-[r:HAT_KONTO]->(b:bankkonto), (p)-[r2:NUTZT_KARTE]->(k:kreditkarte) return p,b,k,r,r2;
which gives me the following:
the nodes below this Hermine and Ron are correct, so I want to exclude everything who are linked to them.
But when I try to do MATCH (p:person)-[r:HAT_KONTO]->(b:bankkonto), (p)-[r2:NUTZT_KARTE]->(k:kreditkarte) WHERE NOT (k)-[:VERKNUEPFT]-(b) return p,b,k,r,r2;
I get the following:
only the bankaccount (the brown one) is missing.
When I test the same code with WHERE instead of WHERE NOT:
MATCH (p:person)-[r:HAT_KONTO]->(b:bankkonto), (p)-[r2:NUTZT_KARTE]->(k:kreditkarte) WHERE (k)-[:VERKNUEPFT]-(b) return p,b,k,r,r2;
I achieve the opposite of what I want to.
I think you need to check, whether all the credit cards held by a person, are linked to any one of his bank accounts. Currently, you are checking if they are linked to a specific bank. Try something along these lines:
MATCH (p:person)-[:HAT_KONTO]->(b:bankkonto)
WITH p, collect(b) AS banks
MATCH (p)-[r2:NUTZT_KARTE]->(k:kreditkarte)
WITH p, banks, collect(k) AS creditCards
WHERE ALL(card IN creditCards WHERE ANY(bank IN banks WHERE (card)-[:VERKNUEPFT]-(bank)))
UNWIND banks AS b
UNWIND creditCards AS k
MATCH (p)-[r:HAT_KONTO]->(b), (p)-[r2:NUTZT_KARTE]->(k)
RETURN p,r,b,r2,k
In the above query, we first collect the banks associated with a person and his/her credit cards into two different collections. Then, we check whether all the credit cards are linked to one of the banks, to filter out the valid users. Then we return their details.

Combining related information within a cypher query

I have created a knowledge with the nodes and relationships pictured. Each person has any number of jobs and skills connected to them and each Job and Skill can have any number of People connected to them. I would like to be able to search for a particular job (e.g. Security Architect) and return a list of all the people who have been employed_as that job and all of the skills that each person is skilled_in. I have created a query hich retrieves these results, however a new line in the query is created for each skill, duplicating the person details each time. This is the query I have which retrieves those results.
MATCH (j:Job {job_title: "Security Architect"})<-[p_rel:employed_as]-(p:Person)-[skilled_in]->(s:Skill) return p,s,p_rel
Is it possible to create a query that returns all of the skill nodes connected to a person as a single list with the details of that person?
Since you need all skills in single line, you can collect all the skills per person.
MATCH (j:Job {job_title: "Security Architect"})<-[p_rel:employed_as]-(p:Person)
-[skilled_in]->(s:Skill)
RETURN p,p_rel, collect(s) as skills_per_person

Neo4J query to find same data link to different nodes

Following is what I created in Neo4j:
Nodes: Customer Names, Customer Address and Customer Contact
Linked these nodes based on common relationships between all three.
I can see all three nodes linked in Neo4j. Contact contain email and phone numbers so some cases customer name node is connected to email address, phone number and address.
In my learning curve I am asked to show how many same contacts are used by different customer names also how many same address used by different customer names. Based on my little experience I tried few queries but couldnt reach to results.
Tried following query -
start n=node(*)
match n-[:CONTACT_AT]-()
return distinct n
CONTACT_AT is the relationship between customer name and Contact (email, phone) node.
Your question does not provide enough information about your data model. To save time, I will assume that it looks something like this (without showing all the properties):
(a:Address)<-[:ADDRESS_AT]-(p:Person {name: '...'})-[:CONTACT_AT]->(c:Contact)
With this model, this is how you'd get all the names of the people who have the same Contact:
MATCH (person:Person)-[:CONTACT_AT]->(contact:Contact)
RETURN contact, COLLECT(person.name) AS names;
And this is how you'd get all the names of the people who have the same Address:
MATCH (person:Person)-[:ADDRESS_AT]->(address:Address)
RETURN address, COLLECT(person.name) AS names;

graph modeling approach for node/edge user access control

Are there sets of best practices to approach how to model data in a graph database (I am considering arangodb right now but the question would apply to other platforms)? Here is a practical case to illustrate my question:
Assuming we are creating a centralised contact list for users. Each user has contacts but some contacts could be common to users e.g. John knows Mary, and Marc knows Mary. I would thus have 3 nodes (John, Mary and Marc) but John should only see his relationship to Mary, not Marc's relationship to Mary
So how should a full graph be designed in order to support user access to their information?
Option 1: Create 1 graph per user. That way, I know exactly who can see what (I could for example prefix all my collections with the user id). That would be simple but would duplicate a lot of data (e.g. if I put all my family in the db, my brother will do too, creating twice the same data, in different graphs)
Option 2: Create 1 general graph with Contact nodes, plus User nodes. I would have the contact John, Mary and Marc connected, but the User node representing John, would be linked to the Contact nodes John and Mary only. That way I would know to get only the contact nodes that are connected to the User node I am focusing on.
The problem is that edges cannot be linked to the User node (I cannot have an edge going from a node to an edge...can I?). So I would have to add an attribute of user_id to all the edges in order to only fetch the ones relevant to the current user.
This is slightly nicer as I do not have to duplicate nodes, but I would still have to duplicate edges as they would be user specific
Option 3: Do it SQL like with a Rights table, maintaining a list of Contact ids along with what user can see what Node and what Edge (heavy on joins)
Options 4: ???
As in everything, there are many ways to reach a solution but I was wondering what was considered best practice to balance cleanliness of approach and performance for insertion/deletion...knowing that performance might be platform dependent
i would suggest an Option 4:
First i would not distinguish between User and Contact Nodes, but all of them should be Contact Nodes.
If you create a new User you basically create a new Contact for him (or use an existing one) and connect your Applications Authentication to this specific Contact.
Then you can use directed edges to create the contact list for a user.
Say you have two users John and Mary, than John can add Mary to his contact list, but Mary would not recognize. If she wants to add John this means you will add a second edge.
If you want to have symmetrical contacts only (if John adds Mary to his list, he should automatically appear in her list) you simply ignore this direction in your queries.
If you now want to get the contacts for John this can be done by selecting the Neighbors of John.
In ArangoDB this can be realized with two collections, say Contact and Knows, where Knows holds the edges.
The following code pasted into arangosh creates your situation described above:
db._create("Contact");
db._createEdgeCollection("Knows");
db.Contact.save({_key: "John", mail: "john#example.com"});
db.Contact.save({_key: "Mary", mail: "mary#somewhere.com"});
db.Contact.save({_key: "Marc", mail: "marc#somewhereelse.com"});
db.Knows.save("Contact/John", "Contact/Mary", {});
db.Knows.save("Contact/Marc", "Contact/Mary", {});
To query the contact list for user John:
db._query('RETURN NEIGHBORS(Contact, Knows, "John", "outbound")').toArray()
Should give Mary as result, no information about Marc.
If you do not want to join Contacts and User Accounts as i suggested you could also separate them in different collections, in this case you have to slightly modify the edges and the query:
db.Knows.save("User/John", "Contact/Mary", {});
db.Knows.save("User/Marc", "Contact/Mary", {});
db._query('RETURN NEIGHBORS(Users, Knows, "John", "outbound")').toArray()
should give the same result.
Edit:
Regarding your question in Option 2:
In ArangoDB it is actually possible to point edges to other edges, however build in graph functionality will now consider the edges pointed to as if they were nodes. This means they do not follow their direction automatically. But you can use these resulting edges in further AQL statements and continue the search with AQL features.

Neo4j suggestion on large scale

i need to implement a suggestion system for my project
in this system we should recommend people base on some parameters like current city, education, friend of friends etc.
i have designed this by creating(update) may_know relations when users edit their profile or become friend with someone and i will retrieve them by MATCH u-[r:MAY_KNOW]-x RETURN * ORDER BY r.weight so people can find most like people to them
but i think this is not a best practice because soon may_know relation from/to every user can reach even milions and scan and sorting them will be heavy cost
do you have a better idea?
Depends a bit on the data-structure, I assume there are relationships to cities, education facilities and friends. So you don't actually have MAY_KNOW relationships as those are only inferred?
Also it depends if you want to create a cross products between all your users (how many) and how you would want to filter out non-related people.
Perhaps check out this blog post from Max: http://maxdemarzi.com/2013/04/19/match-making-with-neo4j/
So something like this query might work (depending on the data volume I'd rewrite it in the Java API).
match (p:Person {id:{user_id})
match (p)-[:LIVES_IN]->(:City)<-[:LIVES_IN]-(other)
match (p)-[:GRADUATED]->(:School)<-[:GRADUATED]-(other)
match (p)-[:KNOWS]->(:Person)<-[:KNOWS]-(other)
RETURN other

Resources