Neo4j - Find triplets with most similar attributes (nodes and relationships)? - neo4j

I have a lot of triplets with different attributes for each user node (it can contain user_id, location, role for some nodes and others can be without location but with their marital status, use_car etc.) and data node.
Data node can contain location, size, origin and sometimes it will contain only the location.
I have a relation between these nodes - that has some attributes like folder_name, approved/rejected etc.
Given a new triplet, how can I find the most similar triplets by their attributes (user node + relation + data node)
Is there any functionality to do this? I will be happy to get a direction to check or minimal example.

Maybe you could use Node Similarity algorithm that uses the Jaccard similarity under the hood. It is available in the Neo4j Graph Data Science library. This algorithm will allow you to compare node similarity based on their attribute nodes as you call them. If you also want to compare node properties, you would have to create your own comparing sets for each node and then use the basic implementation of Jaccard similarity score: https://neo4j.com/docs/graph-data-science/current/alpha-algorithms/jaccard/

something along these lines should get you started
MATCH (u:User)-->(something)<--(other:User)
WHERE u <> other
WITH u, COUNT(something) AS sharedSomethings
ORDER BY sharedSomethings DESC

Related

Neo4j : propagate label through two types of nodes

I would like to apply label propagation to my data in Neo4j. My data looks like the image.
The relationship 'Appears_in' has the weight property and some articles nodes has seed label property.
I would like to propagate this seed labels to create clusters with the articles that speaks about the same topic, for example (politics cluster). More precisely I would like to propagate seed label to another article node but articles do not have a direct relationship between them. They are 'somewhere' connected through the words that they have in commun...
Is it possible to propagate the label from one article node to another article node through the words' node?
Yes, this is quite a frequent scenario. You want to project a bipartite graph to a monopartite graph. I wrote a blog post about a very similar scenario.
I will give you a solution that handles all in one step. It consists of two parts. We project a bipartite graph to a monopartite with the cypher projection. In the next step, you run the Label Propagation algorithm. You will be needing the Neo4j Graph Data Science library. Here is an example query:
CALL gds.labelPropagation.write({
nodeQuery:"MATCH (n:Article) RETURN id(n) as id, n.seed_property as seed",
relationshipQuery:"MATCH (a:Article)<-[:APPEARS_IN]-()-[:APPEARS_IN]->(b:Article)
RETURN id(a) as source, id(b) as target",
seedProperty: "seed",
writeProperty: "lpa"
})
I didn't include the relationship weights as I don't know exactly how you want them handled. You could also use any of the similarity algorithms instead to project a monopartite graph.
Hope this helps

How to calculate custom degree based on the node label or other conditions?

I have a scenario where I need to calcula a custom degree between the first node (:employee) where it should only be incremented to another node when this node's label is :natural or :relative, but not when it is :legal.
Example:
The thing is I'm having trouble generating this custom degree property as I needed it.
So far I've tried playing with FOREACH and CASE but had no luck. The closest I got to getting some sort of calculated custom degree is this:
match p = (:employee)-[*5..5]-()
WITH distinct nodes(p) AS nodes
FOREACH(i IN RANGE(0, size(nodes)) |
FOREACH(node IN [nodes[i]] |
SET node.degree = i
))
return *
limit 1
But even this isn't right, as despite having 5 distinct nodes, I get SIZE(nodes) = 6, as the :legal node is accounted for twice for some reason.
Does anyone know how to achieve my goal within a single cypher query?
Also, if you know why the :legal node is account for twice, please let me know. I suspect it is because it has 2 :natural nodes related to it, but don't know the inner workings that make it appear twice.
More context:
:employee nodes are, well, employees of an organization
:relative nodes are relatives to an employee
:natural nodes are natural persons that may or may not be related to a :legal
:legal nodes are companies (legal persons) that may, or may not, be related to an :employee, :relative, :natural or another :legal on an IS_PARTNER relationship when, in real life, they are part of the board of directors or are shareholders of that company (:legal).
custom degree is what I aim to create and will define how close one node is to another given some conditions to this project (specified below).
All nodes have a total_contracts property that are the total amount of money received through contracts.
The objective is to find any employees with relationships to another node that has total_contracts > 0 and are up to custom degree <= 3, as employees may be receiving money from external sources, when they shouldn't.
As for why I need this custom degree ignoring the distance when it is a :legal node, is because we threat companies as the same distance as the natural person that is a partner.
On the illustrated example above, the employee has a son, DIEGO, that is a shareholder of a company (ALLURE) and has 2 other business partners (JOSE and ROSIEL). When I ask what's the degree of the son to the employee, I should get 1, as they are directly related; when I ask whats the degree of JOSE to the employee I should get 2, as JOSE is related to DIEGO through ALLURE and we shouldn't increment the custom degree when it is a company, only when its a person.
The trick with this type of graph is making sure we avoid paths that loop back to the same nodes (which is definitely going to happen quite a lot because you're using multiple relationships between nodes instead of just one...you may want to make sure this is necessary in your model).
The easiest way to do that is via APOC Procedures, as you can adjust the uniqueness of traversals so that nodes are unique in each path.
So for example, for a specific start node (let's say the :employee has empId:1 just for the sake of mocking up a lookup of the node, we'll calculate a degree for all nodes within 5 hops of the starting node. The idea here is that we'll take the length of the path (the number of hops) - the number of :legal nodes in the path (by filtering the nodes in the path for just :legal nodes, then getting the size of that filtered list).
MATCH (e:employee {empId:1})
CALL apoc.path.expandConfig(e, {minLevel:1, maxLevel:5, uniqueness:'NODE_PATH'}) YIELD path
WITH e, last(nodes(path)) as endNode,
length(path) - size([x in nodes(path) WHERE x:legal]) as customDegree
RETURN e, endNode, customDegree

Most efficient way to get all connected nodes in neo4j

The answer to this question shows how to get a list of all nodes connected to a particular node via a path of known relationship types.
As a follow up to that question, I'm trying to determine if traversing the graph like this is the most efficient way to get all nodes connected to a particular node via any path.
My scenario: I have a tree of groups (group can have any number of children). This I model with IS_PARENT_OF relationships. Groups can also relate to any other groups via a special relationship called role playing. This I model with PLAYS_ROLE_IN relationships.
The most common question I want to ask is MATCH(n {name: "xxx") -[*]-> (o) RETURN o.name, but this seems to be extremely slow on even a small number of nodes (4000 nodes - takes 5s to return an answer). Note that the graph may contain cycles (n-IS_PARENT_OF->o, n<-PLAYS_ROLE_IN-o).
Is connectedness via any path not something that can be indexed?
As a first point, by not using labels and an indexed property for your starting node, this will already need to first find ALL the nodes in the graph and opening the PropertyContainer to see if the node has the property name with a value "xxx".
Secondly, if you now an approximate maximum depth of parentship, you may want to limit the depth of the search
I would suggest you add a label of your choice to your nodes and index the name property.
Use label, e.g. :Group for your starting point and an index for :Group(name)
Then Neo4j can quickly find your starting point without scanning the whole graph.
You can easily see where the time is spent by prefixing your query with PROFILE.
Do you really want all arbitrarily long paths from the starting point? Or just all pairs of connected nodes?
If the latter then this query would be more efficient.
MATCH (n:Group)-[:IS_PARENT_OF|:PLAYS_ROLE_IN]->(m:Group)
RETURN n,m

How to represent the lines of a network in Neo4j

I am trying to realize a datamodel in Neo4j. The model has points of interest in a city and streets. The streets connect the points.
Initially I thought that points and streets should both represented in the graph database as nodes.
Between these two different type of nodes there is a relationship ("point is connected with").
Now I am thinking the possibility that instead of representing the street as a node, perhaps is more correct to represent the street as relationship ("connects two points")
And this is my question actually. What is the more correct way to represent the network (line part) in a model: with nodes or with relationships?
The only major difference between relationships and nodes is that relationships must exist between two nodes. This means that you wouldn't be able to store a specific street if you didn't store two points of interest that it connects. So, if you see this being an issue, you may want to store streets as nodes. If you are certain that you will only want to store streets if there are points of interest in your database that exist on the street, then it'd make more sense to represent the streets as relationships.
In general, you should try to avoid storing properties in nodes that you only intend to use to find relationships between them. In this case, you mention possible storying a "point is connected with" property in each point of interest node. This would work, but is essentially just saying that a relationship exists between two points without actually using a relationship. Again, in the case where you want to be able to store streets that don't have points of interests existing on them, this may be necessary, and you could store streets that don't have points of interests on them by leaving the "point is connected with" property as NULL, but I would advise against this.
Another thing to think about is what you would store in the relationship. If you go with the model where streets are nodes, it becomes very difficult to represent quantities like distances between points of interest without adding relationships into your graph specifically for those properties, which may as well be properties of a street relationship.
UPDATE: Thought I'd add an example query to show how making the streets relationships can simplify your logic and make using your database much simpler and more intuitive.
Imagine you wish to find the path with the fewest points of interest between points A and B.
This is what the query would look like with the relationships model:
MATCH (a:Point {name: "foo"}), (b:Point {name: "bar"}),
p = shortestPath(a-[*:Street]-b)
RETURN p
By using relationships where appropriate, you enable the capabilities of Neo4j, allowing you to get a lot of work done with relatively simple queries. It's hard to think of a way to write this query in the model where you represent streets as nodes, but it would in all likelihood be much more complex and less efficient.

Nearest nodes to a give node, assigning dynamically weight to relationship types

I need to find the N nodes "nearest" to a given node in a graph, meaning the ones with least combined weight of relationships along the path from given node.
Is is possible to do so with a pure Cypher only solution? I was looking about path functions but couldn't find a viable way to express my query.
Moreover, is it possible to assign a default weight to a relationship at query time, according to its type/label (or somehow else map the relationship type to the weight)? The idea is to experiment with different weights without having to change a property for every relationship.
Otherwise I would have to change the weight property's value to each relationship and re-do it to before each query, which is very time-consuming (my graph has around 10M relationships).
Again, a pure Cypher solution would be the best, or please point me in the right direction.
Please use variable length Cypher queries to find the nearest nodes from a single node.
MATCH (n:Start { id: 0 }),
(n)-[:CONNECTED*0..2]-(x)
RETURN x
Note that the syntax [CONNECTED*0..2] is a range parameter specifying the min and max relationship distance from a given node, with relationship type CONNECTED.
You can swap this relationship type for other types.
In the case you wanted to traverse variably from the start node to surrounding nodes but constrain via a stop criteria to a threshold, that is a bit more difficult. For these kinds of things it is useful to get acquainted with Neo4j's spatial plugin. A good starting point to learn more about Neo4j spatial can be found in this blog post: http://neo4j.com/blog/neo4j-spatial-part1-finding-things-close-to-other-things
The post is a little outdated but if you do some Google searching you can find more updated materials.
GitHub repository: https://github.com/neo4j-contrib/spatial

Resources