i am new to graph databases. I am having some issues while modeling data. I am working on site which contains information about books.
I have categorized books in different categories like: arts, fiction etc. I ve generated a node for each category. Nodes of categories are unconnected with each other and that where the first issue comes. To solve this issue i am going to index a node called category and connect all my category nodes to it which leads me to the second issue of dense node or super node.
Now tell me how to solve both these issues.
You should use labels. The idea of super nodes was for neo4j < 2.*
You can add a label BookCategory and when you want to return back all your bookCategoryNodes, just specify the label in the match query :
MATCH (n:BookCategory) RETURN n
Related
I have a graph with two types of nodes: Persons and Houses. In this graph, persons can sell houses (SELLS relationship), and houses are bought by persons (IS_BOUGHT_BY relationship). This could be represented as the following diagram:
(Person_1)-[:SELLS]->(House_1)-[:IS_BOUGHT_BY]->(Person_2)
A basic example of this diagram can be created in Neo4j by running:
CREATE(Person_1: Person {name:'Person 1'})
CREATE(House_1: House {name:'House 1', deal: 10238})
CREATE(Person_2: Person {name:'Person 2'})
CREATE
(Person_1)-[:SELLS]->(House_1),
(House_1)-[:IS_BOUGHT_BY]->(Person_2)
or can be created in Gremlin by executing:
g.
addV('Person').property('name', 'Person 1').as('p1').
addV('House').property('name', 'House 1').property('deal', 10238).as('h1').
addV('Person').property('name', 'Person 1').as('p2').
addE('SELLS').from('p1').to('h1').
addE('IS_BOUGHT_BY').from('h1').to('p2')
I want to make a query that returns these Person nodes connected with a fake edge called SELLS_HOUSE_TO without saving this relationship in the database. Also this SELLS_HOUSE_TO relationship must have as deal property the deal id of the house sold by Person 1 and bought by Person 2. In other words, I want the output of the query to follow the following diagram:
(Person_1)-[:SELLS_HOUSE_TO {deal: house_1_deal}]->(Person_2)
where id_deal is the deal property of the House_1 node, which connects Person_1 and Person_2 through the SELLS and IS_BOUGHT_BY relationships respectively.
This query can be easily done in Neo4j by using the vRelationship function provided by the APOC library:
MATCH (p1: Person)-[r1:SELLS]->(h1: House)-[r2:BUYS]->(p2: Person)
CALL apoc.create.vRelationship(p1, 'SELLS_HOUSE_TO', {deal: h1.deal}, p2) YIELD rel
RETURN p1, rel, p2
It is possible to do something similar in Gremlin without involving storing the edges in the database at some step?
Currently, Gremlin, outside of perhaps using in-line code (closure/lambda), has no way to inject virtual edges into a query result. This might be a good feature request to open as a Jira ticket for Apache Tinkerpop at this location.
As a short term solution, I think the best you can do is to create the edges for the needs of a visualization and perhaps give them a unique property key and value something like "virtual":true and to delete such edges when no longer needed.
I'm pretty new to Neo4j and graph DBs in general, and have been playing around with it for the last few days. I've now hit something I'm stumped on: I'm trying to create a "temporary" relationship between two disjoint nodes just for the sake of a RETURN, then not store this relationship within the DB afterwards.
The dataset I'm using is a graph of Movie and Person nodes provided in one of the basic Neo4j built-in tutorials. My query is currently as follows:
MATCH (p1:Person)-[r1:ACTED_IN]-(m1:Movie)-[r2:ACTED_IN]-(p2:Person)
WHERE p1.name="Kevin Bacon"
RETURN {start:p1,rel:"COSTAR",end:p2}
What I'd ultimately like to see is a central "Kevin Bacon" node with COSTAR relationships to a series of Person nodes around it, without any Movie nodes or ACTED_IN relationships being displayed. The query above does show the COSTAR relationship in the returned rows, but it does not appear on the graph itself; I've attached a few screenshots of what I'm seeing.
The only other idea I have is to use the MERGE keyword to create a COSTAR relationship, but (as I understand it) this actually stores the relationship in the DB which is what I'm trying to avoid.
Any suggestions would be greatly appreciated.
The neo4j Browser only visualizes nodes and relationships that actually exist in the DB. So, there is no way to do what you want without actually creating the COSTAR relationships, visualizing the result in the Browser, and then deleting all the COSTAR relationships.
As a workaround you could simply display the nodes of all of Kevin Bacon's costars, like this:
MATCH (p1:Person)-[:ACTED_IN]-(:Movie)-[:ACTED_IN]-(p2:Person)
WHERE p1.name="Kevin Bacon"
RETURN DISTINCT p2;
So you want the relationships to appear in the graph visualization in the Neo4j browser but not store these relationships in the graph itself? I can't think of a way to make that happen (without hacking it), but would deleting the relationships after you are done generating the visual work?
Query to create COSTAR relationships:
MATCH (p1:Person)-[r1:ACTED_IN]-(m1:Movie)-[r2:ACTED_IN]-(p2:Person)
WHERE p1.name="Kevin Bacon"
CREATE UNIQUE (p1)<-[:COSTAR]-(p2);
Execute your query to populate the graph in Neo4j Browser...
Then to delete the COSTAR relationships:
MATCH (:Person)-[r:COSTAR]-(:Person)
DELETE r;
The best way to achieve this (now... 6 years later) is with the gds.graph.create.* functions (assuming you load GDS)
https://neo4j.com/docs/graph-data-science/current/graph-create/
With a graph as simple as this, gds.graph.create(...) would be enough (creating COSTAR for all co-starrings)
Or, if you wanted to do some constraining, gds.graph.create.cypher(...)
The in-memory graph projection feels like what you wanted to achieve - it persists only as long as the DBMS is active, or until you call gds.graph.drop(...)
The answer to this question shows how to get a list of all nodes connected to a particular node via a path of known relationship types.
As a follow up to that question, I'm trying to determine if traversing the graph like this is the most efficient way to get all nodes connected to a particular node via any path.
My scenario: I have a tree of groups (group can have any number of children). This I model with IS_PARENT_OF relationships. Groups can also relate to any other groups via a special relationship called role playing. This I model with PLAYS_ROLE_IN relationships.
The most common question I want to ask is MATCH(n {name: "xxx") -[*]-> (o) RETURN o.name, but this seems to be extremely slow on even a small number of nodes (4000 nodes - takes 5s to return an answer). Note that the graph may contain cycles (n-IS_PARENT_OF->o, n<-PLAYS_ROLE_IN-o).
Is connectedness via any path not something that can be indexed?
As a first point, by not using labels and an indexed property for your starting node, this will already need to first find ALL the nodes in the graph and opening the PropertyContainer to see if the node has the property name with a value "xxx".
Secondly, if you now an approximate maximum depth of parentship, you may want to limit the depth of the search
I would suggest you add a label of your choice to your nodes and index the name property.
Use label, e.g. :Group for your starting point and an index for :Group(name)
Then Neo4j can quickly find your starting point without scanning the whole graph.
You can easily see where the time is spent by prefixing your query with PROFILE.
Do you really want all arbitrarily long paths from the starting point? Or just all pairs of connected nodes?
If the latter then this query would be more efficient.
MATCH (n:Group)-[:IS_PARENT_OF|:PLAYS_ROLE_IN]->(m:Group)
RETURN n,m
Totally new to graph databases -- corrections welcome.
If I want to obtain a list of nodes labeled with the "User" label, does neo4j (or possibly other graph databases) need to search all nodes for that label or does it somehow auto-index nodes by label?
Without indexing, (horrible performance) every node is queried to see if any one of its labels matches "User," like so:
List<Node> userNodes = new List<Node>();
for (Node node : all_nodes)
{
for (Label label : node.labels())
{
if (label.name() == "User")
{
userNodes.Add(node);
// no need to look at other labels for this node
break;
}
}
}
return userNodes;
With indexing, the system grabs some system-managed "node" that has all of the label names under it (search space of dozens instead of millions) and grabs its children:
List<Node> userNodes = new List<Node>();
for (Node labelNode : labels_node) // where labels_node is system-managed
{
if (labelNode.name() == "User")
{
// All children of the "User" node have the label "User"
userNodes = labelNode.children();
// No need to look at other labels
break;
}
}
return userNodes;
Ultimately, I think this question gets to this: if I am building a list of "things" for which I need to retrieve all of them by type of thing, should I use labels to accomplish this? Or should I instead create my own "Users" node, which points to all nodes that are users, and only use labels once I have found the subset of nodes I want?
It seems this question is similar though more vague but did not receive a satisfactory answer.
Terminology wise, the docs talk about "labels and schema indexes". An "index" is a thing that you attach on a label property, such as indexing all first_name attributes of :Person nodes.
But for your question, labels behave like indexes because yes, the execution engine takes advantage of them and use them like you'd expect an index, even though the documentation doesn't talk about labels as indexes.
So, for a concrete example, suppose we had a graph of 1 million nodes, of which 5 of them had the label :Person. And suppose we had the following query:
MATCH (p:Person) RETURN p;
The question boils down to, how many nodes does cypher have to consider? The answer is 5, not 1 million.
Your second code snippet is more of a neo4j version 1.9 kind of approach; nowadays I wouldn't create these artificial "index nodes", and I wouldn't loop through all possible labels, I'd just match by label and be done with it.
Yes labels are indexed automatically, meaning that if you have 1000 user nodes where 700 are active users, querying for the Active label will only return you the 700 active users without looking up for the others.
Having super nodes and connecting to them the related ones is a (almost always) bad idea.
Also, you should model your database for querying purposes, look this amazing answer :
Neo4J - Storing into relationship vs nodes
There is a topic too for the difference between using labels or indexed properties on nodes, this blog post is explaining it very well :
http://graphaware.com/neo4j/2015/01/16/neo4j-graph-model-design-labels-versus-indexed-properties.html
You should also profile your queries, meaning also it is non sense to start importing 1million nodes at the beginning, try with 100 and do some queries.
I heard an amazing sentence from someone at the neo4j hq :
Be faithful to your graph and the graph will be faithful to you
Find your way to do it at a manner that it solves your problem !
There is a dedicated method in
ops = GlobalGraphOperations.at(gdb);
for (Node node : ops.getAllNodesWithLabel(DynamicLabel.label("User")) {
// do sth with node
}
which uses the optimized label-scan-store behind the scenes.
I have read the Neo4j manual and saw the numerous short examples regarding movie graph. I have also installed it locally and played with the cypher.
Here is the setup:
I have the following nodes: Movies (with name and id, owned by friend), Actors(with name and ids) Directors (with names and id), Genre (with id and name)
Relations are: Actors acted in Movies (1 movie - many actors), Directors directed a movie (1 director per movie but a director can direct many movies), and Movies has several genre "(many to many)
1) Owned by friend I dont know why but following the LOAD CSV example they put USA as a node rather than a property but is there a logical reason why its better to put it as a node rather than a property like i did?
2)
What I want to search is similar to the answer given to this question:
Nearest nodes to a give node, assigning dynamically weight to relationship types
However - I do not have a weight on the relationship and its more of a "go find the first give nodes connected to it"
Given that the "owned by friend" can only be owned by 1 person.
If given movie title "Spider-Man" (which for example purpose is owned by frank) go find the next occurrence of a movie that is owned by John.
So after reading Neo4j I believe that I dont need to specify which relationship is needed to traverse but just go find the next movie that meets my criteria, right?
So Following the above link
MATCH (n:Start { title: 'Spider-Man' }),
(n)-[:CONNECTED*0..2]-(x)
RETURN x
So go to node Spider-Man and go find me X as long as it is connected but I got stump by *0..2 because its the range...what if I just say "go find me the first you that means the own by John"
3) following up to #2 - how do i insert the fitler "own by john" ?
There are a number of things in your question that don't quite make sense. Here's a stab at an answer.
1) Making 'USA' a node rather than a property is useful if you want to search based on country. If 'USA' is a node, you are able to limit your search by starting at the 'USA' node. If you don't care to do this, then it doesn't really matter. It may also save a small amount of space for longer country names to store the name once and link to it via relationships.
2) Your example doesn't match your described graph. I can't really speak to it without a better example.
3) This is probably easy to answer once you improve your example.
OK. Based on the comments to the answer, here's what you need. To find one movie owned by John that is connected via common actors, directors, etc to the movie Spider-man owned by Frank (that is, sub-graphs like (movie)<--(actor)-->(movie) ) you can write:
MATCH (n:Movie {title : 'Spider-Man', owned_by : 'Frank'})<-[*2]->(m:Movie {owned_by : 'John'})
RETURN m LIMIT 1
If you want more responses, alter or remove the LIMIT on the RETURN clause. If you want to allow chains that pass through chains like (movie)<--(actor)-->(movie)<--(director)-->(movie), you can increase the number of relationships matched (the *2) to 4, 6, 8, etc. You probably shouldn't just write the relationship part of the MATCH clause as -[*]-, because this could get into infinite loops.