neo4j - how to separate or organize my nodes? - neo4j

I'm new to Neo4J, so I'm using Rest API to insert data. Like this:
POST http://site.co:7474/db/data/node {"type":"user", "name":"mike"}
POST http://site.co:7474/db/data/node {"type":"admin", "name":"john"}
POST http://site.co:7474/db/data/node {"type":"photo", "url":"myfile.jpg"}
Is there any way to separate/organize my nodes by types and id's? Maybe folders or categories?
For example, in ElasticSearch I can separate by index, types and define the ID in URL, eg:
PUT http://site.co / objects / user / mike
PUT http://site.co / objects / admin / john
PUT http://site.co / objects / photo / myfile
|index| |type| |obj_id|

I suggest taking a look at Neo4j's free online training course at http://www.neo4j.org/training, which introduces you to some of the basic concepts behind the platform. One of those basic concepts is a label, which can be used to classify nodes. A node can have many different labels. They are used to group nodes together and are useful for querying the relationships between certain groups.
For instance, let's consider the movie domain.
MATCH (movie:Movie)<-[:ACTED_IN]-(actor:Person)
Here we are looking for actors who acted in a movie. (movie:Movie) represents the group of nodes that have the label Movie. (actor:Person) represents the group of nodes that are people, which is a classifier for an actor.
Take a look at the REST api documentation to understand usage for labels: http://docs.neo4j.org/chunked/stable/rest-api-node-labels.html

Related

Model source informations to maximize query performance

I am wondering about the best way (in terms of performance) to model data sources in Neo4j.
Consider the following scenario:
We are joining different datasets about the music domain in one graph. The data can range from different artists and styles to sales information. Important is to store the source of this information. E.g. do we have the data from a public source like DBpedia or some other private sources.
To be able to run queries only on certain datasets we have to include the source to each Node (and in the optimal way to each Relation). Of course one Node or Relation could have multiple sources.
There are three straight forward solutions:
Add a source property to each Node and Relation; index this property and use it in a cypher query. E.g.:
MATCH(n:Artist) WHERE n.source='DBpedia' return n
Add the source as Label to each Node and a Type to each Relation (can we have multiple types on one Relation?). E.g.:
CREATE (n:Artist:DBpediaSource:CustomerSource)
Create a separate Node for each Source and link all other Nodes to the corresponding Source Node. E.g.:
MATCH (n:Artist)-[:HASSOURCE]-(:DBpediaSource) return n
Of course for those examples the solution does not matter in terms of performance. However using the source in more complex queries and on a bigger graph (lets say with a few million Nodes and Relations) the way we model this challenge will have a significant influence on the performance.
One more complex example where the sources are also needed is the generation of a "sub graph".
We want to extract all Nodes and Relations from one or multiple Sources and for example export this to a new Neo4j instance, or restrict some graph algorithms such as PageRang to this "sub graph" without creating a separate Neo4j instance.
Does anyone in the community has experience with such a case? What is the best way to model this in terms of performance? Are there maybe other solutions?
Thanks for your help.

Collapse Relationships Neo4j?

Is it possible to "collapse" relationships in neo4j? I'm trying to graph relationships between people, and they can be related in multiple different ways - a shared course, jointly authored paper, RT or tweet mention. Right now I'm modeling people, courses, papers, and tweets all as nodes. But what I'm really interested in is modeling the person-person relationships that go through these intermediary nodes. Is it possible to graph the implicit relationship (person-course-person) explicit (person-person), while still keeping the course as a node? Something like this http://catalhoyuk.stanford.edu/network/teams/ - slide 2 and 3.
Any other data modeling suggestions welcome as well.
Yes, you can do it. The query
MATCH(a:Person)-->(:Course)<--(b:Person)
CREATE (a)-[:IMPLICIT_RELATIONSHIP]->(b)
will crate a relationship with type :IMPLICIT_RELATIONSHIP between all people that are related to the same course. But probably you don't need it since you can transverse from a to b and from b to a without this extra and not necessary relationship. Also if you want a virtual relationship at query time to use in a projection you can use the APOC procedure apoc.create.vRelationship.
The APOC procedures docs says:
Virtual Nodes and Relationships don’t exist in the graph, they are
only returned to the UI/user for representing a graph projection. They
can be visualized or processed otherwise. Please note that they have
negative id’s.

Neo4j graph modelling performance and querability, property to a node or as separate node plus relationship

I am teaching myself graph modelling and use Neo4j 2.2.3 database with NodeJs and Express framework.
I have skimmed through the free neo4j graph database book and learned how to model a scenario, when to use relationship and when to create nodes, etc.
I have modelled a vehicle selling scenario, with following structure
NODES
(:VEHICLE{mileage:xxx, manufacture_year: xxxx, price: xxxx})
(:VFUEL_TYPE{type:xxxx}) x 2 (one for diesel and one for petrol)
(:VCOLOR{color:xxxx}) x 8 (red, green, blue, .... yellow)
(:VGEARBOX{type:xxx}) x 2 (AUTO, MANUAL)
RELATIONSHIPS
(vehicleNode)-[:VHAVE_COLOR]->(colorNode - either of the colors)
(vehicleNode)-[:VGEARBOX_IS]->(gearboxNode - either manual or auto)
(vehicleNode)-[:VCONSUMES_FUEL_TYPE]->(fuelNode - either diesel or petrol)
Assuming we have the above structure and so on for the rest of the features.
As shown in the above screenshot (136 & 137 are VEHICLE nodes), majority of the features of a vehicle is created as separate nodes and shared among vehicles with common feature with relationships.
Could you please advise whether roles (labels) like color, body type, driving side (left drive or right drive), gearbox and others should be seperate nodes or properties of vehicle node? Which option is more performance friendly, and easy to query?
I want to write a JS code that allows querying the graph with above structure with one or many search criteria. If majority of those features are properties of VEHICLE node then querying would not be difficult:
MATCH (v:VEHICLE) WHERE v.gearbox = "MANUAL" AND v.fuel_type = "PETROL" AND v.price > x AND v.price < y AND .... RETURN v;
However with existing graph model that I have it is tricky to search, specially when there are multiple criteria that are not necessarily a properties of VEHICLE node but separate nodes and linked via relationship.
Any ideas and advise in regards to existing structure of the graph to make it more query-able as well as performance friendly would be much appreciated. If we imagine a scenario with 1000 VEHICLE nodes that would generate 15000 relationship, sounds a bit scary and if it hits a million VEHICLE then at most 15 million relationships. Please comment if I am heading in the wrong direction.
Thank you for your time.
Modeling is full of tradeoffs, it looks like you have a decent start.
Don't be concerned at all with the number of relationships. That's what graph databases are good at, so I wouldn't be too concerned about over-using them.
Should something be a property, or a node? I can't answer for your scenario, but here are some things to consider:
If you look something up by a value all the time, and you have many objects, it's usually going to be faster to find one node and then everything connected to it, because graph DBs are good at exploiting relationships. It's less fast to scan all nodes of a label and find the items where a property=a value.
Relationships work well when you want to express a connection to something that isn't a simple primitive data type. For example, take "gearbox". There's manuals, and other types...if it's a property value, you won't later easily be able to decide to store 4 other sub-types/sub-aspects of "gearbox". If it were a node, that would later be easy because you could add more properties to the node, or relate other things.
If a piece of data really is a primitive (String, integer, etc) and you don't need extra detail about it, that usually makes a good property. Querying primitive values by connecting to other nodes will seem clunky later on. For example, I wouldn't model a person with a "date of birth" as a separate node, that would be irritating to query, and would give you flexibility you'd be very unlikely to need in the future.
Semantically, how is your data related? If two items are similar because they share an X, then that X probably should be a node. If two items happen to have the same Y value but that doesn't really mean much, then Y is probably better off as a node property.

Neo4j suggestion on large scale

i need to implement a suggestion system for my project
in this system we should recommend people base on some parameters like current city, education, friend of friends etc.
i have designed this by creating(update) may_know relations when users edit their profile or become friend with someone and i will retrieve them by MATCH u-[r:MAY_KNOW]-x RETURN * ORDER BY r.weight so people can find most like people to them
but i think this is not a best practice because soon may_know relation from/to every user can reach even milions and scan and sorting them will be heavy cost
do you have a better idea?
Depends a bit on the data-structure, I assume there are relationships to cities, education facilities and friends. So you don't actually have MAY_KNOW relationships as those are only inferred?
Also it depends if you want to create a cross products between all your users (how many) and how you would want to filter out non-related people.
Perhaps check out this blog post from Max: http://maxdemarzi.com/2013/04/19/match-making-with-neo4j/
So something like this query might work (depending on the data volume I'd rewrite it in the Java API).
match (p:Person {id:{user_id})
match (p)-[:LIVES_IN]->(:City)<-[:LIVES_IN]-(other)
match (p)-[:GRADUATED]->(:School)<-[:GRADUATED]-(other)
match (p)-[:KNOWS]->(:Person)<-[:KNOWS]-(other)
RETURN other

Why do relationships as a concept exist in neo4j or graph databases in general?

I can't seem to find any discussion on this. I had been imagining a database that was schemaless and node based and heirarchical, and one day I decided it was too common sense to not exist, so I started searching around and neo4j is about 95% of what I imagined.
What I didn't imagine was the concept of relationships. I don't understand why they are necessary. They seem to add a ton of complexity to all topics centered around graph databases, but I don't quite understand what the benefit is. Relationships seem to be almost exactly like nodes, except more limited.
To explain what I'm thinking, I was imagining starting a company, so I create myself as my first nodes:
create (u:User { u.name:"mindreader"});
create (c:Company { c.name:"mindreader Corp"});
One day I get a customer, so I put his company into my db.
create (c:Company { c.name:"Customer Company"});
create (u:User { u.name:"Customer Employee1" });
create (u:User { u.name:"Customer Employee2"});
I decide to link users to their customers
match (u:User) where u.name =~ "Customer.*"
match (c:Company) where c.name =~ "Customer.*
create (u)-[:Employee]->(c);
match (u:User where name = "mindreader"
match (c:Company) where name =~ "mindreader.*"
create (u)-[:Employee]->(c);
Then I hire some people:
match (c:Company) where c.name =~ "mindreader.*"
create (u:User { name:"Employee1"})-[:Employee]->(c)
create (u:User { name:"Employee2"})-[:Employee]->(c);
One day hr says they need to know when I hired employees. Okay:
match (c:Company)<-[r:Employee]-(u:User)
where name =~ "mindreader.*" and u.name =~ "Employee.*"
set r.hiredate = '2013-01-01';
Then hr comes back and says hey, we need to know which person in the company recruited a new employee so that they can get a cash reward for it.
Well now what I need is for a relationship to point to a user but that isn't allowed (:Hired_By relationship between :Employee relationship and a User). We could have an extra relationship :Hired_By, but if the :Employee relationship is ever deleted, the hired_by will remain unless someone remembers to delete it.
What I could have done in neo4j was just have a
(u:User)-[:hiring_info]->(hire_info:HiringInfo)-[:hired_by]->(u:User)
In which case the relationships only confer minimal information, the name.
What I originally envisioned was that there would be nodes, and then each property of a node could be a datatype or it could be a pointer to another node. In my case, a user record would end up looking like:
User {
name: "Employee1"
hiring_info: {
hire_date: "2013-01-01"
hired_by: u:User # -> would point to a user
}
}
Essentially it is still a graph. Nodes point to each other. The name of the relationship is just a field in the origin node. To query it you would just go
match (u:User) where ... return u.name, u.hiring_info.hiring_date, u.hiring_info.hired_by.name
If you needed a one to many relationship of the same type, you would just have a collection of pointers to nodes. If you referenced a collection in return, you'd get essentially a join. If you delete hiring_info, it would delete the pointer. References to other nodes would not have to be a disorganized list at the toplevel of a node. Furthermore when I query each user I will know all of the info about a user without both querying for the user itself and also all of its relationships. I would know his name and the fact that he hired someone in the same query. From the database backend, I'm not sure much would change.
I see quite a few questions from people asking whether they should use nodes or relationships to model this or that, and occasionally people asking for a relationship between relationships. It feels like the XML problem where you are wondering if a pieces of information should be its own tag or just a property its parent tag.
The query engine goes to great pains to handle relationships, so there must be some huge advantage to having them, but I can't quite see it.
Different databases are for different things. You seem to be looking for a noSQL database.
This is an extremely wide topic area that you've reached into, so I'll give you the short of it. There's a spectrum of database schemas, each of which have different use cases.
NoSQL aka Non-relational Databases:
Every object is a single document. You can have references to other documents, but any additional traversal means you're making another query. Times when you don't have relationships between your data very often, and are usually just going to want to query once and have a large amount of flexibly-stored data as the document that is returnedNote: These are not "nodes". Node have a very specific definition and implies that there are edges.)
SQL aka Relational Databases:
This is table land, this is where foreign keys and one-to-many relationships come into play. Here you have strict schemas and very fast queries. This is honestly what you should use for your user example. Small amounts of data where the relationships between things are shallow (You don't have to follow a relationship more than 1-2 times to get to the relevant entry) are where these excel.
Graph Database:
Use this when relationships are key to what you're trying to do. The most common example of a graph is something like a social graph where you're connecting different users together and need to follow relationships for many steps. (Figure out if two people are connected within a depth for 4 for instance)
Relationships exist in graph databases because that is the entire concept of a graph database. It doesn't really fit your application, but to be fair you could just keep more in the node part of your database. In general the whole idea of a database is something that lets you query a LOT of data very quickly. Depending on the intrinsic structure of your data there are different ways that that makes sense. Hence the different kinds of databases.
In strongly connected graphs, Neo4j is 1000x faster on 1000x the data than a SQL database. NoSQL would probably never be able to perform in a strongly connected graph scenario.
Take a look at what we're building right now: http://vimeo.com/81206025
Update: In reaction to mindreader's comment, we added the related properties to the picture:
RDBM systems are tabular and put more information in the tables than the relationships. Graph databases put more information in relationships. In the end, you can accomplish much the same goals.
However, putting more information in relationships can make queries smaller and faster.
Here's an example:
Graph databases are also good at storing human-readable knowledge representations, being edge (relationship) centric. RDF takes it one step further were all information is stored as edges rather than nodes. This is ideal for working with predicate logic, propositional calculus, and triples.
Maybe the right answer is an object database.
Objectivity/DB, which now supports a full suite of graph database capabilities, allows you to design complex schema with one-to-one, one-to-many, many-to-one, and many-to-many reference attributes. It has the semantics to view objects as graph nodes and edges. An edge can be just the reference attribute from one node to another or an edge can exist as an edge object that sits between two nodes.
An edge object can have any number of attribute and can have references off to other objects, as shown in the diagram below.
Being able to "hang" complex objects off of an edge allows Objectivity/DB to support weighted queries where the edge-weight can be calculated using a user-defined weight calculator operator. The weight calculator operator can build the weight from a static attribute on the edge or build the weight by digging down through the objects connected to the edge. In the picture, above, we could create a edge-weight calculator that computes the sum of the CallDetail lengths connected to the Call edge.

Resources