Working with cyclical graphs in RoR - ruby-on-rails

I haven't attempted to work with graphs in Rails before, and am curious as to the best approach. Some background:
I am making a Rails 3 site and thought it would be interesting to store certain objects and their relationships as a graph, where each object is a node and some are connected to show that the two objects are related. The graph does contain cycles, and there wouldn't be more than 100-150 nodes in the graph (probably only closer to 50). One node probably wouldn't have more than five edges, with an average of three to four edges per node.
I figured a simple join table with two columns (each the ID of the object) might be the easiest way to do it, but I doubt it's the best way. Another thought was to use a plugin such as acts_as_tree (which doesn't appear to be updated for Rails 3...) or acts_as_tree_with_dotted_ids, but I am unsure of their ability to work with cycles rather than hierarchical trees.
the most I would currently like is to easily traverse from one node to its siblings. I really can't think of a reason I would want to traverse to a node's sibling's sibling, which is why I was considering just making an SQL join table. I only want to have a section on the site to display objects related to a specified object, and this graph is one of the ways I am specifying relationships.
Advice? Things I should check out? Thanks!

I would use two SQL tables, node and link where a link is simply two foreign keys, source and target. This way you can get the set of inbound or outbound links to a node by performing an SQL select query by constraining the source or target node id. You could take it a step further by adding a "graph_id" column to both tables so you can retrieve all the data for a graph in two queries and build it as a post-processing step.
This strategy should be just as easy (if not easier) than finding, installing, learning to use, and implementing a plugin to do the same, IMHO.

Depending on whether your concern is primarily about operations on graphs, or on storage of graphs, what you need is potentially quite different. If you want convenient operations on graphs, investigate the gem "rgl" (ruby graph library). It has implementations of most of the basic classic traversal and search algorithms.
If you're dealing with something on the order of 150 nodes, you can probably get away with a minimalist adjacency list representation in the database itself, or incidence list. Then you can feed that into RGL for traversal and search operations.
If I remember correctly, RGL has enough abstraction that you may be able to work with an existing class structure and you simply provide methods to get adjacent nodes.

Assuming that it is a directed graph, use a mapping table such as
id | src | dest
where src and dest are FKs to your object table.
If your objects are not all of the same type, either have them all inherit a ruby class or have another table:
id | type | type_id
Where type is the type of object it is and type_id is its id in another table.
By doing this, you should be able to get an array of objects for each object that it points to using:
select dest
from maptable
where dest = self.id
If you need to know its inbound edges, you can preform the same type of query using src instead of dest.
From there, you should be able to easily write any graph algorithms that you want. If you need weights, you can modify the mapping table as such.
id | src | dest | weight

Related

Model source informations to maximize query performance

I am wondering about the best way (in terms of performance) to model data sources in Neo4j.
Consider the following scenario:
We are joining different datasets about the music domain in one graph. The data can range from different artists and styles to sales information. Important is to store the source of this information. E.g. do we have the data from a public source like DBpedia or some other private sources.
To be able to run queries only on certain datasets we have to include the source to each Node (and in the optimal way to each Relation). Of course one Node or Relation could have multiple sources.
There are three straight forward solutions:
Add a source property to each Node and Relation; index this property and use it in a cypher query. E.g.:
MATCH(n:Artist) WHERE n.source='DBpedia' return n
Add the source as Label to each Node and a Type to each Relation (can we have multiple types on one Relation?). E.g.:
CREATE (n:Artist:DBpediaSource:CustomerSource)
Create a separate Node for each Source and link all other Nodes to the corresponding Source Node. E.g.:
MATCH (n:Artist)-[:HASSOURCE]-(:DBpediaSource) return n
Of course for those examples the solution does not matter in terms of performance. However using the source in more complex queries and on a bigger graph (lets say with a few million Nodes and Relations) the way we model this challenge will have a significant influence on the performance.
One more complex example where the sources are also needed is the generation of a "sub graph".
We want to extract all Nodes and Relations from one or multiple Sources and for example export this to a new Neo4j instance, or restrict some graph algorithms such as PageRang to this "sub graph" without creating a separate Neo4j instance.
Does anyone in the community has experience with such a case? What is the best way to model this in terms of performance? Are there maybe other solutions?
Thanks for your help.

Is there a benefit to implementing singletons in Neo?

My business requirement says I need to add an arbitrary number of well-defined (AKA not dynamic, not unknown) attributes to certain types of nodes. I am pretty sure that while there could be 30 or 40 different attributes, a node will probably have no more than 4 or 5 of them. Of course there will be corner cases...
In this context, I am generically using 'attribute' as a tag wanted by the business, and not in the Neo4J sense.
I'll be expected to report on which nodes have which attributes. For example, I might have to report on which nodes have the "detention", "suspension", or "double secret probation" attributes.
One way is to simply have an array of appropriate attributes on each entity. But each query would require a search of all nodes. Or, I could create explicit attributes on each node. Now they could be indexed. I'm not seriously considering either of these approaches.
Another way is to implement each attribute as a singleton Neo node, and allow many (tens of thousands?) of other nodes to relate to these nodes. This implementation would have 10,000 nodes but 40,000 relationships.
Finally, the attribute nodes could be created and used by specific entity nodes on an as-needed basis. In this case, if 10,000 entities had an average of 4 attributes, I'd have a total of 50,000 nodes.
As I type this, I realize that in the 2nd case, I still have 40,000 relationships; the 'truth' of the situation did not change.
Is there a reason to avoid the 'singleton' implementation? I could put timestamps on the relationships. But those wouldn't be indexed...
For your simple use case, I'd suggest an approach you didn't list -- which is to use a node label for each "attribute".
Nodes can have multiple labels, and neo4j can quickly iterate through all the nodes with the same label -- making it very quick and easy to find all the nodes with a specific label.
For example:
MATCH (n:Detention)
RETURN n;

Neo4j graph modelling performance and querability, property to a node or as separate node plus relationship

I am teaching myself graph modelling and use Neo4j 2.2.3 database with NodeJs and Express framework.
I have skimmed through the free neo4j graph database book and learned how to model a scenario, when to use relationship and when to create nodes, etc.
I have modelled a vehicle selling scenario, with following structure
NODES
(:VEHICLE{mileage:xxx, manufacture_year: xxxx, price: xxxx})
(:VFUEL_TYPE{type:xxxx}) x 2 (one for diesel and one for petrol)
(:VCOLOR{color:xxxx}) x 8 (red, green, blue, .... yellow)
(:VGEARBOX{type:xxx}) x 2 (AUTO, MANUAL)
RELATIONSHIPS
(vehicleNode)-[:VHAVE_COLOR]->(colorNode - either of the colors)
(vehicleNode)-[:VGEARBOX_IS]->(gearboxNode - either manual or auto)
(vehicleNode)-[:VCONSUMES_FUEL_TYPE]->(fuelNode - either diesel or petrol)
Assuming we have the above structure and so on for the rest of the features.
As shown in the above screenshot (136 & 137 are VEHICLE nodes), majority of the features of a vehicle is created as separate nodes and shared among vehicles with common feature with relationships.
Could you please advise whether roles (labels) like color, body type, driving side (left drive or right drive), gearbox and others should be seperate nodes or properties of vehicle node? Which option is more performance friendly, and easy to query?
I want to write a JS code that allows querying the graph with above structure with one or many search criteria. If majority of those features are properties of VEHICLE node then querying would not be difficult:
MATCH (v:VEHICLE) WHERE v.gearbox = "MANUAL" AND v.fuel_type = "PETROL" AND v.price > x AND v.price < y AND .... RETURN v;
However with existing graph model that I have it is tricky to search, specially when there are multiple criteria that are not necessarily a properties of VEHICLE node but separate nodes and linked via relationship.
Any ideas and advise in regards to existing structure of the graph to make it more query-able as well as performance friendly would be much appreciated. If we imagine a scenario with 1000 VEHICLE nodes that would generate 15000 relationship, sounds a bit scary and if it hits a million VEHICLE then at most 15 million relationships. Please comment if I am heading in the wrong direction.
Thank you for your time.
Modeling is full of tradeoffs, it looks like you have a decent start.
Don't be concerned at all with the number of relationships. That's what graph databases are good at, so I wouldn't be too concerned about over-using them.
Should something be a property, or a node? I can't answer for your scenario, but here are some things to consider:
If you look something up by a value all the time, and you have many objects, it's usually going to be faster to find one node and then everything connected to it, because graph DBs are good at exploiting relationships. It's less fast to scan all nodes of a label and find the items where a property=a value.
Relationships work well when you want to express a connection to something that isn't a simple primitive data type. For example, take "gearbox". There's manuals, and other types...if it's a property value, you won't later easily be able to decide to store 4 other sub-types/sub-aspects of "gearbox". If it were a node, that would later be easy because you could add more properties to the node, or relate other things.
If a piece of data really is a primitive (String, integer, etc) and you don't need extra detail about it, that usually makes a good property. Querying primitive values by connecting to other nodes will seem clunky later on. For example, I wouldn't model a person with a "date of birth" as a separate node, that would be irritating to query, and would give you flexibility you'd be very unlikely to need in the future.
Semantically, how is your data related? If two items are similar because they share an X, then that X probably should be a node. If two items happen to have the same Y value but that doesn't really mean much, then Y is probably better off as a node property.

Why do relationships as a concept exist in neo4j or graph databases in general?

I can't seem to find any discussion on this. I had been imagining a database that was schemaless and node based and heirarchical, and one day I decided it was too common sense to not exist, so I started searching around and neo4j is about 95% of what I imagined.
What I didn't imagine was the concept of relationships. I don't understand why they are necessary. They seem to add a ton of complexity to all topics centered around graph databases, but I don't quite understand what the benefit is. Relationships seem to be almost exactly like nodes, except more limited.
To explain what I'm thinking, I was imagining starting a company, so I create myself as my first nodes:
create (u:User { u.name:"mindreader"});
create (c:Company { c.name:"mindreader Corp"});
One day I get a customer, so I put his company into my db.
create (c:Company { c.name:"Customer Company"});
create (u:User { u.name:"Customer Employee1" });
create (u:User { u.name:"Customer Employee2"});
I decide to link users to their customers
match (u:User) where u.name =~ "Customer.*"
match (c:Company) where c.name =~ "Customer.*
create (u)-[:Employee]->(c);
match (u:User where name = "mindreader"
match (c:Company) where name =~ "mindreader.*"
create (u)-[:Employee]->(c);
Then I hire some people:
match (c:Company) where c.name =~ "mindreader.*"
create (u:User { name:"Employee1"})-[:Employee]->(c)
create (u:User { name:"Employee2"})-[:Employee]->(c);
One day hr says they need to know when I hired employees. Okay:
match (c:Company)<-[r:Employee]-(u:User)
where name =~ "mindreader.*" and u.name =~ "Employee.*"
set r.hiredate = '2013-01-01';
Then hr comes back and says hey, we need to know which person in the company recruited a new employee so that they can get a cash reward for it.
Well now what I need is for a relationship to point to a user but that isn't allowed (:Hired_By relationship between :Employee relationship and a User). We could have an extra relationship :Hired_By, but if the :Employee relationship is ever deleted, the hired_by will remain unless someone remembers to delete it.
What I could have done in neo4j was just have a
(u:User)-[:hiring_info]->(hire_info:HiringInfo)-[:hired_by]->(u:User)
In which case the relationships only confer minimal information, the name.
What I originally envisioned was that there would be nodes, and then each property of a node could be a datatype or it could be a pointer to another node. In my case, a user record would end up looking like:
User {
name: "Employee1"
hiring_info: {
hire_date: "2013-01-01"
hired_by: u:User # -> would point to a user
}
}
Essentially it is still a graph. Nodes point to each other. The name of the relationship is just a field in the origin node. To query it you would just go
match (u:User) where ... return u.name, u.hiring_info.hiring_date, u.hiring_info.hired_by.name
If you needed a one to many relationship of the same type, you would just have a collection of pointers to nodes. If you referenced a collection in return, you'd get essentially a join. If you delete hiring_info, it would delete the pointer. References to other nodes would not have to be a disorganized list at the toplevel of a node. Furthermore when I query each user I will know all of the info about a user without both querying for the user itself and also all of its relationships. I would know his name and the fact that he hired someone in the same query. From the database backend, I'm not sure much would change.
I see quite a few questions from people asking whether they should use nodes or relationships to model this or that, and occasionally people asking for a relationship between relationships. It feels like the XML problem where you are wondering if a pieces of information should be its own tag or just a property its parent tag.
The query engine goes to great pains to handle relationships, so there must be some huge advantage to having them, but I can't quite see it.
Different databases are for different things. You seem to be looking for a noSQL database.
This is an extremely wide topic area that you've reached into, so I'll give you the short of it. There's a spectrum of database schemas, each of which have different use cases.
NoSQL aka Non-relational Databases:
Every object is a single document. You can have references to other documents, but any additional traversal means you're making another query. Times when you don't have relationships between your data very often, and are usually just going to want to query once and have a large amount of flexibly-stored data as the document that is returnedNote: These are not "nodes". Node have a very specific definition and implies that there are edges.)
SQL aka Relational Databases:
This is table land, this is where foreign keys and one-to-many relationships come into play. Here you have strict schemas and very fast queries. This is honestly what you should use for your user example. Small amounts of data where the relationships between things are shallow (You don't have to follow a relationship more than 1-2 times to get to the relevant entry) are where these excel.
Graph Database:
Use this when relationships are key to what you're trying to do. The most common example of a graph is something like a social graph where you're connecting different users together and need to follow relationships for many steps. (Figure out if two people are connected within a depth for 4 for instance)
Relationships exist in graph databases because that is the entire concept of a graph database. It doesn't really fit your application, but to be fair you could just keep more in the node part of your database. In general the whole idea of a database is something that lets you query a LOT of data very quickly. Depending on the intrinsic structure of your data there are different ways that that makes sense. Hence the different kinds of databases.
In strongly connected graphs, Neo4j is 1000x faster on 1000x the data than a SQL database. NoSQL would probably never be able to perform in a strongly connected graph scenario.
Take a look at what we're building right now: http://vimeo.com/81206025
Update: In reaction to mindreader's comment, we added the related properties to the picture:
RDBM systems are tabular and put more information in the tables than the relationships. Graph databases put more information in relationships. In the end, you can accomplish much the same goals.
However, putting more information in relationships can make queries smaller and faster.
Here's an example:
Graph databases are also good at storing human-readable knowledge representations, being edge (relationship) centric. RDF takes it one step further were all information is stored as edges rather than nodes. This is ideal for working with predicate logic, propositional calculus, and triples.
Maybe the right answer is an object database.
Objectivity/DB, which now supports a full suite of graph database capabilities, allows you to design complex schema with one-to-one, one-to-many, many-to-one, and many-to-many reference attributes. It has the semantics to view objects as graph nodes and edges. An edge can be just the reference attribute from one node to another or an edge can exist as an edge object that sits between two nodes.
An edge object can have any number of attribute and can have references off to other objects, as shown in the diagram below.
Being able to "hang" complex objects off of an edge allows Objectivity/DB to support weighted queries where the edge-weight can be calculated using a user-defined weight calculator operator. The weight calculator operator can build the weight from a static attribute on the edge or build the weight by digging down through the objects connected to the edge. In the picture, above, we could create a edge-weight calculator that computes the sum of the CallDetail lengths connected to the Call edge.

Modelling alternatives and performance when traversing a tree structure in Neo4J

I modelled a tree structure using the Neo4J graph database. All nodes represent a category with a characterising name. So I have to traverse my tree very often from the root to a specific node / category. To which node depends on a list coming as input. This list contains strings representing the names of the categories from the root to the target node.
I wonder, if it would be effective to store these names as the types of the edges instead of a name property in the particular nodes.
I thought that when I do so, Neo4J doesn't have to look for the fitting name property of every child node every time going a step deeper in the tree. Instead Neo4J could lookup the name in the map that contains the outgoing edges.
What do you think?
Sounds sensible. How many different names do you have? If it is just categories those shouldn't be millions.
Did you load your data into the graph and run a performance comparison between both approaches? Is it a performance critical thing in your graph?

Resources