Rugged: Path and commit_id of tree - ruby-on-rails

I have a rugged tree object and I want to find out what is its path (relative to root) and what was the commit id when that tree was written. For example:
tree = repo.lookup '7892eeee70c08fae4db63aef7000dea39f883b30' #sha/oid of tree
What operations should I perform on this tree object so that I get its path and commit id?

That information is not stored in the tree at all. Git uses Merkle trees where the parents know what the children trees are, but each tree can be contained in multiple commits (this is the typical situation, as some subdirs are very rarely touched).
A tree may also be accessible through many different paths, if those directories have the same contents.
The only way to figure out where a tree belongs would be to go and look at each commit and recursively look from the root to see if you can find the tree you were given. This is going to be a very expensive operation.
I would recommend you take a step back and figure out why you think you need to figure out where a tree is reachable from. It sounds like you've already decided many steps and you're asking about a detail, when you should be looking at it from a higher level.

Related

How to Copy Sub-Graph in Neo4j using Cypher

I am trying to simulate a file system using Neo4j, Cypher, and Python(Py2Neo).
I have created the data model as shown in the following screenshot.
Type=0 means folder and type=1 means file.
.
.
I am implementing functions like Copy, Move etc for files/folders.
Move functions look simple, I can create a new relationship and delete the old one. But copying files/folders need to copy the sub-graph.
How can I copy the sub-graph?
I am creating a python module so trying to avoid apoc.
Even though you're trying to avoid APOC, it already has this feature implemented in the most recent release: apoc.refactor.cloneSubgraph()
For a non-APOC approach you'll need to accomplish the following:
MATCH to distinct nodes and relationships that make up the subgraph you want to clone. Having a separate list for each will make this easier to process.
Clone the nodes, and get a way to map from the original node to the cloned node.
Process the relationships, finding the start and end nodes, and following the mapping to the cloned nodes, then create the same relationship type using the cloned nodes for your start and end nodes of the relationship, then copy properties from the original relationship. This way you don't have any relationships to the originals, just the clones.
Determine which nodes you want to reanchor (you probably don't want to clone the original), and for any relationship that goes to/from this node, create it (via step 3) to the node you want to use as the new anchor (for example, the new :File which should be the parent of the cloned directory tree).
All this is tough to do in Cypher (steps 3 and 4 in particular), thus the reason all this was encapsulated in apoc.refactor.cloneSubgraph().

Store a users traversal path through a nested hierarchy

I have a nested tree stored in Neo4j. Where each node can have a (n)-[:CHILD]->(c) relationship with other nodes. Allowing you to query the entire tree from a given node down with MATCH (n)-[c:CHILD*]-(m).
What I am having trouble with, is figuring out how to store a path that a user takes as they walk through a tree. For instance a query to return the path would be (user)-[:USER_PATH*]->(node).
However the path has to remain along the lines of :CHILD relationships, it cannot jump outside of its branch. A user path cannot jump from the leaf of one branch to a leaf of another branch, without first retracting its way back up the path it came from until it finds a fork that will walk down to it.
Also I do not think shortest path will work, because its not the shortest path I want, I want the users actual path. But it should disregard relationships that were abandoned as the user backed out of any branches. It should not be leaving around dead paths.
How would I be able to update the graph after each node is walked to so these rules stay in tact??
It can be assumed that in drilling down into a new branch, it can only step through to one more set of siblings. However all branches it came through are still open, so their siblings can be selected.
Best I can figure is that it needs to:
"prefer" walking the :USER_PATH relationships as long as it can
until it needs to break that path to get to the new node
at which point it creates any new relationships
then delete any old relationships that are no longer on that path
I have no idea how to accomplish that though.
I have spent a ton of time in trial and error and googling to no avail.
Thanks in advance!
Given the image below:
red node = User
green nodes = A valid node to be a new "target"
blue nodes = invalid target node
So if you were to back out of the leaf node it is in currently, it would delete that final :RATIONAL_PATH relation in the chain.
Also the path should adjust to any of the green nodes that were selected, but keeping the existing :RATIONAL_PATH in tact for as far as possible.
Personally I think removing the existing path and creating a new one with shortestPath() is probably the best way to go. The cost of reusing the existing path and performing cleanup is often going to be higher and more complex than simply starting over.
Otherwise, the approach to take would be to match down to the last node of your path, and then perform a shortestPath() to the new node, and create your path.
And then we'd have to perform cleanup. This would probably involve matching all paths along :RATIONAL_PATH relationships to the end node resulting in a group of paths. The one with the shortest distance is going to be the one we keep. We'd need to collect those relationships, collect the other relationships of other paths that are no longer valid, do some set subtraction to get the relationships not used in the shortest path, and deleting them.
That's quite a bit of work that should probably be avoided.

Neo4j data modeling for branching/merging graphs

We are working on a system where users can define their own nodes and connections, and can query them with arbitrary queries. A user can create a "branch" much like in SCM systems and later can merge back changes into the main graph.
Is it possible to create an efficient data model for that in Neo4j? What would be the best approach? Of course we don't want to duplicate all the graph data for every branch as we have several million nodes in the DB.
I have read Ian Robinson's excellent article on Time-Based Versioned Graphs and Tom Zeppenfeldt's alternative approach with Network versioning using relationnodes but unfortunately they are solving a different problem.
I Would love to know what you guys think, any thoughts appreciated.
I'm not sure what your experience level is. Any insight into that would be helpful.
It would be my guess that this system would rely heavily on tags on the nodes. maybe come up with 5-20 node types that are very broad, including the names and a few key properties. Then you could allow the users to select from those base categories and create their own spin-offs by adding tags.
Say you had your basic categories of (:Thing{Name:"",Place:""}) and (:Object{Category:"",Count:4})
Your users would have a drop-down or something with "Thing" and "Object". They'd select "Thing" for instance, and type a new label (Say "Cool"), values for "Name" and "Place", and add any custom properties (IsAwesome:True).
So now you've got a new node (:Thing:Cool{Name:"Rock",Place:"Here",IsAwesome:True}) Which allows you to query by broad categories or a users created categories. Hopefully this would keep each broad category to a proportional fraction of your overall node count.
Not sure if this is exactly what you're asking for. Good luck!
Hmm. While this isn't insane, think about the type of system you're replacing first. SQL. In SQL databases you wouldn't use branches because it's data storage. If you're trying to get data from multiple sources into one DB, I'd suggest exporting them all to CSV files and using a MERGE statement in cypher to bring them all into your DB at once.
This could manifest similar to branching by having each person run a script on their own copy of the DB when you merge that takes all the nodes and edges in their copy and puts them all into a CSV. IE
MATCH (n)-[:e]-(n2)
RETURN n,e,n2
Then comparing these CSV's as you pull them into your final DB to see what's already there from the other copies.
IMPORT CSV WITH HEADERS FROM "file:\\YourFile.CSV" AS file
MERGE (N:Node{Property1:file.Property1, Property2:file.Property2})
MERGE (N2:Node{Property1:file.Property1, Property2:file.Property2})
MERGE (N)-[E:Edge]-(N2)
This will work, as long as you're using node types that you already know about and each person isn't creating new data structures that you don't know about until the merge.

Neo4j Traversal Framework Expander and Ordering

I am trying to understand Neo4J java traversal API but after a thorough reading I am stuck on certain points.
What I seem to know:
Difference between PathExpander and BranchOrderingPolicy. As per my understanding, the former tells what relationships are eligible to be explored from a particular position and the latter specifies the ordering in which they should be evaluated.
I would like to know the following things:
Whether or to what extent this understanding is correct or if it can be altered to give the correct understanding.
If correct, how is PathExpander different from Evaluator.
How does PathExpander and BranchOrderingPolicy work. What I intend to ask is, is PathExpander consulted everytime a relationship is added to the traversal and what does it do with the iterable of relationships returned. Similarly with branch ordering.
During traversal how and when do the components Expander, BranchOrdering, Evaluator, Uniqueness come into picture. Basically I wish to know the template algorithm where one would say like first expander is asked for a collection of relationships to expand and then ordering policy is consulted to select one of the eligibles....
If correct, does the ordering policy specified by BranchOrderingPolicy apply on the eligible relationships only(after expander has done). Perhaps it must be.
Please include anything else that might be helpful in understanding the API.
I'll try to describe these parts to the best of my ability.
As to the difference between PathExpander and BranchOrderingPolicy: a PathExpander is invoked for each traversal branch the first time the traversal continues from that branch. (A traversal branch is a node including the path leading up to that node, note that there may be many paths, i.e. many branches to the same node, mostly depending on uniqueness). The result of invoking the PathExpander is an Iterator<Relationship> which will lazily provide new relationships off of that traversal branch when needed. That brings us to BranchOrderingPolicy which looks at all alive traversal branches. By "alive" I mean a branch that has one or more relationships on it such that more branches can be created from it. Given all alive branches it picks one of them, following its next relationship (retreived from the relationship iterator on that branch, potentially if it's the first call initializes that iterator using the PathExpander (as described above).
Difference between PathExpander and Evaluator: that split is very much a matter of convenience and separation of concerns. PathExpander grows the number of branches and Evaluator filters, i.e. reduces the number of branches. An expander creates new branches that are evaluated by the Evaluator. With that said you can write a PathExpander that does both those things and it could be more efficient doing so. But the convenience of having them separated, where there can be multiple Evaluators is quite useful.
See above (1)
Some of this is described in (1), but a broader picture would be that the BranchOrderingPolicy is the driver in the traversal - out of every alive branch it picks one and follows it one relationship out to a new branch. Only branches that comply with the selected uniqueness will be created. The relationships for a branch are retreived on the first time this happens for every branch, in the form of a lazy relationship iterator using the PathExpander. New branches get evaluated the first time they are selected where one result of the evaluation is whether or not this branch is a dead end and the other is whether or not to include it in the result out to the user.
I think the above explains that
Is this sufficient information?

Working with cyclical graphs in RoR

I haven't attempted to work with graphs in Rails before, and am curious as to the best approach. Some background:
I am making a Rails 3 site and thought it would be interesting to store certain objects and their relationships as a graph, where each object is a node and some are connected to show that the two objects are related. The graph does contain cycles, and there wouldn't be more than 100-150 nodes in the graph (probably only closer to 50). One node probably wouldn't have more than five edges, with an average of three to four edges per node.
I figured a simple join table with two columns (each the ID of the object) might be the easiest way to do it, but I doubt it's the best way. Another thought was to use a plugin such as acts_as_tree (which doesn't appear to be updated for Rails 3...) or acts_as_tree_with_dotted_ids, but I am unsure of their ability to work with cycles rather than hierarchical trees.
the most I would currently like is to easily traverse from one node to its siblings. I really can't think of a reason I would want to traverse to a node's sibling's sibling, which is why I was considering just making an SQL join table. I only want to have a section on the site to display objects related to a specified object, and this graph is one of the ways I am specifying relationships.
Advice? Things I should check out? Thanks!
I would use two SQL tables, node and link where a link is simply two foreign keys, source and target. This way you can get the set of inbound or outbound links to a node by performing an SQL select query by constraining the source or target node id. You could take it a step further by adding a "graph_id" column to both tables so you can retrieve all the data for a graph in two queries and build it as a post-processing step.
This strategy should be just as easy (if not easier) than finding, installing, learning to use, and implementing a plugin to do the same, IMHO.
Depending on whether your concern is primarily about operations on graphs, or on storage of graphs, what you need is potentially quite different. If you want convenient operations on graphs, investigate the gem "rgl" (ruby graph library). It has implementations of most of the basic classic traversal and search algorithms.
If you're dealing with something on the order of 150 nodes, you can probably get away with a minimalist adjacency list representation in the database itself, or incidence list. Then you can feed that into RGL for traversal and search operations.
If I remember correctly, RGL has enough abstraction that you may be able to work with an existing class structure and you simply provide methods to get adjacent nodes.
Assuming that it is a directed graph, use a mapping table such as
id | src | dest
where src and dest are FKs to your object table.
If your objects are not all of the same type, either have them all inherit a ruby class or have another table:
id | type | type_id
Where type is the type of object it is and type_id is its id in another table.
By doing this, you should be able to get an array of objects for each object that it points to using:
select dest
from maptable
where dest = self.id
If you need to know its inbound edges, you can preform the same type of query using src instead of dest.
From there, you should be able to easily write any graph algorithms that you want. If you need weights, you can modify the mapping table as such.
id | src | dest | weight

Resources