Efficient gremlin query to find a cycle in a graph - graph-algorithm

I have a big TinkerGraph (~80.000 vertices, ~160.000 edges) and I need to detect if there is a cycle in it using the Apache TinkerPop/Gremlin query language. If any, I would like to obtain the vertices of one of the cycles.
Is there a way to write a O(|V| + |E|) gremlin query to find a cyclic path in a graph?
I tried using the queries from here and here, but they are too slow and they time out. I suspect that they are not O(|V| + |E|), but I am still learning TinkerPop and I can not evaluate the memory/time complexity of the TinkerGraph implementation.

Related

Execution time of neo4j cypher query

I'm trying to find the execution time of GDS algorithms using the community edition of Neo4j. Is there any way to find it rather than query logging? Since this facility is specific to the enterprise edition.
Update:
I did as suggested. Why the result is 0 for the computeMillis and preProcessingMillis?
Update 2:
The following table indicates the time in ms required for running the Yen algorithm to retrieve one path for each topology. However, the time does not dependent on the graph size. Why? is it normal to have such results?
When you are executing the mutate or the write mode of the algorithm, you can YIELD the computeMillis property, which can tell you the execution time of the algorithm. Note that some algorithms like PageRank have more properties available to be YIELD-ed
preProcessingMillis - Milliseconds for preprocessing the graph.
computeMillis - Milliseconds for running the algorithm.
postProcessingMillis - Milliseconds for computing the
centralityDistribution.
writeMillis - Milliseconds for writing result data back.

How to specify maximum cost when running BFS Neo4j?

The docs of Neo4j data science library state:
There are multiple termination conditions supported for the traversal,
based on either reaching one of several target nodes, reaching a
maximum depth, exhausting a given budget of traversed relationship
cost, or just traversing the whole graph.
But in the algorithm specific parameters I could not find any parameter for constraining the maximum cost of the traversal (or simply number of relationships if cost is 1). The Only parameters listed are startNodeId, targetNodes and maxDepth.
Any Idea if this actually can be done or if the docs are incorrect?
Here is the list of procedures and functions for your reference. As you can see, Breadth First Search is still in Alpha stage and no estimate function is available yet. You can also see that functions in Beta and Production stages have this function *.estimate. These functions will give you an idea of how much memory will be used when you run those data science related functions. An example of gds.nodeSimilarity.write.estimate can be found below
CALL gds.nodeSimilarity.write.estimate('myGraph', {
writeRelationshipType: 'SIMILAR',
writeProperty: 'score'})
YIELD nodeCount, relationshipCount, bytesMin, bytesMax, requiredMemory
nodeCount relationshipCount bytesMin bytesMax requiredMemory
9 9 2592 2808 "[2592 Bytes ... 2808 Bytes]"

How does Titan achieve constant time lookup using HBase / Cassandra?

In the O'Reilly book "Graph Databases" in chapter 6, which is about how Neo4j stores a graph database it says:
To understand why native graph processing is so much more efficient
than graphs based on heavy indexing, consider the following. Depending on the implementation, index lookups could be O(log n) in algorithmic complexity versus O(1) for looking up immediate relationships.
To traverse a network of m steps, the cost of the indexed approach, at
O(m log n), dwarfs the cost of O(m) for an implementation that uses
index-free adjacency.
It is then explained that Neo4j achieves this constant time lookup by storing all nodes and relationships as fixed size records:
With fixed sized records and pointer-like record IDs, traversals are
implemented simply by chasing pointers around a data structure, which
can be performed at very high speed. To traverse a particular
relationship from one node to another, the database performs several
cheap ID computations (these computations are much cheaper than
searching global indexes, as we’d have to do if faking a graph in a
non-graph native database)
This last sentence triggers my question: how does Titan, which uses Cassandra or HBase as a storage backend, achieve these performance gains or make up for it?
Neo4j only achieves O(1) when the data is in-memory in the same JVM. When the data is on disk, Neo4j is slow because of pointer chasing on disk (they have a poor disk representation).
Titan only achieves O(1) when the data is in-memory in the same JVM. When the data is on disk, Titan is faster than Neo4j cause it has a better disk representation.
Please see the following blog post that explains the above quantitatively:
http://thinkaurelius.com/2013/11/24/boutique-graph-data-with-titan/
Thus, its important to understand when people say O(1) what part of the memory hierarchy they are in. When you are in a single JVM (single machine), its easy to be fast as both Neo4j and Titan demonstrate with their respective caching engines. When you can't put the entire graph in memory, you have to rely on intelligent disk layouts, distributed caches, and the like.
Please see the following two blog posts for more information:
http://thinkaurelius.com/2013/11/01/a-letter-regarding-native-graph-databases/
http://thinkaurelius.com/2013/07/22/scalable-graph-computing-der-gekrummte-graph/
OrientDB uses a similar approach where relationships are managed without indexes (index-free adjacency), but rather with direct pointers (LINKS) between vertices. It's like in memory pointers but on disk. In this way OrientDB achieves O(1) on traversing in memory and on disk.
But if you have a vertex "City" with thousands of edges to the vertices "Person", and you're looking for all the people with age > 18, then OrientDB uses indexes because a query is involved, so in this case it's O(log N).

Why doesn't my nine-step path Cypher query on a small database ever finish?

We are evaluating Neo4J for our application, testing it against a small test database with a total of around 20K nodes, 150K properties, and 100K relationships. The branching factor is ~100 relationships/node. Server and version information is below [1]. The Cypher query is:
MATCH p = ()-[r1:RATES]-(m1:Movie)-[r2:RATES]-(u1:User)-[r3:RATES]-(m2:Movie)-[r4:RATES]-()
RETURN r1.id as i_id, m1.id, r2.id, u1.id, r3.id, m2.id, r4.id as t_id;
(The first and last empty nodes aren't important to us, but I didn't see how to start with relationships.)
I killed it after a couple of hours. Maybe I'm expecting too much by hoping Neo4J would avoid combinatorial explosion. I tried tweaking some server parameters but got no further.
My main question is whether what I'm trying to do (a nine-step path query) is reasonable for Neo4J, or, for that matter, any graph database. I realize nine steps is a very deep search, and one that touches every node in the database multiple times, but unfortunately that's what our research needs to do.
Looking forward to your thoughts.
[1] System info:
The Linux server has 32 processors and 64GB of memory.
Neo4j - Graph Database Kernel (neo4j-kernel), version: 2.1.2.
java version "1.7.0_60", Java(TM) SE Runtime Environment (build 1.7.0_60-b19), Java HotSpot(TM) 64-Bit Server VM (build 24.60-b09, mixed mode)
To answer your main question, Neo4j has no problem doing a variable length query that does not result in a combinatorial explosion in the search space (an exponential time complexity as a result of your branching factor).
There is however an optimization that can be done to your Cypher query.
MATCH ()-[r1:RATES]->(m1:Movie),
(m1)<-[r2:RATES]-(u1:User),
(u1)-[r3:RATES]->(m2:Movie),
(m2)<-[r4:RATES]-()
RETURN r1.id as i_id, m1.id, r2.id, u1.id, r3.id, m2.id, r4.id as t_id;
That being said, Cypher has some current limitations with these kinds of queries. We call these queries "graph global operations". When you are running a query that touches the graph globally without a specific starting point, computation as well as writes and reads to disc can cause performance bottlenecks. When returning large payloads over HTTP REST, you'll encounter data transfer limitations within your network.
To test the difference between query response times due to network data transfer constraints, compare the previous query to the following:
MATCH ()-[r1:RATES]->(m1:Movie),
(m1)<-[r2:RATES]-(u1:User),
(u1)-[r3:RATES]->(m2:Movie),
(m2)<-[r4:RATES]-()
RETURN count(*)
The difference between the queries in response time should be significant.
So what are your options?
Option 1:
Write a Neo4j unmanaged extension in Java that runs on-heap embedded in the JVM using Neo4j's Java API. Your Cypher query can be translated imperatively into a traversal description that operates on your graph in-memory. Seeing that you have 64GB of memory, your Java heap should be configured so that Neo4j has access to 70-85% of your available memory.
You can learn more about the Neo4j Java API here: http://docs.neo4j.org/chunked/stable/server-unmanaged-extensions.html
Option 2:
Tune the performance configurations of Neo4j to run your graph in-memory and optimize your Cypher queries to limit the amount of data transferred over the network. Performance will still be sub-optimal for graph global operations.

Poor performance of Neo4j Cypher query for transitive closure

I have a graph with ~89K nodes and ~1.2M relationships, and am trying to get the transitive closure of a single node via the following Cypher query:
start n=NODE(<id of a single node of interest>)
match (n)-[*1..]->(m)
where has(m.name)
return distinct m.name
Unfortunately, this query goes away and doesn't seem to come back (although to be fair I've only given it about an hour of execution time at this point).
Any suggestions on ways to optimise what I've got here, or better ways to achieve the requirement?
Notes:
Neo4J v2.0.0 (installed via Homebrew).
Mac OSX 10.8.5
Oracle Java 1.7.0_51
8GB physical RAM (neo4j JVM assigned whatever the default is)
Database is hosted on an SSD volume.
Query is submitted via the admin web UI's "Data browser".
"name" is an auto-indexed field.
CPU usage is fairly low - averaging around 20% of 8 cores.
I haven't gotten into the weeds of profiling the Neo4J server yet - my first attempt locked up VisualVM.
That's probably a combinatorial explosion of path, care to try this?
start n=NODE(<id of a single node of interest>),m=node:node_auto_index("name:*")
match shortestPath((n)-[*]->(m))
return m.name
without shortest-path it would look like that, but as you are only interested in the reachable nodes from n the above should be good enough.
start n=NODE(<id of a single node of interest>),m=node:node_auto_index("name:*")
match (n)-[*]->(m)
return distnct m.name
Try query - https://code.google.com/p/gueryframework/ - this is a standalone library but is has a neo4j adapter. I.e., you will have to rewrite your queries in the query format.
Better support for transitive closure was one of the main reasons for developing query, we mainly use this in software analysis tools where we need reachability / pattern analysis (e.g., the antipattern queries in http://xplrarc.massey.ac.nz/ are computed using query).
There is a brief discussion about this in the neo4j google group:
https://groups.google.com/forum/#!searchin/neo4j/jens/neo4j/n69ksEJxDtQ/29DNKyWKur4J
and an (older, not maintained) project with some benchmarking code:
https://code.google.com/p/graph-query-benchmarks/
Cheers, Jens

Resources