Neo4j - Cypher vs Gremlin query language - neo4j

I'm starting to develop with Neo4j using the REST API.
I saw that there are two options for performing complex queries - Cypher (Neo4j's query language) and Gremlin (the general purpose graph query/traversal language).
Here's what I want to know - is there any query or operation that can be done by using Gremlin and can't be done with Cypher? or vice versa?
Cypher seems much more clear to me than Gremlin, and in general it seems that the guys in Neo4j are going with Cypher.
But - if Cypher is limited compared to Gremlin - I would really like to know that in advance.

For general querying, Cypher is enough and is probably faster. The advantage of Gremlin over Cypher is when you get into high level traversing. In Gremlin, you can better define the exact traversal pattern (or your own algorithms) whereas in Cypher the engine tries to find the best traversing solution itself.
I personally use Cypher because of its simplicity and, to date, I have not had any situations where I had to use Gremlin (except working with Gremlin graphML import/export functions). I expect, however, that even if i would need to use Gremlin, I would do so for a specific query I would find on the net and never come back to again.
You can always learn Cypher really fast (in days) and then continue with the (longer-run) general Gremlin.

We have to traverse thousands of nodes in our queries. Cypher was slow. Neo4j team told us that implementing our algorithm directly against the Java API would be 100-200 times faster. We did so and got easily factor 60 out of it. As of now we have no single Cypher query in our system due to lack of confidence. Easy Cypher queries are easy to write in Java, complex queries won't perform. The problem is when you have multiple conditions in your query there is no way in Cypher to tell in which order to perform the traversals. So your cypher query may go wild into the graph in a wrong direction first.
I have not done much with Gremlin, but I could imagine you get much more execution control with Gremlin.

The Neo4j team's efforts on Cypher have been really impressive, and it's come a long way. The Neo team typically pushes people toward it, and as Cypher matures, Gremlin will probably get less attention. Cypher is a good long-term choice.
That said- Gremlin is a Groovy DSL. Using it through its Neo4j REST endpoint allows full, unfettered access to the underlying Neo4j Java API. It (and other script plugins in the same category) cannot be matched in terms of low-level power. Plus, you can run Cypher from within the Gremlin plugin.
Either way, there's a sane upgrade path where you learn both. I'd go with the one that gets you up and running faster. In my projects, I typically use Gremlin and then call Cypher (from within Gremlin or not) when I need tabular results or expressive pattern matching- both are a pain in the Gremlin DSL.

I initially started using Gremlin. However, at the time, the REST interface was a little unstable, so I switched to Cypher. It has much better support for Neo4j. However, there are some types of queries that are simply not possible with Cypher, or where Cypher can't quite optimize the way you can with Gremlin.
Gremlin is built over Groovy, so you can actually use it as a generic way to get Neo4j to execute 'Java' code and perform various tasks from the server, without having to take the HTTP hit from the REST interface. Among others, Gremlin will let you modify data.
However, when all I want is to query data, I go with Cypher as it is more readable and easier to maintain. Gremlin is the fallback when a limitation is reached.

Gremlin queries can be generated programmatically.
(See http://docs.sqlalchemy.org/en/rel_0_7/core/tutorial.html#intro-to-generative-selects to know what I mean.)
This seems to be a bit more tricky with Cypher.

Cypher only works for simple queries. When you start incorporating complex business logic into your graph traversals it becomes prohibitively slow or stops working altogether.
Neo4J clearly knows that Cypher isn't cutting it, because they also provide the APOC procedures which include an alternate path expander (apoc.path.expand, apoc.path.subgraphAll, etc).
Gremlin is harder to learn but it's more powerful than Cypher and APOC. You can implement any logic you can think of in Gremlin.
I really wish Neo4J shipped with a toggleable Gremlin server (from reading around, this used to be the case). You can get Gremlin running against a live Neo4J instance, but it involves jumping through a lot of hoops. My hope is that since Neo4J's competitors are allowing Gremlin as an option, Neo4J will follow suit.

Cypher is a declarative query language for querying graph databases. The term declarative is important because is a different way of programming than programming paradigms like imperative.
In a declarative query language like Cypher and SQL we tell the underlying engine what data we want to fetch and we do not specify how we want the data to be fetched.
In Cypher a user defines a sub graph of interest in the MATCH clause. Then underlying engine runs a pattern matching algorithm to search for the similar occurrences of sub graph in the graph database.
Gremlin is both declarative and imperative features. It is a graph traversal language where a user has to give explicit instructions as to how the graph is to be navigated.
The difference between these languages in this case is that in Cypher we can use a Kleene star operator to find paths between any two given nodes in a graph database. In Gremlin however we will have to explicitly define all such paths. But we can use a repeat operator in Gremlin to find multiple occurrences of such explicit paths in a graph database. However, doing iterations over explicit structures in not possible in Cypher.

If you use gremlin, then it allow you to migrate the to different graph databases,
Since most of the graph databases supports the gremlin traversal, Its good idea to chose the gremlin.

Long answer short : Use cypher for query and gremlin for traversal. You will see the response timing yourself.

Related

setting cypher planner for a query

I'm trying to understand cypher planner, and I'm not sure of few things.
should I ever change it, or let the Cypher engine to control it?
what is the difference between the COST and the RULE planner?
Since every version of neo4j may tweak the planners, the only way to know for sure which planner works better for a specific query and a specific neo4j version would be to use PROFILE and performance testing. Also, since the plan generated by the COST planner depends on the actual characteristics of your data, you may also want to periodically test query performance with both planners even when you do not upgrade to a newer neo4j version.
This neo4j blog entry provides some details on the planners.

Modifying Cypher Query Engine

I would like to modify the way Cypher processes queries sent to it for pattern matching. I have read about Execution plans and how Cypher chooses the best plan with the least number of operations and all. This is pretty good. However I am looking into implementing a Similarity Search feature that allows you to specify a Query graph that would be matched if not exact, close (similar). I have seen a few examples of this in theory. I would like to implement something of this sort for Neo4j. Which I am guessing would require a change in how the Query Engine deals with queries sent to it. Or Worse :)
Here are some links that demonstrate the idea
http://www.cs.cmu.edu/~dchau/graphite/graphite.pdf
http://www.cidrdb.org/cidr2013/Papers/CIDR13_Paper72.pdf
I am looking for ideas. Anything at all in relation to the topic would be helpful. Thanks in advance
(:I)<-[:NEEDING_HELP_FROM]-(:YOU)
From my point of view, better for you is to create Unmanaged Extensions.
Because you can create you own custom functionality into Neo4j server.
You are not able to extend Cypher Language without your own fork of source code.

Neo4j Traversal API vs. Cypher

When should I choose Neo4j’s traversal framework over Cypher?
For example, for a friend-of-a-friend query I would write a Cypher query as follows:
MATCH (p:Person {pid:'56'})-[:FRIEND*2..2]->(fof)
WHERE NOT (p)-[:FRIEND]->(fof)
RETURN fof.pid
And the corresponding Traversal implementation would require two traversals for friends_at_depth_1 and friends_at_depth_2 (or a core API call to get the relationships) and find the difference of these two sets using plain java constructs, outside of the traversal description. Correct me if I’m wrong here.
Any thoughts?
The key thing to remember about Cypher vs. the traversal API is that the traversal API is an imperative way of accessing a graph, and Cypher is a declarative way of accessing a graph. You can read more about that difference here but the short version is that in imperative access, you're telling the database exactly how to go get the graph. (E.g. I want to do a depth first search, prune these branches, stop when I hit certain nodes, etc). In declarative graph query, you're instead specifying what you want, and you're delegating all aspects of how to get it to the Cypher implementation.
In your query, I'd slightly revise it:
MATCH (p:Person {pid:'56'})-[:FRIEND*2..2]->(fof)
WHERE NOT (p)-[:FRIEND]->(fof) AND
p <> fof
RETURN fof.pid
(I added making sure that p<>fof because friend links might go back to the original person)
To do this in a traverser, you wouldn't need to have two traverser, just one. You'd traverse only FRIEND relationships, stop at depth 2, and accumulate a set of results.
Now, I'm going to attempt to argue that you should almost always use Cypher, and never use the traversal API unless you have very specific circumstances. Here are my reasons:
Declarative query is very powerful, in that it frees you from thinking about the how. All you need to know is what you want. This means you spend more time focusing on what your code is supposed to do, and less time in implementation detail.
The cypher query executor is getting better all the time (version 2.2 will have a cost based planner) and of course they put a lot of effort into making sure cypher exploits all available indexes. I'ts possible that for many queries, cypher would do a better job of finding your data than your traversal, unless you were very careful in coding the traversal.
Cypher is just way less code than writing your own traversal, which will frequently require you to implement certain classes to do specialized stop conditions, etc.
At present, cypher can run in embedded databases, or on the server. If you want to run a traversal, you can't send that remotely to a server to be executed; maybe at best you could write a server extension that did the traversal. So I think cypher is more flexible at present.
OK so when should you use traversal? Two key cases that I know of (others may suggest others)
Sometimes you need to execute a complex custom java code operation on everything you traverse. In this case, you're using the traverser as a "visitor function" of sorts, and sometimes traversals are more convenient to use than cypher, depending on the nature of the java you're running on the nodes.
Sometimes your performance requirements are so intense, you need to hand-traverse the graph, because there's some aspect of graph structure that you can exploit in the traverser to make it go faster that Cypher can't take advantage of. This does happen, but going to this first usually isn't a good idea.
An excerpt from the book
Core API, Traversal Framework or Cypher?
The Core API allows developers to fine-tune their queries so that they exhibit high
affinity with the underlying graph. A well-written Core API query is often faster than
any other approach. The downside is that such queries can be verbose, requiring considerable
developer effort. Moreover, their high affinity with the underlying graph
makes them tightly coupled to its structure. When the graph structure changes, they
can often break. Cypher can be more tolerant of structural changes—things such as
variable-length paths help mitigate variation and change.
The Traversal Framework is both more loosely coupled than the Core API (because it
allows the developer to declare informational goals), and less verbose, and as a result
a query written using the Traversal Framework typically requires less developer effort
than the equivalent written using the Core API. Because it is a general-purpose
framework, however, the Traversal Framework tends to perform marginally less well
than a well-written Core API query.
If we find ourselves in the unusual situation of coding with the Core API or Traversal
Framework (and thus eschewing Cypher and its affordances), it’s because we are
working on an edge case where we need to finely craft an algorithm that cannot be
expressed effectively using Cypher’s pattern matching. Choosing between the Core
API and the Traversal Framework is a matter of deciding whether the higher abstraction/
lower coupling of the Traversal Framework is sufficient, or whether the close-tothe-
metal/higher coupling of the Core API is in fact necessary for implementing an
algorithm correctly and in accordance with our performance requirements.
Ref: Graph Databases, New Opportunities for Connected Data, p161
What is cypher?
Definition goes in developer doc as follows: cypher is a declarative, SQL-inspired language for describing patterns in graphs visually using an ascii-art syntax.
You can find more about it here.
What is core API practically?
I found this page having following sentence:
Besides an object-oriented API to the graph database, working with Node, Relationship, and Path objects, it also offers highly customizable, high-speed traversal- and graph-algorithm implementations.
So practically speaking core API deals with basic objects such as Node, Relationship which belongs to org.neo4j.graphdb package.
You can find more at its developer guide.
What is traversal API practically?
Traversal API adds more interfaces to core API to help us conveniently perform traversal, instead of writing the whole traversal logic from scratch. These interfaces are contained in org.neo4j.graphdb.traversal package.
You can find more at its developer guide.
The relation between all three
According to this answer:
The Traversal API is built on the Core API, and Cypher is build on the Traversal API; So anything you can do in Cypher, can be done with the other 2.
Same example done with all three
This tutorial from 2012 shows all three in action for performing same task, with Core API being fastest. It includes a quote from Andres Taylor:
Cypher is just over a year old. Since we are very constrained on developers, we have had to be very picky about what we work on the focus in this first phase has been to explore the language, and learn about how our users use the query language, and to expand the feature set to a reasonable level.
I believe that Cypher is our future API. I know you can very easily outperform Cypher by handwriting queries. like every language ever created, in the beginning you can always do better than the compiler by writing by hand but eventually, the compiler catches up
Article's conclusion:
So far I was only using the Java Core API working with neo4j and I will continue to do so.
If you are in a high speed scenario (I believe every web application is one) you should really think about switching to the neo4j Java core API for writing your queries. It might not be as nice looking as Cypher or the traverser Framework but the gain in speed pays off.
Also I personally like the amount of control that you have when traversing over the core yourself.

Is item-based collaborative filtering feasible with Neo4J/Cypher?

I'm testing Neo4J as a potentially more efficient alternative to (non-distributed) Mahout for item-based collaborative filtering (i.e. 'recommend items for User based on his and others' preferences), and have seen excellent examples using Gremlin, e.g. http://markorodriguez.com/2011/09/22/a-graph-based-movie-recommender-engine/ - but none for Cypher.
Is this practical/feasible with Cypher, or should I just bite the bullet and start using Gremlin (and the REST API)?
We've done these kinds of recommendations using Neo4j and Cypher in particular, are are really pleased with the results.
Of course it could span into a couple of Cypher queries depending on the complexity of logic you have but it's entirely do-able.
I realize this is a extremely simplified approach but it might help you compare a gremlin/cypher kind of approach:
Gremlin: http://blog.everymansoftware.com/2012/02/similarity-based-recommendation-engines.html
Cypher: http://thought-bytes.blogspot.in/2012/02/similarity-based-recommendations-with.html
Disclaimer: I am the author of the Thought Bytes post
have you tried my open source project? It's name is reco4j, it is a graph based recommender engine based on neo4j as graph database.
It is in an early stage but it works for your use case.
Cheers,
Alessandro

cypher vs neo4j-sh : why do we have both?

I understand HOW the 2 differ:
neo4j-sh (not it's real name I'm guessing) works with a file-system-like abstraction
cypher is meant to be more of a SQL-like approach
But WHY do we have both?
I actually really like the ability to manipulate a data structure as a file system (like FUSE does with things like procfs) and would be happy to write all my important scripts in it.
But is it discouraged? The last thing I want to do is rely on a technology that will be unsupported or deprecated in the future.
I don't think the neo4j-shell is intended for use in an application, the intended use case is rather during development and debugging. Note that it also supports Cypher queries. I'd say go with Cypher wherever possible.
The neo4j-shell has been around since way before Cypher was invented, so that's why we currently have both.

Resources