Graph Database and Big Data

Graph Database and Big Data - neo4j

I've been working on a project that involves graph database and I used Neo4j as a tool.
Since, I am on the verge of completing the project, I was thinking of integrating it with big data.
Is there any way of integrating or connecting them, any current real world examples which uses that combo?

Take a look at this blog post about Mazerunner for Neo4j: http://www.kennybastani.com/2014/11/using-apache-spark-and-neo4j-for-big.html
It's still experimental and installation requires a VM deployment. It uses Apache Spark and HDFS to run PageRank and import the results back into Neo4j.
More graph algorithms will be added over time.

Related

Neo4j: difference between tinkerpop gremlin and aurelius gremlin

as I was wandering in the Web looking for a Gremlin implementation for Neo4j I found these two possible solutions:
https://github.com/thinkaurelius/neo4j-gremlin-plugin
http://tinkerpop.incubator.apache.org/docs/3.0.2-incubating/#neo4j-gremlin
Does anybody know what is the difference between the two in practice?
I saw that 1. is a Neo4j plugin while it's not really clear to me what the second is, and if it would lock the entire database thus not allowing other connections (I noticed that it requires the path to the data folder).
Which one is preferred in the neo4j community?
Cheers,
Alberto

I'm not sure there's really a difference as there isn't a direct comparison to be made. The second link is to the TinkerPop project and specifically to the Neo4j implementation of TinkerPop APIs. It runs in an embedded mode and does not yet have support for HA (though we hope to have that soon). The Neo4j implementation can be run in Gremlin Server which let's you send Gremlin to it as a REST, websockets, etc endpoint.
The project in the first link you provided uses that implementation to allow you to send Gremlin to Neo4j Server - so the first project depends on the second.

Your rule of thumb should be activity in the source code.
neo4j-gremlin-plugin has 3 commits this year - https://github.com/thinkaurelius/neo4j-gremlin-plugin/commits/master
tikerpop is much more active - https://github.com/apache/incubator-tinkerpop/commits/master/neo4j-gremlin/src/main/java/org/apache/tinkerpop/gremlin/neo4j
neo4j-gremlin-plugin
Extending existing Neo4j server with support for Gremlin Query Language.
TinkerPop Neo4j-Gremlin
Extending Gremlin console with support for Neo4j server.

spark streaming to neo4j

I need to input Spark Streaming output to Neo4j as a graph in real time. Is there any way to do that. If so, can you share some example code?. I have seen Mazerunner, but it only inputs graph data from Neo4j to Spark-Graphx. Thank you.

Mazerunner also writes data back.
Easiest would be to use a Neo4j connector to Neo4j server and write data back concurrently. Neo4j 2.2+ can sustain (quite) high concurrent write load.
For scala you can use AnormCypher and for Python py2neo
I'm currently looking into spark integration for Neo4j, so it would help a lot if you could detail your use-case a bit. E.g. do you use plain Spark (RDD / DStream) or GraphX?

Which should I use to implement a collaborative filtering on top of Neo4j?

I'm working on a project (a social network) which use Neo4j (v1.9) as the underlying datastore and Spring Data Neo4j.
I'm trying to add a tag system to the project and I'm searching for ways to efficiently implement tag recommendation using collaborative filtering strategies.
After a lot of researches, I've come with these options:
Cypher. It is the embedded query language used by Neo4j. No other framework needed, maybe the computational times are better than the others. Maybe I can easily implement the queries using Spring Data Neo4j.
Apache Mahout. It offers machine learning algorithms focused primarly in the areas of collaborative filtering, clustering and classification. However, it isn't designed for graph databases and could be potentially slow.
Apache Giraph. Open source counterpart of Google Pregel.
Apache Spark. It is a fast and general engine for large-scale data processing.
reco4j. It is the best suited solution until now, but the project seems dead.
Apache Spark GraphX + Mazerunner. Suggested by the answer of #johnymontana. I'm documenting on it. The main issue is that I don't know if it supports collaborative filtering.
Graphaware Reco. Suggested by #ChristopheWillemsen in a comment. From the official site
is an extensible high-performance recommendation engine skeleton for
Neo4j, allowing for computing and serving real-time as well as
pre-computed recommendations.
However, I haven't understand yet if it works with old version of Neo4j (I can't upgrade the Neo4j version at the moment).
So, what do you suggest and why? Feel free to suggest other interesting frameworks not listed above.

Cypher is very fast when it comes to local traversals, but is not optimized for global graph operations. If you want to do something like compute similarity metrics between all pairs of users then using a graph processing framework (like Apache Spark GraphX) would be better. There is a project called Mazerunner that connects Neo4j and Spark that you might want to take a look at.
For a pure Cypher approach, here and here are a couple of recent blog posts demonstrating Cypher queries for recommendations.

is there a nice cypher desktop client for neo4j

I'm trying out a lot of cypher queries on my neo4j database and have found the web based console to be a bit clumsy.
The neo4j console is more useful ,but I am not sure how to point it at my own database/dataset.
Is there a nice desktop client or tool to run cypher queries on as well as manage a neo4j database, akin to the SQL management studio?
I'd like to avoid using the web admin if possible.

When you start a neo4j server it also spins up a wed-based admin tool that allows you to submit adhoc queries via a tool and see the results. Just go to your server's URL in your web browser.
There is also a decent visualization tool called Neoclipse. It's got some small bugs and a bit of a learning curve but it's pretty decent.

JCR (JackRabbit, ModeShape) vs. Graph (Neo4j)

In order to store hierarchical data, can a graph database (Neo4j) be viewed as an alternative to JCR based solutions (ModeShape, JackRabbit)? Or do they belong to 2 different level of abstraction meaning that a JCR implementation could use Neo4j under the hood?
Thank you for your help.

Both, people are building CMS applications with Neo4j as storage backend (see http://structr.org)
A JCR implementation could also be done using Neo4j, some people worked on that in the past, we also have a group using Neo4j as backend storage for Apache Shindig.

You also might want to take a look at OrientDB (http://www.orientdb.org/) which combines features of a Graph-DB (as Neo4j) with those of a Document-DB. There even seems to be a prototype implementation using OrientDB as a storage-adapter for Jackrabbit (https://github.com/eiswind/jackrabbit-orient) that illustrates the implementation of such a hybrid approach.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart