spark streaming to neo4j - neo4j

I need to input Spark Streaming output to Neo4j as a graph in real time. Is there any way to do that. If so, can you share some example code?. I have seen Mazerunner, but it only inputs graph data from Neo4j to Spark-Graphx. Thank you.

Mazerunner also writes data back.
Easiest would be to use a Neo4j connector to Neo4j server and write data back concurrently. Neo4j 2.2+ can sustain (quite) high concurrent write load.
For scala you can use AnormCypher and for Python py2neo
I'm currently looking into spark integration for Neo4j, so it would help a lot if you could detail your use-case a bit. E.g. do you use plain Spark (RDD / DStream) or GraphX?

Related

Building and extending a Knowledge Graph with entity extraction while Neo4j for my database

My goal is to build an automated Knowledge Graph. I have decided to use Neo4j as my database. I am intending to load a json file from my local directory to Neo4j. The data I will be using are the yelp datasets(the json files are quite large).
I have seen some Neo4j examples with Graphaware and OpenNLP. I read that Neo4j has a good support for JAVA apps. I have also read that Neoj supports python(I am intending to use nltk). Is it advisable to use Neo4j with JAVA maven/gradle and OpenNLP? Or should I use it with py2neo with nltk.
I am really sorry that I don't have any prior experience with these tools. Any advice or recommendation will be greatly appreciated. Thank you so much!
Welcome to Stack Overflow! Unfortunately, this question is a suggestion/opinion question so isn't appropriate for this forum.
However, this is an area I have worked in so I can confidently say that Java (or Kotlin) is the best way to go for Neo. The reason being, it is the native language for Neo and there is significantly more support in terms of the community for questions and libraries available out there.
However, NLTK is much more powerful than OpenNLP. So, if your usecase is simple enough for OpenNLP, then purely Java/Kotlin is a perfect approach. Alternatively, you can utilize java as an interfacing layer for the stored graph, but use python with NLTK for language work feeding into the graph. This would, of course, dramatically increase the complexity of your project.
Ultimately, the best approach depends on your exact use-case and which trade-offs make the most sense for you.

Apache Flume vs Apache Flink difference

I need to read a stream of data from some source (in my case it's UDP stream, but it shouldn't matter), transform the each record and write it to the HDFS.
Is there any difference between using Flume or Flink for this purpose?
I know I can use Flume with the custom interceptor to transform each event.
But I am new in Flink, so for me it looks like Flink will do the same.
Which one is better to choose? Is there a difference in performance?
Please, help!
Disclaimer: I'm a committer and PMC member of Apache Flink. I do not have detailed knowledge about Apache Flume.
Moving streaming data from various sources into HDFS is one of the primary use cases for Apache Flume as far as I can tell. It is a specialized tool and I would assume it has a lot of related functionality built in. I cannot comment on Flume's performance.
Apache Flink is a platform for data stream processing and more generic and feature rich than Flume (e.g., support for event-time, advance windowing, high-level APIs, fault-tolerant and stateful applications, ...). You can implement and execute many different kinds of stream processing applications with Flink including streaming analytics and CEP.
Flink features a rolling file sink to write data streams to HDFS files and allows to implement all kinds of custom behavior via user-defined functions. However, it is not a specialized tool for data ingestion into HDFS. Do not expect a lot of built-in functionality for this use case. Flink provides very good throughput and low latency.
If you do not need more than simple record-level transformations, I'd first try to solve your use case with Flume. I would expect Flume to come with a few features that you would need to implement yourself when choosing Flink. If you expect to do more advanced stream processing in the future, Flink is definitely worth a look.
Disclaimer: I'm a committer of Apache Flume. I do not have detailed knowledge about Apache Flink.
For the use case you have described, Flume could be the right choice.
You could use the Exec Source until netcat UDP source gets committed to the codebase.
For the transformation, it's hard to provide suggestions, but you might want to take a look at Morphline Interceptor.
Regarding the channel, I would recommend Memory Channel, because if the source is UDP, some negligible data loss should be acceptable.
Sink-wise, HDFS Sink probably covers your needs.

Neo4j: difference between tinkerpop gremlin and aurelius gremlin

as I was wandering in the Web looking for a Gremlin implementation for Neo4j I found these two possible solutions:
https://github.com/thinkaurelius/neo4j-gremlin-plugin
http://tinkerpop.incubator.apache.org/docs/3.0.2-incubating/#neo4j-gremlin
Does anybody know what is the difference between the two in practice?
I saw that 1. is a Neo4j plugin while it's not really clear to me what the second is, and if it would lock the entire database thus not allowing other connections (I noticed that it requires the path to the data folder).
Which one is preferred in the neo4j community?
Cheers,
Alberto
I'm not sure there's really a difference as there isn't a direct comparison to be made. The second link is to the TinkerPop project and specifically to the Neo4j implementation of TinkerPop APIs. It runs in an embedded mode and does not yet have support for HA (though we hope to have that soon). The Neo4j implementation can be run in Gremlin Server which let's you send Gremlin to it as a REST, websockets, etc endpoint.
The project in the first link you provided uses that implementation to allow you to send Gremlin to Neo4j Server - so the first project depends on the second.
Your rule of thumb should be activity in the source code.
neo4j-gremlin-plugin has 3 commits this year - https://github.com/thinkaurelius/neo4j-gremlin-plugin/commits/master
tikerpop is much more active - https://github.com/apache/incubator-tinkerpop/commits/master/neo4j-gremlin/src/main/java/org/apache/tinkerpop/gremlin/neo4j
neo4j-gremlin-plugin
Extending existing Neo4j server with support for Gremlin Query Language.
TinkerPop Neo4j-Gremlin
Extending Gremlin console with support for Neo4j server.

Which should I use to implement a collaborative filtering on top of Neo4j?

I'm working on a project (a social network) which use Neo4j (v1.9) as the underlying datastore and Spring Data Neo4j.
I'm trying to add a tag system to the project and I'm searching for ways to efficiently implement tag recommendation using collaborative filtering strategies.
After a lot of researches, I've come with these options:
Cypher. It is the embedded query language used by Neo4j. No other framework needed, maybe the computational times are better than the others. Maybe I can easily implement the queries using Spring Data Neo4j.
Apache Mahout. It offers machine learning algorithms focused primarly in the areas of collaborative filtering, clustering and classification. However, it isn't designed for graph databases and could be potentially slow.
Apache Giraph. Open source counterpart of Google Pregel.
Apache Spark. It is a fast and general engine for large-scale data processing.
reco4j. It is the best suited solution until now, but the project seems dead.
Apache Spark GraphX + Mazerunner. Suggested by the answer of #johnymontana. I'm documenting on it. The main issue is that I don't know if it supports collaborative filtering.
Graphaware Reco. Suggested by #ChristopheWillemsen in a comment. From the official site
is an extensible high-performance recommendation engine skeleton for
Neo4j, allowing for computing and serving real-time as well as
pre-computed recommendations.
However, I haven't understand yet if it works with old version of Neo4j (I can't upgrade the Neo4j version at the moment).
So, what do you suggest and why? Feel free to suggest other interesting frameworks not listed above.
Cypher is very fast when it comes to local traversals, but is not optimized for global graph operations. If you want to do something like compute similarity metrics between all pairs of users then using a graph processing framework (like Apache Spark GraphX) would be better. There is a project called Mazerunner that connects Neo4j and Spark that you might want to take a look at.
For a pure Cypher approach, here and here are a couple of recent blog posts demonstrating Cypher queries for recommendations.

Graph Database and Big Data

I've been working on a project that involves graph database and I used Neo4j as a tool.
Since, I am on the verge of completing the project, I was thinking of integrating it with big data.
Is there any way of integrating or connecting them, any current real world examples which uses that combo?
Take a look at this blog post about Mazerunner for Neo4j: http://www.kennybastani.com/2014/11/using-apache-spark-and-neo4j-for-big.html
It's still experimental and installation requires a VM deployment. It uses Apache Spark and HDFS to run PageRank and import the results back into Neo4j.
More graph algorithms will be added over time.

Resources